Synthetic Data Schema

Summary

The RiskFabric data schema is designed to mirror a professional financial environment while providing the "white-box" visibility required for advanced machine learning research. It consists of five core entities that represent the hierarchical relationship between a customer and their financial events.

Design Intent

The schema is structured to prioritize Relational Realism over flat-file simplicity. By separating Customers, Accounts, and Cards into distinct tables, the simulation models complex many-to-one relationships (e.g., a single customer owning multiple accounts, each with different card instruments). This is essential for testing entity-linking models and network analysis in fraud detection.

The inclusion of the FraudMetadata table is a critical architectural decision. It decouples the simulation ground truth (fraud_target) from the operational signal (is_fraud). This allows researchers to train on noisy, real-world signals while validating against the perfect, latent truth of the generator.

Entity Relationship Overview

  • Customer: The primary entity. Owns several Accounts.
  • Account: A financial container (Savings, Current, Credit). Contains several Cards.
  • Card: The instrument used for transactions.
  • Transaction: A financial event linked to a Card, Account, and Customer.
  • FraudMetadata: Ground-truth data linked 1:1 with Transactions to explain the generation context.

👥 Customer (customers.parquet)

Defines the synthetic population's demographics and geographic baseline.

FieldTypeDescription
customer_idStringUnique UUID for the customer.
nameStringFull name (Indian-centric).
ageUInt8Age of the customer (18-90).
emailStringSynthetic email address.
locationStringFull residential address (OSM-based).
stateStringStandardized Indian state name.
location_typeStringUrban vs. Rural classification.
home_latitudeFloat64WGS84 Latitude of home.
home_longitudeFloat64WGS84 Longitude of home.
home_h3r5StringH3 Resolution 5 index (Neighborhood level).
home_h3r7StringH3 Resolution 7 index (Block level).
credit_scoreUInt16Synthetic credit score (300-850).
monthly_spendFloat64Average expected monthly expenditure.
customer_risk_scoreFloat32Baseline risk probability (0.0 to 1.0).
is_fraudBoolFlag indicating if this customer represents a fraud target.
registration_dateStringISO 8601 date of account registration.

🏦 Account (accounts.parquet)

The logical banking container for funds.

FieldTypeDescription
account_idStringUnique UUID for the account.
customer_idStringFK to Customer.
bank_idStringIdentifier for the issuing bank.
account_noString12-digit synthetic account number.
account_typeStringSavings, Current, or Credit.
balanceFloat64Current funds in the account.
statusStringActive, Closed, or Suspended.
creation_dateStringThe account opening date.

💳 Card (cards.parquet)

The payment instrument associated with an account.

FieldTypeDescription
card_idStringUnique UUID for the card.
account_idStringFK to Account.
customer_idStringFK to Customer.
card_numberString16-digit synthetic PAN.
card_networkStringVISA, Mastercard, or RuPay.
card_typeStringDebit or Credit.
statusStringActive, Blocked, or Expired.
status_reasonStringReason for status changes (e.g., SIM Swap Suspect).
issue_dateStringCard issuance date.
activation_dateStringInitial card usage date.
expiry_dateStringCard expiry date.
issuing_bankStringFull name of the bank.
bank_codeStringStandardized 4-digit bank identifier.

💸 Transaction (transactions.parquet)

The high-volume stream of financial events.

FieldTypeDescription
transaction_idStringUnique UUID for the transaction.
card_idStringFK to Card.
account_idStringFK to Account.
customer_idStringFK to Customer.
merchant_idStringUnique identifier for the merchant.
merchant_nameStringName of the business.
merchant_categoryStringCategory (e.g., GROCERY, TRAVEL).
merchant_countryStringCountry code of the merchant (defaults to IN).
amountFloat64Transaction value in base currency.
timestampStringISO 8601 high-precision timestamp.
transaction_channelStringonline, in-store, UPI, etc.
card_presentBoolPhysical card usage flag.
user_agentStringBrowser or POS device identifier.
ip_addressStringIPv4 address of the requester.
statusStringHigh-level status (Success or Failed).
auth_statusStringBanking authorization code (approved/declined).
failure_reasonStringDetailed reason for declined transactions.
is_fraudBoolNoisy Label (includes FN/FP).
chargebackBoolFlag indicating a later customer dispute.
location_latFloat64Latitude of the transaction event.
location_longFloat64Longitude of the transaction event.
h3_r7StringH3 Resolution 7 index of the transaction location.

🕵️ Fraud Metadata (fraud_metadata.parquet)

Internal ground-truth for debugging and advanced ML training. This table is not used in standard inference but is vital for "white-box" evaluation.

FieldTypeDescription
transaction_idStringFK to Transaction.
fraud_targetBoolGround Truth (True Fraud flag).
fraud_typeStringProfile used (e.g., upi_scam, ato).
label_noiseStringReason for label mismatch (if any).
injector_versionStringEngine version.
geo_anomalyBoolTrue if location represents an outlier.
device_anomalyBoolTrue if device/UA represents an outlier.
ip_anomalyBoolTrue if IP represents a known malicious prefix.
burst_sessionBoolPart of a rapid-fire sequence.
burst_seqInt32Sequence number within a burst session.
campaign_idStringLink to a coordinated attack campaign.
campaign_typeStringCoordination type (e.g., coordinated_attack).
campaign_phaseStringPhase within the campaign (early, active, late).
campaign_day_numberInt32Days since campaign start.

Known Issues

UUID strings are currently used for all primary keys (customer_id, card_id, etc.). While ensuring global uniqueness, this increases storage overhead and join latency in ClickHouse compared to integer-based keys. Transitioning to a 64-bit integer ID system is under consideration for future versions.

Furthermore, a dedicated Merchant Table is not yet implemented in the output schema. Merchant attributes are currently denormalized directly into the transaction table, creating data redundancy and limiting merchant-level entity modeling. Breaking merchants into a separate merchants.parquet file is required to complete the star schema.