RiskFabric

Rust Python Polars ClickHouse Redpanda Redis Docker License: MIT Deploy mdBook

RiskFabric is a fraud intelligence platform that generates synthetic Indian payment transaction data, processes it through a Medallion ETL pipeline, and produces trained fraud detection models.

✨ Key Features

  • Extreme Throughput: Achieves ~182,000 Transactions Per Second (TPS) using a parallelized "One-Pass" architecture.
  • Agent-Based Realism: Simulates the full lifecycle of Customers, Accounts, and Cards, with behavioral spend profiles driven by real-world heuristics.
  • Geographic Fidelity: Integrates OpenStreetMap (OSM) India data and Uber H3 hexagonal indexing for hyper-realistic spatial spend patterns and location anomalies.
  • Sophisticated Fraud Injection: Includes signatures for UPI Scams, Account Takeover (ATO), Card Not Present (CNP) fraud, and coordinated campaigns.
  • Medallion Data Architecture: A full pipeline taking data from Bronze (Raw) to Silver (Feature Engineered) to Gold (ML-Ready).
  • ML Mastery: Built-in leakage prevention and simulated label noise (False Positives/Negatives) to ensure models are robust and production-ready.

🛠️ Tech Stack

  • Core Engine: Rust (Rayon for parallelization, Rand for deterministic simulation).
  • Real-time Streaming: Redpanda (Kafka-compatible), rdkafka, and Tokio async runtime.
  • Data Processing: Polars 0.51.0 (Lazy API & high-performance transformation).
  • Data Warehouse: PostgreSQL (Spatial/OSM staging), ClickHouse (High-volume transactions), and dbt (Analytical enrichment).
  • Feature Store: Redis (Low-latency state for real-time Z-scores and behavior).
  • Data Ingestion: dlt (Data Load Tool) for MDS integration.
  • Machine Learning: Python (XGBoost) with real-time inference via scorer.py.
  • Infrastructure: Docker/Podman orchestration with Prometheus and Grafana for observability.

📁 Project Structure

🧠 Core Simulation (src/)

  • generators/: Agent-Based Modeling (ABM) logic, entity creation, and fraud mutation engines.
  • models/: Rust structures for Customers, Accounts, Cards, and Transactions.
  • bin/: CLI binaries for data generation (generate.rs), streaming (stream.rs), and preparation.
  • config.rs: Centralized, type-safe configuration engine for simulation parameters.

🥈 ETL & Data Warehouse (src/etl/ & warehouse/)

  • etl/: Multi-stage Polars transformation pipeline (Silver/Gold feature engineering).
  • warehouse/: dbt project for geographic enrichment and merchant risk profiling using PostGIS.
  • dlt/: MDS integration for automated data lake ingestion.

🤖 Machine Learning (src/ml/)

  • train_xgboost.py: Training pipeline with Feature sanitization and OOT validation.
  • scorer.py: Real-time inference service consuming from Kafka and stateful Redis features.
  • seed_redis.py: Point-in-time state synchronization between the warehouse and feature store.

🛠️ Infrastructure & Docs

  • docker-compose.yml: Orchestrated local stack (ClickHouse, Postgres, Redpanda, Redis, Grafana).
  • documentation/: Arichitectural docs and theory of operation (mdBook).
  • data/config/: Behavioral rules and system tuning YAML configurations.

📈 Benchmarks (150k Txns)

ArchitectureThroughputTotal TimeSpeedup
Sequential Port3,400 TPS48.7s1x
Optimized One-Pass182,000 TPS4.4s53x

Developed by harshafaik

Your First Generation

Summary

This tutorial provides a step-by-step operational guide for initializing the RiskFabric environment and executing a full synthetic data lifecycle—from world-building to model training.

Prerequisites

The following components must be installed and available:

  • Rust (Latest Stable)
  • Docker or Podman (with Docker Compose support)
  • Python 3.10+
  • Git

Step 0: Infrastructure Setup

The simulation requires several backing services (Postgres, ClickHouse, Redpanda, Redis). These are orchestrated via Docker Compose and must be running before the generation binaries are executed.

# Start the local service stack
docker-compose up -d

Step 1: World Building (Level 0)

Before generating transactions, the physical reference data must be prepared by extracting OpenStreetMap nodes, enriching them via dbt, and exporting them to Parquet.

# 1. Extract raw OSM nodes to Postgres
cargo run --bin prepare_refs -- extract-nodes

# 2. Enrich & Transform (Spatial Joins and Risk Categorization)
dbt run --project-dir warehouse

# 3. Export to Parquet for the generator
# Option A: Rust-based export
cargo run --bin export_references
# Option B: DLT-based export (Recommended)
python dlt/pipelines.py export

The Database Transformation Process

During this step, the Postgres database performs three critical operations to build the "Physical World":

  1. Ingestion: Millions of raw coordinates are copied from OSM PBF files into the staging area.
  2. Spatial Anchoring: dbt uses PostGIS to perform spatial intersections against official Indian boundaries, ensuring every coordinate is anchored to a verified State and District for realistic travel-velocity calculations.
  3. Adversarial DNA: Raw merchant tags are mapped to standardized categories (e.g., LUXURY, GAMBLING) and assigned baseline risk levels, establishing the ground truth for fraud injection.

Step 2: Batch Generation and ETL

The historical dataset used for model training must be generated, ingested into the warehouse, and processed through the feature engineering pipeline.

Configuring the Simulation

Before running the generation, you can tune the scale and behavior of the synthetic population in the data/config/ directory:

  • Population Scale (customer_config.yaml):
    • control.customer_count: Total number of unique agents (Default: 3334).
    • control.transactions_per_customer: Min/Max transaction volume per agent (Default: 400-800).
    • registration.lookback_years: How far back the customer history goes (Default: 5 years).
  • Transaction Patterns (transaction_config.yaml):
    • transactions.lookback_days: Duration of the generated transaction history (Default: 365 days).
    • transactions.amount_range: The global min/max for transaction values (Default: 10 - 50,000 INR).
    • temporal_patterns: Hourly and daily weights that drive circadian rhythms.
  • Fraud Injection (fraud_rules.yaml):
    • fraud_injector.target_share: The percentage of transactions that are intentionally fraudulent (Default: 0.01 or 1%).
    • fraud_injector.default_fp_rate: Baseline "noise" (False Positives) injected into the labels (Default: 0.005).
    • fraud_injector.profiles: Tune the frequency and behavior of specific attack types (UPI Scams, ATO, Velocity Abuse).
# Generate the initial population and history
cargo run --release --bin generate

# Ingest into ClickHouse and run ETL layers
cargo run --bin ingest
cargo run --bin etl -- silver-all
cargo run --bin etl -- gold-master

Step 3: Model Training and Streaming

The final phase involves training the XGBoost classifier, seeding the real-time feature store, and starting the streaming simulation.

Model Configuration established

The training script (train_xgboost.py) uses a configration optimal for high-imbalance datasets:

  • Class Imbalance Handling: The scale_pos_weight is calculated dynamically (Legitimate / Fraud ratio) to ensure the model doesn't ignore the minority fraud class.
  • Hyperparameters:
    • n_estimators: 100
    • max_depth: 6
    • learning_rate: 0.1
    • eval_metric: aucpr
  • Operational Feature Set: The model trains on 12 behavioral features (e.g., spatial_velocity, amount_deviation_z_score), explicitly excluding synthetic IDs (customer_id, etc.) to prevent label leakage.
# Train the fraud detection model
python src/ml/train_xgboost.py

Model Validation and Interpretability

Before moving to production scoring, you should validate the model's performance and interpret its decision drivers:

  • Performance Testing (test_model.py): Runs the trained model against a test dataset to generate classification reports and conduct threshold analysis (identifying the optimal Precision/Recall trade-off).
  • Explainability (shap_analysis.py): Uses SHAP (SHapley Additive exPlanations) to create visual reports in reports/shap/. This identifies which features (e.g., spatial_velocity) drove the model's flags globally and for each specific fraud profile.
  • Model Metadata (dump_model.py): A developer utility used to inspect the internals of the saved JSON model, verifying feature names, types, and categorical encodings.
# Run performance and threshold analysis
python src/ml/test_model.py

# Generate SHAP interpretability reports
python src/ml/shap_analysis.py

# (Optional) Inspect model metadata
python src/ml/dump_model.py

Starting the Real-time Pipeline

Once the model is validated, seed the feature store and start the inference engine:

# Seed the Redis feature store with warehouse state
python src/ml/seed_redis.py

# Start the real-time scorer and the streaming generator
python src/ml/scorer.py
cargo run --bin stream

Known Issues

The documentation assumes a local container environment. Running without containers may result in database connection failures. Explicit validation of service availability (Kafka, Redis, ClickHouse, Postgres) is required before beginning the tutorial.

Furthermore, the tutorial follows a linear path. Instructions for incremental updates, such as appending new transactions to an existing warehouse, are currently omitted. Implementation of stateful resumption guidance is required for large-scale simulation runs.

How-to Guides

Step-by-step instructions and roadmaps for managing, extending, and operating the RiskFabric simulation environment.

Project Roadmap & Backlog

Summary

The to-do.md document serves as the tactical roadmap for RiskFabric. It details the completed milestones and upcoming engineering tasks required to evolve the simulation from a prototype into a production-grade synthetic data platform.


👥 Customer Generation

  • Location Heuristic Fix: location_type (Urban/Rural) is assigned based on city name or configuration fallback.
  • Spatial Jittering: Implementation of multi-level jittering, including a ~500m drift for residential nodes and a deterministic ~100m drift for transaction events.
  • City Name Fallbacks: Use of "{State} Region" for missing city names to maintain geographic consistency.
  • Demographic Validation: Implementation of Indian-centric naming and email domain distributions via customer_config.yaml.
  • Device & ISP Profiling: Implementation of realistic device fingerprinting and ISP-level behavioral attributes for each customer profile.
  • Feature Correlation: Enforcing structural relationships between Age, Credit Score, and Monthly Spend to ensure dataset realism.
  • Simulation Scalability: Transitioning to a streaming Parquet reader for residential reference data to support multi-million agent populations without memory exhaustion.
  • Demographic Realism Tuning: Implement Name-Gender-State correlation for first names and surnames.
  • Email Distribution Tuning: Align email domain distributions with actual Indian market shares.

💸 Transaction & Merchant Logic

  • One-Pass Chunked Generation: Refactoring of the generator to process cards in batches of 5,000, enabling multi-million transaction generation on standard hardware.
  • Chronological Simulation: Implementation of time-ordered transaction generation with support for temporal burst warping.
  • MCC Mapping: Mapping of OSM categories to standard Merchant Category Codes (MCC) for realistic financial analysis.
  • Budget-Aware Simulation: Transaction amounts are linked to the customer's monthly_spend profile, with noise added to individual events.
  • Temporal weighted Patterns: Implementation of circadian rhythms via hourly and daily weights in transaction_config.yaml.
  • Device & Agent Persistence: Implementation of persistent devices and realistic app identifiers (e.g., GPay, PhonePe) per payment channel.
  • Amount Distribution Tuning: Remediation of the "Amount Shortcut" by ensuring fraudulent amounts significantly overlap with legitimate spending distributions.
  • Geographic Precision: Implementing the Haversine formula for all spatial velocity and distance calculations to replace Euclidean approximations.
  • Jitter Normalization: Ensure consistent ~100m spatial jittering across all geographic profiles.
  • Rayon Chunk Size Optimization: Explicitly tune chunk_size for parallel generation to optimize throughput.
  • H3 Resolution Consistency: Enforce consistent H3 resolution usage across all spatial calculation layers.

🥈 ETL & Infrastructure

  • Unified CLI Tooling: Consolidation of multiple utility binaries into unified etl, prepare_refs, and ingest tools for improved developer experience.
  • Streaming Infrastructure: Integration of Redpanda (Kafka-compatible) for high-throughput, low-latency transaction event streams.
  • Stateful Feature Store: Integration of Redis for sub-millisecond retrieval of behavioral context and running statistical aggregates.
  • Full-Stack Observability: Implementation of Prometheus and Grafana dashboards for real-time monitoring of generation throughput and scoring latency.
  • Zero-Copy Stdin Piping: Optimization of the ETL pipeline to pipe Parquet data directly from Polars to ClickHouse stdin, eliminating intermediate disk I/O.
  • Streaming ETL Implementation: Refactoring of runners to use .scan_parquet() and .sink_parquet() to support 10M+ row benchmarks without memory exhaustion.
  • Infrastructure Hardening: Transitioning from hardcoded credentials to an .env and Docker Secrets management system.
  • Docker Healthcheck Synchronization: Refine depends_on to use service_healthy conditions in docker-compose.yml.
  • Polars Type Consistency: Systematically cast boolean flags and small counters to UInt32 to prevent ClickHouse ingestion panics.
  • ETL Signal Reliability: Re-enable commented-out Silver ETL stages (Campaign, Device IP, Network).
  • ClickHouse Ingestion Stability: Transition to a native driver/HTTP client to replace podman exec dependencies.

🤖 Machine Learning & Model Training

  • "Operational Feature" Pivot: Refactoring of the training pipeline to focus exclusively on behavioral signals, explicitly excluding synthetic metadata to prevent label leakage.
  • SHAP Interpretability: Integration of SHAP (SHapley Additive exPlanations) for global and profile-specific feature importance validation.
  • Real-Time Scoring Service: Development of a stateful inference service (scorer.py) capable of sub-millisecond fraud detection on Kafka streams.
  • Point-in-Time State Seeder: Implementation of seed_redis.py to synchronize historical warehouse state with the real-time feature store using Welford's algorithm.
  • GNN-based Campaign Detection: Transitioning to Graph Neural Networks (GNNs) for coordinated multi-entity attacks, as traditional classifier-based models (e.g., XGBoost) are inherently unsuited for capturing non-local relational patterns.
  • OOT Validation & Drift: Transitioning to Out-of-Time validation and implementing a retraining scheduler to simulate model performance under adversarial concept drift.
  • Seed Redis Robustness: Add existence checks for fact_transactions_gold.
  • Label Noise Calibration: Fine-tune FP/FN rates in fraud.rs for better model convergence.
  • Class Weight Balancing: Implement scale_pos_weight or sampling strategy in XGBoost pipeline.
  • Strict ID Sanitization: Explicitly drop all internal IDs (card_id, customer_id) during training feature engineering.

⚙️ Configuration & Tuning

  • Consolidated Control: Integration of all generation volume and parallelism settings into a centralized customer_config.yaml.
  • Modular Fraud Logic: Implementation of a profile-driven mutation engine that decouples adversarial patterns from core simulation code.
  • Product Catalog Centralization: Consolidation of card types, networks, and limits in product_catalog.yaml.
  • Configuration Robustness: Refactoring the configuration loader to provide graceful error handling and support for descriptive error messages.
  • Campaign Attack Implementation: Finalization of coordinated adversarial logic (currently disabled in configuration pending GNN-ready data structures).
  • Dependency & Code Hygiene: Perform security audit of Rust crates and remove deprecated "legacy" code blocks.

📊 Observability & Dashboards

  • Rust Metric Exporter: Integrate prometheus crate into the simulation engine to track TPS/performance.
  • Geographic Visualization: Implement a H3 Geomap panel for fraud hotspot visualization.
  • Materialized View Optimization: Pre-calculate dashboard metrics in ClickHouse to improve query performance.
  • Infrastructure Alerting: Define Prometheus alert rules for critical service failures.
  • Grafana Secret Externalization: Use GF_ environment variables instead of hardcoded creds in datasources.
  • ClickHouse Metrics Activation: Enable the port 9363 Prometheus endpoint in ClickHouse's config.xml.
  • DataSource UID Fixing: Explicitly set UIDs (ClickHouse, Prometheus) in datasources.yaml to prevent panel breakage.
  • Geomap Plugin Cleanup: Remove the deprecated 'worldmap-panel' and ensure the native 'geomap' panel is used for hotspot visualization.

How-to: Add a New Fraud Signature

This guide provides a task-oriented path for developers to inject new fraud behaviors into the RiskFabric engine.

1. Define the Profile

New fraud patterns are defined in src/generators/fraud.rs. Every profile needs:

  • A unique name.
  • A weighted probability in the configuration.
  • A Behavioral and Spatial signature.

2. Implement the Mutator

Add a new branch to the FraudMutator logic.

#![allow(unused)]
fn main() {
// Example skeleton
fn mutate_upi_scam(txn: &mut Transaction) {
    // Modify amount, location, or device
}
}

3. Register in Config

Update data/config/fraud_rules.yaml to include your new profile and its target weight.


Detailed guide coming soon.

Simulation & Generation Engines

This section documents the core modules responsible for agent-based modeling, transaction simulation, and adversarial mutation logic.

Batch Data Generator (generate.rs)

Summary

The generate.rs binary serves as the primary orchestration engine for creating large-scale, labeled synthetic datasets. It generates a complete ecosystem of customers, accounts, cards, and historical transactions, providing the "ground truth" required for training fraud detection models.

Architectural Decisions

The generator uses a chunked execution strategy to handle datasets that exceed available system memory. By processing cards in batches of 5,000, the generator maintains a stable memory profile regardless of the total population size. For spatial lookups, the system implements a multi-tier H3 index (resolutions 4 and 6) and a state-level index. This allows for rapid, localized merchant selection during transaction generation without exhaustive searching of the merchant reference dataset.

The choice of Apache Parquet as the output format ensures that multi-million row datasets remain compressed and performant for the downstream Python-based ML pipeline and Polars-based ETL.

System Integration

generate.rs sits at the start of the RiskFabric lifecycle. It consumes reference Parquet files for merchants and residential locations and produces the four core tables: customers.parquet, accounts.parquet, cards.parquet, and transactions.parquet (including its accompanying fraud_metadata.parquet).

Known Issues

The final merge phase is implemented by writing temporary Parquet chunks to disk and then re-scanning them with the Polars lazy API. While this prevents memory exhaustion during the final join, it introduces disk I/O overhead that affects the "cleanup" phase of generation. Additionally, the 5,000-card chunk size is currently hardcoded; moving this to customer_config.yaml would allow performance tuning based on available RAM capacity.

Language: Rust

The streaming generator produces unlabeled transactions at a configurable rate and publishes them to the raw_transactions Kafka topic for real-time scoring.

It reuses generate_transactions_chunk from the batch pipeline — the core generation logic is untouched. The one-pass architecture is preserved: transactions and fraud metadata are produced in a single traversal, then separated at the output layer via UnlabeledTransaction, which is a struct that mirrors Transaction but omits is_fraud, chargeback, and all label fields entirely. The Kafka payload is guaranteed label-free at the type level.

The generator operates in two modes, controlled by streaming_mode in generator_config.yaml:

  • Pure streaming (streaming_mode: true) — behavioral mutations active, no labels assigned, no metadata collected. Used for live fraud detection.
  • Verification mode (streaming_mode: false) — identical Kafka output, but ground truth labels are captured internally to ground_truth.csv via FraudMetadata. Used to measure scorer precision/recall by joining against fraud_scores after a test run.

The rate limiter targets configurable throughput (default 100 tx/s) using a self-correcting mechanism — each send measures actual Kafka latency and sleeps only the remaining interval, preventing cumulative drift under variable broker response times.

The merchant population is loaded from data/references/ref_merchants.parquet and indexed at H3 resolutions 4 and 6 for spatial locality lookups during generation.

Known issue: Population size is hardcoded to 1,000 customers, decoupled from the batch pipeline's 10,000 customer population. This should be moved to config to ensure Redis seeding and streaming population are consistent.

Population Generator (customer_gen.rs)

Summary

The customer_gen.rs module is responsible for the foundational entity creation in the RiskFabric simulation. It generates a synthetic population of customers by synthesizing demographics, geographic data from OpenStreetMap (OSM) reference points, and financial behavioral profiles. This module ensures that every customer is "anchored" to a realistic physical and economic context.

Architectural Decisions

This generator is designed around a Constraint-Based Synthetic Model. Instead of simple randomization, the engine enforces correlations across different entity domains. For example, it programmatically links Credit Score to Age (using an age_weight factor) and Monthly Spend to Location Type (Metro vs. Rural). This ensures that the resulting dataset possesses the structural patterns expected in real-world financial data.

For geographic fidelity, a Spatial Jittering strategy is implemented. By adding a ~500m drift (0.005 degrees) to the original OSM residential nodes, the simulation avoids "clumping" effects where multiple customers would otherwise share identical coordinates. This jittering preserves the overall density of the reference data while providing unique home coordinates for every agent. Note that while transaction-level jitter is deterministic, the initial population jitter is currently stochastic.

The generator uses Probabilistic Location Typing to classify customers into Metro, Urban, or Rural categories based on their proximity to city centers in the reference data. This classification serves as the primary driver for the financial heuristics used in the simulation.

System Integration

customer_gen.rs acts as the first stage of the generation pipeline. It consumes the ref_residential.parquet file and the customer_config.yaml configuration to produce a vector of Customer structs. This vector is passed downstream to the account and card generators to complete the entity hierarchy.

Known Issues

The entire residential reference dataset is currently loaded into memory using Polars' ParquetReader for every generation run. While efficient for populations up to 100,000 customers, this creates a significant memory bottleneck when scaling to millions of agents. Moving to a chunked or streaming approach for reading reference data is required. Additionally, the jitter range (0.005) is currently hardcoded in the source code; moving this to the configuration would allow for different levels of spatial precision.

Financial Entity Linking (account_gen.rs & card_gen.rs)

Summary

The account_gen.rs and card_gen.rs modules are responsible for constructing the financial "graph" of the simulation. They define the hierarchical relationships between customers and their payment instruments, ensuring that every transaction is linked to a valid account and card entity. This layer establishes the structural foundation required for testing entity-linking models and cross-account fraud detection.

Architectural Decisions

These generators prioritize Relational Consistency. Instead of generating accounts and cards in isolation, the system uses a top-down orchestration: Customers drive the creation of Accounts, which in turn drive the creation of Cards. This ensures that every card PAN is programmatically linked back to a specific customer ID, maintaining 100% referential integrity across the multi-million row dataset.

For Entity Density, a probabilistic account ownership model is implemented in account_gen.rs. While every customer is guaranteed a primary account, there is a 50% chance for a customer to own a secondary account (e.g., a "Credit" account in addition to a "Savings" account). This architectural decision allows the simulation to model complex multi-entity behaviors, such as "Balance Transfers" or "Cross-Account Velocity," which are common signals in sophisticated fraud patterns.

In card_gen.rs, an Account-Driven Mapping strategy is used. The card generator iterates over the accounts vector and issues a unique payment instrument for each. This one-to-one mapping simplifies the transaction generation logic while ensuring that the "issuing bank" metadata is correctly inherited from the parent account entity.

System Integration

These modules are the primary components of the batch generation pipeline. They are invoked by generate.rs immediately after the population has been created. The resulting vectors of Account and Card structs are then materialized into Parquet files and passed downstream to the transaction engine.

Known Issues

A hardcoded 50% probability for secondary account creation is currently used. This should be moved to customer_config.yaml to allow for more granular control over the "financial depth" of the population.

Furthermore, Card Metadata (like contactless_limit and online_limit) is currently initialized as empty strings. This prevents the simulation from enforcing realistic "Limit Breaches" during transaction generation. A "Product Catalog" lookup in card_gen.rs is required to populate these fields with realistic values based on the account type, which will enable a new class of "Limit-Based" fraud detection features.

Core Simulation Engine (transaction_gen.rs)

Summary

The transaction_gen.rs module is the primary logic engine of RiskFabric. It is responsible for simulating the financial lifecycle of every card in the system over a specified lookback period (default 365 days). It transforms static entity data into a high-fidelity stream of behavioral events, incorporating spatial realism, temporal patterns, and adversarial mutations in a single execution pass.

Architectural Decisions

The engine uses a One-Pass Parallel Architecture. By using rayon to iterate over cards, all logic—including merchant selection, timestamp generation, amount calculation, and fraud injection—occurs within a single parallelized loop. This eliminates the need for multi-pass joins and is a key factor in the project's performance.

For spatial realism, the system implements a Hierarchical Selection Strategy using H3 indices. Merchants are selected based on a probabilistic proximity model: 80% are "super-local" (Res 6), 15% are "district-level" (Res 4), 3% are "state-level," and 2% are "global." This creates realistic spending clusters around a customer's home while allowing for occasional travel or remote spending.

To ensure reproducibility, Deterministic Seeding is used at the card level. Every card's random number generator is seeded with a combination of the global seed, a salt, and a hash of the card ID. This ensures that a specific card will always generate the exact same transaction history across different runs, provided the global configuration remains unchanged.

System Integration

This engine is the central utility consumed by both the Batch Generator (generate.rs) and the Streaming Generator (stream.rs). It acts as a pure function that takes configuration, spatial indices, and entity maps as input and produces vectors of Transaction and FraudMetadata as output.

Known Issues

Timestamp generation is implemented by sorting a local vector of dates for each card. While this ensures that transactions are chronologically ordered per card, it does not guarantee a global chronological order across the entire dataset during batch generation. ClickHouse is currently used to perform the final global sort.

Additionally, the spatial distribution weights (80/15/3/2) are hardcoded directly into the logic. Moving these to transaction_config.yaml would allow users to simulate different mobility profiles—for example, a "commuter" population would require a higher Res 4 weight compared to a "rural" population.

Adversary Logic Engine (fraud.rs)

Summary

The fraud.rs module contains the "attack logic" of RiskFabric. It defines the specific behavioral rules used to mutate legitimate transactions into adversarial patterns. This module ensures that synthetic fraud reflects realistic criminal tactics such as velocity abuse, account takeovers, and coordinated campaigns.

Architectural Decisions

This module follows a Profile-Driven Mutation Strategy. The engine interprets profiles from fraud_rules.yaml to dynamically adjust transaction attributes, rather than using hardcoded fraud logic. This allows for experimentation with new fraud signatures without modifying the core simulation code.

For Behavioral Mimicry, a relative amount calculation strategy is implemented. By allowing an attacker to spend within a multiplier range of the customer's average transaction amount (e.g., 0.8x to 1.2x), the engine simulates subtle, low-value fraud that is difficult for simple rule-based systems to detect.

To simulate Stateful Attacks, the apply_campaign_logic function is used. This allows the generator to override standard spatial and device signals with persistent attacker metadata (e.g., a shared IP or fixed coordinates). This architectural decision is critical for generating the clustered signals that modern graph-based fraud models are designed to identify.

System Integration

fraud.rs is a stateless logic provider consumed by the transaction_gen.rs module. It acts as a specialized "mutation filter" that takes a completed transaction and a fraud profile and returns a set of behavioral anomalies.

Known Issues

String-based matching (e.g., f_type == "account_takeover") is currently used to determine which mutation logic to apply. This is a fragile pattern that could lead to silent failures if a typo is introduced in the YAML configuration. Refactoring these into a proper Enum would ensure compile-time safety and better performance. Additionally, the calculate_fraud_timestamp logic is currently limited to two specific fraud types; generalizing this to support a wider range of temporal attack patterns is needed.

Central Configuration Engine (config.rs)

Summary

The config.rs module is the architectural backbone of RiskFabric. It provides a strongly-typed, unified interface for all behavioral and operational parameters of the simulation. By mapping multiple YAML files into a hierarchical Rust structure, it ensures that every component—from the simulation engine to the machine learning pipeline—operates with a consistent and validated world-view.

Architectural Decisions

This engine is designed to enforce Type-Safe Behavioral Modeling. Instead of using loose key-value pairs or dynamic JSON, a deep hierarchy of nested structs is implemented. This leverages Rust’s compiler to ensure that any change to the configuration schema in one part of the system is immediately reflected and validated in every other part.

The use of Atomic Multi-File Loading is a critical architectural decision. The AppConfig::load() method reads five separate YAML files (fraud_rules, fraud_tuning, customer_config, transaction_config, and product_catalog) and synthesizes them into a single AppConfig object. This separation of concerns allows specific domains (like "Product Catalog" or "Fraud Rules") to be tuned in isolation without creating massive, unmanageable configuration files.

Safety Defaults are also implemented using serde macros. This ensures that the simulation remains resilient even if the underlying YAML files are missing non-essential keys, providing sensible fallbacks for parameters like the streaming_rate.

System Integration

config.rs is widely consumed across the codebase. It is initialized at the entry point of every binary (generate, stream, etl, ingest) and is passed down into the generators as a shared reference. This ensures that the "rules of the world" are identical across the batch, streaming, and ETL layers.

Known Issues

fs::read_to_string and expect calls are currently used in the load() method. This causes the application to panic immediately if a config file is missing or contains a syntax error. While acceptable for a CLI tool, refactoring to return a Result type is required to allow for more graceful error handling and reporting. Additionally, the file paths for the YAML configs are currently hardcoded relative to the project root; a more flexible path resolution strategy is needed to allow RiskFabric to be executed from different directories.

Data Engineering & Warehouse

This section documents the ETL pipelines, warehouse ingestion utilities, and geographic reference preparation tools used to build the RiskFabric environment.

ETL Pipeline System (etl.rs & src/etl/)

Summary

The ETL (Extract, Transform, Load) system is the transformation engine of RiskFabric. It is responsible for converting raw, "bronze" level synthetic transactions into "silver" behavioral features and finally into a "gold" master table ready for machine learning. The system is designed to handle large datasets by leveraging Polars for local transformations and ClickHouse for large-scale joins and persistence.

Architectural Decisions

The system follows a Medallion Architecture (Bronze → Silver → Gold) to ensure data lineage and modularity.

  • Bronze: Raw data as generated by generate.rs.
  • Silver: Subject-specific feature engineering (Customer, Merchant, Sequence, Network, Campaign, and Device/IP). These are calculated using Polars' lazy evaluation for performance.
  • Gold: The final flattened "master" table.

A key design choice is the Hybrid Execution Model. While the feature logic is implemented in Rust using Polars, the pipeline orchestrates data movement between ClickHouse (the primary warehouse) and local memory via Parquet. This allows complex, stateful calculations in Rust (like Welford's algorithm for running variance) that are difficult to express in pure SQL, while still using ClickHouse for efficient storage and final broad joins.

System Integration

The ETL system acts as the connective tissue between the Data Generation layer and the Machine Learning layer. It reads from ClickHouse tables (populated via ingest.rs), performs transformations, and writes the results back to ClickHouse. The final fact_transactions_gold table is the direct source for the Python-based training pipeline.

Known Issues

The system currently uses podman exec calls to interact with ClickHouse from within the Rust binary. This approach depends on the local environment's container runtime and shell availability. Transitioning to a proper ClickHouse client library (like clickhouse-rs) will make the pipeline more portable and robust. Additionally, the GoldMaster stage is currently implemented as a raw SQL join in ClickHouse, which duplicates some of the logic found in gold_master.rs. Unifying these two approaches will ensure the batch and streaming feature definitions remain consistent.

Behavioral Feature Engineering (src/etl/features/)

Summary

The src/etl/features/ directory contains the core analytical logic of RiskFabric. It defines how raw synthetic transactions are transformed into behavioral features across multiple domains: Customer history, Merchant risk, Transaction sequences, and Network relationships. These features provide the high-dimensional context required for modern fraud detection models to identify subtle adversarial patterns.

Architectural Decisions

This layer is designed to prioritize Domain-Specific Modularity. By separating feature sets into dedicated modules (e.g., network.rs, sequence.rs), independent iteration on different detection strategies is possible. This modularity ensures that the ETL pipeline can be easily extended with new behavioral signals (like graph-based features or deep-temporal windows) without refactoring the entire transformation engine.

For Transaction Sequencing, a window-based approach is implemented using Polars' shift and over functions. This allows for the calculation of complex stateful features like spatial_velocity and amount_deviation_z_score without the overhead of row-by-row iteration. The decision to perform these calculations at the "Silver" layer ensures that the final "Gold" master table is pre-enriched with predictive signals, reducing the training time for downstream models.

In the Network Intelligence module, a "Proxy Entity" strategy is used. Instead of building a full N:N customer relationship graph (which is memory-intensive), the risk reputation of shared entities like IP addresses and User Agents is calculated. This allows the system to identify "Suspicious Clusters" where multiple customers share a single high-fraud entity, capturing coordinated attack signals with high computational efficiency.

System Integration

These modules are the primary transformation components of the etl.rs binary. They consume "Bronze" tables from ClickHouse and produce "Silver" feature tables. The logic defined here is also mirrored in the scorer.py service to ensure training-serving parity during real-time inference.

Known Issues

A simple Euclidean distance formula is currently used for Spatial Velocity calculations. As noted in the etl_schema.md, this approximation becomes inaccurate over large distances. Implementation of the Haversine formula within the Polars transformation is required to ensure geographic precision.

Furthermore, the Campaign Detection logic in campaign.rs is currently based on a fixed 48-hour time gap. This is a heuristic that may fail to capture long-running, low-frequency attack campaigns. This threshold should be moved to the configuration or a more dynamic "Sessionization" strategy implemented to account for different adversarial behaviors.

Physical World Transformation (warehouse/)

Summary

The warehouse/ directory contains the SQL-based transformation logic for RiskFabric's physical environment. Using dbt (data build tool) and Postgres/PostGIS, this layer transforms raw OpenStreetMap (OSM) nodes into the "Physical World" reference data (Merchants and Residential points) used by the simulation engine.

Architectural Decisions

This layer prioritizes Geographic High-Fidelity. Instead of relying on the often inconsistent "state" and "district" tags in OSM, a Spatial Join Strategy is implemented. By performing ST_Intersects operations against official geographic boundaries (provided by DataMeet), the transformation layer provides a verified ground truth for every coordinate in the simulation. This ensures that a customer living in "Mumbai" is programmatically anchored to the correct state and district boundaries, which is critical for realistic spatial velocity calculations.

For Merchant Risk Profiles, a categorical mapping strategy is implemented in the stg_merchants model. By mapping raw OSM sub-categories (like jewelry or electronics) to standardized RiskFabric categories and risk levels (LOW, MEDIUM, HIGH), the "Adversarial Ground Truth" is established for the simulation. This architectural decision allows the fraud engine to select high-risk merchants for specific attack profiles without needing to embed merchant-level risk logic into the Rust binaries.

System Integration

The dbt layer acts as the "Level 0" enrichment engine. It consumes the raw tables populated by prepare_refs.rs and produces the mart_residential and mart_merchants models. These models are then exported to Parquet via export_references.rs or dlt/pipelines.py to be used as the primary lookup data for the simulation generators.

Known Issues

Spatial Joins are performed on every run for the mart models. While this ensures data quality, it is computationally expensive and slow when processing millions of Indian OSM nodes. A "Spatial Indexing" strategy should be implemented or the boundary results materialized into a lookup table to reduce the processing time.

Furthermore, the City Normalization logic is currently based on a simple regex-based macro. This fails to handle the wide variety of spelling variations and transliteration errors found in raw Indian OSM data. A fuzzy-matching strategy or integration of a dedicated geographic gazetteer is needed to ensure more robust city-level clustering in the simulation.

Data Warehouse Ingestor (ingest.rs)

Summary

The ingest.rs binary is the primary data loading utility that populates the RiskFabric data warehouse (ClickHouse). It consumes the raw Parquet output from the batch generator and transforms it into structured "Bronze" tables, providing the necessary foundation for downstream ETL and machine learning operations.

Architectural Decisions

The ingestor handles the initial schema enforcement for the warehouse. A key architectural decision is the use of a two-stage ingestion process for transactions. First, raw data is loaded into fact_transactions_bronze_raw with all fields preserved as strings or basic types. Then, ClickHouse's parseDateTime64BestEffort performs a high-performance conversion into a typed DateTime64 column for the final fact_transactions_bronze table. This approach ensures that data is not lost because of formatting mismatches during the initial bulk load.

The utility is idempotent, automatically dropping and recreating tables on every run. This simplifies the development lifecycle by ensuring the warehouse reflects the latest state of the synthetic generation configuration.

System Integration

ingest.rs acts as the bridge between the File System layer and the Warehouse layer. It interacts directly with the podman container runtime to execute commands against the riskfabric_clickhouse instance. It is the prerequisite for the etl.rs pipeline, which expects the tables defined here to be present and populated.

Known Issues

Data is currently piped into the warehouse using shell-based cat and podman exec commands. This is inefficient for large datasets and introduces a dependency on the host's shell environment. Refactoring this to use the ClickHouse HTTP interface or a native Rust client will allow for more reliable bulk inserts.

Furthermore, the warehouse schema in ingest.rs has drifted from the Rust model definitions in src/models/. For example:

  • The dim_accounts table in the warehouse is missing the bank_id and account_no fields present in account.rs.
  • The dim_cards table is missing over 10 fields, including issue_date, activation_date, and all usage limit fields defined in card.rs.
  • The dim_customers schema is more aligned but still represents a manual duplication of the Customer struct.

Unifying these schemas, ideally by deriving the ClickHouse DDL directly from the Rust structs, will ensure the warehouse remains a high-fidelity representation of the synthetic population.

Reference Data Preparator (prepare_refs.rs)

Summary

The prepare_refs.rs binary is the "world-building" utility of RiskFabric. It is responsible for ingesting, filtering, and normalizing raw OpenStreetMap (OSM) data and other geographic datasets to create the high-performance reference files used by the simulation generators. It handles the task of mapping physical coordinates to behavioral entities like merchants, residential points, and financial institutions.

Architectural Decisions

This utility is designed to handle Parallel OSM Parsing using the osmpbf library and rayon. Since the raw India PBF file is several gigabytes in size, the preparator uses a map-reduce strategy to extract relevant nodes (residential buildings, shops, and amenities) across all available CPU cores. This allows for the processing of a country's entire geographic dataset in minutes rather than hours.

A key architectural choice is the implementation of Fuzzy State Normalization. OSM data is often inconsistent, with the same state appearing in multiple formats (e.g., "AP," "Andhra Pradesh," or "Andra Pradesh"). A rule-based normalization engine standardizes these variations, ensuring that downstream generators can reliably perform state-level joins and spatial indexing without data gaps.

A Postgres-Based Staging Layer is also integrated for the extraction process. By using the BinaryCopyInWriter for bulk insertion, the preparator moves millions of extracted nodes into a structured database with minimal overhead. This staging layer allows for complex SQL-based cleaning and verification before the final reference Parquet files are exported.

System Integration

prepare_refs.rs is a standalone "Level 0" utility that must be run before synthetic data generation. It populates the data/references/ directory with ref_merchants.parquet, ref_residential.parquet, and other critical lookup tables. These files are then consumed by generate.rs, stream.rs, and customer_gen.rs.

Known Issues

A hardcoded Postgres connection string (postgres://harshafaik:123@localhost:5432/riskfabric) is currently used within the CLI defaults. This is a security and portability issue; it should be moved to an environment variable or a configuration file. Additionally, the utility lacks a unified "Export to Parquet" command—it populates Postgres, but the final conversion to Parquet is often handled by separate, manual scripts. Consolidating the end-to-end pipeline (OSM → Postgres → Parquet) into this single binary would improve the developer experience.

Reference Data Exporter (export_references.rs)

Summary

The export_references.rs binary is the final stage of the reference data preparation pipeline. It extracts cleaned and processed geographic data from the staging database (Postgres) and serializes it into the high-performance Parquet format required by the simulation generators. This utility ensures that the "synthetic world" is correctly typed, indexed, and portable across different environments.

Architectural Decisions

This utility is designed to act as the Final Schema Validator for the reference data. While the prepare_refs.rs utility handles raw extraction and normalization, the exporter ensures that the data is structured exactly as expected by the generators. By using Polars to build the final DataFrames, high-performance memory management and efficient Parquet serialization are leveraged, which is critical when dealing with millions of reference nodes.

A key architectural choice is the Database-to-Parquet decoupling. By exporting processed staging tables into standalone Parquet files, the simulation environment becomes portable. This allows the core RiskFabric generators to run without a live Postgres connection, simplifying the deployment and execution of the simulation on local workstations or in CI/CD pipelines.

System Integration

export_references.rs is a "Level 0" utility that bridges the Staging layer (Postgres) and the Generation layer (Parquet). It is typically run after prepare_refs.rs and any subsequent SQL-based cleaning has been performed on the staging tables. The resulting Parquet files in data/references/ are the direct input for generate.rs, stream.rs, and the various generator modules.

Known Issues

A hardcoded Postgres connection string (postgres://harshafaik:123@localhost:5432/riskfabric) is currently used directly in the source code. This is a duplicate of the issue in prepare_refs.rs and should be unified into a shared configuration or environment variable. Additionally, the exporter manually maps Postgres rows into local vectors before creating the Polars DataFrame. This is inefficient for extremely large datasets; refactoring to use a streaming connector or a more direct Polars-Postgres integration is needed to reduce the memory overhead of the export process.

Reference Data Pipeline (dlt/pipelines.py)

Summary

The dlt/pipelines.py script is the Modern Data Stack (MDS) integration for RiskFabric. It uses the dlt (Data Load Tool) library to manage the extraction and movement of cleaned, enriched geographic data from the staging database (Postgres) into the optimized Parquet reference files used by the generators.

Architectural Decisions

This pipeline is designed to facilitate Declarative Reference Data Export. Instead of custom SQL-to-Parquet conversion logic (as seen in export_references.rs), this script leverages the dlt library’s built-in support for the "filesystem" destination. This allows for automated schema handling and standardized Parquet formatting, which is critical for maintaining consistency between the OSM-derived reference data and the Rust-based simulation.

A key architectural choice was the use of write_disposition="replace". Since the reference data (merchants and residential nodes) represents a "static" world that is fully rebuilt after every OSM extraction, this strategy ensures that the data/references/ directory always contains a clean snapshot of the environment without manual cleanup.

System Integration

dlt/pipelines.py acts as an alternative or supplementary utility to export_references.rs. It bridges the Staging layer (Postgres) and the Local File System layer. It is typically run as part of the "Level 0" world-building phase, specifically after dbt has transformed the raw OSM nodes into the mart_residential and mart_merchants models.

Known Issues

Environment variables (e.g., DESTINATION__FILESYSTEM__BUCKET_URL) are currently used to configure the DLT pipeline directly within the Python script. This approach is fragile and makes it difficult to change the reference directory without modifying the code. These should be moved into a dedicated dlt_config.toml file to align with the library’s best practices. Additionally, the pipeline currently lacks Data Validation tests; dlt "checks" should be implemented to ensure that the exported Parquet files contain the expected number of rows and non-null H3 indices before they are handed off to the generation engine.

Machine Learning Systems

This section documents the model training pipelines, real-time inference services, and metadata utilities required for detecting synthetic fraud patterns.

Machine Learning Training Pipeline (train_xgboost.py)

Summary

The train_xgboost.py script is the primary model development engine for RiskFabric. It extracts features from the ClickHouse "Gold" layer and trains an XGBoost classifier to detect synthetic fraud patterns. It evaluates the learnability of the generated fraud signatures by industry-standard algorithms.

Architectural Decisions

An "Operational Feature" policy is implemented in the training script to prevent data leakage. While the synthetic generator provides explicit labels like geo_anomaly and fraud_type for verification, these are strictly excluded from training. Instead, the model is forced to learn from behavioral proxies such as amount_deviation_z_score, spatial_velocity, and hour_deviation_from_norm. This ensures that the model's performance reflects real-world detectability rather than just learning internal generator flags.

The choice of XGBoost with Native Categorical Support allows the model to process high-cardinality fields like merchant_category and transaction_channel directly, without the memory overhead of one-hot encoding. This maintains performance as the synthetic merchant population scales.

System Integration

The training pipeline is the final "offline" consumer of the Data Warehouse layer. It uses the clickhouse-connect library to pull data directly into Polars DataFrames for training. The resulting model is serialized to models/fraud_model_v1.json, which is consumed by the scorer.py service for real-time inference in the streaming pipeline.

Known Issues

A simple 80/20 train/test split with stratification is currently used, but time-series validation is missing. Since fraud patterns evolve over time, a random split can lead to optimistic performance estimates by allowing the model to see future patterns during training. A walk-forward validation strategy is required to better simulate real production deployments. Additionally, XGBoost hyperparameters (like max_depth=6) are currently hardcoded; these should be moved to a ml_tuning.yaml configuration file to allow for automated hyperparameter optimization.

Real-Time Scoring Service (scorer.py)

Summary

The scorer.py service is the production inference engine of RiskFabric. It consumes unlabeled transaction events from Kafka, performs sub-millisecond feature engineering using a Redis-backed feature store, and applies the trained XGBoost model to generate real-time fraud probabilities. The service serves as the final link in the streaming pipeline, providing the "Detection" half of the simulation.

Architectural Decisions

This service is designed around a Stateful Micro-Batching Architecture. To balance high throughput with low latency, feature engineering is performed for each transaction individually, but the final model predictions are grouped into batches of 50. This reduces the overhead of XGBoost inference and ClickHouse persistence while maintaining a P99 latency of approximately 12ms per transaction.

For real-time feature engineering, Welford’s Algorithm is implemented to maintain running means and standard deviations within Redis. This allows for the calculation of an "Operational" amount_deviation_z_score for every transaction without needing to scan historical Parquet files or perform heavy SQL queries. This stateful approach is critical for simulating how behavioral anomalies are detected on a "live" stream.

The service maintains Feature Alignment with the training pipeline by dynamically reordering and casting incoming features to match the exact schema and types (categorical, float, int) exported from the fraud_model_v1.json booster. This prevents "training-serving skew," ensuring that the model's performance in production matches its performance during validation.

System Integration

scorer.py sits at the exit point of the Streaming layer. It consumes from the raw_transactions Kafka topic (populated by stream.rs) and writes its decisions to both the fraud_scores ClickHouse table and a downstream Kafka topic for automated blocking. It depends on Redis for its behavioral context and ClickHouse for long-term audit logging and performance monitoring.

Known Issues

A hardcoded THRESHOLD = 0.85 is currently used for flagging transactions as fraud. This should be moved to a configuration file (or a dynamic service) to allow for easier tuning of the precision-recall trade-off. Furthermore, the hour_deviation_from_norm feature is currently a placeholder (0.0). Implementation of the temporal aggregation logic in seed_redis.py and fetching it from Redis is required to ensure the model has access to its full set of behavioral signals during real-time inference.

Model Metadata Utility (dump_model.py)

Summary

The dump_model.py script is a specialized inspection utility used to extract the internal schema and feature definitions from a serialized XGBoost model. It ensures that the real-time scoring engine (scorer.py) has exact visibility into the feature names and data types (categorical, float, integer) expected by the binary booster.

Architectural Decisions

This utility is designed to solve the Feature Alignment Problem in production ML. When an XGBoost model is saved as a JSON booster, it encodes its expected input schema. If the inference engine sends features in the wrong order or with the wrong data types, the model may crash or return incorrect results. By using get_booster().feature_names, this utility provides a programmatically verifiable source of truth for the inference interface, allowing the scorer.py to dynamically reorder and cast its input DataFrames to match the model's training state.

The implementation of JSON-Path Extraction for categorical features is a critical design choice. Since XGBoost's native categorical encoding is serialized within the learner block of the JSON file, this utility parses those internal dictionaries. This architectural safety measure ensures verification of the categorical "levels" (e.g., specific merchant categories) the model was exposed to during training, preventing "Unknown Category" errors during real-time scoring.

System Integration

dump_model.py is an auxiliary utility in the Machine Learning layer. It is typically run after train_xgboost.py to verify the model artifact before it is deployed to the scoring service. It acts as a manual "Gatekeeper" for ensuring feature consistency across the pipeline.

Known Issues

A fragile, regex-based approach (re.findall) is currently used to extract categorical strings from the XGBoost JSON. This is an unreliable method that depends on the specific serialization format of the XGBoost version being used. A more robust parser that follows the official XGBoost JSON schema is required. Additionally, the utility currently only prints the metadata to the console; refactoring is needed to export a structured schema.yaml file that the scorer.py can load automatically to configure its inference pipeline.

Infrastructure & Operations

This section documents the local service stack, orchestration configuration, and state synchronization utilities used to operate the RiskFabric environment.

Infrastructure & Local Service Stack

Summary

The RiskFabric simulation is supported by a comprehensive local service stack orchestrated via Docker Compose. This infrastructure provides the multi-modal data environment—relational, columnar, stream, and cache—required to simulate a modern financial technology ecosystem. It enables the end-to-end lifecycle of synthetic data, from geographic world-building to real-time adversarial detection.

Architectural Decisions

The infrastructure is designed using a Multi-Model Database Strategy. By incorporating ClickHouse for high-volume transactions and Postgres/PostGIS for geographic preparation, each stage of the simulation uses the optimal storage engine for its specific data type. The inclusion of Redpanda (a Kafka-compatible event store) and Redis facilitates the real-time scoring path, allowing the simulation to model the sub-millisecond latency requirements of production fraud systems.

For Observability, Prometheus and Grafana are integrated directly into the core stack. This architectural decision transforms RiskFabric from a simple data generator into a performance benchmarking environment. By instrumenting the database exporters and the real-time scorer, system metrics (e.g., Kafka ingestion lag, Redis lookup latency, and model inference time) can be visualized in real-time, providing visibility into the operational impact of different fraud detection strategies.

The use of Healthchecks across all critical services ensures that the generation binaries (ingest.rs, etl.rs) only attempt to connect when the infrastructure is ready. This improves the developer experience by reducing connection-refused errors during the initial cold-start of the simulation environment.

System Integration

The infrastructure is the foundation upon which all RiskFabric binaries execute. The Rust-based generators and Python-based ML services connect to these containers via standardized ports and internal networks. The scorer service is configured to run as a long-lived container, automatically subscribing to the Kafka stream as soon as the stack is up.

Known Issues

A Single-Node Redpanda instance without persistence is currently used. While this is sufficient for local development, it does not support testing "Consumer Group Rebalancing" or "Partition-Level Parallelism," which are common challenges in production streaming systems. A multi-node Redpanda cluster configuration is required to support high-availability testing scenarios.

Furthermore, Postgres and ClickHouse credentials are currently hardcoded as harshafaik:123 across the docker-compose.yml. This security vulnerability prevents the stack from being used in shared or public environments. These credentials must be moved to an .env file and Docker Secrets used to manage sensitive information more securely.

Redis Feature Seeder (seed_redis.py)

Summary

The seed_redis.py script is an operational utility that initializes the real-time feature store (Redis) with historical data from the warehouse (ClickHouse). It bridges the gap between the batch-trained model and the streaming inference engine by ensuring that every card and customer has immediate behavioral context before real-time transactions start arriving.

Architectural Decisions

This seeder is designed to facilitate Warm-Start Inference. Without this script, the first few transactions for every card in the streaming pipeline would be difficult to score accurately (as there would be no "previous" location for velocity or "previous" amount for Z-score). The seeder extracts the most recent state for every card and customer, including the last 10 transactions, the final coordinate pair, and the cumulative count of events.

A key architectural choice is the Redis Hash/List strategy. Redis Lists (RPUSH) are used to store chronological card history and Hashes (HSET) to store aggregate statistics. This allows scorer.py to perform O(1) lookups for behavioral context, maintaining the strict latency requirements of real-time fraud detection. Furthermore, the seeder explicitly calculates the initial Welford state (Mean and M2) from the warehouse, enabling the online scorer to continue updating statistical variance incrementally without a full history scan.

System Integration

seed_redis.py acts as a synchronization service between the Warehouse layer (ClickHouse) and the Scoring layer (Redis/Kafka). It must be executed after etl.rs completes (to ensure the "Gold" table is populated) and before stream.rs and scorer.py are started.

Known Issues

The entire feature initialization set is currently pulled into local Python memory before pushing to Redis. For datasets with millions of cards, this may lead to a memory-exhaustion failure. The ClickHouse queries should be refactored to use chunked fetching (cursors) or a parallelized worker pool implemented to stream data from the warehouse to Redis in batches. Additionally, a hardcoded password for ClickHouse is currently used; this should be moved to an environment variable to align with project security standards.

Technical Reference

Exhaustive documentation of the schemas, configurations, and developer utilities used to build and manage the RiskFabric simulation.

Synthetic Data Schema

Summary

The RiskFabric data schema is designed to mirror a professional financial environment while providing the "white-box" visibility required for advanced machine learning research. It consists of five core entities that represent the hierarchical relationship between a customer and their financial events.

Design Intent

The schema is structured to prioritize Relational Realism over flat-file simplicity. By separating Customers, Accounts, and Cards into distinct tables, the simulation models complex many-to-one relationships (e.g., a single customer owning multiple accounts, each with different card instruments). This is essential for testing entity-linking models and network analysis in fraud detection.

The inclusion of the FraudMetadata table is a critical architectural decision. It decouples the simulation ground truth (fraud_target) from the operational signal (is_fraud). This allows researchers to train on noisy, real-world signals while validating against the perfect, latent truth of the generator.

Entity Relationship Overview

  • Customer: The primary entity. Owns several Accounts.
  • Account: A financial container (Savings, Current, Credit). Contains several Cards.
  • Card: The instrument used for transactions.
  • Transaction: A financial event linked to a Card, Account, and Customer.
  • FraudMetadata: Ground-truth data linked 1:1 with Transactions to explain the generation context.

👥 Customer (customers.parquet)

Defines the synthetic population's demographics and geographic baseline.

FieldTypeDescription
customer_idStringUnique UUID for the customer.
nameStringFull name (Indian-centric).
ageUInt8Age of the customer (18-90).
emailStringSynthetic email address.
locationStringFull residential address (OSM-based).
stateStringStandardized Indian state name.
location_typeStringUrban vs. Rural classification.
home_latitudeFloat64WGS84 Latitude of home.
home_longitudeFloat64WGS84 Longitude of home.
home_h3r5StringH3 Resolution 5 index (Neighborhood level).
home_h3r7StringH3 Resolution 7 index (Block level).
credit_scoreUInt16Synthetic credit score (300-850).
monthly_spendFloat64Average expected monthly expenditure.
customer_risk_scoreFloat32Baseline risk probability (0.0 to 1.0).
is_fraudBoolFlag indicating if this customer represents a fraud target.
registration_dateStringISO 8601 date of account registration.

🏦 Account (accounts.parquet)

The logical banking container for funds.

FieldTypeDescription
account_idStringUnique UUID for the account.
customer_idStringFK to Customer.
bank_idStringIdentifier for the issuing bank.
account_noString12-digit synthetic account number.
account_typeStringSavings, Current, or Credit.
balanceFloat64Current funds in the account.
statusStringActive, Closed, or Suspended.
creation_dateStringThe account opening date.

💳 Card (cards.parquet)

The payment instrument associated with an account.

FieldTypeDescription
card_idStringUnique UUID for the card.
account_idStringFK to Account.
customer_idStringFK to Customer.
card_numberString16-digit synthetic PAN.
card_networkStringVISA, Mastercard, or RuPay.
card_typeStringDebit or Credit.
statusStringActive, Blocked, or Expired.
status_reasonStringReason for status changes (e.g., SIM Swap Suspect).
issue_dateStringCard issuance date.
activation_dateStringInitial card usage date.
expiry_dateStringCard expiry date.
issuing_bankStringFull name of the bank.
bank_codeStringStandardized 4-digit bank identifier.

💸 Transaction (transactions.parquet)

The high-volume stream of financial events.

FieldTypeDescription
transaction_idStringUnique UUID for the transaction.
card_idStringFK to Card.
account_idStringFK to Account.
customer_idStringFK to Customer.
merchant_idStringUnique identifier for the merchant.
merchant_nameStringName of the business.
merchant_categoryStringCategory (e.g., GROCERY, TRAVEL).
merchant_countryStringCountry code of the merchant (defaults to IN).
amountFloat64Transaction value in base currency.
timestampStringISO 8601 high-precision timestamp.
transaction_channelStringonline, in-store, UPI, etc.
card_presentBoolPhysical card usage flag.
user_agentStringBrowser or POS device identifier.
ip_addressStringIPv4 address of the requester.
statusStringHigh-level status (Success or Failed).
auth_statusStringBanking authorization code (approved/declined).
failure_reasonStringDetailed reason for declined transactions.
is_fraudBoolNoisy Label (includes FN/FP).
chargebackBoolFlag indicating a later customer dispute.
location_latFloat64Latitude of the transaction event.
location_longFloat64Longitude of the transaction event.
h3_r7StringH3 Resolution 7 index of the transaction location.

🕵️ Fraud Metadata (fraud_metadata.parquet)

Internal ground-truth for debugging and advanced ML training. This table is not used in standard inference but is vital for "white-box" evaluation.

FieldTypeDescription
transaction_idStringFK to Transaction.
fraud_targetBoolGround Truth (True Fraud flag).
fraud_typeStringProfile used (e.g., upi_scam, ato).
label_noiseStringReason for label mismatch (if any).
injector_versionStringEngine version.
geo_anomalyBoolTrue if location represents an outlier.
device_anomalyBoolTrue if device/UA represents an outlier.
ip_anomalyBoolTrue if IP represents a known malicious prefix.
burst_sessionBoolPart of a rapid-fire sequence.
burst_seqInt32Sequence number within a burst session.
campaign_idStringLink to a coordinated attack campaign.
campaign_typeStringCoordination type (e.g., coordinated_attack).
campaign_phaseStringPhase within the campaign (early, active, late).
campaign_day_numberInt32Days since campaign start.

Known Issues

UUID strings are currently used for all primary keys (customer_id, card_id, etc.). While ensuring global uniqueness, this increases storage overhead and join latency in ClickHouse compared to integer-based keys. Transitioning to a 64-bit integer ID system is under consideration for future versions.

Furthermore, a dedicated Merchant Table is not yet implemented in the output schema. Merchant attributes are currently denormalized directly into the transaction table, creating data redundancy and limiting merchant-level entity modeling. Breaking merchants into a separate merchants.parquet file is required to complete the star schema.

ETL & Feature Schema

Summary

The etl_schema.md document defines the behavioral features and data transformations performed by the RiskFabric ETL pipeline (etl.rs). It acts as the technical contract for the "Silver" and "Gold" layers, detailing how raw synthetic events are transformed into the high-dimensional vectors used for model training and real-time inference.

Design Intent

The feature schema represents a Hybrid Behavioral State, intended to provide models with a multi-domain view of financial events across customer history, merchant risk, and temporal sequences. This approach facilitates sophisticated behavioral modeling, such as Z-scores and velocity-based indicators, similar to production fraud detection systems.

A critical design choice was the use of Welford's Algorithm for statistical aggregates. Calculating running means and variances locally in Rust (and Redis) ensures that features are numerically stable and computationally efficient for both batch processing and low-latency streaming. This architectural decision is intended to eliminate training-serving skew.


🥈 Silver Layer: Behavioral Features

Transaction Sequence Features (fact_transactions_silver)

Calculated at the individual card level to identify temporal and spatial anomalies.

FieldDescriptionLogic
time_since_lastSeconds since the previous event.T - T_prev
spatial_velocitySpeed (km/h) between consecutive events.Dist(L, L_prev) / (T - T_prev)
amount_z_scoreDeviation from customer's mean spend.(Amt - Mean) / StdDev
hour_deviationDeviation from customer's peak spend hour.Circular variance of timestamp.hour()

Network & Entity Features (network_features_silver)

Identifies high-risk clusters across the payment network.

FieldDescriptionLogic
shared_ip_fraudFraud rate of cards sharing the same IP.SUM(is_fraud) / COUNT(card_id) OVER IP
scammer_hubFlag for known high-risk coordinates.1 if Lat/Lon in [hub_coordinates]

🥇 Gold Layer: The Master Table

The final flattened table used for model training, joining all Silver behavioral features with the original Bronze transactions.


Known Issues

Spatial Velocity is currently calculated using a Euclidean distance approximation. While computationally efficient, this is inaccurate over long distances. Implementation of the Haversine formula is required to ensure geographic precision for cross-state and international fraud simulations.

Furthermore, Feature Freshness is limited to the last 10 transactions in Redis. This prevents the modeling of long-term behavioral baselines for infrequent spenders. Implementing "Stateful Cold Storage" in the ETL pipeline is necessary to retrieve historical data without exceeding real-time feature store capacity.

Configuration Reference

Summary

The config_reference.md document provide a catalog of the behavioral parameters and system-wide settings available in RiskFabric. It details the schema of the YAML configuration files that define the simulation's behavioral rules, ranging from geographic boundaries to fraud injection rates.

Design Intent

The configuration system is designed to be Hierarchical and Domain-Specific. By splitting settings into five distinct YAML files, researchers can perform comparative testing on simulation behaviors (e.g., comparing different fraud population densities) by swapping configuration files. This decoupling ensures the generator can be tuned without recompiling the Rust binaries.

A critical design choice was the use of Semantic Weights. For parameters such as hourly_weights and daily_weights, relative values are used rather than absolute probabilities. This allows the generator to maintain consistent behavioral ratios (e.g., temporal activity peaks) regardless of the total volume of generated data.


📄 Core Configuration Files

fraud_rules.yaml

Defines the individual attack profiles and their behavioral biases.

  • profiles: Mapping of fraud types (e.g., upi_scam) to amount strategies and geographic anomaly probabilities.
  • fraud_patterns: List of common "test amounts" used by attackers for card validation.

customer_config.yaml

Defines the synthetic population's physical and economic footprint.

  • control.customer_count: Total population size for the batch generation run.
  • financials.base_spend: Expected monthly expenditure per location type (Metro, Urban, Rural).

transaction_config.yaml

Defines the "physics" of the transaction stream.

  • geo_bounds: The lat/long bounding box for transaction events.
  • temporal_patterns: The weighted distribution of activity across the 24-hour day and 7-day week.

Known Issues

The Lookback Period (lookback_days) can currently be set independently of the customer registration window. This allows for temporal inconsistencies where transaction history precedes a customer's registration date. Implementing cross-configuration validation is necessary to ensure temporal consistency.

Furthermore, the Streaming Rate (streaming_rate) is a global setting. "Dynamic Throughput," which would allow the generator to simulate peak activity hours (e.g., varying tx/s by time of day), is not yet implemented. Modifying the streaming engine to respect the temporal weights defined in transaction_config.yaml is required to create more realistic real-time traffic patterns.

Developer Utilities CLI

Summary

The developer_utilities.md document details specialized binaries and tools designed to support the RiskFabric development lifecycle. These utilities automate auxiliary tasks surrounding synthetic data generation, such as geographic preprocessing, reference data export, and model metadata inspection.

Design Intent

These utilities function as a Developer's Toolkit for the simulation. By decomposing complex tasks—such as OSM node extraction and Parquet serialization—into dedicated CLI binaries, the core generation engine remains focused. This modular approach allows the synthetic environment to be rebuilt independently of the transaction simulation, enabling iteration on geographic density and merchant risk profiles.

A critical design choice was the use of Strongly-Typed Subcommands via the clap library. This provides a consistent, self-documenting interface for every utility, reducing cognitive load and ensuring operational errors are caught during argument parsing.


🔧 Core Utilities

riskfabric-prepare-refs

The primary utility for extracting and normalizing OSM data.

  • extract-nodes: Parallel parsing of PBF files into a Postgres staging layer.
  • map-city-state: Rules-based geographic normalization.

riskfabric-export-references

The serializer bridging the staging database and the generation layer.

  • Function: Converts Postgres tables into H3-indexed Parquet files.

riskfabric-ingest

The automated loader for the ClickHouse data warehouse.

  • Function: Handles schema creation and bulk loading of generated transactions.

Known Issues

Two separate binaries are currently maintained for reference handling (prepare-refs and export-references), which introduces friction in the developer workflow. Consolidation into a Unified "Refs" Command with subcommands for extraction, normalization, and export is planned.

Furthermore, Duplicate Connection Logic exists across several utilities, with database URLs and file paths hardcoded in multiple binaries. Refactoring common CLI logic into a shared riskfabric-cli-core crate is required to ensure consistent handling of parameters like --db-url and --output-dir.

Machine Learning Strategy

Summary

RiskFabric's machine learning strategy is built around the "Operational Model" philosophy. Instead of training on perfect, latent labels provided by the generator, the strategy forces models to learn from behavioral proxies in a multi-stage pipeline that mirrors real-world deployment challenges.

Design Intent

The ML pipeline serves as a Calibration Bench for the generator. Achieving 100% recall on synthetic data indicates that the fraud signatures are insufficient in complexity. Label Noise (FP/FN) and Sanitized Feature Sets are explicitly introduced to create a realistic "Information Gap" between the generator and the learner.

The architecture utilizes XGBoost as its primary classifier, leveraging its native categorical handling and gradient-boosting strengths for tabular financial data. This enables researchers to evaluate feature importance in an interpretable manner, identifying which synthetic signals (e.g., spatial velocity vs. amount deviation) are the most predictive.


🏗️ The Training Pipeline

  1. Ingestion & ETL: Data is extracted from the ClickHouse "Gold" layer via train_xgboost.py.
  2. Sanitization: Internal generator flags (e.g., fraud_type, geo_anomaly) are dropped to prevent data leakage.
  3. Training: XGBoost utilizes a binary:logistic objective with a 20% stratified test split.
  4. Verification: Models are evaluated against both the noisy is_fraud label and the perfect fraud_target.

Known Issues

The current use of Random Stratified Splitting for validation is an architectural limitation. In a financial stream, data is temporally ordered; random splitting allows for "look-ahead bias," where the model may be exposed to a customer's future patterns during training. Transitioning to Out-of-Time (OOT) Validation—training on the first nine months and testing exclusively on the final three—is necessary.

Furthermore, the model is currently static, without a "Concept Drift" simulation to account for fraud signatures changing over time. This makes the accuracy metrics potentially misleading as they do not reflect adversarial evolution. Implementing a Retraining Scheduler is required to evaluate precision degradation as fraud profiles evolve.

Conceptual Explanations

High-level documentation explaining the underlying philosophy, architectural strategies, and simulation logic of RiskFabric.

Theory of Operation

This document explains the underlying philosophy, architecture, and logic of the RiskFabric simulation. It answers the question: "How does the engine actually think?"

1. Agent-Based Simulation (ABM) Philosophy

RiskFabric functions as an Agent-Based Simulator rather than a simple random data generator.

graph TD
    subgraph "World Building"
        OSM[OpenStreetMap Data] --> Prepare[prepare_refs.rs]
        Prepare --> PG[(Postgres / PostGIS)]
        PG --> Parquet[Reference Parquet Files]
    end

    subgraph "Simulation Engine"
        Parquet --> Gen[generate.rs / stream.rs]
        Config[YAML Configs] --> Gen
        Gen --> Trans[Transactions]
    end

    subgraph "Detection Pipeline"
        Trans --> CH[(ClickHouse)]
        CH --> ETL[etl.rs]
        ETL --> Train[train_xgboost.py]
        Train --> Scorer[scorer.py]
        Gen --> Kafka[Redpanda]
        Kafka --> Scorer
    end
  • The Agent: The primary agent, the Customer, drives the logic.
  • The World: OpenStreetMap (OSM) reference nodes (Residential and Merchant points) across India define the physical world.
  • The Rules: Agents follow deterministic rules defined in fraud_rules.yaml and transaction_config.yaml.

Unlike statistical generators that sample from distributions to create flat tables, RiskFabric simulates the lifecycle of financial entities.


2. The Deterministic Lifecycle

To ensure consistency across 10M rows and all tables, RiskFabric follows a strict creation order:

graph LR
    Cust[Customer] -->|1:N| Acc[Account]
    Acc -->|1:N| Card[Card]
    Card -->|1:N| Tx[Transaction]
    Tx -->|linked| Merch[Merchant]
  1. Customer Birth: The generator assigns each customer a name, age, and a Home Coordinate based on real residential OSM nodes.
  2. Financial Anchoring: The system assigns one or more Accounts to every customer.
  3. Payment Instruments: Accounts issue Cards. These cards act as "keys" for generating transaction streams.
  4. The Spend Loop: Each card generates transactions based on the customer's monthly_spend profile.

3. The "One-Pass" Parallel Architecture

Traditional simulators often use multiple passes (e.g., Pass 1: Generate legitimate data, Pass 2: Inject fraud). This approach increases latency and memory usage.

RiskFabric uses a One-Pass Architecture in Rust:

  • Parallelization: The engine uses the Rayon library to process thousands of entities simultaneously across all CPU cores.
  • Unified Logic: Merchant selection, amount calculation, fraud injection, and campaign coordination occur in a single loop.
  • Memory Efficiency: By using "Batched Generation" (5,000 entities per cycle), the engine maintains a constant memory footprint whether generating 1M or 10M rows.

4. Spatial Realism & H3 Indexing

RiskFabric uses geographic high-fidelity.

  • H3 Hierarchies: The system uses Uber’s H3 hexagonal grid. When a user spends, the engine first looks for merchants within the same H3 Resolution 5 cell (neighborhood level) as their home.
  • Local vs. Global Spend: Legitimate transactions remain "local" (same H3 cell) approximately 98% of the time. Fraud profiles (like UPI Scams) explicitly force "Remote" coordinates to simulate offshore or cross-state attacks.

5. Statistical Reproducibility (Seeded PRNG)

Every card in the system has a Deterministic Seed.

#![allow(unused)]
fn main() {
let mut card_rng = StdRng::seed_from_u64(global_seed + salt + card_id_hash);
}

Running the simulation with the same global_seed ensures every transaction for a given card remains identical. This enables Machine Learning reproducibility, allowing for feature adjustments without the underlying ground-truth shifting.


6. Simulated Imperfection (Label Noise)

To mirror real-world banking challenges, RiskFabric implements Noisy Labeling:

  • Ground Truth (fraud_target): The latent indicator of whether the generator injected a specific fraud pattern.
  • Noisy Label (is_fraud): The signal typically available to a bank's operational systems. It includes False Positives (legitimate transactions flagged as fraud) and False Negatives (undetected fraudulent transactions).

This design forces models to learn robustness and generalizable patterns rather than memorizing perfect synthetic signatures.


7. Hybrid Streaming & Verification Architecture

To support real-time fraud detection, RiskFabric includes a dedicated Streaming Generator that bridges the gap between static datasets and live production environments.

  • One-Pass Consistency: The streaming engine reuses the exact same logic as the batch pipeline but operates on a continuous loop, producing transactions at a configurable rate (default 100 tx/s).
  • Type-Level Safety (Unlabeled Output): To prevent "label leakage" during live scoring, the system uses a specialized UnlabeledTransaction struct. This mirrors the standard transaction but programmatically omits all ground-truth and labeling fields (is_fraud, chargeback, etc.), ensuring the Kafka payload is consistent with a real production stream.
  • Verification Mode: While in verification mode, the generator writes the "Ground Truth" of every streaming transaction to ground_truth.csv. This allows for a post-hoc join against real-time model scores to measure precision and recall in a simulated production environment.
  • Self-Correcting Rate Limiter: The generator measures actual Kafka broker latency for every message sent. It dynamically adjusts its sleep interval to compensate for network jitter, ensuring steady, drift-free throughput over long durations.

Fraud Signatures & Attack Patterns

Summary

The fraud_signatures.md document serves as the high-level behavioral specification for the simulation's adversary. It defines both individual fraud profiles and coordinated multi-entity campaigns, providing the theoretical basis for the synthetic anomalies generated by the engine.

Design Intent

I designed these signatures to move beyond "random noise" and toward Structured Adversarial Intelligence. Each profile (e.g., UPI Scam, Account Takeover) is anchored in a specific real-world financial threat observed in the Indian market. By layering Campaign Logic on top of individual profiles, I allow the simulation to model the "clustered" signals that are the hallmark of professional criminal organizations.

A critical design choice was the implementation of Probabilistic Mutation. Instead of every fraudulent transaction being an obvious outlier, I use configuration-driven probabilities to ensure that some fraud looks "legitimate" (e.g., Friendly Fraud). This forces ML models to learn subtle, high-dimensional boundaries rather than simple, hard-coded thresholds.


1. Fraud Profiles (Individual Patterns)

ProfileBehavioral SignatureSpatial Signature
UPI ScamHigh frequency, small to medium amounts (₹1,500 - ₹20,000).90% Geo-Anomaly: Scammer is remote.
Account TakeoverHigh-value transfers, sudden change in device/channel.40% Geo-Anomaly: Compromised from distant location.
Velocity AbuseRapid-fire "testing" transactions (₹1.01, ₹1.23, etc.).10% Geo-Anomaly: Low spatial signal.
Card Not PresentOnline-only channel bias, standard e-commerce amounts.30% Geo-Anomaly: Card details used remotely.
Friendly FraudLegitimate channel/device, standard amounts.0% Geo-Anomaly: Customer is physically at home.

2. Campaign Attack Patterns (Coordinated)

Coordinated Attack

  • Signal: Multiple distinct cards/customers targeted simultaneously by a single entity.
  • Hard Correlation: Every transaction in the campaign shares the exact same IP Address and geographic coordinate (simulating a scammer hub or proxy).
  • Tuning: Coordinated IP is configurable via fraud_tuning.yaml (Default: 103.21.244.12).

Sequential Takeover

  • Signal: A single card experiencing a progressive escalation of fraud.
  • Monotonic Escalation: Each subsequent transaction amount is multiplied by the ato_escalation_rate (Default: 30%).
  • Persistent Location: Once the takeover begins, the geographic coordinate "sticks" to the attacker's location for the remainder of the sequence.

Known Issues

I have currently implemented the "Spatial Signature" for fraud as a simple latitude/longitude jump. While this creates a clear anomaly, it doesn't account for Traveling Legitimate Customers. This leads to a higher-than-normal false positive rate in models that rely too heavily on distance-from-home. I need to implement a "Travel Profile" for legitimate customers to introduce more realistic noise.

Furthermore, my campaign logic is currently limited to "Shared IP" and "Shared Coordinate." I haven't yet implemented Account-to-Account (A2A) graph signals, where stolen funds are moved through a chain of "mule" accounts. This is a significant gap in the simulation's "Money Laundering" fidelity that I need to address in the next version of the fraud.rs engine.

Synthetic Fraud Profiles

Summary

The fraud_profiles.md document provides a detailed behavioral and statistical breakdown of the five core adversarial signatures simulated by RiskFabric. It explains the contextual logic used by the generator to mimic real-world financial crimes and provides examples of how these patterns manifest in the synthetic data stream.

Design Intent

These profiles are designed to challenge machine learning models by mirroring the statistical "noise" and multi-dimensional anomalies of modern fraud. By shifting from simple "hardcoded amount" rules to Behavioral Multipliers and Contextual Biases, the generator forces downstream models to evaluate combinations of spatial velocity, merchant categories, and temporal deviations.


1. Velocity Abuse

Objective: Simulate a bot network or organized fraud ring rapidly "testing" compromised card details or exploiting a merchant gateway before limits are triggered.

Behavioral Signature

  • Amount Strategy: customer_normal_range with a strict 0.90x to 1.10x multiplier.
  • Primary Signals: Extreme Transaction Frequency (rapid_fire_transaction_flag), High Spatial Velocity (impossible travel), and Specific Merchant Bias (GAMBLING, ENTERTAINMENT).
  • The "Trick": By keeping the transaction amount perfectly aligned with the customer's normal spending habits, it evades simple threshold-based alerts, forcing the model to rely entirely on speed and location.

Example Scenario

A customer whose average transaction is ₹500 has three transactions generated within a 4-minute window for exactly ₹490, ₹510, and ₹495 at three different entertainment merchants located 800km away from their last known physical transaction.


2. Account Takeover (ATO)

Objective: Simulate a malicious actor gaining unauthorized access to a legitimate user's banking app or online portal to drain funds or make high-value purchases.

Behavioral Signature

  • Amount Strategy: customer_normal_range with a tight 0.95x to 1.05x multiplier.
  • Primary Signals: Extreme Spatial Velocity (impossible travel), Temporal Anomaly (occurring during the customer's historical "sleep" hours), and Channel Bias (mobile_banking, online).
  • The "Trick": Similar to Velocity Abuse, the amount does not spike. The anomaly is purely contextual: the transaction occurs on a new device, from a new IP, at 3:00 AM, purchasing from a LUXURY or ELECTRONICS merchant.

Example Scenario

A customer completes an in-store grocery purchase in Mumbai at 8:00 PM. At 3:15 AM the following morning, a mobile banking transfer for a standard amount is initiated from an IP address in Delhi.


3. Card Not Present (CNP) Fraud

Objective: Simulate the unauthorized use of stolen credit card details (PAN, CVV) for online purchases, typically for easily liquidatable goods.

Behavioral Signature

  • Amount Strategy: customer_normal_range with an aggressive 1.0x to 5.0x multiplier.
  • Primary Signals: Channel Bias (100% online), Merchant Category Bias (ELECTRONICS, LUXURY), and elevated amount_deviation_z_score.
  • The "Trick": This profile blends moderate amount spikes with specific merchant categories. It tests the model's ability to correlate the "Online" channel with high-risk retail sectors.

Example Scenario

A customer who typically spends ₹2,000 per transaction across various local stores suddenly has an online transaction for ₹8,500 at an ELECTRONICS merchant, processed without physical card presence.


4. UPI Scam (Social Engineering)

Objective: Simulate phishing or coercive scams where a victim is tricked into authorizing a high-value transfer via the Unified Payments Interface (UPI).

Behavioral Signature

  • Amount Strategy: customer_normal_range with a massive 1.5x to 4.0x multiplier.
  • Primary Signals: Massive amount_deviation_z_score, Channel Bias (Heavily biased toward upi), and Merchant Category Bias (GENERAL_RETAIL, SERVICES).
  • The "Trick": This represents the classic "drain the account" scenario. The model must learn that extreme amount deviations on the UPI channel to unfamiliar service merchants are highly suspicious, even if the device fingerprint appears legitimate.

Example Scenario

A user with an average transaction of ₹300 suddenly authorizes a UPI payment of ₹1,100 to a previously unseen "Services" merchant, heavily deviating from their historical spend pattern.


5. Friendly Fraud (First-Party Fraud)

Objective: Simulate a legitimate customer making a valid purchase (often digital goods or travel) and subsequently filing a false chargeback claim with their bank.

Behavioral Signature

  • Amount Strategy: customer_normal_range with a standard 0.5x to 1.5x multiplier.
  • Primary Signals: None. This profile intentionally lacks spatial, temporal, or behavioral anomalies.
  • The "Trick": This is the hardest profile to detect at the transaction level. The location, device, and amount are all perfectly normal. Detection relies entirely on historical entity-level features, such as the cf_fraud_rate (Customer Fraud Rate) or merchant_category risks (TRAVEL, FOOD_AND_DRINK).

Example Scenario

A customer purchases a ₹1,200 airline ticket online from their home IP address, using their normal device, during their usual active hours. Three weeks later, the transaction is marked with a chargeback flag.

Data Warehouse & dbt Strategy

Summary

The data_warehouse.md document outlines the architectural strategy for RiskFabric's analytical layer. It explains how raw synthetic data is transformed into high-fidelity behavioral entities using a Modern Data Stack (MDS) approach, specifically leveraging ClickHouse for high-volume transactions and Postgres/dbt for geographic enrichment.

Design Intent

The warehouse functions as a Medallion Data Lakehouse, intended to demonstrate how synthetic data can be used to test both machine learning models and the data engineering lifecycle. By using dbt (data build tool), complex geographic filtering and merchant risk assignment are implemented in SQL, allowing for a clear separation between the simulation engine (Rust) and the analytical environment (SQL).

A critical architectural decision was the adoption of a Dual-Warehouse Model. ClickHouse serves as the primary engine for transaction data due to its performance with columnar storage and large-scale joins. Conversely, Postgres is used for "Level 0" geographic preparation (OSM extraction), as it provides mature support for spatial extensions like PostGIS. This approach ensures each part of the simulation utilizes the tool best suited for its specific data type.


🏗️ Warehouse Layers

  1. Bronze (Raw): Direct ingest from Parquet files via ingest.rs.
  2. Silver (Enriched): Entity-level behavioral features (e.g., customer_features_silver).
  3. Gold (Master): The flattened, model-ready fact_transactions_gold.

Known Issues

The system currently utilizes Podman-based container execution to interact with the warehouse from the Rust binaries. This introduces environment-level fragility and limits the simulation's scalability in distributed cloud environments. Transitioning to native ClickHouse and Postgres client libraries is necessary to improve the reliability of the ingestion and transformation stages.

Furthermore, dbt models are split between two different databases (Postgres for references, ClickHouse for transactions). This prevents cross-warehouse joins and requires moving data via Parquet files. Unifying the transformation layer—specifically by moving all "Level 0" geography data into ClickHouse—is required to eliminate manual data-movement steps and simplify dbt pipeline orchestration.

Project Goals & Objectives

Summary

The objectives.md document defines the high-level mission and technical milestones for the RiskFabric project. It outlines the strategic intent behind building a high-fidelity synthetic data generator and the specific problems it aims to solve for the financial technology community.

Design Intent

RiskFabric is designed to address the "Data Paradox" in fraud detection: researchers require large volumes of labeled data to develop effective models, but real-world financial data is sensitive and often inaccessible. By creating a high-fidelity, "white-box" alternative, the project provides a safe environment for testing machine learning algorithms and the operational infrastructure required for real-time fraud detection.

A key strategic objective is the promotion of Infrastructure-as-Code for Simulation. Transitioning from static CSV datasets to dynamic, configuration-driven environments allows organizations to "stress-test" systems against hypothetical scenarios—such as doubling transaction volumes—without requiring production data.


🎯 Key Milestones

  1. High-Fidelity Generation: Reaching 180k+ TPS while maintaining spatial and temporal realism.
  2. Streaming Parity: Ensuring models trained on batch data perform consistently in real-time Kafka environments.
  3. Adversarial Diversity: Expanding the fraud library to include multi-stage attacks like money laundering and mule-account networks.

Known Issues

Focus is currently placed on Individual and Coordinated Fraud, but Macroeconomic Factors remain unimplemented. The simulation assumes spending patterns are unaffected by external events such as inflation or holidays. Implementing a "Global Event Engine" is necessary to simulate seasonal surges and economic shifts, providing a more challenging baseline for detection models.

Furthermore, the project lacks Multi-Currency Support. The simulation is anchored to a single base currency, preventing the modeling of international fraud or cross-border remittance scams. Refactoring the transaction engine to handle dynamic currency conversion and exchange-rate fluctuations is required to support global fintech use cases.

Results & Monitoring

Tracking the evolution of model performance, generation throughput, and ETL efficiency benchmarks.

Machine Learning Metrics & Model Progression

This document tracks the performance and evolution of the fraud detection models trained on RiskFabric synthetic data, progressing from initial leakage-prone baselines to a robust, behavioral production configuration.


Section 1: Early Iterations

The development process began with basic feature sets to establish a baseline for fraud detection performance.

v1 Iteration (Baseline)

The initial model established core feature sets including amount deviations and spatial velocity on a sample population.

  • Accuracy: 0.95
  • ROC AUC Score: 0.9782
  • Recall (Fraud): 0.30 (Identified significant "Recall Gap")

v2 High-Fidelity (Leakage Detected)

Scaling to larger datasets revealed massive performance inflation due to generator artifacts in metadata fields.

  • ROC AUC Score: 0.9993
  • Leakage Identified: Synthetic metadata fields (fraud_target, burst_seq) were providing a "static bypass" for the model.

v2 Iteration (Leakage Prevention)

The feature vector was sanitized to exclude metadata, shifting the focus to behavioral signals.

  • ROC AUC Score: 0.9746
  • Recall (Noisy Labels): 0.72
  • Sanitization: Transitioned from fraud_target to the noisy is_fraud label.

Note: In addition to the leakage issues documented below, v1 and v2 iterations were trained on an incomplete feature set. Behavioral features computed in the Rust ETL layer — including amount_deviation_z_score, spatial_velocity, and granular anomaly flags — were silently dropped before reaching XGBoost due to a narrow Gold table join. The inflated AUC figures in these iterations reflect both metadata leakage and the absence of the features that would have provided genuine behavioral signal.

Section 2: v3 — Production Configuration (Final)

The final model configuration focuses on pure behavioral signals, specifically tuned to handle the extreme class imbalance (1.4% fraud rate) found in realistic production environments.

Training Setup

  • Dataset: 1.5M transactions (Seed 42).
  • Fraud Rate: 1.41% (target_share: 0.01, fp_rate: 0.005).
  • Model: XGBoost binary classifier.
  • Scale Pos Weight: 69.57 (Computed dynamically from training imbalance).
  • Eval Metric: aucpr (Area Under Precision-Recall Curve).
  • Label Noise: 0.5% False Positives and 1% False Negatives deliberately injected.
  • Theoretical Recall Ceiling: 66.7% (Derived from the intentional label noise ratio).

Feature Importance

The model prioritizes physical and financial anomalies over static identifiers.

FeatureImportanceDescription
spatial_velocity25.38%Impossible travel speed between transactions
amount_deviation_z_score20.80%Spending magnitude relative to customer norm
time_since_last_transaction12.72%Temporal burst and frequency detection
transaction_channel11.60%Risk associated with specific payment methods
merchant_category11.08%Contextual risk of the merchant type
hour_deviation_from_norm7.40%Circadian rhythm anomalies
merchant_category_switch_flag2.89%Unexpected shifts in merchant category
card_present2.45%Physical vs. digital transaction risk
transaction_sequence_number1.95%Position within the account lifecycle
rapid_fire_transaction_flag1.88%High-velocity sequence identification

For a detailed narrative of the discovery and resolution of these artifacts, see the Feature Leakage Case Study.

Generalization Results

Validated against three independent populations to ensure robust performance across different random seeds.

Test PopulationSeedTransactionsAUC
Holdout42 (Same)1.5M84.72%
Independent8888 (Different)1.5M79.94%
Independent5555 (Different)3.0M79.81%

Note: The higher AUC on the holdout set is due to distributional overlap with the training population, while the ~80% AUC on independent seeds represents the model's true behavioral generalization.


Section 3: Threshold Operating Points

In a production environment, the model's probability output is mapped to specific operational actions.

Operating ModeThresholdPrecisionRecallF1Use Case
Detection Layer0.49510%60%0.172Review queue — broad capture
Triage0.64518%55%0.268Early analyst filtering
Investigation0.73631%50%0.385Analyst workbench
High Confidence0.84257%45%0.502Escalation decisions
Blocking0.94573%40%0.517Automatic card block

The Detection Layer feeds a review queue for manual inspection, while the Blocking Layer is reserved for automated enforcement. The tradeoff between these layers is an operational business decision, not a model failure.


Section 4: Merchant Category Audit

Leakage verification at the "Blocking" threshold (0.945) confirms that overrepresentation reflects genuine category risk levels rather than static bypasses.

CategoryGlobal ShareFlag ShareIndexVerified Fraud Rate
GAMBLING0.07%1.09%17x17.68%
ENTERTAINMENT1.10%14.35%13x11.20%
LUXURY1.62%8.63%5x4.91%
ELECTRONICS3.39%10.22%3x2.40%
TRAVEL6.14%16.29%2.6x2.53%
SERVICES5.15%11.92%2.3x2.53%

All verified fraud rates fall below the 20% threshold, confirming that no single category acts as a near-deterministic fraud rule. The model uses category as a Bayesian prior requiring behavioral confirmation rather than a static classifier.

The GAMBLING index was previously at 103x (documented in the leakage case study); its reduction to 17x after generator retuning and the verified fraud rate confirms it is now a legitimate signal.


Section 5: Known Limitations

Recall Ceiling (66.7%)

Theoretical maximum recall is imposed by deliberate label noise design. The 0.5% false positive rate in fp_rate creates labels that are behaviorally unlearnable. Recall approaching this ceiling represents optimal behavior.

Silver ETL Eager Execution

Sequence features using .over() window functions trigger eager in-memory execution despite Polars lazy API usage. Datasets significantly exceeding available RAM will hit memory pressure. Roadmap: transition to a stateful streaming pre-aggregation pass.

Campaign Detection

Coordinated attack signatures require graph-based reasoning over entity relationships. Individual transactions in a campaign are often behaviorally indistinguishable from legitimate ones when viewed in isolation—this is a structural limitation of single-transaction classifiers.

Feature Leakage Case Study

While designing a fraud detection model for flagging potentially fraudulent transactions using XGBoost, some problems were discovered that made it unsuitable for being used in real-time scoring. The issue revolves around an individual feature primarily deciding the model's decision making.

Transactions were synthetically generated for 10,000 customers with a total figure being around 6M transactions. After feature engineering and interpolating them into a gold_master_table which is used by XGBoost for training. An AUC score of 0.9079 was achieved which felt more realistic than the previous test which ran on a smaller dataset (4.3M transactions and achieved an AUC score of 0.97). However, the crux of the issue became apparent when the top features by importance were checked:

Features (AUC)0.9079
amount0.9632
escalating_amounts_flag0.0093
cf_night_tx_ratio0.0049
transaction_sequence_number0.0048
rapid_fire_escalation_flag0.0048
time_since_last_transaction0.0042
transaction_channel0.0039
merchant_category_switch_flag0.0027
t.merchant_category0.0021
card_present0

What this means is that if this model identifies a suspicious transaction amount, there is a high probability it will flag the transaction as fraud without considering other characteristics. This is sub-optimal, as the model should evaluate multiple constraints such as temporal factors and geographic behavior. To address this, two strategies can be evaluated: the removal of amount as a training feature to test performance on purely behavioral flags, or the use of feature binning to reduce reliance on exact values. After testing the second option and seeing negligible difference in feature importance, the first option was evaluated, which revealed the underlying issue within the system. After removing amount as a feature and executing the training script again, a ROC score of 0.5868 was achieved—a significant decrease from the previous result—but the resulting feature importance distribution was more revealing:

Features (AUC)0.5868
escalating_amounts_flag0.8865
transaction_channel0.0227
time_since_last_transaction0.0208
transaction_sequence_number0.0203
cf_night_tx_ratio0.0194
rapid_fire_escalation_flag0.0172
t.merchant_category0.0076
merchant_category_switch_flag0.0057

This indicates that the behavioral features designed have very little predictive power to be captured by the model. Essentially, the model can't distinguish them from normal variation. Re-tuning the fraud generator is required to create distinctive behavioral signals and ensure those features are engineered effectively and processed through to the gold table. This ensures each fraud signature has its distinctive characteristics so that it can be captured by the model.

Analysis of the pipeline identified a significant gap: The behavioral features engineered in Rust were being dropped before reaching the XGBoost:

  1. The "Silently Dropped" Features In src/etl/features/sequence.rs, high-value signals are calculated that would likely address the 0.58 AUC problem, but they are missing from the ClickHouse tables:

    • amount_deviation_z_score: This measures if ₹5,000 is "normal" for that specific customer. Without this, the model only sees the absolute ₹5,000 and assumes it's fraud because the average transaction is ₹500.
    • fraud_type & campaign_id: These are currently calculated but not stored in the Silver layer.
    • Granular Anomalies: geo_anomaly, device_anomaly, and ip_anomaly are being calculated in the transformation but aren't being selected in the final Gold table join.
  2. The Gold Table Join is too narrow The run_gold_master function in src/bin/etl.rs only pulls a small subset of columns from the Silver tables. It's ignoring the very features needed to replace the "Amount Shortcut."

The Plan to Fix It:

Step 1: Repair the ETL Pipeline (The "Plumbing" fix) * Update the CREATE TABLE statement for fact_transactions_silver to include the missing behavioral columns.

  • Update the run_gold_master query to pull these features into the final training set.
  • Outcome: The model will finally "see" the Z-Score and the behavioral context.

Step 2: Re-tune the Generator (The "Signal" fix) * The fraud_rules.yaml configuration is modified to ensure fraudulent transaction amounts overlap with legitimate amounts. * Outcome: The model will be forced to stop using "High Amount" as a shortcut and start using the "Z-Score" and "Velocity" features fixed in Step 1.

Features (AUC)0.8246
amount_deviation_z_score0.9069
escalating_amounts_flag0.0468
transaction_sequence_number0.0122
cf_night_tx_ratio0.0121
time_since_last_transaction0.0075
transaction_channel0.0047
rapid_fire_transaction_flag0.0033
merchant_category_switch_flag0.0033
t.merchant_category0.0033
card_present0.0000

The score improved to 0.8246 just by fixing the plumbing. No new features, no generator retuning, no architectural changes. The signals were there the whole time. However, amount_deviation_z_score at 90% is the new dominant feature. It is better than raw amount — customer-relative making it more meaningful but still a single feature carrying almost everything.

The generator needs to be retuned to overlap fraud and legitimate amount distributions. Force fraudsters to transact at amounts that are normal for that customer — the Z-score becomes less dominant, behavioral features have to carry more weight.

When fraud amounts overlap with legitimate amounts, the model must rely on:

  • cf_night_tx_ratio — when does this customer normally transact?
  • rapid_fire_transaction_flag — velocity anomaly
  • merchant_category_switch_flag — behavioral deviation
  • time_since_last_transaction — timing patterns
Features (AUC)0.7960
amount_deviation_z_score0.9337
escalating_amounts_flag0.0229
cf_night_tx_ratio0.0108
transaction_sequence_number0.0103
time_since_last_transaction0.0071
transaction_channel0.0060
rapid_fire_transaction_flag0.0044
t.merchant_category0.0026
merchant_category_switch_flag0.0023
card_present0.0000

After retuning the generator to create more overlap and amplify behavioral signals, the score dropped slightly to 0.7960, fraud amounts now blend into legitimate ranges, so amount_deviation_z_score has less to work with. The model is being forced away from the amount shortcut. AUC dropped because the problem genuinely got harder.

However, amount_deviation_z_score is still at 93% despite the overlap. This means the Z-score is still capturing enough separation between fraud and legitimate amounts to dominate. The overlap wasn't aggressive enough. The problem appears to be fraud amounts as they are mostly specific values — ₹5000, ₹8500, ₹12000. Legitimate transactions cluster around ₹660. Due to this, the Z-score still sees as "this customer normally spends ₹660, this transaction is ₹8500 — suspicious." The relative deviation is still huge.

The Z-score only becomes less dominant when fraudsters transact at amounts that are normal for that specific customer. This requires the generator to look up the customer's monthly_spend and generate fraud amounts within their normal range:

account_takeover:
amount_strategy: "customer_normal_range"  # instead of fixed high_value_amounts
amount_multiplier: 0.8_to_1.2  # within customer's normal band

This makes fraud amounts become customer-relative rather than absolute. In this iteration, the Z-score was removed and the model was trained without amount_deviation_z_score and escalating_amounts_flag to evaluate the strength of the behavioral signals.

Features (AUC)0.6704
amount_deviation_z_score0.9101
transaction_sequence_number0.0188
cf_night_tx_ratio0.0165
escalating_amounts_flag0.0152
time_since_last_transaction0.0121
transaction_channel0.0117
rapid_fire_transaction_flag0.0067
t.merchant_category0.0045
merchant_category_switch_flag0.0044
card_present0.0000

Removing escalating_amounts_flag dropped AUC to 0.6704 but amount_deviation_z_score remains at 91%.

Every amount-derived feature removed makes the Z-score more dominant. The model is completely anchored to amount-relative signals. The behavioral features — cf_night_tx_ratio, rapid_fire_transaction_flag, merchant_category_switch_flag — collectively contribute approximately 5-6% of decisions.

Removing the Z-score and executing the training once more provides the definitive test. The resulting AUC represents the pure behavioral signal floor — no amount, no Z-score, no escalating amounts. Just time, velocity, channel, merchant, sequence.

Features (AUC)0.5572
transaction_channel0.1691
escalating_amounts_flag0.1677
time_since_last_transaction0.1644
transaction_sequence_number0.1602
cf_night_tx_ratio0.1532
rapid_fire_transaction_flag0.0851
t.merchant_category0.0679
merchant_category_switch_flag0.0325

The score dropped to 0.5572, and for the first time, feature importance is evenly distributed. No single feature exceeds 17%, with every behavioral signal contributing. This structure represents a balanced feature set.

The challenge is that none of these features possess sufficient signal strength to detect fraud reliably. The model is unable to distinguish fraud because fraudsters in the simulation behave too similarly to legitimate customers. For instance:

  • transaction_channel at 17% — channel bias exists but is weak.
  • cf_night_tx_ratio at 15% — night patterns exist but fraud is not concentrated enough at night to be distinctive.
  • rapid_fire_transaction_flag at 8.5% — velocity fraud occurs but not with sufficient frequency.
  • merchant_category_switch_flag at 3.25% — almost no signal. Fraudsters shop at similar merchants as legitimate customers.

To address this at the root level, the logic responsible for injecting fraudulent signatures and behaviors requires refinement to increase signal strength for training a behaviorally-driven fraud model.

In the RiskFabric project, fraud.rs is primarily responsible for injecting fraud labels and altering transaction behavior according to the fraud signature. Two configurations drive transaction behavior: geo_anomaly_prob and device_anomaly_prob. Inspection of geo_anomaly_prob identified significant limitations:

If a transaction has the geo_anomaly flag set to true, its coordinates are randomized from the global range. While this creates an anomaly, it does not provide a behavioral signal that the model can learn without access to the customer's "Home" coordinates or a feature like "Distance from Home." Consequently, the model only evaluates final_lat and final_lon. Since legitimate transactions are also distributed across India (clustered around specific homes), a random coordinate appears normal to a model lacking home location context.

To resolve this, a new feature, Spatial Velocity, was introduced in the ETL layer. This measures: distance(txn_N, txn_N-1) / time(txn_N, txn_N-1), enabling the model to identify high-velocity spatial anomalies, such as transactions occurring in distant cities within short time intervals.

Features (AUC)0.6868
spatial_velocity0.6126
escalating_amounts_flag0.1847
time_since_last_transaction0.0652
merchant_category_switch_flag0.0643
t.merchant_category0.0167
transaction_sequence_number0.0148
rapid_fire_transaction_flag0.0145
transaction_channel0.0144
cf_night_tx_ratio0.0128
card_present0.0000

The AUC increased to 0.6868 from a single feature addition. spatial_velocity at 61% became the dominant behavioral feature—a genuine behavioral signal. This also had a cascading effect on other features:

  • merchant_category_switch_flag increased from 3.25% → 6.43%
  • time_since_last_transaction changed from 16% → 6.52%

Several issues still required attention:

  • spatial_velocity at 61% was too dominant, capturing almost the entire geo_anomaly fraud signal. The implementation at the time teleported fraudsters to random coordinates, almost guaranteed to trigger the impossible travel flag.
  • cf_night_tx_ratio decreased to 1.28%, as night behavior was not sufficiently distinctive in the generator.
  • card_present remained at 0%, indicating CNP fraud was not being captured.

Analysis of the low cf_night_tx_ratio (1.28%) led to an audit of the hourly distribution, particularly under Account Takeover (ATO) fraud. While hourly_weights peaked in the early morning and late evening to simulate attacker activity, the "Night Ratio" was not a strong signal due to legitimate late-night spending and a lack of sharpness in the ATO peak.

This was addressed by updating account_takeover hourly weights to concentrate over 70% of transactions between 00:00 and 04:00. Additionally, the hour_deviation_from_norm feature was introduced in the ETL layer to capture temporal anomalies at the transaction level by determining the absolute deviation from a customer's average transaction hour.

Features (AUC)0.7005
spatial_velocity0.6439
escalating_amounts_flag0.1450
merchant_category_switch_flag0.0664
time_since_last_transaction0.0620
t.merchant_category0.0165
transaction_channel0.0149
rapid_fire_transaction_flag0.0136
hour_deviation_from_norm0.0130
cf_night_tx_ratio0.0124
transaction_sequence_number0.0122

AUC increased to 0.7005—a small but consistent improvement. hour_deviation_from_norm appeared at 1.3%, registering as a signal. cf_night_tx_ratio remained at 1.24%, and escalating_amounts_flag decreased from 18% → 14.5%, indicating behavioral features were gradually gaining influence.

Despite this, night-based features contributed only ~2.5% combined. The sharpened hourly weights provided marginal benefit, but cf_night_tx_ratio dilution persisted—a small number of ATO transactions does not significantly shift a customer-level ratio.

A higher impact correction involved card_present, which was at 0%. Correcting the 'wiring' for this feature was identified as a high-impact fix, as CNP transactions are by definition not card_present.

Features (AUC)0.7500
amount_deviation_z_score0.4973
spatial_velocity0.3402
merchant_category_switch_flag0.0497
escalating_amounts_flag0.0299
time_since_last_transaction0.0289
transaction_sequence_number0.0126
cf_night_tx_ratio0.0117
t.merchant_category0.0091
transaction_channel0.0082
hour_deviation_from_norm0.0073

After restoring the Z-score as a feature, its dominance remained strong but was lower than in previous instances, supplemented by spatial_velocity.

Features (AUC)0.7491
amount_deviation_z_score0.5038
spatial_velocity0.3026
card_present0.0499
merchant_category_switch_flag0.0445
time_since_last_transaction0.0271
escalating_amounts_flag0.0186
cf_night_tx_ratio0.0117
transaction_sequence_number0.0114
transaction_channel0.0085
t.merchant_category0.0083

After correcting the 'CNP' wiring, it increased to 5%. However, rapid_fire_transaction_flag disappeared from the top 10 features. Analysis of the code revealed that this flag utilized a 300-second (5-minute) threshold, and max_interval_seconds for velocity abuse was set to a random minute within an hour, which was too coarse for signatures depending on second-level timing.

A more realistic temporal pattern for fraud bursts was implemented, ensuring transactions occur in tighter sequences (e.g., seconds apart) via max_burst_interval_seconds. This creates a sharper behavioral signal for the rapid_fire_transaction_flag to capture.

Features (AUC)0.8085
time_since_last_transaction0.3280
rapid_fire_transaction_flag0.2707
amount_deviation_z_score0.2153
spatial_velocity0.0727
card_present0.0614
merchant_category_switch_flag0.0213
escalating_amounts_flag0.0105
transaction_sequence_number0.0049
hour_deviation_from_norm0.0045
transaction_channel0.0042

Amount-based features are now in third place at 21%. Temporal behavioral signals—time_since_last_transaction and rapid_fire_transaction_flag—together drive 60% of model decisions. This allows for fraud detection based on behavioral patterns rather than absolute cost, providing logic suitable for real-time scoring of velocity abuse, ATO, and CNP fraud without flagging high-value legitimate transactions.

merchant_category_switch_flag at 2.1% and hour_deviation_from_norm at 0.45% both have potential for growth through future cross-card coordination work.

Performance Benchmarks

This document tracks the evolution of riskfabric generation performance, focusing on the journey to the 100k+ Transactions Per Second (TPS) milestone.

Test Environment

  • Workload: 10,000 Customers (~15,000 Accounts, ~150,000 Transactions)
  • Format: Parquet (Snappy compression)
  • Hardware: Single Workstation (Multi-threaded Rust)

Milestone Log

1. Initial Port (Sequential Multi-Pass)

Date: February 2026

  • Architecture: Sequential loops for generation, fraud injection, and campaign mutations. Cryptographic Sha256 hashing for reproducibility.
  • Transaction Gen Time: 44.11 seconds
  • Total Runtime: 48.76 seconds
  • Throughput: ~3,400 TPS
  • Bottleneck: Cryptographic hashing overhead and high memory access in multiple sequential passes.

2. Parallel Injection & Hash Optimization

  • Architecture: Parallelized the inject pass using rayon. Optimized hash01 to reduce string allocations.
  • Transaction Gen Time: 35.86 seconds
  • Total Runtime: 40.35 seconds
  • Throughput: ~4,100 TPS
  • Gain: +20% improvement.

3. The "One-Pass" Unified Architecture (Current)

Date: February 2026

  • Architecture:
    • Unified Loop: All logic (Selection, Generation, Fraud, Campaigns) handled in a single parallel pass.
    • Fast PRNG: Swapped Sha256 for StdRng (seeded per card for stability).
    • Reduced Allocations: Replaced UUIDs with synthetic IDs and pre-formatted timestamps.
  • Transaction Gen Time: 0.82 seconds
  • Total Runtime: 4.40 seconds (Includes all file I/O)
  • Throughput: ~182,000 TPS
  • Gain: 53x improvement from baseline.

Summary of Optimization Impact

StageBaseline (s)Optimized (s)Speedup
Customer Gen0.1470.1551x
Transaction Gen44.1100.82353.6x
Parquet Write (Txn)3.6962.6401.4x
Total Pipeline48.7634.40211x

4. High-Fidelity One-Pass (Tuned)

Date: February 2026

  • Architecture: Added profile-specific geo-anomalies, campaign-coordinated spatial signals, and dynamic failure reasons.
  • Performance: Maintained throughput at ~180,000 TPS despite increased logic complexity.
  • Result: High-quality training data with sharp spatial/temporal signals generated in < 4 seconds for 150k+ transactions.

5. Real-Time Streaming Throughput (Kafka)

Date: March 15, 2026

  • Architecture:
    • Async I/O: Leverages tokio and rdkafka for non-blocking Kafka publication.
    • Self-Correcting Limiter: Measures per-message latency to adjust micro-sleep intervals.
    • Verification Mode Overhead: Minimal (local CSV writes are buffered).
  • Target Throughput: 100 tx/s (Configurable)
  • Actual Throughput: 99.85 tx/s (Average over 1 hour)
  • Publication Latency (P99): 4.2ms to local Kafka broker.

Throughput Comparison

ModeEngineTransportPeak Throughput (TPS)
Batchgenerate.rsLocal Parquet~180,000
Streamingstream.rsKafka Topic~1,200 (Unbound)*

*Note: Streaming throughput is artificially limited to 100 tx/s for realism, but peak unbound performance is ~1,200 tx/s on a single thread.



6. End-to-End Pipeline Stress Test (3M Transactions)

Date: March 2026

  • Workload: 3,334 Customers | 2,984,575 Transactions
  • Scope: Full lifecycle orchestration via stress_test.py (Reset -> Generate -> Ingest -> ETL -> Gold).
Pipeline StageDuration (s)Throughput / Info
Generation16.96s~176,000 TPS
ClickHouse Ingestion25.54s~116,000 Rows/s
Silver ETL (Parallel)56.61sFeatures: Sequence, Merchant, Customer
Gold Finalization11.32sMaterialized Join
Total End-to-End110.43s~1.8 Minutes

Benchmark Conclusions

The stress test confirms that the One-Pass Architecture successfully scales to multi-million row datasets while maintaining near-linear throughput. The entire pipeline, including heavy feature engineering and entity joins, completes in under 2 minutes for 3 million transactions, making it suitable for rapid iterative model development.

Summary of Optimization Impact

...

ETL Performance Optimizations

Summary

The RiskFabric ETL pipeline (etl.rs) is designed for high-fidelity feature engineering using a hybrid Polars and ClickHouse architecture. While functionally robust, the current implementation contains several architectural bottlenecks that limit its scalability to billion-row datasets. This document outlines the identified performance issues and the strategic roadmap for transitioning to a high-concurrency, zero-copy pipeline.

Architectural Decisions

To achieve enterprise-grade throughput, the pipeline is moving toward a Parallel Stream-Oriented Architecture.

The primary decision is the shift to Asynchronous Pipeline Orchestration. By utilizing tokio or rayon, the independent "Silver" ETL stages (Customer, Merchant, Device/IP) will be executed in parallel. This maximizes multi-core utilization and significantly reduces the total wall-clock time of the transformation phase.

The second decision involves Zero-Copy Data Exchange. The current "Double Buffering" strategy—where data is fetched into memory, stored as a vector, and then parsed—is slated for replacement with a streaming architecture. By piping the raw output of the ClickHouse process directly into the Polars ParquetReader and vice-versa, the memory footprint is halved, and intermediate disk I/O for temporary Parquet files is eliminated.

Finally, the transition to Native Driver Connectivity via clickhouse-rs is prioritized over the current podman exec method. This eliminates the process overhead of spawning container instances for every query and provides superior type safety and error propagation.

System Integration

The optimized ETL system remains the central bridge between the Data Warehouse (ClickHouse) and the Machine Learning Pipeline. By maintaining the Parquet exchange format but moving it through memory pipes rather than physical files, the system ensures that the "handshake" between Polars and ClickHouse remains high-speed while reducing infrastructure dependencies and disk wear.

Performance Benchmarks & Results

ImplementationWall-Clock TimeCPU User TimeMemory / Disk Overhead
Baseline (Sequential)~22.5 seconds~19.6 secondsHigh (Temp files + buffers)
Optimized (Parallel + Pipes)~21.1 - 22.5 seconds~32.4 secondsLow (Streaming + Zero temp files)

Analysis

The implementation of Rayon-based parallelism and Direct Stdin Piping resulted in a significant increase in CPU utilization (~65% increase in User time), indicating that the Rust transformation engine is now processing multiple stages concurrently.

However, the Wall-Clock Time remained relatively flat. This confirms that the pipeline is currently I/O Bound by the ClickHouse single-node instance. Spawning six parallel podman exec processes causes resource contention at the database level, preventing a linear speedup.

Implemented Improvements

  1. Stage Parallelism: All Silver ETL functions now run concurrently via rayon.
  2. Streaming Ingestion: Parquet data is piped directly from Polars to ClickHouse stdin, eliminating data/tmp_*.parquet file I/O.
  3. Thread-Safe Workspace: Each parallel stage uses isolated logic and unique identifiers to prevent race conditions.
  4. Memory Optimization: Replaced large Vec<u8> output buffers with direct process pipes where possible.

Knowledge Base

Documentation of technical hurdles, resolutions, and ongoing developmental challenges encountered during the project.

Technical Issues & Resolutions

Summary

The issues.md document acts as the primary engineering log for RiskFabric. It captures architectural hurdles, environment-specific bugs, and performance bottlenecks encountered during development, along with their implemented or proposed resolutions.

Design Intent

This document serves as Institutional Knowledge for the project. In complex simulations, the most difficult bugs often arise from the interaction between system layers (e.g., Rust → Kafka → Python). Documenting these issues provides a roadmap for future optimizations and prevents the repetition of architectural errors. Every entry is paired with a specific technical fix validated through benchmarking or regression testing.


🛠️ Data Engine & Type Safety

1. Polars UInt8 Series Creation Error

  • Problem: Polars returned a ComputeError when materializing DataFrames containing 8-bit unsigned integers. This blocked features like is_weekend and other boolean-adjacent flags.
  • Resolution: All flag and counter columns were migrated to DataType::UInt32 to ensure native Polars support and broader ML library compatibility.

2. Polars is_in Panic on Int8

  • Problem: The .dt().weekday() function returns Int8, which caused kernel-level panics during .is_in() membership checks.
  • Resolution: Output from .weekday() is now explicitly cast to Int32, ensuring the comparison set (e.g., &[6i32, 7i32]) matches the target type exactly.

3. ClickHouse Timestamp Precision

  • Problem: Standard DateTime64 ingestion in ClickHouse failed when processing ISO 8601 strings with nanosecond precision.
  • Resolution: Timestamps are landed as String in the Bronze layer. High-precision parsing is deferred to the Silver ETL stage using Polars' .str().to_datetime() for increased flexibility.

🚀 Performance & Scaling

4. Out of Memory (OOM) in Network Linkage

  • Problem: Multi-million row many-to-many joins on IP and User Agent entities caused combinatorial explosions, leading to process termination.
  • Resolution: The architecture shifted from an Edge-List Graph approach to an Entity Reputation model. Risk is now calculated at the entity level and joined back to transactions, reducing complexity from $O(N^2)$ to $O(N)$.

5. OOM in Large-Scale Generation

  • Problem: Single-pass generation of 17M+ transactions exceeded available system RAM.
  • Resolution: The generator was refactored to use a Chunked One-Pass Architecture. The population is processed in batches of 5,000 entities, with transactions flushed to Parquet incrementally to maintain a constant memory profile.

6. Parquet Serialization Bottleneck

  • Problem: Transaction generation required 44 seconds, with 90% of the time spent in disk I/O and Parquet encoding.
  • Resolution: A One-Pass Parallel Architecture was implemented and the Polars chunk size was optimized. This reduced the total runtime to 4.4 seconds, an 11x improvement.

🤖 Machine Learning & Data Science

7. Label Leakage (Near-Perfect AUC)

  • Problem: Early models achieved 0.9993 AUC by learning internal generator flags (e.g., geo_anomaly) instead of behavioral patterns.
  • Resolution: A strict "Operational Feature" Sanitization step was implemented to drop all internal metadata. The training target was also shifted from the perfect fraud_target to the noisy is_fraud label.

8. Observed vs. Configured Fraud Rate Discrepancy

  • Problem: The observed fraud rate (~13.6%) appeared higher than the 12% defined in the configuration.
  • Resolution: Validation confirmed that is_fraud deliberately incorporates simulated label noise (3% FP, 10% FN), resulting in a higher observed ratio than the latent ground truth.

9. High Fraud Prevalence in Initial Runs

  • Problem: Approximately 86% of customers experienced fraud due to high default configuration values.
  • Resolution: The target_share parameter was tuned to 0.005 (0.5% transaction rate) to align with industry benchmarks for sparse fraud data.

Known Issues

There is ongoing difficulty with Container Runtime Variability. The podman exec calls used in ingestion and ETL pipelines behave inconsistently across Linux and macOS environments, causing failures in the data warehouse loading process. Transitioning to native database drivers is required to eliminate dependency on the host's container CLI.

Furthermore, Memory Management during Reference Extraction is currently insufficient. When processing large OSM PBF files, the prepare_refs.rs binary can consume significant RAM. Implementing a "Spill-to-Disk" strategy for the parallel map-reduce operation is necessary to maintain a memory footprint below 4GB.