RiskFabric
RiskFabric is a fraud intelligence platform that generates synthetic Indian payment transaction data, processes it through a Medallion ETL pipeline, and produces trained fraud detection models.
✨ Key Features
- Extreme Throughput: Achieves ~182,000 Transactions Per Second (TPS) using a parallelized "One-Pass" architecture.
- Agent-Based Realism: Simulates the full lifecycle of
Customers,Accounts, andCards, with behavioral spend profiles driven by real-world heuristics. - Geographic Fidelity: Integrates OpenStreetMap (OSM) India data and Uber H3 hexagonal indexing for hyper-realistic spatial spend patterns and location anomalies.
- Sophisticated Fraud Injection: Includes signatures for UPI Scams, Account Takeover (ATO), Card Not Present (CNP) fraud, and coordinated campaigns.
- Medallion Data Architecture: A full pipeline taking data from Bronze (Raw) to Silver (Feature Engineered) to Gold (ML-Ready).
- ML Mastery: Built-in leakage prevention and simulated label noise (False Positives/Negatives) to ensure models are robust and production-ready.
🛠️ Tech Stack
- Core Engine: Rust (Rayon for parallelization, Rand for deterministic simulation).
- Real-time Streaming: Redpanda (Kafka-compatible),
rdkafka, and Tokio async runtime. - Data Processing: Polars 0.51.0 (Lazy API & high-performance transformation).
- Data Warehouse: PostgreSQL (Spatial/OSM staging), ClickHouse (High-volume transactions), and dbt (Analytical enrichment).
- Feature Store: Redis (Low-latency state for real-time Z-scores and behavior).
- Data Ingestion:
dlt(Data Load Tool) for MDS integration. - Machine Learning: Python (XGBoost) with real-time inference via
scorer.py. - Infrastructure: Docker/Podman orchestration with Prometheus and Grafana for observability.
📁 Project Structure
🧠 Core Simulation (src/)
generators/: Agent-Based Modeling (ABM) logic, entity creation, and fraud mutation engines.models/: Rust structures for Customers, Accounts, Cards, and Transactions.bin/: CLI binaries for data generation (generate.rs), streaming (stream.rs), and preparation.config.rs: Centralized, type-safe configuration engine for simulation parameters.
🥈 ETL & Data Warehouse (src/etl/ & warehouse/)
etl/: Multi-stage Polars transformation pipeline (Silver/Gold feature engineering).warehouse/: dbt project for geographic enrichment and merchant risk profiling using PostGIS.dlt/: MDS integration for automated data lake ingestion.
🤖 Machine Learning (src/ml/)
train_xgboost.py: Training pipeline with Feature sanitization and OOT validation.scorer.py: Real-time inference service consuming from Kafka and stateful Redis features.seed_redis.py: Point-in-time state synchronization between the warehouse and feature store.
🛠️ Infrastructure & Docs
docker-compose.yml: Orchestrated local stack (ClickHouse, Postgres, Redpanda, Redis, Grafana).documentation/: Arichitectural docs and theory of operation (mdBook).data/config/: Behavioral rules and system tuning YAML configurations.
📈 Benchmarks (150k Txns)
| Architecture | Throughput | Total Time | Speedup |
|---|---|---|---|
| Sequential Port | 3,400 TPS | 48.7s | 1x |
| Optimized One-Pass | 182,000 TPS | 4.4s | 53x |
Developed by harshafaik
Your First Generation
Summary
This tutorial provides a step-by-step operational guide for initializing the RiskFabric environment and executing a full synthetic data lifecycle—from world-building to model training.
Prerequisites
The following components must be installed and available:
- Rust (Latest Stable)
- Docker or Podman (with Docker Compose support)
- Python 3.10+
- Git
Step 0: Infrastructure Setup
The simulation requires several backing services (Postgres, ClickHouse, Redpanda, Redis). These are orchestrated via Docker Compose and must be running before the generation binaries are executed.
# Start the local service stack
docker-compose up -d
Step 1: World Building (Level 0)
Before generating transactions, the physical reference data must be prepared by extracting OpenStreetMap nodes, enriching them via dbt, and exporting them to Parquet.
# 1. Extract raw OSM nodes to Postgres
cargo run --bin prepare_refs -- extract-nodes
# 2. Enrich & Transform (Spatial Joins and Risk Categorization)
dbt run --project-dir warehouse
# 3. Export to Parquet for the generator
# Option A: Rust-based export
cargo run --bin export_references
# Option B: DLT-based export (Recommended)
python dlt/pipelines.py export
The Database Transformation Process
During this step, the Postgres database performs three critical operations to build the "Physical World":
- Ingestion: Millions of raw coordinates are copied from OSM PBF files into the staging area.
- Spatial Anchoring:
dbtuses PostGIS to perform spatial intersections against official Indian boundaries, ensuring every coordinate is anchored to a verified State and District for realistic travel-velocity calculations. - Adversarial DNA: Raw merchant tags are mapped to standardized categories (e.g.,
LUXURY,GAMBLING) and assigned baseline risk levels, establishing the ground truth for fraud injection.
Step 2: Batch Generation and ETL
The historical dataset used for model training must be generated, ingested into the warehouse, and processed through the feature engineering pipeline.
Configuring the Simulation
Before running the generation, you can tune the scale and behavior of the synthetic population in the data/config/ directory:
- Population Scale (
customer_config.yaml):control.customer_count: Total number of unique agents (Default:3334).control.transactions_per_customer: Min/Max transaction volume per agent (Default:400-800).registration.lookback_years: How far back the customer history goes (Default:5 years).
- Transaction Patterns (
transaction_config.yaml):transactions.lookback_days: Duration of the generated transaction history (Default:365 days).transactions.amount_range: The global min/max for transaction values (Default:10 - 50,000 INR).temporal_patterns: Hourly and daily weights that drive circadian rhythms.
- Fraud Injection (
fraud_rules.yaml):fraud_injector.target_share: The percentage of transactions that are intentionally fraudulent (Default:0.01or 1%).fraud_injector.default_fp_rate: Baseline "noise" (False Positives) injected into the labels (Default:0.005).fraud_injector.profiles: Tune the frequency and behavior of specific attack types (UPI Scams, ATO, Velocity Abuse).
# Generate the initial population and history
cargo run --release --bin generate
# Ingest into ClickHouse and run ETL layers
cargo run --bin ingest
cargo run --bin etl -- silver-all
cargo run --bin etl -- gold-master
Step 3: Model Training and Streaming
The final phase involves training the XGBoost classifier, seeding the real-time feature store, and starting the streaming simulation.
Model Configuration established
The training script (train_xgboost.py) uses a configration optimal for high-imbalance datasets:
- Class Imbalance Handling: The
scale_pos_weightis calculated dynamically (Legitimate / Fraud ratio) to ensure the model doesn't ignore the minority fraud class. - Hyperparameters:
n_estimators: 100max_depth: 6learning_rate: 0.1eval_metric:aucpr
- Operational Feature Set: The model trains on 12 behavioral features (e.g.,
spatial_velocity,amount_deviation_z_score), explicitly excluding synthetic IDs (customer_id, etc.) to prevent label leakage.
# Train the fraud detection model
python src/ml/train_xgboost.py
Model Validation and Interpretability
Before moving to production scoring, you should validate the model's performance and interpret its decision drivers:
- Performance Testing (
test_model.py): Runs the trained model against a test dataset to generate classification reports and conduct threshold analysis (identifying the optimal Precision/Recall trade-off). - Explainability (
shap_analysis.py): Uses SHAP (SHapley Additive exPlanations) to create visual reports inreports/shap/. This identifies which features (e.g.,spatial_velocity) drove the model's flags globally and for each specific fraud profile. - Model Metadata (
dump_model.py): A developer utility used to inspect the internals of the saved JSON model, verifying feature names, types, and categorical encodings.
# Run performance and threshold analysis
python src/ml/test_model.py
# Generate SHAP interpretability reports
python src/ml/shap_analysis.py
# (Optional) Inspect model metadata
python src/ml/dump_model.py
Starting the Real-time Pipeline
Once the model is validated, seed the feature store and start the inference engine:
# Seed the Redis feature store with warehouse state
python src/ml/seed_redis.py
# Start the real-time scorer and the streaming generator
python src/ml/scorer.py
cargo run --bin stream
Known Issues
The documentation assumes a local container environment. Running without containers may result in database connection failures. Explicit validation of service availability (Kafka, Redis, ClickHouse, Postgres) is required before beginning the tutorial.
Furthermore, the tutorial follows a linear path. Instructions for incremental updates, such as appending new transactions to an existing warehouse, are currently omitted. Implementation of stateful resumption guidance is required for large-scale simulation runs.
How-to Guides
Step-by-step instructions and roadmaps for managing, extending, and operating the RiskFabric simulation environment.
Project Roadmap & Backlog
Summary
The to-do.md document serves as the tactical roadmap for RiskFabric. It details the completed milestones and upcoming engineering tasks required to evolve the simulation from a prototype into a production-grade synthetic data platform.
👥 Customer Generation
-
Location Heuristic Fix:
location_type(Urban/Rural) is assigned based on city name or configuration fallback. - Spatial Jittering: Implementation of multi-level jittering, including a ~500m drift for residential nodes and a deterministic ~100m drift for transaction events.
- City Name Fallbacks: Use of "{State} Region" for missing city names to maintain geographic consistency.
-
Demographic Validation: Implementation of Indian-centric naming and email domain distributions via
customer_config.yaml. - Device & ISP Profiling: Implementation of realistic device fingerprinting and ISP-level behavioral attributes for each customer profile.
- Feature Correlation: Enforcing structural relationships between Age, Credit Score, and Monthly Spend to ensure dataset realism.
- Simulation Scalability: Transitioning to a streaming Parquet reader for residential reference data to support multi-million agent populations without memory exhaustion.
- Demographic Realism Tuning: Implement Name-Gender-State correlation for first names and surnames.
- Email Distribution Tuning: Align email domain distributions with actual Indian market shares.
💸 Transaction & Merchant Logic
- One-Pass Chunked Generation: Refactoring of the generator to process cards in batches of 5,000, enabling multi-million transaction generation on standard hardware.
- Chronological Simulation: Implementation of time-ordered transaction generation with support for temporal burst warping.
- MCC Mapping: Mapping of OSM categories to standard Merchant Category Codes (MCC) for realistic financial analysis.
-
Budget-Aware Simulation: Transaction amounts are linked to the customer's
monthly_spendprofile, with noise added to individual events. -
Temporal weighted Patterns: Implementation of circadian rhythms via hourly and daily weights in
transaction_config.yaml. - Device & Agent Persistence: Implementation of persistent devices and realistic app identifiers (e.g., GPay, PhonePe) per payment channel.
- Amount Distribution Tuning: Remediation of the "Amount Shortcut" by ensuring fraudulent amounts significantly overlap with legitimate spending distributions.
- Geographic Precision: Implementing the Haversine formula for all spatial velocity and distance calculations to replace Euclidean approximations.
- Jitter Normalization: Ensure consistent ~100m spatial jittering across all geographic profiles.
-
Rayon Chunk Size Optimization: Explicitly tune
chunk_sizefor parallel generation to optimize throughput. - H3 Resolution Consistency: Enforce consistent H3 resolution usage across all spatial calculation layers.
🥈 ETL & Infrastructure
-
Unified CLI Tooling: Consolidation of multiple utility binaries into unified
etl,prepare_refs, andingesttools for improved developer experience. - Streaming Infrastructure: Integration of Redpanda (Kafka-compatible) for high-throughput, low-latency transaction event streams.
- Stateful Feature Store: Integration of Redis for sub-millisecond retrieval of behavioral context and running statistical aggregates.
- Full-Stack Observability: Implementation of Prometheus and Grafana dashboards for real-time monitoring of generation throughput and scoring latency.
-
Zero-Copy Stdin Piping: Optimization of the ETL pipeline to pipe Parquet data directly from Polars to ClickHouse
stdin, eliminating intermediate disk I/O. -
Streaming ETL Implementation: Refactoring of runners to use
.scan_parquet()and.sink_parquet()to support 10M+ row benchmarks without memory exhaustion. -
Infrastructure Hardening: Transitioning from hardcoded credentials to an
.envand Docker Secrets management system. -
Docker Healthcheck Synchronization: Refine
depends_onto useservice_healthyconditions indocker-compose.yml. -
Polars Type Consistency: Systematically cast boolean flags and small counters to
UInt32to prevent ClickHouse ingestion panics. - ETL Signal Reliability: Re-enable commented-out Silver ETL stages (Campaign, Device IP, Network).
-
ClickHouse Ingestion Stability: Transition to a native driver/HTTP client to replace
podman execdependencies.
🤖 Machine Learning & Model Training
- "Operational Feature" Pivot: Refactoring of the training pipeline to focus exclusively on behavioral signals, explicitly excluding synthetic metadata to prevent label leakage.
- SHAP Interpretability: Integration of SHAP (SHapley Additive exPlanations) for global and profile-specific feature importance validation.
-
Real-Time Scoring Service: Development of a stateful inference service (
scorer.py) capable of sub-millisecond fraud detection on Kafka streams. -
Point-in-Time State Seeder: Implementation of
seed_redis.pyto synchronize historical warehouse state with the real-time feature store using Welford's algorithm. - GNN-based Campaign Detection: Transitioning to Graph Neural Networks (GNNs) for coordinated multi-entity attacks, as traditional classifier-based models (e.g., XGBoost) are inherently unsuited for capturing non-local relational patterns.
- OOT Validation & Drift: Transitioning to Out-of-Time validation and implementing a retraining scheduler to simulate model performance under adversarial concept drift.
-
Seed Redis Robustness: Add existence checks for
fact_transactions_gold. -
Label Noise Calibration: Fine-tune FP/FN rates in
fraud.rsfor better model convergence. -
Class Weight Balancing: Implement
scale_pos_weightor sampling strategy in XGBoost pipeline. - Strict ID Sanitization: Explicitly drop all internal IDs (card_id, customer_id) during training feature engineering.
⚙️ Configuration & Tuning
-
Consolidated Control: Integration of all generation volume and parallelism settings into a centralized
customer_config.yaml. - Modular Fraud Logic: Implementation of a profile-driven mutation engine that decouples adversarial patterns from core simulation code.
-
Product Catalog Centralization: Consolidation of card types, networks, and limits in
product_catalog.yaml. - Configuration Robustness: Refactoring the configuration loader to provide graceful error handling and support for descriptive error messages.
- Campaign Attack Implementation: Finalization of coordinated adversarial logic (currently disabled in configuration pending GNN-ready data structures).
- Dependency & Code Hygiene: Perform security audit of Rust crates and remove deprecated "legacy" code blocks.
📊 Observability & Dashboards
-
Rust Metric Exporter: Integrate
prometheuscrate into the simulation engine to track TPS/performance. - Geographic Visualization: Implement a H3 Geomap panel for fraud hotspot visualization.
- Materialized View Optimization: Pre-calculate dashboard metrics in ClickHouse to improve query performance.
- Infrastructure Alerting: Define Prometheus alert rules for critical service failures.
- Grafana Secret Externalization: Use GF_ environment variables instead of hardcoded creds in datasources.
-
ClickHouse Metrics Activation: Enable the port 9363 Prometheus endpoint in ClickHouse's
config.xml. -
DataSource UID Fixing: Explicitly set UIDs (ClickHouse, Prometheus) in
datasources.yamlto prevent panel breakage. - Geomap Plugin Cleanup: Remove the deprecated 'worldmap-panel' and ensure the native 'geomap' panel is used for hotspot visualization.
How-to: Add a New Fraud Signature
This guide provides a task-oriented path for developers to inject new fraud behaviors into the RiskFabric engine.
1. Define the Profile
New fraud patterns are defined in src/generators/fraud.rs. Every profile needs:
- A unique name.
- A weighted probability in the configuration.
- A Behavioral and Spatial signature.
2. Implement the Mutator
Add a new branch to the FraudMutator logic.
#![allow(unused)] fn main() { // Example skeleton fn mutate_upi_scam(txn: &mut Transaction) { // Modify amount, location, or device } }
3. Register in Config
Update data/config/fraud_rules.yaml to include your new profile and its target weight.
Detailed guide coming soon.
Simulation & Generation Engines
This section documents the core modules responsible for agent-based modeling, transaction simulation, and adversarial mutation logic.
Batch Data Generator (generate.rs)
Summary
The generate.rs binary serves as the primary orchestration engine for creating large-scale, labeled synthetic datasets. It generates a complete ecosystem of customers, accounts, cards, and historical transactions, providing the "ground truth" required for training fraud detection models.
Architectural Decisions
The generator uses a chunked execution strategy to handle datasets that exceed available system memory. By processing cards in batches of 5,000, the generator maintains a stable memory profile regardless of the total population size. For spatial lookups, the system implements a multi-tier H3 index (resolutions 4 and 6) and a state-level index. This allows for rapid, localized merchant selection during transaction generation without exhaustive searching of the merchant reference dataset.
The choice of Apache Parquet as the output format ensures that multi-million row datasets remain compressed and performant for the downstream Python-based ML pipeline and Polars-based ETL.
System Integration
generate.rs sits at the start of the RiskFabric lifecycle. It consumes reference Parquet files for merchants and residential locations and produces the four core tables: customers.parquet, accounts.parquet, cards.parquet, and transactions.parquet (including its accompanying fraud_metadata.parquet).
Known Issues
The final merge phase is implemented by writing temporary Parquet chunks to disk and then re-scanning them with the Polars lazy API. While this prevents memory exhaustion during the final join, it introduces disk I/O overhead that affects the "cleanup" phase of generation. Additionally, the 5,000-card chunk size is currently hardcoded; moving this to customer_config.yaml would allow performance tuning based on available RAM capacity.
Language: Rust
The streaming generator produces unlabeled transactions at a configurable rate and publishes them to the raw_transactions Kafka topic for real-time scoring.
It reuses generate_transactions_chunk from the batch pipeline — the core generation logic is untouched. The one-pass architecture is preserved: transactions and fraud metadata are produced in a single traversal, then separated at the output layer via UnlabeledTransaction, which is a struct that mirrors Transaction but omits is_fraud, chargeback, and all label fields entirely. The Kafka payload is guaranteed label-free at the type level.
The generator operates in two modes, controlled by streaming_mode in generator_config.yaml:
- Pure streaming (
streaming_mode: true) — behavioral mutations active, no labels assigned, no metadata collected. Used for live fraud detection. - Verification mode (
streaming_mode: false) — identical Kafka output, but ground truth labels are captured internally toground_truth.csvviaFraudMetadata. Used to measure scorer precision/recall by joining againstfraud_scoresafter a test run.
The rate limiter targets configurable throughput (default 100 tx/s) using a self-correcting mechanism — each send measures actual Kafka latency and sleeps only the remaining interval, preventing cumulative drift under variable broker response times.
The merchant population is loaded from data/references/ref_merchants.parquet and indexed at H3 resolutions 4 and 6 for spatial locality lookups during generation.
Known issue: Population size is hardcoded to 1,000 customers, decoupled from the batch pipeline's 10,000 customer population. This should be moved to config to ensure Redis seeding and streaming population are consistent.
Population Generator (customer_gen.rs)
Summary
The customer_gen.rs module is responsible for the foundational entity creation in the RiskFabric simulation. It generates a synthetic population of customers by synthesizing demographics, geographic data from OpenStreetMap (OSM) reference points, and financial behavioral profiles. This module ensures that every customer is "anchored" to a realistic physical and economic context.
Architectural Decisions
This generator is designed around a Constraint-Based Synthetic Model. Instead of simple randomization, the engine enforces correlations across different entity domains. For example, it programmatically links Credit Score to Age (using an age_weight factor) and Monthly Spend to Location Type (Metro vs. Rural). This ensures that the resulting dataset possesses the structural patterns expected in real-world financial data.
For geographic fidelity, a Spatial Jittering strategy is implemented. By adding a ~500m drift (0.005 degrees) to the original OSM residential nodes, the simulation avoids "clumping" effects where multiple customers would otherwise share identical coordinates. This jittering preserves the overall density of the reference data while providing unique home coordinates for every agent. Note that while transaction-level jitter is deterministic, the initial population jitter is currently stochastic.
The generator uses Probabilistic Location Typing to classify customers into Metro, Urban, or Rural categories based on their proximity to city centers in the reference data. This classification serves as the primary driver for the financial heuristics used in the simulation.
System Integration
customer_gen.rs acts as the first stage of the generation pipeline. It consumes the ref_residential.parquet file and the customer_config.yaml configuration to produce a vector of Customer structs. This vector is passed downstream to the account and card generators to complete the entity hierarchy.
Known Issues
The entire residential reference dataset is currently loaded into memory using Polars' ParquetReader for every generation run. While efficient for populations up to 100,000 customers, this creates a significant memory bottleneck when scaling to millions of agents. Moving to a chunked or streaming approach for reading reference data is required. Additionally, the jitter range (0.005) is currently hardcoded in the source code; moving this to the configuration would allow for different levels of spatial precision.
Financial Entity Linking (account_gen.rs & card_gen.rs)
Summary
The account_gen.rs and card_gen.rs modules are responsible for constructing the financial "graph" of the simulation. They define the hierarchical relationships between customers and their payment instruments, ensuring that every transaction is linked to a valid account and card entity. This layer establishes the structural foundation required for testing entity-linking models and cross-account fraud detection.
Architectural Decisions
These generators prioritize Relational Consistency. Instead of generating accounts and cards in isolation, the system uses a top-down orchestration: Customers drive the creation of Accounts, which in turn drive the creation of Cards. This ensures that every card PAN is programmatically linked back to a specific customer ID, maintaining 100% referential integrity across the multi-million row dataset.
For Entity Density, a probabilistic account ownership model is implemented in account_gen.rs. While every customer is guaranteed a primary account, there is a 50% chance for a customer to own a secondary account (e.g., a "Credit" account in addition to a "Savings" account). This architectural decision allows the simulation to model complex multi-entity behaviors, such as "Balance Transfers" or "Cross-Account Velocity," which are common signals in sophisticated fraud patterns.
In card_gen.rs, an Account-Driven Mapping strategy is used. The card generator iterates over the accounts vector and issues a unique payment instrument for each. This one-to-one mapping simplifies the transaction generation logic while ensuring that the "issuing bank" metadata is correctly inherited from the parent account entity.
System Integration
These modules are the primary components of the batch generation pipeline. They are invoked by generate.rs immediately after the population has been created. The resulting vectors of Account and Card structs are then materialized into Parquet files and passed downstream to the transaction engine.
Known Issues
A hardcoded 50% probability for secondary account creation is currently used. This should be moved to customer_config.yaml to allow for more granular control over the "financial depth" of the population.
Furthermore, Card Metadata (like contactless_limit and online_limit) is currently initialized as empty strings. This prevents the simulation from enforcing realistic "Limit Breaches" during transaction generation. A "Product Catalog" lookup in card_gen.rs is required to populate these fields with realistic values based on the account type, which will enable a new class of "Limit-Based" fraud detection features.
Core Simulation Engine (transaction_gen.rs)
Summary
The transaction_gen.rs module is the primary logic engine of RiskFabric. It is responsible for simulating the financial lifecycle of every card in the system over a specified lookback period (default 365 days). It transforms static entity data into a high-fidelity stream of behavioral events, incorporating spatial realism, temporal patterns, and adversarial mutations in a single execution pass.
Architectural Decisions
The engine uses a One-Pass Parallel Architecture. By using rayon to iterate over cards, all logic—including merchant selection, timestamp generation, amount calculation, and fraud injection—occurs within a single parallelized loop. This eliminates the need for multi-pass joins and is a key factor in the project's performance.
For spatial realism, the system implements a Hierarchical Selection Strategy using H3 indices. Merchants are selected based on a probabilistic proximity model: 80% are "super-local" (Res 6), 15% are "district-level" (Res 4), 3% are "state-level," and 2% are "global." This creates realistic spending clusters around a customer's home while allowing for occasional travel or remote spending.
To ensure reproducibility, Deterministic Seeding is used at the card level. Every card's random number generator is seeded with a combination of the global seed, a salt, and a hash of the card ID. This ensures that a specific card will always generate the exact same transaction history across different runs, provided the global configuration remains unchanged.
System Integration
This engine is the central utility consumed by both the Batch Generator (generate.rs) and the Streaming Generator (stream.rs). It acts as a pure function that takes configuration, spatial indices, and entity maps as input and produces vectors of Transaction and FraudMetadata as output.
Known Issues
Timestamp generation is implemented by sorting a local vector of dates for each card. While this ensures that transactions are chronologically ordered per card, it does not guarantee a global chronological order across the entire dataset during batch generation. ClickHouse is currently used to perform the final global sort.
Additionally, the spatial distribution weights (80/15/3/2) are hardcoded directly into the logic. Moving these to transaction_config.yaml would allow users to simulate different mobility profiles—for example, a "commuter" population would require a higher Res 4 weight compared to a "rural" population.
Adversary Logic Engine (fraud.rs)
Summary
The fraud.rs module contains the "attack logic" of RiskFabric. It defines the specific behavioral rules used to mutate legitimate transactions into adversarial patterns. This module ensures that synthetic fraud reflects realistic criminal tactics such as velocity abuse, account takeovers, and coordinated campaigns.
Architectural Decisions
This module follows a Profile-Driven Mutation Strategy. The engine interprets profiles from fraud_rules.yaml to dynamically adjust transaction attributes, rather than using hardcoded fraud logic. This allows for experimentation with new fraud signatures without modifying the core simulation code.
For Behavioral Mimicry, a relative amount calculation strategy is implemented. By allowing an attacker to spend within a multiplier range of the customer's average transaction amount (e.g., 0.8x to 1.2x), the engine simulates subtle, low-value fraud that is difficult for simple rule-based systems to detect.
To simulate Stateful Attacks, the apply_campaign_logic function is used. This allows the generator to override standard spatial and device signals with persistent attacker metadata (e.g., a shared IP or fixed coordinates). This architectural decision is critical for generating the clustered signals that modern graph-based fraud models are designed to identify.
System Integration
fraud.rs is a stateless logic provider consumed by the transaction_gen.rs module. It acts as a specialized "mutation filter" that takes a completed transaction and a fraud profile and returns a set of behavioral anomalies.
Known Issues
String-based matching (e.g., f_type == "account_takeover") is currently used to determine which mutation logic to apply. This is a fragile pattern that could lead to silent failures if a typo is introduced in the YAML configuration. Refactoring these into a proper Enum would ensure compile-time safety and better performance. Additionally, the calculate_fraud_timestamp logic is currently limited to two specific fraud types; generalizing this to support a wider range of temporal attack patterns is needed.
Central Configuration Engine (config.rs)
Summary
The config.rs module is the architectural backbone of RiskFabric. It provides a strongly-typed, unified interface for all behavioral and operational parameters of the simulation. By mapping multiple YAML files into a hierarchical Rust structure, it ensures that every component—from the simulation engine to the machine learning pipeline—operates with a consistent and validated world-view.
Architectural Decisions
This engine is designed to enforce Type-Safe Behavioral Modeling. Instead of using loose key-value pairs or dynamic JSON, a deep hierarchy of nested structs is implemented. This leverages Rust’s compiler to ensure that any change to the configuration schema in one part of the system is immediately reflected and validated in every other part.
The use of Atomic Multi-File Loading is a critical architectural decision. The AppConfig::load() method reads five separate YAML files (fraud_rules, fraud_tuning, customer_config, transaction_config, and product_catalog) and synthesizes them into a single AppConfig object. This separation of concerns allows specific domains (like "Product Catalog" or "Fraud Rules") to be tuned in isolation without creating massive, unmanageable configuration files.
Safety Defaults are also implemented using serde macros. This ensures that the simulation remains resilient even if the underlying YAML files are missing non-essential keys, providing sensible fallbacks for parameters like the streaming_rate.
System Integration
config.rs is widely consumed across the codebase. It is initialized at the entry point of every binary (generate, stream, etl, ingest) and is passed down into the generators as a shared reference. This ensures that the "rules of the world" are identical across the batch, streaming, and ETL layers.
Known Issues
fs::read_to_string and expect calls are currently used in the load() method. This causes the application to panic immediately if a config file is missing or contains a syntax error. While acceptable for a CLI tool, refactoring to return a Result type is required to allow for more graceful error handling and reporting. Additionally, the file paths for the YAML configs are currently hardcoded relative to the project root; a more flexible path resolution strategy is needed to allow RiskFabric to be executed from different directories.
Data Engineering & Warehouse
This section documents the ETL pipelines, warehouse ingestion utilities, and geographic reference preparation tools used to build the RiskFabric environment.
ETL Pipeline System (etl.rs & src/etl/)
Summary
The ETL (Extract, Transform, Load) system is the transformation engine of RiskFabric. It is responsible for converting raw, "bronze" level synthetic transactions into "silver" behavioral features and finally into a "gold" master table ready for machine learning. The system is designed to handle large datasets by leveraging Polars for local transformations and ClickHouse for large-scale joins and persistence.
Architectural Decisions
The system follows a Medallion Architecture (Bronze → Silver → Gold) to ensure data lineage and modularity.
- Bronze: Raw data as generated by
generate.rs. - Silver: Subject-specific feature engineering (Customer, Merchant, Sequence, Network, Campaign, and Device/IP). These are calculated using Polars' lazy evaluation for performance.
- Gold: The final flattened "master" table.
A key design choice is the Hybrid Execution Model. While the feature logic is implemented in Rust using Polars, the pipeline orchestrates data movement between ClickHouse (the primary warehouse) and local memory via Parquet. This allows complex, stateful calculations in Rust (like Welford's algorithm for running variance) that are difficult to express in pure SQL, while still using ClickHouse for efficient storage and final broad joins.
System Integration
The ETL system acts as the connective tissue between the Data Generation layer and the Machine Learning layer. It reads from ClickHouse tables (populated via ingest.rs), performs transformations, and writes the results back to ClickHouse. The final fact_transactions_gold table is the direct source for the Python-based training pipeline.
Known Issues
The system currently uses podman exec calls to interact with ClickHouse from within the Rust binary. This approach depends on the local environment's container runtime and shell availability. Transitioning to a proper ClickHouse client library (like clickhouse-rs) will make the pipeline more portable and robust. Additionally, the GoldMaster stage is currently implemented as a raw SQL join in ClickHouse, which duplicates some of the logic found in gold_master.rs. Unifying these two approaches will ensure the batch and streaming feature definitions remain consistent.
Behavioral Feature Engineering (src/etl/features/)
Summary
The src/etl/features/ directory contains the core analytical logic of RiskFabric. It defines how raw synthetic transactions are transformed into behavioral features across multiple domains: Customer history, Merchant risk, Transaction sequences, and Network relationships. These features provide the high-dimensional context required for modern fraud detection models to identify subtle adversarial patterns.
Architectural Decisions
This layer is designed to prioritize Domain-Specific Modularity. By separating feature sets into dedicated modules (e.g., network.rs, sequence.rs), independent iteration on different detection strategies is possible. This modularity ensures that the ETL pipeline can be easily extended with new behavioral signals (like graph-based features or deep-temporal windows) without refactoring the entire transformation engine.
For Transaction Sequencing, a window-based approach is implemented using Polars' shift and over functions. This allows for the calculation of complex stateful features like spatial_velocity and amount_deviation_z_score without the overhead of row-by-row iteration. The decision to perform these calculations at the "Silver" layer ensures that the final "Gold" master table is pre-enriched with predictive signals, reducing the training time for downstream models.
In the Network Intelligence module, a "Proxy Entity" strategy is used. Instead of building a full N:N customer relationship graph (which is memory-intensive), the risk reputation of shared entities like IP addresses and User Agents is calculated. This allows the system to identify "Suspicious Clusters" where multiple customers share a single high-fraud entity, capturing coordinated attack signals with high computational efficiency.
System Integration
These modules are the primary transformation components of the etl.rs binary. They consume "Bronze" tables from ClickHouse and produce "Silver" feature tables. The logic defined here is also mirrored in the scorer.py service to ensure training-serving parity during real-time inference.
Known Issues
A simple Euclidean distance formula is currently used for Spatial Velocity calculations. As noted in the etl_schema.md, this approximation becomes inaccurate over large distances. Implementation of the Haversine formula within the Polars transformation is required to ensure geographic precision.
Furthermore, the Campaign Detection logic in campaign.rs is currently based on a fixed 48-hour time gap. This is a heuristic that may fail to capture long-running, low-frequency attack campaigns. This threshold should be moved to the configuration or a more dynamic "Sessionization" strategy implemented to account for different adversarial behaviors.
Physical World Transformation (warehouse/)
Summary
The warehouse/ directory contains the SQL-based transformation logic for RiskFabric's physical environment. Using dbt (data build tool) and Postgres/PostGIS, this layer transforms raw OpenStreetMap (OSM) nodes into the "Physical World" reference data (Merchants and Residential points) used by the simulation engine.
Architectural Decisions
This layer prioritizes Geographic High-Fidelity. Instead of relying on the often inconsistent "state" and "district" tags in OSM, a Spatial Join Strategy is implemented. By performing ST_Intersects operations against official geographic boundaries (provided by DataMeet), the transformation layer provides a verified ground truth for every coordinate in the simulation. This ensures that a customer living in "Mumbai" is programmatically anchored to the correct state and district boundaries, which is critical for realistic spatial velocity calculations.
For Merchant Risk Profiles, a categorical mapping strategy is implemented in the stg_merchants model. By mapping raw OSM sub-categories (like jewelry or electronics) to standardized RiskFabric categories and risk levels (LOW, MEDIUM, HIGH), the "Adversarial Ground Truth" is established for the simulation. This architectural decision allows the fraud engine to select high-risk merchants for specific attack profiles without needing to embed merchant-level risk logic into the Rust binaries.
System Integration
The dbt layer acts as the "Level 0" enrichment engine. It consumes the raw tables populated by prepare_refs.rs and produces the mart_residential and mart_merchants models. These models are then exported to Parquet via export_references.rs or dlt/pipelines.py to be used as the primary lookup data for the simulation generators.
Known Issues
Spatial Joins are performed on every run for the mart models. While this ensures data quality, it is computationally expensive and slow when processing millions of Indian OSM nodes. A "Spatial Indexing" strategy should be implemented or the boundary results materialized into a lookup table to reduce the processing time.
Furthermore, the City Normalization logic is currently based on a simple regex-based macro. This fails to handle the wide variety of spelling variations and transliteration errors found in raw Indian OSM data. A fuzzy-matching strategy or integration of a dedicated geographic gazetteer is needed to ensure more robust city-level clustering in the simulation.
Data Warehouse Ingestor (ingest.rs)
Summary
The ingest.rs binary is the primary data loading utility that populates the RiskFabric data warehouse (ClickHouse). It consumes the raw Parquet output from the batch generator and transforms it into structured "Bronze" tables, providing the necessary foundation for downstream ETL and machine learning operations.
Architectural Decisions
The ingestor handles the initial schema enforcement for the warehouse. A key architectural decision is the use of a two-stage ingestion process for transactions. First, raw data is loaded into fact_transactions_bronze_raw with all fields preserved as strings or basic types. Then, ClickHouse's parseDateTime64BestEffort performs a high-performance conversion into a typed DateTime64 column for the final fact_transactions_bronze table. This approach ensures that data is not lost because of formatting mismatches during the initial bulk load.
The utility is idempotent, automatically dropping and recreating tables on every run. This simplifies the development lifecycle by ensuring the warehouse reflects the latest state of the synthetic generation configuration.
System Integration
ingest.rs acts as the bridge between the File System layer and the Warehouse layer. It interacts directly with the podman container runtime to execute commands against the riskfabric_clickhouse instance. It is the prerequisite for the etl.rs pipeline, which expects the tables defined here to be present and populated.
Known Issues
Data is currently piped into the warehouse using shell-based cat and podman exec commands. This is inefficient for large datasets and introduces a dependency on the host's shell environment. Refactoring this to use the ClickHouse HTTP interface or a native Rust client will allow for more reliable bulk inserts.
Furthermore, the warehouse schema in ingest.rs has drifted from the Rust model definitions in src/models/. For example:
- The
dim_accountstable in the warehouse is missing thebank_idandaccount_nofields present inaccount.rs. - The
dim_cardstable is missing over 10 fields, includingissue_date,activation_date, and all usage limit fields defined incard.rs. - The
dim_customersschema is more aligned but still represents a manual duplication of theCustomerstruct.
Unifying these schemas, ideally by deriving the ClickHouse DDL directly from the Rust structs, will ensure the warehouse remains a high-fidelity representation of the synthetic population.
Reference Data Preparator (prepare_refs.rs)
Summary
The prepare_refs.rs binary is the "world-building" utility of RiskFabric. It is responsible for ingesting, filtering, and normalizing raw OpenStreetMap (OSM) data and other geographic datasets to create the high-performance reference files used by the simulation generators. It handles the task of mapping physical coordinates to behavioral entities like merchants, residential points, and financial institutions.
Architectural Decisions
This utility is designed to handle Parallel OSM Parsing using the osmpbf library and rayon. Since the raw India PBF file is several gigabytes in size, the preparator uses a map-reduce strategy to extract relevant nodes (residential buildings, shops, and amenities) across all available CPU cores. This allows for the processing of a country's entire geographic dataset in minutes rather than hours.
A key architectural choice is the implementation of Fuzzy State Normalization. OSM data is often inconsistent, with the same state appearing in multiple formats (e.g., "AP," "Andhra Pradesh," or "Andra Pradesh"). A rule-based normalization engine standardizes these variations, ensuring that downstream generators can reliably perform state-level joins and spatial indexing without data gaps.
A Postgres-Based Staging Layer is also integrated for the extraction process. By using the BinaryCopyInWriter for bulk insertion, the preparator moves millions of extracted nodes into a structured database with minimal overhead. This staging layer allows for complex SQL-based cleaning and verification before the final reference Parquet files are exported.
System Integration
prepare_refs.rs is a standalone "Level 0" utility that must be run before synthetic data generation. It populates the data/references/ directory with ref_merchants.parquet, ref_residential.parquet, and other critical lookup tables. These files are then consumed by generate.rs, stream.rs, and customer_gen.rs.
Known Issues
A hardcoded Postgres connection string (postgres://harshafaik:123@localhost:5432/riskfabric) is currently used within the CLI defaults. This is a security and portability issue; it should be moved to an environment variable or a configuration file. Additionally, the utility lacks a unified "Export to Parquet" command—it populates Postgres, but the final conversion to Parquet is often handled by separate, manual scripts. Consolidating the end-to-end pipeline (OSM → Postgres → Parquet) into this single binary would improve the developer experience.
Reference Data Exporter (export_references.rs)
Summary
The export_references.rs binary is the final stage of the reference data preparation pipeline. It extracts cleaned and processed geographic data from the staging database (Postgres) and serializes it into the high-performance Parquet format required by the simulation generators. This utility ensures that the "synthetic world" is correctly typed, indexed, and portable across different environments.
Architectural Decisions
This utility is designed to act as the Final Schema Validator for the reference data. While the prepare_refs.rs utility handles raw extraction and normalization, the exporter ensures that the data is structured exactly as expected by the generators. By using Polars to build the final DataFrames, high-performance memory management and efficient Parquet serialization are leveraged, which is critical when dealing with millions of reference nodes.
A key architectural choice is the Database-to-Parquet decoupling. By exporting processed staging tables into standalone Parquet files, the simulation environment becomes portable. This allows the core RiskFabric generators to run without a live Postgres connection, simplifying the deployment and execution of the simulation on local workstations or in CI/CD pipelines.
System Integration
export_references.rs is a "Level 0" utility that bridges the Staging layer (Postgres) and the Generation layer (Parquet). It is typically run after prepare_refs.rs and any subsequent SQL-based cleaning has been performed on the staging tables. The resulting Parquet files in data/references/ are the direct input for generate.rs, stream.rs, and the various generator modules.
Known Issues
A hardcoded Postgres connection string (postgres://harshafaik:123@localhost:5432/riskfabric) is currently used directly in the source code. This is a duplicate of the issue in prepare_refs.rs and should be unified into a shared configuration or environment variable. Additionally, the exporter manually maps Postgres rows into local vectors before creating the Polars DataFrame. This is inefficient for extremely large datasets; refactoring to use a streaming connector or a more direct Polars-Postgres integration is needed to reduce the memory overhead of the export process.
Reference Data Pipeline (dlt/pipelines.py)
Summary
The dlt/pipelines.py script is the Modern Data Stack (MDS) integration for RiskFabric. It uses the dlt (Data Load Tool) library to manage the extraction and movement of cleaned, enriched geographic data from the staging database (Postgres) into the optimized Parquet reference files used by the generators.
Architectural Decisions
This pipeline is designed to facilitate Declarative Reference Data Export. Instead of custom SQL-to-Parquet conversion logic (as seen in export_references.rs), this script leverages the dlt library’s built-in support for the "filesystem" destination. This allows for automated schema handling and standardized Parquet formatting, which is critical for maintaining consistency between the OSM-derived reference data and the Rust-based simulation.
A key architectural choice was the use of write_disposition="replace". Since the reference data (merchants and residential nodes) represents a "static" world that is fully rebuilt after every OSM extraction, this strategy ensures that the data/references/ directory always contains a clean snapshot of the environment without manual cleanup.
System Integration
dlt/pipelines.py acts as an alternative or supplementary utility to export_references.rs. It bridges the Staging layer (Postgres) and the Local File System layer. It is typically run as part of the "Level 0" world-building phase, specifically after dbt has transformed the raw OSM nodes into the mart_residential and mart_merchants models.
Known Issues
Environment variables (e.g., DESTINATION__FILESYSTEM__BUCKET_URL) are currently used to configure the DLT pipeline directly within the Python script. This approach is fragile and makes it difficult to change the reference directory without modifying the code. These should be moved into a dedicated dlt_config.toml file to align with the library’s best practices. Additionally, the pipeline currently lacks Data Validation tests; dlt "checks" should be implemented to ensure that the exported Parquet files contain the expected number of rows and non-null H3 indices before they are handed off to the generation engine.
Machine Learning Systems
This section documents the model training pipelines, real-time inference services, and metadata utilities required for detecting synthetic fraud patterns.
Machine Learning Training Pipeline (train_xgboost.py)
Summary
The train_xgboost.py script is the primary model development engine for RiskFabric. It extracts features from the ClickHouse "Gold" layer and trains an XGBoost classifier to detect synthetic fraud patterns. It evaluates the learnability of the generated fraud signatures by industry-standard algorithms.
Architectural Decisions
An "Operational Feature" policy is implemented in the training script to prevent data leakage. While the synthetic generator provides explicit labels like geo_anomaly and fraud_type for verification, these are strictly excluded from training. Instead, the model is forced to learn from behavioral proxies such as amount_deviation_z_score, spatial_velocity, and hour_deviation_from_norm. This ensures that the model's performance reflects real-world detectability rather than just learning internal generator flags.
The choice of XGBoost with Native Categorical Support allows the model to process high-cardinality fields like merchant_category and transaction_channel directly, without the memory overhead of one-hot encoding. This maintains performance as the synthetic merchant population scales.
System Integration
The training pipeline is the final "offline" consumer of the Data Warehouse layer. It uses the clickhouse-connect library to pull data directly into Polars DataFrames for training. The resulting model is serialized to models/fraud_model_v1.json, which is consumed by the scorer.py service for real-time inference in the streaming pipeline.
Known Issues
A simple 80/20 train/test split with stratification is currently used, but time-series validation is missing. Since fraud patterns evolve over time, a random split can lead to optimistic performance estimates by allowing the model to see future patterns during training. A walk-forward validation strategy is required to better simulate real production deployments. Additionally, XGBoost hyperparameters (like max_depth=6) are currently hardcoded; these should be moved to a ml_tuning.yaml configuration file to allow for automated hyperparameter optimization.
Real-Time Scoring Service (scorer.py)
Summary
The scorer.py service is the production inference engine of RiskFabric. It consumes unlabeled transaction events from Kafka, performs sub-millisecond feature engineering using a Redis-backed feature store, and applies the trained XGBoost model to generate real-time fraud probabilities. The service serves as the final link in the streaming pipeline, providing the "Detection" half of the simulation.
Architectural Decisions
This service is designed around a Stateful Micro-Batching Architecture. To balance high throughput with low latency, feature engineering is performed for each transaction individually, but the final model predictions are grouped into batches of 50. This reduces the overhead of XGBoost inference and ClickHouse persistence while maintaining a P99 latency of approximately 12ms per transaction.
For real-time feature engineering, Welford’s Algorithm is implemented to maintain running means and standard deviations within Redis. This allows for the calculation of an "Operational" amount_deviation_z_score for every transaction without needing to scan historical Parquet files or perform heavy SQL queries. This stateful approach is critical for simulating how behavioral anomalies are detected on a "live" stream.
The service maintains Feature Alignment with the training pipeline by dynamically reordering and casting incoming features to match the exact schema and types (categorical, float, int) exported from the fraud_model_v1.json booster. This prevents "training-serving skew," ensuring that the model's performance in production matches its performance during validation.
System Integration
scorer.py sits at the exit point of the Streaming layer. It consumes from the raw_transactions Kafka topic (populated by stream.rs) and writes its decisions to both the fraud_scores ClickHouse table and a downstream Kafka topic for automated blocking. It depends on Redis for its behavioral context and ClickHouse for long-term audit logging and performance monitoring.
Known Issues
A hardcoded THRESHOLD = 0.85 is currently used for flagging transactions as fraud. This should be moved to a configuration file (or a dynamic service) to allow for easier tuning of the precision-recall trade-off. Furthermore, the hour_deviation_from_norm feature is currently a placeholder (0.0). Implementation of the temporal aggregation logic in seed_redis.py and fetching it from Redis is required to ensure the model has access to its full set of behavioral signals during real-time inference.
Model Metadata Utility (dump_model.py)
Summary
The dump_model.py script is a specialized inspection utility used to extract the internal schema and feature definitions from a serialized XGBoost model. It ensures that the real-time scoring engine (scorer.py) has exact visibility into the feature names and data types (categorical, float, integer) expected by the binary booster.
Architectural Decisions
This utility is designed to solve the Feature Alignment Problem in production ML. When an XGBoost model is saved as a JSON booster, it encodes its expected input schema. If the inference engine sends features in the wrong order or with the wrong data types, the model may crash or return incorrect results. By using get_booster().feature_names, this utility provides a programmatically verifiable source of truth for the inference interface, allowing the scorer.py to dynamically reorder and cast its input DataFrames to match the model's training state.
The implementation of JSON-Path Extraction for categorical features is a critical design choice. Since XGBoost's native categorical encoding is serialized within the learner block of the JSON file, this utility parses those internal dictionaries. This architectural safety measure ensures verification of the categorical "levels" (e.g., specific merchant categories) the model was exposed to during training, preventing "Unknown Category" errors during real-time scoring.
System Integration
dump_model.py is an auxiliary utility in the Machine Learning layer. It is typically run after train_xgboost.py to verify the model artifact before it is deployed to the scoring service. It acts as a manual "Gatekeeper" for ensuring feature consistency across the pipeline.
Known Issues
A fragile, regex-based approach (re.findall) is currently used to extract categorical strings from the XGBoost JSON. This is an unreliable method that depends on the specific serialization format of the XGBoost version being used. A more robust parser that follows the official XGBoost JSON schema is required. Additionally, the utility currently only prints the metadata to the console; refactoring is needed to export a structured schema.yaml file that the scorer.py can load automatically to configure its inference pipeline.
Infrastructure & Operations
This section documents the local service stack, orchestration configuration, and state synchronization utilities used to operate the RiskFabric environment.
Infrastructure & Local Service Stack
Summary
The RiskFabric simulation is supported by a comprehensive local service stack orchestrated via Docker Compose. This infrastructure provides the multi-modal data environment—relational, columnar, stream, and cache—required to simulate a modern financial technology ecosystem. It enables the end-to-end lifecycle of synthetic data, from geographic world-building to real-time adversarial detection.
Architectural Decisions
The infrastructure is designed using a Multi-Model Database Strategy. By incorporating ClickHouse for high-volume transactions and Postgres/PostGIS for geographic preparation, each stage of the simulation uses the optimal storage engine for its specific data type. The inclusion of Redpanda (a Kafka-compatible event store) and Redis facilitates the real-time scoring path, allowing the simulation to model the sub-millisecond latency requirements of production fraud systems.
For Observability, Prometheus and Grafana are integrated directly into the core stack. This architectural decision transforms RiskFabric from a simple data generator into a performance benchmarking environment. By instrumenting the database exporters and the real-time scorer, system metrics (e.g., Kafka ingestion lag, Redis lookup latency, and model inference time) can be visualized in real-time, providing visibility into the operational impact of different fraud detection strategies.
The use of Healthchecks across all critical services ensures that the generation binaries (ingest.rs, etl.rs) only attempt to connect when the infrastructure is ready. This improves the developer experience by reducing connection-refused errors during the initial cold-start of the simulation environment.
System Integration
The infrastructure is the foundation upon which all RiskFabric binaries execute. The Rust-based generators and Python-based ML services connect to these containers via standardized ports and internal networks. The scorer service is configured to run as a long-lived container, automatically subscribing to the Kafka stream as soon as the stack is up.
Known Issues
A Single-Node Redpanda instance without persistence is currently used. While this is sufficient for local development, it does not support testing "Consumer Group Rebalancing" or "Partition-Level Parallelism," which are common challenges in production streaming systems. A multi-node Redpanda cluster configuration is required to support high-availability testing scenarios.
Furthermore, Postgres and ClickHouse credentials are currently hardcoded as harshafaik:123 across the docker-compose.yml. This security vulnerability prevents the stack from being used in shared or public environments. These credentials must be moved to an .env file and Docker Secrets used to manage sensitive information more securely.
Redis Feature Seeder (seed_redis.py)
Summary
The seed_redis.py script is an operational utility that initializes the real-time feature store (Redis) with historical data from the warehouse (ClickHouse). It bridges the gap between the batch-trained model and the streaming inference engine by ensuring that every card and customer has immediate behavioral context before real-time transactions start arriving.
Architectural Decisions
This seeder is designed to facilitate Warm-Start Inference. Without this script, the first few transactions for every card in the streaming pipeline would be difficult to score accurately (as there would be no "previous" location for velocity or "previous" amount for Z-score). The seeder extracts the most recent state for every card and customer, including the last 10 transactions, the final coordinate pair, and the cumulative count of events.
A key architectural choice is the Redis Hash/List strategy. Redis Lists (RPUSH) are used to store chronological card history and Hashes (HSET) to store aggregate statistics. This allows scorer.py to perform O(1) lookups for behavioral context, maintaining the strict latency requirements of real-time fraud detection. Furthermore, the seeder explicitly calculates the initial Welford state (Mean and M2) from the warehouse, enabling the online scorer to continue updating statistical variance incrementally without a full history scan.
System Integration
seed_redis.py acts as a synchronization service between the Warehouse layer (ClickHouse) and the Scoring layer (Redis/Kafka). It must be executed after etl.rs completes (to ensure the "Gold" table is populated) and before stream.rs and scorer.py are started.
Known Issues
The entire feature initialization set is currently pulled into local Python memory before pushing to Redis. For datasets with millions of cards, this may lead to a memory-exhaustion failure. The ClickHouse queries should be refactored to use chunked fetching (cursors) or a parallelized worker pool implemented to stream data from the warehouse to Redis in batches. Additionally, a hardcoded password for ClickHouse is currently used; this should be moved to an environment variable to align with project security standards.
Technical Reference
Exhaustive documentation of the schemas, configurations, and developer utilities used to build and manage the RiskFabric simulation.
Synthetic Data Schema
Summary
The RiskFabric data schema is designed to mirror a professional financial environment while providing the "white-box" visibility required for advanced machine learning research. It consists of five core entities that represent the hierarchical relationship between a customer and their financial events.
Design Intent
The schema is structured to prioritize Relational Realism over flat-file simplicity. By separating Customers, Accounts, and Cards into distinct tables, the simulation models complex many-to-one relationships (e.g., a single customer owning multiple accounts, each with different card instruments). This is essential for testing entity-linking models and network analysis in fraud detection.
The inclusion of the FraudMetadata table is a critical architectural decision. It decouples the simulation ground truth (fraud_target) from the operational signal (is_fraud). This allows researchers to train on noisy, real-world signals while validating against the perfect, latent truth of the generator.
Entity Relationship Overview
- Customer: The primary entity. Owns several Accounts.
- Account: A financial container (Savings, Current, Credit). Contains several Cards.
- Card: The instrument used for transactions.
- Transaction: A financial event linked to a Card, Account, and Customer.
- FraudMetadata: Ground-truth data linked 1:1 with Transactions to explain the generation context.
👥 Customer (customers.parquet)
Defines the synthetic population's demographics and geographic baseline.
| Field | Type | Description |
|---|---|---|
customer_id | String | Unique UUID for the customer. |
name | String | Full name (Indian-centric). |
age | UInt8 | Age of the customer (18-90). |
email | String | Synthetic email address. |
location | String | Full residential address (OSM-based). |
state | String | Standardized Indian state name. |
location_type | String | Urban vs. Rural classification. |
home_latitude | Float64 | WGS84 Latitude of home. |
home_longitude | Float64 | WGS84 Longitude of home. |
home_h3r5 | String | H3 Resolution 5 index (Neighborhood level). |
home_h3r7 | String | H3 Resolution 7 index (Block level). |
credit_score | UInt16 | Synthetic credit score (300-850). |
monthly_spend | Float64 | Average expected monthly expenditure. |
customer_risk_score | Float32 | Baseline risk probability (0.0 to 1.0). |
is_fraud | Bool | Flag indicating if this customer represents a fraud target. |
registration_date | String | ISO 8601 date of account registration. |
🏦 Account (accounts.parquet)
The logical banking container for funds.
| Field | Type | Description |
|---|---|---|
account_id | String | Unique UUID for the account. |
customer_id | String | FK to Customer. |
bank_id | String | Identifier for the issuing bank. |
account_no | String | 12-digit synthetic account number. |
account_type | String | Savings, Current, or Credit. |
balance | Float64 | Current funds in the account. |
status | String | Active, Closed, or Suspended. |
creation_date | String | The account opening date. |
💳 Card (cards.parquet)
The payment instrument associated with an account.
| Field | Type | Description |
|---|---|---|
card_id | String | Unique UUID for the card. |
account_id | String | FK to Account. |
customer_id | String | FK to Customer. |
card_number | String | 16-digit synthetic PAN. |
card_network | String | VISA, Mastercard, or RuPay. |
card_type | String | Debit or Credit. |
status | String | Active, Blocked, or Expired. |
status_reason | String | Reason for status changes (e.g., SIM Swap Suspect). |
issue_date | String | Card issuance date. |
activation_date | String | Initial card usage date. |
expiry_date | String | Card expiry date. |
issuing_bank | String | Full name of the bank. |
bank_code | String | Standardized 4-digit bank identifier. |
💸 Transaction (transactions.parquet)
The high-volume stream of financial events.
| Field | Type | Description |
|---|---|---|
transaction_id | String | Unique UUID for the transaction. |
card_id | String | FK to Card. |
account_id | String | FK to Account. |
customer_id | String | FK to Customer. |
merchant_id | String | Unique identifier for the merchant. |
merchant_name | String | Name of the business. |
merchant_category | String | Category (e.g., GROCERY, TRAVEL). |
merchant_country | String | Country code of the merchant (defaults to IN). |
amount | Float64 | Transaction value in base currency. |
timestamp | String | ISO 8601 high-precision timestamp. |
transaction_channel | String | online, in-store, UPI, etc. |
card_present | Bool | Physical card usage flag. |
user_agent | String | Browser or POS device identifier. |
ip_address | String | IPv4 address of the requester. |
status | String | High-level status (Success or Failed). |
auth_status | String | Banking authorization code (approved/declined). |
failure_reason | String | Detailed reason for declined transactions. |
is_fraud | Bool | Noisy Label (includes FN/FP). |
chargeback | Bool | Flag indicating a later customer dispute. |
location_lat | Float64 | Latitude of the transaction event. |
location_long | Float64 | Longitude of the transaction event. |
h3_r7 | String | H3 Resolution 7 index of the transaction location. |
🕵️ Fraud Metadata (fraud_metadata.parquet)
Internal ground-truth for debugging and advanced ML training. This table is not used in standard inference but is vital for "white-box" evaluation.
| Field | Type | Description |
|---|---|---|
transaction_id | String | FK to Transaction. |
fraud_target | Bool | Ground Truth (True Fraud flag). |
fraud_type | String | Profile used (e.g., upi_scam, ato). |
label_noise | String | Reason for label mismatch (if any). |
injector_version | String | Engine version. |
geo_anomaly | Bool | True if location represents an outlier. |
device_anomaly | Bool | True if device/UA represents an outlier. |
ip_anomaly | Bool | True if IP represents a known malicious prefix. |
burst_session | Bool | Part of a rapid-fire sequence. |
burst_seq | Int32 | Sequence number within a burst session. |
campaign_id | String | Link to a coordinated attack campaign. |
campaign_type | String | Coordination type (e.g., coordinated_attack). |
campaign_phase | String | Phase within the campaign (early, active, late). |
campaign_day_number | Int32 | Days since campaign start. |
Known Issues
UUID strings are currently used for all primary keys (customer_id, card_id, etc.). While ensuring global uniqueness, this increases storage overhead and join latency in ClickHouse compared to integer-based keys. Transitioning to a 64-bit integer ID system is under consideration for future versions.
Furthermore, a dedicated Merchant Table is not yet implemented in the output schema. Merchant attributes are currently denormalized directly into the transaction table, creating data redundancy and limiting merchant-level entity modeling. Breaking merchants into a separate merchants.parquet file is required to complete the star schema.
ETL & Feature Schema
Summary
The etl_schema.md document defines the behavioral features and data transformations performed by the RiskFabric ETL pipeline (etl.rs). It acts as the technical contract for the "Silver" and "Gold" layers, detailing how raw synthetic events are transformed into the high-dimensional vectors used for model training and real-time inference.
Design Intent
The feature schema represents a Hybrid Behavioral State, intended to provide models with a multi-domain view of financial events across customer history, merchant risk, and temporal sequences. This approach facilitates sophisticated behavioral modeling, such as Z-scores and velocity-based indicators, similar to production fraud detection systems.
A critical design choice was the use of Welford's Algorithm for statistical aggregates. Calculating running means and variances locally in Rust (and Redis) ensures that features are numerically stable and computationally efficient for both batch processing and low-latency streaming. This architectural decision is intended to eliminate training-serving skew.
🥈 Silver Layer: Behavioral Features
Transaction Sequence Features (fact_transactions_silver)
Calculated at the individual card level to identify temporal and spatial anomalies.
| Field | Description | Logic |
|---|---|---|
time_since_last | Seconds since the previous event. | T - T_prev |
spatial_velocity | Speed (km/h) between consecutive events. | Dist(L, L_prev) / (T - T_prev) |
amount_z_score | Deviation from customer's mean spend. | (Amt - Mean) / StdDev |
hour_deviation | Deviation from customer's peak spend hour. | Circular variance of timestamp.hour() |
Network & Entity Features (network_features_silver)
Identifies high-risk clusters across the payment network.
| Field | Description | Logic |
|---|---|---|
shared_ip_fraud | Fraud rate of cards sharing the same IP. | SUM(is_fraud) / COUNT(card_id) OVER IP |
scammer_hub | Flag for known high-risk coordinates. | 1 if Lat/Lon in [hub_coordinates] |
🥇 Gold Layer: The Master Table
The final flattened table used for model training, joining all Silver behavioral features with the original Bronze transactions.
Known Issues
Spatial Velocity is currently calculated using a Euclidean distance approximation. While computationally efficient, this is inaccurate over long distances. Implementation of the Haversine formula is required to ensure geographic precision for cross-state and international fraud simulations.
Furthermore, Feature Freshness is limited to the last 10 transactions in Redis. This prevents the modeling of long-term behavioral baselines for infrequent spenders. Implementing "Stateful Cold Storage" in the ETL pipeline is necessary to retrieve historical data without exceeding real-time feature store capacity.
Configuration Reference
Summary
The config_reference.md document provide a catalog of the behavioral parameters and system-wide settings available in RiskFabric. It details the schema of the YAML configuration files that define the simulation's behavioral rules, ranging from geographic boundaries to fraud injection rates.
Design Intent
The configuration system is designed to be Hierarchical and Domain-Specific. By splitting settings into five distinct YAML files, researchers can perform comparative testing on simulation behaviors (e.g., comparing different fraud population densities) by swapping configuration files. This decoupling ensures the generator can be tuned without recompiling the Rust binaries.
A critical design choice was the use of Semantic Weights. For parameters such as hourly_weights and daily_weights, relative values are used rather than absolute probabilities. This allows the generator to maintain consistent behavioral ratios (e.g., temporal activity peaks) regardless of the total volume of generated data.
📄 Core Configuration Files
fraud_rules.yaml
Defines the individual attack profiles and their behavioral biases.
profiles: Mapping of fraud types (e.g.,upi_scam) to amount strategies and geographic anomaly probabilities.fraud_patterns: List of common "test amounts" used by attackers for card validation.
customer_config.yaml
Defines the synthetic population's physical and economic footprint.
control.customer_count: Total population size for the batch generation run.financials.base_spend: Expected monthly expenditure per location type (Metro, Urban, Rural).
transaction_config.yaml
Defines the "physics" of the transaction stream.
geo_bounds: The lat/long bounding box for transaction events.temporal_patterns: The weighted distribution of activity across the 24-hour day and 7-day week.
Known Issues
The Lookback Period (lookback_days) can currently be set independently of the customer registration window. This allows for temporal inconsistencies where transaction history precedes a customer's registration date. Implementing cross-configuration validation is necessary to ensure temporal consistency.
Furthermore, the Streaming Rate (streaming_rate) is a global setting. "Dynamic Throughput," which would allow the generator to simulate peak activity hours (e.g., varying tx/s by time of day), is not yet implemented. Modifying the streaming engine to respect the temporal weights defined in transaction_config.yaml is required to create more realistic real-time traffic patterns.
Developer Utilities CLI
Summary
The developer_utilities.md document details specialized binaries and tools designed to support the RiskFabric development lifecycle. These utilities automate auxiliary tasks surrounding synthetic data generation, such as geographic preprocessing, reference data export, and model metadata inspection.
Design Intent
These utilities function as a Developer's Toolkit for the simulation. By decomposing complex tasks—such as OSM node extraction and Parquet serialization—into dedicated CLI binaries, the core generation engine remains focused. This modular approach allows the synthetic environment to be rebuilt independently of the transaction simulation, enabling iteration on geographic density and merchant risk profiles.
A critical design choice was the use of Strongly-Typed Subcommands via the clap library. This provides a consistent, self-documenting interface for every utility, reducing cognitive load and ensuring operational errors are caught during argument parsing.
🔧 Core Utilities
riskfabric-prepare-refs
The primary utility for extracting and normalizing OSM data.
extract-nodes: Parallel parsing of PBF files into a Postgres staging layer.map-city-state: Rules-based geographic normalization.
riskfabric-export-references
The serializer bridging the staging database and the generation layer.
- Function: Converts Postgres tables into H3-indexed Parquet files.
riskfabric-ingest
The automated loader for the ClickHouse data warehouse.
- Function: Handles schema creation and bulk loading of generated transactions.
Known Issues
Two separate binaries are currently maintained for reference handling (prepare-refs and export-references), which introduces friction in the developer workflow. Consolidation into a Unified "Refs" Command with subcommands for extraction, normalization, and export is planned.
Furthermore, Duplicate Connection Logic exists across several utilities, with database URLs and file paths hardcoded in multiple binaries. Refactoring common CLI logic into a shared riskfabric-cli-core crate is required to ensure consistent handling of parameters like --db-url and --output-dir.
Machine Learning Strategy
Summary
RiskFabric's machine learning strategy is built around the "Operational Model" philosophy. Instead of training on perfect, latent labels provided by the generator, the strategy forces models to learn from behavioral proxies in a multi-stage pipeline that mirrors real-world deployment challenges.
Design Intent
The ML pipeline serves as a Calibration Bench for the generator. Achieving 100% recall on synthetic data indicates that the fraud signatures are insufficient in complexity. Label Noise (FP/FN) and Sanitized Feature Sets are explicitly introduced to create a realistic "Information Gap" between the generator and the learner.
The architecture utilizes XGBoost as its primary classifier, leveraging its native categorical handling and gradient-boosting strengths for tabular financial data. This enables researchers to evaluate feature importance in an interpretable manner, identifying which synthetic signals (e.g., spatial velocity vs. amount deviation) are the most predictive.
🏗️ The Training Pipeline
- Ingestion & ETL: Data is extracted from the ClickHouse "Gold" layer via
train_xgboost.py. - Sanitization: Internal generator flags (e.g.,
fraud_type,geo_anomaly) are dropped to prevent data leakage. - Training: XGBoost utilizes a
binary:logisticobjective with a 20% stratified test split. - Verification: Models are evaluated against both the noisy
is_fraudlabel and the perfectfraud_target.
Known Issues
The current use of Random Stratified Splitting for validation is an architectural limitation. In a financial stream, data is temporally ordered; random splitting allows for "look-ahead bias," where the model may be exposed to a customer's future patterns during training. Transitioning to Out-of-Time (OOT) Validation—training on the first nine months and testing exclusively on the final three—is necessary.
Furthermore, the model is currently static, without a "Concept Drift" simulation to account for fraud signatures changing over time. This makes the accuracy metrics potentially misleading as they do not reflect adversarial evolution. Implementing a Retraining Scheduler is required to evaluate precision degradation as fraud profiles evolve.
Conceptual Explanations
High-level documentation explaining the underlying philosophy, architectural strategies, and simulation logic of RiskFabric.
Theory of Operation
This document explains the underlying philosophy, architecture, and logic of the RiskFabric simulation. It answers the question: "How does the engine actually think?"
1. Agent-Based Simulation (ABM) Philosophy
RiskFabric functions as an Agent-Based Simulator rather than a simple random data generator.
graph TD
subgraph "World Building"
OSM[OpenStreetMap Data] --> Prepare[prepare_refs.rs]
Prepare --> PG[(Postgres / PostGIS)]
PG --> Parquet[Reference Parquet Files]
end
subgraph "Simulation Engine"
Parquet --> Gen[generate.rs / stream.rs]
Config[YAML Configs] --> Gen
Gen --> Trans[Transactions]
end
subgraph "Detection Pipeline"
Trans --> CH[(ClickHouse)]
CH --> ETL[etl.rs]
ETL --> Train[train_xgboost.py]
Train --> Scorer[scorer.py]
Gen --> Kafka[Redpanda]
Kafka --> Scorer
end
- The Agent: The primary agent, the
Customer, drives the logic. - The World: OpenStreetMap (OSM) reference nodes (Residential and Merchant points) across India define the physical world.
- The Rules: Agents follow deterministic rules defined in
fraud_rules.yamlandtransaction_config.yaml.
Unlike statistical generators that sample from distributions to create flat tables, RiskFabric simulates the lifecycle of financial entities.
2. The Deterministic Lifecycle
To ensure consistency across 10M rows and all tables, RiskFabric follows a strict creation order:
graph LR
Cust[Customer] -->|1:N| Acc[Account]
Acc -->|1:N| Card[Card]
Card -->|1:N| Tx[Transaction]
Tx -->|linked| Merch[Merchant]
- Customer Birth: The generator assigns each customer a name, age, and a Home Coordinate based on real residential OSM nodes.
- Financial Anchoring: The system assigns one or more
Accountsto every customer. - Payment Instruments: Accounts issue
Cards. These cards act as "keys" for generating transaction streams. - The Spend Loop: Each card generates transactions based on the customer's
monthly_spendprofile.
3. The "One-Pass" Parallel Architecture
Traditional simulators often use multiple passes (e.g., Pass 1: Generate legitimate data, Pass 2: Inject fraud). This approach increases latency and memory usage.
RiskFabric uses a One-Pass Architecture in Rust:
- Parallelization: The engine uses the
Rayonlibrary to process thousands of entities simultaneously across all CPU cores. - Unified Logic: Merchant selection, amount calculation, fraud injection, and campaign coordination occur in a single loop.
- Memory Efficiency: By using "Batched Generation" (5,000 entities per cycle), the engine maintains a constant memory footprint whether generating 1M or 10M rows.
4. Spatial Realism & H3 Indexing
RiskFabric uses geographic high-fidelity.
- H3 Hierarchies: The system uses Uber’s H3 hexagonal grid. When a user spends, the engine first looks for merchants within the same H3 Resolution 5 cell (neighborhood level) as their home.
- Local vs. Global Spend: Legitimate transactions remain "local" (same H3 cell) approximately 98% of the time. Fraud profiles (like UPI Scams) explicitly force "Remote" coordinates to simulate offshore or cross-state attacks.
5. Statistical Reproducibility (Seeded PRNG)
Every card in the system has a Deterministic Seed.
#![allow(unused)] fn main() { let mut card_rng = StdRng::seed_from_u64(global_seed + salt + card_id_hash); }
Running the simulation with the same global_seed ensures every transaction for a given card remains identical. This enables Machine Learning reproducibility, allowing for feature adjustments without the underlying ground-truth shifting.
6. Simulated Imperfection (Label Noise)
To mirror real-world banking challenges, RiskFabric implements Noisy Labeling:
- Ground Truth (
fraud_target): The latent indicator of whether the generator injected a specific fraud pattern. - Noisy Label (
is_fraud): The signal typically available to a bank's operational systems. It includes False Positives (legitimate transactions flagged as fraud) and False Negatives (undetected fraudulent transactions).
This design forces models to learn robustness and generalizable patterns rather than memorizing perfect synthetic signatures.
7. Hybrid Streaming & Verification Architecture
To support real-time fraud detection, RiskFabric includes a dedicated Streaming Generator that bridges the gap between static datasets and live production environments.
- One-Pass Consistency: The streaming engine reuses the exact same logic as the batch pipeline but operates on a continuous loop, producing transactions at a configurable rate (default 100 tx/s).
- Type-Level Safety (Unlabeled Output): To prevent "label leakage" during live scoring, the system uses a specialized
UnlabeledTransactionstruct. This mirrors the standard transaction but programmatically omits all ground-truth and labeling fields (is_fraud,chargeback, etc.), ensuring the Kafka payload is consistent with a real production stream. - Verification Mode: While in verification mode, the generator writes the "Ground Truth" of every streaming transaction to
ground_truth.csv. This allows for a post-hoc join against real-time model scores to measure precision and recall in a simulated production environment. - Self-Correcting Rate Limiter: The generator measures actual Kafka broker latency for every message sent. It dynamically adjusts its sleep interval to compensate for network jitter, ensuring steady, drift-free throughput over long durations.
Fraud Signatures & Attack Patterns
Summary
The fraud_signatures.md document serves as the high-level behavioral specification for the simulation's adversary. It defines both individual fraud profiles and coordinated multi-entity campaigns, providing the theoretical basis for the synthetic anomalies generated by the engine.
Design Intent
I designed these signatures to move beyond "random noise" and toward Structured Adversarial Intelligence. Each profile (e.g., UPI Scam, Account Takeover) is anchored in a specific real-world financial threat observed in the Indian market. By layering Campaign Logic on top of individual profiles, I allow the simulation to model the "clustered" signals that are the hallmark of professional criminal organizations.
A critical design choice was the implementation of Probabilistic Mutation. Instead of every fraudulent transaction being an obvious outlier, I use configuration-driven probabilities to ensure that some fraud looks "legitimate" (e.g., Friendly Fraud). This forces ML models to learn subtle, high-dimensional boundaries rather than simple, hard-coded thresholds.
1. Fraud Profiles (Individual Patterns)
| Profile | Behavioral Signature | Spatial Signature |
|---|---|---|
| UPI Scam | High frequency, small to medium amounts (₹1,500 - ₹20,000). | 90% Geo-Anomaly: Scammer is remote. |
| Account Takeover | High-value transfers, sudden change in device/channel. | 40% Geo-Anomaly: Compromised from distant location. |
| Velocity Abuse | Rapid-fire "testing" transactions (₹1.01, ₹1.23, etc.). | 10% Geo-Anomaly: Low spatial signal. |
| Card Not Present | Online-only channel bias, standard e-commerce amounts. | 30% Geo-Anomaly: Card details used remotely. |
| Friendly Fraud | Legitimate channel/device, standard amounts. | 0% Geo-Anomaly: Customer is physically at home. |
2. Campaign Attack Patterns (Coordinated)
Coordinated Attack
- Signal: Multiple distinct cards/customers targeted simultaneously by a single entity.
- Hard Correlation: Every transaction in the campaign shares the exact same IP Address and geographic coordinate (simulating a scammer hub or proxy).
- Tuning: Coordinated IP is configurable via
fraud_tuning.yaml(Default:103.21.244.12).
Sequential Takeover
- Signal: A single card experiencing a progressive escalation of fraud.
- Monotonic Escalation: Each subsequent transaction amount is multiplied by the
ato_escalation_rate(Default: 30%). - Persistent Location: Once the takeover begins, the geographic coordinate "sticks" to the attacker's location for the remainder of the sequence.
Known Issues
I have currently implemented the "Spatial Signature" for fraud as a simple latitude/longitude jump. While this creates a clear anomaly, it doesn't account for Traveling Legitimate Customers. This leads to a higher-than-normal false positive rate in models that rely too heavily on distance-from-home. I need to implement a "Travel Profile" for legitimate customers to introduce more realistic noise.
Furthermore, my campaign logic is currently limited to "Shared IP" and "Shared Coordinate." I haven't yet implemented Account-to-Account (A2A) graph signals, where stolen funds are moved through a chain of "mule" accounts. This is a significant gap in the simulation's "Money Laundering" fidelity that I need to address in the next version of the fraud.rs engine.
Synthetic Fraud Profiles
Summary
The fraud_profiles.md document provides a detailed behavioral and statistical breakdown of the five core adversarial signatures simulated by RiskFabric. It explains the contextual logic used by the generator to mimic real-world financial crimes and provides examples of how these patterns manifest in the synthetic data stream.
Design Intent
These profiles are designed to challenge machine learning models by mirroring the statistical "noise" and multi-dimensional anomalies of modern fraud. By shifting from simple "hardcoded amount" rules to Behavioral Multipliers and Contextual Biases, the generator forces downstream models to evaluate combinations of spatial velocity, merchant categories, and temporal deviations.
1. Velocity Abuse
Objective: Simulate a bot network or organized fraud ring rapidly "testing" compromised card details or exploiting a merchant gateway before limits are triggered.
Behavioral Signature
- Amount Strategy:
customer_normal_rangewith a strict0.90x to 1.10xmultiplier. - Primary Signals: Extreme Transaction Frequency (
rapid_fire_transaction_flag), High Spatial Velocity (impossible travel), and Specific Merchant Bias (GAMBLING,ENTERTAINMENT). - The "Trick": By keeping the transaction amount perfectly aligned with the customer's normal spending habits, it evades simple threshold-based alerts, forcing the model to rely entirely on speed and location.
Example Scenario
A customer whose average transaction is ₹500 has three transactions generated within a 4-minute window for exactly ₹490, ₹510, and ₹495 at three different entertainment merchants located 800km away from their last known physical transaction.
2. Account Takeover (ATO)
Objective: Simulate a malicious actor gaining unauthorized access to a legitimate user's banking app or online portal to drain funds or make high-value purchases.
Behavioral Signature
- Amount Strategy:
customer_normal_rangewith a tight0.95x to 1.05xmultiplier. - Primary Signals: Extreme Spatial Velocity (
impossible travel), Temporal Anomaly (occurring during the customer's historical "sleep" hours), and Channel Bias (mobile_banking,online). - The "Trick": Similar to Velocity Abuse, the amount does not spike. The anomaly is purely contextual: the transaction occurs on a new device, from a new IP, at 3:00 AM, purchasing from a
LUXURYorELECTRONICSmerchant.
Example Scenario
A customer completes an in-store grocery purchase in Mumbai at 8:00 PM. At 3:15 AM the following morning, a mobile banking transfer for a standard amount is initiated from an IP address in Delhi.
3. Card Not Present (CNP) Fraud
Objective: Simulate the unauthorized use of stolen credit card details (PAN, CVV) for online purchases, typically for easily liquidatable goods.
Behavioral Signature
- Amount Strategy:
customer_normal_rangewith an aggressive1.0x to 5.0xmultiplier. - Primary Signals: Channel Bias (100%
online), Merchant Category Bias (ELECTRONICS,LUXURY), and elevatedamount_deviation_z_score. - The "Trick": This profile blends moderate amount spikes with specific merchant categories. It tests the model's ability to correlate the "Online" channel with high-risk retail sectors.
Example Scenario
A customer who typically spends ₹2,000 per transaction across various local stores suddenly has an online transaction for ₹8,500 at an ELECTRONICS merchant, processed without physical card presence.
4. UPI Scam (Social Engineering)
Objective: Simulate phishing or coercive scams where a victim is tricked into authorizing a high-value transfer via the Unified Payments Interface (UPI).
Behavioral Signature
- Amount Strategy:
customer_normal_rangewith a massive1.5x to 4.0xmultiplier. - Primary Signals: Massive
amount_deviation_z_score, Channel Bias (Heavily biased towardupi), and Merchant Category Bias (GENERAL_RETAIL,SERVICES). - The "Trick": This represents the classic "drain the account" scenario. The model must learn that extreme amount deviations on the UPI channel to unfamiliar service merchants are highly suspicious, even if the device fingerprint appears legitimate.
Example Scenario
A user with an average transaction of ₹300 suddenly authorizes a UPI payment of ₹1,100 to a previously unseen "Services" merchant, heavily deviating from their historical spend pattern.
5. Friendly Fraud (First-Party Fraud)
Objective: Simulate a legitimate customer making a valid purchase (often digital goods or travel) and subsequently filing a false chargeback claim with their bank.
Behavioral Signature
- Amount Strategy:
customer_normal_rangewith a standard0.5x to 1.5xmultiplier. - Primary Signals: None. This profile intentionally lacks spatial, temporal, or behavioral anomalies.
- The "Trick": This is the hardest profile to detect at the transaction level. The location, device, and amount are all perfectly normal. Detection relies entirely on historical entity-level features, such as the
cf_fraud_rate(Customer Fraud Rate) ormerchant_categoryrisks (TRAVEL,FOOD_AND_DRINK).
Example Scenario
A customer purchases a ₹1,200 airline ticket online from their home IP address, using their normal device, during their usual active hours. Three weeks later, the transaction is marked with a chargeback flag.
Data Warehouse & dbt Strategy
Summary
The data_warehouse.md document outlines the architectural strategy for RiskFabric's analytical layer. It explains how raw synthetic data is transformed into high-fidelity behavioral entities using a Modern Data Stack (MDS) approach, specifically leveraging ClickHouse for high-volume transactions and Postgres/dbt for geographic enrichment.
Design Intent
The warehouse functions as a Medallion Data Lakehouse, intended to demonstrate how synthetic data can be used to test both machine learning models and the data engineering lifecycle. By using dbt (data build tool), complex geographic filtering and merchant risk assignment are implemented in SQL, allowing for a clear separation between the simulation engine (Rust) and the analytical environment (SQL).
A critical architectural decision was the adoption of a Dual-Warehouse Model. ClickHouse serves as the primary engine for transaction data due to its performance with columnar storage and large-scale joins. Conversely, Postgres is used for "Level 0" geographic preparation (OSM extraction), as it provides mature support for spatial extensions like PostGIS. This approach ensures each part of the simulation utilizes the tool best suited for its specific data type.
🏗️ Warehouse Layers
- Bronze (Raw): Direct ingest from Parquet files via
ingest.rs. - Silver (Enriched): Entity-level behavioral features (e.g.,
customer_features_silver). - Gold (Master): The flattened, model-ready
fact_transactions_gold.
Known Issues
The system currently utilizes Podman-based container execution to interact with the warehouse from the Rust binaries. This introduces environment-level fragility and limits the simulation's scalability in distributed cloud environments. Transitioning to native ClickHouse and Postgres client libraries is necessary to improve the reliability of the ingestion and transformation stages.
Furthermore, dbt models are split between two different databases (Postgres for references, ClickHouse for transactions). This prevents cross-warehouse joins and requires moving data via Parquet files. Unifying the transformation layer—specifically by moving all "Level 0" geography data into ClickHouse—is required to eliminate manual data-movement steps and simplify dbt pipeline orchestration.
Project Goals & Objectives
Summary
The objectives.md document defines the high-level mission and technical milestones for the RiskFabric project. It outlines the strategic intent behind building a high-fidelity synthetic data generator and the specific problems it aims to solve for the financial technology community.
Design Intent
RiskFabric is designed to address the "Data Paradox" in fraud detection: researchers require large volumes of labeled data to develop effective models, but real-world financial data is sensitive and often inaccessible. By creating a high-fidelity, "white-box" alternative, the project provides a safe environment for testing machine learning algorithms and the operational infrastructure required for real-time fraud detection.
A key strategic objective is the promotion of Infrastructure-as-Code for Simulation. Transitioning from static CSV datasets to dynamic, configuration-driven environments allows organizations to "stress-test" systems against hypothetical scenarios—such as doubling transaction volumes—without requiring production data.
🎯 Key Milestones
- High-Fidelity Generation: Reaching 180k+ TPS while maintaining spatial and temporal realism.
- Streaming Parity: Ensuring models trained on batch data perform consistently in real-time Kafka environments.
- Adversarial Diversity: Expanding the fraud library to include multi-stage attacks like money laundering and mule-account networks.
Known Issues
Focus is currently placed on Individual and Coordinated Fraud, but Macroeconomic Factors remain unimplemented. The simulation assumes spending patterns are unaffected by external events such as inflation or holidays. Implementing a "Global Event Engine" is necessary to simulate seasonal surges and economic shifts, providing a more challenging baseline for detection models.
Furthermore, the project lacks Multi-Currency Support. The simulation is anchored to a single base currency, preventing the modeling of international fraud or cross-border remittance scams. Refactoring the transaction engine to handle dynamic currency conversion and exchange-rate fluctuations is required to support global fintech use cases.
Results & Monitoring
Tracking the evolution of model performance, generation throughput, and ETL efficiency benchmarks.
Machine Learning Metrics & Model Progression
This document tracks the performance and evolution of the fraud detection models trained on RiskFabric synthetic data, progressing from initial leakage-prone baselines to a robust, behavioral production configuration.
Section 1: Early Iterations
The development process began with basic feature sets to establish a baseline for fraud detection performance.
v1 Iteration (Baseline)
The initial model established core feature sets including amount deviations and spatial velocity on a sample population.
- Accuracy: 0.95
- ROC AUC Score: 0.9782
- Recall (Fraud): 0.30 (Identified significant "Recall Gap")
v2 High-Fidelity (Leakage Detected)
Scaling to larger datasets revealed massive performance inflation due to generator artifacts in metadata fields.
- ROC AUC Score: 0.9993
- Leakage Identified: Synthetic metadata fields (
fraud_target,burst_seq) were providing a "static bypass" for the model.
v2 Iteration (Leakage Prevention)
The feature vector was sanitized to exclude metadata, shifting the focus to behavioral signals.
- ROC AUC Score: 0.9746
- Recall (Noisy Labels): 0.72
- Sanitization: Transitioned from
fraud_targetto the noisyis_fraudlabel.
Note: In addition to the leakage issues documented below, v1 and v2 iterations were trained on an incomplete feature set. Behavioral features computed in the Rust ETL layer — including amount_deviation_z_score, spatial_velocity, and granular anomaly flags — were silently dropped before reaching XGBoost due to a narrow Gold table join. The inflated AUC figures in these iterations reflect both metadata leakage and the absence of the features that would have provided genuine behavioral signal.
Section 2: v3 — Production Configuration (Final)
The final model configuration focuses on pure behavioral signals, specifically tuned to handle the extreme class imbalance (1.4% fraud rate) found in realistic production environments.
Training Setup
- Dataset: 1.5M transactions (Seed 42).
- Fraud Rate: 1.41% (
target_share: 0.01,fp_rate: 0.005). - Model: XGBoost binary classifier.
- Scale Pos Weight: 69.57 (Computed dynamically from training imbalance).
- Eval Metric:
aucpr(Area Under Precision-Recall Curve). - Label Noise: 0.5% False Positives and 1% False Negatives deliberately injected.
- Theoretical Recall Ceiling: 66.7% (Derived from the intentional label noise ratio).
Feature Importance
The model prioritizes physical and financial anomalies over static identifiers.
| Feature | Importance | Description |
|---|---|---|
spatial_velocity | 25.38% | Impossible travel speed between transactions |
amount_deviation_z_score | 20.80% | Spending magnitude relative to customer norm |
time_since_last_transaction | 12.72% | Temporal burst and frequency detection |
transaction_channel | 11.60% | Risk associated with specific payment methods |
merchant_category | 11.08% | Contextual risk of the merchant type |
hour_deviation_from_norm | 7.40% | Circadian rhythm anomalies |
merchant_category_switch_flag | 2.89% | Unexpected shifts in merchant category |
card_present | 2.45% | Physical vs. digital transaction risk |
transaction_sequence_number | 1.95% | Position within the account lifecycle |
rapid_fire_transaction_flag | 1.88% | High-velocity sequence identification |
For a detailed narrative of the discovery and resolution of these artifacts, see the Feature Leakage Case Study.
Generalization Results
Validated against three independent populations to ensure robust performance across different random seeds.
| Test Population | Seed | Transactions | AUC |
|---|---|---|---|
| Holdout | 42 (Same) | 1.5M | 84.72% |
| Independent | 8888 (Different) | 1.5M | 79.94% |
| Independent | 5555 (Different) | 3.0M | 79.81% |
Note: The higher AUC on the holdout set is due to distributional overlap with the training population, while the ~80% AUC on independent seeds represents the model's true behavioral generalization.
Section 3: Threshold Operating Points
In a production environment, the model's probability output is mapped to specific operational actions.
| Operating Mode | Threshold | Precision | Recall | F1 | Use Case |
|---|---|---|---|---|---|
| Detection Layer | 0.495 | 10% | 60% | 0.172 | Review queue — broad capture |
| Triage | 0.645 | 18% | 55% | 0.268 | Early analyst filtering |
| Investigation | 0.736 | 31% | 50% | 0.385 | Analyst workbench |
| High Confidence | 0.842 | 57% | 45% | 0.502 | Escalation decisions |
| Blocking | 0.945 | 73% | 40% | 0.517 | Automatic card block |
The Detection Layer feeds a review queue for manual inspection, while the Blocking Layer is reserved for automated enforcement. The tradeoff between these layers is an operational business decision, not a model failure.
Section 4: Merchant Category Audit
Leakage verification at the "Blocking" threshold (0.945) confirms that overrepresentation reflects genuine category risk levels rather than static bypasses.
| Category | Global Share | Flag Share | Index | Verified Fraud Rate |
|---|---|---|---|---|
| GAMBLING | 0.07% | 1.09% | 17x | 17.68% |
| ENTERTAINMENT | 1.10% | 14.35% | 13x | 11.20% |
| LUXURY | 1.62% | 8.63% | 5x | 4.91% |
| ELECTRONICS | 3.39% | 10.22% | 3x | 2.40% |
| TRAVEL | 6.14% | 16.29% | 2.6x | 2.53% |
| SERVICES | 5.15% | 11.92% | 2.3x | 2.53% |
All verified fraud rates fall below the 20% threshold, confirming that no single category acts as a near-deterministic fraud rule. The model uses category as a Bayesian prior requiring behavioral confirmation rather than a static classifier.
The GAMBLING index was previously at 103x (documented in the leakage case study); its reduction to 17x after generator retuning and the verified fraud rate confirms it is now a legitimate signal.
Section 5: Known Limitations
Recall Ceiling (66.7%)
Theoretical maximum recall is imposed by deliberate label noise design. The 0.5% false positive rate in fp_rate creates labels that are behaviorally unlearnable. Recall approaching this ceiling represents optimal behavior.
Silver ETL Eager Execution
Sequence features using .over() window functions trigger eager in-memory execution despite Polars lazy API usage. Datasets significantly exceeding available RAM will hit memory pressure. Roadmap: transition to a stateful streaming pre-aggregation pass.
Campaign Detection
Coordinated attack signatures require graph-based reasoning over entity relationships. Individual transactions in a campaign are often behaviorally indistinguishable from legitimate ones when viewed in isolation—this is a structural limitation of single-transaction classifiers.
Feature Leakage Case Study
While designing a fraud detection model for flagging potentially fraudulent transactions using XGBoost, some problems were discovered that made it unsuitable for being used in real-time scoring. The issue revolves around an individual feature primarily deciding the model's decision making.
Transactions were synthetically generated for 10,000 customers with a total figure being around 6M transactions. After feature engineering and interpolating them into a gold_master_table which is used by XGBoost for training. An AUC score of 0.9079 was achieved which felt more realistic than the previous test which ran on a smaller dataset (4.3M transactions and achieved an AUC score of 0.97). However, the crux of the issue became apparent when the top features by importance were checked:
| Features (AUC) | 0.9079 |
|---|---|
| amount | 0.9632 |
| escalating_amounts_flag | 0.0093 |
| cf_night_tx_ratio | 0.0049 |
| transaction_sequence_number | 0.0048 |
| rapid_fire_escalation_flag | 0.0048 |
| time_since_last_transaction | 0.0042 |
| transaction_channel | 0.0039 |
| merchant_category_switch_flag | 0.0027 |
| t.merchant_category | 0.0021 |
| card_present | 0 |
What this means is that if this model identifies a suspicious transaction amount, there is a high probability it will flag the transaction as fraud without considering other characteristics. This is sub-optimal, as the model should evaluate multiple constraints such as temporal factors and geographic behavior. To address this, two strategies can be evaluated: the removal of amount as a training feature to test performance on purely behavioral flags, or the use of feature binning to reduce reliance on exact values. After testing the second option and seeing negligible difference in feature importance, the first option was evaluated, which revealed the underlying issue within the system. After removing amount as a feature and executing the training script again, a ROC score of 0.5868 was achieved—a significant decrease from the previous result—but the resulting feature importance distribution was more revealing:
| Features (AUC) | 0.5868 |
|---|---|
| escalating_amounts_flag | 0.8865 |
| transaction_channel | 0.0227 |
| time_since_last_transaction | 0.0208 |
| transaction_sequence_number | 0.0203 |
| cf_night_tx_ratio | 0.0194 |
| rapid_fire_escalation_flag | 0.0172 |
| t.merchant_category | 0.0076 |
| merchant_category_switch_flag | 0.0057 |
This indicates that the behavioral features designed have very little predictive power to be captured by the model. Essentially, the model can't distinguish them from normal variation. Re-tuning the fraud generator is required to create distinctive behavioral signals and ensure those features are engineered effectively and processed through to the gold table. This ensures each fraud signature has its distinctive characteristics so that it can be captured by the model.
Analysis of the pipeline identified a significant gap: The behavioral features engineered in Rust were being dropped before reaching the XGBoost:
-
The "Silently Dropped" Features In src/etl/features/sequence.rs, high-value signals are calculated that would likely address the 0.58 AUC problem, but they are missing from the ClickHouse tables:
- amount_deviation_z_score: This measures if ₹5,000 is "normal" for that specific customer. Without this, the model only sees the absolute ₹5,000 and assumes it's fraud because the average transaction is ₹500.
- fraud_type & campaign_id: These are currently calculated but not stored in the Silver layer.
- Granular Anomalies: geo_anomaly, device_anomaly, and ip_anomaly are being calculated in the transformation but aren't being selected in the final Gold table join.
-
The Gold Table Join is too narrow The run_gold_master function in src/bin/etl.rs only pulls a small subset of columns from the Silver tables. It's ignoring the very features needed to replace the "Amount Shortcut."
The Plan to Fix It:
Step 1: Repair the ETL Pipeline (The "Plumbing" fix) * Update the CREATE TABLE statement for fact_transactions_silver to include the missing behavioral columns.
- Update the run_gold_master query to pull these features into the final training set.
- Outcome: The model will finally "see" the Z-Score and the behavioral context.
Step 2: Re-tune the Generator (The "Signal" fix) * The fraud_rules.yaml configuration is modified to ensure fraudulent transaction amounts overlap with legitimate amounts. * Outcome: The model will be forced to stop using "High Amount" as a shortcut and start using the "Z-Score" and "Velocity" features fixed in Step 1.
| Features (AUC) | 0.8246 |
|---|---|
| amount_deviation_z_score | 0.9069 |
| escalating_amounts_flag | 0.0468 |
| transaction_sequence_number | 0.0122 |
| cf_night_tx_ratio | 0.0121 |
| time_since_last_transaction | 0.0075 |
| transaction_channel | 0.0047 |
| rapid_fire_transaction_flag | 0.0033 |
| merchant_category_switch_flag | 0.0033 |
| t.merchant_category | 0.0033 |
| card_present | 0.0000 |
The score improved to 0.8246 just by fixing the plumbing. No new features, no generator retuning, no architectural changes. The signals were there the whole time. However, amount_deviation_z_score at 90% is the new dominant feature. It is better than raw amount — customer-relative making it more meaningful but still a single feature carrying almost everything.
The generator needs to be retuned to overlap fraud and legitimate amount distributions. Force fraudsters to transact at amounts that are normal for that customer — the Z-score becomes less dominant, behavioral features have to carry more weight.
When fraud amounts overlap with legitimate amounts, the model must rely on:
cf_night_tx_ratio— when does this customer normally transact?rapid_fire_transaction_flag— velocity anomalymerchant_category_switch_flag— behavioral deviationtime_since_last_transaction— timing patterns
| Features (AUC) | 0.7960 |
|---|---|
| amount_deviation_z_score | 0.9337 |
| escalating_amounts_flag | 0.0229 |
| cf_night_tx_ratio | 0.0108 |
| transaction_sequence_number | 0.0103 |
| time_since_last_transaction | 0.0071 |
| transaction_channel | 0.0060 |
| rapid_fire_transaction_flag | 0.0044 |
| t.merchant_category | 0.0026 |
| merchant_category_switch_flag | 0.0023 |
| card_present | 0.0000 |
After retuning the generator to create more overlap and amplify behavioral signals, the score dropped slightly to 0.7960, fraud amounts now blend into legitimate ranges, so amount_deviation_z_score has less to work with. The model is being forced away from the amount shortcut. AUC dropped because the problem genuinely got harder.
However, amount_deviation_z_score is still at 93% despite the overlap. This means the Z-score is still capturing enough separation between fraud and legitimate amounts to dominate. The overlap wasn't aggressive enough. The problem appears to be fraud amounts as they are mostly specific values — ₹5000, ₹8500, ₹12000. Legitimate transactions cluster around ₹660. Due to this, the Z-score still sees as "this customer normally spends ₹660, this transaction is ₹8500 — suspicious." The relative deviation is still huge.
The Z-score only becomes less dominant when fraudsters transact at amounts that are normal for that specific customer. This requires the generator to look up the customer's monthly_spend and generate fraud amounts within their normal range:
account_takeover:
amount_strategy: "customer_normal_range" # instead of fixed high_value_amounts
amount_multiplier: 0.8_to_1.2 # within customer's normal band
This makes fraud amounts become customer-relative rather than absolute. In this iteration, the Z-score was removed and the model was trained without amount_deviation_z_score and escalating_amounts_flag to evaluate the strength of the behavioral signals.
| Features (AUC) | 0.6704 |
|---|---|
| amount_deviation_z_score | 0.9101 |
| transaction_sequence_number | 0.0188 |
| cf_night_tx_ratio | 0.0165 |
| escalating_amounts_flag | 0.0152 |
| time_since_last_transaction | 0.0121 |
| transaction_channel | 0.0117 |
| rapid_fire_transaction_flag | 0.0067 |
| t.merchant_category | 0.0045 |
| merchant_category_switch_flag | 0.0044 |
| card_present | 0.0000 |
Removing escalating_amounts_flag dropped AUC to 0.6704 but amount_deviation_z_score remains at 91%.
Every amount-derived feature removed makes the Z-score more dominant. The model is completely anchored to amount-relative signals. The behavioral features — cf_night_tx_ratio, rapid_fire_transaction_flag, merchant_category_switch_flag — collectively contribute approximately 5-6% of decisions.
Removing the Z-score and executing the training once more provides the definitive test. The resulting AUC represents the pure behavioral signal floor — no amount, no Z-score, no escalating amounts. Just time, velocity, channel, merchant, sequence.
| Features (AUC) | 0.5572 |
|---|---|
| transaction_channel | 0.1691 |
| escalating_amounts_flag | 0.1677 |
| time_since_last_transaction | 0.1644 |
| transaction_sequence_number | 0.1602 |
| cf_night_tx_ratio | 0.1532 |
| rapid_fire_transaction_flag | 0.0851 |
| t.merchant_category | 0.0679 |
| merchant_category_switch_flag | 0.0325 |
The score dropped to 0.5572, and for the first time, feature importance is evenly distributed. No single feature exceeds 17%, with every behavioral signal contributing. This structure represents a balanced feature set.
The challenge is that none of these features possess sufficient signal strength to detect fraud reliably. The model is unable to distinguish fraud because fraudsters in the simulation behave too similarly to legitimate customers. For instance:
transaction_channelat 17% — channel bias exists but is weak.cf_night_tx_ratioat 15% — night patterns exist but fraud is not concentrated enough at night to be distinctive.rapid_fire_transaction_flagat 8.5% — velocity fraud occurs but not with sufficient frequency.merchant_category_switch_flagat 3.25% — almost no signal. Fraudsters shop at similar merchants as legitimate customers.
To address this at the root level, the logic responsible for injecting fraudulent signatures and behaviors requires refinement to increase signal strength for training a behaviorally-driven fraud model.
In the RiskFabric project, fraud.rs is primarily responsible for injecting fraud labels and altering transaction behavior according to the fraud signature. Two configurations drive transaction behavior: geo_anomaly_prob and device_anomaly_prob. Inspection of geo_anomaly_prob identified significant limitations:
If a transaction has the geo_anomaly flag set to true, its coordinates are randomized from the global range. While this creates an anomaly, it does not provide a behavioral signal that the model can learn without access to the customer's "Home" coordinates or a feature like "Distance from Home." Consequently, the model only evaluates final_lat and final_lon. Since legitimate transactions are also distributed across India (clustered around specific homes), a random coordinate appears normal to a model lacking home location context.
To resolve this, a new feature, Spatial Velocity, was introduced in the ETL layer. This measures: distance(txn_N, txn_N-1) / time(txn_N, txn_N-1), enabling the model to identify high-velocity spatial anomalies, such as transactions occurring in distant cities within short time intervals.
| Features (AUC) | 0.6868 |
|---|---|
| spatial_velocity | 0.6126 |
| escalating_amounts_flag | 0.1847 |
| time_since_last_transaction | 0.0652 |
| merchant_category_switch_flag | 0.0643 |
| t.merchant_category | 0.0167 |
| transaction_sequence_number | 0.0148 |
| rapid_fire_transaction_flag | 0.0145 |
| transaction_channel | 0.0144 |
| cf_night_tx_ratio | 0.0128 |
| card_present | 0.0000 |
The AUC increased to 0.6868 from a single feature addition. spatial_velocity at 61% became the dominant behavioral feature—a genuine behavioral signal. This also had a cascading effect on other features:
merchant_category_switch_flagincreased from 3.25% → 6.43%time_since_last_transactionchanged from 16% → 6.52%
Several issues still required attention:
spatial_velocityat 61% was too dominant, capturing almost the entiregeo_anomalyfraud signal. The implementation at the time teleported fraudsters to random coordinates, almost guaranteed to trigger the impossible travel flag.cf_night_tx_ratiodecreased to 1.28%, as night behavior was not sufficiently distinctive in the generator.card_presentremained at 0%, indicating CNP fraud was not being captured.
Analysis of the low cf_night_tx_ratio (1.28%) led to an audit of the hourly distribution, particularly under Account Takeover (ATO) fraud. While hourly_weights peaked in the early morning and late evening to simulate attacker activity, the "Night Ratio" was not a strong signal due to legitimate late-night spending and a lack of sharpness in the ATO peak.
This was addressed by updating account_takeover hourly weights to concentrate over 70% of transactions between 00:00 and 04:00. Additionally, the hour_deviation_from_norm feature was introduced in the ETL layer to capture temporal anomalies at the transaction level by determining the absolute deviation from a customer's average transaction hour.
| Features (AUC) | 0.7005 |
|---|---|
| spatial_velocity | 0.6439 |
| escalating_amounts_flag | 0.1450 |
| merchant_category_switch_flag | 0.0664 |
| time_since_last_transaction | 0.0620 |
| t.merchant_category | 0.0165 |
| transaction_channel | 0.0149 |
| rapid_fire_transaction_flag | 0.0136 |
| hour_deviation_from_norm | 0.0130 |
| cf_night_tx_ratio | 0.0124 |
| transaction_sequence_number | 0.0122 |
AUC increased to 0.7005—a small but consistent improvement. hour_deviation_from_norm appeared at 1.3%, registering as a signal. cf_night_tx_ratio remained at 1.24%, and escalating_amounts_flag decreased from 18% → 14.5%, indicating behavioral features were gradually gaining influence.
Despite this, night-based features contributed only ~2.5% combined. The sharpened hourly weights provided marginal benefit, but cf_night_tx_ratio dilution persisted—a small number of ATO transactions does not significantly shift a customer-level ratio.
A higher impact correction involved card_present, which was at 0%. Correcting the 'wiring' for this feature was identified as a high-impact fix, as CNP transactions are by definition not card_present.
| Features (AUC) | 0.7500 |
|---|---|
| amount_deviation_z_score | 0.4973 |
| spatial_velocity | 0.3402 |
| merchant_category_switch_flag | 0.0497 |
| escalating_amounts_flag | 0.0299 |
| time_since_last_transaction | 0.0289 |
| transaction_sequence_number | 0.0126 |
| cf_night_tx_ratio | 0.0117 |
| t.merchant_category | 0.0091 |
| transaction_channel | 0.0082 |
| hour_deviation_from_norm | 0.0073 |
After restoring the Z-score as a feature, its dominance remained strong but was lower than in previous instances, supplemented by spatial_velocity.
| Features (AUC) | 0.7491 |
|---|---|
| amount_deviation_z_score | 0.5038 |
| spatial_velocity | 0.3026 |
| card_present | 0.0499 |
| merchant_category_switch_flag | 0.0445 |
| time_since_last_transaction | 0.0271 |
| escalating_amounts_flag | 0.0186 |
| cf_night_tx_ratio | 0.0117 |
| transaction_sequence_number | 0.0114 |
| transaction_channel | 0.0085 |
| t.merchant_category | 0.0083 |
After correcting the 'CNP' wiring, it increased to 5%. However, rapid_fire_transaction_flag disappeared from the top 10 features. Analysis of the code revealed that this flag utilized a 300-second (5-minute) threshold, and max_interval_seconds for velocity abuse was set to a random minute within an hour, which was too coarse for signatures depending on second-level timing.
A more realistic temporal pattern for fraud bursts was implemented, ensuring transactions occur in tighter sequences (e.g., seconds apart) via max_burst_interval_seconds. This creates a sharper behavioral signal for the rapid_fire_transaction_flag to capture.
| Features (AUC) | 0.8085 |
|---|---|
| time_since_last_transaction | 0.3280 |
| rapid_fire_transaction_flag | 0.2707 |
| amount_deviation_z_score | 0.2153 |
| spatial_velocity | 0.0727 |
| card_present | 0.0614 |
| merchant_category_switch_flag | 0.0213 |
| escalating_amounts_flag | 0.0105 |
| transaction_sequence_number | 0.0049 |
| hour_deviation_from_norm | 0.0045 |
| transaction_channel | 0.0042 |
Amount-based features are now in third place at 21%. Temporal behavioral signals—time_since_last_transaction and rapid_fire_transaction_flag—together drive 60% of model decisions. This allows for fraud detection based on behavioral patterns rather than absolute cost, providing logic suitable for real-time scoring of velocity abuse, ATO, and CNP fraud without flagging high-value legitimate transactions.
merchant_category_switch_flag at 2.1% and hour_deviation_from_norm at 0.45% both have potential for growth through future cross-card coordination work.
Performance Benchmarks
This document tracks the evolution of riskfabric generation performance, focusing on the journey to the 100k+ Transactions Per Second (TPS) milestone.
Test Environment
- Workload: 10,000 Customers (~15,000 Accounts, ~150,000 Transactions)
- Format: Parquet (Snappy compression)
- Hardware: Single Workstation (Multi-threaded Rust)
Milestone Log
1. Initial Port (Sequential Multi-Pass)
Date: February 2026
- Architecture: Sequential loops for generation, fraud injection, and campaign mutations. Cryptographic
Sha256hashing for reproducibility. - Transaction Gen Time: 44.11 seconds
- Total Runtime: 48.76 seconds
- Throughput: ~3,400 TPS
- Bottleneck: Cryptographic hashing overhead and high memory access in multiple sequential passes.
2. Parallel Injection & Hash Optimization
- Architecture: Parallelized the
injectpass usingrayon. Optimizedhash01to reduce string allocations. - Transaction Gen Time: 35.86 seconds
- Total Runtime: 40.35 seconds
- Throughput: ~4,100 TPS
- Gain: +20% improvement.
3. The "One-Pass" Unified Architecture (Current)
Date: February 2026
- Architecture:
- Unified Loop: All logic (Selection, Generation, Fraud, Campaigns) handled in a single parallel pass.
- Fast PRNG: Swapped
Sha256forStdRng(seeded per card for stability). - Reduced Allocations: Replaced UUIDs with synthetic IDs and pre-formatted timestamps.
- Transaction Gen Time: 0.82 seconds
- Total Runtime: 4.40 seconds (Includes all file I/O)
- Throughput: ~182,000 TPS
- Gain: 53x improvement from baseline.
Summary of Optimization Impact
| Stage | Baseline (s) | Optimized (s) | Speedup |
|---|---|---|---|
| Customer Gen | 0.147 | 0.155 | 1x |
| Transaction Gen | 44.110 | 0.823 | 53.6x |
| Parquet Write (Txn) | 3.696 | 2.640 | 1.4x |
| Total Pipeline | 48.763 | 4.402 | 11x |
4. High-Fidelity One-Pass (Tuned)
Date: February 2026
- Architecture: Added profile-specific geo-anomalies, campaign-coordinated spatial signals, and dynamic failure reasons.
- Performance: Maintained throughput at ~180,000 TPS despite increased logic complexity.
- Result: High-quality training data with sharp spatial/temporal signals generated in < 4 seconds for 150k+ transactions.
5. Real-Time Streaming Throughput (Kafka)
Date: March 15, 2026
- Architecture:
- Async I/O: Leverages
tokioandrdkafkafor non-blocking Kafka publication. - Self-Correcting Limiter: Measures per-message latency to adjust micro-sleep intervals.
- Verification Mode Overhead: Minimal (local CSV writes are buffered).
- Async I/O: Leverages
- Target Throughput: 100 tx/s (Configurable)
- Actual Throughput: 99.85 tx/s (Average over 1 hour)
- Publication Latency (P99): 4.2ms to local Kafka broker.
Throughput Comparison
| Mode | Engine | Transport | Peak Throughput (TPS) |
|---|---|---|---|
| Batch | generate.rs | Local Parquet | ~180,000 |
| Streaming | stream.rs | Kafka Topic | ~1,200 (Unbound)* |
*Note: Streaming throughput is artificially limited to 100 tx/s for realism, but peak unbound performance is ~1,200 tx/s on a single thread.
6. End-to-End Pipeline Stress Test (3M Transactions)
Date: March 2026
- Workload: 3,334 Customers | 2,984,575 Transactions
- Scope: Full lifecycle orchestration via
stress_test.py(Reset -> Generate -> Ingest -> ETL -> Gold).
| Pipeline Stage | Duration (s) | Throughput / Info |
|---|---|---|
| Generation | 16.96s | ~176,000 TPS |
| ClickHouse Ingestion | 25.54s | ~116,000 Rows/s |
| Silver ETL (Parallel) | 56.61s | Features: Sequence, Merchant, Customer |
| Gold Finalization | 11.32s | Materialized Join |
| Total End-to-End | 110.43s | ~1.8 Minutes |
Benchmark Conclusions
The stress test confirms that the One-Pass Architecture successfully scales to multi-million row datasets while maintaining near-linear throughput. The entire pipeline, including heavy feature engineering and entity joins, completes in under 2 minutes for 3 million transactions, making it suitable for rapid iterative model development.
Summary of Optimization Impact
...
ETL Performance Optimizations
Summary
The RiskFabric ETL pipeline (etl.rs) is designed for high-fidelity feature engineering using a hybrid Polars and ClickHouse architecture. While functionally robust, the current implementation contains several architectural bottlenecks that limit its scalability to billion-row datasets. This document outlines the identified performance issues and the strategic roadmap for transitioning to a high-concurrency, zero-copy pipeline.
Architectural Decisions
To achieve enterprise-grade throughput, the pipeline is moving toward a Parallel Stream-Oriented Architecture.
The primary decision is the shift to Asynchronous Pipeline Orchestration. By utilizing tokio or rayon, the independent "Silver" ETL stages (Customer, Merchant, Device/IP) will be executed in parallel. This maximizes multi-core utilization and significantly reduces the total wall-clock time of the transformation phase.
The second decision involves Zero-Copy Data Exchange. The current "Double Buffering" strategy—where data is fetched into memory, stored as a vector, and then parsed—is slated for replacement with a streaming architecture. By piping the raw output of the ClickHouse process directly into the Polars ParquetReader and vice-versa, the memory footprint is halved, and intermediate disk I/O for temporary Parquet files is eliminated.
Finally, the transition to Native Driver Connectivity via clickhouse-rs is prioritized over the current podman exec method. This eliminates the process overhead of spawning container instances for every query and provides superior type safety and error propagation.
System Integration
The optimized ETL system remains the central bridge between the Data Warehouse (ClickHouse) and the Machine Learning Pipeline. By maintaining the Parquet exchange format but moving it through memory pipes rather than physical files, the system ensures that the "handshake" between Polars and ClickHouse remains high-speed while reducing infrastructure dependencies and disk wear.
Performance Benchmarks & Results
| Implementation | Wall-Clock Time | CPU User Time | Memory / Disk Overhead |
|---|---|---|---|
| Baseline (Sequential) | ~22.5 seconds | ~19.6 seconds | High (Temp files + buffers) |
| Optimized (Parallel + Pipes) | ~21.1 - 22.5 seconds | ~32.4 seconds | Low (Streaming + Zero temp files) |
Analysis
The implementation of Rayon-based parallelism and Direct Stdin Piping resulted in a significant increase in CPU utilization (~65% increase in User time), indicating that the Rust transformation engine is now processing multiple stages concurrently.
However, the Wall-Clock Time remained relatively flat. This confirms that the pipeline is currently I/O Bound by the ClickHouse single-node instance. Spawning six parallel podman exec processes causes resource contention at the database level, preventing a linear speedup.
Implemented Improvements
- Stage Parallelism: All Silver ETL functions now run concurrently via
rayon. - Streaming Ingestion: Parquet data is piped directly from Polars to ClickHouse
stdin, eliminatingdata/tmp_*.parquetfile I/O. - Thread-Safe Workspace: Each parallel stage uses isolated logic and unique identifiers to prevent race conditions.
- Memory Optimization: Replaced large
Vec<u8>output buffers with direct process pipes where possible.
Knowledge Base
Documentation of technical hurdles, resolutions, and ongoing developmental challenges encountered during the project.
Technical Issues & Resolutions
Summary
The issues.md document acts as the primary engineering log for RiskFabric. It captures architectural hurdles, environment-specific bugs, and performance bottlenecks encountered during development, along with their implemented or proposed resolutions.
Design Intent
This document serves as Institutional Knowledge for the project. In complex simulations, the most difficult bugs often arise from the interaction between system layers (e.g., Rust → Kafka → Python). Documenting these issues provides a roadmap for future optimizations and prevents the repetition of architectural errors. Every entry is paired with a specific technical fix validated through benchmarking or regression testing.
🛠️ Data Engine & Type Safety
1. Polars UInt8 Series Creation Error
- Problem: Polars returned a
ComputeErrorwhen materializing DataFrames containing 8-bit unsigned integers. This blocked features likeis_weekendand other boolean-adjacent flags. - Resolution: All flag and counter columns were migrated to
DataType::UInt32to ensure native Polars support and broader ML library compatibility.
2. Polars is_in Panic on Int8
- Problem: The
.dt().weekday()function returnsInt8, which caused kernel-level panics during.is_in()membership checks. - Resolution: Output from
.weekday()is now explicitly cast toInt32, ensuring the comparison set (e.g.,&[6i32, 7i32]) matches the target type exactly.
3. ClickHouse Timestamp Precision
- Problem: Standard
DateTime64ingestion in ClickHouse failed when processing ISO 8601 strings with nanosecond precision. - Resolution: Timestamps are landed as
Stringin the Bronze layer. High-precision parsing is deferred to the Silver ETL stage using Polars'.str().to_datetime()for increased flexibility.
🚀 Performance & Scaling
4. Out of Memory (OOM) in Network Linkage
- Problem: Multi-million row many-to-many joins on IP and User Agent entities caused combinatorial explosions, leading to process termination.
- Resolution: The architecture shifted from an Edge-List Graph approach to an Entity Reputation model. Risk is now calculated at the entity level and joined back to transactions, reducing complexity from $O(N^2)$ to $O(N)$.
5. OOM in Large-Scale Generation
- Problem: Single-pass generation of 17M+ transactions exceeded available system RAM.
- Resolution: The generator was refactored to use a Chunked One-Pass Architecture. The population is processed in batches of 5,000 entities, with transactions flushed to Parquet incrementally to maintain a constant memory profile.
6. Parquet Serialization Bottleneck
- Problem: Transaction generation required 44 seconds, with 90% of the time spent in disk I/O and Parquet encoding.
- Resolution: A One-Pass Parallel Architecture was implemented and the Polars chunk size was optimized. This reduced the total runtime to 4.4 seconds, an 11x improvement.
🤖 Machine Learning & Data Science
7. Label Leakage (Near-Perfect AUC)
- Problem: Early models achieved 0.9993 AUC by learning internal generator flags (e.g.,
geo_anomaly) instead of behavioral patterns. - Resolution: A strict "Operational Feature" Sanitization step was implemented to drop all internal metadata. The training target was also shifted from the perfect
fraud_targetto the noisyis_fraudlabel.
8. Observed vs. Configured Fraud Rate Discrepancy
- Problem: The observed fraud rate (~13.6%) appeared higher than the 12% defined in the configuration.
- Resolution: Validation confirmed that
is_frauddeliberately incorporates simulated label noise (3% FP, 10% FN), resulting in a higher observed ratio than the latent ground truth.
9. High Fraud Prevalence in Initial Runs
- Problem: Approximately 86% of customers experienced fraud due to high default configuration values.
- Resolution: The
target_shareparameter was tuned to 0.005 (0.5% transaction rate) to align with industry benchmarks for sparse fraud data.
Known Issues
There is ongoing difficulty with Container Runtime Variability. The podman exec calls used in ingestion and ETL pipelines behave inconsistently across Linux and macOS environments, causing failures in the data warehouse loading process. Transitioning to native database drivers is required to eliminate dependency on the host's container CLI.
Furthermore, Memory Management during Reference Extraction is currently insufficient. When processing large OSM PBF files, the prepare_refs.rs binary can consume significant RAM. Implementing a "Spill-to-Disk" strategy for the parallel map-reduce operation is necessary to maintain a memory footprint below 4GB.