Batch Data Generator (`generate.rs`)

Summary

The generate.rs binary serves as the primary orchestration engine for creating large-scale, labeled synthetic datasets. It generates a complete ecosystem of customers, accounts, cards, and historical transactions, providing the "ground truth" required for training fraud detection models.

Architectural Decisions

The generator uses a chunked execution strategy to handle datasets that exceed available system memory. By processing cards in batches of 5,000, the generator maintains a stable memory profile regardless of the total population size. For spatial lookups, the system implements a multi-tier H3 index (resolutions 4 and 6) and a state-level index. This allows for rapid, localized merchant selection during transaction generation without exhaustive searching of the merchant reference dataset.

The choice of Apache Parquet as the output format ensures that multi-million row datasets remain compressed and performant for the downstream Python-based ML pipeline and Polars-based ETL.

System Integration

generate.rs sits at the start of the RiskFabric lifecycle. It consumes reference Parquet files for merchants and residential locations and produces the four core tables: customers.parquet, accounts.parquet, cards.parquet, and transactions.parquet (including its accompanying fraud_metadata.parquet).

Known Issues

The final merge phase is implemented by writing temporary Parquet chunks to disk and then re-scanning them with the Polars lazy API. While this prevents memory exhaustion during the final join, it introduces disk I/O overhead that affects the "cleanup" phase of generation. Additionally, the 5,000-card chunk size is currently hardcoded; moving this to customer_config.yaml would allow performance tuning based on available RAM capacity.

RiskFabric Documentation

Batch Data Generator (generate.rs)

Summary

Architectural Decisions

System Integration

Known Issues

Batch Data Generator (`generate.rs`)