ETL Pipeline System (etl.rs & src/etl/)
Summary
The ETL (Extract, Transform, Load) system is the transformation engine of RiskFabric. It is responsible for converting raw, "bronze" level synthetic transactions into "silver" behavioral features and finally into a "gold" master table ready for machine learning. The system is designed to handle large datasets by leveraging Polars for local transformations and ClickHouse for large-scale joins and persistence.
Architectural Decisions
The system follows a Medallion Architecture (Bronze → Silver → Gold) to ensure data lineage and modularity.
- Bronze: Raw data as generated by
generate.rs. - Silver: Subject-specific feature engineering (Customer, Merchant, Sequence, Network, Campaign, and Device/IP). These are calculated using Polars' lazy evaluation for performance.
- Gold: The final flattened "master" table.
A key design choice is the Hybrid Execution Model. While the feature logic is implemented in Rust using Polars, the pipeline orchestrates data movement between ClickHouse (the primary warehouse) and local memory via Parquet. This allows complex, stateful calculations in Rust (like Welford's algorithm for running variance) that are difficult to express in pure SQL, while still using ClickHouse for efficient storage and final broad joins.
System Integration
The ETL system acts as the connective tissue between the Data Generation layer and the Machine Learning layer. It reads from ClickHouse tables (populated via ingest.rs), performs transformations, and writes the results back to ClickHouse. The final fact_transactions_gold table is the direct source for the Python-based training pipeline.
Known Issues
The system currently uses podman exec calls to interact with ClickHouse from within the Rust binary. This approach depends on the local environment's container runtime and shell availability. Transitioning to a proper ClickHouse client library (like clickhouse-rs) will make the pipeline more portable and robust. Additionally, the GoldMaster stage is currently implemented as a raw SQL join in ClickHouse, which duplicates some of the logic found in gold_master.rs. Unifying these two approaches will ensure the batch and streaming feature definitions remain consistent.