ETL & Feature Schema

Summary

The etl_schema.md document defines the behavioral features and data transformations performed by the RiskFabric ETL pipeline (etl.rs). It acts as the technical contract for the "Silver" and "Gold" layers, detailing how raw synthetic events are transformed into the high-dimensional vectors used for model training and real-time inference.

Design Intent

The feature schema represents a Hybrid Behavioral State, intended to provide models with a multi-domain view of financial events across customer history, merchant risk, and temporal sequences. This approach facilitates sophisticated behavioral modeling, such as Z-scores and velocity-based indicators, similar to production fraud detection systems.

A critical design choice was the use of Welford's Algorithm for statistical aggregates. Calculating running means and variances locally in Rust (and Redis) ensures that features are numerically stable and computationally efficient for both batch processing and low-latency streaming. This architectural decision is intended to eliminate training-serving skew.

🥈 Silver Layer: Behavioral Features

Transaction Sequence Features (`fact_transactions_silver`)

Calculated at the individual card level to identify temporal and spatial anomalies.

Field	Description	Logic
`time_since_last`	Seconds since the previous event.	`T - T_prev`
`spatial_velocity`	Speed (km/h) between consecutive events.	`Dist(L, L_prev) / (T - T_prev)`
`amount_z_score`	Deviation from customer's mean spend.	`(Amt - Mean) / StdDev`
`hour_deviation`	Deviation from customer's peak spend hour.	Circular variance of `timestamp.hour()`

Network & Entity Features (`network_features_silver`)

Identifies high-risk clusters across the payment network.

Field	Description	Logic
`shared_ip_fraud`	Fraud rate of cards sharing the same IP.	`SUM(is_fraud) / COUNT(card_id) OVER IP`
`scammer_hub`	Flag for known high-risk coordinates.	`1 if Lat/Lon in [hub_coordinates]`

🥇 Gold Layer: The Master Table

The final flattened table used for model training, joining all Silver behavioral features with the original Bronze transactions.

Known Issues

Spatial Velocity is currently calculated using a Euclidean distance approximation. While computationally efficient, this is inaccurate over long distances. Implementation of the Haversine formula is required to ensure geographic precision for cross-state and international fraud simulations.

Furthermore, Feature Freshness is limited to the last 10 transactions in Redis. This prevents the modeling of long-term behavioral baselines for infrequent spenders. Implementing "Stateful Cold Storage" in the ETL pipeline is necessary to retrieve historical data without exceeding real-time feature store capacity.

RiskFabric Documentation