ETL & Feature Schema

Summary

The etl_schema.md document defines the behavioral features and data transformations performed by the RiskFabric ETL pipeline (etl.rs). It acts as the technical contract for the "Silver" and "Gold" layers, detailing how raw synthetic events are transformed into the high-dimensional vectors used for model training and real-time inference.

Design Intent

The feature schema represents a Hybrid Behavioral State, intended to provide models with a multi-domain view of financial events across customer history, merchant risk, and temporal sequences. This approach facilitates sophisticated behavioral modeling, such as Z-scores and velocity-based indicators, similar to production fraud detection systems.

A critical design choice was the use of Welford's Algorithm for statistical aggregates. Calculating running means and variances locally in Rust (and Redis) ensures that features are numerically stable and computationally efficient for both batch processing and low-latency streaming. This architectural decision is intended to eliminate training-serving skew.


🥈 Silver Layer: Behavioral Features

Transaction Sequence Features (fact_transactions_silver)

Calculated at the individual card level to identify temporal and spatial anomalies.

FieldDescriptionLogic
time_since_lastSeconds since the previous event.T - T_prev
spatial_velocitySpeed (km/h) between consecutive events.Dist(L, L_prev) / (T - T_prev)
amount_z_scoreDeviation from customer's mean spend.(Amt - Mean) / StdDev
hour_deviationDeviation from customer's peak spend hour.Circular variance of timestamp.hour()

Network & Entity Features (network_features_silver)

Identifies high-risk clusters across the payment network.

FieldDescriptionLogic
shared_ip_fraudFraud rate of cards sharing the same IP.SUM(is_fraud) / COUNT(card_id) OVER IP
scammer_hubFlag for known high-risk coordinates.1 if Lat/Lon in [hub_coordinates]

🥇 Gold Layer: The Master Table

The final flattened table used for model training, joining all Silver behavioral features with the original Bronze transactions.


Known Issues

Spatial Velocity is currently calculated using a Euclidean distance approximation. While computationally efficient, this is inaccurate over long distances. Implementation of the Haversine formula is required to ensure geographic precision for cross-state and international fraud simulations.

Furthermore, Feature Freshness is limited to the last 10 transactions in Redis. This prevents the modeling of long-term behavioral baselines for infrequent spenders. Implementing "Stateful Cold Storage" in the ETL pipeline is necessary to retrieve historical data without exceeding real-time feature store capacity.