Machine Learning Metrics & Model Progression

This document tracks the performance and evolution of the fraud detection models trained on RiskFabric synthetic data, progressing from initial leakage-prone baselines to a robust, behavioral production configuration.


Section 1: Early Iterations

The development process began with basic feature sets to establish a baseline for fraud detection performance.

v1 Iteration (Baseline)

The initial model established core feature sets including amount deviations and spatial velocity on a sample population.

  • Accuracy: 0.95
  • ROC AUC Score: 0.9782
  • Recall (Fraud): 0.30 (Identified significant "Recall Gap")

v2 High-Fidelity (Leakage Detected)

Scaling to larger datasets revealed massive performance inflation due to generator artifacts in metadata fields.

  • ROC AUC Score: 0.9993
  • Leakage Identified: Synthetic metadata fields (fraud_target, burst_seq) were providing a "static bypass" for the model.

v2 Iteration (Leakage Prevention)

The feature vector was sanitized to exclude metadata, shifting the focus to behavioral signals.

  • ROC AUC Score: 0.9746
  • Recall (Noisy Labels): 0.72
  • Sanitization: Transitioned from fraud_target to the noisy is_fraud label.

Note: In addition to the leakage issues documented below, v1 and v2 iterations were trained on an incomplete feature set. Behavioral features computed in the Rust ETL layer — including amount_deviation_z_score, spatial_velocity, and granular anomaly flags — were silently dropped before reaching XGBoost due to a narrow Gold table join. The inflated AUC figures in these iterations reflect both metadata leakage and the absence of the features that would have provided genuine behavioral signal.

Section 2: v3 — Production Configuration (Final)

The final model configuration focuses on pure behavioral signals, specifically tuned to handle the extreme class imbalance (1.4% fraud rate) found in realistic production environments.

Training Setup

  • Dataset: 1.5M transactions (Seed 42).
  • Fraud Rate: 1.41% (target_share: 0.01, fp_rate: 0.005).
  • Model: XGBoost binary classifier.
  • Scale Pos Weight: 69.57 (Computed dynamically from training imbalance).
  • Eval Metric: aucpr (Area Under Precision-Recall Curve).
  • Label Noise: 0.5% False Positives and 1% False Negatives deliberately injected.
  • Theoretical Recall Ceiling: 66.7% (Derived from the intentional label noise ratio).

Feature Importance

The model prioritizes physical and financial anomalies over static identifiers.

FeatureImportanceDescription
spatial_velocity25.38%Impossible travel speed between transactions
amount_deviation_z_score20.80%Spending magnitude relative to customer norm
time_since_last_transaction12.72%Temporal burst and frequency detection
transaction_channel11.60%Risk associated with specific payment methods
merchant_category11.08%Contextual risk of the merchant type
hour_deviation_from_norm7.40%Circadian rhythm anomalies
merchant_category_switch_flag2.89%Unexpected shifts in merchant category
card_present2.45%Physical vs. digital transaction risk
transaction_sequence_number1.95%Position within the account lifecycle
rapid_fire_transaction_flag1.88%High-velocity sequence identification

For a detailed narrative of the discovery and resolution of these artifacts, see the Feature Leakage Case Study.

Generalization Results

Validated against three independent populations to ensure robust performance across different random seeds.

Test PopulationSeedTransactionsAUC
Holdout42 (Same)1.5M84.72%
Independent8888 (Different)1.5M79.94%
Independent5555 (Different)3.0M79.81%

Note: The higher AUC on the holdout set is due to distributional overlap with the training population, while the ~80% AUC on independent seeds represents the model's true behavioral generalization.


Section 3: Threshold Operating Points

In a production environment, the model's probability output is mapped to specific operational actions.

Operating ModeThresholdPrecisionRecallF1Use Case
Detection Layer0.49510%60%0.172Review queue — broad capture
Triage0.64518%55%0.268Early analyst filtering
Investigation0.73631%50%0.385Analyst workbench
High Confidence0.84257%45%0.502Escalation decisions
Blocking0.94573%40%0.517Automatic card block

The Detection Layer feeds a review queue for manual inspection, while the Blocking Layer is reserved for automated enforcement. The tradeoff between these layers is an operational business decision, not a model failure.


Section 4: Merchant Category Audit

Leakage verification at the "Blocking" threshold (0.945) confirms that overrepresentation reflects genuine category risk levels rather than static bypasses.

CategoryGlobal ShareFlag ShareIndexVerified Fraud Rate
GAMBLING0.07%1.09%17x17.68%
ENTERTAINMENT1.10%14.35%13x11.20%
LUXURY1.62%8.63%5x4.91%
ELECTRONICS3.39%10.22%3x2.40%
TRAVEL6.14%16.29%2.6x2.53%
SERVICES5.15%11.92%2.3x2.53%

All verified fraud rates fall below the 20% threshold, confirming that no single category acts as a near-deterministic fraud rule. The model uses category as a Bayesian prior requiring behavioral confirmation rather than a static classifier.

The GAMBLING index was previously at 103x (documented in the leakage case study); its reduction to 17x after generator retuning and the verified fraud rate confirms it is now a legitimate signal.


Section 5: Known Limitations

Recall Ceiling (66.7%)

Theoretical maximum recall is imposed by deliberate label noise design. The 0.5% false positive rate in fp_rate creates labels that are behaviorally unlearnable. Recall approaching this ceiling represents optimal behavior.

Silver ETL Eager Execution

Sequence features using .over() window functions trigger eager in-memory execution despite Polars lazy API usage. Datasets significantly exceeding available RAM will hit memory pressure. Roadmap: transition to a stateful streaming pre-aggregation pass.

Campaign Detection

Coordinated attack signatures require graph-based reasoning over entity relationships. Individual transactions in a campaign are often behaviorally indistinguishable from legitimate ones when viewed in isolation—this is a structural limitation of single-transaction classifiers.