Machine Learning & Risk Prediction
This document outlines the evolution of the Vehicle Risk Prediction model, the challenges of working with synthetic data, and the path to achieving a realistic predictive system.
The Challenge: The "Perfect" Model Fallacy
Initially, the XGBoost model achieved an AUC-ROC score of 1.0000. While statistically perfect, this indicated a critical flaw in the machine learning design: Label Leakage.
The Root Cause: Deterministic Leakage
In the first iteration, the model was "cheating" by observing the exact same variables used to calculate the ground truth health score.
* The Target: is_at_risk was derived from a linear SQL formula in the Gold layer.
* The Features: The model was given the exact averages (e.g., avg_coolant_temp) used in that formula.
* The Result: The model didn't learn mechanical failure patterns; it simply reverse-engineered the SQL thresholds.
Step 1: Implementing "Digital Twin" Physics
To break the deterministic link, we refactored the simulator to move from random offsets to a physics-based correlation engine.
Key Physical Correlations
- Air Flow (MAF): Realistically scales with both RPM and Throttle Position.
- Thermal Dynamics: Engine temperature is now a function of Engine Load. High RPMs under heavy throttle cause the coolant temperature to rise above average.
- Alternator Charging: Battery voltage fluctuations are tied to RPM, simulating the alternator's charging cycle.
Step 2: Breaking the Leakage
Latent Factors (Hidden Variables)
We introduced "Latent Factors" that affect the sensors but are not provided to the model: * Ambient Temperature: A hidden weather variable that offsets engine heat. The model must now distinguish between "Normal high heat" (a hot day) and "Anomalous high heat" (engine failure). * Latent Pre-Failure Noise: Added a "latent instability" state where sensors begin to "smell bad" (increased jitter) before they actually trigger a hard anomaly.
Feature Engineering (Moving to Snapshots)
We removed the "Smoking Gun" features from train.py:
* Removed: avg_coolant_temp and avg_battery_voltage (The direct SQL inputs).
* Added: maf_rpm_ratio (Efficiency), volt_volatility (Electrical stability), and max_coolant_temp_delta (Thermal rate of change).
Step 3: Elastic Scoring in the Gold Layer
To prevent a "Data Swamp" where 70% of vehicles were marked as "Fair," we implemented Elastic Scoring:
* Dead Zones: No penalties are applied for normal physical fluctuations (e.g., up to 108°C under load).
* Exponential Scaling: Penalties for degraded vehicles now scale non-linearly (pow(delta, 1.8)), ensuring that truly failing vehicles are clearly separated from healthy ones.
Final Result: Realistic Predictive Power
After these refactors, the XGBoost model achieved a realistic AUC-ROC Score of 0.9786.
| Metric | Result |
|---|---|
| Accuracy | 94% |
| Healthy (N) | 446 |
| At-Risk (N) | 178 |
| AUC-ROC | 0.9786 |
Why this matters
The model is now learning latent mechanical signatures rather than simple thresholds. It can identify a vehicle as "At-Risk" even before it triggers a hard anomaly by observing subtle changes in efficiency ratios and sensor volatility—exactly how a production-grade predictive maintenance system operates.