CanFlow Documentation
Welcome to the CanFlow project documentation. CanFlow is a high-fidelity vehicle telemetry simulator designed to emit realistic OBD-II and GPS data for fleet management and anomaly detection testing.
🚀 Key Features
📡 High-Fidelity Simulation
- Physics-Based Correlations: Real-world relationship modeling where Air Flow (MAF) depends on RPM/Throttle, and Engine Temperature scales with Load.
- Latent Environmental Noise: Hidden variables like
ambient_temperatureandroad_inclinecreate realistic signal overlap. - Pre-Failure "Smells": Sensors exhibit latent instability (jitter and drift) for a duration before triggering hard anomalies.
- Diverse Fleet: Supports 25+ unique Indian vehicle models across Passenger (12V) and Commercial (24V) classes.
⚡ Real-Time Streaming & Warehouse
- Kafka Integration: Live streaming of JSON telemetry via Confluent Cloud with Schema Registry validation.
- ClickHouse Storage: High-performance OLAP storage for millions of telemetry rows.
- dbt "Elastic" Scoring: A sophisticated health scoring engine in the Gold layer that uses nonlinear penalties and "dead zones" to handle normal high-load operating ranges.
🤖 Machine Learning
- XGBoost Risk Prediction: A robust classifier that identifies at-risk vehicles using latent indicators (MAF/RPM efficiency, voltage volatility) rather than simple threshold breaks.
- Anti-Leakage Design: Built to avoid "label leakage" by hiding deterministic SQL triggers from the model, resulting in a realistic AUC-ROC of ~0.97.
🏗️ Architecture
graph LR
A[Vehicle Agents] -->|JSON| B(Kafka / Confluent)
B --> C[Stream Consumer]
C -->|Batch Insert| D[(ClickHouse Bronze)]
D --> E{dbt Transforms}
E --> F[(Silver Telemetry)]
F --> G[(Gold Health Metrics)]
G --> H[XGBoost Risk Model]
H --> I[Grafana Dashboard]
🛠️ Tech Stack
- Language: Python 3.x
- Streaming: Confluent Kafka
- Database: ClickHouse (OLAP)
- Warehouse: dbt (Data Build Tool)
- ML: XGBoost, Scikit-Learn, Pandas
- Visualization: Grafana
- Infrastructure: Docker Compose
🏁 Getting Started
1. Prerequisites
- Docker & Docker Compose
- Python 3.9+ (Virtual Environment recommended)
- Confluent Cloud Account (for Kafka/Schema Registry)
2. Quick Run
# Start Infrastructure
docker-compose up -d
# Start the Fleet Simulator
python simulator/simulator.py
# Start the Stream Consumer
python stream/consumer.py
📂 Project Structure
/
├── config/ # Configuration files (YAML)
│ ├── fleet_config.yaml # Vehicle model profiles & categories
│ └── health_config.yaml # Probabilistic fault & recovery settings
├── dashboard/ # Visualization (Grafana)
├── documentation/ # Technical guides (MkDocs source)
├── models/ # ML artifacts & training logic
├── simulator/ # Data generation (Physics & Kafka)
├── stream/ # Real-time processing & ingestion
├── warehouse/ # Data transformation (dbt SQL)
├── docker-compose.yml # Infrastructure orchestration
└── README.md # Repository overview
📄 Technical Guides
- Machine Learning Journey: From label leakage to realistic AUC.
- Known Issues: Technical hurdles and resolutions.
- Future Ideas: Roadmap for increasing simulation uncertainty.
- Anomaly Detection: Stage 1 (Injection) vs Stage 2 (Detection).