Technical Issues & Resolutions
Summary
The issues.md document acts as the primary engineering log for RiskFabric. It captures architectural hurdles, environment-specific bugs, and performance bottlenecks encountered during development, along with their implemented or proposed resolutions.
Design Intent
This document serves as Institutional Knowledge for the project. In complex simulations, the most difficult bugs often arise from the interaction between system layers (e.g., Rust → Kafka → Python). Documenting these issues provides a roadmap for future optimizations and prevents the repetition of architectural errors. Every entry is paired with a specific technical fix validated through benchmarking or regression testing.
🛠️ Data Engine & Type Safety
1. Polars UInt8 Series Creation Error
- Problem: Polars returned a
ComputeErrorwhen materializing DataFrames containing 8-bit unsigned integers. This blocked features likeis_weekendand other boolean-adjacent flags. - Resolution: All flag and counter columns were migrated to
DataType::UInt32to ensure native Polars support and broader ML library compatibility.
2. Polars is_in Panic on Int8
- Problem: The
.dt().weekday()function returnsInt8, which caused kernel-level panics during.is_in()membership checks. - Resolution: Output from
.weekday()is now explicitly cast toInt32, ensuring the comparison set (e.g.,&[6i32, 7i32]) matches the target type exactly.
3. ClickHouse Timestamp Precision
- Problem: Standard
DateTime64ingestion in ClickHouse failed when processing ISO 8601 strings with nanosecond precision. - Resolution: Timestamps are landed as
Stringin the Bronze layer. High-precision parsing is deferred to the Silver ETL stage using Polars'.str().to_datetime()for increased flexibility.
🚀 Performance & Scaling
4. Out of Memory (OOM) in Network Linkage
- Problem: Multi-million row many-to-many joins on IP and User Agent entities caused combinatorial explosions, leading to process termination.
- Resolution: The architecture shifted from an Edge-List Graph approach to an Entity Reputation model. Risk is now calculated at the entity level and joined back to transactions, reducing complexity from $O(N^2)$ to $O(N)$.
5. OOM in Large-Scale Generation
- Problem: Single-pass generation of 17M+ transactions exceeded available system RAM.
- Resolution: The generator was refactored to use a Chunked One-Pass Architecture. The population is processed in batches of 5,000 entities, with transactions flushed to Parquet incrementally to maintain a constant memory profile.
6. Parquet Serialization Bottleneck
- Problem: Transaction generation required 44 seconds, with 90% of the time spent in disk I/O and Parquet encoding.
- Resolution: A One-Pass Parallel Architecture was implemented and the Polars chunk size was optimized. This reduced the total runtime to 4.4 seconds, an 11x improvement.
🤖 Machine Learning & Data Science
7. Label Leakage (Near-Perfect AUC)
- Problem: Early models achieved 0.9993 AUC by learning internal generator flags (e.g.,
geo_anomaly) instead of behavioral patterns. - Resolution: A strict "Operational Feature" Sanitization step was implemented to drop all internal metadata. The training target was also shifted from the perfect
fraud_targetto the noisyis_fraudlabel.
8. Observed vs. Configured Fraud Rate Discrepancy
- Problem: The observed fraud rate (~13.6%) appeared higher than the 12% defined in the configuration.
- Resolution: Validation confirmed that
is_frauddeliberately incorporates simulated label noise (3% FP, 10% FN), resulting in a higher observed ratio than the latent ground truth.
9. High Fraud Prevalence in Initial Runs
- Problem: Approximately 86% of customers experienced fraud due to high default configuration values.
- Resolution: The
target_shareparameter was tuned to 0.005 (0.5% transaction rate) to align with industry benchmarks for sparse fraud data.
Known Issues
There is ongoing difficulty with Container Runtime Variability. The podman exec calls used in ingestion and ETL pipelines behave inconsistently across Linux and macOS environments, causing failures in the data warehouse loading process. Transitioning to native database drivers is required to eliminate dependency on the host's container CLI.
Furthermore, Memory Management during Reference Extraction is currently insufficient. When processing large OSM PBF files, the prepare_refs.rs binary can consume significant RAM. Implementing a "Spill-to-Disk" strategy for the parallel map-reduce operation is necessary to maintain a memory footprint below 4GB.