--- license: mit tags: - tabular-classification - lightgbm - xgboost - ensemble - waterborne-disease - early-warning-system --- # ArogyaJal Early Warning System for Waterborne Disease Outbreaks ## Model Description The **ArogyaJal Early Warning System v2.0** is a production-grade machine learning model designed to predict waterborne disease outbreaks. Developed by the ML Engineering Team, this system provides early alerts for potential outbreaks, enabling timely interventions and safeguarding public health. It is built upon an ensemble of LightGBM and XGBoost classifiers, trained on a large-scale dataset from 50 villages over a 730-day period. ## Problem Formulation The model addresses a **binary classification task**: predicting whether the 7-day forward sum of reported cases of waterborne diseases will be greater than or equal to 3 (indicating an outbreak). This formulation allows for clear, actionable alerts. ## Dataset The model was trained and validated on a comprehensive dataset with the following key characteristics: * **Total Observations**: 36,200 samples * **Number of Villages**: 50 * **Date Range**: January 1, 2023, to December 24, 2024 * **Class Imbalance**: 72.0% negative (no outbreak), 28.0% positive (outbreak), reflecting a realistic 2.57:1 imbalance ratio. * **Validation**: Panel-Safe Stratified time-based validation was used to ensure robustness and prevent temporal leakage. ## Feature Engineering The model utilizes 44 strictly causal features, engineered through a highly optimized, vectorized pipeline. These features capture critical information from IoT sensor data and reported cases, including: * Missingness indicators * Imputed IoT features * Temporal lags (t-1, t-3, t-7, t-14) for IoT parameters and reported cases * Rolling statistics (7-day and 14-day mean/std) for turbidity and reported cases * 3-day rate of change (derivatives) for pH, turbidity, TDS, and conductivity * Cyclical seasonal features (sin/cos of day of year) * Water quality composite score * Interaction features (`turb_x_cases_lag1`, `ph_deviation`, `tds_x_turb`) * Cumulative burden (7-day and 14-day rolling sum of reported cases) ## Model Architecture The final production model is a **weighted ensemble** of two Optuna-tuned gradient boosting models: * **LightGBM Binary Classifier** * **XGBoost Binary Classifier** The ensemble combines their predictions with weights: `0.90 * P_LightGBM + 0.10 * P_XGBoost`. ## Performance Metrics The model demonstrates excellent performance on a holdout test set (7,250 samples): | Metric | Value | Target | Status | |---|---|---|---| | **PR-AUC** | **0.9437** | ≥ 0.60 | ✓ EXCEEDED | | **ROC-AUC** | **0.9528** | ≥ 0.85 | ✓ EXCEEDED | | **Recall (Sensitivity)** | **0.8712 (87.1%)** | ≥ 80.0% | ✓ EXCEEDED | | **Precision** | **0.9060 (90.6%)** | ≥ 50.0% | ✓ EXCEEDED | | **F1-Score** | **0.8883** | ≥ 0.60 | ✓ EXCEEDED | | **False Positive Rate** | **4.5%** | ≤ 30.0% | ✓ EXCEEDED | | **Optimal Threshold** | **0.3774** | - | - | **Confusion Matrix (Holdout Test Set)**: * **True Negatives (TN)**: 4,633 * **False Positives (FP)**: 217 * **False Negatives (FN)**: 309 * **True Positives (TP)**: 2,091 ## Explainability (SHAP) SHAP analysis identified the most influential features for outbreak prediction: 1. **turbidity_roll_std7**: Short-term volatility in water turbidity. 2. **turbidity_roll_mean7**: Baseline water cloudiness. 3. **turbidity_roc3**: 3-day rate of change in turbidity. 4. **turbidity_lag3** & **turbidity_lag1**: Lagged turbidity values. 5. **tds_x_turb**: Interaction of total dissolved solids and turbidity. 6. **season_cos** & **season_sin**: Cyclical seasonal patterns. 7. **reported_cases_roll_std7**: Volatility of active outbreaks. This indicates that rapid increases and high volatility in water turbidity, coupled with seasonal patterns, are strong predictors of outbreaks. ## Usage To use the ArogyaJal Early Warning System for inference, you will need the following artifacts: * `model_lgbm_v2.pkl` * `model_xgb_v2.pkl` * `ensemble_config.json` * `arogyajal_inference.py` (the inference script) ### Example Inference Code ```python import pandas as pd from arogyajal_inference import OutbreakWarningSystem # Assuming model files and config are in the same directory warning_system = OutbreakWarningSystem( lgb_path="model_lgbm_v2.pkl", xgb_path="model_xgb_v2.pkl", config_path="ensemble_config.json" ) # Prepare new data for prediction (example DataFrame structure) df_new = pd.DataFrame({ 'timestamp': pd.to_datetime(['2024-12-25', '2024-12-26']), 'village_id': ['VIL_001', 'VIL_001'], 'ph': [7.2, 7.3], 'turbidity': [0.8, 1.5], 'tds': [32000, 32500], 'conductivity': [460, 470], 'reported_cases': [0, 1] }) results = warning_system.predict(df_new) alerts = warning_system.get_alerts(results) print("Outbreak Alerts:") print(alerts) ``` ## Installation To run the inference script, you will need the following Python packages: ```bash pip install pandas numpy lightgbm xgboost scikit-learn ``` ## Model Maintenance It is recommended to re-tune the model quarterly using the `arogyajal_train_v2.py` script to adapt to seasonal shifts and maintain optimal performance.