Breathe Easy AQI Predictor (XGBoost Tuned)

Model Description

This is an XGBoost regression model trained to predict the Air Quality Index (AQI) of Indian cities based on historical pollutant concentrations and engineered temporal features. It is the flagship model of the "Breathe Easy" project, demonstrating state-of-the-art predictive performance.

The model was meticulously tuned using Optuna Bayesian optimization to maximize its generalizability across diverse geographic locations and seasonal variations in India.

Uses

Direct Use

You can use the model to predict the daily AQI value based on the previous days' pollutant data, rolling averages, and cyclical date features.

Downstream Use

This model is configured to be embedded within real-time Streamlit dashboards and monitoring tools to provide localized air quality forecasting and raise public health alerts.

Out-of-Scope Use

The model relies heavily on the specific distribution of pollutants found in India (CPCB metrics). Deploying this model out-of-the-box for European or North American cities (which use different AQI formula scales and have different baseline pollutant ratios) without retraining will likely result in inaccurate predictions.

Training Data

The model was trained on the city_day.csv dataset, which contains daily aggregations of pollutants across major Indian cities from 2015 to 2020.

Core Features Engineered:

  • Advanced temporal features (Rolling 3/7-day means, Lags of 1/2 days)
  • Cyclical encodings for Month and Day of the week (Sin/Cos transformations)
  • Target-encoded City names
  • Pollutant ratios (e.g., PM2.5 to PM10 ratio)

Training Details & Hyperparameters

The model's hyperparameters were optimized using Optuna over 50 trials.

  • Framework: XGBoost
  • Objective: reg:squarederror
  • Evaluation Metric: RMSE (during training)

Evaluation

The model was evaluated using a strict temporal split to prevent data leakage (Training: 2015-2018, Validation: 2019, Test: 2020).

Metrics

Metric Score Note
R² (R-Squared) 0.925 Explains ~92.5% of variance in AQI
MAE 11.42 Extremely accurate given AQI ranges up to 500+
RMSE 22.90
MAPE 11.68%

Note: While CatBoost achieved a slightly higher R² (0.928) internally, this tuned XGBoost model was selected for its balance of high accuracy and extremely fast inference time for dashboard deployment.

Explainability (SHAP)

Global feature importance using SHAP reveals that:

  1. PM2.5 and PM10 concentrations strongly dictate the predicted AQI.
  2. Lag_1_AQI (yesterday's AQI) is heavily relied upon, acting as a powerful baseline anchor for the tree splits.
  3. Month_cos (seasonal impact) is a secondary but highly distinct driver, specifically accounting for the severe winter pollution spikes (November - January).

How to Get Started with the Model

Since this is a pickled Scikit-Learn/XGBoost pipeline, you can load it directly via joblib or pickle.

import pickle
import pandas as pd

# 1. Load the model
with open("models/best_xgboost_tuned.pkl", "rb") as f:
    model = pickle.load(f)

# 2. Prepare your features 
# (Ensure your DataFrame matches the 20+ columns generated during feature engineering)
# X_new = pd.DataFrame(...)

# 3. Predict AQI
# predictions = model.predict(X_new)
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Evaluation results