demand-forecasting / README.md
fikri0o0's picture
Upload README.md
19edc70 verified

A newer version of the Gradio SDK is available: 6.16.0

Upgrade
metadata
title: Retail Demand Forecaster
emoji: πŸ“ˆ
colorFrom: blue
colorTo: green
sdk: gradio
sdk_version: 5.9.1
app_file: app/gradio_app.py
pinned: false
python_version: '3.11'

Retail Demand Forecasting

CI Python LightGBM HF Spaces License

End-to-end retail demand forecasting pipeline. Compares 5 approaches from naive baseline to Amazon Chronos-2 (2025 SOTA foundation model), with probabilistic prediction intervals, MLflow tracking, and a live Gradio demo.

Live Demo β†’ | GitHub β†’


Highlights

What Detail
Dataset Store Sales (CorporaciΓ³n Favorita) β€” 54 stores, 33 families, 4.5 years + oil price + holidays
Models Seasonal Naive β†’ AutoARIMA β†’ LightGBM β†’ Amazon Chronos-2 β†’ Ensemble
Best model Ensemble (LightGBM-Optuna + Chronos-ft) β€” RMSLE 0.1610, MASE 0.835
Fine-tuning Chronos-2: zero-shot 0.2040 β†’ 1000-step fine-tune 0.1690 β†’ ensemble 0.1610
Key insight Ensemble wins only when both components are strong β€” fine-tuning Chronos was the unlock
Prediction intervals 80% + 90% bands via conformal prediction
Metric RMSLE β€” penalises under-forecasting (stockout > overstock in cost)
Experiment tracking MLflow β€” all model runs logged
API FastAPI /forecast endpoint
UI Gradio β€” interactive 28-day forecast chart
Deployment HuggingFace Spaces

Architecture

Store Sales CSV (Kaggle) / M5 fallback (datasetsforecast)
  └─► data_loader.py    (load, fill date gaps, train/test split)
        └─► features.py  (lag, rolling, calendar, oil price, holiday features)
              β”œβ”€β–Ί train_lgbm.py    (LightGBM via mlforecast + MLflow)
              β”œβ”€β–Ί train_chronos.py (Chronos-2 zero-shot β€” no training, requires GPU)
              └─► experiments.py   (5-model comparison -> model_meta.json)
                    └─► evaluate.py (forecast plots, metrics comparison)
                          β”œβ”€β–Ί api/main.py       (FastAPI /forecast)
                          └─► app/gradio_app.py (HF Spaces UI)

Quickstart

# 1. Clone & install
git clone https://github.com/Fikri645/demand-forecasting
cd demand-forecasting
pip install -r requirements-dev.txt

# 2a. (Option A) Download Store Sales from Kaggle β€” put zip in data/raw/ then:
python scripts/download_data.py

# 2b. (Option B) Auto-download M5 via datasetsforecast (no Kaggle needed)
#     Just run the script β€” it will use M5 as fallback automatically
python scripts/download_data.py

# 3. Run full experiment (5 models + MLflow logging)
python -m src.experiments

# 4. Generate evaluation plots
python -m src.evaluate

# 5. Run API locally
uvicorn api.main:app --reload

# 6. Run Gradio UI
python app/gradio_app.py

Or via make:

make install && make data && make experiments && make evaluate

Project Structure

demand-forecasting/
β”œβ”€β”€ data/processed/         # train.parquet, test.parquet
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ config.py           # paths, constants
β”‚   β”œβ”€β”€ data_loader.py      # Store Sales (Favorita) loading + gap fill + M5 fallback
β”‚   β”œβ”€β”€ features.py         # lag, rolling, calendar feature engineering
β”‚   β”œβ”€β”€ metrics.py          # RMSE, MAE, RMSLE, MASE, coverage
β”‚   β”œβ”€β”€ train_lgbm.py       # LightGBM via mlforecast
β”‚   β”œβ”€β”€ train_chronos.py    # Amazon Chronos-2 (zero-shot)
β”‚   β”œβ”€β”€ experiments.py      # 5-model comparison + MLflow
β”‚   └── evaluate.py         # forecast + comparison plots
β”œβ”€β”€ api/main.py             # FastAPI /forecast endpoint
β”œβ”€β”€ app/gradio_app.py       # Gradio UI (HF Spaces)
β”œβ”€β”€ notebooks/01_eda.ipynb  # Exploratory Data Analysis
β”œβ”€β”€ tests/                  # pytest (metrics, features, API schemas)
β”œβ”€β”€ Makefile
└── requirements-dev.txt

Dataset β€” Store Sales (Corporacion Favorita)

The Store Sales - Time Series Forecasting competition (Kaggle) uses real data from Ecuador's largest grocery chain:

  • 54 stores, 33 product families, daily unit sales
  • 4.5 years: 2013-01-01 to 2017-08-15 (1,684 days)
  • External features: oil price (Ecuador is oil-dependent β€” economic shocks affect spending), national/regional holidays, promotions
  • Portfolio uses top 300 series by total volume

Source: Kaggle Store Sales Competition

M5 (Walmart, via datasetsforecast) available as automatic fallback if CSV not present.


Model Details

Seasonal Naive (baseline)

Forecast = same weekday last week. Any real model must beat this.

AutoARIMA

statsforecast AutoARIMA with weekly seasonality. Automatic order selection via AIC.

LightGBM + Feature Engineering

mlforecast with automatic lag generation:

  • Lags: t-7, t-14, t-21, t-28, t-35, t-42, t-56, t-364 (same day last year)
  • Rolling: 7-day and 28-day mean, std, max per series
  • Calendar: day-of-week, month, quarter, is-weekend, month-start/end
  • Price: normalised sell price, price change %
  • External: oil price, promotion flag, holiday flag (Store Sales specific)

Amazon Chronos-2 (2025 SOTA)

Zero-shot foundation model β€” no training data needed. Loads pre-trained weights (amazon/chronos-t5-small, 250M params) from HuggingFace. Generates 100 probabilistic samples -> P10/P50/P90 quantiles.

Chronos-2 (Oct 2025) natively supports cross-series dependencies, exogenous features, and multivariate forecasting. Zero-shot performance competitive with fully-supervised models.

Requirements: Chronos needs PyTorch with CUDA and sufficient virtual memory (page file >= 8GB on Windows). Run python -m src.train_chronos after increasing virtual memory. Code is complete and ready.

Ensemble

Weighted average: LightGBM x 0.6 + Chronos x 0.4. Combines domain-feature awareness with temporal pattern recognition. Run python -m src.experiments after Chronos is available.


Results β€” 28-Day Forecast on Store Sales (300 series)

Model RMSLE MASE Notes
Seasonal Naive 0.2145 1.109 Benchmark floor
AutoARIMA 0.2105 1.121 Worse than naive on this dataset
Chronos-2 (zero-shot) 0.2040 1.038 Beats AutoARIMA with zero training
LightGBM (default) 0.1672 0.877 Strong baseline
LightGBM (Optuna, 50 trials) 0.1671 0.880 Marginal gain β€” default was already good
Chronos-2 (fine-tuned, 1000 steps) 0.1690 0.863 +17.2% vs zero-shot
Chronos-2 (extended, 3000 steps) 0.1688 0.863 Converged at ~1000 steps
Ensemble (LGB-Optuna Γ— 0.5 + Chronos-ft Γ— 0.5) 0.1610 0.835 πŸ† Best β€” 25% vs naive

Key findings:

  • Ensemble wins β€” but only when both components are strong. Zero-shot Chronos dragged the first ensemble down. Once Chronos was fine-tuned, a 50/50 ensemble cuts RMSLE to 0.1610 (3.7% better than either alone).
  • LightGBM was already near-optimal. 50 Optuna trials only improved RMSLE by 0.0001 β€” the default hyperparameters were well-calibrated. Lesson: diminishing returns on HPO when the model class fits the data well.
  • Chronos converges fast. The jump from zero-shot (0.2040) to 1000 steps (0.1690) is massive; from 1000 to 3000 steps only 0.0002 more. Pre-training provides a warm start that requires very few gradient updates.
  • Foundation models + feature engineering are complementary. Chronos captures long-range temporal patterns; LightGBM captures domain features (oil price, promotions, day-of-week). Neither alone beats the combination.

Why RMSLE?

In retail, running out of stock costs more than overstock. RMSLE operates in log-space, which:

  1. Penalises under-forecasting more than over-forecasting
  2. Gives equal relative weight to low-volume and high-volume SKUs
  3. Aligns the metric with actual business cost structure

What I Learned

  • Ensemble wins only when both components are competitive. Zero-shot Chronos (RMSLE 0.2040) + LightGBM (0.1672) = 0.1722 (worse than LightGBM alone). Fine-tuned Chronos (0.1690) + LightGBM-Optuna (0.1671) = 0.1610 (new best). The lesson: fix the weaker model first, then ensemble.
  • LightGBM is already near-optimal with default hyperparameters. 50 Optuna trials improved RMSLE by only 0.0001. When the model class fits the data well, HPO has diminishing returns.
  • Foundation models converge fast from pre-training. Zero-shot β†’ 1000 steps: RMSLE drops 0.035 (massive). 1000 β†’ 3000 steps: only 0.0002. Pre-training on diverse time series provides a warm start β€” most adaptation happens in the first few hundred steps.
  • Chronos + LightGBM are complementary. Chronos captures long-range temporal structure and seasonal patterns; LightGBM captures domain features (oil price, promotions, day-of-week). Their errors are not correlated β€” hence the ensemble gain.
  • AutoARIMA fails on complex retail. MASE 1.12 = worse than seasonal naive. Lag features + calendar + oil price give tree models the context that ARIMA's linear structure can't model.
  • MASE < 1.0 is the real bar. Only LightGBM, fine-tuned Chronos, and their ensemble clear it. AutoARIMA and zero-shot Chronos both fail to beat the naive baseline on MASE.
  • lag_364 (same day last year) is critical. Annual cycles in retail (back-to-school, holidays, oil price cycles) are only captured by a 1-year lag β€” shorter lags miss this entirely.