Spaces:
Sleeping
A newer version of the Gradio SDK is available: 6.16.0
title: Retail Demand Forecaster
emoji: π
colorFrom: blue
colorTo: green
sdk: gradio
sdk_version: 5.9.1
app_file: app/gradio_app.py
pinned: false
python_version: '3.11'
Retail Demand Forecasting
End-to-end retail demand forecasting pipeline. Compares 5 approaches from naive baseline to Amazon Chronos-2 (2025 SOTA foundation model), with probabilistic prediction intervals, MLflow tracking, and a live Gradio demo.
Highlights
| What | Detail |
|---|---|
| Dataset | Store Sales (CorporaciΓ³n Favorita) β 54 stores, 33 families, 4.5 years + oil price + holidays |
| Models | Seasonal Naive β AutoARIMA β LightGBM β Amazon Chronos-2 β Ensemble |
| Best model | Ensemble (LightGBM-Optuna + Chronos-ft) β RMSLE 0.1610, MASE 0.835 |
| Fine-tuning | Chronos-2: zero-shot 0.2040 β 1000-step fine-tune 0.1690 β ensemble 0.1610 |
| Key insight | Ensemble wins only when both components are strong β fine-tuning Chronos was the unlock |
| Prediction intervals | 80% + 90% bands via conformal prediction |
| Metric | RMSLE β penalises under-forecasting (stockout > overstock in cost) |
| Experiment tracking | MLflow β all model runs logged |
| API | FastAPI /forecast endpoint |
| UI | Gradio β interactive 28-day forecast chart |
| Deployment | HuggingFace Spaces |
Architecture
Store Sales CSV (Kaggle) / M5 fallback (datasetsforecast)
βββΊ data_loader.py (load, fill date gaps, train/test split)
βββΊ features.py (lag, rolling, calendar, oil price, holiday features)
βββΊ train_lgbm.py (LightGBM via mlforecast + MLflow)
βββΊ train_chronos.py (Chronos-2 zero-shot β no training, requires GPU)
βββΊ experiments.py (5-model comparison -> model_meta.json)
βββΊ evaluate.py (forecast plots, metrics comparison)
βββΊ api/main.py (FastAPI /forecast)
βββΊ app/gradio_app.py (HF Spaces UI)
Quickstart
# 1. Clone & install
git clone https://github.com/Fikri645/demand-forecasting
cd demand-forecasting
pip install -r requirements-dev.txt
# 2a. (Option A) Download Store Sales from Kaggle β put zip in data/raw/ then:
python scripts/download_data.py
# 2b. (Option B) Auto-download M5 via datasetsforecast (no Kaggle needed)
# Just run the script β it will use M5 as fallback automatically
python scripts/download_data.py
# 3. Run full experiment (5 models + MLflow logging)
python -m src.experiments
# 4. Generate evaluation plots
python -m src.evaluate
# 5. Run API locally
uvicorn api.main:app --reload
# 6. Run Gradio UI
python app/gradio_app.py
Or via make:
make install && make data && make experiments && make evaluate
Project Structure
demand-forecasting/
βββ data/processed/ # train.parquet, test.parquet
βββ src/
β βββ config.py # paths, constants
β βββ data_loader.py # Store Sales (Favorita) loading + gap fill + M5 fallback
β βββ features.py # lag, rolling, calendar feature engineering
β βββ metrics.py # RMSE, MAE, RMSLE, MASE, coverage
β βββ train_lgbm.py # LightGBM via mlforecast
β βββ train_chronos.py # Amazon Chronos-2 (zero-shot)
β βββ experiments.py # 5-model comparison + MLflow
β βββ evaluate.py # forecast + comparison plots
βββ api/main.py # FastAPI /forecast endpoint
βββ app/gradio_app.py # Gradio UI (HF Spaces)
βββ notebooks/01_eda.ipynb # Exploratory Data Analysis
βββ tests/ # pytest (metrics, features, API schemas)
βββ Makefile
βββ requirements-dev.txt
Dataset β Store Sales (Corporacion Favorita)
The Store Sales - Time Series Forecasting competition (Kaggle) uses real data from Ecuador's largest grocery chain:
- 54 stores, 33 product families, daily unit sales
- 4.5 years: 2013-01-01 to 2017-08-15 (1,684 days)
- External features: oil price (Ecuador is oil-dependent β economic shocks affect spending), national/regional holidays, promotions
- Portfolio uses top 300 series by total volume
Source: Kaggle Store Sales Competition
M5 (Walmart, via
datasetsforecast) available as automatic fallback if CSV not present.
Model Details
Seasonal Naive (baseline)
Forecast = same weekday last week. Any real model must beat this.
AutoARIMA
statsforecast AutoARIMA with weekly seasonality. Automatic order selection via AIC.
LightGBM + Feature Engineering
mlforecast with automatic lag generation:
- Lags: t-7, t-14, t-21, t-28, t-35, t-42, t-56, t-364 (same day last year)
- Rolling: 7-day and 28-day mean, std, max per series
- Calendar: day-of-week, month, quarter, is-weekend, month-start/end
- Price: normalised sell price, price change %
- External: oil price, promotion flag, holiday flag (Store Sales specific)
Amazon Chronos-2 (2025 SOTA)
Zero-shot foundation model β no training data needed. Loads pre-trained weights (amazon/chronos-t5-small, 250M params) from HuggingFace. Generates 100 probabilistic samples -> P10/P50/P90 quantiles.
Chronos-2 (Oct 2025) natively supports cross-series dependencies, exogenous features, and multivariate forecasting. Zero-shot performance competitive with fully-supervised models.
Requirements: Chronos needs PyTorch with CUDA and sufficient virtual memory (page file >= 8GB on Windows). Run python -m src.train_chronos after increasing virtual memory. Code is complete and ready.
Ensemble
Weighted average: LightGBM x 0.6 + Chronos x 0.4. Combines domain-feature awareness with temporal pattern recognition. Run python -m src.experiments after Chronos is available.
Results β 28-Day Forecast on Store Sales (300 series)
| Model | RMSLE | MASE | Notes |
|---|---|---|---|
| Seasonal Naive | 0.2145 | 1.109 | Benchmark floor |
| AutoARIMA | 0.2105 | 1.121 | Worse than naive on this dataset |
| Chronos-2 (zero-shot) | 0.2040 | 1.038 | Beats AutoARIMA with zero training |
| LightGBM (default) | 0.1672 | 0.877 | Strong baseline |
| LightGBM (Optuna, 50 trials) | 0.1671 | 0.880 | Marginal gain β default was already good |
| Chronos-2 (fine-tuned, 1000 steps) | 0.1690 | 0.863 | +17.2% vs zero-shot |
| Chronos-2 (extended, 3000 steps) | 0.1688 | 0.863 | Converged at ~1000 steps |
| Ensemble (LGB-Optuna Γ 0.5 + Chronos-ft Γ 0.5) | 0.1610 | 0.835 | π Best β 25% vs naive |
Key findings:
- Ensemble wins β but only when both components are strong. Zero-shot Chronos dragged the first ensemble down. Once Chronos was fine-tuned, a 50/50 ensemble cuts RMSLE to 0.1610 (3.7% better than either alone).
- LightGBM was already near-optimal. 50 Optuna trials only improved RMSLE by 0.0001 β the default hyperparameters were well-calibrated. Lesson: diminishing returns on HPO when the model class fits the data well.
- Chronos converges fast. The jump from zero-shot (0.2040) to 1000 steps (0.1690) is massive; from 1000 to 3000 steps only 0.0002 more. Pre-training provides a warm start that requires very few gradient updates.
- Foundation models + feature engineering are complementary. Chronos captures long-range temporal patterns; LightGBM captures domain features (oil price, promotions, day-of-week). Neither alone beats the combination.
Why RMSLE?
In retail, running out of stock costs more than overstock. RMSLE operates in log-space, which:
- Penalises under-forecasting more than over-forecasting
- Gives equal relative weight to low-volume and high-volume SKUs
- Aligns the metric with actual business cost structure
What I Learned
- Ensemble wins only when both components are competitive. Zero-shot Chronos (RMSLE 0.2040) + LightGBM (0.1672) = 0.1722 (worse than LightGBM alone). Fine-tuned Chronos (0.1690) + LightGBM-Optuna (0.1671) = 0.1610 (new best). The lesson: fix the weaker model first, then ensemble.
- LightGBM is already near-optimal with default hyperparameters. 50 Optuna trials improved RMSLE by only 0.0001. When the model class fits the data well, HPO has diminishing returns.
- Foundation models converge fast from pre-training. Zero-shot β 1000 steps: RMSLE drops 0.035 (massive). 1000 β 3000 steps: only 0.0002. Pre-training on diverse time series provides a warm start β most adaptation happens in the first few hundred steps.
- Chronos + LightGBM are complementary. Chronos captures long-range temporal structure and seasonal patterns; LightGBM captures domain features (oil price, promotions, day-of-week). Their errors are not correlated β hence the ensemble gain.
- AutoARIMA fails on complex retail. MASE 1.12 = worse than seasonal naive. Lag features + calendar + oil price give tree models the context that ARIMA's linear structure can't model.
- MASE < 1.0 is the real bar. Only LightGBM, fine-tuned Chronos, and their ensemble clear it. AutoARIMA and zero-shot Chronos both fail to beat the naive baseline on MASE.
- lag_364 (same day last year) is critical. Annual cycles in retail (back-to-school, holidays, oil price cycles) are only captured by a 1-year lag β shorter lags miss this entirely.