--- title: Retail Demand Forecaster emoji: πŸ“ˆ colorFrom: blue colorTo: green sdk: gradio sdk_version: "5.9.1" app_file: app/gradio_app.py pinned: false python_version: "3.11" --- # Retail Demand Forecasting ![CI](https://github.com/Fikri645/demand-forecasting/actions/workflows/ci.yml/badge.svg) ![Python](https://img.shields.io/badge/python-3.11-blue) ![LightGBM](https://img.shields.io/badge/LightGBM-4.x-green) [![HF Spaces](https://img.shields.io/badge/πŸ€—%20HuggingFace-Space-yellow)](https://huggingface.co/spaces/fikri0o0/demand-forecasting) ![License](https://img.shields.io/badge/license-MIT-green) End-to-end retail demand forecasting pipeline. Compares **5 approaches** from naive baseline to Amazon Chronos-2 (2025 SOTA foundation model), with probabilistic prediction intervals, MLflow tracking, and a live Gradio demo. **[Live Demo β†’](https://huggingface.co/spaces/fikri0o0/demand-forecasting)** | **[GitHub β†’](https://github.com/Fikri645/demand-forecasting)** --- ## Highlights | What | Detail | |---|---| | **Dataset** | Store Sales (CorporaciΓ³n Favorita) β€” 54 stores, 33 families, 4.5 years + oil price + holidays | | **Models** | Seasonal Naive β†’ AutoARIMA β†’ LightGBM β†’ Amazon Chronos-2 β†’ Ensemble | | **Best model** | Ensemble (LightGBM-Optuna + Chronos-ft) β€” RMSLE **0.1610**, MASE **0.835** | | **Fine-tuning** | Chronos-2: zero-shot 0.2040 β†’ 1000-step fine-tune **0.1690** β†’ ensemble **0.1610** | | **Key insight** | Ensemble wins only when both components are strong β€” fine-tuning Chronos was the unlock | | **Prediction intervals** | 80% + 90% bands via conformal prediction | | **Metric** | RMSLE β€” penalises under-forecasting (stockout > overstock in cost) | | **Experiment tracking** | MLflow β€” all model runs logged | | **API** | FastAPI `/forecast` endpoint | | **UI** | Gradio β€” interactive 28-day forecast chart | | **Deployment** | HuggingFace Spaces | --- ## Architecture ``` Store Sales CSV (Kaggle) / M5 fallback (datasetsforecast) └─► data_loader.py (load, fill date gaps, train/test split) └─► features.py (lag, rolling, calendar, oil price, holiday features) β”œβ”€β–Ί train_lgbm.py (LightGBM via mlforecast + MLflow) β”œβ”€β–Ί train_chronos.py (Chronos-2 zero-shot β€” no training, requires GPU) └─► experiments.py (5-model comparison -> model_meta.json) └─► evaluate.py (forecast plots, metrics comparison) β”œβ”€β–Ί api/main.py (FastAPI /forecast) └─► app/gradio_app.py (HF Spaces UI) ``` --- ## Quickstart ```bash # 1. Clone & install git clone https://github.com/Fikri645/demand-forecasting cd demand-forecasting pip install -r requirements-dev.txt # 2a. (Option A) Download Store Sales from Kaggle β€” put zip in data/raw/ then: python scripts/download_data.py # 2b. (Option B) Auto-download M5 via datasetsforecast (no Kaggle needed) # Just run the script β€” it will use M5 as fallback automatically python scripts/download_data.py # 3. Run full experiment (5 models + MLflow logging) python -m src.experiments # 4. Generate evaluation plots python -m src.evaluate # 5. Run API locally uvicorn api.main:app --reload # 6. Run Gradio UI python app/gradio_app.py ``` Or via `make`: ```bash make install && make data && make experiments && make evaluate ``` --- ## Project Structure ``` demand-forecasting/ β”œβ”€β”€ data/processed/ # train.parquet, test.parquet β”œβ”€β”€ src/ β”‚ β”œβ”€β”€ config.py # paths, constants β”‚ β”œβ”€β”€ data_loader.py # Store Sales (Favorita) loading + gap fill + M5 fallback β”‚ β”œβ”€β”€ features.py # lag, rolling, calendar feature engineering β”‚ β”œβ”€β”€ metrics.py # RMSE, MAE, RMSLE, MASE, coverage β”‚ β”œβ”€β”€ train_lgbm.py # LightGBM via mlforecast β”‚ β”œβ”€β”€ train_chronos.py # Amazon Chronos-2 (zero-shot) β”‚ β”œβ”€β”€ experiments.py # 5-model comparison + MLflow β”‚ └── evaluate.py # forecast + comparison plots β”œβ”€β”€ api/main.py # FastAPI /forecast endpoint β”œβ”€β”€ app/gradio_app.py # Gradio UI (HF Spaces) β”œβ”€β”€ notebooks/01_eda.ipynb # Exploratory Data Analysis β”œβ”€β”€ tests/ # pytest (metrics, features, API schemas) β”œβ”€β”€ Makefile └── requirements-dev.txt ``` --- ## Dataset β€” Store Sales (Corporacion Favorita) The **Store Sales - Time Series Forecasting** competition (Kaggle) uses real data from Ecuador's largest grocery chain: - **54 stores**, 33 product families, daily unit sales - **4.5 years**: 2013-01-01 to 2017-08-15 (1,684 days) - External features: **oil price** (Ecuador is oil-dependent β€” economic shocks affect spending), **national/regional holidays**, **promotions** - Portfolio uses top 300 series by total volume Source: [Kaggle Store Sales Competition](https://www.kaggle.com/competitions/store-sales-time-series-forecasting) > M5 (Walmart, via `datasetsforecast`) available as automatic fallback if CSV not present. --- ## Model Details ### Seasonal Naive (baseline) Forecast = same weekday last week. Any real model must beat this. ### AutoARIMA `statsforecast` AutoARIMA with weekly seasonality. Automatic order selection via AIC. ### LightGBM + Feature Engineering `mlforecast` with automatic lag generation: - **Lags**: t-7, t-14, t-21, t-28, t-35, t-42, t-56, t-364 (same day last year) - **Rolling**: 7-day and 28-day mean, std, max per series - **Calendar**: day-of-week, month, quarter, is-weekend, month-start/end - **Price**: normalised sell price, price change % - **External**: oil price, promotion flag, holiday flag (Store Sales specific) ### Amazon Chronos-2 (2025 SOTA) Zero-shot foundation model β€” no training data needed. Loads pre-trained weights (`amazon/chronos-t5-small`, 250M params) from HuggingFace. Generates 100 probabilistic samples -> P10/P50/P90 quantiles. > Chronos-2 (Oct 2025) natively supports cross-series dependencies, exogenous features, and multivariate forecasting. Zero-shot performance competitive with fully-supervised models. **Requirements:** Chronos needs PyTorch with CUDA and sufficient virtual memory (page file >= 8GB on Windows). Run `python -m src.train_chronos` after increasing virtual memory. Code is complete and ready. ### Ensemble Weighted average: LightGBM x 0.6 + Chronos x 0.4. Combines domain-feature awareness with temporal pattern recognition. Run `python -m src.experiments` after Chronos is available. --- ## Results β€” 28-Day Forecast on Store Sales (300 series) | Model | RMSLE | MASE | Notes | |---|---|---|---| | Seasonal Naive | 0.2145 | 1.109 | Benchmark floor | | AutoARIMA | 0.2105 | 1.121 | Worse than naive on this dataset | | Chronos-2 (zero-shot) | 0.2040 | 1.038 | Beats AutoARIMA with zero training | | LightGBM (default) | 0.1672 | 0.877 | Strong baseline | | LightGBM (Optuna, 50 trials) | 0.1671 | 0.880 | Marginal gain β€” default was already good | | Chronos-2 (fine-tuned, 1000 steps) | 0.1690 | 0.863 | +17.2% vs zero-shot | | Chronos-2 (extended, 3000 steps) | 0.1688 | 0.863 | Converged at ~1000 steps | | **Ensemble (LGB-Optuna Γ— 0.5 + Chronos-ft Γ— 0.5)** | **0.1610** | **0.835** | **πŸ† Best β€” 25% vs naive** | **Key findings:** - **Ensemble wins β€” but only when both components are strong.** Zero-shot Chronos dragged the first ensemble down. Once Chronos was fine-tuned, a 50/50 ensemble cuts RMSLE to 0.1610 (3.7% better than either alone). - **LightGBM was already near-optimal.** 50 Optuna trials only improved RMSLE by 0.0001 β€” the default hyperparameters were well-calibrated. Lesson: diminishing returns on HPO when the model class fits the data well. - **Chronos converges fast.** The jump from zero-shot (0.2040) to 1000 steps (0.1690) is massive; from 1000 to 3000 steps only 0.0002 more. Pre-training provides a warm start that requires very few gradient updates. - **Foundation models + feature engineering are complementary.** Chronos captures long-range temporal patterns; LightGBM captures domain features (oil price, promotions, day-of-week). Neither alone beats the combination. --- ## Why RMSLE? In retail, **running out of stock costs more than overstock**. RMSLE operates in log-space, which: 1. Penalises under-forecasting more than over-forecasting 2. Gives equal relative weight to low-volume and high-volume SKUs 3. Aligns the metric with actual business cost structure --- ## What I Learned - **Ensemble wins only when both components are competitive.** Zero-shot Chronos (RMSLE 0.2040) + LightGBM (0.1672) = 0.1722 (worse than LightGBM alone). Fine-tuned Chronos (0.1690) + LightGBM-Optuna (0.1671) = **0.1610** (new best). The lesson: fix the weaker model first, then ensemble. - **LightGBM is already near-optimal with default hyperparameters.** 50 Optuna trials improved RMSLE by only 0.0001. When the model class fits the data well, HPO has diminishing returns. - **Foundation models converge fast from pre-training.** Zero-shot β†’ 1000 steps: RMSLE drops 0.035 (massive). 1000 β†’ 3000 steps: only 0.0002. Pre-training on diverse time series provides a warm start β€” most adaptation happens in the first few hundred steps. - **Chronos + LightGBM are complementary.** Chronos captures long-range temporal structure and seasonal patterns; LightGBM captures domain features (oil price, promotions, day-of-week). Their errors are not correlated β€” hence the ensemble gain. - **AutoARIMA fails on complex retail.** MASE 1.12 = worse than seasonal naive. Lag features + calendar + oil price give tree models the context that ARIMA's linear structure can't model. - **MASE < 1.0 is the real bar.** Only LightGBM, fine-tuned Chronos, and their ensemble clear it. AutoARIMA and zero-shot Chronos both fail to beat the naive baseline on MASE. - **lag_364 (same day last year) is critical.** Annual cycles in retail (back-to-school, holidays, oil price cycles) are only captured by a 1-year lag β€” shorter lags miss this entirely.