Toto 2.0 Family and Friends — GIFT-Eval artifact bundle
Pre-computed artifacts for replicating the Toto 2.0 Family and Friends (short form: Toto-2.0-FnF) submission to the GIFT-Eval benchmark. The ensemble is an FFORMA-style (Montero-Manso et al., International Journal of Forecasting, 2020) meta-learner that gates a pool of foundation models on a per-(frequency, term) bucket basis using XGBoost over time-series features.
The replication notebook lives in the GIFT-Eval repo at notebooks/toto_2_0_fnf.ipynb.
✨ Key Features
- Per-bucket gating: Separate XGBoost head per
(frequency, term)bucket — each bucket learns its own softmax over the model pool so the ensemble can specialize without one global gate trading off across regimes. - No retraining at inference: The bundle ships pre-computed base-model predictions and tsfeatures for the full GIFT-Eval test split, so replication needs neither GPUs nor the base-model libraries.
- No leakage: tsfeatures are computed only on the lookback context preceding each forecast window; the bundle stores dataset metadata but not ground-truth labels.
🧩 Model pool
The meta-learner outputs softmax weights over 10 foundation models (column order matters — it is tied to the booster's class indices):
| # | Model | Family |
|---|---|---|
| 0 | chronos-2 |
Chronos |
| 1 | timesfm-2.5 |
TimesFM |
| 2 | flowstate |
FlowState |
| 3 | tirex |
TiRex |
| 4 | patchtst-fm |
PatchTST |
| 5 | toto-2.0-4m |
Toto 2.0 |
| 6 | toto-2.0-22m |
Toto 2.0 |
| 7 | toto-2.0-313m |
Toto 2.0 |
| 8 | toto-2.0-1b |
Toto 2.0 |
| 9 | toto-2.0-2.5b |
Toto 2.0 |
📦 Bundle layout
booster_manifest.json ~4.8 GB — base64-encoded XGBoost boosters keyed by "<canonical_freq>|<term>"
feature_columns.json train-time column order expected by the booster
feature_types.json XGBoost feature_types (c = categorical, q = float)
categories.json {"freq": [...], "domain": [...]} train-time category vocabularies
models.json list of model names in column order (column index ↔ model)
test_features/<ds_dirname>/
test_features.npz (n_windows, n_tsfeatures) tsfeatures from the lookback context preceding each window
test_metadata.npz dataset-level scalars only (seasonality, prediction_length, num_variates, freq, domain)
test_predictions/<model>/<ds_dirname>/
test_predictions.npz (n_windows, 9, prediction_length) quantile forecasts at QUANTILE_LEVELS = [0.1, ..., 0.9]
ds_dirname follows GIFT-Eval's canonical naming: <pretty_name>_<freq>_<term> (e.g. m4_weekly_W_short).
⚡ How the booster is used
Per (dataset, term):
- Load
test_features.npzandtest_metadata.npz. Reindex the tsfeatures tofeature_columns.json— columns missing in this dataset's tsfeatures (e.g.seasonal_strengthon yearly data) become NaN, which XGBoost handles natively. Attach scalar features (seasonality,prediction_length,num_variates) and categorical features (freq,domain) using the train-time categorical vocabularies incategories.json. The tsfeatures are computed only on the lookback context that precedes each forecast window, so no information from the ground-truth labels is ever used at inference time. - Look up the bucket booster for
(canonical_freq, term)where canonical_freq strips pandas anchor suffixes (W-TUE→W,Q-DEC→Q). booster.predict(..., output_margin=True)returns raw class logits of shape(n_windows, 10); softmax over the model axis gives the per-window weights.- Stack the 10 per-model
test_predictions.npzarrays into a(n_windows, 10, 9, prediction_length)tensor; weight-sum across the model axis → final quantile forecast. - Score with
gluonts.evaluate_modelusing the same call shape every other GIFT-Eval submission uses (seeevaluate_datasetin the notebook).
🔁 Reproducing from scratch
Each base model's predictions were generated by running its standard GIFT-Eval notebook (notebooks/chronos-2.ipynb, etc.) with a wrapper that saves the per-window quantile forecasts to test_predictions.npz instead of going straight into evaluate_model. The notebook's "Optional B" section shows the wrapper for every pool member. Time-series features come from the tsfeatures library; "Optional A" in the notebook shows the per-window extraction call. The meta-learner boosters were trained on the corresponding train-window predictions, which are not included in this bundle.
🔗 Additional Resources
- GIFT-Eval benchmark — leaderboard hosting this submission
- Replication notebook — fast-path scoring + optional regeneration of every artifact in this bundle
- Toto 2.0 family — base Toto checkpoints (4M → 2.5B)
- Toto GitHub repository — Toto 2.0 source code
- BOOM dataset — Datadog's observability time-series benchmark
- Datadog blog post — Toto 2.0 announcement
📖 Citation
(citation coming soon)
📝 License
Apache 2.0. Each base model retains its original license — see the linked HF repos in the model pool table.