Toto 2.0 Family and Friends — GIFT-Eval artifact bundle

Pre-computed artifacts for replicating the Toto 2.0 Family and Friends (short form: Toto-2.0-FnF) submission to the GIFT-Eval benchmark. The ensemble is an FFORMA-style (Montero-Manso et al., International Journal of Forecasting, 2020) meta-learner that gates a pool of foundation models on a per-(frequency, term) bucket basis using XGBoost over time-series features.

The replication notebook lives in the GIFT-Eval repo at notebooks/toto_2_0_fnf.ipynb.

✨ Key Features

Per-bucket gating: Separate XGBoost head per (frequency, term) bucket — each bucket learns its own softmax over the model pool so the ensemble can specialize without one global gate trading off across regimes.
No retraining at inference: The bundle ships pre-computed base-model predictions and tsfeatures for the full GIFT-Eval test split, so replication needs neither GPUs nor the base-model libraries.
No leakage: tsfeatures are computed only on the lookback context preceding each forecast window; the bundle stores dataset metadata but not ground-truth labels.

🧩 Model pool

The meta-learner outputs softmax weights over 10 foundation models (column order matters — it is tied to the booster's class indices):

#	Model	Family
0	`chronos-2`	Chronos
1	`timesfm-2.5`	TimesFM
2	`flowstate`	FlowState
3	`tirex`	TiRex
4	`patchtst-fm`	PatchTST
5	`toto-2.0-4m`	Toto 2.0
6	`toto-2.0-22m`	Toto 2.0
7	`toto-2.0-313m`	Toto 2.0
8	`toto-2.0-1b`	Toto 2.0
9	`toto-2.0-2.5b`	Toto 2.0

📦 Bundle layout

booster_manifest.json          ~4.8 GB — base64-encoded XGBoost boosters keyed by "<canonical_freq>|<term>"
feature_columns.json           train-time column order expected by the booster
feature_types.json             XGBoost feature_types (c = categorical, q = float)
categories.json                {"freq": [...], "domain": [...]} train-time category vocabularies
models.json                    list of model names in column order (column index ↔ model)
test_features/<ds_dirname>/
  test_features.npz            (n_windows, n_tsfeatures) tsfeatures from the lookback context preceding each window
  test_metadata.npz            dataset-level scalars only (seasonality, prediction_length, num_variates, freq, domain)
test_predictions/<model>/<ds_dirname>/
  test_predictions.npz         (n_windows, 9, prediction_length) quantile forecasts at QUANTILE_LEVELS = [0.1, ..., 0.9]

ds_dirname follows GIFT-Eval's canonical naming: <pretty_name>_<freq>_<term> (e.g. m4_weekly_W_short).

⚡ How the booster is used

Per (dataset, term):

Load test_features.npz and test_metadata.npz. Reindex the tsfeatures to feature_columns.json — columns missing in this dataset's tsfeatures (e.g. seasonal_strength on yearly data) become NaN, which XGBoost handles natively. Attach scalar features (seasonality, prediction_length, num_variates) and categorical features (freq, domain) using the train-time categorical vocabularies in categories.json. The tsfeatures are computed only on the lookback context that precedes each forecast window, so no information from the ground-truth labels is ever used at inference time.
Look up the bucket booster for (canonical_freq, term) where canonical_freq strips pandas anchor suffixes (W-TUE → W, Q-DEC → Q).
booster.predict(..., output_margin=True) returns raw class logits of shape (n_windows, 10); softmax over the model axis gives the per-window weights.
Stack the 10 per-model test_predictions.npz arrays into a (n_windows, 10, 9, prediction_length) tensor; weight-sum across the model axis → final quantile forecast.
Score with gluonts.evaluate_model using the same call shape every other GIFT-Eval submission uses (see evaluate_dataset in the notebook).

🔁 Reproducing from scratch

Each base model's predictions were generated by running its standard GIFT-Eval notebook (notebooks/chronos-2.ipynb, etc.) with a wrapper that saves the per-window quantile forecasts to test_predictions.npz instead of going straight into evaluate_model. The notebook's "Optional B" section shows the wrapper for every pool member. Time-series features come from the tsfeatures library; "Optional A" in the notebook shows the per-window extraction call. The meta-learner boosters were trained on the corresponding train-window predictions, which are not included in this bundle.

🔗 Additional Resources

GIFT-Eval benchmark — leaderboard hosting this submission
Replication notebook — fast-path scoring + optional regeneration of every artifact in this bundle
Toto 2.0 family — base Toto checkpoints (4M → 2.5B)
Toto GitHub repository — Toto 2.0 source code
BOOM dataset — Datadog's observability time-series benchmark
Datadog blog post — Toto 2.0 announcement

📖 Citation

(citation coming soon)

📝 License

Apache 2.0. Each base model retains its original license — see the linked HF repos in the model pool table.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Time Series Forecasting

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support