arxiv:2605.20119

Toto 2.0: Time Series Forecasting Enters the Scaling Era

Published on May 19

· Submitted by

Emaad Khwaja on May 21

Datadog

Upvote

Authors:

Emaad Khwaja ,

Abstract

Time series foundation models demonstrate scalable forecasting performance across parameter sizes, with Toto 2.0 achieving state-of-the-art results on multiple benchmarks through a unified training approach.

AI-generated summary

We show that time series foundation models scale: a single training recipe produces reliable forecast-quality improvements from 4M to 2.5B parameters. We release Toto 2.0, a family of five open-weights forecasting models trained under this recipe. The Toto 2.0 family sets a new state of the art on three forecasting benchmarks: BOOM, our observability benchmark; GIFT-Eval, the standard general-purpose benchmark; and the recent contamination-resistant TIME benchmark. This report describes our experimental results and details the design decisions behind Toto 2.0: its architecture and training recipe, training data, and the u-muP hyperparameter transfer pipeline. All five base checkpoints are released under Apache 2.0.

View arXiv page View PDF Project page GitHub 437 Add to collection

Community

Emaad

Paper author Paper submitter about 12 hours ago

Toto 2.0 is designed to answer a simple and open question: Can time series foundation models (TSFMs) improve as they scale?

Our results show they can. The highlights:

Scaling that works. Every size improves on the one below it, with no sign of saturation at 2.5B.
Best in class on every benchmark we tested. Toto 2.0 takes the top spots on BOOM (Datadog's observability forecasting benchmark), GIFT-Eval (the standard general-purpose benchmark), and TIME (a new contamination-resistant zero-shot benchmark).
A generational jump from Toto 1.0. Toto 2.0 is 7× more parameter-efficient at matching quality and dramatically faster at inference time.
Trained on observability and synthetic data, generalizes broadly. Toto 2.0 does not see any public forecasting data during pretraining, yet leads the field on general-purpose benchmarks.