razvan
/

timesfm-analysis

ml-intern

Model card Files Files and versions

xet

Community

razvan commited on 4 days ago

Commit

db0f6e8

verified ·

1 Parent(s): 1f32aed

Add comprehensive TimesFM analysis

Browse files

Files changed (1) hide show

timesfm_analysis.md +166 -0

timesfm_analysis.md ADDED Viewed

	@@ -0,0 +1,166 @@

+# TimesFM: What It Is, What It's Good For, What It's Bad At, and How to Decide
+## The Architecture in 30 Seconds
+TimesFM is a **decoder-only transformer where the "tokens" are patches of real-valued time series**, not discrete words. Think GPT, but instead of predicting the next word, it predicts the next chunk of a time series.
+| Design Choice | Detail | Implication |
+|---|---|---|
+| **Tokenization** | Non-overlapping patches of 32 time points → MLP → 1280-dim embedding | Continuous values, no quantization. Contrast with Chronos which discretizes into bins. |
+| **Asymmetric patches** | Input patch = 32 pts, Output patch = 128 pts | Generates 128 future steps per autoregressive step → fast inference, but coarse for short series |
+| **Attention** | Causal self-attention (each patch sees only past patches) | Standard GPT-style. No cross-series attention. |
+| **Normalization** | RevIN (instance normalization by context mean/std) | Shallow stationarization. Does not handle structural breaks. |
+| **Loss** | Quantile (pinball) regression | Produces point forecasts + experimental quantile heads. **Quantiles are NOT calibrated post-training.** |
+| **Size** | v1.0: 200M params (20 layers) / v2.0: 500M (50 layers) / v2.5: 200M (HF-native) | Runs on a single GPU. Lightweight by LLM standards. |
+---
+## What It Was Trained On — And Why This Matters More Than Architecture
+| Source | Scale | Character |
+|---|---|---|
+| **Wikipedia page views** | ~374 **billion** time points | Smooth, trend + weekly/annual seasonality |
+| **Synthetic (ARMA + sines + trends)** | ~6B time points | Textbook stationary processes |
+| **Google Trends** | ~537M | Search interest curves — smooth, seasonal |
+| **M4 competition** | ~23M | Mixed-frequency competition data |
+| **Electricity, traffic, weather, retail** | ~200M combined | Real-world but tiny fraction of training mix |
+**The training distribution is overwhelmingly smooth consumer-web engagement data.** This is the single most important fact for deciding if TimesFM will work for you. The model has an extraordinary prior on trend-plus-seasonality patterns. It has a weak prior on everything else.
+v2.0 added subsets of the LOTSA corpus (cloud VMs, ERA5 weather, buildings energy), broadening domain coverage — but financial return data, clinical vitals, industrial sensor anomalies, and log-scale series remain **absent from pretraining**.
+---
+## Where It's Genuinely Good
+### 1. Zero-shot univariate forecasting at scale
+You have 50,000 SKUs, each with 2 years of daily sales. You need a forecast tomorrow. No time to train. TimesFM gives you a competitive point forecast in seconds per series, with zero training. On the GIFT-Eval benchmark (the most rigorous, contamination-controlled evaluation available), **TimesFM ranks #1 in the Economic/Finance domain** for point forecasts.
+### 2. Rapid prototyping / establishing a baseline
+Before committing to a supervised model, run TimesFM zero-shot. If it's already 90% as good as your production model, that tells you something about the information content of your features. If it's terrible, that tells you the series distribution is far from its training corpus.
+### 3. Medium-horizon, sub-daily granularity
+Its training data density is highest at hourly/daily frequencies. Forecast horizons of 1–512 steps in this regime is its sweet spot.
+### 4. Fine-tuning with small domain data
+Standard transformer → LoRA adapters work. With 1k–10k domain series, fine-tuning is well-documented and significantly boosts performance over zero-shot (demonstrated in medical, energy, and retail domains).
+### 5. Compute-constrained deployment
+200M params, Apache 2.0 license, v2.5 ships natively in `transformers`. Inference is fast. This is a real engineering advantage over Moirai-Large or Chronos-Large.
+---
+## Where It Fails — The Hard Constraints
+### ❌ 1. Strictly Univariate — No Cross-Series Modeling
+Each series is processed independently. There is **zero cross-asset attention**. No covariates (holidays, promotions, macro factors). The paper explicitly states: *"Currently the model is not pretrained with covariates."*
+**For a quant**: If your signal comes from cross-sectional relationships (pairs trading, factor models, lead-lag), TimesFM cannot see them. Period.
+### ❌ 2. Uncalibrated Uncertainty
+The quantile heads are experimental and the model card explicitly warns they are *"not calibrated after pretraining."* If you need VaR, prediction intervals for safety stock, or any decision that depends on the tails of the forecast distribution — **do not trust TimesFM's uncertainty estimates**. Use Chronos (cross-entropy loss over discrete bins → naturally calibrated) or add a conformal prediction wrapper post-hoc.
+### ❌ 3. Financial Return Prediction
+The most damning evidence: a comprehensive 2025 study ([arxiv:2511.18578](https://arxiv.org/abs/2511.18578)) tested zero-shot TimesFM on daily excess-return prediction across **94 countries, 1990–2023**. Result: **underperforms XGBoost, CatBoost, and LightGBM** on R²_OOS, directional accuracy, and F1. Fine-tuning helps modestly but still doesn't beat tree ensembles. Pre-training from scratch on financial data helps significantly — confirming the domain mismatch hypothesis.
+**Translation**: The model learned "how Wikipedia page views behave." Financial returns are a fundamentally different DGP — low SNR, heavy tails, regime switches, non-stationarity. The pretrained prior doesn't help and may actively hurt.
+### ❌ 4. Structural Breaks / Regime Change
+RevIN normalizes by context mean and std. If the DGP changes mid-context (COVID shock, rate regime change, mean reversion → trending), the model treats pre- and post-break data uniformly. There is no mechanism for break detection or regime-conditional forecasting.
+### ❌ 5. Context Window at High Frequencies
+- v1.0: max 512 points, v2.0: max 2048 points
+- At 1-second resolution: 2048 points = **34 minutes of history**
+- At tick-level: essentially useless
+- Even at 1-minute bars: 2048 points ≈ 1.4 trading days
+### ❌ 6. Irregular / Event-Driven Series
+v2.0 handles NaN via linear interpolation before inference — a crude band-aid. Transaction data with variable gaps, clinical vitals with missing readings, event-triggered sensor logs all need careful preprocessing that may destroy the signal.
+### ❌ 7. Exponential Growth
+Training data is dominated by linear/polynomial trends and ARMA processes. Exponential dynamics (cumulative infection curves, viral growth, compound returns) are underrepresented. Log-transform your data if this applies.
+---
+## The Benchmark Contamination Problem
+This is critical and often ignored: TimesFM's training corpus (Wikipedia, Google Trends, M4, ETT) **overlaps with standard benchmarks**. A dedicated paper ([arxiv:2510.13654](https://arxiv.org/abs/2510.13654)) identifies that published zero-shot numbers for TimesFM, Chronos, and Moirai all suffer from potential information leakage. **All published zero-shot numbers should be read with a 10–20% skepticism discount.** Run your own held-out evaluation.
+---
+## The Competitive Landscape (Mid-2025)
+| Model | Probabilistic? | Multivariate? | Key Advantage Over TimesFM |
+|---|---|---|---|
+| **Chronos / Chronos-2** (Amazon) | ✅ Calibrated | ✅ (v2) | Calibrated uncertainty; v2 adds multivariate + covariates |
+| **Moirai / Moirai-2.0** (Salesforce) | ✅ Mixture dist | ✅ (v1) / ❌ (v2) | Wins GIFT-Eval overall; Moirai-2.0 is 30× smaller at similar accuracy |
+| **Moirai-MoE** (Salesforce) | ✅ | ❌ | No frequency heuristic needed — token-level specialization via sparse MoE |
+| **Lag-Llama** (Academic) | ✅ Student-t | ❌ | Lag features for long-range dependencies |
+| **TabPFN-v2** (Prior-Data Fitted Networks) | N/A | N/A | **Beats all TSFMs on GIFT-Eval** when time series is framed as tabular prediction with lag features |
+That last row is the most sobering: a tabular foundation model with engineered lag features outperforms every time series foundation model on the most rigorous benchmark. The implication is that **feature engineering + a strong tabular model remains an extremely competitive baseline** for many forecasting tasks.
+---
+## How to Decide If TimesFM Will Work for Your Use Case
+```
+1. Is your problem fundamentally multivariate or covariate-driven?
+   ├─ YES → Stop. Use Chronos-2, Moirai, TFT, or a supervised model.
+   └─ NO (each series is self-contained) → Continue
+2. Do you need calibrated prediction intervals?
+   ├─ YES → Use Chronos (or add conformal prediction to TimesFM).
+   └─ NO (point forecast sufficient) → Continue
+3. Is your target variable financial returns / alpha signal?
+   ├─ YES → Use XGBoost/CatBoost/LightGBM with lag features.
+   │         TSFMs empirically lose here across 94 countries.
+   └─ NO → Continue
+4. Is your series regular, contiguous, and ≥32 data points?
+   ├─ NO → Heavy preprocessing needed. Consider alternatives.
+   └─ YES → Continue
+5. Does your series exhibit smooth trend + seasonality
+   (like web traffic, retail demand, energy consumption)?
+   ├─ YES → TimesFM is likely excellent. Run zero-shot first.
+   └─ NO (chaotic, heavy-tailed, structural breaks) → Expect degradation.
+         Fine-tune on domain data or use supervised models.
+6. Is your context window sufficient?
+   ├─ Need >2048 points of history → TimesFM v2.0 won't fit it.
+   └─ ≤2048 → Continue
+7. Practical test: Run TimesFM zero-shot on a held-out sample.
+   Compare against:
+   (a) Seasonal naïve baseline
+   (b) XGBoost with lag features
+   (c) Chronos zero-shot
+   If TimesFM beats (a) and is within 10% of (b),
+   it's a viable production candidate.
+```
+---
+## The Bottom Line
+TimesFM is a **fast, lightweight, zero-shot point-forecast engine with a strong prior on smooth, seasonal, consumer-internet-style time series**. It pioneered the decoder-only patched architecture that every subsequent TSFM has adopted. For univariate demand forecasting, KPI monitoring, web traffic, and energy consumption — it's an excellent starting point.
+Its three **structural limitations** that no amount of scaling will fix: (1) univariate-only, (2) uncalibrated uncertainty, (3) training distribution dominated by Wikipedia/Google Trends rather than your domain. For financial return prediction specifically, the empirical evidence is clear: it loses to tree-based models with lag features.
+**The sharpest insight**: If your series "looks like Wikipedia traffic" — regular frequency, trend + seasonality, medium noise — TimesFM will probably work well zero-shot. If it doesn't look like that, treat TimesFM as a feature extractor or a fine-tuning base, not a finished product.
+---
+## Essential References
+| Resource | What It Contains |
+|---|---|
+| [arxiv:2310.10688](https://arxiv.org/abs/2310.10688) | Original TimesFM paper (ICML 2024); Sections 4 (architecture), 5 (training data), A.1 (limitations) |
+| [arxiv:2403.07815](https://arxiv.org/abs/2403.07815) | Chronos paper; Section 5.7 has qualitative failure mode analysis across TSFMs |
+| [arxiv:2410.10393](https://arxiv.org/abs/2410.10393) | GIFT-Eval benchmark; Table 6 has domain-level breakdown of all models |
+| [arxiv:2510.13654](https://arxiv.org/abs/2510.13654) | Information leakage in TSFM benchmarking — critical for trusting published numbers |
+| [arxiv:2511.18578](https://arxiv.org/abs/2511.18578) | Finance-specific evaluation; zero-shot TSFMs beaten by tree models |
+| [arxiv:2410.10469](https://arxiv.org/abs/2410.10469) | Moirai-MoE; articulates the frequency-embedding coarseness problem in TimesFM |
+| [arxiv:2501.02945](https://arxiv.org/abs/2501.02945) | TabPFN-v2; tabular model that beats all TSFMs on GIFT-Eval |