Ray0202
update 02.17.2026
949c705
TITLE = """<h1 align="center" id="space-title">TemporalBench Leaderboard</h1>"""
INTRODUCTION_TEXT = """
This leaderboard presents **offline** evaluation results for agent configurations on the
TemporalBench benchmark. It is a pure visualization and validation layer: no agents are
executed here, and no LLM APIs are called.
"""
LLM_BENCHMARKS_TEXT = """
The paper describing this benchmark is *TemporalBench: A Benchmark for Evaluating LLM-Based Agents on Contextual and Event-Informed Time Series Tasks* (https://arxiv.org/abs/2602.13272). We also maintain a public leaderboard and welcome submissions from state-of-the-art models: https://huggingface.co/spaces/Melady/TemporalBench_Leaderboard
## What this leaderboard shows
- One row per evaluated agent configuration
- Task-family MCQ metrics for TemporalBench (T1–T4)
- Forecasting metrics for T2/T4 (sMAPE, MAE) and MIMIC OW metrics when provided
- Dataset-level results for: FreshRetailNet, PSML, Causal Chambers, MIMIC
## Data requirements
Results are loaded from a local JSON or CSV file. Each record must include:
- Identity fields: `agent_name`, `agent_type`, `base_model`
- Required metrics: `T1_acc`, `T2_acc`, `T3_acc`, `T4_acc` (computed overall)
- Optional metrics:
- Overall forecasting: `T2_sMAPE`, `T2_MAE`, `T4_sMAPE`, `T4_MAE`
- MIMIC overall OW: `MIMIC_T2_OW_sMAPE`, `MIMIC_T2_OW_RMSSE`, `MIMIC_T4_OW_sMAPE`, `MIMIC_T4_OW_RMSSE`
- Dataset-level metrics: `<Dataset>_T{1..4}_acc` and forecasting metrics per dataset
## Overall computation
Overall T1–T4 accuracy and T2/T4 forecasting metrics are computed as weighted averages
from dataset-level results using question/series counts. Missing values are ignored.
## Submission workflow
Uploads are stored locally for manual review.
For a valid submission, please provide **two files**:
1. `results_on_dev_dataset.json`
- This follows the leaderboard metrics format.
- It should include task-level metrics only (e.g., T1-T4 and forecasting metrics).
2. `results_on_test_dataset.json`
- This should include per-example outputs on the test split.
- For each example, include at least:
- `id`
- `tier`
- `source_dataset`
- `label`
- `output` (required when the example contains a forecasting task)
We also strongly encourage including model and system metadata, such as:
- model architecture code
- LLM(s) used
- key implementation details needed for result verification
Approved submissions should then be merged into the main results file to appear on the leaderboard.
## Data access
The dataset is available at:
```
https://huggingface.co/datasets/Melady/TemporalBench
```
It includes all test tasks and a `forecast_metrics_utils.py` file that documents the
standard metric computation utilities.
"""
EVALUATION_QUEUE_TEXT = ""
CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
CITATION_BUTTON_TEXT = r"""
@misc{weng2026temporalbenchbenchmarkevaluatingllmbased,
title={TemporalBench: A Benchmark for Evaluating LLM-Based Agents on Contextual and Event-Informed Time Series Tasks},
author={Muyan Weng and Defu Cao and Wei Yang and Yashaswi Sharma and Yan Liu},
year={2026},
eprint={2602.13272},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2602.13272},
}
"""