File size: 3,339 Bytes
2c6288c 4e4ebc6 2c6288c 4e4ebc6 2c6288c 949c705 2c6288c 4e4ebc6 2c6288c 1dd52d9 4e4ebc6 2c6288c 4e4ebc6 2c6288c 4e4ebc6 1dd52d9 4527eaf 949c705 1dd52d9 4e4ebc6 2c6288c 4e4ebc6 949c705 4e4ebc6 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 |
TITLE = """<h1 align="center" id="space-title">TemporalBench Leaderboard</h1>"""
INTRODUCTION_TEXT = """
This leaderboard presents **offline** evaluation results for agent configurations on the
TemporalBench benchmark. It is a pure visualization and validation layer: no agents are
executed here, and no LLM APIs are called.
"""
LLM_BENCHMARKS_TEXT = """
The paper describing this benchmark is *TemporalBench: A Benchmark for Evaluating LLM-Based Agents on Contextual and Event-Informed Time Series Tasks* (https://arxiv.org/abs/2602.13272). We also maintain a public leaderboard and welcome submissions from state-of-the-art models: https://huggingface.co/spaces/Melady/TemporalBench_Leaderboard
## What this leaderboard shows
- One row per evaluated agent configuration
- Task-family MCQ metrics for TemporalBench (T1–T4)
- Forecasting metrics for T2/T4 (sMAPE, MAE) and MIMIC OW metrics when provided
- Dataset-level results for: FreshRetailNet, PSML, Causal Chambers, MIMIC
## Data requirements
Results are loaded from a local JSON or CSV file. Each record must include:
- Identity fields: `agent_name`, `agent_type`, `base_model`
- Required metrics: `T1_acc`, `T2_acc`, `T3_acc`, `T4_acc` (computed overall)
- Optional metrics:
- Overall forecasting: `T2_sMAPE`, `T2_MAE`, `T4_sMAPE`, `T4_MAE`
- MIMIC overall OW: `MIMIC_T2_OW_sMAPE`, `MIMIC_T2_OW_RMSSE`, `MIMIC_T4_OW_sMAPE`, `MIMIC_T4_OW_RMSSE`
- Dataset-level metrics: `<Dataset>_T{1..4}_acc` and forecasting metrics per dataset
## Overall computation
Overall T1–T4 accuracy and T2/T4 forecasting metrics are computed as weighted averages
from dataset-level results using question/series counts. Missing values are ignored.
## Submission workflow
Uploads are stored locally for manual review.
For a valid submission, please provide **two files**:
1. `results_on_dev_dataset.json`
- This follows the leaderboard metrics format.
- It should include task-level metrics only (e.g., T1-T4 and forecasting metrics).
2. `results_on_test_dataset.json`
- This should include per-example outputs on the test split.
- For each example, include at least:
- `id`
- `tier`
- `source_dataset`
- `label`
- `output` (required when the example contains a forecasting task)
We also strongly encourage including model and system metadata, such as:
- model architecture code
- LLM(s) used
- key implementation details needed for result verification
Approved submissions should then be merged into the main results file to appear on the leaderboard.
## Data access
The dataset is available at:
```
https://huggingface.co/datasets/Melady/TemporalBench
```
It includes all test tasks and a `forecast_metrics_utils.py` file that documents the
standard metric computation utilities.
"""
EVALUATION_QUEUE_TEXT = ""
CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
CITATION_BUTTON_TEXT = r"""
@misc{weng2026temporalbenchbenchmarkevaluatingllmbased,
title={TemporalBench: A Benchmark for Evaluating LLM-Based Agents on Contextual and Event-Informed Time Series Tasks},
author={Muyan Weng and Defu Cao and Wei Yang and Yashaswi Sharma and Yan Liu},
year={2026},
eprint={2602.13272},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2602.13272},
}
"""
|