File size: 3,339 Bytes
2c6288c
4e4ebc6
 
2c6288c
 
 
4e4ebc6
 
2c6288c
949c705
 
2c6288c
4e4ebc6
2c6288c
1dd52d9
 
 
4e4ebc6
2c6288c
4e4ebc6
2c6288c
4e4ebc6
1dd52d9
 
 
 
 
 
 
 
 
 
 
4527eaf
 
 
949c705
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1dd52d9
 
 
 
 
 
 
 
 
4e4ebc6
 
2c6288c
 
4e4ebc6
 
949c705
 
 
 
 
 
 
 
 
4e4ebc6
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
TITLE = """<h1 align="center" id="space-title">TemporalBench Leaderboard</h1>"""

INTRODUCTION_TEXT = """
This leaderboard presents **offline** evaluation results for agent configurations on the
TemporalBench benchmark. It is a pure visualization and validation layer: no agents are
executed here, and no LLM APIs are called.
"""

LLM_BENCHMARKS_TEXT = """
The paper describing this benchmark is *TemporalBench: A Benchmark for Evaluating LLM-Based Agents on Contextual and Event-Informed Time Series Tasks* (https://arxiv.org/abs/2602.13272). We also maintain a public leaderboard and welcome submissions from state-of-the-art models: https://huggingface.co/spaces/Melady/TemporalBench_Leaderboard

## What this leaderboard shows

- One row per evaluated agent configuration
- Task-family MCQ metrics for TemporalBench (T1–T4)
- Forecasting metrics for T2/T4 (sMAPE, MAE) and MIMIC OW metrics when provided
- Dataset-level results for: FreshRetailNet, PSML, Causal Chambers, MIMIC

## Data requirements

Results are loaded from a local JSON or CSV file. Each record must include:

- Identity fields: `agent_name`, `agent_type`, `base_model`
- Required metrics: `T1_acc`, `T2_acc`, `T3_acc`, `T4_acc` (computed overall)
- Optional metrics:
  - Overall forecasting: `T2_sMAPE`, `T2_MAE`, `T4_sMAPE`, `T4_MAE`
  - MIMIC overall OW: `MIMIC_T2_OW_sMAPE`, `MIMIC_T2_OW_RMSSE`, `MIMIC_T4_OW_sMAPE`, `MIMIC_T4_OW_RMSSE`
  - Dataset-level metrics: `<Dataset>_T{1..4}_acc` and forecasting metrics per dataset

## Overall computation

Overall T1–T4 accuracy and T2/T4 forecasting metrics are computed as weighted averages
from dataset-level results using question/series counts. Missing values are ignored.

## Submission workflow

Uploads are stored locally for manual review.

For a valid submission, please provide **two files**:

1. `results_on_dev_dataset.json`
   - This follows the leaderboard metrics format.
   - It should include task-level metrics only (e.g., T1-T4 and forecasting metrics).

2. `results_on_test_dataset.json`
   - This should include per-example outputs on the test split.
   - For each example, include at least:
     - `id`
     - `tier`
     - `source_dataset`
     - `label`
     - `output` (required when the example contains a forecasting task)

We also strongly encourage including model and system metadata, such as:
- model architecture code
- LLM(s) used
- key implementation details needed for result verification

Approved submissions should then be merged into the main results file to appear on the leaderboard.

## Data access

The dataset is available at:
```
https://huggingface.co/datasets/Melady/TemporalBench
```
It includes all test tasks and a `forecast_metrics_utils.py` file that documents the
standard metric computation utilities.
"""

EVALUATION_QUEUE_TEXT = ""

CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
CITATION_BUTTON_TEXT = r"""
@misc{weng2026temporalbenchbenchmarkevaluatingllmbased,
      title={TemporalBench: A Benchmark for Evaluating LLM-Based Agents on Contextual and Event-Informed Time Series Tasks}, 
      author={Muyan Weng and Defu Cao and Wei Yang and Yashaswi Sharma and Yan Liu},
      year={2026},
      eprint={2602.13272},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2602.13272}, 
}
"""