| TITLE = """<h1 align="center" id="space-title">TemporalBench Leaderboard</h1>""" | |
| INTRODUCTION_TEXT = """ | |
| This leaderboard presents **offline** evaluation results for agent configurations on the | |
| TemporalBench benchmark. It is a pure visualization and validation layer: no agents are | |
| executed here, and no LLM APIs are called. | |
| """ | |
| LLM_BENCHMARKS_TEXT = """ | |
| The paper describing this benchmark is *TemporalBench: A Benchmark for Evaluating LLM-Based Agents on Contextual and Event-Informed Time Series Tasks* (https://arxiv.org/abs/2602.13272). We also maintain a public leaderboard and welcome submissions from state-of-the-art models: https://huggingface.co/spaces/Melady/TemporalBench_Leaderboard | |
| ## What this leaderboard shows | |
| - One row per evaluated agent configuration | |
| - Task-family MCQ metrics for TemporalBench (T1–T4) | |
| - Forecasting metrics for T2/T4 (sMAPE, MAE) and MIMIC OW metrics when provided | |
| - Dataset-level results for: FreshRetailNet, PSML, Causal Chambers, MIMIC | |
| ## Data requirements | |
| Results are loaded from a local JSON or CSV file. Each record must include: | |
| - Identity fields: `agent_name`, `agent_type`, `base_model` | |
| - Required metrics: `T1_acc`, `T2_acc`, `T3_acc`, `T4_acc` (computed overall) | |
| - Optional metrics: | |
| - Overall forecasting: `T2_sMAPE`, `T2_MAE`, `T4_sMAPE`, `T4_MAE` | |
| - MIMIC overall OW: `MIMIC_T2_OW_sMAPE`, `MIMIC_T2_OW_RMSSE`, `MIMIC_T4_OW_sMAPE`, `MIMIC_T4_OW_RMSSE` | |
| - Dataset-level metrics: `<Dataset>_T{1..4}_acc` and forecasting metrics per dataset | |
| ## Overall computation | |
| Overall T1–T4 accuracy and T2/T4 forecasting metrics are computed as weighted averages | |
| from dataset-level results using question/series counts. Missing values are ignored. | |
| ## Submission workflow | |
| Uploads are stored locally for manual review. | |
| For a valid submission, please provide **two files**: | |
| 1. `results_on_dev_dataset.json` | |
| - This follows the leaderboard metrics format. | |
| - It should include task-level metrics only (e.g., T1-T4 and forecasting metrics). | |
| 2. `results_on_test_dataset.json` | |
| - This should include per-example outputs on the test split. | |
| - For each example, include at least: | |
| - `id` | |
| - `tier` | |
| - `source_dataset` | |
| - `label` | |
| - `output` (required when the example contains a forecasting task) | |
| We also strongly encourage including model and system metadata, such as: | |
| - model architecture code | |
| - LLM(s) used | |
| - key implementation details needed for result verification | |
| Approved submissions should then be merged into the main results file to appear on the leaderboard. | |
| ## Data access | |
| The dataset is available at: | |
| ``` | |
| https://huggingface.co/datasets/Melady/TemporalBench | |
| ``` | |
| It includes all test tasks and a `forecast_metrics_utils.py` file that documents the | |
| standard metric computation utilities. | |
| """ | |
| EVALUATION_QUEUE_TEXT = "" | |
| CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results" | |
| CITATION_BUTTON_TEXT = r""" | |
| @misc{weng2026temporalbenchbenchmarkevaluatingllmbased, | |
| title={TemporalBench: A Benchmark for Evaluating LLM-Based Agents on Contextual and Event-Informed Time Series Tasks}, | |
| author={Muyan Weng and Defu Cao and Wei Yang and Yashaswi Sharma and Yan Liu}, | |
| year={2026}, | |
| eprint={2602.13272}, | |
| archivePrefix={arXiv}, | |
| primaryClass={cs.AI}, | |
| url={https://arxiv.org/abs/2602.13272}, | |
| } | |
| """ | |