TITLE = """

TemporalBench Leaderboard

""" INTRODUCTION_TEXT = """ This leaderboard presents **offline** evaluation results for agent configurations on the TemporalBench benchmark. It is a pure visualization and validation layer: no agents are executed here, and no LLM APIs are called. """ LLM_BENCHMARKS_TEXT = """ The paper describing this benchmark is *TemporalBench: A Benchmark for Evaluating LLM-Based Agents on Contextual and Event-Informed Time Series Tasks* (https://arxiv.org/abs/2602.13272). We also maintain a public leaderboard and welcome submissions from state-of-the-art models: https://huggingface.co/spaces/Melady/TemporalBench_Leaderboard ## What this leaderboard shows - One row per evaluated agent configuration - Task-family MCQ metrics for TemporalBench (T1–T4) - Forecasting metrics for T2/T4 (sMAPE, MAE) and MIMIC OW metrics when provided - Dataset-level results for: FreshRetailNet, PSML, Causal Chambers, MIMIC ## Data requirements Results are loaded from a local JSON or CSV file. Each record must include: - Identity fields: `agent_name`, `agent_type`, `base_model` - Required metrics: `T1_acc`, `T2_acc`, `T3_acc`, `T4_acc` (computed overall) - Optional metrics: - Overall forecasting: `T2_sMAPE`, `T2_MAE`, `T4_sMAPE`, `T4_MAE` - MIMIC overall OW: `MIMIC_T2_OW_sMAPE`, `MIMIC_T2_OW_RMSSE`, `MIMIC_T4_OW_sMAPE`, `MIMIC_T4_OW_RMSSE` - Dataset-level metrics: `_T{1..4}_acc` and forecasting metrics per dataset ## Overall computation Overall T1–T4 accuracy and T2/T4 forecasting metrics are computed as weighted averages from dataset-level results using question/series counts. Missing values are ignored. ## Submission workflow Uploads are stored locally for manual review. For a valid submission, please provide **two files**: 1. `results_on_dev_dataset.json` - This follows the leaderboard metrics format. - It should include task-level metrics only (e.g., T1-T4 and forecasting metrics). 2. `results_on_test_dataset.json` - This should include per-example outputs on the test split. - For each example, include at least: - `id` - `tier` - `source_dataset` - `label` - `output` (required when the example contains a forecasting task) We also strongly encourage including model and system metadata, such as: - model architecture code - LLM(s) used - key implementation details needed for result verification Approved submissions should then be merged into the main results file to appear on the leaderboard. ## Data access The dataset is available at: ``` https://huggingface.co/datasets/Melady/TemporalBench ``` It includes all test tasks and a `forecast_metrics_utils.py` file that documents the standard metric computation utilities. """ EVALUATION_QUEUE_TEXT = "" CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results" CITATION_BUTTON_TEXT = r""" @misc{weng2026temporalbenchbenchmarkevaluatingllmbased, title={TemporalBench: A Benchmark for Evaluating LLM-Based Agents on Contextual and Event-Informed Time Series Tasks}, author={Muyan Weng and Defu Cao and Wei Yang and Yashaswi Sharma and Yan Liu}, year={2026}, eprint={2602.13272}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2602.13272}, } """