Spaces:

Melady
/

TemporalBench_Leaderboard

Running

TemporalBench_Leaderboard / src /about.py

Ray0202

update 02.17.2026

949c705 1 day ago

3.34 kB

	TITLE = """<h1 align="center" id="space-title">TemporalBench Leaderboard</h1>"""

	INTRODUCTION_TEXT = """
	This leaderboard presents offline evaluation results for agent configurations on the
	TemporalBench benchmark. It is a pure visualization and validation layer: no agents are
	executed here, and no LLM APIs are called.
	"""

	LLM_BENCHMARKS_TEXT = """
	The paper describing this benchmark is TemporalBench: A Benchmark for Evaluating LLM-Based Agents on Contextual and Event-Informed Time Series Tasks (https://arxiv.org/abs/2602.13272). We also maintain a public leaderboard and welcome submissions from state-of-the-art models: https://huggingface.co/spaces/Melady/TemporalBench_Leaderboard

	## What this leaderboard shows

	- One row per evaluated agent configuration
	- Task-family MCQ metrics for TemporalBench (T1–T4)
	- Forecasting metrics for T2/T4 (sMAPE, MAE) and MIMIC OW metrics when provided
	- Dataset-level results for: FreshRetailNet, PSML, Causal Chambers, MIMIC

	## Data requirements

	Results are loaded from a local JSON or CSV file. Each record must include:

	- Identity fields: `agent_name`, `agent_type`, `base_model`
	- Required metrics: `T1_acc`, `T2_acc`, `T3_acc`, `T4_acc` (computed overall)
	- Optional metrics:
	- Overall forecasting: `T2_sMAPE`, `T2_MAE`, `T4_sMAPE`, `T4_MAE`
	- MIMIC overall OW: `MIMIC_T2_OW_sMAPE`, `MIMIC_T2_OW_RMSSE`, `MIMIC_T4_OW_sMAPE`, `MIMIC_T4_OW_RMSSE`
	- Dataset-level metrics: `<Dataset>_T{1..4}_acc` and forecasting metrics per dataset

	## Overall computation

	Overall T1–T4 accuracy and T2/T4 forecasting metrics are computed as weighted averages
	from dataset-level results using question/series counts. Missing values are ignored.

	## Submission workflow

	Uploads are stored locally for manual review.

	For a valid submission, please provide two files:

	1. `results_on_dev_dataset.json`
	- This follows the leaderboard metrics format.
	- It should include task-level metrics only (e.g., T1-T4 and forecasting metrics).

	2. `results_on_test_dataset.json`
	- This should include per-example outputs on the test split.
	- For each example, include at least:
	- `id`
	- `tier`
	- `source_dataset`
	- `label`
	- `output` (required when the example contains a forecasting task)

	We also strongly encourage including model and system metadata, such as:
	- model architecture code
	- LLM(s) used
	- key implementation details needed for result verification

	Approved submissions should then be merged into the main results file to appear on the leaderboard.

	## Data access

	The dataset is available at:
	```
	https://huggingface.co/datasets/Melady/TemporalBench
	```
	It includes all test tasks and a `forecast_metrics_utils.py` file that documents the
	standard metric computation utilities.
	"""

	EVALUATION_QUEUE_TEXT = ""

	CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
	CITATION_BUTTON_TEXT = r"""
	@misc{weng2026temporalbenchbenchmarkevaluatingllmbased,
	title={TemporalBench: A Benchmark for Evaluating LLM-Based Agents on Contextual and Event-Informed Time Series Tasks},
	author={Muyan Weng and Defu Cao and Wei Yang and Yashaswi Sharma and Yan Liu},
	year={2026},
	eprint={2602.13272},
	archivePrefix={arXiv},
	primaryClass={cs.AI},
	url={https://arxiv.org/abs/2602.13272},
	}
	"""