Spaces:
Sleeping
Sleeping
| title: TimeBench Eval | |
| datasets: | |
| - TimeBench | |
| tags: | |
| - evaluate | |
| - metric | |
| - temporal reasoning | |
| description: Evaluation metric for the TimeBench temporal reasoning benchmark by Chu et al. (2023). | |
| sdk: gradio | |
| sdk_version: 6.3.0 | |
| app_file: app.py | |
| pinned: false | |
| emoji: ⏰ | |
| colorFrom: purple | |
| colorTo: red | |
| # Metric Card for TimeBench Eval | |
| ## Metric Description | |
| This metric is designed for the **TimeBench** benchmark (Chu et al., 2023), which evaluates temporal reasoning abilities in large language models. It uses modified prompts from the [ADeLe paper by Zhou et al., 2025](https://arxiv.org/abs/2503.06378). It supports multiple task types with different evaluation strategies: | |
| - **TempReason, TimeQA, MenatQA**: Uses SQuAD-style exact match and F1 scoring | |
| - **Date Arithmetic**: Parses and compares dates for exact match | |
| - **TimeDial**: Set-based comparison of selected multiple-choice options (A-D) | |
| The metric expects model outputs to contain the answer in the format `"Thus, the correct answer is: <answer>"`. | |
| It performs the following steps: | |
| 1. Extracts the answer from the model's prediction string using the marker `"Thus, the correct answer is:"`. | |
| 2. Applies task-specific evaluation: | |
| - For QA tasks: Computes SQuAD exact match and F1 scores | |
| - For Date Arithmetic: Parses dates and compares them (day is normalized to 1) | |
| - For TimeDial: Extracts option letters (A-D) and computes set-based exact match and F1 | |
| ## How to Use | |
| You can load the metric using the `evaluate` library: | |
| ```python | |
| import evaluate | |
| metric = evaluate.load("aauss/timebench_eval") | |
| # Example for Date Arithmetic task | |
| predictions = [ | |
| "Let me solve this step by step... Thus, the correct answer is: Aug, 1987.", | |
| "Calculating the date... Thus, the correct answer is: January 2020.", | |
| ] | |
| references = ["Aug, 1987", "Feb, 2020"] | |
| result = metric.compute( | |
| predictions=predictions, | |
| references=references, | |
| task="Date Arithmetic", | |
| ) | |
| print(result) | |
| >>> {"exact_match": [1, 0]} | |
| # Example for TempReason/TimeQA/MenatQA tasks | |
| predictions = [ | |
| "Based on the context... Thus, the correct answer is: Cardiff City.", | |
| "The answer cannot be determined. Thus, the correct answer is: unanswerable", | |
| ] | |
| references = ["Cardiff City", "unanswerable"] | |
| result = metric.compute( | |
| predictions=predictions, | |
| references=references, | |
| task="MenatQA", | |
| ) | |
| print(result) | |
| >>> {"exact_match": [1.0, 1.0], "f1": [1.0, 1.0]} | |
| # Example for TimeDial task (multiple choice) | |
| predictions = [ | |
| "Options B and C are correct. Thus, the correct answer is: B, C.", | |
| ] | |
| references = ["B. No more than ten minutes && C. No more than five minutes"] | |
| result = metric.compute( | |
| predictions=predictions, | |
| references=references, | |
| task="TimeDial", | |
| ) | |
| print(result) | |
| >>> {"exact_match": [1], "f1": [1.0]} | |
| ``` | |
| ### Inputs | |
| - **predictions** (`list` of `str`): List of predictions to score. Each prediction should be a string containing the model's response, which must include the answer after the marker `"Thus, the correct answer is:"`. | |
| - **references** (`list` of `str`): List of reference answers. | |
| - **task** (`str`): The task type being evaluated. Must be one of: | |
| - `"TempReason"`: Temporal reasoning QA | |
| - `"TimeQA"`: Time-based QA | |
| - `"MenatQA"`: Multiple Sensitive Factors Time QA | |
| - `"Date Arithmetic"`: Date calculation tasks | |
| - `"TimeDial"`: Dialogue-based temporal multiple choice | |
| ### Output Values | |
| The metric returns a dictionary with the following keys (depending on task): | |
| - **exact_match** (`list` of `float` or `int`): Exact match scores for each prediction (0 or 1). | |
| - **f1** (`list` of `float`): F1 scores for each prediction (0.0 to 1.0). Returned for all tasks except Date Arithmetic. | |
| Scores range from 0.0 to 1.0, with higher values indicating better performance. | |
| #### Values from Popular Papers | |
| Refer to the [original TimeBench paper](https://arxiv.org/abs/2311.17667) for baseline performance values across various language models. | |
| ## Limitations and Bias | |
| - The metric relies on the marker `"Thus, the correct answer is:"` to extract answers. If the model output does not follow this exact format, extraction will fail and return `None`. | |
| - For Date Arithmetic, dates are parsed using `dateutil.parser` with day normalized to 1. Unparseable dates will result in `None` comparisons. | |
| - For TimeDial, only options A-D are recognized. The extraction looks for standalone letters at word boundaries. | |
| - The metric assumes predictions and references are properly aligned (same length lists). | |
| ## Citation | |
| ```bibtex | |
| @software{abbood2026timebench_eval, | |
| title={TimeBench Eval}, | |
| author={Abbood, Auss}, | |
| year={2026}, | |
| url={https://huggingface.co/spaces/aauss/timebench_eval} | |
| } | |
| ``` | |
| ## Further References | |
| - [TimeBench Paper](https://arxiv.org/abs/2311.17667) | |
| - [ADeLe paper which adopts TimeBench](https://huggingface.co/datasets/CFI-Kinds-of-Intelligence/ADeLe_battery_v1dot0) |