Spaces:

aauss
/

timebench_eval

Sleeping

App Files Files Community

aauss commited on Jan 15

Commit

71222fd

1 Parent(s): fde9678

Fill out README. Update pyproject.toml.

Browse files

Files changed (4) hide show

README.md +106 -18
pyproject.toml +10 -2
timebench_eval.py +31 -38
uv.lock +11 -13

README.md CHANGED Viewed

@@ -1,50 +1,138 @@
 ---
-title: Timebench Eval
 datasets:
 - TimeBench
 tags:
 - evaluate
 - metric
-description: 'TODO: add a description here'
 sdk: gradio
 sdk_version: 6.3.0
 app_file: app.py
 pinned: false
 ---
-# Metric Card for Timebench Eval
-***Module Card Instructions:*** *Fill out the following subsections. Feel free to take a look at existing metric cards if you'd like examples.*
 ## Metric Description
-*Give a brief overview of this metric, including what task(s) it is usually used for, if any.*
 ## How to Use
-*Give general statement of how to use the metric*
-*Provide simplest possible example for using the metric*
 ### Inputs
-*List all input arguments in the format below*
-- **input_field** *(type): Definition of input, with explanation if necessary. State any default value(s).*
 ### Output Values
-*Explain what this metric outputs and provide an example of what the metric output looks like. Modules should return a dictionary with one or multiple key-value pairs, e.g. {"bleu" : 6.02}*
-*State the range of possible values that the metric's output can take, as well as what in that range is considered good. For example: "This metric can take on any value between 0 and 100, inclusive. Higher scores are better."*
 #### Values from Popular Papers
-*Give examples, preferrably with links to leaderboards or publications, to papers that have reported this metric, along with the values they have reported.*
-### Examples
-*Give code examples of the metric being used. Try to include examples that clear up any potential ambiguity left from the metric description above. If possible, provide a range of examples that show both typical and atypical results, as well as examples where a variety of input parameters are passed.*
 ## Limitations and Bias
-*Note any known limitations or biases that the metric has, with links and references if possible.*
 ## Citation
-*Cite the source where this metric was introduced.*
 ## Further References
-*Add any useful further references.*

 ---
+title: TimeBench Eval
 datasets:
 - TimeBench
 tags:
 - evaluate
 - metric
+- temporal reasoning
+description: Evaluation metric for the TimeBench temporal reasoning benchmark by Chu et al. (2023).
 sdk: gradio
 sdk_version: 6.3.0
 app_file: app.py
 pinned: false
+emoji: ⏰
+colorFrom: purple
+colorTo: red
 ---
+# Metric Card for TimeBench Eval
 ## Metric Description
+This metric is designed for the **TimeBench** benchmark (Chu et al., 2023), which evaluates temporal reasoning abilities in large language models. It uses modified prompts from the [ADeLe paper by Zhou et al., 2025](https://arxiv.org/abs/2503.06378). It supports multiple task types with different evaluation strategies:
+- **TempReason, TimeQA, MenatQA**: Uses SQuAD-style exact match and F1 scoring
+- **Date Arithmetic**: Parses and compares dates for exact match
+- **TimeDial**: Set-based comparison of selected multiple-choice options (A-D)
+The metric expects model outputs to contain the answer in the format `"Thus, the correct answer is: <answer>"`.
+It performs the following steps:
+1. Extracts the answer from the model's prediction string using the marker `"Thus, the correct answer is:"`.
+2. Applies task-specific evaluation:
+   - For QA tasks: Computes SQuAD exact match and F1 scores
+   - For Date Arithmetic: Parses dates and compares them (day is normalized to 1)
+   - For TimeDial: Extracts option letters (A-D) and computes set-based exact match and F1
 ## How to Use
+You can load the metric using the `evaluate` library:
+```python
+import evaluate
+metric = evaluate.load("aauss/timebench_eval")
+# Example for Date Arithmetic task
+predictions = [
+    "Let me solve this step by step... Thus, the correct answer is: Aug, 1987.",
+    "Calculating the date... Thus, the correct answer is: January 2020.",
+]
+references = ["Aug, 1987", "Feb, 2020"]
+result = metric.compute(
+    predictions=predictions,
+    references=references,
+    task="Date Arithmetic",
+)
+print(result)
+>>> {"exact_match": [1, 0]}
+# Example for TempReason/TimeQA/MenatQA tasks
+predictions = [
+    "Based on the context... Thus, the correct answer is: Cardiff City.",
+    "The answer cannot be determined. Thus, the correct answer is: unanswerable",
+]
+references = ["Cardiff City", "unanswerable"]
+result = metric.compute(
+    predictions=predictions,
+    references=references,
+    task="MenatQA",
+)
+print(result)
+>>> {"exact_match": [1.0, 1.0], "f1": [1.0, 1.0]}
+# Example for TimeDial task (multiple choice)
+predictions = [
+    "Options B and C are correct. Thus, the correct answer is: B, C.",
+]
+references = ["B. No more than ten minutes && C. No more than five minutes"]
+result = metric.compute(
+    predictions=predictions,
+    references=references,
+    task="TimeDial",
+)
+print(result)
+>>> {"exact_match": [1], "f1": [1.0]}
+```
 ### Inputs
+- **predictions** (`list` of `str`): List of predictions to score. Each prediction should be a string containing the model's response, which must include the answer after the marker `"Thus, the correct answer is:"`.
+- **references** (`list` of `str`): List of reference answers.
+- **task** (`str`): The task type being evaluated. Must be one of:
+  - `"TempReason"`: Temporal reasoning QA
+  - `"TimeQA"`: Time-based QA
+  - `"MenatQA"`: Multiple Sensitive Factors Time QA
+  - `"Date Arithmetic"`: Date calculation tasks
+  - `"TimeDial"`: Dialogue-based temporal multiple choice
 ### Output Values
+The metric returns a dictionary with the following keys (depending on task):
+- **exact_match** (`list` of `float` or `int`): Exact match scores for each prediction (0 or 1).
+- **f1** (`list` of `float`): F1 scores for each prediction (0.0 to 1.0). Returned for all tasks except Date Arithmetic.
+Scores range from 0.0 to 1.0, with higher values indicating better performance.
 #### Values from Popular Papers
+Refer to the [original TimeBench paper](https://arxiv.org/abs/2311.17667) for baseline performance values across various language models.
 ## Limitations and Bias
+- The metric relies on the marker `"Thus, the correct answer is:"` to extract answers. If the model output does not follow this exact format, extraction will fail and return `None`.
+- For Date Arithmetic, dates are parsed using `dateutil.parser` with day normalized to 1. Unparseable dates will result in `None` comparisons.
+- For TimeDial, only options A-D are recognized. The extraction looks for standalone letters at word boundaries.
+- The metric assumes predictions and references are properly aligned (same length lists).
 ## Citation
+```bibtex
+@software{abbood2026timebench_eval,
+  title={TimeBench Eval},
+  author={Abbood, Auss},
+  year={2026},
+  url={https://huggingface.co/spaces/aauss/timebench_eval}
+}
+```
 ## Further References
+- [TimeBench Paper](https://arxiv.org/abs/2311.17667)
+- [ADeLe paper which adopts TimeBench](https://huggingface.co/datasets/CFI-Kinds-of-Intelligence/ADeLe_battery_v1dot0)

pyproject.toml CHANGED Viewed

@@ -1,13 +1,21 @@
 [project]
 name = "timebench-eval"
 version = "0.1.0"
-description = "Add your description here"
 readme = "README.md"
 requires-python = ">=3.12"
 dependencies = [
     "evaluate==0.4.6",
     "huggingface-hub<0.25",
-    "pip>=25.3",
     "pytest>=9.0.2",
     "ruff>=0.14.11",
 ]

 [project]
 name = "timebench-eval"
 version = "0.1.0"
+description = "Evaluation metric for the TimeBench temporal reasoning benchmark"
 readme = "README.md"
 requires-python = ">=3.12"
 dependencies = [
     "evaluate==0.4.6",
     "huggingface-hub<0.25",
+    "python-dateutil>=2.8",
+    "datasets>=2.0",
+]
+[project.optional-dependencies]
+dev = [
     "pytest>=9.0.2",
     "ruff>=0.14.11",
 ]
+[tool.pytest.ini_options]
+pythonpath = ["."]

timebench_eval.py CHANGED Viewed

@@ -11,86 +11,75 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-"""TODO: Add a description here."""
 import re
 from datetime import datetime
 from dateutil import parser
 from dateutil.parser import ParserError
-import evaluate
-import datasets
-# TODO: Add BibTeX citation
 _CITATION = """\
-@InProceedings{huggingface:module,
-title = {A great new module},
-authors={huggingface, Inc.},
-year={2020}
 }
 """
-# TODO: Add description of the module here
 _DESCRIPTION = """\
-This new module is designed to solve this great ML task and is crafted with a lot of care.
 """
-# TODO: Add description of the arguments of the module here
 _KWARGS_DESCRIPTION = """
-Calculates how good are predictions given some references, using certain scores
 Args:
-    predictions: list of predictions to score. Each predictions
-        should be a string with tokens separated by spaces.
-    references: list of reference for each prediction. Each
-        reference should be a string with tokens separated by spaces.
 Returns:
-    accuracy: description of the first score,
-    another_score: description of the second score,
 Examples:
-    Examples should be written in doctest format, and should illustrate how
-    to use the function.
-    >>> my_new_module = evaluate.load("my_new_module")
-    >>> results = my_new_module.compute(references=[0, 1], predictions=[0, 1])
     >>> print(results)
-    {'accuracy': 1.0}
 """
-# TODO: Define external resources urls if needed
-BAD_WORDS_URL = "http://url/to/external/resource/bad_words.txt"
 @evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
 class TimebenchEval(evaluate.Metric):
-    """TODO: Short description of my evaluation module."""
     def __init__(self, *args, **kwargs):
         super().__init__(*args, **kwargs)
         self.squad_metric = evaluate.load("squad")
     def _info(self):
-        # TODO: Specifies the evaluate.EvaluationModuleInfo object
         return evaluate.MetricInfo(
-            # This is the description that will appear on the modules page.
             module_type="metric",
             description=_DESCRIPTION,
             citation=_CITATION,
             inputs_description=_KWARGS_DESCRIPTION,
-            # This defines the format of each prediction and reference
             features=datasets.Features(
                 {
                     "predictions": datasets.Value("string"),
                     "references": datasets.Value("string"),
                 }
             ),
-            # Homepage of the module for documentation
-            homepage="http://module.homepage",
-            # Additional links to the codebase or references
-            codebase_urls=["http://github.com/path/to/codebase/of/new_module"],
-            reference_urls=["http://path.to.reference.url/new_module"],
         )
     def _compute(
@@ -117,6 +106,10 @@ class TimebenchEval(evaluate.Metric):
             return self._compare_dates(predictions, references)
         elif task == "TimeDial":
             return self._compute_timedial(predictions, references)
     @staticmethod
     def _extract_answer(response: str) -> str | None:

 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
+"""Evaluation metric for the TimeBench temporal reasoning benchmark."""
 import re
 from datetime import datetime
+import datasets
+import evaluate
 from dateutil import parser
 from dateutil.parser import ParserError
 _CITATION = """\
+@software{abbood2026timebench_eval,
+  title={TimeBench Eval},
+  author={Abbood, Auss},
+  year={2026},
+  url={https://huggingface.co/spaces/aauss/timebench_eval}
 }
 """
 _DESCRIPTION = """\
+Evaluation metric for the TimeBench benchmark, which assesses temporal reasoning
+abilities in large language models. Supports multiple task types including TempReason,
+TimeQA, MenatQA, Date Arithmetic, and TimeDial.
 """
 _KWARGS_DESCRIPTION = """
+Calculates evaluation metrics for temporal reasoning tasks.
 Args:
+    predictions: list of prediction strings from the model. Each prediction
+        should contain the marker "Thus, the correct answer is:" followed by the answer.
+    references: list of reference answer strings.
+    task: the task type, one of "TempReason", "TimeQA", "MenatQA", "Date Arithmetic", or "TimeDial".
 Returns:
+    exact_match: list of exact match scores (0 or 1) for each prediction.
+    f1: list of F1 scores for each prediction (for applicable tasks).
 Examples:
+    >>> timebench_eval = evaluate.load("aauss/timebench_eval")
+    >>> predictions = ["Let me think... Thus, the correct answer is: Aug, 1987."]
+    >>> references = ["Aug, 1987"]
+    >>> results = timebench_eval.compute(predictions=predictions, references=references, task="Date Arithmetic")
     >>> print(results)
+    {'exact_match': [1]}
 """
 @evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
 class TimebenchEval(evaluate.Metric):
+    """Evaluation metric for TimeBench temporal reasoning tasks."""
     def __init__(self, *args, **kwargs):
         super().__init__(*args, **kwargs)
         self.squad_metric = evaluate.load("squad")
     def _info(self):
         return evaluate.MetricInfo(
             module_type="metric",
             description=_DESCRIPTION,
             citation=_CITATION,
             inputs_description=_KWARGS_DESCRIPTION,
             features=datasets.Features(
                 {
                     "predictions": datasets.Value("string"),
                     "references": datasets.Value("string"),
                 }
             ),
+            homepage="https://huggingface.co/spaces/aauss/timebench_eval",
+            codebase_urls=["https://huggingface.co/spaces/aauss/timebench_eval/tree/main"],
+            reference_urls=["https://huggingface.co/datasets/ulab-ai/Time-Bench"],
         )
     def _compute(
             return self._compare_dates(predictions, references)
         elif task == "TimeDial":
             return self._compute_timedial(predictions, references)
+        else:
+            raise ValueError(
+                f"Unknown task: {task}. Expected one of: TempReason, TimeQA, MenatQA, Date Arithmetic, TimeDial"
+            )
     @staticmethod
     def _extract_answer(response: str) -> str | None:

uv.lock CHANGED Viewed

@@ -628,15 +628,6 @@ wheels = [
     { url = "https://files.pythonhosted.org/packages/70/44/5191d2e4026f86a2a109053e194d3ba7a31a2d10a9c2348368c63ed4e85a/pandas-2.3.3-cp314-cp314t-musllinux_1_2_x86_64.whl", hash = "sha256:3869faf4bd07b3b66a9f462417d0ca3a9df29a9f6abd5d0d0dbab15dac7abe87", size = 13202175, upload-time = "2025-09-29T23:31:59.173Z" },
 ]
-[[package]]
-name = "pip"
-version = "25.3"
-source = { registry = "https://pypi.org/simple" }
-sdist = { url = "https://files.pythonhosted.org/packages/fe/6e/74a3f0179a4a73a53d66ce57fdb4de0080a8baa1de0063de206d6167acc2/pip-25.3.tar.gz", hash = "sha256:8d0538dbbd7babbd207f261ed969c65de439f6bc9e5dbd3b3b9a77f25d95f343", size = 1803014, upload-time = "2025-10-25T00:55:41.394Z" }
-wheels = [
-    { url = "https://files.pythonhosted.org/packages/44/3c/d717024885424591d5376220b5e836c2d5293ce2011523c9de23ff7bf068/pip-25.3-py3-none-any.whl", hash = "sha256:9655943313a94722b7774661c21049070f6bbb0a1516bf02f7c8d5d9201514cd", size = 1778622, upload-time = "2025-10-25T00:55:39.247Z" },
-]
 [[package]]
 name = "pluggy"
 version = "1.6.0"
@@ -920,21 +911,28 @@ name = "timebench-eval"
 version = "0.1.0"
 source = { virtual = "." }
 dependencies = [
     { name = "evaluate" },
     { name = "huggingface-hub" },
-    { name = "pip" },
     { name = "pytest" },
     { name = "ruff" },
 ]
 [package.metadata]
 requires-dist = [
     { name = "evaluate", specifier = "==0.4.6" },
     { name = "huggingface-hub", specifier = "<0.25" },
-    { name = "pip", specifier = ">=25.3" },
-    { name = "pytest", specifier = ">=9.0.2" },
-    { name = "ruff", specifier = ">=0.14.11" },
 ]
 [[package]]
 name = "tqdm"

     { url = "https://files.pythonhosted.org/packages/70/44/5191d2e4026f86a2a109053e194d3ba7a31a2d10a9c2348368c63ed4e85a/pandas-2.3.3-cp314-cp314t-musllinux_1_2_x86_64.whl", hash = "sha256:3869faf4bd07b3b66a9f462417d0ca3a9df29a9f6abd5d0d0dbab15dac7abe87", size = 13202175, upload-time = "2025-09-29T23:31:59.173Z" },
 ]
 [[package]]
 name = "pluggy"
 version = "1.6.0"
 version = "0.1.0"
 source = { virtual = "." }
 dependencies = [
+    { name = "datasets" },
     { name = "evaluate" },
     { name = "huggingface-hub" },
+    { name = "python-dateutil" },
+]
+[package.optional-dependencies]
+dev = [
     { name = "pytest" },
     { name = "ruff" },
 ]
 [package.metadata]
 requires-dist = [
+    { name = "datasets", specifier = ">=2.0" },
     { name = "evaluate", specifier = "==0.4.6" },
     { name = "huggingface-hub", specifier = "<0.25" },
+    { name = "pytest", marker = "extra == 'dev'", specifier = ">=9.0.2" },
+    { name = "python-dateutil", specifier = ">=2.8" },
+    { name = "ruff", marker = "extra == 'dev'", specifier = ">=0.14.11" },
 ]
+provides-extras = ["dev"]
 [[package]]
 name = "tqdm"