aauss commited on
Commit
71222fd
·
1 Parent(s): fde9678

Fill out README. Update pyproject.toml.

Browse files
Files changed (4) hide show
  1. README.md +106 -18
  2. pyproject.toml +10 -2
  3. timebench_eval.py +31 -38
  4. uv.lock +11 -13
README.md CHANGED
@@ -1,50 +1,138 @@
1
  ---
2
- title: Timebench Eval
3
  datasets:
4
  - TimeBench
5
  tags:
6
  - evaluate
7
  - metric
8
- description: 'TODO: add a description here'
 
9
  sdk: gradio
10
  sdk_version: 6.3.0
11
  app_file: app.py
12
  pinned: false
 
 
 
13
  ---
14
 
15
- # Metric Card for Timebench Eval
16
-
17
- ***Module Card Instructions:*** *Fill out the following subsections. Feel free to take a look at existing metric cards if you'd like examples.*
18
 
19
  ## Metric Description
20
- *Give a brief overview of this metric, including what task(s) it is usually used for, if any.*
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
21
 
22
  ## How to Use
23
- *Give general statement of how to use the metric*
24
 
25
- *Provide simplest possible example for using the metric*
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
26
 
27
  ### Inputs
28
- *List all input arguments in the format below*
29
- - **input_field** *(type): Definition of input, with explanation if necessary. State any default value(s).*
 
 
 
 
 
 
 
30
 
31
  ### Output Values
32
 
33
- *Explain what this metric outputs and provide an example of what the metric output looks like. Modules should return a dictionary with one or multiple key-value pairs, e.g. {"bleu" : 6.02}*
34
 
35
- *State the range of possible values that the metric's output can take, as well as what in that range is considered good. For example: "This metric can take on any value between 0 and 100, inclusive. Higher scores are better."*
 
 
 
36
 
37
  #### Values from Popular Papers
38
- *Give examples, preferrably with links to leaderboards or publications, to papers that have reported this metric, along with the values they have reported.*
39
 
40
- ### Examples
41
- *Give code examples of the metric being used. Try to include examples that clear up any potential ambiguity left from the metric description above. If possible, provide a range of examples that show both typical and atypical results, as well as examples where a variety of input parameters are passed.*
42
 
43
  ## Limitations and Bias
44
- *Note any known limitations or biases that the metric has, with links and references if possible.*
 
 
 
 
45
 
46
  ## Citation
47
- *Cite the source where this metric was introduced.*
 
 
 
 
 
 
 
 
48
 
49
  ## Further References
50
- *Add any useful further references.*
 
 
 
1
  ---
2
+ title: TimeBench Eval
3
  datasets:
4
  - TimeBench
5
  tags:
6
  - evaluate
7
  - metric
8
+ - temporal reasoning
9
+ description: Evaluation metric for the TimeBench temporal reasoning benchmark by Chu et al. (2023).
10
  sdk: gradio
11
  sdk_version: 6.3.0
12
  app_file: app.py
13
  pinned: false
14
+ emoji: ⏰
15
+ colorFrom: purple
16
+ colorTo: red
17
  ---
18
 
19
+ # Metric Card for TimeBench Eval
 
 
20
 
21
  ## Metric Description
22
+
23
+ This metric is designed for the **TimeBench** benchmark (Chu et al., 2023), which evaluates temporal reasoning abilities in large language models. It uses modified prompts from the [ADeLe paper by Zhou et al., 2025](https://arxiv.org/abs/2503.06378). It supports multiple task types with different evaluation strategies:
24
+
25
+ - **TempReason, TimeQA, MenatQA**: Uses SQuAD-style exact match and F1 scoring
26
+ - **Date Arithmetic**: Parses and compares dates for exact match
27
+ - **TimeDial**: Set-based comparison of selected multiple-choice options (A-D)
28
+
29
+ The metric expects model outputs to contain the answer in the format `"Thus, the correct answer is: <answer>"`.
30
+
31
+ It performs the following steps:
32
+
33
+ 1. Extracts the answer from the model's prediction string using the marker `"Thus, the correct answer is:"`.
34
+ 2. Applies task-specific evaluation:
35
+ - For QA tasks: Computes SQuAD exact match and F1 scores
36
+ - For Date Arithmetic: Parses dates and compares them (day is normalized to 1)
37
+ - For TimeDial: Extracts option letters (A-D) and computes set-based exact match and F1
38
 
39
  ## How to Use
 
40
 
41
+ You can load the metric using the `evaluate` library:
42
+
43
+ ```python
44
+ import evaluate
45
+
46
+ metric = evaluate.load("aauss/timebench_eval")
47
+
48
+ # Example for Date Arithmetic task
49
+ predictions = [
50
+ "Let me solve this step by step... Thus, the correct answer is: Aug, 1987.",
51
+ "Calculating the date... Thus, the correct answer is: January 2020.",
52
+ ]
53
+ references = ["Aug, 1987", "Feb, 2020"]
54
+
55
+ result = metric.compute(
56
+ predictions=predictions,
57
+ references=references,
58
+ task="Date Arithmetic",
59
+ )
60
+ print(result)
61
+ >>> {"exact_match": [1, 0]}
62
+
63
+ # Example for TempReason/TimeQA/MenatQA tasks
64
+ predictions = [
65
+ "Based on the context... Thus, the correct answer is: Cardiff City.",
66
+ "The answer cannot be determined. Thus, the correct answer is: unanswerable",
67
+ ]
68
+ references = ["Cardiff City", "unanswerable"]
69
+
70
+ result = metric.compute(
71
+ predictions=predictions,
72
+ references=references,
73
+ task="MenatQA",
74
+ )
75
+ print(result)
76
+ >>> {"exact_match": [1.0, 1.0], "f1": [1.0, 1.0]}
77
+
78
+ # Example for TimeDial task (multiple choice)
79
+ predictions = [
80
+ "Options B and C are correct. Thus, the correct answer is: B, C.",
81
+ ]
82
+ references = ["B. No more than ten minutes && C. No more than five minutes"]
83
+
84
+ result = metric.compute(
85
+ predictions=predictions,
86
+ references=references,
87
+ task="TimeDial",
88
+ )
89
+ print(result)
90
+ >>> {"exact_match": [1], "f1": [1.0]}
91
+ ```
92
 
93
  ### Inputs
94
+
95
+ - **predictions** (`list` of `str`): List of predictions to score. Each prediction should be a string containing the model's response, which must include the answer after the marker `"Thus, the correct answer is:"`.
96
+ - **references** (`list` of `str`): List of reference answers.
97
+ - **task** (`str`): The task type being evaluated. Must be one of:
98
+ - `"TempReason"`: Temporal reasoning QA
99
+ - `"TimeQA"`: Time-based QA
100
+ - `"MenatQA"`: Multiple Sensitive Factors Time QA
101
+ - `"Date Arithmetic"`: Date calculation tasks
102
+ - `"TimeDial"`: Dialogue-based temporal multiple choice
103
 
104
  ### Output Values
105
 
106
+ The metric returns a dictionary with the following keys (depending on task):
107
 
108
+ - **exact_match** (`list` of `float` or `int`): Exact match scores for each prediction (0 or 1).
109
+ - **f1** (`list` of `float`): F1 scores for each prediction (0.0 to 1.0). Returned for all tasks except Date Arithmetic.
110
+
111
+ Scores range from 0.0 to 1.0, with higher values indicating better performance.
112
 
113
  #### Values from Popular Papers
 
114
 
115
+ Refer to the [original TimeBench paper](https://arxiv.org/abs/2311.17667) for baseline performance values across various language models.
 
116
 
117
  ## Limitations and Bias
118
+
119
+ - The metric relies on the marker `"Thus, the correct answer is:"` to extract answers. If the model output does not follow this exact format, extraction will fail and return `None`.
120
+ - For Date Arithmetic, dates are parsed using `dateutil.parser` with day normalized to 1. Unparseable dates will result in `None` comparisons.
121
+ - For TimeDial, only options A-D are recognized. The extraction looks for standalone letters at word boundaries.
122
+ - The metric assumes predictions and references are properly aligned (same length lists).
123
 
124
  ## Citation
125
+
126
+ ```bibtex
127
+ @software{abbood2026timebench_eval,
128
+ title={TimeBench Eval},
129
+ author={Abbood, Auss},
130
+ year={2026},
131
+ url={https://huggingface.co/spaces/aauss/timebench_eval}
132
+ }
133
+ ```
134
 
135
  ## Further References
136
+
137
+ - [TimeBench Paper](https://arxiv.org/abs/2311.17667)
138
+ - [ADeLe paper which adopts TimeBench](https://huggingface.co/datasets/CFI-Kinds-of-Intelligence/ADeLe_battery_v1dot0)
pyproject.toml CHANGED
@@ -1,13 +1,21 @@
1
  [project]
2
  name = "timebench-eval"
3
  version = "0.1.0"
4
- description = "Add your description here"
5
  readme = "README.md"
6
  requires-python = ">=3.12"
7
  dependencies = [
8
  "evaluate==0.4.6",
9
  "huggingface-hub<0.25",
10
- "pip>=25.3",
 
 
 
 
 
11
  "pytest>=9.0.2",
12
  "ruff>=0.14.11",
13
  ]
 
 
 
 
1
  [project]
2
  name = "timebench-eval"
3
  version = "0.1.0"
4
+ description = "Evaluation metric for the TimeBench temporal reasoning benchmark"
5
  readme = "README.md"
6
  requires-python = ">=3.12"
7
  dependencies = [
8
  "evaluate==0.4.6",
9
  "huggingface-hub<0.25",
10
+ "python-dateutil>=2.8",
11
+ "datasets>=2.0",
12
+ ]
13
+
14
+ [project.optional-dependencies]
15
+ dev = [
16
  "pytest>=9.0.2",
17
  "ruff>=0.14.11",
18
  ]
19
+
20
+ [tool.pytest.ini_options]
21
+ pythonpath = ["."]
timebench_eval.py CHANGED
@@ -11,86 +11,75 @@
11
  # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12
  # See the License for the specific language governing permissions and
13
  # limitations under the License.
14
- """TODO: Add a description here."""
15
 
16
  import re
17
  from datetime import datetime
18
 
 
 
19
  from dateutil import parser
20
  from dateutil.parser import ParserError
21
 
22
- import evaluate
23
- import datasets
24
-
25
-
26
- # TODO: Add BibTeX citation
27
  _CITATION = """\
28
- @InProceedings{huggingface:module,
29
- title = {A great new module},
30
- authors={huggingface, Inc.},
31
- year={2020}
 
32
  }
33
  """
34
 
35
- # TODO: Add description of the module here
36
  _DESCRIPTION = """\
37
- This new module is designed to solve this great ML task and is crafted with a lot of care.
 
 
38
  """
39
 
40
 
41
- # TODO: Add description of the arguments of the module here
42
  _KWARGS_DESCRIPTION = """
43
- Calculates how good are predictions given some references, using certain scores
44
  Args:
45
- predictions: list of predictions to score. Each predictions
46
- should be a string with tokens separated by spaces.
47
- references: list of reference for each prediction. Each
48
- reference should be a string with tokens separated by spaces.
49
  Returns:
50
- accuracy: description of the first score,
51
- another_score: description of the second score,
52
  Examples:
53
- Examples should be written in doctest format, and should illustrate how
54
- to use the function.
55
-
56
- >>> my_new_module = evaluate.load("my_new_module")
57
- >>> results = my_new_module.compute(references=[0, 1], predictions=[0, 1])
58
  >>> print(results)
59
- {'accuracy': 1.0}
60
  """
61
 
62
- # TODO: Define external resources urls if needed
63
- BAD_WORDS_URL = "http://url/to/external/resource/bad_words.txt"
64
-
65
 
66
  @evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
67
  class TimebenchEval(evaluate.Metric):
68
- """TODO: Short description of my evaluation module."""
69
 
70
  def __init__(self, *args, **kwargs):
71
  super().__init__(*args, **kwargs)
72
  self.squad_metric = evaluate.load("squad")
73
 
74
  def _info(self):
75
- # TODO: Specifies the evaluate.EvaluationModuleInfo object
76
  return evaluate.MetricInfo(
77
- # This is the description that will appear on the modules page.
78
  module_type="metric",
79
  description=_DESCRIPTION,
80
  citation=_CITATION,
81
  inputs_description=_KWARGS_DESCRIPTION,
82
- # This defines the format of each prediction and reference
83
  features=datasets.Features(
84
  {
85
  "predictions": datasets.Value("string"),
86
  "references": datasets.Value("string"),
87
  }
88
  ),
89
- # Homepage of the module for documentation
90
- homepage="http://module.homepage",
91
- # Additional links to the codebase or references
92
- codebase_urls=["http://github.com/path/to/codebase/of/new_module"],
93
- reference_urls=["http://path.to.reference.url/new_module"],
94
  )
95
 
96
  def _compute(
@@ -117,6 +106,10 @@ class TimebenchEval(evaluate.Metric):
117
  return self._compare_dates(predictions, references)
118
  elif task == "TimeDial":
119
  return self._compute_timedial(predictions, references)
 
 
 
 
120
 
121
  @staticmethod
122
  def _extract_answer(response: str) -> str | None:
 
11
  # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12
  # See the License for the specific language governing permissions and
13
  # limitations under the License.
14
+ """Evaluation metric for the TimeBench temporal reasoning benchmark."""
15
 
16
  import re
17
  from datetime import datetime
18
 
19
+ import datasets
20
+ import evaluate
21
  from dateutil import parser
22
  from dateutil.parser import ParserError
23
 
 
 
 
 
 
24
  _CITATION = """\
25
+ @software{abbood2026timebench_eval,
26
+ title={TimeBench Eval},
27
+ author={Abbood, Auss},
28
+ year={2026},
29
+ url={https://huggingface.co/spaces/aauss/timebench_eval}
30
  }
31
  """
32
 
 
33
  _DESCRIPTION = """\
34
+ Evaluation metric for the TimeBench benchmark, which assesses temporal reasoning
35
+ abilities in large language models. Supports multiple task types including TempReason,
36
+ TimeQA, MenatQA, Date Arithmetic, and TimeDial.
37
  """
38
 
39
 
 
40
  _KWARGS_DESCRIPTION = """
41
+ Calculates evaluation metrics for temporal reasoning tasks.
42
  Args:
43
+ predictions: list of prediction strings from the model. Each prediction
44
+ should contain the marker "Thus, the correct answer is:" followed by the answer.
45
+ references: list of reference answer strings.
46
+ task: the task type, one of "TempReason", "TimeQA", "MenatQA", "Date Arithmetic", or "TimeDial".
47
  Returns:
48
+ exact_match: list of exact match scores (0 or 1) for each prediction.
49
+ f1: list of F1 scores for each prediction (for applicable tasks).
50
  Examples:
51
+ >>> timebench_eval = evaluate.load("aauss/timebench_eval")
52
+ >>> predictions = ["Let me think... Thus, the correct answer is: Aug, 1987."]
53
+ >>> references = ["Aug, 1987"]
54
+ >>> results = timebench_eval.compute(predictions=predictions, references=references, task="Date Arithmetic")
 
55
  >>> print(results)
56
+ {'exact_match': [1]}
57
  """
58
 
 
 
 
59
 
60
  @evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
61
  class TimebenchEval(evaluate.Metric):
62
+ """Evaluation metric for TimeBench temporal reasoning tasks."""
63
 
64
  def __init__(self, *args, **kwargs):
65
  super().__init__(*args, **kwargs)
66
  self.squad_metric = evaluate.load("squad")
67
 
68
  def _info(self):
 
69
  return evaluate.MetricInfo(
 
70
  module_type="metric",
71
  description=_DESCRIPTION,
72
  citation=_CITATION,
73
  inputs_description=_KWARGS_DESCRIPTION,
 
74
  features=datasets.Features(
75
  {
76
  "predictions": datasets.Value("string"),
77
  "references": datasets.Value("string"),
78
  }
79
  ),
80
+ homepage="https://huggingface.co/spaces/aauss/timebench_eval",
81
+ codebase_urls=["https://huggingface.co/spaces/aauss/timebench_eval/tree/main"],
82
+ reference_urls=["https://huggingface.co/datasets/ulab-ai/Time-Bench"],
 
 
83
  )
84
 
85
  def _compute(
 
106
  return self._compare_dates(predictions, references)
107
  elif task == "TimeDial":
108
  return self._compute_timedial(predictions, references)
109
+ else:
110
+ raise ValueError(
111
+ f"Unknown task: {task}. Expected one of: TempReason, TimeQA, MenatQA, Date Arithmetic, TimeDial"
112
+ )
113
 
114
  @staticmethod
115
  def _extract_answer(response: str) -> str | None:
uv.lock CHANGED
@@ -628,15 +628,6 @@ wheels = [
628
  { url = "https://files.pythonhosted.org/packages/70/44/5191d2e4026f86a2a109053e194d3ba7a31a2d10a9c2348368c63ed4e85a/pandas-2.3.3-cp314-cp314t-musllinux_1_2_x86_64.whl", hash = "sha256:3869faf4bd07b3b66a9f462417d0ca3a9df29a9f6abd5d0d0dbab15dac7abe87", size = 13202175, upload-time = "2025-09-29T23:31:59.173Z" },
629
  ]
630
 
631
- [[package]]
632
- name = "pip"
633
- version = "25.3"
634
- source = { registry = "https://pypi.org/simple" }
635
- sdist = { url = "https://files.pythonhosted.org/packages/fe/6e/74a3f0179a4a73a53d66ce57fdb4de0080a8baa1de0063de206d6167acc2/pip-25.3.tar.gz", hash = "sha256:8d0538dbbd7babbd207f261ed969c65de439f6bc9e5dbd3b3b9a77f25d95f343", size = 1803014, upload-time = "2025-10-25T00:55:41.394Z" }
636
- wheels = [
637
- { url = "https://files.pythonhosted.org/packages/44/3c/d717024885424591d5376220b5e836c2d5293ce2011523c9de23ff7bf068/pip-25.3-py3-none-any.whl", hash = "sha256:9655943313a94722b7774661c21049070f6bbb0a1516bf02f7c8d5d9201514cd", size = 1778622, upload-time = "2025-10-25T00:55:39.247Z" },
638
- ]
639
-
640
  [[package]]
641
  name = "pluggy"
642
  version = "1.6.0"
@@ -920,21 +911,28 @@ name = "timebench-eval"
920
  version = "0.1.0"
921
  source = { virtual = "." }
922
  dependencies = [
 
923
  { name = "evaluate" },
924
  { name = "huggingface-hub" },
925
- { name = "pip" },
 
 
 
 
926
  { name = "pytest" },
927
  { name = "ruff" },
928
  ]
929
 
930
  [package.metadata]
931
  requires-dist = [
 
932
  { name = "evaluate", specifier = "==0.4.6" },
933
  { name = "huggingface-hub", specifier = "<0.25" },
934
- { name = "pip", specifier = ">=25.3" },
935
- { name = "pytest", specifier = ">=9.0.2" },
936
- { name = "ruff", specifier = ">=0.14.11" },
937
  ]
 
938
 
939
  [[package]]
940
  name = "tqdm"
 
628
  { url = "https://files.pythonhosted.org/packages/70/44/5191d2e4026f86a2a109053e194d3ba7a31a2d10a9c2348368c63ed4e85a/pandas-2.3.3-cp314-cp314t-musllinux_1_2_x86_64.whl", hash = "sha256:3869faf4bd07b3b66a9f462417d0ca3a9df29a9f6abd5d0d0dbab15dac7abe87", size = 13202175, upload-time = "2025-09-29T23:31:59.173Z" },
629
  ]
630
 
 
 
 
 
 
 
 
 
 
631
  [[package]]
632
  name = "pluggy"
633
  version = "1.6.0"
 
911
  version = "0.1.0"
912
  source = { virtual = "." }
913
  dependencies = [
914
+ { name = "datasets" },
915
  { name = "evaluate" },
916
  { name = "huggingface-hub" },
917
+ { name = "python-dateutil" },
918
+ ]
919
+
920
+ [package.optional-dependencies]
921
+ dev = [
922
  { name = "pytest" },
923
  { name = "ruff" },
924
  ]
925
 
926
  [package.metadata]
927
  requires-dist = [
928
+ { name = "datasets", specifier = ">=2.0" },
929
  { name = "evaluate", specifier = "==0.4.6" },
930
  { name = "huggingface-hub", specifier = "<0.25" },
931
+ { name = "pytest", marker = "extra == 'dev'", specifier = ">=9.0.2" },
932
+ { name = "python-dateutil", specifier = ">=2.8" },
933
+ { name = "ruff", marker = "extra == 'dev'", specifier = ">=0.14.11" },
934
  ]
935
+ provides-extras = ["dev"]
936
 
937
  [[package]]
938
  name = "tqdm"