aauss commited on
Commit
6bb843b
·
1 Parent(s): 2664bc6

Implement and test metric for TempReason, TimeQA, MenatQA and Date Arithmetic.

Browse files
.gitignore ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ # Python-generated files
2
+ __pycache__/
3
+ *.py[oc]
4
+ build/
5
+ dist/
6
+ wheels/
7
+ *.egg-info
8
+
9
+ # Virtual environments
10
+ .venv
.python-version ADDED
@@ -0,0 +1 @@
 
 
1
+ 3.12
pyproject.toml ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [project]
2
+ name = "timebench-eval"
3
+ version = "0.1.0"
4
+ description = "Add your description here"
5
+ readme = "README.md"
6
+ requires-python = ">=3.12"
7
+ dependencies = [
8
+ "evaluate==0.4.6",
9
+ "huggingface-hub<0.25",
10
+ "pip>=25.3",
11
+ "pytest>=9.0.2",
12
+ "ruff>=0.14.11",
13
+ ]
tests/conftest.py ADDED
@@ -0,0 +1,109 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from textwrap import dedent
2
+
3
+ PREDICTION_1 = dedent("""\
4
+ Let's think step by step:
5
+
6
+ Brayan Perea was born on 25 February 1993.
7
+ He began his professional career with Deportivo Cali, making his first team debut in 2011.
8
+ In 2013, he joined Lazio (an Italian Serie A club) on a five-year deal for €2.5 million.
9
+ His debut for Lazio was on 25 September 2013.
10
+ The question asks which team he played for in November 2014.
11
+ From the context, it states: "Perea spent the following campaigns out on loan, representing Perugia, Troyes AC and CD Lugo."
12
+ "The following campaigns" refers to after joining Lazio in 2013.
13
+ So, after joining Lazio in 2013, he went on loan during the 2013–2014 and 2014–2015 seasons.
14
+ The loans were to Perugia, Troyes AC, and CD Lugo.
15
+ Since the timeline is sequential:
16
+ ...
17
+
18
+ However, we must answer based only on the provided context.
19
+
20
+ And since the text says "spend the following campaigns out on loan, representing Perugia, Troyes AC and CD Lugo" — in that sequence — and November 2014 falls in the 2014–2015 season (second campaign), then the club would be Troyes AC.
21
+
22
+ Therefore, in November 2014, he was playing for Troyes AC.
23
+
24
+ Thus, the correct answer is: Troyes AC.
25
+
26
+ Thus, the correct answer is: Troyes AC.""")
27
+
28
+ PREDICTION_2 = dedent("""\
29
+ Let's solve this step by step:
30
+
31
+ We are given a date: August 1814.
32
+ We are asked to find the date that is 10 years before that.
33
+ Subtracting 10 years from 1814:
34
+ 1814 − 10 = 1804.
35
+ The month remains the same (August), since we're only subtracting years, not months or days.
36
+ Therefore, 10 years before August 1814 is August 1804.
37
+ Thus, the correct answer is: August 1804.""")
38
+
39
+ PREDICTION_3 = dedent("""\
40
+ Step 1: Identify the person in question — John Williams (footballer, born 1968).
41
+
42
+ Step 2: Determine which teams he played for according to the context.
43
+
44
+ The context lists the teams he played for:
45
+
46
+ Premier League: Coventry City
47
+ Football League: Swansea City, Notts County, Stoke City, Wycombe Wanderers, Hereford United, Walsall, Exeter City, Cardiff City, York City, Darlington
48
+ Non-league: Cradley Town, Kidderminster Harriers, Bath City, Redditch United, Evesham United, Weston-super-Mare, Stourbridge, Willenhall Town, Boldmere St. Michaels
49
+ Step 3: Consider the timeline provided: from 1998 to 2000, and that he released Cardiff City from his contract in 2002.
50
+
51
+ Note: The release from Cardiff City in 2002 implies that he was playing for Cardiff City at some point prior to 2002.
52
+
53
+ Step 4: Determine which team he played for between 1998 and 2000.
54
+
55
+ Since the list does not include specific dates or timelines for each team, we must infer based on logical sequence and context. However, the given context does not provide any specific start or end years for the clubs.
56
+
57
+ We know he played for Cardiff City and left in 2002 — so he must have been with Cardiff City at some point before 2002. But we are asked about 1998–2000.
58
+
59
+ Can we determine which team he played for during that window?
60
+
61
+ The context does not specify when he joined or left any of these clubs, nor does it provide dates for transfers. Therefore, we cannot definitively assign a team to the period 1998–2000 based solely on the given information.
62
+
63
+ Additionally, the question implies that he released Cardiff City in 2002 — meaning he was with them before that. That means he could have been at Cardiff City during 1998–2000. But that is speculation.
64
+
65
+ Since the context doesn't provide any details on transfer dates or timelines for individual clubs, and we cannot determine which team he was at between 1998 and 2000 based on the information provided, we cannot with certainty assign a team to that time period.
66
+
67
+ Hence, the information is insufficient to determine which team John Williams played for between 1998 and 2000.
68
+
69
+ Therefore, the answer cannot be determined from the context.
70
+
71
+ Thus, the correct answer is: [unanswerable]
72
+ """)
73
+
74
+ PREDICTION_4 = dedent("""\
75
+ Let's break this down step by step:
76
+
77
+ The question is about John Williams (footballer, born 1968) and specifically which team he played for between 1998 and 2000.
78
+
79
+ We are told that he released Cardiff City from his contract in 2002. This implies that he was at Cardiff City before 2002, and likely during or before the 1998–2000 timeframe.
80
+
81
+ However, the context provided lists the clubs he played for, which are:
82
+
83
+ Coventry City (Premier League)
84
+ Swansea City
85
+ Notts County
86
+ Stoke City
87
+ Wycombe Wanderers
88
+ Hereford United
89
+ Walsall
90
+ Exeter City
91
+ Cardiff City
92
+ York City
93
+ Darlington
94
+ Non-league clubs: Cradley Town, Kidderminster Harriers, Bath City, Redditch United, Evesham United, Weston-super-Mare, Stourbridge, Willenhall Town, Boldmere St. Michaels
95
+ The question asks which team he played for from 1998 to 2000. Since he left Cardiff City in 2002 (released from contract), he must have been at Cardiff City during or before that time. But we are not given exact dates for his tenure at each club.
96
+
97
+ However, the club "Cardiff City" appears in the list, and since he left in 2002, it is logical that he played for Cardiff City around the 1998–2000 period.
98
+
99
+ The key point: the question is specifically asking about 1998 to 2000. There is no mention of any specific team during that window in the context — all clubs are listed, but with no timeframes.
100
+
101
+ Given that Cardiff City is the only club explicitly mentioned that he left in 2002, and assuming he was with them during that timeframe unless stated otherwise, it is plausible that he played for Cardiff City from 1998 to 2000.
102
+
103
+ But the context does not provide any dates for when he joined or left any of the clubs. So, without timeframes, we cannot definitively assign 1998–2000 to any club.
104
+
105
+ We have no evidence in the context that he played for any team between 1998 and 2000 other than those listed.
106
+
107
+ Since the only team he left in 2002 is Cardiff City, and the timeframe in question is 1998–2000, it is the most likely candidate.
108
+
109
+ Thus, the correct answer is: Cardiff City.""")
tests/test_answer_extraction.py ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import pytest
2
+ from timebench_eval.timebench_eval import TimebenchEval
3
+ from conftest import PREDICTION_1, PREDICTION_2, PREDICTION_3, PREDICTION_4
4
+
5
+
6
+ @pytest.mark.parametrize(
7
+ "prediction,extracted_answer",
8
+ [
9
+ (PREDICTION_1, "Troyes AC"),
10
+ (PREDICTION_2, "August 1804"),
11
+ (PREDICTION_3, "unanswerable"),
12
+ (PREDICTION_4, "Cardiff City"),
13
+ ],
14
+ )
15
+ def test_answer_extraction(prediction, extracted_answer):
16
+ assert TimebenchEval._extract_answer(prediction) == extracted_answer
tests/test_metrics.py ADDED
@@ -0,0 +1,58 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from timebench_eval.timebench_eval import TimebenchEval
2
+ import pytest
3
+ from conftest import PREDICTION_1, PREDICTION_2, PREDICTION_3, PREDICTION_4
4
+
5
+
6
+ @pytest.mark.parametrize(
7
+ "prediction,reference,task,expected_metrics",
8
+ [
9
+ (
10
+ PREDICTION_1,
11
+ "Troyes AC",
12
+ "TempReason",
13
+ {
14
+ "exact_match": [1],
15
+ "f1": [1],
16
+ },
17
+ ),
18
+ (
19
+ PREDICTION_2,
20
+ "Aug, 1804",
21
+ "Date Arithmetic",
22
+ {
23
+ "exact_match": [1],
24
+ },
25
+ ),
26
+ (
27
+ PREDICTION_3,
28
+ "unanswerable",
29
+ "MenatQA",
30
+ {
31
+ "exact_match": [1],
32
+ "f1": [1],
33
+ },
34
+ ),
35
+ (
36
+ PREDICTION_4,
37
+ "Cardiff City",
38
+ "MenatQA",
39
+ {
40
+ "exact_match": [1],
41
+ "f1": [1],
42
+ },
43
+ ),
44
+ ],
45
+ )
46
+ def test_eval(prediction, reference, task, expected_metrics):
47
+ metrics = TimebenchEval()._compute([prediction], [reference], task)
48
+ assert metrics == expected_metrics
49
+
50
+
51
+ def test_eval_many():
52
+ metrics = TimebenchEval()._compute(
53
+ [PREDICTION_3, PREDICTION_4], ["unanswerable", "Cardiff City"], "MenatQA"
54
+ )
55
+ assert metrics == {
56
+ "exact_match": [1, 1],
57
+ "f1": [1, 1],
58
+ }
timebench_eval/.gitattributes ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
timebench_eval/__init__.py ADDED
File without changes
timebench_eval/app.py ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ import evaluate
2
+ from evaluate.utils import launch_gradio_widget
3
+
4
+
5
+ module = evaluate.load("aauss/timebench_eval")
6
+ launch_gradio_widget(module)
timebench_eval/timebench_eval.py ADDED
@@ -0,0 +1,161 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright 2020 The HuggingFace Datasets Authors and the current dataset script contributor.
2
+ #
3
+ # Licensed under the Apache License, Version 2.0 (the "License");
4
+ # you may not use this file except in compliance with the License.
5
+ # You may obtain a copy of the License at
6
+ #
7
+ # http://www.apache.org/licenses/LICENSE-2.0
8
+ #
9
+ # Unless required by applicable law or agreed to in writing, software
10
+ # distributed under the License is distributed on an "AS IS" BASIS,
11
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12
+ # See the License for the specific language governing permissions and
13
+ # limitations under the License.
14
+ """TODO: Add a description here."""
15
+
16
+ from dateutil import parser
17
+ from dateutil.parser import ParserError
18
+
19
+
20
+ import evaluate
21
+ import datasets
22
+ import numpy as np
23
+
24
+
25
+ # TODO: Add BibTeX citation
26
+ _CITATION = """\
27
+ @InProceedings{huggingface:module,
28
+ title = {A great new module},
29
+ authors={huggingface, Inc.},
30
+ year={2020}
31
+ }
32
+ """
33
+
34
+ # TODO: Add description of the module here
35
+ _DESCRIPTION = """\
36
+ This new module is designed to solve this great ML task and is crafted with a lot of care.
37
+ """
38
+
39
+
40
+ # TODO: Add description of the arguments of the module here
41
+ _KWARGS_DESCRIPTION = """
42
+ Calculates how good are predictions given some references, using certain scores
43
+ Args:
44
+ predictions: list of predictions to score. Each predictions
45
+ should be a string with tokens separated by spaces.
46
+ references: list of reference for each prediction. Each
47
+ reference should be a string with tokens separated by spaces.
48
+ Returns:
49
+ accuracy: description of the first score,
50
+ another_score: description of the second score,
51
+ Examples:
52
+ Examples should be written in doctest format, and should illustrate how
53
+ to use the function.
54
+
55
+ >>> my_new_module = evaluate.load("my_new_module")
56
+ >>> results = my_new_module.compute(references=[0, 1], predictions=[0, 1])
57
+ >>> print(results)
58
+ {'accuracy': 1.0}
59
+ """
60
+
61
+ # TODO: Define external resources urls if needed
62
+ BAD_WORDS_URL = "http://url/to/external/resource/bad_words.txt"
63
+
64
+
65
+ @evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
66
+ class TimebenchEval(evaluate.Metric):
67
+ """TODO: Short description of my evaluation module."""
68
+
69
+ def __init__(self, *args, **kwargs):
70
+ super().__init__(*args, **kwargs)
71
+ self.squad_metric = evaluate.load("squad")
72
+
73
+ def _info(self):
74
+ # TODO: Specifies the evaluate.EvaluationModuleInfo object
75
+ return evaluate.MetricInfo(
76
+ # This is the description that will appear on the modules page.
77
+ module_type="metric",
78
+ description=_DESCRIPTION,
79
+ citation=_CITATION,
80
+ inputs_description=_KWARGS_DESCRIPTION,
81
+ # This defines the format of each prediction and reference
82
+ features=datasets.Features(
83
+ {
84
+ "predictions": datasets.Value("int64"),
85
+ "references": datasets.Value("int64"),
86
+ }
87
+ ),
88
+ # Homepage of the module for documentation
89
+ homepage="http://module.homepage",
90
+ # Additional links to the codebase or references
91
+ codebase_urls=["http://github.com/path/to/codebase/of/new_module"],
92
+ reference_urls=["http://path.to.reference.url/new_module"],
93
+ )
94
+
95
+ @staticmethod
96
+ def _extract_answer(response: str) -> str | None:
97
+ """Extract the answer from the response"""
98
+ marker = "Thus, the correct answer is:"
99
+
100
+ if marker not in response:
101
+ return None
102
+ answer = response.split(marker)[-1]
103
+ # Take only the first line (stops at newlines if model continues)
104
+ answer = answer.strip().split("\n")[0]
105
+ answer = answer.rstrip(".!?").strip()
106
+ if "unanswerable" in answer.lower():
107
+ return "unanswerable"
108
+ return answer or None
109
+
110
+ def _call_squad(self, predictions, references):
111
+ exact_matches = []
112
+ f1_scores = []
113
+
114
+ for i, (pred, ref) in enumerate(zip(predictions, references)):
115
+ formatted_pred = [
116
+ {"id": "0", "prediction_text": self._extract_answer(pred)}
117
+ ]
118
+ formatted_ref = [
119
+ {"id": "0", "answers": {"text": [self._extract_answer(ref)], "answer_start": [0]}}
120
+ ]
121
+
122
+ results = self.squad_metric.compute(
123
+ predictions=formatted_pred, references=formatted_ref
124
+ )
125
+ exact_matches.append(results["exact_match"] / 100)
126
+ f1_scores.append(results["f1"] / 100)
127
+
128
+ return {
129
+ "exact_match": exact_matches,
130
+ "f1": f1_scores,
131
+ }
132
+
133
+ @staticmethod
134
+ def _parse_historical_date(date_str):
135
+ try:
136
+ return parser.parse(date_str).replace(day=1)
137
+ except ParserError:
138
+ return None
139
+
140
+ def _compare_dates(self, predictions, references):
141
+ predictions = [
142
+ self._parse_historical_date(self._extract_answer(pred))
143
+ for pred in predictions
144
+ ]
145
+ references = [self._parse_historical_date(ref) for ref in references]
146
+ return {
147
+ "exact_match": [
148
+ 1 if pred == ref else 0 for pred, ref in zip(predictions, references)
149
+ ],
150
+ }
151
+
152
+ def _compute(self, predictions, references, task: str):
153
+ """Returns the scores"""
154
+ if task in [
155
+ "TempReason",
156
+ "TimeQA",
157
+ "MenatQA",
158
+ ]:
159
+ return self._call_squad(predictions, references)
160
+ elif task == "Date Arithmetic":
161
+ return self._compare_dates(predictions, references)