Spaces:

aauss
/

timebench_eval

Sleeping

App Files Files Community

aauss commited on Jan 13

Commit

6bb843b

1 Parent(s): 2664bc6

Implement and test metric for TempReason, TimeQA, MenatQA and Date Arithmetic.

Browse files

Files changed (10) hide show

.gitignore +10 -0
.python-version +1 -0
pyproject.toml +13 -0
tests/conftest.py +109 -0
tests/test_answer_extraction.py +16 -0
tests/test_metrics.py +58 -0
timebench_eval/.gitattributes +35 -0
timebench_eval/__init__.py +0 -0
timebench_eval/app.py +6 -0
timebench_eval/timebench_eval.py +161 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,10 @@

+# Python-generated files
+__pycache__/
+*.py[oc]
+build/
+dist/
+wheels/
+*.egg-info
+# Virtual environments
+.venv

.python-version ADDED Viewed

	@@ -0,0 +1 @@


1	+ 3.12

pyproject.toml ADDED Viewed

	@@ -0,0 +1,13 @@

+[project]
+name = "timebench-eval"
+version = "0.1.0"
+description = "Add your description here"
+readme = "README.md"
+requires-python = ">=3.12"
+dependencies = [
+    "evaluate==0.4.6",
+    "huggingface-hub<0.25",
+    "pip>=25.3",
+    "pytest>=9.0.2",
+    "ruff>=0.14.11",
+]

tests/conftest.py ADDED Viewed

	@@ -0,0 +1,109 @@

+from textwrap import dedent
+PREDICTION_1 = dedent("""\
+    Let's think step by step:
+    Brayan Perea was born on 25 February 1993.
+    He began his professional career with Deportivo Cali, making his first team debut in 2011.
+    In 2013, he joined Lazio (an Italian Serie A club) on a five-year deal for €2.5 million.
+    His debut for Lazio was on 25 September 2013.
+    The question asks which team he played for in November 2014.
+    From the context, it states: "Perea spent the following campaigns out on loan, representing Perugia, Troyes AC and CD Lugo."
+    "The following campaigns" refers to after joining Lazio in 2013.
+    So, after joining Lazio in 2013, he went on loan during the 2013–2014 and 2014–2015 seasons.
+    The loans were to Perugia, Troyes AC, and CD Lugo.
+    Since the timeline is sequential:
+    ...
+    However, we must answer based only on the provided context.
+    And since the text says "spend the following campaigns out on loan, representing Perugia, Troyes AC and CD Lugo" — in that sequence — and November 2014 falls in the 2014–2015 season (second campaign), then the club would be Troyes AC.
+    Therefore, in November 2014, he was playing for Troyes AC.
+    Thus, the correct answer is: Troyes AC.
+    Thus, the correct answer is: Troyes AC.""")
+PREDICTION_2 = dedent("""\
+    Let's solve this step by step:
+    We are given a date: August 1814.
+    We are asked to find the date that is 10 years before that.
+    Subtracting 10 years from 1814:
+    1814 − 10 = 1804.
+    The month remains the same (August), since we're only subtracting years, not months or days.
+    Therefore, 10 years before August 1814 is August 1804.
+    Thus, the correct answer is: August 1804.""")
+PREDICTION_3 = dedent("""\
+    Step 1: Identify the person in question — John Williams (footballer, born 1968).
+    Step 2: Determine which teams he played for according to the context.
+    The context lists the teams he played for:
+    Premier League: Coventry City
+    Football League: Swansea City, Notts County, Stoke City, Wycombe Wanderers, Hereford United, Walsall, Exeter City, Cardiff City, York City, Darlington
+    Non-league: Cradley Town, Kidderminster Harriers, Bath City, Redditch United, Evesham United, Weston-super-Mare, Stourbridge, Willenhall Town, Boldmere St. Michaels
+    Step 3: Consider the timeline provided: from 1998 to 2000, and that he released Cardiff City from his contract in 2002.
+    Note: The release from Cardiff City in 2002 implies that he was playing for Cardiff City at some point prior to 2002.
+    Step 4: Determine which team he played for between 1998 and 2000.
+    Since the list does not include specific dates or timelines for each team, we must infer based on logical sequence and context. However, the given context does not provide any specific start or end years for the clubs.
+    We know he played for Cardiff City and left in 2002 — so he must have been with Cardiff City at some point before 2002. But we are asked about 1998–2000.
+    Can we determine which team he played for during that window?
+    The context does not specify when he joined or left any of these clubs, nor does it provide dates for transfers. Therefore, we cannot definitively assign a team to the period 1998–2000 based solely on the given information.
+    Additionally, the question implies that he released Cardiff City in 2002 — meaning he was with them before that. That means he could have been at Cardiff City during 1998–2000. But that is speculation.
+    Since the context doesn't provide any details on transfer dates or timelines for individual clubs, and we cannot determine which team he was at between 1998 and 2000 based on the information provided, we cannot with certainty assign a team to that time period.
+    Hence, the information is insufficient to determine which team John Williams played for between 1998 and 2000.
+    Therefore, the answer cannot be determined from the context.
+    Thus, the correct answer is: [unanswerable]
+""")
+PREDICTION_4 = dedent("""\
+    Let's break this down step by step:
+    The question is about John Williams (footballer, born 1968) and specifically which team he played for between 1998 and 2000.
+    We are told that he released Cardiff City from his contract in 2002. This implies that he was at Cardiff City before 2002, and likely during or before the 1998–2000 timeframe.
+    However, the context provided lists the clubs he played for, which are:
+    Coventry City (Premier League)
+    Swansea City
+    Notts County
+    Stoke City
+    Wycombe Wanderers
+    Hereford United
+    Walsall
+    Exeter City
+    Cardiff City
+    York City
+    Darlington
+    Non-league clubs: Cradley Town, Kidderminster Harriers, Bath City, Redditch United, Evesham United, Weston-super-Mare, Stourbridge, Willenhall Town, Boldmere St. Michaels
+    The question asks which team he played for from 1998 to 2000. Since he left Cardiff City in 2002 (released from contract), he must have been at Cardiff City during or before that time. But we are not given exact dates for his tenure at each club.
+    However, the club "Cardiff City" appears in the list, and since he left in 2002, it is logical that he played for Cardiff City around the 1998–2000 period.
+    The key point: the question is specifically asking about 1998 to 2000. There is no mention of any specific team during that window in the context — all clubs are listed, but with no timeframes.
+    Given that Cardiff City is the only club explicitly mentioned that he left in 2002, and assuming he was with them during that timeframe unless stated otherwise, it is plausible that he played for Cardiff City from 1998 to 2000.
+    But the context does not provide any dates for when he joined or left any of the clubs. So, without timeframes, we cannot definitively assign 1998–2000 to any club.
+    We have no evidence in the context that he played for any team between 1998 and 2000 other than those listed.
+    Since the only team he left in 2002 is Cardiff City, and the timeframe in question is 1998–2000, it is the most likely candidate.
+    Thus, the correct answer is: Cardiff City.""")

tests/test_answer_extraction.py ADDED Viewed

	@@ -0,0 +1,16 @@

+import pytest
+from timebench_eval.timebench_eval import TimebenchEval
+from conftest import PREDICTION_1, PREDICTION_2, PREDICTION_3, PREDICTION_4
+@pytest.mark.parametrize(
+    "prediction,extracted_answer",
+    [
+        (PREDICTION_1, "Troyes AC"),
+        (PREDICTION_2, "August 1804"),
+        (PREDICTION_3, "unanswerable"),
+        (PREDICTION_4, "Cardiff City"),
+    ],
+)
+def test_answer_extraction(prediction, extracted_answer):
+    assert TimebenchEval._extract_answer(prediction) == extracted_answer

tests/test_metrics.py ADDED Viewed

	@@ -0,0 +1,58 @@

+from timebench_eval.timebench_eval import TimebenchEval
+import pytest
+from conftest import PREDICTION_1, PREDICTION_2, PREDICTION_3, PREDICTION_4
+@pytest.mark.parametrize(
+    "prediction,reference,task,expected_metrics",
+    [
+        (
+            PREDICTION_1,
+            "Troyes AC",
+            "TempReason",
+            {
+                "exact_match": [1],
+                "f1": [1],
+            },
+        ),
+        (
+            PREDICTION_2,
+            "Aug, 1804",
+            "Date Arithmetic",
+            {
+                "exact_match": [1],
+            },
+        ),
+        (
+            PREDICTION_3,
+            "unanswerable",
+            "MenatQA",
+            {
+                "exact_match": [1],
+                "f1": [1],
+            },
+        ),
+        (
+            PREDICTION_4,
+            "Cardiff City",
+            "MenatQA",
+            {
+                "exact_match": [1],
+                "f1": [1],
+            },
+        ),
+    ],
+)
+def test_eval(prediction, reference, task, expected_metrics):
+    metrics = TimebenchEval()._compute([prediction], [reference], task)
+    assert metrics == expected_metrics
+def test_eval_many():
+    metrics = TimebenchEval()._compute(
+        [PREDICTION_3, PREDICTION_4], ["unanswerable", "Cardiff City"], "MenatQA"
+    )
+    assert metrics == {
+        "exact_match": [1, 1],
+        "f1": [1, 1],
+    }

timebench_eval/.gitattributes ADDED Viewed

	@@ -0,0 +1,35 @@

+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ckpt filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.mlmodel filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.npy filter=lfs diff=lfs merge=lfs -text
+*.npz filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pickle filter=lfs diff=lfs merge=lfs -text
+*.pkl filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tar filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.wasm filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zst filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text

timebench_eval/__init__.py ADDED Viewed

File without changes

timebench_eval/app.py ADDED Viewed

	@@ -0,0 +1,6 @@

+import evaluate
+from evaluate.utils import launch_gradio_widget
+module = evaluate.load("aauss/timebench_eval")
+launch_gradio_widget(module)

timebench_eval/timebench_eval.py ADDED Viewed

	@@ -0,0 +1,161 @@

+# Copyright 2020 The HuggingFace Datasets Authors and the current dataset script contributor.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""TODO: Add a description here."""
+from dateutil import parser
+from dateutil.parser import ParserError
+import evaluate
+import datasets
+import numpy as np
+# TODO: Add BibTeX citation
+_CITATION = """\
+@InProceedings{huggingface:module,
+title = {A great new module},
+authors={huggingface, Inc.},
+year={2020}
+}
+"""
+# TODO: Add description of the module here
+_DESCRIPTION = """\
+This new module is designed to solve this great ML task and is crafted with a lot of care.
+"""
+# TODO: Add description of the arguments of the module here
+_KWARGS_DESCRIPTION = """
+Calculates how good are predictions given some references, using certain scores
+Args:
+    predictions: list of predictions to score. Each predictions
+        should be a string with tokens separated by spaces.
+    references: list of reference for each prediction. Each
+        reference should be a string with tokens separated by spaces.
+Returns:
+    accuracy: description of the first score,
+    another_score: description of the second score,
+Examples:
+    Examples should be written in doctest format, and should illustrate how
+    to use the function.
+    >>> my_new_module = evaluate.load("my_new_module")
+    >>> results = my_new_module.compute(references=[0, 1], predictions=[0, 1])
+    >>> print(results)
+    {'accuracy': 1.0}
+"""
+# TODO: Define external resources urls if needed
+BAD_WORDS_URL = "http://url/to/external/resource/bad_words.txt"
+@evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
+class TimebenchEval(evaluate.Metric):
+    """TODO: Short description of my evaluation module."""
+    def __init__(self, *args, **kwargs):
+        super().__init__(*args, **kwargs)
+        self.squad_metric = evaluate.load("squad")
+    def _info(self):
+        # TODO: Specifies the evaluate.EvaluationModuleInfo object
+        return evaluate.MetricInfo(
+            # This is the description that will appear on the modules page.
+            module_type="metric",
+            description=_DESCRIPTION,
+            citation=_CITATION,
+            inputs_description=_KWARGS_DESCRIPTION,
+            # This defines the format of each prediction and reference
+            features=datasets.Features(
+                {
+                    "predictions": datasets.Value("int64"),
+                    "references": datasets.Value("int64"),
+                }
+            ),
+            # Homepage of the module for documentation
+            homepage="http://module.homepage",
+            # Additional links to the codebase or references
+            codebase_urls=["http://github.com/path/to/codebase/of/new_module"],
+            reference_urls=["http://path.to.reference.url/new_module"],
+        )
+    @staticmethod
+    def _extract_answer(response: str) -> str | None:
+        """Extract the answer from the response"""
+        marker = "Thus, the correct answer is:"
+        if marker not in response:
+            return None
+        answer = response.split(marker)[-1]
+        # Take only the first line (stops at newlines if model continues)
+        answer = answer.strip().split("\n")[0]
+        answer = answer.rstrip(".!?").strip()
+        if "unanswerable" in answer.lower():
+            return "unanswerable"
+        return answer or None
+    def _call_squad(self, predictions, references):
+        exact_matches = []
+        f1_scores = []
+        for i, (pred, ref) in enumerate(zip(predictions, references)):
+            formatted_pred = [
+                {"id": "0", "prediction_text": self._extract_answer(pred)}
+            ]
+            formatted_ref = [
+                {"id": "0", "answers": {"text": [self._extract_answer(ref)], "answer_start": [0]}}
+            ]
+            results = self.squad_metric.compute(
+                predictions=formatted_pred, references=formatted_ref
+            )
+            exact_matches.append(results["exact_match"] / 100)
+            f1_scores.append(results["f1"] / 100)
+        return {
+            "exact_match": exact_matches,
+            "f1": f1_scores,
+        }
+    @staticmethod
+    def _parse_historical_date(date_str):
+        try:
+            return parser.parse(date_str).replace(day=1)
+        except ParserError:
+            return None
+    def _compare_dates(self, predictions, references):
+        predictions = [
+            self._parse_historical_date(self._extract_answer(pred))
+            for pred in predictions
+        ]
+        references = [self._parse_historical_date(ref) for ref in references]
+        return {
+            "exact_match": [
+                1 if pred == ref else 0 for pred, ref in zip(predictions, references)
+            ],
+        }
+    def _compute(self, predictions, references, task: str):
+        """Returns the scores"""
+        if task in [
+            "TempReason",
+            "TimeQA",
+            "MenatQA",
+        ]:
+            return self._call_squad(predictions, references)
+        elif task == "Date Arithmetic":
+            return self._compare_dates(predictions, references)