Spaces:

aauss
/

tram_accuracy

Sleeping

App Files Files Community

aauss commited on 4 days ago

Commit

1cdae56

1 Parent(s): eb96977

First commit.

Browse files

Files changed (7) hide show

.gitignore +10 -0
.python-version +1 -0
README.md +76 -20
pyproject.toml +14 -0
tests/test_metric.py +18 -0
tram_accuracy.py +41 -44
uv.lock +0 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,10 @@

+# Python-generated files
+__pycache__/
+*.py[oc]
+build/
+dist/
+wheels/
+*.egg-info
+# Virtual environments
+.venv

.python-version ADDED Viewed

	@@ -0,0 +1 @@


1	+ 3.12

README.md CHANGED Viewed

@@ -1,49 +1,105 @@
 ---
 title: TRAM Accuracy
-datasets: []
 tags:
-- evaluate
-- metric
-description: "TODO: add a description here"
 sdk: gradio
 sdk_version: 3.19.1
 app_file: app.py
 pinned: false
 ---
 # Metric Card for TRAM Accuracy
-***Module Card Instructions:*** *Fill out the following subsections. Feel free to take a look at existing metric cards if you'd like examples.*
 ## Metric Description
-*Give a brief overview of this metric, including what task(s) it is usually used for, if any.*
 ## How to Use
-*Give general statement of how to use the metric*
-*Provide simplest possible example for using the metric*
 ### Inputs
-*List all input arguments in the format below*
-- **input_field** *(type): Definition of input, with explanation if necessary. State any default value(s).*
 ### Output Values
-*Explain what this metric outputs and provide an example of what the metric output looks like. Modules should return a dictionary with one or multiple key-value pairs, e.g. {"bleu" : 6.02}*
-*State the range of possible values that the metric's output can take, as well as what in that range is considered good. For example: "This metric can take on any value between 0 and 100, inclusive. Higher scores are better."*
-#### Values from Popular Papers
-*Give examples, preferrably with links to leaderboards or publications, to papers that have reported this metric, along with the values they have reported.*
-### Examples
-*Give code examples of the metric being used. Try to include examples that clear up any potential ambiguity left from the metric description above. If possible, provide a range of examples that show both typical and atypical results, as well as examples where a variety of input parameters are passed.*
 ## Limitations and Bias
-*Note any known limitations or biases that the metric has, with links and references if possible.*
 ## Citation
-*Cite the source where this metric was introduced.*
 ## Further References
-*Add any useful further references.*

 ---
 title: TRAM Accuracy
+datasets:
+  - Warrieryes/TRAM-Temporal
 tags:
+  - evaluate
+  - metric
+  - temporal reasoning
+  - multiple choice
+description: Accuracy metric for the (multiple choice) TRAM benchmark by Wang et al. (2024).
 sdk: gradio
 sdk_version: 3.19.1
 app_file: app.py
 pinned: false
+emoji: 🚊
+colorFrom: red
+colorTo: white
 ---
 # Metric Card for TRAM Accuracy
 ## Metric Description
+This metric is designed for the **TRAM** benchmark (Wang et al., 2024). It measures the accuracy of model predictions on multiple-choice temporal reasoning tasks. The metric expects model outputs to contain the answer in the format `"The final answer is (X)"` where X is a letter from A to D.
+It performs the following steps:
+1. Extracts the final answer from the model's prediction string using a regex pattern that matches `"The final answer is (A/B/C/D)"`.
+2. Compares the extracted letter to the reference answer.
+3. Calculates accuracy as the proportion of correct matches.
 ## How to Use
+You can load the metric using the `evaluate` library:
+```python
+import evaluate
+metric = evaluate.load("aauss/tram_accuracy")
+predictions = [
+    "Let me analyze this step by step... The final answer is (A).",
+    "Based on my reasoning, the events should be ordered as shown. The final answer is (B).",
+    "After careful consideration, The final answer is (C).",
+]
+references = ["A", "B", "D"]
+# Get average accuracy
+result = metric.compute(
+    predictions=predictions,
+    references=references,
+)
+print(result)
+>>> {"accuracy": 0.6666666666666666}
+# Get per-sample accuracy
+result = metric.compute(
+    predictions=predictions,
+    references=references,
+    return_average=False,
+)
+print(result)
+>>> {"accuracy": [1, 1, 0]}
+```
 ### Inputs
+- **predictions** (`list` of `str`): List of predictions to score. Each prediction should be a string containing the model's response, which must include the final answer in the format `"The final answer is (X)"` where X is A, B, C, or D.
+- **references** (`list` of `str`): List of reference answers. Each reference should be a single letter (A, B, C, or D) representing the correct answer.
+- **return_average** (`bool`, optional): If `True`, returns the average accuracy as a float. If `False`, returns a list of binary scores (1 for correct, 0 for incorrect) for each sample. Defaults to `True`.
 ### Output Values
+The metric returns a dictionary with the following key:
+- **accuracy** (`float` or `list` of `int`): The accuracy score (0.0 to 1.0) if `return_average=True`, or a list of binary values (0 or 1) indicating correctness per sample if `return_average=False`.
+This metric can take on any value between 0.0 and 1.0, inclusive. Higher scores indicate better performance.
+#### Reported performance from original publication
+Refer to the [original TRAM paper](https://arxiv.org/abs/2310.00835) for baseline performance values across various language models.
 ## Limitations and Bias
+- The metric relies on a regex pattern `[Tt]he final answer is .([A-D]).` to extract the answer. If the model output does not follow this exact format, extraction will fail and the prediction will be marked as incorrect.
+- The metric is case-insensitive for "The/the" but requires the answer letter to be uppercase (A-D).
+- Only supports multiple-choice questions with options A through D.
+- If a prediction contains multiple instances of the pattern, only the first match is used.
 ## Citation
+```bibtex
+@InProceedings{auss:tram_accuracy,
+  title = {TRAM Accuracy},
+  authors = {Auss Abbood},
+  year = {2025}
+}
+```
 ## Further References
+- [TRAM Benchmark Paper](https://arxiv.org/abs/2310.00835)
+- [TRAM Dataset on Hugging Face](https://huggingface.co/datasets/Warrieryes/TRAM-Temporal)

pyproject.toml ADDED Viewed

	@@ -0,0 +1,14 @@

+[project]
+name = "tram-accuracy"
+version = "0.1.0"
+description = "Add your description here"
+readme = "README.md"
+requires-python = ">=3.12"
+dependencies = [
+    "pip>=25.3",
+    "evaluate==0.4.6",
+    "huggingface-hub<0.25",
+    "cookiecutter>=2.6",
+    "ruff>=0.14.8",
+    "pytest>=9.0.2",
+]

tests/test_metric.py ADDED Viewed

	@@ -0,0 +1,18 @@

+from tram_accuracy import TRAMAccuracy
+def test_tram_accuracy():
+    predictions = [
+        "Let's analyze each event and determine its historical accuracy and approximate date to arrange them in chronological order.\n\n(1) The FIFA World Cup is held in Africa for the first time.  \n→ This is historically inaccurate. The FIFA World Cup has never been held in Africa. The first World Cup was in 1930 (Uruguay), and Africa has never hosted it. The closest is the 2010 World Cup in South Africa, which was the first time the tournament was held on the African continent. So, this event is factually correct only if interpreted as \"the first time the World Cup was held in Africa,\" which happened in 2010.\n\n(2) After 45 years' reign, Jiajing Emperor died in the Forbidden City, Longqing Emperor ascended the throne of Ming Dynasty.  \n→ The Jiajing Emperor reigned from 1521 to 1567 (46 years). He died in 1567, and the Longqing Emperor succeeded him in 1567. This is a well-documented event in Chinese history. So, this occurred in 1567.\n\n(3) Independence of Malta.  \n→ Malta gained independence from British rule in 1964. It was a British colony until then. So, this event occurred in 1964.\n\n(4) Earliest recorded air raid, as Austria launches from land and from the ship SMS Vulcano some 200 incendiary balloons against Venice.  \n→ This is a fictional or historically inaccurate event. During World War I, Austria-Hungary did not launch incendiary balloons against Venice. The first recorded use of air raids was during World War I, and the use of balloons for attacks was limited. However, the most plausible reference is the 1915 incident where Austria-Hungary launched incendiary balloons over Venice. This is a real, though minor, event and occurred in 1915.\n\n(5) The Western European Union is established.  \n→ The Western European Union (WEU) was established in 1954, as a political and military alliance between Western European countries, later succeeded by the EU. So, this occurred in 1954.\n\nNow, let's assign approximate dates:\n\n- (2) Jiajing Emperor's death and Longqing's ascension: 1567  \n- (4) Air raid on Venice with balloons: 1915  \n- (5) Western European Union established: 1954  \n- (3) Independence of Malta: 1964  \n- (1) First FIFA World Cup in Africa: 2010\n\nSo, chronological order is:\n\n(2) 1567  \n(4) 1915  \n(5) 1954  \n(3) 1964  \n(1) 2010\n\nThus, the correct chronological sequence is: (2), (4), (5), (3), (1)\n\nNow, check the options:\n\n(A) (2), (4), (5), (3), (1) → ✅ Matches our order  \n(B) (1), (5), (3), (4), (2) → Incorrect (starts with 2010)  \n(C) (1), (3), (2), (4), (5) → Incorrect (starts with 2010)\n\nTherefore, the correct answer is (A).\n\nThe final answer is (A)",
+        'The instruction presents a scenario about coal lasting about 300 years at current rates of use and asks what people did after knowing that coal will not last forever. It then poses a question about whether "alternative sources of funds meet its obligations" is possible, which appears to be a misaligned or irrelevant question in the context of coal depletion.\n\nThe original scenario is about energy resources (coal), not financial obligations or funding. Therefore, the statement "alternative sources of funds meet its obligations" does not logically follow or apply to the context of coal usage or sustainability.\n\nThus, the question is not relevant or meaningful in this context.\n\nThe final answer is (A) FALSE.',
+        'To determine if the sequence of events is in correct chronological order, let\'s analyze the events logically:\n\n(1) Leo joined a renowned orchestra.  \n(2) Leo started learning the violin.  \n(3) Leo released his debut solo album.  \n(4) Leo performed at his first public concert.\n\nFirst, one cannot "join a renowned orchestra" before learning to play the violin. So, (2) "Leo started learning the violin" must come before (1) "Leo joined a renowned orchestra."\n\nNext, performing at a public concert (4) typically happens after one has learned the instrument and possibly joined an orchestra. However, a first public performance could occur before releasing a solo album, or it could be part of the process leading to the album.\n\nBut the key issue is the order of (2) and (1). Learning the violin (2) must precede joining an orchestra (1). So (2) should come before (1), not after.\n\nIn the given sequence:\n(1) Leo joined a renowned orchestra — before (2) Leo started learning the violin? That is illogical.\n\nTherefore, the sequence is not in correct chronological order.\n\nThe final answer is (C) FALSE.',
+    ]
+    references = ["A", "A", "C"]
+    accuracy = TRAMAccuracy()._compute(predictions, references, return_average=False)[
+        "accuracy"
+    ]
+    assert accuracy == [1, 1, 1]
+    accuracy = TRAMAccuracy()._compute(predictions, references, return_average=True)[
+        "accuracy"
+    ]
+    assert accuracy == 1.0

tram_accuracy.py CHANGED Viewed

@@ -13,53 +13,43 @@
 # limitations under the License.
 """TODO: Add a description here."""
 import evaluate
 import datasets
-# TODO: Add BibTeX citation
 _CITATION = """\
-@InProceedings{huggingface:module,
-title = {A great new module},
-authors={huggingface, Inc.},
-year={2020}
 }
 """
-# TODO: Add description of the module here
 _DESCRIPTION = """\
-This new module is designed to solve this great ML task and is crafted with a lot of care.
 """
-# TODO: Add description of the arguments of the module here
 _KWARGS_DESCRIPTION = """
-Calculates how good are predictions given some references, using certain scores
 Args:
-    predictions: list of predictions to score. Each predictions
-        should be a string with tokens separated by spaces.
     references: list of reference for each prediction. Each
-        reference should be a string with tokens separated by spaces.
 Returns:
-    accuracy: description of the first score,
-    another_score: description of the second score,
-Examples:
-    Examples should be written in doctest format, and should illustrate how
-    to use the function.
-    >>> my_new_module = evaluate.load("my_new_module")
-    >>> results = my_new_module.compute(references=[0, 1], predictions=[0, 1])
-    >>> print(results)
-    {'accuracy': 1.0}
 """
-# TODO: Define external resources urls if needed
-BAD_WORDS_URL = "http://url/to/external/resource/bad_words.txt"
 @evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
 class TRAMAccuracy(evaluate.Metric):
-    """TODO: Short description of my evaluation module."""
     def _info(self):
         # TODO: Specifies the evaluate.EvaluationModuleInfo object
@@ -70,26 +60,33 @@ class TRAMAccuracy(evaluate.Metric):
             citation=_CITATION,
             inputs_description=_KWARGS_DESCRIPTION,
             # This defines the format of each prediction and reference
-            features=datasets.Features({
-                'predictions': datasets.Value('int64'),
-                'references': datasets.Value('int64'),
-            }),
             # Homepage of the module for documentation
-            homepage="http://module.homepage",
             # Additional links to the codebase or references
-            codebase_urls=["http://github.com/path/to/codebase/of/new_module"],
-            reference_urls=["http://path.to.reference.url/new_module"]
         )
-    def _download_and_prepare(self, dl_manager):
-        """Optional: download external resources useful to compute the scores"""
-        # TODO: Download external resources if needed
-        pass
-    def _compute(self, predictions, references):
-        """Returns the scores"""
-        # TODO: Compute the different scores of the module
-        accuracy = sum(i == j for i, j in zip(predictions, references)) / len(predictions)
-        return {
-            "accuracy": accuracy,
-        }

 # limitations under the License.
 """TODO: Add a description here."""
+import re
 import evaluate
 import datasets
 _CITATION = """\
+@InProceedings{auss:tram_accuracy,
+title = {TRAM Accuracy},
+authors={Auss Abbood},
+year={2025}
 }
 """
 _DESCRIPTION = """\
+Accuracy metric for the (multiple choice) TRAM datasets by Wang et al. (2024).
 """
 _KWARGS_DESCRIPTION = """
+Calculates the accuracy for the TRAM datasets by extracting the final answer from the prediction and comparing it to the reference answer.
 Args:
+    predictions: list of predictions to score. Each prediction
+        should be a string with the model's response, which contains the final answer.
     references: list of reference for each prediction. Each
+        reference a single letter respresenting the correct answer.
+    return_average: whether to return the average accuracy or the accuracy for each prediction.
 Returns:
+    accuracy: the accuracy for the TRAM datasets.
 """
+TRAM_ANSWER_REGEX = re.compile(r"[Tt]he final answer is .([A-D]).")
 @evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
 class TRAMAccuracy(evaluate.Metric):
+    """Calculates the accuracy for the (multiple choice) TRAM datasets by extracting the final answer from the prediction and comparing it to the reference answer."""
     def _info(self):
         # TODO: Specifies the evaluate.EvaluationModuleInfo object
             citation=_CITATION,
             inputs_description=_KWARGS_DESCRIPTION,
             # This defines the format of each prediction and reference
+            features=datasets.Features(
+                {
+                    "predictions": datasets.Value("string"),
+                    "references": datasets.Value("string"),
+                }
+            ),
             # Homepage of the module for documentation
+            # homepage="http://module.homepage",
             # Additional links to the codebase or references
+            # codebase_urls=["http://github.com/path/to/codebase/of/new_module"],
+            # reference_urls=["http://path.to.reference.url/new_module"],
         )
+    def _compute(self, predictions, references, return_average=True):
+        """Returns the accuracy for the (multiple choice) TRAM datasets."""
+        predictions_matches = [
+            TRAM_ANSWER_REGEX.search(prediction) for prediction in predictions
+        ]
+        predictions_extracted = [
+            match.group(1) if match is not None else None
+            for match in predictions_matches
+        ]
+        accuracy = [
+            1 if response == label else 0
+            for response, label in zip(predictions_extracted, references)
+        ]
+        if return_average:
+            return {"accuracy": sum(accuracy) / len(accuracy)}
+        else:
+            return {"accuracy": accuracy}

uv.lock ADDED Viewed

The diff for this file is too large to render. See raw diff