Spaces:
Sleeping
Sleeping
First commit.
Browse files- .gitignore +10 -0
- .python-version +1 -0
- README.md +76 -20
- pyproject.toml +14 -0
- tests/test_metric.py +18 -0
- tram_accuracy.py +41 -44
- uv.lock +0 -0
.gitignore
ADDED
|
@@ -0,0 +1,10 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Python-generated files
|
| 2 |
+
__pycache__/
|
| 3 |
+
*.py[oc]
|
| 4 |
+
build/
|
| 5 |
+
dist/
|
| 6 |
+
wheels/
|
| 7 |
+
*.egg-info
|
| 8 |
+
|
| 9 |
+
# Virtual environments
|
| 10 |
+
.venv
|
.python-version
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
3.12
|
README.md
CHANGED
|
@@ -1,49 +1,105 @@
|
|
| 1 |
---
|
| 2 |
title: TRAM Accuracy
|
| 3 |
-
datasets:
|
|
|
|
| 4 |
tags:
|
| 5 |
-
- evaluate
|
| 6 |
-
- metric
|
| 7 |
-
|
|
|
|
|
|
|
| 8 |
sdk: gradio
|
| 9 |
sdk_version: 3.19.1
|
| 10 |
app_file: app.py
|
| 11 |
pinned: false
|
|
|
|
|
|
|
|
|
|
| 12 |
---
|
| 13 |
|
| 14 |
# Metric Card for TRAM Accuracy
|
| 15 |
|
| 16 |
-
***Module Card Instructions:*** *Fill out the following subsections. Feel free to take a look at existing metric cards if you'd like examples.*
|
| 17 |
-
|
| 18 |
## Metric Description
|
| 19 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 20 |
|
| 21 |
## How to Use
|
| 22 |
-
*Give general statement of how to use the metric*
|
| 23 |
|
| 24 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 25 |
|
| 26 |
### Inputs
|
| 27 |
-
|
| 28 |
-
- **
|
|
|
|
|
|
|
| 29 |
|
| 30 |
### Output Values
|
| 31 |
|
| 32 |
-
|
|
|
|
|
|
|
| 33 |
|
| 34 |
-
|
| 35 |
|
| 36 |
-
####
|
| 37 |
-
*Give examples, preferrably with links to leaderboards or publications, to papers that have reported this metric, along with the values they have reported.*
|
| 38 |
|
| 39 |
-
|
| 40 |
-
*Give code examples of the metric being used. Try to include examples that clear up any potential ambiguity left from the metric description above. If possible, provide a range of examples that show both typical and atypical results, as well as examples where a variety of input parameters are passed.*
|
| 41 |
|
| 42 |
## Limitations and Bias
|
| 43 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 44 |
|
| 45 |
## Citation
|
| 46 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 47 |
|
| 48 |
## Further References
|
| 49 |
-
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
title: TRAM Accuracy
|
| 3 |
+
datasets:
|
| 4 |
+
- Warrieryes/TRAM-Temporal
|
| 5 |
tags:
|
| 6 |
+
- evaluate
|
| 7 |
+
- metric
|
| 8 |
+
- temporal reasoning
|
| 9 |
+
- multiple choice
|
| 10 |
+
description: Accuracy metric for the (multiple choice) TRAM benchmark by Wang et al. (2024).
|
| 11 |
sdk: gradio
|
| 12 |
sdk_version: 3.19.1
|
| 13 |
app_file: app.py
|
| 14 |
pinned: false
|
| 15 |
+
emoji: 🚊
|
| 16 |
+
colorFrom: red
|
| 17 |
+
colorTo: white
|
| 18 |
---
|
| 19 |
|
| 20 |
# Metric Card for TRAM Accuracy
|
| 21 |
|
|
|
|
|
|
|
| 22 |
## Metric Description
|
| 23 |
+
|
| 24 |
+
This metric is designed for the **TRAM** benchmark (Wang et al., 2024). It measures the accuracy of model predictions on multiple-choice temporal reasoning tasks. The metric expects model outputs to contain the answer in the format `"The final answer is (X)"` where X is a letter from A to D.
|
| 25 |
+
|
| 26 |
+
It performs the following steps:
|
| 27 |
+
|
| 28 |
+
1. Extracts the final answer from the model's prediction string using a regex pattern that matches `"The final answer is (A/B/C/D)"`.
|
| 29 |
+
2. Compares the extracted letter to the reference answer.
|
| 30 |
+
3. Calculates accuracy as the proportion of correct matches.
|
| 31 |
|
| 32 |
## How to Use
|
|
|
|
| 33 |
|
| 34 |
+
You can load the metric using the `evaluate` library:
|
| 35 |
+
|
| 36 |
+
```python
|
| 37 |
+
import evaluate
|
| 38 |
+
|
| 39 |
+
metric = evaluate.load("aauss/tram_accuracy")
|
| 40 |
+
|
| 41 |
+
predictions = [
|
| 42 |
+
"Let me analyze this step by step... The final answer is (A).",
|
| 43 |
+
"Based on my reasoning, the events should be ordered as shown. The final answer is (B).",
|
| 44 |
+
"After careful consideration, The final answer is (C).",
|
| 45 |
+
]
|
| 46 |
+
|
| 47 |
+
references = ["A", "B", "D"]
|
| 48 |
+
|
| 49 |
+
# Get average accuracy
|
| 50 |
+
result = metric.compute(
|
| 51 |
+
predictions=predictions,
|
| 52 |
+
references=references,
|
| 53 |
+
)
|
| 54 |
+
print(result)
|
| 55 |
+
>>> {"accuracy": 0.6666666666666666}
|
| 56 |
+
|
| 57 |
+
# Get per-sample accuracy
|
| 58 |
+
result = metric.compute(
|
| 59 |
+
predictions=predictions,
|
| 60 |
+
references=references,
|
| 61 |
+
return_average=False,
|
| 62 |
+
)
|
| 63 |
+
print(result)
|
| 64 |
+
>>> {"accuracy": [1, 1, 0]}
|
| 65 |
+
```
|
| 66 |
|
| 67 |
### Inputs
|
| 68 |
+
|
| 69 |
+
- **predictions** (`list` of `str`): List of predictions to score. Each prediction should be a string containing the model's response, which must include the final answer in the format `"The final answer is (X)"` where X is A, B, C, or D.
|
| 70 |
+
- **references** (`list` of `str`): List of reference answers. Each reference should be a single letter (A, B, C, or D) representing the correct answer.
|
| 71 |
+
- **return_average** (`bool`, optional): If `True`, returns the average accuracy as a float. If `False`, returns a list of binary scores (1 for correct, 0 for incorrect) for each sample. Defaults to `True`.
|
| 72 |
|
| 73 |
### Output Values
|
| 74 |
|
| 75 |
+
The metric returns a dictionary with the following key:
|
| 76 |
+
|
| 77 |
+
- **accuracy** (`float` or `list` of `int`): The accuracy score (0.0 to 1.0) if `return_average=True`, or a list of binary values (0 or 1) indicating correctness per sample if `return_average=False`.
|
| 78 |
|
| 79 |
+
This metric can take on any value between 0.0 and 1.0, inclusive. Higher scores indicate better performance.
|
| 80 |
|
| 81 |
+
#### Reported performance from original publication
|
|
|
|
| 82 |
|
| 83 |
+
Refer to the [original TRAM paper](https://arxiv.org/abs/2310.00835) for baseline performance values across various language models.
|
|
|
|
| 84 |
|
| 85 |
## Limitations and Bias
|
| 86 |
+
|
| 87 |
+
- The metric relies on a regex pattern `[Tt]he final answer is .([A-D]).` to extract the answer. If the model output does not follow this exact format, extraction will fail and the prediction will be marked as incorrect.
|
| 88 |
+
- The metric is case-insensitive for "The/the" but requires the answer letter to be uppercase (A-D).
|
| 89 |
+
- Only supports multiple-choice questions with options A through D.
|
| 90 |
+
- If a prediction contains multiple instances of the pattern, only the first match is used.
|
| 91 |
|
| 92 |
## Citation
|
| 93 |
+
|
| 94 |
+
```bibtex
|
| 95 |
+
@InProceedings{auss:tram_accuracy,
|
| 96 |
+
title = {TRAM Accuracy},
|
| 97 |
+
authors = {Auss Abbood},
|
| 98 |
+
year = {2025}
|
| 99 |
+
}
|
| 100 |
+
```
|
| 101 |
|
| 102 |
## Further References
|
| 103 |
+
|
| 104 |
+
- [TRAM Benchmark Paper](https://arxiv.org/abs/2310.00835)
|
| 105 |
+
- [TRAM Dataset on Hugging Face](https://huggingface.co/datasets/Warrieryes/TRAM-Temporal)
|
pyproject.toml
ADDED
|
@@ -0,0 +1,14 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
[project]
|
| 2 |
+
name = "tram-accuracy"
|
| 3 |
+
version = "0.1.0"
|
| 4 |
+
description = "Add your description here"
|
| 5 |
+
readme = "README.md"
|
| 6 |
+
requires-python = ">=3.12"
|
| 7 |
+
dependencies = [
|
| 8 |
+
"pip>=25.3",
|
| 9 |
+
"evaluate==0.4.6",
|
| 10 |
+
"huggingface-hub<0.25",
|
| 11 |
+
"cookiecutter>=2.6",
|
| 12 |
+
"ruff>=0.14.8",
|
| 13 |
+
"pytest>=9.0.2",
|
| 14 |
+
]
|
tests/test_metric.py
ADDED
|
@@ -0,0 +1,18 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from tram_accuracy import TRAMAccuracy
|
| 2 |
+
|
| 3 |
+
|
| 4 |
+
def test_tram_accuracy():
|
| 5 |
+
predictions = [
|
| 6 |
+
"Let's analyze each event and determine its historical accuracy and approximate date to arrange them in chronological order.\n\n(1) The FIFA World Cup is held in Africa for the first time. \n→ This is historically inaccurate. The FIFA World Cup has never been held in Africa. The first World Cup was in 1930 (Uruguay), and Africa has never hosted it. The closest is the 2010 World Cup in South Africa, which was the first time the tournament was held on the African continent. So, this event is factually correct only if interpreted as \"the first time the World Cup was held in Africa,\" which happened in 2010.\n\n(2) After 45 years' reign, Jiajing Emperor died in the Forbidden City, Longqing Emperor ascended the throne of Ming Dynasty. \n→ The Jiajing Emperor reigned from 1521 to 1567 (46 years). He died in 1567, and the Longqing Emperor succeeded him in 1567. This is a well-documented event in Chinese history. So, this occurred in 1567.\n\n(3) Independence of Malta. \n→ Malta gained independence from British rule in 1964. It was a British colony until then. So, this event occurred in 1964.\n\n(4) Earliest recorded air raid, as Austria launches from land and from the ship SMS Vulcano some 200 incendiary balloons against Venice. \n→ This is a fictional or historically inaccurate event. During World War I, Austria-Hungary did not launch incendiary balloons against Venice. The first recorded use of air raids was during World War I, and the use of balloons for attacks was limited. However, the most plausible reference is the 1915 incident where Austria-Hungary launched incendiary balloons over Venice. This is a real, though minor, event and occurred in 1915.\n\n(5) The Western European Union is established. \n→ The Western European Union (WEU) was established in 1954, as a political and military alliance between Western European countries, later succeeded by the EU. So, this occurred in 1954.\n\nNow, let's assign approximate dates:\n\n- (2) Jiajing Emperor's death and Longqing's ascension: 1567 \n- (4) Air raid on Venice with balloons: 1915 \n- (5) Western European Union established: 1954 \n- (3) Independence of Malta: 1964 \n- (1) First FIFA World Cup in Africa: 2010\n\nSo, chronological order is:\n\n(2) 1567 \n(4) 1915 \n(5) 1954 \n(3) 1964 \n(1) 2010\n\nThus, the correct chronological sequence is: (2), (4), (5), (3), (1)\n\nNow, check the options:\n\n(A) (2), (4), (5), (3), (1) → ✅ Matches our order \n(B) (1), (5), (3), (4), (2) → Incorrect (starts with 2010) \n(C) (1), (3), (2), (4), (5) → Incorrect (starts with 2010)\n\nTherefore, the correct answer is (A).\n\nThe final answer is (A)",
|
| 7 |
+
'The instruction presents a scenario about coal lasting about 300 years at current rates of use and asks what people did after knowing that coal will not last forever. It then poses a question about whether "alternative sources of funds meet its obligations" is possible, which appears to be a misaligned or irrelevant question in the context of coal depletion.\n\nThe original scenario is about energy resources (coal), not financial obligations or funding. Therefore, the statement "alternative sources of funds meet its obligations" does not logically follow or apply to the context of coal usage or sustainability.\n\nThus, the question is not relevant or meaningful in this context.\n\nThe final answer is (A) FALSE.',
|
| 8 |
+
'To determine if the sequence of events is in correct chronological order, let\'s analyze the events logically:\n\n(1) Leo joined a renowned orchestra. \n(2) Leo started learning the violin. \n(3) Leo released his debut solo album. \n(4) Leo performed at his first public concert.\n\nFirst, one cannot "join a renowned orchestra" before learning to play the violin. So, (2) "Leo started learning the violin" must come before (1) "Leo joined a renowned orchestra."\n\nNext, performing at a public concert (4) typically happens after one has learned the instrument and possibly joined an orchestra. However, a first public performance could occur before releasing a solo album, or it could be part of the process leading to the album.\n\nBut the key issue is the order of (2) and (1). Learning the violin (2) must precede joining an orchestra (1). So (2) should come before (1), not after.\n\nIn the given sequence:\n(1) Leo joined a renowned orchestra — before (2) Leo started learning the violin? That is illogical.\n\nTherefore, the sequence is not in correct chronological order.\n\nThe final answer is (C) FALSE.',
|
| 9 |
+
]
|
| 10 |
+
references = ["A", "A", "C"]
|
| 11 |
+
accuracy = TRAMAccuracy()._compute(predictions, references, return_average=False)[
|
| 12 |
+
"accuracy"
|
| 13 |
+
]
|
| 14 |
+
assert accuracy == [1, 1, 1]
|
| 15 |
+
accuracy = TRAMAccuracy()._compute(predictions, references, return_average=True)[
|
| 16 |
+
"accuracy"
|
| 17 |
+
]
|
| 18 |
+
assert accuracy == 1.0
|
tram_accuracy.py
CHANGED
|
@@ -13,53 +13,43 @@
|
|
| 13 |
# limitations under the License.
|
| 14 |
"""TODO: Add a description here."""
|
| 15 |
|
|
|
|
| 16 |
import evaluate
|
| 17 |
import datasets
|
| 18 |
|
| 19 |
|
| 20 |
-
# TODO: Add BibTeX citation
|
| 21 |
_CITATION = """\
|
| 22 |
-
@InProceedings{
|
| 23 |
-
title = {
|
| 24 |
-
authors={
|
| 25 |
-
year={
|
| 26 |
}
|
| 27 |
"""
|
| 28 |
|
| 29 |
-
# TODO: Add description of the module here
|
| 30 |
_DESCRIPTION = """\
|
| 31 |
-
|
| 32 |
"""
|
| 33 |
|
| 34 |
|
| 35 |
-
# TODO: Add description of the arguments of the module here
|
| 36 |
_KWARGS_DESCRIPTION = """
|
| 37 |
-
Calculates
|
| 38 |
Args:
|
| 39 |
-
predictions: list of predictions to score. Each
|
| 40 |
-
should be a string with
|
| 41 |
references: list of reference for each prediction. Each
|
| 42 |
-
reference
|
|
|
|
| 43 |
Returns:
|
| 44 |
-
accuracy:
|
| 45 |
-
another_score: description of the second score,
|
| 46 |
-
Examples:
|
| 47 |
-
Examples should be written in doctest format, and should illustrate how
|
| 48 |
-
to use the function.
|
| 49 |
-
|
| 50 |
-
>>> my_new_module = evaluate.load("my_new_module")
|
| 51 |
-
>>> results = my_new_module.compute(references=[0, 1], predictions=[0, 1])
|
| 52 |
-
>>> print(results)
|
| 53 |
-
{'accuracy': 1.0}
|
| 54 |
"""
|
| 55 |
|
| 56 |
-
|
| 57 |
-
|
| 58 |
|
| 59 |
|
| 60 |
@evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
|
| 61 |
class TRAMAccuracy(evaluate.Metric):
|
| 62 |
-
"""
|
| 63 |
|
| 64 |
def _info(self):
|
| 65 |
# TODO: Specifies the evaluate.EvaluationModuleInfo object
|
|
@@ -70,26 +60,33 @@ class TRAMAccuracy(evaluate.Metric):
|
|
| 70 |
citation=_CITATION,
|
| 71 |
inputs_description=_KWARGS_DESCRIPTION,
|
| 72 |
# This defines the format of each prediction and reference
|
| 73 |
-
features=datasets.Features(
|
| 74 |
-
|
| 75 |
-
|
| 76 |
-
|
|
|
|
|
|
|
| 77 |
# Homepage of the module for documentation
|
| 78 |
-
homepage="http://module.homepage",
|
| 79 |
# Additional links to the codebase or references
|
| 80 |
-
codebase_urls=["http://github.com/path/to/codebase/of/new_module"],
|
| 81 |
-
reference_urls=["http://path.to.reference.url/new_module"]
|
| 82 |
)
|
| 83 |
|
| 84 |
-
def
|
| 85 |
-
"""
|
| 86 |
-
|
| 87 |
-
|
| 88 |
-
|
| 89 |
-
|
| 90 |
-
|
| 91 |
-
|
| 92 |
-
|
| 93 |
-
|
| 94 |
-
|
| 95 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 13 |
# limitations under the License.
|
| 14 |
"""TODO: Add a description here."""
|
| 15 |
|
| 16 |
+
import re
|
| 17 |
import evaluate
|
| 18 |
import datasets
|
| 19 |
|
| 20 |
|
|
|
|
| 21 |
_CITATION = """\
|
| 22 |
+
@InProceedings{auss:tram_accuracy,
|
| 23 |
+
title = {TRAM Accuracy},
|
| 24 |
+
authors={Auss Abbood},
|
| 25 |
+
year={2025}
|
| 26 |
}
|
| 27 |
"""
|
| 28 |
|
|
|
|
| 29 |
_DESCRIPTION = """\
|
| 30 |
+
Accuracy metric for the (multiple choice) TRAM datasets by Wang et al. (2024).
|
| 31 |
"""
|
| 32 |
|
| 33 |
|
|
|
|
| 34 |
_KWARGS_DESCRIPTION = """
|
| 35 |
+
Calculates the accuracy for the TRAM datasets by extracting the final answer from the prediction and comparing it to the reference answer.
|
| 36 |
Args:
|
| 37 |
+
predictions: list of predictions to score. Each prediction
|
| 38 |
+
should be a string with the model's response, which contains the final answer.
|
| 39 |
references: list of reference for each prediction. Each
|
| 40 |
+
reference a single letter respresenting the correct answer.
|
| 41 |
+
return_average: whether to return the average accuracy or the accuracy for each prediction.
|
| 42 |
Returns:
|
| 43 |
+
accuracy: the accuracy for the TRAM datasets.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 44 |
"""
|
| 45 |
|
| 46 |
+
|
| 47 |
+
TRAM_ANSWER_REGEX = re.compile(r"[Tt]he final answer is .([A-D]).")
|
| 48 |
|
| 49 |
|
| 50 |
@evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
|
| 51 |
class TRAMAccuracy(evaluate.Metric):
|
| 52 |
+
"""Calculates the accuracy for the (multiple choice) TRAM datasets by extracting the final answer from the prediction and comparing it to the reference answer."""
|
| 53 |
|
| 54 |
def _info(self):
|
| 55 |
# TODO: Specifies the evaluate.EvaluationModuleInfo object
|
|
|
|
| 60 |
citation=_CITATION,
|
| 61 |
inputs_description=_KWARGS_DESCRIPTION,
|
| 62 |
# This defines the format of each prediction and reference
|
| 63 |
+
features=datasets.Features(
|
| 64 |
+
{
|
| 65 |
+
"predictions": datasets.Value("string"),
|
| 66 |
+
"references": datasets.Value("string"),
|
| 67 |
+
}
|
| 68 |
+
),
|
| 69 |
# Homepage of the module for documentation
|
| 70 |
+
# homepage="http://module.homepage",
|
| 71 |
# Additional links to the codebase or references
|
| 72 |
+
# codebase_urls=["http://github.com/path/to/codebase/of/new_module"],
|
| 73 |
+
# reference_urls=["http://path.to.reference.url/new_module"],
|
| 74 |
)
|
| 75 |
|
| 76 |
+
def _compute(self, predictions, references, return_average=True):
|
| 77 |
+
"""Returns the accuracy for the (multiple choice) TRAM datasets."""
|
| 78 |
+
predictions_matches = [
|
| 79 |
+
TRAM_ANSWER_REGEX.search(prediction) for prediction in predictions
|
| 80 |
+
]
|
| 81 |
+
predictions_extracted = [
|
| 82 |
+
match.group(1) if match is not None else None
|
| 83 |
+
for match in predictions_matches
|
| 84 |
+
]
|
| 85 |
+
accuracy = [
|
| 86 |
+
1 if response == label else 0
|
| 87 |
+
for response, label in zip(predictions_extracted, references)
|
| 88 |
+
]
|
| 89 |
+
if return_average:
|
| 90 |
+
return {"accuracy": sum(accuracy) / len(accuracy)}
|
| 91 |
+
else:
|
| 92 |
+
return {"accuracy": accuracy}
|
uv.lock
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|