aauss commited on
Commit
1cdae56
·
1 Parent(s): eb96977

First commit.

Browse files
Files changed (7) hide show
  1. .gitignore +10 -0
  2. .python-version +1 -0
  3. README.md +76 -20
  4. pyproject.toml +14 -0
  5. tests/test_metric.py +18 -0
  6. tram_accuracy.py +41 -44
  7. uv.lock +0 -0
.gitignore ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ # Python-generated files
2
+ __pycache__/
3
+ *.py[oc]
4
+ build/
5
+ dist/
6
+ wheels/
7
+ *.egg-info
8
+
9
+ # Virtual environments
10
+ .venv
.python-version ADDED
@@ -0,0 +1 @@
 
 
1
+ 3.12
README.md CHANGED
@@ -1,49 +1,105 @@
1
  ---
2
  title: TRAM Accuracy
3
- datasets: []
 
4
  tags:
5
- - evaluate
6
- - metric
7
- description: "TODO: add a description here"
 
 
8
  sdk: gradio
9
  sdk_version: 3.19.1
10
  app_file: app.py
11
  pinned: false
 
 
 
12
  ---
13
 
14
  # Metric Card for TRAM Accuracy
15
 
16
- ***Module Card Instructions:*** *Fill out the following subsections. Feel free to take a look at existing metric cards if you'd like examples.*
17
-
18
  ## Metric Description
19
- *Give a brief overview of this metric, including what task(s) it is usually used for, if any.*
 
 
 
 
 
 
 
20
 
21
  ## How to Use
22
- *Give general statement of how to use the metric*
23
 
24
- *Provide simplest possible example for using the metric*
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
25
 
26
  ### Inputs
27
- *List all input arguments in the format below*
28
- - **input_field** *(type): Definition of input, with explanation if necessary. State any default value(s).*
 
 
29
 
30
  ### Output Values
31
 
32
- *Explain what this metric outputs and provide an example of what the metric output looks like. Modules should return a dictionary with one or multiple key-value pairs, e.g. {"bleu" : 6.02}*
 
 
33
 
34
- *State the range of possible values that the metric's output can take, as well as what in that range is considered good. For example: "This metric can take on any value between 0 and 100, inclusive. Higher scores are better."*
35
 
36
- #### Values from Popular Papers
37
- *Give examples, preferrably with links to leaderboards or publications, to papers that have reported this metric, along with the values they have reported.*
38
 
39
- ### Examples
40
- *Give code examples of the metric being used. Try to include examples that clear up any potential ambiguity left from the metric description above. If possible, provide a range of examples that show both typical and atypical results, as well as examples where a variety of input parameters are passed.*
41
 
42
  ## Limitations and Bias
43
- *Note any known limitations or biases that the metric has, with links and references if possible.*
 
 
 
 
44
 
45
  ## Citation
46
- *Cite the source where this metric was introduced.*
 
 
 
 
 
 
 
47
 
48
  ## Further References
49
- *Add any useful further references.*
 
 
 
1
  ---
2
  title: TRAM Accuracy
3
+ datasets:
4
+ - Warrieryes/TRAM-Temporal
5
  tags:
6
+ - evaluate
7
+ - metric
8
+ - temporal reasoning
9
+ - multiple choice
10
+ description: Accuracy metric for the (multiple choice) TRAM benchmark by Wang et al. (2024).
11
  sdk: gradio
12
  sdk_version: 3.19.1
13
  app_file: app.py
14
  pinned: false
15
+ emoji: 🚊
16
+ colorFrom: red
17
+ colorTo: white
18
  ---
19
 
20
  # Metric Card for TRAM Accuracy
21
 
 
 
22
  ## Metric Description
23
+
24
+ This metric is designed for the **TRAM** benchmark (Wang et al., 2024). It measures the accuracy of model predictions on multiple-choice temporal reasoning tasks. The metric expects model outputs to contain the answer in the format `"The final answer is (X)"` where X is a letter from A to D.
25
+
26
+ It performs the following steps:
27
+
28
+ 1. Extracts the final answer from the model's prediction string using a regex pattern that matches `"The final answer is (A/B/C/D)"`.
29
+ 2. Compares the extracted letter to the reference answer.
30
+ 3. Calculates accuracy as the proportion of correct matches.
31
 
32
  ## How to Use
 
33
 
34
+ You can load the metric using the `evaluate` library:
35
+
36
+ ```python
37
+ import evaluate
38
+
39
+ metric = evaluate.load("aauss/tram_accuracy")
40
+
41
+ predictions = [
42
+ "Let me analyze this step by step... The final answer is (A).",
43
+ "Based on my reasoning, the events should be ordered as shown. The final answer is (B).",
44
+ "After careful consideration, The final answer is (C).",
45
+ ]
46
+
47
+ references = ["A", "B", "D"]
48
+
49
+ # Get average accuracy
50
+ result = metric.compute(
51
+ predictions=predictions,
52
+ references=references,
53
+ )
54
+ print(result)
55
+ >>> {"accuracy": 0.6666666666666666}
56
+
57
+ # Get per-sample accuracy
58
+ result = metric.compute(
59
+ predictions=predictions,
60
+ references=references,
61
+ return_average=False,
62
+ )
63
+ print(result)
64
+ >>> {"accuracy": [1, 1, 0]}
65
+ ```
66
 
67
  ### Inputs
68
+
69
+ - **predictions** (`list` of `str`): List of predictions to score. Each prediction should be a string containing the model's response, which must include the final answer in the format `"The final answer is (X)"` where X is A, B, C, or D.
70
+ - **references** (`list` of `str`): List of reference answers. Each reference should be a single letter (A, B, C, or D) representing the correct answer.
71
+ - **return_average** (`bool`, optional): If `True`, returns the average accuracy as a float. If `False`, returns a list of binary scores (1 for correct, 0 for incorrect) for each sample. Defaults to `True`.
72
 
73
  ### Output Values
74
 
75
+ The metric returns a dictionary with the following key:
76
+
77
+ - **accuracy** (`float` or `list` of `int`): The accuracy score (0.0 to 1.0) if `return_average=True`, or a list of binary values (0 or 1) indicating correctness per sample if `return_average=False`.
78
 
79
+ This metric can take on any value between 0.0 and 1.0, inclusive. Higher scores indicate better performance.
80
 
81
+ #### Reported performance from original publication
 
82
 
83
+ Refer to the [original TRAM paper](https://arxiv.org/abs/2310.00835) for baseline performance values across various language models.
 
84
 
85
  ## Limitations and Bias
86
+
87
+ - The metric relies on a regex pattern `[Tt]he final answer is .([A-D]).` to extract the answer. If the model output does not follow this exact format, extraction will fail and the prediction will be marked as incorrect.
88
+ - The metric is case-insensitive for "The/the" but requires the answer letter to be uppercase (A-D).
89
+ - Only supports multiple-choice questions with options A through D.
90
+ - If a prediction contains multiple instances of the pattern, only the first match is used.
91
 
92
  ## Citation
93
+
94
+ ```bibtex
95
+ @InProceedings{auss:tram_accuracy,
96
+ title = {TRAM Accuracy},
97
+ authors = {Auss Abbood},
98
+ year = {2025}
99
+ }
100
+ ```
101
 
102
  ## Further References
103
+
104
+ - [TRAM Benchmark Paper](https://arxiv.org/abs/2310.00835)
105
+ - [TRAM Dataset on Hugging Face](https://huggingface.co/datasets/Warrieryes/TRAM-Temporal)
pyproject.toml ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [project]
2
+ name = "tram-accuracy"
3
+ version = "0.1.0"
4
+ description = "Add your description here"
5
+ readme = "README.md"
6
+ requires-python = ">=3.12"
7
+ dependencies = [
8
+ "pip>=25.3",
9
+ "evaluate==0.4.6",
10
+ "huggingface-hub<0.25",
11
+ "cookiecutter>=2.6",
12
+ "ruff>=0.14.8",
13
+ "pytest>=9.0.2",
14
+ ]
tests/test_metric.py ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from tram_accuracy import TRAMAccuracy
2
+
3
+
4
+ def test_tram_accuracy():
5
+ predictions = [
6
+ "Let's analyze each event and determine its historical accuracy and approximate date to arrange them in chronological order.\n\n(1) The FIFA World Cup is held in Africa for the first time. \n→ This is historically inaccurate. The FIFA World Cup has never been held in Africa. The first World Cup was in 1930 (Uruguay), and Africa has never hosted it. The closest is the 2010 World Cup in South Africa, which was the first time the tournament was held on the African continent. So, this event is factually correct only if interpreted as \"the first time the World Cup was held in Africa,\" which happened in 2010.\n\n(2) After 45 years' reign, Jiajing Emperor died in the Forbidden City, Longqing Emperor ascended the throne of Ming Dynasty. \n→ The Jiajing Emperor reigned from 1521 to 1567 (46 years). He died in 1567, and the Longqing Emperor succeeded him in 1567. This is a well-documented event in Chinese history. So, this occurred in 1567.\n\n(3) Independence of Malta. \n→ Malta gained independence from British rule in 1964. It was a British colony until then. So, this event occurred in 1964.\n\n(4) Earliest recorded air raid, as Austria launches from land and from the ship SMS Vulcano some 200 incendiary balloons against Venice. \n→ This is a fictional or historically inaccurate event. During World War I, Austria-Hungary did not launch incendiary balloons against Venice. The first recorded use of air raids was during World War I, and the use of balloons for attacks was limited. However, the most plausible reference is the 1915 incident where Austria-Hungary launched incendiary balloons over Venice. This is a real, though minor, event and occurred in 1915.\n\n(5) The Western European Union is established. \n→ The Western European Union (WEU) was established in 1954, as a political and military alliance between Western European countries, later succeeded by the EU. So, this occurred in 1954.\n\nNow, let's assign approximate dates:\n\n- (2) Jiajing Emperor's death and Longqing's ascension: 1567 \n- (4) Air raid on Venice with balloons: 1915 \n- (5) Western European Union established: 1954 \n- (3) Independence of Malta: 1964 \n- (1) First FIFA World Cup in Africa: 2010\n\nSo, chronological order is:\n\n(2) 1567 \n(4) 1915 \n(5) 1954 \n(3) 1964 \n(1) 2010\n\nThus, the correct chronological sequence is: (2), (4), (5), (3), (1)\n\nNow, check the options:\n\n(A) (2), (4), (5), (3), (1) → ✅ Matches our order \n(B) (1), (5), (3), (4), (2) → Incorrect (starts with 2010) \n(C) (1), (3), (2), (4), (5) → Incorrect (starts with 2010)\n\nTherefore, the correct answer is (A).\n\nThe final answer is (A)",
7
+ 'The instruction presents a scenario about coal lasting about 300 years at current rates of use and asks what people did after knowing that coal will not last forever. It then poses a question about whether "alternative sources of funds meet its obligations" is possible, which appears to be a misaligned or irrelevant question in the context of coal depletion.\n\nThe original scenario is about energy resources (coal), not financial obligations or funding. Therefore, the statement "alternative sources of funds meet its obligations" does not logically follow or apply to the context of coal usage or sustainability.\n\nThus, the question is not relevant or meaningful in this context.\n\nThe final answer is (A) FALSE.',
8
+ 'To determine if the sequence of events is in correct chronological order, let\'s analyze the events logically:\n\n(1) Leo joined a renowned orchestra. \n(2) Leo started learning the violin. \n(3) Leo released his debut solo album. \n(4) Leo performed at his first public concert.\n\nFirst, one cannot "join a renowned orchestra" before learning to play the violin. So, (2) "Leo started learning the violin" must come before (1) "Leo joined a renowned orchestra."\n\nNext, performing at a public concert (4) typically happens after one has learned the instrument and possibly joined an orchestra. However, a first public performance could occur before releasing a solo album, or it could be part of the process leading to the album.\n\nBut the key issue is the order of (2) and (1). Learning the violin (2) must precede joining an orchestra (1). So (2) should come before (1), not after.\n\nIn the given sequence:\n(1) Leo joined a renowned orchestra — before (2) Leo started learning the violin? That is illogical.\n\nTherefore, the sequence is not in correct chronological order.\n\nThe final answer is (C) FALSE.',
9
+ ]
10
+ references = ["A", "A", "C"]
11
+ accuracy = TRAMAccuracy()._compute(predictions, references, return_average=False)[
12
+ "accuracy"
13
+ ]
14
+ assert accuracy == [1, 1, 1]
15
+ accuracy = TRAMAccuracy()._compute(predictions, references, return_average=True)[
16
+ "accuracy"
17
+ ]
18
+ assert accuracy == 1.0
tram_accuracy.py CHANGED
@@ -13,53 +13,43 @@
13
  # limitations under the License.
14
  """TODO: Add a description here."""
15
 
 
16
  import evaluate
17
  import datasets
18
 
19
 
20
- # TODO: Add BibTeX citation
21
  _CITATION = """\
22
- @InProceedings{huggingface:module,
23
- title = {A great new module},
24
- authors={huggingface, Inc.},
25
- year={2020}
26
  }
27
  """
28
 
29
- # TODO: Add description of the module here
30
  _DESCRIPTION = """\
31
- This new module is designed to solve this great ML task and is crafted with a lot of care.
32
  """
33
 
34
 
35
- # TODO: Add description of the arguments of the module here
36
  _KWARGS_DESCRIPTION = """
37
- Calculates how good are predictions given some references, using certain scores
38
  Args:
39
- predictions: list of predictions to score. Each predictions
40
- should be a string with tokens separated by spaces.
41
  references: list of reference for each prediction. Each
42
- reference should be a string with tokens separated by spaces.
 
43
  Returns:
44
- accuracy: description of the first score,
45
- another_score: description of the second score,
46
- Examples:
47
- Examples should be written in doctest format, and should illustrate how
48
- to use the function.
49
-
50
- >>> my_new_module = evaluate.load("my_new_module")
51
- >>> results = my_new_module.compute(references=[0, 1], predictions=[0, 1])
52
- >>> print(results)
53
- {'accuracy': 1.0}
54
  """
55
 
56
- # TODO: Define external resources urls if needed
57
- BAD_WORDS_URL = "http://url/to/external/resource/bad_words.txt"
58
 
59
 
60
  @evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
61
  class TRAMAccuracy(evaluate.Metric):
62
- """TODO: Short description of my evaluation module."""
63
 
64
  def _info(self):
65
  # TODO: Specifies the evaluate.EvaluationModuleInfo object
@@ -70,26 +60,33 @@ class TRAMAccuracy(evaluate.Metric):
70
  citation=_CITATION,
71
  inputs_description=_KWARGS_DESCRIPTION,
72
  # This defines the format of each prediction and reference
73
- features=datasets.Features({
74
- 'predictions': datasets.Value('int64'),
75
- 'references': datasets.Value('int64'),
76
- }),
 
 
77
  # Homepage of the module for documentation
78
- homepage="http://module.homepage",
79
  # Additional links to the codebase or references
80
- codebase_urls=["http://github.com/path/to/codebase/of/new_module"],
81
- reference_urls=["http://path.to.reference.url/new_module"]
82
  )
83
 
84
- def _download_and_prepare(self, dl_manager):
85
- """Optional: download external resources useful to compute the scores"""
86
- # TODO: Download external resources if needed
87
- pass
88
-
89
- def _compute(self, predictions, references):
90
- """Returns the scores"""
91
- # TODO: Compute the different scores of the module
92
- accuracy = sum(i == j for i, j in zip(predictions, references)) / len(predictions)
93
- return {
94
- "accuracy": accuracy,
95
- }
 
 
 
 
 
 
13
  # limitations under the License.
14
  """TODO: Add a description here."""
15
 
16
+ import re
17
  import evaluate
18
  import datasets
19
 
20
 
 
21
  _CITATION = """\
22
+ @InProceedings{auss:tram_accuracy,
23
+ title = {TRAM Accuracy},
24
+ authors={Auss Abbood},
25
+ year={2025}
26
  }
27
  """
28
 
 
29
  _DESCRIPTION = """\
30
+ Accuracy metric for the (multiple choice) TRAM datasets by Wang et al. (2024).
31
  """
32
 
33
 
 
34
  _KWARGS_DESCRIPTION = """
35
+ Calculates the accuracy for the TRAM datasets by extracting the final answer from the prediction and comparing it to the reference answer.
36
  Args:
37
+ predictions: list of predictions to score. Each prediction
38
+ should be a string with the model's response, which contains the final answer.
39
  references: list of reference for each prediction. Each
40
+ reference a single letter respresenting the correct answer.
41
+ return_average: whether to return the average accuracy or the accuracy for each prediction.
42
  Returns:
43
+ accuracy: the accuracy for the TRAM datasets.
 
 
 
 
 
 
 
 
 
44
  """
45
 
46
+
47
+ TRAM_ANSWER_REGEX = re.compile(r"[Tt]he final answer is .([A-D]).")
48
 
49
 
50
  @evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
51
  class TRAMAccuracy(evaluate.Metric):
52
+ """Calculates the accuracy for the (multiple choice) TRAM datasets by extracting the final answer from the prediction and comparing it to the reference answer."""
53
 
54
  def _info(self):
55
  # TODO: Specifies the evaluate.EvaluationModuleInfo object
 
60
  citation=_CITATION,
61
  inputs_description=_KWARGS_DESCRIPTION,
62
  # This defines the format of each prediction and reference
63
+ features=datasets.Features(
64
+ {
65
+ "predictions": datasets.Value("string"),
66
+ "references": datasets.Value("string"),
67
+ }
68
+ ),
69
  # Homepage of the module for documentation
70
+ # homepage="http://module.homepage",
71
  # Additional links to the codebase or references
72
+ # codebase_urls=["http://github.com/path/to/codebase/of/new_module"],
73
+ # reference_urls=["http://path.to.reference.url/new_module"],
74
  )
75
 
76
+ def _compute(self, predictions, references, return_average=True):
77
+ """Returns the accuracy for the (multiple choice) TRAM datasets."""
78
+ predictions_matches = [
79
+ TRAM_ANSWER_REGEX.search(prediction) for prediction in predictions
80
+ ]
81
+ predictions_extracted = [
82
+ match.group(1) if match is not None else None
83
+ for match in predictions_matches
84
+ ]
85
+ accuracy = [
86
+ 1 if response == label else 0
87
+ for response, label in zip(predictions_extracted, references)
88
+ ]
89
+ if return_average:
90
+ return {"accuracy": sum(accuracy) / len(accuracy)}
91
+ else:
92
+ return {"accuracy": accuracy}
uv.lock ADDED
The diff for this file is too large to render. See raw diff