Spaces:
Sleeping
Sleeping
metadata
title: TRAM Accuracy
datasets:
- Warrieryes/TRAM-Temporal
tags:
- evaluate
- metric
- temporal reasoning
- multiple choice
description: >-
Accuracy metric for the (multiple choice) TRAM benchmark by Wang et al.
(2024).
sdk: gradio
sdk_version: 3.19.1
app_file: app.py
pinned: false
emoji: π
colorFrom: red
colorTo: gray
Metric Card for TRAM Accuracy
Metric Description
This metric is designed for the TRAM benchmark (Wang et al., 2024). It measures the accuracy of model predictions on multiple-choice temporal reasoning tasks. The metric expects model outputs to contain the answer in the format "The final answer is (X)" where X is a letter from A to D.
It performs the following steps:
- Extracts the final answer from the model's prediction string using a regex pattern that matches
"The final answer is (A/B/C/D)". - Compares the extracted letter to the reference answer.
- Calculates accuracy as the proportion of correct matches.
How to Use
You can load the metric using the evaluate library:
import evaluate
metric = evaluate.load("aauss/tram_accuracy")
predictions = [
"Let me analyze this step by step... The final answer is (A).",
"Based on my reasoning, the events should be ordered as shown. The final answer is (B).",
"After careful consideration, The final answer is (C).",
]
references = ["A", "B", "D"]
# Get average accuracy
result = metric.compute(
predictions=predictions,
references=references,
)
print(result)
>>> {"accuracy": 0.6666666666666666}
# Get per-sample accuracy
result = metric.compute(
predictions=predictions,
references=references,
return_average=False,
)
print(result)
>>> {"accuracy": [1, 1, 0]}
Inputs
- predictions (
listofstr): List of predictions to score. Each prediction should be a string containing the model's response, which must include the final answer in the format"The final answer is (X)"where X is A, B, C, or D. - references (
listofstr): List of reference answers. Each reference should be a single letter (A, B, C, or D) representing the correct answer. - return_average (
bool, optional): IfTrue, returns the average accuracy as a float. IfFalse, returns a list of binary scores (1 for correct, 0 for incorrect) for each sample. Defaults toTrue.
Output Values
The metric returns a dictionary with the following key:
- accuracy (
floatorlistofint): The accuracy score (0.0 to 1.0) ifreturn_average=True, or a list of binary values (0 or 1) indicating correctness per sample ifreturn_average=False.
This metric can take on any value between 0.0 and 1.0, inclusive. Higher scores indicate better performance.
Reported performance from original publication
Refer to the original TRAM paper for baseline performance values across various language models.
Limitations and Bias
- The metric relies on a regex pattern
[Tt]he final answer is .([A-D]).to extract the answer. If the model output does not follow this exact format, extraction will fail and the prediction will be marked as incorrect. - The metric is case-insensitive for "The/the" but requires the answer letter to be uppercase (A-D).
- Only supports multiple-choice questions with options A through D.
- If a prediction contains multiple instances of the pattern, only the first match is used.
Citation
@InProceedings{auss:tram_accuracy,
title = {TRAM Accuracy},
authors = {Auss Abbood},
year = {2025}
}