tram_accuracy / README.md
aauss's picture
Fix wrong color pick in README.
9635e9e
|
raw
history blame
3.74 kB
metadata
title: TRAM Accuracy
datasets:
  - Warrieryes/TRAM-Temporal
tags:
  - evaluate
  - metric
  - temporal reasoning
  - multiple choice
description: >-
  Accuracy metric for the (multiple choice) TRAM benchmark by Wang et al.
  (2024).
sdk: gradio
sdk_version: 3.19.1
app_file: app.py
pinned: false
emoji: 🚊
colorFrom: red
colorTo: gray

Metric Card for TRAM Accuracy

Metric Description

This metric is designed for the TRAM benchmark (Wang et al., 2024). It measures the accuracy of model predictions on multiple-choice temporal reasoning tasks. The metric expects model outputs to contain the answer in the format "The final answer is (X)" where X is a letter from A to D.

It performs the following steps:

  1. Extracts the final answer from the model's prediction string using a regex pattern that matches "The final answer is (A/B/C/D)".
  2. Compares the extracted letter to the reference answer.
  3. Calculates accuracy as the proportion of correct matches.

How to Use

You can load the metric using the evaluate library:

import evaluate

metric = evaluate.load("aauss/tram_accuracy")

predictions = [
    "Let me analyze this step by step... The final answer is (A).",
    "Based on my reasoning, the events should be ordered as shown. The final answer is (B).",
    "After careful consideration, The final answer is (C).",
]

references = ["A", "B", "D"]

# Get average accuracy
result = metric.compute(
    predictions=predictions,
    references=references,
)
print(result)
>>> {"accuracy": 0.6666666666666666}

# Get per-sample accuracy
result = metric.compute(
    predictions=predictions,
    references=references,
    return_average=False,
)
print(result)
>>> {"accuracy": [1, 1, 0]}

Inputs

  • predictions (list of str): List of predictions to score. Each prediction should be a string containing the model's response, which must include the final answer in the format "The final answer is (X)" where X is A, B, C, or D.
  • references (list of str): List of reference answers. Each reference should be a single letter (A, B, C, or D) representing the correct answer.
  • return_average (bool, optional): If True, returns the average accuracy as a float. If False, returns a list of binary scores (1 for correct, 0 for incorrect) for each sample. Defaults to True.

Output Values

The metric returns a dictionary with the following key:

  • accuracy (float or list of int): The accuracy score (0.0 to 1.0) if return_average=True, or a list of binary values (0 or 1) indicating correctness per sample if return_average=False.

This metric can take on any value between 0.0 and 1.0, inclusive. Higher scores indicate better performance.

Reported performance from original publication

Refer to the original TRAM paper for baseline performance values across various language models.

Limitations and Bias

  • The metric relies on a regex pattern [Tt]he final answer is .([A-D]). to extract the answer. If the model output does not follow this exact format, extraction will fail and the prediction will be marked as incorrect.
  • The metric is case-insensitive for "The/the" but requires the answer letter to be uppercase (A-D).
  • Only supports multiple-choice questions with options A through D.
  • If a prediction contains multiple instances of the pattern, only the first match is used.

Citation

@InProceedings{auss:tram_accuracy,
  title = {TRAM Accuracy},
  authors = {Auss Abbood},
  year = {2025}
}

Further References