Spaces:
Sleeping
Sleeping
| title: TRAM Accuracy | |
| datasets: | |
| - Warrieryes/TRAM-Temporal | |
| tags: | |
| - evaluate | |
| - metric | |
| - temporal reasoning | |
| - multiple choice | |
| description: Accuracy metric for the (multiple choice) TRAM benchmark by Wang et al. (2024). | |
| sdk: gradio | |
| sdk_version: 3.19.1 | |
| app_file: app.py | |
| pinned: false | |
| emoji: π | |
| colorFrom: red | |
| colorTo: gray | |
| # Metric Card for TRAM Accuracy | |
| ## Metric Description | |
| This metric is designed for the **TRAM** benchmark (Wang et al., 2024). It measures the accuracy of model predictions on multiple-choice temporal reasoning tasks. The metric expects model outputs to contain the answer in the format `"The final answer is (X)"` where X is a letter from A to D. | |
| It performs the following steps: | |
| 1. Extracts the final answer from the model's prediction string using a regex pattern that matches `"The final answer is (A/B/C/D)"`. | |
| 2. Compares the extracted letter to the reference answer. | |
| 3. Calculates accuracy as the proportion of correct matches. | |
| ## How to Use | |
| You can load the metric using the `evaluate` library: | |
| ```python | |
| import evaluate | |
| metric = evaluate.load("aauss/tram_accuracy") | |
| predictions = [ | |
| "Let me analyze this step by step... The final answer is (A).", | |
| "Based on my reasoning, the events should be ordered as shown. The final answer is (B).", | |
| "After careful consideration, The final answer is (C).", | |
| ] | |
| references = ["A", "B", "D"] | |
| # Get average accuracy | |
| result = metric.compute( | |
| predictions=predictions, | |
| references=references, | |
| ) | |
| print(result) | |
| >>> {"accuracy": 0.6666666666666666} | |
| # Get per-sample accuracy | |
| result = metric.compute( | |
| predictions=predictions, | |
| references=references, | |
| return_average=False, | |
| ) | |
| print(result) | |
| >>> {"accuracy": [1, 1, 0]} | |
| ``` | |
| ### Inputs | |
| - **predictions** (`list` of `str`): List of predictions to score. Each prediction should be a string containing the model's response, which must include the final answer in the format `"The final answer is (X)"` where X is A, B, C, or D. | |
| - **references** (`list` of `str`): List of reference answers. Each reference should be a single letter (A, B, C, or D) representing the correct answer. | |
| - **return_average** (`bool`, optional): If `True`, returns the average accuracy as a float. If `False`, returns a list of binary scores (1 for correct, 0 for incorrect) for each sample. Defaults to `True`. | |
| ### Output Values | |
| The metric returns a dictionary with the following key: | |
| - **accuracy** (`float` or `list` of `int`): The accuracy score (0.0 to 1.0) if `return_average=True`, or a list of binary values (0 or 1) indicating correctness per sample if `return_average=False`. | |
| This metric can take on any value between 0.0 and 1.0, inclusive. Higher scores indicate better performance. | |
| #### Reported performance from original publication | |
| Refer to the [original TRAM paper](https://arxiv.org/abs/2310.00835) for baseline performance values across various language models. | |
| ## Limitations and Bias | |
| - The metric relies on a regex pattern `[Tt]he final answer is .([A-D]).` to extract the answer. If the model output does not follow this exact format, extraction will fail and the prediction will be marked as incorrect. | |
| - The metric is case-insensitive for "The/the" but requires the answer letter to be uppercase (A-D). | |
| - Only supports multiple-choice questions with options A through D. | |
| - If a prediction contains multiple instances of the pattern, only the first match is used. | |
| ## Citation | |
| ```bibtex | |
| @InProceedings{auss:tram_accuracy, | |
| title = {TRAM Accuracy}, | |
| authors = {Auss Abbood}, | |
| year = {2025} | |
| } | |
| ``` | |
| ## Further References | |
| - [TRAM Benchmark Paper](https://arxiv.org/abs/2310.00835) | |
| - [TRAM Dataset on Hugging Face](https://huggingface.co/datasets/Warrieryes/TRAM-Temporal) | |