Spaces:

aauss
/

tram_accuracy

Sleeping

App Files Files Community

tram_accuracy / README.md

aauss

Fix wrong color pick in README.

9635e9e 6 days ago

preview code

raw

history blame contribute delete

3.74 kB

	---
	title: TRAM Accuracy
	datasets:
	- Warrieryes/TRAM-Temporal
	tags:
	- evaluate
	- metric
	- temporal reasoning
	- multiple choice
	description: Accuracy metric for the (multiple choice) TRAM benchmark by Wang et al. (2024).
	sdk: gradio
	sdk_version: 3.19.1
	app_file: app.py
	pinned: false
	emoji: 🚊
	colorFrom: red
	colorTo: gray
	---

	# Metric Card for TRAM Accuracy

	## Metric Description

	This metric is designed for the TRAM benchmark (Wang et al., 2024). It measures the accuracy of model predictions on multiple-choice temporal reasoning tasks. The metric expects model outputs to contain the answer in the format `"The final answer is (X)"` where X is a letter from A to D.

	It performs the following steps:

	1. Extracts the final answer from the model's prediction string using a regex pattern that matches `"The final answer is (A/B/C/D)"`.
	2. Compares the extracted letter to the reference answer.
	3. Calculates accuracy as the proportion of correct matches.

	## How to Use

	You can load the metric using the `evaluate` library:

	```python
	import evaluate

	metric = evaluate.load("aauss/tram_accuracy")

	predictions = [
	"Let me analyze this step by step... The final answer is (A).",
	"Based on my reasoning, the events should be ordered as shown. The final answer is (B).",
	"After careful consideration, The final answer is (C).",
	]

	references = ["A", "B", "D"]

	# Get average accuracy
	result = metric.compute(
	predictions=predictions,
	references=references,
	)
	print(result)
	>>> {"accuracy": 0.6666666666666666}

	# Get per-sample accuracy
	result = metric.compute(
	predictions=predictions,
	references=references,
	return_average=False,
	)
	print(result)
	>>> {"accuracy": [1, 1, 0]}
	```

	### Inputs

	- predictions (`list` of `str`): List of predictions to score. Each prediction should be a string containing the model's response, which must include the final answer in the format `"The final answer is (X)"` where X is A, B, C, or D.
	- references (`list` of `str`): List of reference answers. Each reference should be a single letter (A, B, C, or D) representing the correct answer.
	- return_average (`bool`, optional): If `True`, returns the average accuracy as a float. If `False`, returns a list of binary scores (1 for correct, 0 for incorrect) for each sample. Defaults to `True`.

	### Output Values

	The metric returns a dictionary with the following key:

	- accuracy (`float` or `list` of `int`): The accuracy score (0.0 to 1.0) if `return_average=True`, or a list of binary values (0 or 1) indicating correctness per sample if `return_average=False`.

	This metric can take on any value between 0.0 and 1.0, inclusive. Higher scores indicate better performance.

	#### Reported performance from original publication

	Refer to the [original TRAM paper](https://arxiv.org/abs/2310.00835) for baseline performance values across various language models.

	## Limitations and Bias

	- The metric relies on a regex pattern `[Tt]he final answer is .([A-D]).` to extract the answer. If the model output does not follow this exact format, extraction will fail and the prediction will be marked as incorrect.
	- The metric is case-insensitive for "The/the" but requires the answer letter to be uppercase (A-D).
	- Only supports multiple-choice questions with options A through D.
	- If a prediction contains multiple instances of the pattern, only the first match is used.

	## Citation

	```bibtex
	@InProceedings{auss:tram_accuracy,
	title = {TRAM Accuracy},
	authors = {Auss Abbood},
	year = {2025}
	}
	```

	## Further References

	- [TRAM Benchmark Paper](https://arxiv.org/abs/2310.00835)
	- [TRAM Dataset on Hugging Face](https://huggingface.co/datasets/Warrieryes/TRAM-Temporal)