---
title: TCP Accuracy
datasets:
- Beanbagdzf/TCP
tags:
- evaluate
- metric
- temporal reasoning
- scheduling
- temporal constraint programming
description: >-
  Accuracy metric for the TCP (Temporal Constraint-Based Planning) benchmark by
  Ding et al. (2025).
sdk: gradio
sdk_version: 6.3.0
app_file: app.py
pinned: false
emoji: ⏰
colorFrom: pink
colorTo: red
---

# Metric Card for TCP Accuracy

## Metric Description

This metric is designed for the **TCP** (Temporal Constraint-Based Planning) benchmark (Ding et al., 2025). It evaluates large language models on complex scheduling and planning tasks that require temporal reasoning, constraint satisfaction, and multi-step logical deduction.

The benchmark includes problems such as:
- Project scheduling with team member availability constraints
- Work duration limits and mandatory break requirements
- Time zone conversions
- Sequential task dependencies

The metric expects model outputs to contain the final answer in LaTeX boxed notation: `\boxed{answer}`.

It performs the following steps:

1. Extracts the answer from the model's prediction string using the regex pattern `\boxed{([^}]*)}`.
2. For the "tcp_short" subset, removes "GMT" from both predictions and references before comparison.
3. Performs exact string matching between the extracted prediction and the reference answer.
4. Returns accuracy as the proportion of correct matches (or per-sample scores).

## How to Use

You can load the metric using the `evaluate` library:

```python
import evaluate

metric = evaluate.load("aauss/tcp_accuracy")

predictions = [
    "After analyzing the constraints... \\boxed{2012-11-05}",
    "The project completes on... \\boxed{2021-01-10}",
    "Converting to GMT, the final time is... \\boxed{2020-05-28 16:00}",
]

references = ["2012-11-05", "2012-11-05", "2020-05-28 16:00 GMT"]
subsets = ["tcp_long", "tcp_long", "tcp_short"]

# Get average accuracy
result = metric.compute(
    predictions=predictions,
    references=references,
    subset=subsets,
)
print(result)
>>> {"accuracy": 0.6666666666666666}

# Get per-sample accuracy
result = metric.compute(
    predictions=predictions,
    references=references,
    subset=subsets,
    return_average=False,
)
print(result)
>>> {"accuracy": [1, 0, 1]}
```

### Inputs

- **predictions** (`list` of `str`): List of predictions to score. Each prediction should be a string containing the model's response, which must include the final answer in the format `\boxed{answer}`.
- **references** (`list` of `str`): List of reference answers. Each reference should be the expected answer string.
- **subset** (`str` or `list` of `str`): The subset type(s) for each sample. Must be one of:
  - `"tcp_long"`: Longer scheduling problems (exact match)
  - `"tcp_short"`: Shorter problems (GMT is stripped before comparison)
- **return_average** (`bool`, optional): If `True`, returns the average accuracy as a float. If `False`, returns a list of binary scores (1 for correct, 0 for incorrect) for each sample. Defaults to `True`.

### Output Values

The metric returns a dictionary with the following key:

- **accuracy** (`float` or `list` of `int`): The accuracy score (0.0 to 1.0) if `return_average=True`, or a list of binary values (0 or 1) indicating correctness per sample if `return_average=False`.

This metric can take on any value between 0.0 and 1.0, inclusive. Higher scores indicate better performance.

#### Values from Popular Papers

Refer to the [original TCP paper](https://aclanthology.org/2025.emnlp-main.1142/) for baseline performance values across various language models.

## Limitations and Bias

- The metric relies on the regex pattern `\boxed{([^}]*)}` to extract answers. If the model output does not include a boxed answer, extraction will fail and return `None`, resulting in an incorrect prediction.
- For "tcp_short" subset, "GMT" is stripped from both predictions and references. Other timezone formats may not be handled correctly.
- The metric uses exact string matching. Semantically equivalent answers with different formatting (e.g., "Nov 5, 2012" vs "2012-11-05") will be marked as incorrect.
- Nested braces inside `\boxed{}` are not supported by the current regex pattern.

## Citation

```bibtex
@software{abbood2025tcp_accuracy,
  title={TCP Accuracy},
  author={Abbood, Auss},
  year={2025},
  url={https://huggingface.co/spaces/aauss/tcp_accuracy}
}
```

## Further References

- [TCP Paper (EMNLP 2025)](https://aclanthology.org/2025.emnlp-main.1142/)
- [TCP Dataset on Hugging Face](https://huggingface.co/datasets/Beanbagdzf/TCP)