aauss commited on
Commit
a031da6
·
1 Parent(s): 98a831a

Add metric documentation and fix GMT handling.

Browse files

- Complete README with description, usage examples, inputs/outputs
- Add proper citations and references to TCP paper (EMNLP 2025)
- Fix GMT stripping to only apply to tcp_short subset
- Add type hints and empty predictions validation

Files changed (2) hide show
  1. README.md +89 -17
  2. tcp_accuracy.py +40 -41
README.md CHANGED
@@ -1,50 +1,122 @@
1
  ---
2
  title: TCP Accuracy
3
  datasets:
4
- - Beanbagdzf/TCP
5
  tags:
6
  - evaluate
7
  - metric
8
- description: "TODO: add a description here"
 
 
 
9
  sdk: gradio
10
  sdk_version: 3.19.1
11
  app_file: app.py
12
  pinned: false
 
 
 
13
  ---
14
 
15
  # Metric Card for TCP Accuracy
16
 
17
- ***Module Card Instructions:*** *Fill out the following subsections. Feel free to take a look at existing metric cards if you'd like examples.*
18
-
19
  ## Metric Description
20
- *Give a brief overview of this metric, including what task(s) it is usually used for, if any.*
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
21
 
22
  ## How to Use
23
- *Give general statement of how to use the metric*
24
 
25
- *Provide simplest possible example for using the metric*
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
26
 
27
  ### Inputs
28
- *List all input arguments in the format below*
29
- - **input_field** *(type): Definition of input, with explanation if necessary. State any default value(s).*
 
 
 
 
 
30
 
31
  ### Output Values
32
 
33
- *Explain what this metric outputs and provide an example of what the metric output looks like. Modules should return a dictionary with one or multiple key-value pairs, e.g. {"bleu" : 6.02}*
34
 
35
- *State the range of possible values that the metric's output can take, as well as what in that range is considered good. For example: "This metric can take on any value between 0 and 100, inclusive. Higher scores are better."*
 
 
36
 
37
  #### Values from Popular Papers
38
- *Give examples, preferrably with links to leaderboards or publications, to papers that have reported this metric, along with the values they have reported.*
39
 
40
- ### Examples
41
- *Give code examples of the metric being used. Try to include examples that clear up any potential ambiguity left from the metric description above. If possible, provide a range of examples that show both typical and atypical results, as well as examples where a variety of input parameters are passed.*
42
 
43
  ## Limitations and Bias
44
- *Note any known limitations or biases that the metric has, with links and references if possible.*
 
 
 
 
45
 
46
  ## Citation
47
- *Cite the source where this metric was introduced.*
 
 
 
 
 
 
 
 
48
 
49
  ## Further References
50
- *Add any useful further references.*
 
 
 
1
  ---
2
  title: TCP Accuracy
3
  datasets:
4
+ - Beanbagdzf/TCP
5
  tags:
6
  - evaluate
7
  - metric
8
+ - temporal reasoning
9
+ - scheduling
10
+ - temporal constraint programming
11
+ description: Accuracy metric for the TCP (Temporal Constraint-Based Planning) benchmark by Ding et al. (2025).
12
  sdk: gradio
13
  sdk_version: 3.19.1
14
  app_file: app.py
15
  pinned: false
16
+ emoji: ⏰
17
+ colorFrom: red
18
+ colorTo: green
19
  ---
20
 
21
  # Metric Card for TCP Accuracy
22
 
 
 
23
  ## Metric Description
24
+
25
+ This metric is designed for the **TCP** (Temporal Constraint-Based Planning) benchmark (Ding et al., 2025). It evaluates large language models on complex scheduling and planning tasks that require temporal reasoning, constraint satisfaction, and multi-step logical deduction.
26
+
27
+ The benchmark includes problems such as:
28
+ - Project scheduling with team member availability constraints
29
+ - Work duration limits and mandatory break requirements
30
+ - Time zone conversions
31
+ - Sequential task dependencies
32
+
33
+ The metric expects model outputs to contain the final answer in LaTeX boxed notation: `\boxed{answer}`.
34
+
35
+ It performs the following steps:
36
+
37
+ 1. Extracts the answer from the model's prediction string using the regex pattern `\boxed{([^}]*)}`.
38
+ 2. For the "tcp_short" subset, removes "GMT" from both predictions and references before comparison.
39
+ 3. Performs exact string matching between the extracted prediction and the reference answer.
40
+ 4. Returns accuracy as the proportion of correct matches (or per-sample scores).
41
 
42
  ## How to Use
 
43
 
44
+ You can load the metric using the `evaluate` library:
45
+
46
+ ```python
47
+ import evaluate
48
+
49
+ metric = evaluate.load("aauss/tcp_accuracy")
50
+
51
+ predictions = [
52
+ "After analyzing the constraints... \\boxed{2012-11-05}",
53
+ "The project completes on... \\boxed{2021-01-10}",
54
+ "Converting to GMT, the final time is... \\boxed{2020-05-28 16:00}",
55
+ ]
56
+
57
+ references = ["2012-11-05", "2012-11-05", "2020-05-28 16:00 GMT"]
58
+ subsets = ["tcp_long", "tcp_long", "tcp_short"]
59
+
60
+ # Get average accuracy
61
+ result = metric.compute(
62
+ predictions=predictions,
63
+ references=references,
64
+ subset=subsets,
65
+ )
66
+ print(result)
67
+ >>> {"accuracy": 0.6666666666666666}
68
+
69
+ # Get per-sample accuracy
70
+ result = metric.compute(
71
+ predictions=predictions,
72
+ references=references,
73
+ subset=subsets,
74
+ return_average=False,
75
+ )
76
+ print(result)
77
+ >>> {"accuracy": [1, 0, 1]}
78
+ ```
79
 
80
  ### Inputs
81
+
82
+ - **predictions** (`list` of `str`): List of predictions to score. Each prediction should be a string containing the model's response, which must include the final answer in the format `\boxed{answer}`.
83
+ - **references** (`list` of `str`): List of reference answers. Each reference should be the expected answer string.
84
+ - **subset** (`str` or `list` of `str`): The subset type(s) for each sample. Must be one of:
85
+ - `"tcp_long"`: Longer scheduling problems (exact match)
86
+ - `"tcp_short"`: Shorter problems (GMT is stripped before comparison)
87
+ - **return_average** (`bool`, optional): If `True`, returns the average accuracy as a float. If `False`, returns a list of binary scores (1 for correct, 0 for incorrect) for each sample. Defaults to `True`.
88
 
89
  ### Output Values
90
 
91
+ The metric returns a dictionary with the following key:
92
 
93
+ - **accuracy** (`float` or `list` of `int`): The accuracy score (0.0 to 1.0) if `return_average=True`, or a list of binary values (0 or 1) indicating correctness per sample if `return_average=False`.
94
+
95
+ This metric can take on any value between 0.0 and 1.0, inclusive. Higher scores indicate better performance.
96
 
97
  #### Values from Popular Papers
 
98
 
99
+ Refer to the [original TCP paper](https://aclanthology.org/2025.emnlp-main.1142/) for baseline performance values across various language models.
 
100
 
101
  ## Limitations and Bias
102
+
103
+ - The metric relies on the regex pattern `\boxed{([^}]*)}` to extract answers. If the model output does not include a boxed answer, extraction will fail and return `None`, resulting in an incorrect prediction.
104
+ - For "tcp_short" subset, "GMT" is stripped from both predictions and references. Other timezone formats may not be handled correctly.
105
+ - The metric uses exact string matching. Semantically equivalent answers with different formatting (e.g., "Nov 5, 2012" vs "2012-11-05") will be marked as incorrect.
106
+ - Nested braces inside `\boxed{}` are not supported by the current regex pattern.
107
 
108
  ## Citation
109
+
110
+ ```bibtex
111
+ @software{abbood2025tcp_accuracy,
112
+ title={TCP Accuracy},
113
+ author={Abbood, Auss},
114
+ year={2025},
115
+ url={https://huggingface.co/spaces/aauss/tcp_accuracy}
116
+ }
117
+ ```
118
 
119
  ## Further References
120
+
121
+ - [TCP Paper (EMNLP 2025)](https://aclanthology.org/2025.emnlp-main.1142/)
122
+ - [TCP Dataset on Hugging Face](https://huggingface.co/datasets/Beanbagdzf/TCP)
tcp_accuracy.py CHANGED
@@ -11,7 +11,7 @@
11
  # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12
  # See the License for the specific language governing permissions and
13
  # limitations under the License.
14
- """TODO: Add a description here."""
15
 
16
  import re
17
 
@@ -19,96 +19,95 @@ import evaluate
19
  import datasets
20
 
21
 
22
- # TODO: Add BibTeX citation
23
  _CITATION = """\
24
- @InProceedings{huggingface:module,
25
- title = {A great new module},
26
- authors={huggingface, Inc.},
27
- year={2020}
 
28
  }
29
  """
30
 
31
- # TODO: Add description of the module here
32
  _DESCRIPTION = """\
33
- This new module is designed to solve this great ML task and is crafted with a lot of care.
 
 
34
  """
35
 
36
 
37
- # TODO: Add description of the arguments of the module here
38
  _KWARGS_DESCRIPTION = """
39
- Calculates how good are predictions given some references, using certain scores
40
  Args:
41
- predictions: list of predictions to score. Each predictions
42
- should be a string with tokens separated by spaces.
43
- references: list of reference for each prediction. Each
44
- reference should be a string with tokens separated by spaces.
 
 
 
45
  Returns:
46
- accuracy: description of the first score,
47
- another_score: description of the second score,
48
  Examples:
49
- Examples should be written in doctest format, and should illustrate how
50
- to use the function.
51
-
52
- >>> my_new_module = evaluate.load("my_new_module")
53
- >>> results = my_new_module.compute(references=[0, 1], predictions=[0, 1])
54
  >>> print(results)
55
  {'accuracy': 1.0}
56
  """
57
 
58
- # TODO: Define external resources urls if needed
59
- BAD_WORDS_URL = "http://url/to/external/resource/bad_words.txt"
60
-
61
 
62
  @evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
63
  class TCPAccuracy(evaluate.Metric):
64
- """TODO: Short description of my evaluation module."""
65
 
66
  BOXED_ANSWER_PATTERN = r"\\boxed\{([^}]*)\}"
67
 
68
  def _info(self):
69
- # TODO: Specifies the evaluate.EvaluationModuleInfo object
70
  return evaluate.MetricInfo(
71
- # This is the description that will appear on the modules page.
72
  module_type="metric",
73
  description=_DESCRIPTION,
74
  citation=_CITATION,
75
  inputs_description=_KWARGS_DESCRIPTION,
76
- # This defines the format of each prediction and reference
77
  features=datasets.Features(
78
  {
79
  "predictions": datasets.Value("string"),
80
  "references": datasets.Value("string"),
81
  }
82
  ),
83
- # Homepage of the module for documentation
84
- homepage="http://module.homepage",
85
- # Additional links to the codebase or references
86
- codebase_urls=["http://github.com/path/to/codebase/of/new_module"],
87
- reference_urls=["http://path.to.reference.url/new_module"],
88
  )
89
 
90
- def extract_boxed_answer(self, predictions):
91
- match = re.search(self.BOXED_ANSWER_PATTERN, predictions, re.DOTALL)
92
  if match:
93
- return match.group(1).replace("GMT", "").strip()
94
  return None
95
 
96
  def _compute(
97
  self,
98
- predictions,
99
- references,
100
  subset: str | list[str],
101
  return_average: bool = True,
102
- ):
103
  """Returns the scores"""
 
 
104
  if isinstance(subset, str):
105
  subset = [subset] * len(predictions)
106
- predictions = [self.extract_boxed_answer(p) for p in predictions]
 
 
 
 
107
  references = [
108
  r.replace("GMT", "").strip() if s == "tcp_short" else r
109
  for r, s in zip(references, subset)
110
  ]
111
- accuracy = [int(i == j) for i, j in zip(predictions, references)]
112
  if return_average:
113
  return {"accuracy": sum(accuracy) / len(accuracy)}
114
  return {"accuracy": accuracy}
 
11
  # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12
  # See the License for the specific language governing permissions and
13
  # limitations under the License.
14
+ """TCP Accuracy metric for evaluating temporal constraint-based planning tasks."""
15
 
16
  import re
17
 
 
19
  import datasets
20
 
21
 
 
22
  _CITATION = """\
23
+ @software{abbood2025tcp_accuracy,
24
+ title={TCP Accuracy},
25
+ author={Abbood, Auss},
26
+ year={2025},
27
+ url={https://huggingface.co/spaces/aauss/tcp_accuracy}
28
  }
29
  """
30
 
 
31
  _DESCRIPTION = """\
32
+ This metric evaluates model predictions on the TCP (Temporal Constraint-Based Planning) benchmark
33
+ (Ding et al., 2025). It measures accuracy by extracting answers from LaTeX boxed notation
34
+ (\\boxed{answer}) and comparing them against reference answers using exact string matching.
35
  """
36
 
37
 
 
38
  _KWARGS_DESCRIPTION = """
39
+ Calculates accuracy for TCP benchmark predictions.
40
  Args:
41
+ predictions: list of prediction strings. Each prediction should contain the
42
+ final answer in LaTeX boxed notation: \\boxed{answer}.
43
+ references: list of reference answer strings.
44
+ subset: either a string or list of strings indicating the subset type
45
+ ("tcp_short" or "tcp_long"). For "tcp_short", GMT is stripped before comparison.
46
+ return_average: if True (default), returns average accuracy as a float.
47
+ If False, returns a list of binary scores (0 or 1) for each sample.
48
  Returns:
49
+ accuracy: float (if return_average=True) or list of int (if return_average=False)
 
50
  Examples:
51
+ >>> metric = evaluate.load("aauss/tcp_accuracy")
52
+ >>> predictions = ["...\\\\boxed{2012-11-05}", "...\\\\boxed{2020-05-28 16:00}"]
53
+ >>> references = ["2012-11-05", "2020-05-28 16:00 GMT"]
54
+ >>> results = metric.compute(predictions=predictions, references=references, subset=["tcp_long", "tcp_short"])
 
55
  >>> print(results)
56
  {'accuracy': 1.0}
57
  """
58
 
 
 
 
59
 
60
  @evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
61
  class TCPAccuracy(evaluate.Metric):
62
+ """Accuracy metric for the TCP (Temporal Constraint-Based Planning) benchmark."""
63
 
64
  BOXED_ANSWER_PATTERN = r"\\boxed\{([^}]*)\}"
65
 
66
  def _info(self):
 
67
  return evaluate.MetricInfo(
 
68
  module_type="metric",
69
  description=_DESCRIPTION,
70
  citation=_CITATION,
71
  inputs_description=_KWARGS_DESCRIPTION,
 
72
  features=datasets.Features(
73
  {
74
  "predictions": datasets.Value("string"),
75
  "references": datasets.Value("string"),
76
  }
77
  ),
78
+ homepage="https://huggingface.co/spaces/aauss/tcp_accuracy",
79
+ codebase_urls=["https://huggingface.co/spaces/aauss/tcp_accuracy/tree/main"],
80
+ reference_urls=["https://aclanthology.org/2025.emnlp-main.1142/"],
 
 
81
  )
82
 
83
+ def extract_boxed_answer(self, prediction: str) -> str | None:
84
+ match = re.search(self.BOXED_ANSWER_PATTERN, prediction, re.DOTALL)
85
  if match:
86
+ return match.group(1).strip()
87
  return None
88
 
89
  def _compute(
90
  self,
91
+ predictions: list[str],
92
+ references: list[str],
93
  subset: str | list[str],
94
  return_average: bool = True,
95
+ ) -> dict[str, float | list[int]]:
96
  """Returns the scores"""
97
+ if not predictions:
98
+ raise ValueError("predictions cannot be empty")
99
  if isinstance(subset, str):
100
  subset = [subset] * len(predictions)
101
+ extracted_predictions = [self.extract_boxed_answer(p) for p in predictions]
102
+ extracted_predictions = [
103
+ p.replace("GMT", "").strip() if p and s == "tcp_short" else p
104
+ for p, s in zip(extracted_predictions, subset)
105
+ ]
106
  references = [
107
  r.replace("GMT", "").strip() if s == "tcp_short" else r
108
  for r, s in zip(references, subset)
109
  ]
110
+ accuracy = [int(i == j) for i, j in zip(extracted_predictions, references)]
111
  if return_average:
112
  return {"accuracy": sum(accuracy) / len(accuracy)}
113
  return {"accuracy": accuracy}