Spaces:

aauss
/

tcp_accuracy

Sleeping

aauss commited on Jan 20

Commit

a031da6

1 Parent(s): 98a831a

Add metric documentation and fix GMT handling.

- Complete README with description, usage examples, inputs/outputs
- Add proper citations and references to TCP paper (EMNLP 2025)
- Fix GMT stripping to only apply to tcp_short subset
- Add type hints and empty predictions validation

Files changed (2) hide show

README.md +89 -17
tcp_accuracy.py +40 -41

README.md CHANGED Viewed

@@ -1,50 +1,122 @@
 ---
 title: TCP Accuracy
 datasets:
--  Beanbagdzf/TCP
 tags:
 - evaluate
 - metric
-description: "TODO: add a description here"
 sdk: gradio
 sdk_version: 3.19.1
 app_file: app.py
 pinned: false
 ---
 # Metric Card for TCP Accuracy
-***Module Card Instructions:*** *Fill out the following subsections. Feel free to take a look at existing metric cards if you'd like examples.*
 ## Metric Description
-*Give a brief overview of this metric, including what task(s) it is usually used for, if any.*
 ## How to Use
-*Give general statement of how to use the metric*
-*Provide simplest possible example for using the metric*
 ### Inputs
-*List all input arguments in the format below*
-- **input_field** *(type): Definition of input, with explanation if necessary. State any default value(s).*
 ### Output Values
-*Explain what this metric outputs and provide an example of what the metric output looks like. Modules should return a dictionary with one or multiple key-value pairs, e.g. {"bleu" : 6.02}*
-*State the range of possible values that the metric's output can take, as well as what in that range is considered good. For example: "This metric can take on any value between 0 and 100, inclusive. Higher scores are better."*
 #### Values from Popular Papers
-*Give examples, preferrably with links to leaderboards or publications, to papers that have reported this metric, along with the values they have reported.*
-### Examples
-*Give code examples of the metric being used. Try to include examples that clear up any potential ambiguity left from the metric description above. If possible, provide a range of examples that show both typical and atypical results, as well as examples where a variety of input parameters are passed.*
 ## Limitations and Bias
-*Note any known limitations or biases that the metric has, with links and references if possible.*
 ## Citation
-*Cite the source where this metric was introduced.*
 ## Further References
-*Add any useful further references.*

 ---
 title: TCP Accuracy
 datasets:
+- Beanbagdzf/TCP
 tags:
 - evaluate
 - metric
+- temporal reasoning
+- scheduling
+- temporal constraint programming
+description: Accuracy metric for the TCP (Temporal Constraint-Based Planning) benchmark by Ding et al. (2025).
 sdk: gradio
 sdk_version: 3.19.1
 app_file: app.py
 pinned: false
+emoji: ⏰
+colorFrom: red
+colorTo: green
 ---
 # Metric Card for TCP Accuracy
 ## Metric Description
+This metric is designed for the **TCP** (Temporal Constraint-Based Planning) benchmark (Ding et al., 2025). It evaluates large language models on complex scheduling and planning tasks that require temporal reasoning, constraint satisfaction, and multi-step logical deduction.
+The benchmark includes problems such as:
+- Project scheduling with team member availability constraints
+- Work duration limits and mandatory break requirements
+- Time zone conversions
+- Sequential task dependencies
+The metric expects model outputs to contain the final answer in LaTeX boxed notation: `\boxed{answer}`.
+It performs the following steps:
+1. Extracts the answer from the model's prediction string using the regex pattern `\boxed{([^}]*)}`.
+2. For the "tcp_short" subset, removes "GMT" from both predictions and references before comparison.
+3. Performs exact string matching between the extracted prediction and the reference answer.
+4. Returns accuracy as the proportion of correct matches (or per-sample scores).
 ## How to Use
+You can load the metric using the `evaluate` library:
+```python
+import evaluate
+metric = evaluate.load("aauss/tcp_accuracy")
+predictions = [
+    "After analyzing the constraints... \\boxed{2012-11-05}",
+    "The project completes on... \\boxed{2021-01-10}",
+    "Converting to GMT, the final time is... \\boxed{2020-05-28 16:00}",
+]
+references = ["2012-11-05", "2012-11-05", "2020-05-28 16:00 GMT"]
+subsets = ["tcp_long", "tcp_long", "tcp_short"]
+# Get average accuracy
+result = metric.compute(
+    predictions=predictions,
+    references=references,
+    subset=subsets,
+)
+print(result)
+>>> {"accuracy": 0.6666666666666666}
+# Get per-sample accuracy
+result = metric.compute(
+    predictions=predictions,
+    references=references,
+    subset=subsets,
+    return_average=False,
+)
+print(result)
+>>> {"accuracy": [1, 0, 1]}
+```
 ### Inputs
+- **predictions** (`list` of `str`): List of predictions to score. Each prediction should be a string containing the model's response, which must include the final answer in the format `\boxed{answer}`.
+- **references** (`list` of `str`): List of reference answers. Each reference should be the expected answer string.
+- **subset** (`str` or `list` of `str`): The subset type(s) for each sample. Must be one of:
+  - `"tcp_long"`: Longer scheduling problems (exact match)
+  - `"tcp_short"`: Shorter problems (GMT is stripped before comparison)
+- **return_average** (`bool`, optional): If `True`, returns the average accuracy as a float. If `False`, returns a list of binary scores (1 for correct, 0 for incorrect) for each sample. Defaults to `True`.
 ### Output Values
+The metric returns a dictionary with the following key:
+- **accuracy** (`float` or `list` of `int`): The accuracy score (0.0 to 1.0) if `return_average=True`, or a list of binary values (0 or 1) indicating correctness per sample if `return_average=False`.
+This metric can take on any value between 0.0 and 1.0, inclusive. Higher scores indicate better performance.
 #### Values from Popular Papers
+Refer to the [original TCP paper](https://aclanthology.org/2025.emnlp-main.1142/) for baseline performance values across various language models.
 ## Limitations and Bias
+- The metric relies on the regex pattern `\boxed{([^}]*)}` to extract answers. If the model output does not include a boxed answer, extraction will fail and return `None`, resulting in an incorrect prediction.
+- For "tcp_short" subset, "GMT" is stripped from both predictions and references. Other timezone formats may not be handled correctly.
+- The metric uses exact string matching. Semantically equivalent answers with different formatting (e.g., "Nov 5, 2012" vs "2012-11-05") will be marked as incorrect.
+- Nested braces inside `\boxed{}` are not supported by the current regex pattern.
 ## Citation
+```bibtex
+@software{abbood2025tcp_accuracy,
+  title={TCP Accuracy},
+  author={Abbood, Auss},
+  year={2025},
+  url={https://huggingface.co/spaces/aauss/tcp_accuracy}
+}
+```
 ## Further References
+- [TCP Paper (EMNLP 2025)](https://aclanthology.org/2025.emnlp-main.1142/)
+- [TCP Dataset on Hugging Face](https://huggingface.co/datasets/Beanbagdzf/TCP)

tcp_accuracy.py CHANGED Viewed

@@ -11,7 +11,7 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-"""TODO: Add a description here."""
 import re
@@ -19,96 +19,95 @@ import evaluate
 import datasets
-# TODO: Add BibTeX citation
 _CITATION = """\
-@InProceedings{huggingface:module,
-title = {A great new module},
-authors={huggingface, Inc.},
-year={2020}
 }
 """
-# TODO: Add description of the module here
 _DESCRIPTION = """\
-This new module is designed to solve this great ML task and is crafted with a lot of care.
 """
-# TODO: Add description of the arguments of the module here
 _KWARGS_DESCRIPTION = """
-Calculates how good are predictions given some references, using certain scores
 Args:
-    predictions: list of predictions to score. Each predictions
-        should be a string with tokens separated by spaces.
-    references: list of reference for each prediction. Each
-        reference should be a string with tokens separated by spaces.
 Returns:
-    accuracy: description of the first score,
-    another_score: description of the second score,
 Examples:
-    Examples should be written in doctest format, and should illustrate how
-    to use the function.
-    >>> my_new_module = evaluate.load("my_new_module")
-    >>> results = my_new_module.compute(references=[0, 1], predictions=[0, 1])
     >>> print(results)
     {'accuracy': 1.0}
 """
-# TODO: Define external resources urls if needed
-BAD_WORDS_URL = "http://url/to/external/resource/bad_words.txt"
 @evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
 class TCPAccuracy(evaluate.Metric):
-    """TODO: Short description of my evaluation module."""
     BOXED_ANSWER_PATTERN = r"\\boxed\{([^}]*)\}"
     def _info(self):
-        # TODO: Specifies the evaluate.EvaluationModuleInfo object
         return evaluate.MetricInfo(
-            # This is the description that will appear on the modules page.
             module_type="metric",
             description=_DESCRIPTION,
             citation=_CITATION,
             inputs_description=_KWARGS_DESCRIPTION,
-            # This defines the format of each prediction and reference
             features=datasets.Features(
                 {
                     "predictions": datasets.Value("string"),
                     "references": datasets.Value("string"),
                 }
             ),
-            # Homepage of the module for documentation
-            homepage="http://module.homepage",
-            # Additional links to the codebase or references
-            codebase_urls=["http://github.com/path/to/codebase/of/new_module"],
-            reference_urls=["http://path.to.reference.url/new_module"],
         )
-    def extract_boxed_answer(self, predictions):
-        match = re.search(self.BOXED_ANSWER_PATTERN, predictions, re.DOTALL)
         if match:
-            return match.group(1).replace("GMT", "").strip()
         return None
     def _compute(
         self,
-        predictions,
-        references,
         subset: str | list[str],
         return_average: bool = True,
-    ):
         """Returns the scores"""
         if isinstance(subset, str):
             subset = [subset] * len(predictions)
-        predictions = [self.extract_boxed_answer(p) for p in predictions]
         references = [
             r.replace("GMT", "").strip() if s == "tcp_short" else r
             for r, s in zip(references, subset)
         ]
-        accuracy = [int(i == j) for i, j in zip(predictions, references)]
         if return_average:
             return {"accuracy": sum(accuracy) / len(accuracy)}
         return {"accuracy": accuracy}

 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
+"""TCP Accuracy metric for evaluating temporal constraint-based planning tasks."""
 import re
 import datasets
 _CITATION = """\
+@software{abbood2025tcp_accuracy,
+  title={TCP Accuracy},
+  author={Abbood, Auss},
+  year={2025},
+  url={https://huggingface.co/spaces/aauss/tcp_accuracy}
 }
 """
 _DESCRIPTION = """\
+This metric evaluates model predictions on the TCP (Temporal Constraint-Based Planning) benchmark
+(Ding et al., 2025). It measures accuracy by extracting answers from LaTeX boxed notation
+(\\boxed{answer}) and comparing them against reference answers using exact string matching.
 """
 _KWARGS_DESCRIPTION = """
+Calculates accuracy for TCP benchmark predictions.
 Args:
+    predictions: list of prediction strings. Each prediction should contain the
+        final answer in LaTeX boxed notation: \\boxed{answer}.
+    references: list of reference answer strings.
+    subset: either a string or list of strings indicating the subset type
+        ("tcp_short" or "tcp_long"). For "tcp_short", GMT is stripped before comparison.
+    return_average: if True (default), returns average accuracy as a float.
+        If False, returns a list of binary scores (0 or 1) for each sample.
 Returns:
+    accuracy: float (if return_average=True) or list of int (if return_average=False)
 Examples:
+    >>> metric = evaluate.load("aauss/tcp_accuracy")
+    >>> predictions = ["...\\\\boxed{2012-11-05}", "...\\\\boxed{2020-05-28 16:00}"]
+    >>> references = ["2012-11-05", "2020-05-28 16:00 GMT"]
+    >>> results = metric.compute(predictions=predictions, references=references, subset=["tcp_long", "tcp_short"])
     >>> print(results)
     {'accuracy': 1.0}
 """
 @evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
 class TCPAccuracy(evaluate.Metric):
+    """Accuracy metric for the TCP (Temporal Constraint-Based Planning) benchmark."""
     BOXED_ANSWER_PATTERN = r"\\boxed\{([^}]*)\}"
     def _info(self):
         return evaluate.MetricInfo(
             module_type="metric",
             description=_DESCRIPTION,
             citation=_CITATION,
             inputs_description=_KWARGS_DESCRIPTION,
             features=datasets.Features(
                 {
                     "predictions": datasets.Value("string"),
                     "references": datasets.Value("string"),
                 }
             ),
+            homepage="https://huggingface.co/spaces/aauss/tcp_accuracy",
+            codebase_urls=["https://huggingface.co/spaces/aauss/tcp_accuracy/tree/main"],
+            reference_urls=["https://aclanthology.org/2025.emnlp-main.1142/"],
         )
+    def extract_boxed_answer(self, prediction: str) -> str | None:
+        match = re.search(self.BOXED_ANSWER_PATTERN, prediction, re.DOTALL)
         if match:
+            return match.group(1).strip()
         return None
     def _compute(
         self,
+        predictions: list[str],
+        references: list[str],
         subset: str | list[str],
         return_average: bool = True,
+    ) -> dict[str, float | list[int]]:
         """Returns the scores"""
+        if not predictions:
+            raise ValueError("predictions cannot be empty")
         if isinstance(subset, str):
             subset = [subset] * len(predictions)
+        extracted_predictions = [self.extract_boxed_answer(p) for p in predictions]
+        extracted_predictions = [
+            p.replace("GMT", "").strip() if p and s == "tcp_short" else p
+            for p, s in zip(extracted_predictions, subset)
+        ]
         references = [
             r.replace("GMT", "").strip() if s == "tcp_short" else r
             for r, s in zip(references, subset)
         ]
+        accuracy = [int(i == j) for i, j in zip(extracted_predictions, references)]
         if return_average:
             return {"accuracy": sum(accuracy) / len(accuracy)}
         return {"accuracy": accuracy}