MathTrap300

A benchmark dataset of 300 insolvable, ill-posed mathematical problems designed to evaluate large language models' ability to recognize mathematical insolvability and fundamental contradictions.

Description

While modern large language models (LLMs) achieve high accuracy on many challenging math benchmarks, they often struggle to recognize the insolvability of ill-posed problems. Existing benchmarks for insolvable problems, however, are either modified from elementary-level math questions or lack rigorous validation of their insolvability. There is still no benchmark featuring inherently insolvable problems that require deep mathematical knowledge to identify.

To fill this gap, we introduce MathTrap300, the first benchmark consisting of 300 insolvable, ill-posed math problems with fundamental mathematical contradictions or missing conditions that demand deep domain knowledge to detect. In this work, we manually derived these problems from well-posed counterparts through careful modifications and rigorous verification of ill-posedness by PhD-level experts.

We then present a fine-grained, three-stage LLM judge framework, designed based on observations of LLM responses to insolvable problems. This framework captures signals from both final answers and intermediate reasoning, providing richer metrics and enabling a more faithful assessment of insolvability recognition.

Usage

This dataset is designed for evaluating LLM performance on insolvable mathematical problems. Here's how to use it:

Loading the Dataset

from datasets import load_dataset

# Load the dataset
dataset = load_dataset("GYASBGFUHAADSGADF/mathtrap300")

# Access the data
for example in dataset['train']:
    print(f"Original: {example['original']}")
    print(f"Trap: {example['trap']}")
    print(f"Annotation: {example['annotation']}")
    print(f"Trap Type: {example['trap type']}")
    print("---")

Evaluation Framework

Our three-stage LLM judge framework:

  1. Problem Analysis: Check if the model recognizes the mathematical structure
  2. Contradiction Detection: Evaluate if the model identifies the insolvability
  3. Reasoning Quality: Assess the quality of mathematical reasoning

Key Findings

Our evaluation of recent advanced LLMs on MathTrap300 reveals:

  • Clear Performance Drop: Significant decrease in accuracy from well-posed problems to their insolvable counterparts
  • Common Failure Modes:
    • Hallucination: Models generate plausible-looking but incorrect solutions
    • Guessing: Models provide random answers without proper reasoning
    • Condition Neglect: Models ignore critical mathematical constraints
  • Forced Solutions: Even when models recognize insolvability, they still attempt to force a solution

Dataset Statistics

  • Total Problems: 300 (currently 151 uploaded)
  • Difficulty Levels: 3.0 - 5.0
  • Trap Types: Contradiction, Missing Conditions, and others
  • Sources: MATH dataset, Original creation
  • Validation: Rigorously verified by PhD-level mathematical experts
  • Split: Mix of train/test examples

Citation

If you use this dataset in your research, please cite our paper:

@article{mathtrap300,
  title={MathTrap300: Evaluating Large Language Models on Insolvable Mathematical Problems},
  author={[Authors]},
  journal={ICLR},
  year={2025},
  url={https://huggingface.co/datasets/GYASBGFUHAADSGADF/mathtrap300}
}

License

This dataset is released under the MIT License.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support