MathTrap300
A benchmark dataset of 300 insolvable, ill-posed mathematical problems designed to evaluate large language models' ability to recognize mathematical insolvability and fundamental contradictions.
Description
While modern large language models (LLMs) achieve high accuracy on many challenging math benchmarks, they often struggle to recognize the insolvability of ill-posed problems. Existing benchmarks for insolvable problems, however, are either modified from elementary-level math questions or lack rigorous validation of their insolvability. There is still no benchmark featuring inherently insolvable problems that require deep mathematical knowledge to identify.
To fill this gap, we introduce MathTrap300, the first benchmark consisting of 300 insolvable, ill-posed math problems with fundamental mathematical contradictions or missing conditions that demand deep domain knowledge to detect. In this work, we manually derived these problems from well-posed counterparts through careful modifications and rigorous verification of ill-posedness by PhD-level experts.
We then present a fine-grained, three-stage LLM judge framework, designed based on observations of LLM responses to insolvable problems. This framework captures signals from both final answers and intermediate reasoning, providing richer metrics and enabling a more faithful assessment of insolvability recognition.
Usage
This dataset is designed for evaluating LLM performance on insolvable mathematical problems. Here's how to use it:
Loading the Dataset
from datasets import load_dataset
# Load the dataset
dataset = load_dataset("GYASBGFUHAADSGADF/mathtrap300")
# Access the data
for example in dataset['train']:
print(f"Original: {example['original']}")
print(f"Trap: {example['trap']}")
print(f"Annotation: {example['annotation']}")
print(f"Trap Type: {example['trap type']}")
print("---")
Evaluation Framework
Our three-stage LLM judge framework:
- Problem Analysis: Check if the model recognizes the mathematical structure
- Contradiction Detection: Evaluate if the model identifies the insolvability
- Reasoning Quality: Assess the quality of mathematical reasoning
Key Findings
Our evaluation of recent advanced LLMs on MathTrap300 reveals:
- Clear Performance Drop: Significant decrease in accuracy from well-posed problems to their insolvable counterparts
- Common Failure Modes:
- Hallucination: Models generate plausible-looking but incorrect solutions
- Guessing: Models provide random answers without proper reasoning
- Condition Neglect: Models ignore critical mathematical constraints
- Forced Solutions: Even when models recognize insolvability, they still attempt to force a solution
Dataset Statistics
- Total Problems: 300 (currently 151 uploaded)
- Difficulty Levels: 3.0 - 5.0
- Trap Types: Contradiction, Missing Conditions, and others
- Sources: MATH dataset, Original creation
- Validation: Rigorously verified by PhD-level mathematical experts
- Split: Mix of train/test examples
Citation
If you use this dataset in your research, please cite our paper:
@article{mathtrap300,
title={MathTrap300: Evaluating Large Language Models on Insolvable Mathematical Problems},
author={[Authors]},
journal={ICLR},
year={2025},
url={https://huggingface.co/datasets/GYASBGFUHAADSGADF/mathtrap300}
}
License
This dataset is released under the MIT License.