|
|
--- |
|
|
license: mit |
|
|
tags: |
|
|
- mathematics |
|
|
- education |
|
|
- reasoning |
|
|
- trap-questions |
|
|
- math-problems |
|
|
library_name: datasets |
|
|
--- |
|
|
|
|
|
# MathTrap300 |
|
|
|
|
|
A benchmark dataset of 300 insolvable, ill-posed mathematical problems designed to evaluate large language models' ability to recognize mathematical insolvability and fundamental contradictions. |
|
|
|
|
|
## Description |
|
|
|
|
|
While modern large language models (LLMs) achieve high accuracy on many challenging math benchmarks, they often struggle to recognize the insolvability of ill-posed problems. Existing benchmarks for insolvable problems, however, are either modified from elementary-level math questions or lack rigorous validation of their insolvability. There is still no benchmark featuring inherently insolvable problems that require deep mathematical knowledge to identify. |
|
|
|
|
|
To fill this gap, we introduce **MathTrap300**, the first benchmark consisting of 300 insolvable, ill-posed math problems with fundamental mathematical contradictions or missing conditions that demand deep domain knowledge to detect. In this work, we manually derived these problems from well-posed counterparts through careful modifications and rigorous verification of ill-posedness by PhD-level experts. |
|
|
|
|
|
We then present a fine-grained, three-stage LLM judge framework, designed based on observations of LLM responses to insolvable problems. This framework captures signals from both final answers and intermediate reasoning, providing richer metrics and enabling a more faithful assessment of insolvability recognition. |
|
|
|
|
|
## Usage |
|
|
|
|
|
This dataset is designed for evaluating LLM performance on insolvable mathematical problems. Here's how to use it: |
|
|
|
|
|
### Loading the Dataset |
|
|
|
|
|
```python |
|
|
from datasets import load_dataset |
|
|
|
|
|
# Load the dataset |
|
|
dataset = load_dataset("GYASBGFUHAADSGADF/mathtrap300") |
|
|
|
|
|
# Access the data |
|
|
for example in dataset['train']: |
|
|
print(f"Original: {example['original']}") |
|
|
print(f"Trap: {example['trap']}") |
|
|
print(f"Annotation: {example['annotation']}") |
|
|
print(f"Trap Type: {example['trap type']}") |
|
|
print("---") |
|
|
``` |
|
|
|
|
|
### Evaluation Framework |
|
|
|
|
|
Our three-stage LLM judge framework: |
|
|
|
|
|
1. **Problem Analysis**: Check if the model recognizes the mathematical structure |
|
|
2. **Contradiction Detection**: Evaluate if the model identifies the insolvability |
|
|
3. **Reasoning Quality**: Assess the quality of mathematical reasoning |
|
|
|
|
|
### Key Findings |
|
|
|
|
|
Our evaluation of recent advanced LLMs on MathTrap300 reveals: |
|
|
|
|
|
- **Clear Performance Drop**: Significant decrease in accuracy from well-posed problems to their insolvable counterparts |
|
|
- **Common Failure Modes**: |
|
|
- Hallucination: Models generate plausible-looking but incorrect solutions |
|
|
- Guessing: Models provide random answers without proper reasoning |
|
|
- Condition Neglect: Models ignore critical mathematical constraints |
|
|
- **Forced Solutions**: Even when models recognize insolvability, they still attempt to force a solution |
|
|
|
|
|
## Dataset Statistics |
|
|
|
|
|
- **Total Problems**: 300 (currently 151 uploaded) |
|
|
- **Difficulty Levels**: 1.0 - 5.0 |
|
|
- **Trap Types**: Contradiction, Missing Conditions, and others |
|
|
- **Sources**: MATH dataset, Original creation |
|
|
- **Validation**: Rigorously verified by PhD-level mathematical experts |
|
|
- **Split**: Mix of train/test examples |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this dataset in your research, please cite our paper: |
|
|
|
|
|
```bibtex |
|
|
@article{mathtrap300, |
|
|
title={MathTrap300: Evaluating Large Language Models on Insolvable Mathematical Problems}, |
|
|
author={[Authors]}, |
|
|
journal={ICLR}, |
|
|
year={2025}, |
|
|
url={https://huggingface.co/datasets/GYASBGFUHAADSGADF/mathtrap300} |
|
|
} |
|
|
``` |
|
|
|
|
|
## License |
|
|
|
|
|
This dataset is released under the MIT License. |