GYASBGFUHAADSGADF
/

mathtrap300

+# MathTrap300
+A benchmark dataset of 300 insolvable, ill-posed mathematical problems designed to evaluate large language models' ability to recognize mathematical insolvability and fundamental contradictions.
+## Description
+While modern large language models (LLMs) achieve high accuracy on many challenging math benchmarks, they often struggle to recognize the insolvability of ill-posed problems. Existing benchmarks for insolvable problems, however, are either modified from elementary-level math questions or lack rigorous validation of their insolvability. There is still no benchmark featuring inherently insolvable problems that require deep mathematical knowledge to identify.
+To fill this gap, we introduce **MathTrap300**, the first benchmark consisting of 300 insolvable, ill-posed math problems with fundamental mathematical contradictions or missing conditions that demand deep domain knowledge to detect. In this work, we manually derived these problems from well-posed counterparts through careful modifications and rigorous verification of ill-posedness by PhD-level experts.
+We then present a fine-grained, three-stage LLM judge framework, designed based on observations of LLM responses to insolvable problems. This framework captures signals from both final answers and intermediate reasoning, providing richer metrics and enabling a more faithful assessment of insolvability recognition.
+## Usage
+This dataset is designed for evaluating LLM performance on insolvable mathematical problems. Here's how to use it:
+### Loading the Dataset
+```python
+from datasets import load_dataset
+# Load the dataset
+dataset = load_dataset("GYASBGFUHAADSGADF/mathtrap300")
+# Access the data
+for example in dataset['batch1']:
+    print(f"Original: {example['original']}")
+    print(f"Trap: {example['trap']}")
+    print(f"Annotation: {example['annotation']}")
+    print(f"Trap Type: {example['trap type']}")
+    print("---")
+```
+### Evaluation Framework
+Our three-stage LLM judge framework:
+1. **Problem Analysis**: Check if the model recognizes the mathematical structure
+2. **Contradiction Detection**: Evaluate if the model identifies the insolvability
+3. **Reasoning Quality**: Assess the quality of mathematical reasoning
+### Key Findings
+Our evaluation of recent advanced LLMs on MathTrap300 reveals:
+- **Clear Performance Drop**: Significant decrease in accuracy from well-posed problems to their insolvable counterparts
+- **Common Failure Modes**:
+  - Hallucination: Models generate plausible-looking but incorrect solutions
+  - Guessing: Models provide random answers without proper reasoning
+  - Condition Neglect: Models ignore critical mathematical constraints
+- **Forced Solutions**: Even when models recognize insolvability, they still attempt to force a solution
+## Citation
+If you use this dataset in your research, please cite our paper:
+```bibtex
+@article{mathtrap300,
+  title={MathTrap300: Evaluating Large Language Models on Insolvable Mathematical Problems},
+  author={[Authors]},
+  journal={ICLR},
+  year={2025},
+  url={https://huggingface.co/datasets/GYASBGFUHAADSGADF/mathtrap300}
+}
+```
+## License
+This dataset is released under the MIT License.