GYASBGFUHAADSGADF commited on
Commit
41978ad
·
verified ·
1 Parent(s): 548c2e2

Upload model_README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. model_README.md +69 -0
model_README.md ADDED
@@ -0,0 +1,69 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # MathTrap300
2
+
3
+ A benchmark dataset of 300 insolvable, ill-posed mathematical problems designed to evaluate large language models' ability to recognize mathematical insolvability and fundamental contradictions.
4
+
5
+ ## Description
6
+
7
+ While modern large language models (LLMs) achieve high accuracy on many challenging math benchmarks, they often struggle to recognize the insolvability of ill-posed problems. Existing benchmarks for insolvable problems, however, are either modified from elementary-level math questions or lack rigorous validation of their insolvability. There is still no benchmark featuring inherently insolvable problems that require deep mathematical knowledge to identify.
8
+
9
+ To fill this gap, we introduce **MathTrap300**, the first benchmark consisting of 300 insolvable, ill-posed math problems with fundamental mathematical contradictions or missing conditions that demand deep domain knowledge to detect. In this work, we manually derived these problems from well-posed counterparts through careful modifications and rigorous verification of ill-posedness by PhD-level experts.
10
+
11
+ We then present a fine-grained, three-stage LLM judge framework, designed based on observations of LLM responses to insolvable problems. This framework captures signals from both final answers and intermediate reasoning, providing richer metrics and enabling a more faithful assessment of insolvability recognition.
12
+
13
+ ## Usage
14
+
15
+ This dataset is designed for evaluating LLM performance on insolvable mathematical problems. Here's how to use it:
16
+
17
+ ### Loading the Dataset
18
+
19
+ ```python
20
+ from datasets import load_dataset
21
+
22
+ # Load the dataset
23
+ dataset = load_dataset("GYASBGFUHAADSGADF/mathtrap300")
24
+
25
+ # Access the data
26
+ for example in dataset['batch1']:
27
+ print(f"Original: {example['original']}")
28
+ print(f"Trap: {example['trap']}")
29
+ print(f"Annotation: {example['annotation']}")
30
+ print(f"Trap Type: {example['trap type']}")
31
+ print("---")
32
+ ```
33
+
34
+ ### Evaluation Framework
35
+
36
+ Our three-stage LLM judge framework:
37
+
38
+ 1. **Problem Analysis**: Check if the model recognizes the mathematical structure
39
+ 2. **Contradiction Detection**: Evaluate if the model identifies the insolvability
40
+ 3. **Reasoning Quality**: Assess the quality of mathematical reasoning
41
+
42
+ ### Key Findings
43
+
44
+ Our evaluation of recent advanced LLMs on MathTrap300 reveals:
45
+
46
+ - **Clear Performance Drop**: Significant decrease in accuracy from well-posed problems to their insolvable counterparts
47
+ - **Common Failure Modes**:
48
+ - Hallucination: Models generate plausible-looking but incorrect solutions
49
+ - Guessing: Models provide random answers without proper reasoning
50
+ - Condition Neglect: Models ignore critical mathematical constraints
51
+ - **Forced Solutions**: Even when models recognize insolvability, they still attempt to force a solution
52
+
53
+ ## Citation
54
+
55
+ If you use this dataset in your research, please cite our paper:
56
+
57
+ ```bibtex
58
+ @article{mathtrap300,
59
+ title={MathTrap300: Evaluating Large Language Models on Insolvable Mathematical Problems},
60
+ author={[Authors]},
61
+ journal={ICLR},
62
+ year={2025},
63
+ url={https://huggingface.co/datasets/GYASBGFUHAADSGADF/mathtrap300}
64
+ }
65
+ ```
66
+
67
+ ## License
68
+
69
+ This dataset is released under the MIT License.