Jasaxion commited on
Commit
4ae0204
·
verified ·
1 Parent(s): 1a9e2da

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +64 -0
README.md ADDED
@@ -0,0 +1,64 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - Jasaxion/MathSmith-Hard-Problems
5
+ language:
6
+ - en
7
+ base_model:
8
+ - Qwen/Qwen3-8B
9
+ tags:
10
+ - verl
11
+ ---
12
+
13
+ **MathSmith: Towards Extremely Hard Mathematical Reasoning by Forging Synthetic Problems with a Reinforced Policy**
14
+
15
+ [![Paper](https://img.shields.io/badge/arXiv-2508.05592-b31b1b.svg)](https://arxiv.org/abs/2508.05592)
16
+ [![License](https://img.shields.io/badge/License-Apache%202.0-green.svg)](LICENSE)
17
+ [![Python](https://img.shields.io/badge/Python-3.10%2B-blue.svg)]()
18
+ [![GitHub](https://img.shields.io/badge/-GitHub-181717?logo=github)](https://github.com/Jasaxion/MathSmith)
19
+
20
+
21
+ ## Overview
22
+
23
+ The model generates <rationale>–<problem> pairs, where:
24
+ - `<rationale>`: structured reasoning describing concept integration and difficulty design.
25
+ - `<problem>`: a single Olympiad-level mathematical question that admits a verifiable numeric or symbolic answer.
26
+
27
+ Compared with **MathSmith-HC** (complexity + consistency reward), **MathSmith-Hard** removes the consistency term to emphasize *maximum reasoning depth and difficulty*.
28
+
29
+ ---
30
+
31
+ ## MathSmith Pipeline
32
+
33
+ The MathSmith framework consists of four main stages:
34
+
35
+ 1. **Concept Collection**: Randomly sample concept–explanation pairs from [PlanetMath](https://planetmath.org/) to ensure data independence.
36
+
37
+ 2. **Supervised Fine-tuning (SFT)**: Train the model on collected concept–explanation pairs to establish foundational understanding.
38
+
39
+ 3. **Reinforcement Learning (RL)**: Optimize the model using GRPO with rewards based on:
40
+ - Structural validity
41
+ - Reasoning complexity
42
+ - Answer consistency
43
+
44
+ 4. **Weakness-Focused Self-Improvement**: Iteratively identify and address model weaknesses by generating targeted problem variants.
45
+
46
+
47
+ ## Dependence
48
+ - Transformers 4.52.4
49
+ - Pytorch 2.7.0+cu126
50
+ - Datasets 3.6.0
51
+ - Tokenizers 0.21.1
52
+
53
+ ## Citation
54
+
55
+ If you find this work useful, please cite:
56
+
57
+ ```bibtex
58
+ @article{zhan2025mathsmith,
59
+ title={MathSmith: Towards Extremely Hard Mathematical Reasoning by Forging Synthetic Problems with a Reinforced Policy},
60
+ author={Zhan, Shaoxiong and Lai, Yanlin and Lu, Ziyu and Lin, Dahua and Yang, Ziqing and Tan, Fei},
61
+ journal={arXiv preprint arXiv:2508.05592},
62
+ year={2025}
63
+ }
64
+ ```