Jasaxion nielsr HF Staff commited on
Commit
a716216
·
1 Parent(s): 7c710c4

Add metadata and project page link (#1)

Browse files

- Add metadata and project page link (8cf3cf78a706e5a3b8ccd317ed415451baea408b)


Co-authored-by: Niels Rogge <nielsr@users.noreply.huggingface.co>

Files changed (1) hide show
  1. README.md +18 -18
README.md CHANGED
@@ -1,30 +1,37 @@
1
  ---
2
- license: apache-2.0
3
  datasets:
4
  - Jasaxion/MathSmith-HC-Problems
5
  language:
6
  - en
7
- base_model:
8
- - Qwen/Qwen3-8B
 
9
  tags:
10
  - verl
 
 
11
  ---
12
 
 
 
13
  **MathSmith: Towards Extremely Hard Mathematical Reasoning by Forging Synthetic Problems with a Reinforced Policy**
14
 
15
  [![Paper](https://img.shields.io/badge/arXiv-2508.05592-b31b1b.svg)](https://arxiv.org/abs/2508.05592)
 
16
  [![License](https://img.shields.io/badge/License-Apache%202.0-green.svg)](LICENSE)
17
  [![Python](https://img.shields.io/badge/Python-3.10%2B-blue.svg)]()
18
  [![GitHub](https://img.shields.io/badge/-GitHub-181717?logo=github)](https://github.com/Jasaxion/MathSmith)
19
 
20
-
21
  ## Overview
22
 
23
- The model generates <rationale>–<problem> pairs, where:
24
- - `<rationale>`: structured reasoning describing concept integration and difficulty design.
 
 
25
  - `<problem>`: a single Olympiad-level mathematical question that admits a verifiable numeric or symbolic answer.
26
 
27
- **MathSmith-HC** combines *complexity* and *consistency* as difficulty rewards, producing more stable problems than **MathSmith-Hard**.
28
 
29
  ---
30
 
@@ -32,17 +39,10 @@ The model generates <rationale>–<problem> pairs, where:
32
 
33
  The MathSmith framework consists of four main stages:
34
 
35
- 1. **Concept Collection**: Randomly sample concept–explanation pairs from [PlanetMath](https://planetmath.org/) to ensure data independence.
36
-
37
- 2. **Supervised Fine-tuning (SFT)**: Train the model on collected concept–explanation pairs to establish foundational understanding.
38
-
39
- 3. **Reinforcement Learning (RL)**: Optimize the model using GRPO with rewards based on:
40
- - Structural validity
41
- - Reasoning complexity
42
- - Answer consistency
43
-
44
- 4. **Weakness-Focused Self-Improvement**: Iteratively identify and address model weaknesses by generating targeted problem variants.
45
-
46
 
47
  ## Dependence
48
  - Transformers 4.52.4
 
1
  ---
2
+ base_model: Qwen/Qwen3-8B
3
  datasets:
4
  - Jasaxion/MathSmith-HC-Problems
5
  language:
6
  - en
7
+ license: apache-2.0
8
+ library_name: transformers
9
+ pipeline_tag: text-generation
10
  tags:
11
  - verl
12
+ - math
13
+ - synthetic-data
14
  ---
15
 
16
+ # MathSmith-HC-Problem-Synthesizer-Qwen3-8B
17
+
18
  **MathSmith: Towards Extremely Hard Mathematical Reasoning by Forging Synthetic Problems with a Reinforced Policy**
19
 
20
  [![Paper](https://img.shields.io/badge/arXiv-2508.05592-b31b1b.svg)](https://arxiv.org/abs/2508.05592)
21
+ [![Project Page](https://img.shields.io/badge/Project-Page-blue)](https://jasaxion.github.io/MathSmith_ProjectPage/)
22
  [![License](https://img.shields.io/badge/License-Apache%202.0-green.svg)](LICENSE)
23
  [![Python](https://img.shields.io/badge/Python-3.10%2B-blue.svg)]()
24
  [![GitHub](https://img.shields.io/badge/-GitHub-181717?logo=github)](https://github.com/Jasaxion/MathSmith)
25
 
 
26
  ## Overview
27
 
28
+ MathSmith is a framework for synthesizing challenging mathematical problems to enhance LLM reasoning. This model is a reinforced policy-based synthesizer optimized to generate novel, Olympiad-level mathematical problems from scratch.
29
+
30
+ The model generates `<rationale>`–`<problem>` pairs, where:
31
+ - `<rationale>`: structured reasoning describing concept integration and difficulty design strategies.
32
  - `<problem>`: a single Olympiad-level mathematical question that admits a verifiable numeric or symbolic answer.
33
 
34
+ **MathSmith-HC** (High Consistency) combines *complexity* and *consistency* as difficulty rewards during reinforcement learning, producing more stable problems than the version optimized solely for complexity.
35
 
36
  ---
37
 
 
39
 
40
  The MathSmith framework consists of four main stages:
41
 
42
+ 1. **Concept Collection**: Randomly sample concept–explanation pairs from [PlanetMath](https://planetmath.org/) to ensure data independence and avoid benchmark contamination.
43
+ 2. **Supervised Fine-tuning (SFT)**: Train the model on collected concept–explanation pairs to establish foundational understanding of problem generation.
44
+ 3. **Reinforcement Learning (RL)**: Optimize the model using GRPO with rewards based on structural validity, reasoning complexity (trace length), and answer consistency.
45
+ 4. **Weakness-Focused Self-Improvement**: Iteratively identify and address model weaknesses by generating targeted problem variants for specific mathematical concepts.
 
 
 
 
 
 
 
46
 
47
  ## Dependence
48
  - Transformers 4.52.4