Jasaxion
/

MathSmith-HC-Problem-Synthesizer-Qwen3-8B

@@ -1,30 +1,37 @@
 ---
-license: apache-2.0
 datasets:
 - Jasaxion/MathSmith-HC-Problems
 language:
 - en
-base_model:
-- Qwen/Qwen3-8B
 tags:
 - verl
 ---
 **MathSmith: Towards Extremely Hard Mathematical Reasoning by Forging Synthetic Problems with a Reinforced Policy**
 [![Paper](https://img.shields.io/badge/arXiv-2508.05592-b31b1b.svg)](https://arxiv.org/abs/2508.05592)
 [![License](https://img.shields.io/badge/License-Apache%202.0-green.svg)](LICENSE)
 [![Python](https://img.shields.io/badge/Python-3.10%2B-blue.svg)]()
 [![GitHub](https://img.shields.io/badge/-GitHub-181717?logo=github)](https://github.com/Jasaxion/MathSmith)
 ## Overview
-The model generates <rationale>–<problem> pairs, where:
-- `<rationale>`: structured reasoning describing concept integration and difficulty design.
 - `<problem>`: a single Olympiad-level mathematical question that admits a verifiable numeric or symbolic answer.
-**MathSmith-HC** combines *complexity* and *consistency* as difficulty rewards, producing more stable problems than **MathSmith-Hard**.
 ---
@@ -32,17 +39,10 @@ The model generates <rationale>–<problem> pairs, where:
 The MathSmith framework consists of four main stages:
-1. **Concept Collection**: Randomly sample concept–explanation pairs from [PlanetMath](https://planetmath.org/) to ensure data independence.
-2. **Supervised Fine-tuning (SFT)**: Train the model on collected concept–explanation pairs to establish foundational understanding.
-3. **Reinforcement Learning (RL)**: Optimize the model using GRPO with rewards based on:
-   - Structural validity
-   - Reasoning complexity
-   - Answer consistency
-4. **Weakness-Focused Self-Improvement**: Iteratively identify and address model weaknesses by generating targeted problem variants.
 ## Dependence
 - Transformers 4.52.4

 ---
+base_model: Qwen/Qwen3-8B
 datasets:
 - Jasaxion/MathSmith-HC-Problems
 language:
 - en
+license: apache-2.0
+library_name: transformers
+pipeline_tag: text-generation
 tags:
 - verl
+- math
+- synthetic-data
 ---
+# MathSmith-HC-Problem-Synthesizer-Qwen3-8B
 **MathSmith: Towards Extremely Hard Mathematical Reasoning by Forging Synthetic Problems with a Reinforced Policy**
 [![Paper](https://img.shields.io/badge/arXiv-2508.05592-b31b1b.svg)](https://arxiv.org/abs/2508.05592)
+[![Project Page](https://img.shields.io/badge/Project-Page-blue)](https://jasaxion.github.io/MathSmith_ProjectPage/)
 [![License](https://img.shields.io/badge/License-Apache%202.0-green.svg)](LICENSE)
 [![Python](https://img.shields.io/badge/Python-3.10%2B-blue.svg)]()
 [![GitHub](https://img.shields.io/badge/-GitHub-181717?logo=github)](https://github.com/Jasaxion/MathSmith)
 ## Overview
+MathSmith is a framework for synthesizing challenging mathematical problems to enhance LLM reasoning. This model is a reinforced policy-based synthesizer optimized to generate novel, Olympiad-level mathematical problems from scratch.
+The model generates `<rationale>`–`<problem>` pairs, where:
+- `<rationale>`: structured reasoning describing concept integration and difficulty design strategies.
 - `<problem>`: a single Olympiad-level mathematical question that admits a verifiable numeric or symbolic answer.
+**MathSmith-HC** (High Consistency) combines *complexity* and *consistency* as difficulty rewards during reinforcement learning, producing more stable problems than the version optimized solely for complexity.
 ---
 The MathSmith framework consists of four main stages:
+1. **Concept Collection**: Randomly sample concept–explanation pairs from [PlanetMath](https://planetmath.org/) to ensure data independence and avoid benchmark contamination.
+2. **Supervised Fine-tuning (SFT)**: Train the model on collected concept–explanation pairs to establish foundational understanding of problem generation.
+3. **Reinforcement Learning (RL)**: Optimize the model using GRPO with rewards based on structural validity, reasoning complexity (trace length), and answer consistency.
+4. **Weakness-Focused Self-Improvement**: Iteratively identify and address model weaknesses by generating targeted problem variants for specific mathematical concepts.
 ## Dependence
 - Transformers 4.52.4