RabotniKuma
/

Fast-OpenMath-Nemotron-14B

@@ -1,9 +1,9 @@
 ---
 base_model:
 - nvidia/OpenMath-Nemotron-14B
 license: cc-by-4.0
 pipeline_tag: text-generation
-library_name: transformers
 tags:
 - mathematical-reasoning
 - qwen2
@@ -11,39 +11,105 @@ paper: https://huggingface.co/papers/2507.08267
 ---
 # Fast-OpenMath-Nemotron-14B
 This model is based on the paper [A Practical Two-Stage Recipe for Mathematical LLMs: Maximizing Accuracy with SFT and Efficiency with Reinforcement Learning](https://huggingface.co/papers/2507.08267).
-By applying SFT and GRPO on difficult math problems, we enhanced the performance of `DeepSeek-R1-Distill-Qwen-14B` and developed [`Fast-Math-R1-14B`](https://huggingface.co/RabotniKuma/Fast-Math-R1-14B),
-which achieves approx. 30% faster inference on average, while maintaining accuracy.
-In addition, we trained and open-sourced `Fast-OpenMath-Nemotron-14B`, an efficiency-optimized version of NVIDIA’s [`OpenMath-Nemotron-14B`](https://huggingface.co/nvidia/OpenMath-Nemotron-14B), following the same approach.
-Compared to OpenMath-Nemotron-14B, this model enables approx. 30% faster inference on average, with minimal loss in performance.
-Technical details can be found in [our github repository](https://github.com/analokmaus/kaggle-aimo2-fast-math-r1/tree/master).
-Project page: https://analokmaus.github.io/Fast-Math-R1/
-**Note:**
-This model likely inherits the ability to perform inference in TIR mode from the original model. However, all of our experiments were conducted in CoT mode, and its performance in TIR mode has not been evaluated.
-# Evaluation
 <img src='https://github.com/analokmaus/kaggle-aimo2-fast-math-r1/blob/master/assets/pass1_aime_all.png?raw=true' max-height='400px'>
-|                            |              | AIME 2024        |                    | AIME 2025        |                    |
-| -------------------------- | ------------ | ---------------- | ------------------ | ---------------- | ------------------ |
-| Model                      | Token budget | Pass@1 (avg. 64) | Mean output tokens | Pass@1 (avg. 64) | Mean output tokens |
-| OpenMath-Nemotron-14B      | 32000        | 76.2             | 11493              | 64.5             | 13414              |
-|                            | 24000        | 75.4             | 11417              | 63.4             | 13046              |
-|                            | 16000        | 66               | 10399              | 54.2             | 11422              |
-|                            | 12000        | 55               | 9053               | 40               | 9609               |
-|                            | 8000         | 36               | 6978               | 27.2             | 7083               |
-| Fast-OpenMath-Nemotron-14B | 32000        | 70.7             | 9603               | 61.4             | 11424              |
-|                            | 24000        | 70.6             | 9567               | 60.9             | 11271              |
-|                            | 16000        | 66.6             | 8954               | 55.3             | 10190              |
-|                            | 12000        | 59.4             | 7927               | 45.6             | 8752               |
-|                            | 8000         | 47.6             | 6282               | 33.8             | 6589               |
-# Inference
-## vLLM
 ```python
 from vllm import LLM, SamplingParams
 from transformers import AutoTokenizer
@@ -70,7 +136,7 @@ messages = [
     {
         'role': 'user',
         'content': (
-            'Solve the problem, and put the answer in \boxed{{}}. '
             'Sarah is twice as old as her youngest brother. If the difference between their ages is 15 years. How old is her youngest brother?'
         )
     }
@@ -81,4 +147,7 @@ messages = tokenizer.apply_chat_template(
     add_generation_prompt=True
 )
 response = vllm_engine.generate(messages, sampling_params=sampling_params)
-```

 ---
 base_model:
 - nvidia/OpenMath-Nemotron-14B
+library_name: transformers
 license: cc-by-4.0
 pipeline_tag: text-generation
 tags:
 - mathematical-reasoning
 - qwen2
 ---
 # Fast-OpenMath-Nemotron-14B
+## Paper
+[A Practical Two-Stage Recipe for Mathematical LLMs: Maximizing Accuracy with SFT and Efficiency with Reinforcement Learning](https://huggingface.co/papers/2507.08267)
+### Abstract
+Enhancing the mathematical reasoning of Large Language Models (LLMs) is a pivotal challenge in advancing AI capabilities. While Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) are the dominant training paradigms, a systematic methodology for combining them to maximize both accuracy and efficiency remains largely unexplored. This paper introduces a practical and effective training recipe that strategically integrates extended SFT with RL from online inference (GRPO). We posit that these methods play complementary, not competing, roles: a prolonged SFT phase first pushes the model's accuracy to its limits, after which a GRPO phase dramatically improves token efficiency while preserving this peak performance. Our experiments reveal that extending SFT for as many as 10 epochs is crucial for performance breakthroughs, and that the primary role of GRPO in this framework is to optimize solution length. The efficacy of our recipe is rigorously validated through top-tier performance on challenging benchmarks, including a high rank among over 2,200 teams in the strictly leak-free AI Mathematical Olympiad (AIMO). This work provides the community with a battle-tested blueprint for developing state-of-the-art mathematical reasoners that are both exceptionally accurate and practically efficient. To ensure full reproducibility and empower future research, we will open-source our entire framework, including all code, model checkpoints, and training configurations at this https URL .
+## Model Description
 This model is based on the paper [A Practical Two-Stage Recipe for Mathematical LLMs: Maximizing Accuracy with SFT and Efficiency with Reinforcement Learning](https://huggingface.co/papers/2507.08267).
+By applying Supervised Fine-Tuning (SFT) and Group Reward Policy Optimization (GRPO) on difficult math problems, the authors enhanced the performance of `DeepSeek-R1-Distill-Qwen-14B` and developed [`Fast-Math-R1-14B`](https://huggingface.co/RabotniKuma/Fast-Math-R1-14B), which achieves up to 60% (on average approx. 30%) faster inference while maintaining accuracy.
+`Fast-OpenMath-Nemotron-14B` is an efficiency-optimized version of NVIDIA’s [`OpenMath-Nemotron-14B`](https://huggingface.co/nvidia/OpenMath-Nemotron-14B), trained following the same approach. Compared to OpenMath-Nemotron-14B, this model enables approx. 30% faster inference on average, with minimal loss in performance.
+## Team
+-   Hiroshi Yoshihara @ [Aillis Inc.](https://aillis.jp/en), [The Univ. of Tokyo](https://publichealth.f.u-tokyo.ac.jp/#page_home)
+-   Yuichi Inoue @ [Sakana AI](https://sakana.ai)
+-   Taiki Yamaguchi @ [Rist Inc.](https://www.rist.co.jp/en/)
+## Links
+-   **Paper:** [A Practical Two-Stage Recipe for Mathematical LLMs: Maximizing Accuracy with SFT and Efficiency with Reinforcement Learning](https://huggingface.co/papers/2507.08267)
+-   **GitHub Repository:** [analokmaus/kaggle-aimo2-fast-math-r1](https://github.com/analokmaus/kaggle-aimo2-fast-math-r1/tree/master)
+-   **Project Page:** [Fast-Math-R1](https://analokmaus.github.io/Fast-Math-R1/)
+-   **Kaggle Discussion:** [AI Mathematical Olympiad - Progress Prize 2 - 9th Place Solution](https://www.kaggle.com/competitions/ai-mathematical-olympiad-progress-prize-2/discussion/571252)
+## Performance Comparison
 <img src='https://github.com/analokmaus/kaggle-aimo2-fast-math-r1/blob/master/assets/pass1_aime_all.png?raw=true' max-height='400px'>
+The table below shows the performance comparison between `OpenMath-Nemotron-14B` and `Fast-OpenMath-Nemotron-14B` on AIME benchmarks.
+| Model | Token budget | AIME 2024 Pass@1 (avg. 64) | AIME 2024 Mean output tokens | AIME 2025 Pass@1 (avg. 64) | AIME 2025 Mean output tokens |
+|---|---|---|---|---|---|
+| OpenMath-Nemotron-14B | 32000 | 76.2 | 11493 | 64.5 | 13414 |
+| | 24000 | 75.4 | 11417 | 63.4 | 13046 |
+| | 16000 | 66 | 10399 | 54.2 | 11422 |
+| | 12000 | 55 | 9053 | 40 | 9609 |
+| | 8000 | 36 | 6978 | 27.2 | 7083 |
+| **Fast-OpenMath-Nemotron-14B** | 32000 | 70.7 | 9603 | 61.4 | 11424 |
+| | 24000 | 70.6 | 9567 | 60.9 | 11271 |
+| | 16000 | 66.6 | 8954 | 55.3 | 10190 |
+| | 12000 | 59.4 | 7927 | 45.6 | 8752 |
+| | 8000 | 47.6 | 6282 | 33.8 | 6589 |
+## Download
+In addition to this model, the following related models and datasets are available for download:
+-   `Fast-Math-R1-14B` model: [Huggingface](https://huggingface.co/RabotniKuma/Fast-Math-R1-14B) and [Kaggle Models](https://www.kaggle.com/models/analokamus/fast_math_r1_14b/)
+-   `Fast-Math-Qwen3-14B` model: [Huggingface](https://huggingface.co/RabotniKuma/Fast-Math-Qwen3-14B)
+-   First stage SFT dataset: [RabotniKuma/Fast-Math-R1-SFT](https://huggingface.co/datasets/RabotniKuma/Fast-Math-R1-SFT)
+-   Second stage GRPO dataset: [RabotniKuma/Fast-Math-R1-GRPO](https://huggingface.co/datasets/RabotniKuma/Fast-Math-R1-GRPO)
+-   (Optional) Token scheduler dataset: [RabotniKuma/Fast-Math-R1-Token-Scheduler](https://huggingface.co/datasets/RabotniKuma/Fast-Math-R1-Token-Scheduler)
+## Training Details
+The training recipe for `Fast-OpenMath-Nemotron-14B` involves a practical two-stage approach: extended Supervised Fine-Tuning (SFT) followed by Reinforcement Learning from Online Inference (GRPO).
+### 1. First Stage: Intensive SFT using a High-Difficulty Dataset
+A full-parameter supervised fine-tuning training was conducted on a machine with 8 H200 GPUs, using the SFTTrainer from the `trl` library.
+#### Dataset for SFT
+A high-difficulty dataset consisting of 7900 problem - R1 trace - answer sets was constructed by merging and cleaning examples from:
+-   [OpenR1 Math](https://huggingface.co/datasets/open-r1/OpenR1-Math-220k): This included 3000 randomly sampled examples where the R1’s trace had more than 12800 tokens and an accuracy of over 50%, along with another 3000 examples where the accuracy ranged between 50% and 75%.
+-   [openr1_hard](https://huggingface.co/datasets/hoanganhpham/openr1_hard): Approximately 2.5k hard samples from `open-r1-math-220k` that were unsolvable by `r1-distill-32b` after 4 tries.
+-   [Light-R1-SFTData](https://huggingface.co/datasets/qihoo360/Light-R1-SFTData): The 2nd stage data from this dataset was utilized.
+Duplicates were removed, and for samples in the Light-R1 dataset where ground truth answers were not provided, answers were extracted and substituted from the R1 traces. The processed dataset is available at [RabotniKuma/Fast-Math-R1-SFT](https://huggingface.co/datasets/RabotniKuma/Fast-Math-R1-SFT).
+#### Training Command Example
+```bash
+CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
+    accelerate launch --config_file accelerate_configs/deepspeed_zero3.yaml --num_processes 8 \
+    experiments/train_first_stage.py
+```
+*(Refer to the GitHub repository for detailed `accelerate_configs/deepspeed_zero3.yaml` configuration.)*
+### 2. Second Stage: GRPO for More Efficient Reasoning
+This stage focuses on dramatically improving token efficiency while preserving peak performance, leveraging a [faster implementation of the `trl` GRPOTrainer](https://github.com/nhannguyen2709/open-r1).
+#### Dataset for GRPO
+-   [Light-R1-SFTData](https://huggingface.co/datasets/qihoo360/Light-R1-SFTData): Answers extracted from the 2nd stage SFT data.
+The GRPO dataset used in this stage is available at [RabotniKuma/Fast-Math-R1-GRPO](https://huggingface.co/datasets/RabotniKuma/Fast-Math-R1-GRPO).
+#### Reward Functions
+The GRPO phase utilizes several reward functions to optimize the model's behavior:
+1.  **Format Reward**: To save output tokens, the model is rewarded for giving an answer before the `</think>` tag. This is achieved by rewarding the pattern `r"^.*?oxed{(.*?)}.*?</think>.*?$"`. During inference, generation is stopped at `</think>`.
+2.  **Cosine Reward**: Compared to a normal accuracy-based reward, cosine reward applies a continuous penalty to longer correct reasoning traces and shorter incorrect ones.
+3.  **Length Reward**: Length-based rewards are applied to discourage overthinking and promote token efficiency. For more details, refer to the paper: [https://arxiv.org/abs/2501.12599](https://arxiv.org/abs/2501.12599).
+#### Training Command Example
+```bash
+CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
+    accelerate launch --config_file accelerate_configs/deepspeed_zero2.yaml --num_processes 8 \
+    experiments/train_second_stage.py
+```
+*(Refer to the GitHub repository for detailed `accelerate_configs/deepspeed_zero2.yaml` configuration.)*
+## How to Use
+The model can be used for inference with `vLLM` as shown in the example below.
+### vLLM
 ```python
 from vllm import LLM, SamplingParams
 from transformers import AutoTokenizer
     {
         'role': 'user',
         'content': (
+            'Solve the problem, and put the answer in \\boxed{{}}. '
             'Sarah is twice as old as her youngest brother. If the difference between their ages is 15 years. How old is her youngest brother?'
         )
     }
     add_generation_prompt=True
 )
 response = vllm_engine.generate(messages, sampling_params=sampling_params)
+```
+## Note
+This model likely inherits the ability to perform inference in TIR mode from the original model. However, all of our experiments were conducted in CoT mode, and its performance in TIR mode has not been evaluated.