Miaow-Lab
/

RLVR-Linearity-Checkpoints

Text Generation

Model card Files Files and versions

louiswng commited on Jan 26

Commit

4b925a1

·

verified ·

1 Parent(s): d143c06

Update README.md

Files changed (1) hide show

README.md +10 -11

README.md CHANGED Viewed

@@ -10,33 +10,28 @@ pipeline_tag: text-generation
 # Model Card
 ## 1. Model Details
-This model is the fine-tuned checkpoint described in the paper **"Not All Steps are Informative: On the Linearity of LLMs’ RLVR Training"**. It was trained using Reinforcement Learning (GRPO) to enhance mathematical reasoning capabilities.
 - **Paper:** [ArXiv](https://arxiv.org/pdf/2601.04537v1)
 - **Code:** [Github](https://github.com/Miaow-Lab/RLVR-Linearity)
 - **Base Model:** [deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B)
 - **Training Method:** GRPO
-## 2. Performance
-We evaluated the model on standard math benchmarks. Key results include:
-| Benchmark | Avg@64 |
-| :--- | :--- |
-| AIME 2024 | **41.93%** |
-## 3. Training Details
 - **Hyperparameters:**
   - Learning Rate: `1e-6`
   - Train Batch Size: `128`
   - PPO Mini Batch Size: `64`
   - RL Algorithm: `GRPO`
 - **Compute:** Trained on `32 x H100` GPUs for about `150` hours.
 For full training configurations, please refer to the `config.json` or the training scripts in our [GitHub](https://github.com/Miaow-Lab/RLVR-Linearity).
-## 4. Citation
 If you use this model in your research, please cite our paper:
@@ -50,4 +45,8 @@ If you use this model in your research, please cite our paper:
       primaryClass={cs.LG},
       url={https://arxiv.org/abs/2601.04537},
 }
-```

 # Model Card
 ## 1. Model Details
+This model is the fine-tuned checkpoint described in the paper **"Not All Steps are Informative: On the Linearity of LLMs’ RLVR Training"**. It was trained using Reinforcement Learning (RL) to enhance reasoning capabilities.
 - **Paper:** [ArXiv](https://arxiv.org/pdf/2601.04537v1)
 - **Code:** [Github](https://github.com/Miaow-Lab/RLVR-Linearity)
 - **Base Model:** [deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B)
 - **Training Method:** GRPO
+## 2. Training Details
 - **Hyperparameters:**
   - Learning Rate: `1e-6`
   - Train Batch Size: `128`
   - PPO Mini Batch Size: `64`
   - RL Algorithm: `GRPO`
+  - Rollout Temperature: 1.0
+  - Group Size: 16
 - **Compute:** Trained on `32 x H100` GPUs for about `150` hours.
 For full training configurations, please refer to the `config.json` or the training scripts in our [GitHub](https://github.com/Miaow-Lab/RLVR-Linearity).
+## 3. Citation
 If you use this model in your research, please cite our paper:
       primaryClass={cs.LG},
       url={https://arxiv.org/abs/2601.04537},
 }
+```
+> [!TIP]
+> **Motivation for this Model**
+> This checkpoint is released primarily as a research artifact to facilitate the analysis of linearity in model outputs and weight updates during RLVR fine‑tuning.