File size: 1,814 Bytes
d143c06 4b925a1 d143c06 4b925a1 d143c06 4b925a1 d143c06 4b925a1 d143c06 4b925a1 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 |
---
license: apache-2.0
datasets:
- Miaow-Lab/RLVR-Linearity-Dataset
base_model:
- deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
pipeline_tag: text-generation
---
# Model Card
## 1. Model Details
This model is the fine-tuned checkpoint described in the paper **"Not All Steps are Informative: On the Linearity of LLMs’ RLVR Training"**. It was trained using Reinforcement Learning (RL) to enhance reasoning capabilities.
- **Paper:** [ArXiv](https://arxiv.org/pdf/2601.04537v1)
- **Code:** [Github](https://github.com/Miaow-Lab/RLVR-Linearity)
- **Base Model:** [deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B)
- **Training Method:** GRPO
## 2. Training Details
- **Hyperparameters:**
- Learning Rate: `1e-6`
- Train Batch Size: `128`
- PPO Mini Batch Size: `64`
- RL Algorithm: `GRPO`
- Rollout Temperature: 1.0
- Group Size: 16
- **Compute:** Trained on `32 x H100` GPUs for about `150` hours.
For full training configurations, please refer to the `config.json` or the training scripts in our [GitHub](https://github.com/Miaow-Lab/RLVR-Linearity).
## 3. Citation
If you use this model in your research, please cite our paper:
```bibtex
@misc{wang2026stepsinformativelinearityllms,
title={Not All Steps are Informative: On the Linearity of LLMs' RLVR Training},
author={Tianle Wang and Zhongyuan Wu and Shenghao Jin and Hao Xu and Wei Chen and Ning Miao},
year={2026},
eprint={2601.04537},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2601.04537},
}
```
> [!TIP]
> **Motivation for this Model**
> This checkpoint is released primarily as a research artifact to facilitate the analysis of linearity in model outputs and weight updates during RLVR fine‑tuning. |