--- base_model: - deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B datasets: - Miaow-Lab/RLVR-Linearity-Dataset license: apache-2.0 pipeline_tag: text-generation library_name: transformers tags: - reasoning - grpo - reinforcement-learning --- # Model Card ## 1. Model Details This model is a fine-tuned checkpoint described in the paper **"Not All Steps are Informative: On the Linearity of LLMs’ RLVR Training"**. It was trained using Reinforcement Learning (RL) to investigate the phenomenon of linear evolution in model weights and output log-probabilities during RLVR training. - **Paper:** [Not All Steps are Informative: On the Linearity of LLMs' RLVR Training](https://huggingface.co/papers/2601.04537) - **Code:** [GitHub Repository](https://github.com/Miaow-Lab/RLVR-Linearity) - **Base Model:** [deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B) - **Training Method:** GRPO (using the `verl` framework) ## 2. Training Details - **Hyperparameters:** - Learning Rate: `1e-6` - Train Batch Size: `128` - PPO Mini Batch Size: `64` - RL Algorithm: `GRPO` - Rollout Temperature: 1.0 - Group Size: 16 - **Compute:** Trained on `32 x H100` GPUs for about `150` hours. For full training configurations, please refer to the `config.json` or the training scripts in the official [GitHub repository](https://github.com/Miaow-Lab/RLVR-Linearity). ## 3. Citation If you use this model in your research, please cite our paper: ```bibtex @misc{wang2026stepsinformativelinearityllms, title={Not All Steps are Informative: On the Linearity of LLMs' RLVR Training}, author={Tianle Wang and Zhongyuan Wu and Shenghao Jin and Hao Xu and Wei Chen and Ning Miao}, year={2026}, eprint={2601.04537}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2601.04537}, } ``` > [!TIP] > **Motivation for this Model** > This checkpoint is released primarily as a research artifact to facilitate the analysis of linearity in model outputs and weight updates during RLVR fine‑tuning.