Text Generation
Safetensors
louiswng's picture
Update README.md
4b925a1 verified
---
license: apache-2.0
datasets:
- Miaow-Lab/RLVR-Linearity-Dataset
base_model:
- deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
pipeline_tag: text-generation
---
# Model Card
## 1. Model Details
This model is the fine-tuned checkpoint described in the paper **"Not All Steps are Informative: On the Linearity of LLMs’ RLVR Training"**. It was trained using Reinforcement Learning (RL) to enhance reasoning capabilities.
- **Paper:** [ArXiv](https://arxiv.org/pdf/2601.04537v1)
- **Code:** [Github](https://github.com/Miaow-Lab/RLVR-Linearity)
- **Base Model:** [deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B)
- **Training Method:** GRPO
## 2. Training Details
- **Hyperparameters:**
- Learning Rate: `1e-6`
- Train Batch Size: `128`
- PPO Mini Batch Size: `64`
- RL Algorithm: `GRPO`
- Rollout Temperature: 1.0
- Group Size: 16
- **Compute:** Trained on `32 x H100` GPUs for about `150` hours.
For full training configurations, please refer to the `config.json` or the training scripts in our [GitHub](https://github.com/Miaow-Lab/RLVR-Linearity).
## 3. Citation
If you use this model in your research, please cite our paper:
```bibtex
@misc{wang2026stepsinformativelinearityllms,
title={Not All Steps are Informative: On the Linearity of LLMs' RLVR Training},
author={Tianle Wang and Zhongyuan Wu and Shenghao Jin and Hao Xu and Wei Chen and Ning Miao},
year={2026},
eprint={2601.04537},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2601.04537},
}
```
> [!TIP]
> **Motivation for this Model**
> This checkpoint is released primarily as a research artifact to facilitate the analysis of linearity in model outputs and weight updates during RLVR fine‑tuning.