|
|
--- |
|
|
license: apache-2.0 |
|
|
datasets: |
|
|
- Miaow-Lab/RLVR-Linearity-Dataset |
|
|
base_model: |
|
|
- deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B |
|
|
pipeline_tag: text-generation |
|
|
--- |
|
|
|
|
|
# Model Card |
|
|
|
|
|
## 1. Model Details |
|
|
This model is the fine-tuned checkpoint described in the paper **"Not All Steps are Informative: On the Linearity of LLMs’ RLVR Training"**. It was trained using Reinforcement Learning (RL) to enhance reasoning capabilities. |
|
|
|
|
|
- **Paper:** [ArXiv](https://arxiv.org/pdf/2601.04537v1) |
|
|
- **Code:** [Github](https://github.com/Miaow-Lab/RLVR-Linearity) |
|
|
- **Base Model:** [deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B) |
|
|
- **Training Method:** GRPO |
|
|
|
|
|
|
|
|
## 2. Training Details |
|
|
|
|
|
- **Hyperparameters:** |
|
|
- Learning Rate: `1e-6` |
|
|
- Train Batch Size: `128` |
|
|
- PPO Mini Batch Size: `64` |
|
|
- RL Algorithm: `GRPO` |
|
|
- Rollout Temperature: 1.0 |
|
|
- Group Size: 16 |
|
|
- **Compute:** Trained on `32 x H100` GPUs for about `150` hours. |
|
|
|
|
|
For full training configurations, please refer to the `config.json` or the training scripts in our [GitHub](https://github.com/Miaow-Lab/RLVR-Linearity). |
|
|
|
|
|
## 3. Citation |
|
|
|
|
|
If you use this model in your research, please cite our paper: |
|
|
|
|
|
```bibtex |
|
|
@misc{wang2026stepsinformativelinearityllms, |
|
|
title={Not All Steps are Informative: On the Linearity of LLMs' RLVR Training}, |
|
|
author={Tianle Wang and Zhongyuan Wu and Shenghao Jin and Hao Xu and Wei Chen and Ning Miao}, |
|
|
year={2026}, |
|
|
eprint={2601.04537}, |
|
|
archivePrefix={arXiv}, |
|
|
primaryClass={cs.LG}, |
|
|
url={https://arxiv.org/abs/2601.04537}, |
|
|
} |
|
|
``` |
|
|
|
|
|
> [!TIP] |
|
|
> **Motivation for this Model** |
|
|
> This checkpoint is released primarily as a research artifact to facilitate the analysis of linearity in model outputs and weight updates during RLVR fine‑tuning. |