Text Generation
Safetensors
File size: 1,814 Bytes
d143c06
 
 
 
 
 
 
 
 
 
 
 
4b925a1
d143c06
 
 
 
 
 
 
4b925a1
d143c06
 
 
 
 
 
4b925a1
 
d143c06
 
 
 
4b925a1
d143c06
 
 
 
 
 
 
 
 
 
 
 
 
4b925a1
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
---
license: apache-2.0
datasets:
- Miaow-Lab/RLVR-Linearity-Dataset
base_model:
- deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
pipeline_tag: text-generation
---

# Model Card

## 1. Model Details
This model is the fine-tuned checkpoint described in the paper **"Not All Steps are Informative: On the Linearity of LLMs’ RLVR Training"**. It was trained using Reinforcement Learning (RL) to enhance reasoning capabilities.

- **Paper:** [ArXiv](https://arxiv.org/pdf/2601.04537v1)
- **Code:** [Github](https://github.com/Miaow-Lab/RLVR-Linearity)
- **Base Model:** [deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B)
- **Training Method:** GRPO


## 2. Training Details

- **Hyperparameters:**
  - Learning Rate: `1e-6`
  - Train Batch Size: `128`
  - PPO Mini Batch Size: `64`
  - RL Algorithm: `GRPO`
  - Rollout Temperature: 1.0
  - Group Size: 16
- **Compute:** Trained on `32 x H100` GPUs for about `150` hours.

For full training configurations, please refer to the `config.json` or the training scripts in our [GitHub](https://github.com/Miaow-Lab/RLVR-Linearity).

## 3. Citation

If you use this model in your research, please cite our paper:

```bibtex
@misc{wang2026stepsinformativelinearityllms,
      title={Not All Steps are Informative: On the Linearity of LLMs' RLVR Training}, 
      author={Tianle Wang and Zhongyuan Wu and Shenghao Jin and Hao Xu and Wei Chen and Ning Miao},
      year={2026},
      eprint={2601.04537},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2601.04537}, 
}
```

> [!TIP]
> **Motivation for this Model**
> This checkpoint is released primarily as a research artifact to facilitate the analysis of linearity in model outputs and weight updates during RLVR fine‑tuning.