Text Generation
Safetensors
File size: 2,103 Bytes
d143c06
 
 
2c1d702
 
 
d143c06
2c1d702
 
 
 
 
d143c06
 
 
 
 
2c1d702
d143c06
2c1d702
 
d143c06
2c1d702
d143c06
 
4b925a1
d143c06
 
 
 
 
 
4b925a1
 
d143c06
 
2c1d702
d143c06
4b925a1
d143c06
 
 
 
 
 
 
 
 
 
 
 
 
4b925a1
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
---
base_model:
- deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
datasets:
- Miaow-Lab/RLVR-Linearity-Dataset
license: apache-2.0
pipeline_tag: text-generation
library_name: transformers
tags:
- reasoning
- grpo
- reinforcement-learning
---

# Model Card

## 1. Model Details
This model is a fine-tuned checkpoint described in the paper **"Not All Steps are Informative: On the Linearity of LLMs’ RLVR Training"**. It was trained using Reinforcement Learning (RL) to investigate the phenomenon of linear evolution in model weights and output log-probabilities during RLVR training.

- **Paper:** [Not All Steps are Informative: On the Linearity of LLMs' RLVR Training](https://huggingface.co/papers/2601.04537)
- **Code:** [GitHub Repository](https://github.com/Miaow-Lab/RLVR-Linearity)
- **Base Model:** [deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B)
- **Training Method:** GRPO (using the `verl` framework)


## 2. Training Details

- **Hyperparameters:**
  - Learning Rate: `1e-6`
  - Train Batch Size: `128`
  - PPO Mini Batch Size: `64`
  - RL Algorithm: `GRPO`
  - Rollout Temperature: 1.0
  - Group Size: 16
- **Compute:** Trained on `32 x H100` GPUs for about `150` hours.

For full training configurations, please refer to the `config.json` or the training scripts in the official [GitHub repository](https://github.com/Miaow-Lab/RLVR-Linearity).

## 3. Citation

If you use this model in your research, please cite our paper:

```bibtex
@misc{wang2026stepsinformativelinearityllms,
      title={Not All Steps are Informative: On the Linearity of LLMs' RLVR Training}, 
      author={Tianle Wang and Zhongyuan Wu and Shenghao Jin and Hao Xu and Wei Chen and Ning Miao},
      year={2026},
      eprint={2601.04537},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2601.04537}, 
}
```

> [!TIP]
> **Motivation for this Model**
> This checkpoint is released primarily as a research artifact to facilitate the analysis of linearity in model outputs and weight updates during RLVR fine‑tuning.