metadata
base_model:
- deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
datasets:
- Miaow-Lab/RLVR-Linearity-Dataset
license: apache-2.0
pipeline_tag: text-generation
library_name: transformers
tags:
- reasoning
- grpo
- reinforcement-learning
Model Card
1. Model Details
This model is a fine-tuned checkpoint described in the paper "Not All Steps are Informative: On the Linearity of LLMs’ RLVR Training". It was trained using Reinforcement Learning (RL) to investigate the phenomenon of linear evolution in model weights and output log-probabilities during RLVR training.
- Paper: Not All Steps are Informative: On the Linearity of LLMs' RLVR Training
- Code: GitHub Repository
- Base Model: deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
- Training Method: GRPO (using the
verlframework)
2. Training Details
- Hyperparameters:
- Learning Rate:
1e-6 - Train Batch Size:
128 - PPO Mini Batch Size:
64 - RL Algorithm:
GRPO - Rollout Temperature: 1.0
- Group Size: 16
- Learning Rate:
- Compute: Trained on
32 x H100GPUs for about150hours.
For full training configurations, please refer to the config.json or the training scripts in the official GitHub repository.
3. Citation
If you use this model in your research, please cite our paper:
@misc{wang2026stepsinformativelinearityllms,
title={Not All Steps are Informative: On the Linearity of LLMs' RLVR Training},
author={Tianle Wang and Zhongyuan Wu and Shenghao Jin and Hao Xu and Wei Chen and Ning Miao},
year={2026},
eprint={2601.04537},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2601.04537},
}
Motivation for this Model This checkpoint is released primarily as a research artifact to facilitate the analysis of linearity in model outputs and weight updates during RLVR fine‑tuning.