Text Generation
Safetensors
louiswng commited on
Commit
d143c06
·
verified ·
1 Parent(s): 452b1c2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +53 -3
README.md CHANGED
@@ -1,3 +1,53 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - Miaow-Lab/RLVR-Linearity-Dataset
5
+ base_model:
6
+ - deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
7
+ pipeline_tag: text-generation
8
+ ---
9
+
10
+ # Model Card
11
+
12
+ ## 1. Model Details
13
+ This model is the fine-tuned checkpoint described in the paper **"Not All Steps are Informative: On the Linearity of LLMs’ RLVR Training"**. It was trained using Reinforcement Learning (GRPO) to enhance mathematical reasoning capabilities.
14
+
15
+ - **Paper:** [ArXiv](https://arxiv.org/pdf/2601.04537v1)
16
+ - **Code:** [Github](https://github.com/Miaow-Lab/RLVR-Linearity)
17
+ - **Base Model:** [deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B)
18
+ - **Training Method:** GRPO
19
+
20
+ ## 2. Performance
21
+ We evaluated the model on standard math benchmarks. Key results include:
22
+
23
+ | Benchmark | Avg@64 |
24
+ | :--- | :--- |
25
+ | AIME 2024 | **41.93%** |
26
+
27
+
28
+ ## 3. Training Details
29
+
30
+ - **Hyperparameters:**
31
+ - Learning Rate: `1e-6`
32
+ - Train Batch Size: `128`
33
+ - PPO Mini Batch Size: `64`
34
+ - RL Algorithm: `GRPO`
35
+ - **Compute:** Trained on `32 x H100` GPUs for about `150` hours.
36
+
37
+ For full training configurations, please refer to the `config.json` or the training scripts in our [GitHub](https://github.com/Miaow-Lab/RLVR-Linearity).
38
+
39
+ ## 4. Citation
40
+
41
+ If you use this model in your research, please cite our paper:
42
+
43
+ ```bibtex
44
+ @misc{wang2026stepsinformativelinearityllms,
45
+ title={Not All Steps are Informative: On the Linearity of LLMs' RLVR Training},
46
+ author={Tianle Wang and Zhongyuan Wu and Shenghao Jin and Hao Xu and Wei Chen and Ning Miao},
47
+ year={2026},
48
+ eprint={2601.04537},
49
+ archivePrefix={arXiv},
50
+ primaryClass={cs.LG},
51
+ url={https://arxiv.org/abs/2601.04537},
52
+ }
53
+ ```