Text Generation
Safetensors
louiswng commited on
Commit
4b925a1
·
verified ·
1 Parent(s): d143c06

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +10 -11
README.md CHANGED
@@ -10,33 +10,28 @@ pipeline_tag: text-generation
10
  # Model Card
11
 
12
  ## 1. Model Details
13
- This model is the fine-tuned checkpoint described in the paper **"Not All Steps are Informative: On the Linearity of LLMs’ RLVR Training"**. It was trained using Reinforcement Learning (GRPO) to enhance mathematical reasoning capabilities.
14
 
15
  - **Paper:** [ArXiv](https://arxiv.org/pdf/2601.04537v1)
16
  - **Code:** [Github](https://github.com/Miaow-Lab/RLVR-Linearity)
17
  - **Base Model:** [deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B)
18
  - **Training Method:** GRPO
19
 
20
- ## 2. Performance
21
- We evaluated the model on standard math benchmarks. Key results include:
22
 
23
- | Benchmark | Avg@64 |
24
- | :--- | :--- |
25
- | AIME 2024 | **41.93%** |
26
-
27
-
28
- ## 3. Training Details
29
 
30
  - **Hyperparameters:**
31
  - Learning Rate: `1e-6`
32
  - Train Batch Size: `128`
33
  - PPO Mini Batch Size: `64`
34
  - RL Algorithm: `GRPO`
 
 
35
  - **Compute:** Trained on `32 x H100` GPUs for about `150` hours.
36
 
37
  For full training configurations, please refer to the `config.json` or the training scripts in our [GitHub](https://github.com/Miaow-Lab/RLVR-Linearity).
38
 
39
- ## 4. Citation
40
 
41
  If you use this model in your research, please cite our paper:
42
 
@@ -50,4 +45,8 @@ If you use this model in your research, please cite our paper:
50
  primaryClass={cs.LG},
51
  url={https://arxiv.org/abs/2601.04537},
52
  }
53
- ```
 
 
 
 
 
10
  # Model Card
11
 
12
  ## 1. Model Details
13
+ This model is the fine-tuned checkpoint described in the paper **"Not All Steps are Informative: On the Linearity of LLMs’ RLVR Training"**. It was trained using Reinforcement Learning (RL) to enhance reasoning capabilities.
14
 
15
  - **Paper:** [ArXiv](https://arxiv.org/pdf/2601.04537v1)
16
  - **Code:** [Github](https://github.com/Miaow-Lab/RLVR-Linearity)
17
  - **Base Model:** [deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B)
18
  - **Training Method:** GRPO
19
 
 
 
20
 
21
+ ## 2. Training Details
 
 
 
 
 
22
 
23
  - **Hyperparameters:**
24
  - Learning Rate: `1e-6`
25
  - Train Batch Size: `128`
26
  - PPO Mini Batch Size: `64`
27
  - RL Algorithm: `GRPO`
28
+ - Rollout Temperature: 1.0
29
+ - Group Size: 16
30
  - **Compute:** Trained on `32 x H100` GPUs for about `150` hours.
31
 
32
  For full training configurations, please refer to the `config.json` or the training scripts in our [GitHub](https://github.com/Miaow-Lab/RLVR-Linearity).
33
 
34
+ ## 3. Citation
35
 
36
  If you use this model in your research, please cite our paper:
37
 
 
45
  primaryClass={cs.LG},
46
  url={https://arxiv.org/abs/2601.04537},
47
  }
48
+ ```
49
+
50
+ > [!TIP]
51
+ > **Motivation for this Model**
52
+ > This checkpoint is released primarily as a research artifact to facilitate the analysis of linearity in model outputs and weight updates during RLVR fine‑tuning.