Improve model card: update license, add paper details, usage, results, and citation for HAPO Qwen2.5-Math-7B

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +122 -7
README.md CHANGED
@@ -1,14 +1,129 @@
1
  ---
2
- license: mit
3
  library_name: transformers
 
4
  pipeline_tag: text-generation
5
  ---
6
 
7
- The base Qwen2.5-Math-7B model used by HAPO.
8
- We extend the context window to 32k.
9
 
10
- # Citation
11
- If you find our model, data, or evaluation code useful, please kindly cite our paper:
12
- ```bib
13
 
14
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
 
2
  library_name: transformers
3
+ license: apache-2.0
4
  pipeline_tag: text-generation
5
  ---
6
 
7
+ # From Uniform to Heterogeneous: Tailoring Policy Optimization to Every Token's Nature (HAPO)
 
8
 
9
+ The base Qwen2.5-Math-7B model used by HAPO. We extend the context window to 32k.
 
 
10
 
11
+ 📚 [Paper: From Uniform to Heterogeneous: Tailoring Policy Optimization to Every Token's Nature](https://huggingface.co/papers/2509.16591)
12
+ 💻 [GitHub Repository](https://github.com/starriver030515/HAPO)
13
+
14
+ ## Abstract
15
+ Reinforcement Learning has emerged as the fundamental technique for enhancing reasoning in LLMs. However, existing algorithms apply uniform optimization to all tokens, ignoring their different roles in reasoning process. To address this limitation, we introduce Heterogeneous Adaptive Policy Optimization (HAPO), a comprehensive token-aware algorithm that dynamically adapts optimization based on token entropy. For rollout sampling, we propose Adaptive Temperature Sampling, which adjusts sampling temperature in real time, promoting exploration at high-entropy tokens while preserving coherence at low-entropy ones. For advantage calculation, we introduce Token Level Group Average that normalizes advantages at token level, jointly accounting for sequence-length as in token-mean loss while preserving non-biased treatment. We then develop Differential Advantage Redistribution that leverages entropy and importance ratios to modulate rewards-adjusting updates for tokens with clear signals. For clipping loss, we design Asymmetric Adaptive Clipping, allowing aggressive probability reduction for noisy low-entropy tokens while enabling exploration for high-entropy tokens. Through systematic investigation between entropy and training dynamics, we embedded token-level treatment into every stages to achieve fine-grained control. Extensive experiments demonstrate that HAPO consistently outperforms DAPO across multiple model scales.
16
+
17
+ ## About HAPO
18
+
19
+ Reinforcement Learning has emerged as the fundamental technique for enhancing reasoning in LLMs. However, existing algorithms apply uniform optimization to all tokens, ignoring their different roles in reasoning process. To address this limitation, we introduce **H**eterogeneous **A**daptive **P**olicy **O**ptimization (HAPO), a comprehensive token-aware algorithm that dynamically adapts optimization based on token entropy. For rollout sampling, we propose **Adaptive Temperature Sampling**, which adjusts sampling temperature in real time, promoting exploration at high-entropy tokens while preserving coherence at low-entropy ones. For advantage calculation, we introduce **Token Level Group Average** that normalizes advantages at token level, jointly accounting for sequence-length as in token-mean loss while preserving non-biased treatment. We then develop **Differential Advantage Redistribution** that leverages entropy and importance ratios to modulate rewards—adjusting updates for tokens with clear signals. For clipping loss, we design **Asymmetric Adaptive Clipping**, allowing aggressive probability reduction for noisy low-entropy tokens while enabling exploration for high-entropy tokens. Through systematic investigation between entropy and training dynamics, we embedded token-level treatment into every stages to achieve fine-grained control. Extensive experiments demonstrate that HAPO consistently outperforms DAPO across multiple model scales.
20
+
21
+ <div align="center">
22
+ <img src="https://raw.githubusercontent.com/starriver030515/HAPO/main/figures/framework.png" alt="framework" width="100%" height="auto">
23
+ </div>
24
+
25
+ ## Installation
26
+
27
+ 1. Clone this repository and navigate to the folder
28
+ ```bash
29
+ git clone https://github.com/starriver030515/HAPO
30
+ cd HAPO
31
+ ```
32
+
33
+ 2. Create a conda environment, activate it and install Packages
34
+ ```Shell
35
+ conda create -n hapo python=3.10 -y
36
+ conda activate hapo
37
+ ```
38
+
39
+ 3. Execute verl installation script to install dependencies
40
+ ```bash
41
+ bash scripts/install_vllm_sglang_mcore.sh
42
+ pip install -e .
43
+ ```
44
+
45
+ ## Usage
46
+
47
+ ### Preparation
48
+
49
+ First download training and evaluation parquet from [hapo_data](https://huggingface.co/datasets/starriver030515/hapo_data).
50
+
51
+ If you use Qwen2.5 Math for training, please download [Qwen2.5-Math-1.5B-16k](https://huggingface.co/starriver030515/Qwen2.5-Math-1.5B-16k) and [Qwen2.5-Math-7B-32k](https://huggingface.co/starriver030515/Qwen2.5-Math-7B-32k), which we modified the max position length to support longer context training. For other models, you can download them from their official repository.
52
+
53
+ To support Adaptive Temperature Sampling, you need to replace the vllm-related files in your corresponding environment with those from HAPO/vllm.
54
+
55
+ ### Train
56
+
57
+ Our training scripts are located in the [recipe](https://github.com/starriver030515/HAPO/tree/main/recipe) folder. You only need to replace `MODEL_PATH`, `TRAIN_FILE` and `TEST_FILE`. You can see detailed parameter explanations in [train.md](https://github.com/starriver030515/HAPO/blob/main/recipe/train.md).
58
+
59
+ ```bash
60
+ cd recipe
61
+ bash qwen2.5_math_7b.sh
62
+ ```
63
+
64
+ ### Evaluation
65
+
66
+ ```bash
67
+ cd scripts
68
+ bash eval_model.sh
69
+ ```
70
+
71
+ ## Results
72
+
73
+ Comparison between vanilla DAPO using all tokens, DAPO with forking tokens), Archer, EDGE-GRPO, and HAPO, evaluated on the Qwen-Math-1.5B Base, Qwen-Math-7B Base, and Qwen3-8B Base models. For each question, we generate 8 independent responses under a decoding temperature $T=0.5$, and report the average accuracy.
74
+
75
+ <div align="center">
76
+ <img src="https://raw.githubusercontent.com/starriver030515/HAPO/main/figures/Overall_Results.png" alt="Overall Results" width="100%" height="auto">
77
+ </div>
78
+
79
+ ## Training Dynamics
80
+
81
+ This figure compares the training dynamics of DAPO and HAPO —with respect to four key metrics:
82
+
83
+ - **AIME24 and AIME25 Results**: HAPO consistently achieves higher accuracy across all model sizes (Qwen2.5-Math-1.5B, Qwen2.5-Math-7B, and Qwen3-8B), demonstrating superior learning efficiency and performance throughout the training process.
84
+ - **Response Length**: HAPO maintains longer response lengths during training compared to DAPO, indicating more comprehensive and detailed solution generation without compromising quality.
85
+ - **Mean Entropy**: HAPO preserves significantly higher entropy throughout training across all model configurations, demonstrating better exploration capabilities and maintaining response diversity, which prevents premature convergence to suboptimal solutions.
86
+ <div align="center">
87
+ <figure>
88
+ <img src="https://raw.githubusercontent.com/starriver030515/HAPO/main/figures/aime24.png" alt="AIME24 Results" width="100%" height="auto">
89
+ <figcaption><em>Figure 1: AIME24 accuracy comparison - HAPO consistently achieves higher accuracy across all model sizes</em></figcaption>
90
+ </figure>
91
+ </div>
92
+ <div align="center">
93
+ <figure>
94
+ <img src="https://raw.githubusercontent.com/starriver030515/HAPO/main/figures/aime25.png" alt="AIME25 Results" width="100%" height="auto">
95
+ <figcaption><em>Figure 2: AIME25 accuracy comparison - HAPO consistently achieves higher accuracy across all model sizes</em></figcaption>
96
+ </figure>
97
+ </div>
98
+
99
+ <div align="center">
100
+ <figure>
101
+ <img src="https://raw.githubusercontent.com/starriver030515/HAPO/main/figures/response_length.png" alt="Response Length" width="100%" height="auto">
102
+ <figcaption><em>Figure 3: Response length over training steps - HAPO maintains longer, more comprehensive responses</em></figcaption>
103
+ </figure>
104
+ </div>
105
+
106
+ <div align="center">
107
+ <figure>
108
+ <img src="https://raw.githubusercontent.com/starriver030515/HAPO/main/figures/entropy.png" alt="Mean Entropy" width="100%" height="auto">
109
+ <figcaption><em>Figure 4: Mean entropy comparison - HAPO preserves higher entropy, indicating better exploration and diversity</em></figcaption>
110
+ </figure>
111
+ </div>
112
+
113
+ ## Citation
114
+ If you find our work interesting and helpful, please consider giving our repo a star. Additionally, if you would like to cite our work, please use the following format:
115
+ ```bibtex
116
+ @article{liu2025hapo,
117
+ title={From Uniform to Heterogeneous: Tailoring Policy Optimization to Every Token's Nature},
118
+ author={Liu, Zheng and Liu, Mengjie and Wen, Siwei and Cai, Mengzhang and Cui, Bin and He, Conghui and Wu, Lijun and Zhang, Wentao},
119
+ journal={arXiv preprint arXiv:2509.16591},
120
+ year={2025},
121
+ url={https://arxiv.org/abs/2509.16591}
122
+ }
123
+ ```
124
+
125
+ ## Contact
126
+ If you have any questions or suggestions, please feel free to contact us at `2501213330@stu.pku.edu.cn`.
127
+
128
+ ## Community efforts
129
+ This repository is based on [verl](https://github.com/volcengine/verl/tree/main) project.