Improve model card: update license, add paper details, usage, results, and citation for HAPO Qwen2.5-Math-7B
#1
by nielsr HF Staff - opened
README.md
CHANGED
|
@@ -1,14 +1,129 @@
|
|
| 1 |
---
|
| 2 |
-
license: mit
|
| 3 |
library_name: transformers
|
|
|
|
| 4 |
pipeline_tag: text-generation
|
| 5 |
---
|
| 6 |
|
| 7 |
-
|
| 8 |
-
We extend the context window to 32k.
|
| 9 |
|
| 10 |
-
|
| 11 |
-
If you find our model, data, or evaluation code useful, please kindly cite our paper:
|
| 12 |
-
```bib
|
| 13 |
|
| 14 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
|
|
|
| 2 |
library_name: transformers
|
| 3 |
+
license: apache-2.0
|
| 4 |
pipeline_tag: text-generation
|
| 5 |
---
|
| 6 |
|
| 7 |
+
# From Uniform to Heterogeneous: Tailoring Policy Optimization to Every Token's Nature (HAPO)
|
|
|
|
| 8 |
|
| 9 |
+
The base Qwen2.5-Math-7B model used by HAPO. We extend the context window to 32k.
|
|
|
|
|
|
|
| 10 |
|
| 11 |
+
📚 [Paper: From Uniform to Heterogeneous: Tailoring Policy Optimization to Every Token's Nature](https://huggingface.co/papers/2509.16591)
|
| 12 |
+
💻 [GitHub Repository](https://github.com/starriver030515/HAPO)
|
| 13 |
+
|
| 14 |
+
## Abstract
|
| 15 |
+
Reinforcement Learning has emerged as the fundamental technique for enhancing reasoning in LLMs. However, existing algorithms apply uniform optimization to all tokens, ignoring their different roles in reasoning process. To address this limitation, we introduce Heterogeneous Adaptive Policy Optimization (HAPO), a comprehensive token-aware algorithm that dynamically adapts optimization based on token entropy. For rollout sampling, we propose Adaptive Temperature Sampling, which adjusts sampling temperature in real time, promoting exploration at high-entropy tokens while preserving coherence at low-entropy ones. For advantage calculation, we introduce Token Level Group Average that normalizes advantages at token level, jointly accounting for sequence-length as in token-mean loss while preserving non-biased treatment. We then develop Differential Advantage Redistribution that leverages entropy and importance ratios to modulate rewards-adjusting updates for tokens with clear signals. For clipping loss, we design Asymmetric Adaptive Clipping, allowing aggressive probability reduction for noisy low-entropy tokens while enabling exploration for high-entropy tokens. Through systematic investigation between entropy and training dynamics, we embedded token-level treatment into every stages to achieve fine-grained control. Extensive experiments demonstrate that HAPO consistently outperforms DAPO across multiple model scales.
|
| 16 |
+
|
| 17 |
+
## About HAPO
|
| 18 |
+
|
| 19 |
+
Reinforcement Learning has emerged as the fundamental technique for enhancing reasoning in LLMs. However, existing algorithms apply uniform optimization to all tokens, ignoring their different roles in reasoning process. To address this limitation, we introduce **H**eterogeneous **A**daptive **P**olicy **O**ptimization (HAPO), a comprehensive token-aware algorithm that dynamically adapts optimization based on token entropy. For rollout sampling, we propose **Adaptive Temperature Sampling**, which adjusts sampling temperature in real time, promoting exploration at high-entropy tokens while preserving coherence at low-entropy ones. For advantage calculation, we introduce **Token Level Group Average** that normalizes advantages at token level, jointly accounting for sequence-length as in token-mean loss while preserving non-biased treatment. We then develop **Differential Advantage Redistribution** that leverages entropy and importance ratios to modulate rewards—adjusting updates for tokens with clear signals. For clipping loss, we design **Asymmetric Adaptive Clipping**, allowing aggressive probability reduction for noisy low-entropy tokens while enabling exploration for high-entropy tokens. Through systematic investigation between entropy and training dynamics, we embedded token-level treatment into every stages to achieve fine-grained control. Extensive experiments demonstrate that HAPO consistently outperforms DAPO across multiple model scales.
|
| 20 |
+
|
| 21 |
+
<div align="center">
|
| 22 |
+
<img src="https://raw.githubusercontent.com/starriver030515/HAPO/main/figures/framework.png" alt="framework" width="100%" height="auto">
|
| 23 |
+
</div>
|
| 24 |
+
|
| 25 |
+
## Installation
|
| 26 |
+
|
| 27 |
+
1. Clone this repository and navigate to the folder
|
| 28 |
+
```bash
|
| 29 |
+
git clone https://github.com/starriver030515/HAPO
|
| 30 |
+
cd HAPO
|
| 31 |
+
```
|
| 32 |
+
|
| 33 |
+
2. Create a conda environment, activate it and install Packages
|
| 34 |
+
```Shell
|
| 35 |
+
conda create -n hapo python=3.10 -y
|
| 36 |
+
conda activate hapo
|
| 37 |
+
```
|
| 38 |
+
|
| 39 |
+
3. Execute verl installation script to install dependencies
|
| 40 |
+
```bash
|
| 41 |
+
bash scripts/install_vllm_sglang_mcore.sh
|
| 42 |
+
pip install -e .
|
| 43 |
+
```
|
| 44 |
+
|
| 45 |
+
## Usage
|
| 46 |
+
|
| 47 |
+
### Preparation
|
| 48 |
+
|
| 49 |
+
First download training and evaluation parquet from [hapo_data](https://huggingface.co/datasets/starriver030515/hapo_data).
|
| 50 |
+
|
| 51 |
+
If you use Qwen2.5 Math for training, please download [Qwen2.5-Math-1.5B-16k](https://huggingface.co/starriver030515/Qwen2.5-Math-1.5B-16k) and [Qwen2.5-Math-7B-32k](https://huggingface.co/starriver030515/Qwen2.5-Math-7B-32k), which we modified the max position length to support longer context training. For other models, you can download them from their official repository.
|
| 52 |
+
|
| 53 |
+
To support Adaptive Temperature Sampling, you need to replace the vllm-related files in your corresponding environment with those from HAPO/vllm.
|
| 54 |
+
|
| 55 |
+
### Train
|
| 56 |
+
|
| 57 |
+
Our training scripts are located in the [recipe](https://github.com/starriver030515/HAPO/tree/main/recipe) folder. You only need to replace `MODEL_PATH`, `TRAIN_FILE` and `TEST_FILE`. You can see detailed parameter explanations in [train.md](https://github.com/starriver030515/HAPO/blob/main/recipe/train.md).
|
| 58 |
+
|
| 59 |
+
```bash
|
| 60 |
+
cd recipe
|
| 61 |
+
bash qwen2.5_math_7b.sh
|
| 62 |
+
```
|
| 63 |
+
|
| 64 |
+
### Evaluation
|
| 65 |
+
|
| 66 |
+
```bash
|
| 67 |
+
cd scripts
|
| 68 |
+
bash eval_model.sh
|
| 69 |
+
```
|
| 70 |
+
|
| 71 |
+
## Results
|
| 72 |
+
|
| 73 |
+
Comparison between vanilla DAPO using all tokens, DAPO with forking tokens), Archer, EDGE-GRPO, and HAPO, evaluated on the Qwen-Math-1.5B Base, Qwen-Math-7B Base, and Qwen3-8B Base models. For each question, we generate 8 independent responses under a decoding temperature $T=0.5$, and report the average accuracy.
|
| 74 |
+
|
| 75 |
+
<div align="center">
|
| 76 |
+
<img src="https://raw.githubusercontent.com/starriver030515/HAPO/main/figures/Overall_Results.png" alt="Overall Results" width="100%" height="auto">
|
| 77 |
+
</div>
|
| 78 |
+
|
| 79 |
+
## Training Dynamics
|
| 80 |
+
|
| 81 |
+
This figure compares the training dynamics of DAPO and HAPO —with respect to four key metrics:
|
| 82 |
+
|
| 83 |
+
- **AIME24 and AIME25 Results**: HAPO consistently achieves higher accuracy across all model sizes (Qwen2.5-Math-1.5B, Qwen2.5-Math-7B, and Qwen3-8B), demonstrating superior learning efficiency and performance throughout the training process.
|
| 84 |
+
- **Response Length**: HAPO maintains longer response lengths during training compared to DAPO, indicating more comprehensive and detailed solution generation without compromising quality.
|
| 85 |
+
- **Mean Entropy**: HAPO preserves significantly higher entropy throughout training across all model configurations, demonstrating better exploration capabilities and maintaining response diversity, which prevents premature convergence to suboptimal solutions.
|
| 86 |
+
<div align="center">
|
| 87 |
+
<figure>
|
| 88 |
+
<img src="https://raw.githubusercontent.com/starriver030515/HAPO/main/figures/aime24.png" alt="AIME24 Results" width="100%" height="auto">
|
| 89 |
+
<figcaption><em>Figure 1: AIME24 accuracy comparison - HAPO consistently achieves higher accuracy across all model sizes</em></figcaption>
|
| 90 |
+
</figure>
|
| 91 |
+
</div>
|
| 92 |
+
<div align="center">
|
| 93 |
+
<figure>
|
| 94 |
+
<img src="https://raw.githubusercontent.com/starriver030515/HAPO/main/figures/aime25.png" alt="AIME25 Results" width="100%" height="auto">
|
| 95 |
+
<figcaption><em>Figure 2: AIME25 accuracy comparison - HAPO consistently achieves higher accuracy across all model sizes</em></figcaption>
|
| 96 |
+
</figure>
|
| 97 |
+
</div>
|
| 98 |
+
|
| 99 |
+
<div align="center">
|
| 100 |
+
<figure>
|
| 101 |
+
<img src="https://raw.githubusercontent.com/starriver030515/HAPO/main/figures/response_length.png" alt="Response Length" width="100%" height="auto">
|
| 102 |
+
<figcaption><em>Figure 3: Response length over training steps - HAPO maintains longer, more comprehensive responses</em></figcaption>
|
| 103 |
+
</figure>
|
| 104 |
+
</div>
|
| 105 |
+
|
| 106 |
+
<div align="center">
|
| 107 |
+
<figure>
|
| 108 |
+
<img src="https://raw.githubusercontent.com/starriver030515/HAPO/main/figures/entropy.png" alt="Mean Entropy" width="100%" height="auto">
|
| 109 |
+
<figcaption><em>Figure 4: Mean entropy comparison - HAPO preserves higher entropy, indicating better exploration and diversity</em></figcaption>
|
| 110 |
+
</figure>
|
| 111 |
+
</div>
|
| 112 |
+
|
| 113 |
+
## Citation
|
| 114 |
+
If you find our work interesting and helpful, please consider giving our repo a star. Additionally, if you would like to cite our work, please use the following format:
|
| 115 |
+
```bibtex
|
| 116 |
+
@article{liu2025hapo,
|
| 117 |
+
title={From Uniform to Heterogeneous: Tailoring Policy Optimization to Every Token's Nature},
|
| 118 |
+
author={Liu, Zheng and Liu, Mengjie and Wen, Siwei and Cai, Mengzhang and Cui, Bin and He, Conghui and Wu, Lijun and Zhang, Wentao},
|
| 119 |
+
journal={arXiv preprint arXiv:2509.16591},
|
| 120 |
+
year={2025},
|
| 121 |
+
url={https://arxiv.org/abs/2509.16591}
|
| 122 |
+
}
|
| 123 |
+
```
|
| 124 |
+
|
| 125 |
+
## Contact
|
| 126 |
+
If you have any questions or suggestions, please feel free to contact us at `2501213330@stu.pku.edu.cn`.
|
| 127 |
+
|
| 128 |
+
## Community efforts
|
| 129 |
+
This repository is based on [verl](https://github.com/volcengine/verl/tree/main) project.
|