Improve model card: Add tags, abstract, and paper link
#1
by
nielsr
HF Staff
- opened
README.md
CHANGED
|
@@ -1,8 +1,14 @@
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3 |
---
|
| 4 |
|
|
|
|
| 5 |
|
|
|
|
| 6 |
|
| 7 |
<div align="center">
|
| 8 |
|
|
@@ -27,6 +33,10 @@ license: apache-2.0
|
|
| 27 |
|
| 28 |
</div>
|
| 29 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 30 |
## Overview
|
| 31 |
|
| 32 |
**Archer2.0** marks a significant evolution from its predecessor through the introduction of **Asymmetric Importance Sampling Policy Optimization (ASPO)**, which is designed to overcome the fundamental limitations of **PPO-Clip**, effectively mitigating issues like **entropy collapse** and **repetitive outputs**, preventing **premature convergence**, and thereby enabling more advanced **reinforcement learning** capabilities.
|
|
@@ -103,3 +113,120 @@ While our mathematical models are still in training and have not converged, we h
|
|
| 103 |
</tr>
|
| 104 |
</tbody>
|
| 105 |
</table>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
| 3 |
+
pipeline_tag: text-generation
|
| 4 |
+
library_name: transformers
|
| 5 |
+
tags:
|
| 6 |
+
- code-generation
|
| 7 |
---
|
| 8 |
|
| 9 |
+
# ASPO: Asymmetric Importance Sampling Policy Optimization
|
| 10 |
|
| 11 |
+
This model checkpoint is for the paper [ASPO: Asymmetric Importance Sampling Policy Optimization](https://huggingface.co/papers/2510.06062).
|
| 12 |
|
| 13 |
<div align="center">
|
| 14 |
|
|
|
|
| 33 |
|
| 34 |
</div>
|
| 35 |
|
| 36 |
+
## Abstract
|
| 37 |
+
|
| 38 |
+
Recent Large Language Model (LLM) post-training methods rely on token-level clipping mechanisms during Reinforcement Learning (RL). However, we identify a fundamental flaw in this Outcome-Supervised RL (OSRL) paradigm: the Importance Sampling (IS) ratios of positive-advantage tokens are mismatched, leading to unbalanced token weighting for positive and negative tokens. This mismatch suppresses the update of low-probability tokens while over-amplifying already high-probability ones. To address this, we propose Asymmetric Importance Sampling Policy Optimization (ASPO), which uses a simple yet effective strategy that flips the IS ratios of positive-advantage tokens, aligning their update direction with the learning dynamics of negative ones. AIS further incorporates a soft dual-clipping mechanism to stabilize extreme updates while maintaining gradient flow. Comprehensive experiments on coding and mathematical reasoning benchmarks demonstrate that ASPO significantly mitigates premature convergence, improves training stability, and enhances final performance over strong GRPO-based baselines. Our analysis provides new insights into the role of token-level weighting in OSRL and highlights the critical importance of correcting IS in LLM RL. The code and models of ASPO are available at this https URL .
|
| 39 |
+
|
| 40 |
## Overview
|
| 41 |
|
| 42 |
**Archer2.0** marks a significant evolution from its predecessor through the introduction of **Asymmetric Importance Sampling Policy Optimization (ASPO)**, which is designed to overcome the fundamental limitations of **PPO-Clip**, effectively mitigating issues like **entropy collapse** and **repetitive outputs**, preventing **premature convergence**, and thereby enabling more advanced **reinforcement learning** capabilities.
|
|
|
|
| 113 |
</tr>
|
| 114 |
</tbody>
|
| 115 |
</table>
|
| 116 |
+
|
| 117 |
+
|
| 118 |
+
## Getting Started
|
| 119 |
+
|
| 120 |
+
### 1 Installation
|
| 121 |
+
|
| 122 |
+
```bash
|
| 123 |
+
# Installing Python 3.10 Environment.
|
| 124 |
+
conda create -n archer python=3.10 -y
|
| 125 |
+
conda activate archer
|
| 126 |
+
|
| 127 |
+
# Installing dependencies.
|
| 128 |
+
pip install torch==2.5.1 --index-url https://download.pytorch.org/whl/cu124
|
| 129 |
+
wget -nv https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.3/flash_attn-2.7.3+cu12torch2.5cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
|
| 130 |
+
pip install --no-cache-dir flash_attn-2.7.3+cu12torch2.5cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
|
| 131 |
+
|
| 132 |
+
cd Archer2.0
|
| 133 |
+
pip install -e .
|
| 134 |
+
```
|
| 135 |
+
|
| 136 |
+
#### Initialize Ray Cluster
|
| 137 |
+
|
| 138 |
+
We have provided a one-click script to initialize Ray environments on any number of machines. Run the following command on the head node:
|
| 139 |
+
|
| 140 |
+
```bash
|
| 141 |
+
bash ./tools/start_ray.sh
|
| 142 |
+
```
|
| 143 |
+
|
| 144 |
+
Note:
|
| 145 |
+
- Please replace your_wandb_api_key in export WANDB_API_KEY=your_wandb_api_key with your actual key.
|
| 146 |
+
- Hostfile locations vary across operating systems (e.g., on my machine, it's located at /etc/mpi/hostfile). Locate the file on your server and modify its content accordingly.
|
| 147 |
+
|
| 148 |
+
### 2 Training
|
| 149 |
+
|
| 150 |
+
We have currently provided the script and data to reproduce the results of the “Archer2.0-Code-1.5B-Preview”.
|
| 151 |
+
|
| 152 |
+
```bash
|
| 153 |
+
bash ./scripts/train/run_archer2.0_qwen2.5_1.5b_code.sh
|
| 154 |
+
```
|
| 155 |
+
|
| 156 |
+
### 3 Evaluation
|
| 157 |
+
|
| 158 |
+
When using the Verl framework for RL training, we observed a consistent discrepancy between the evaluation results produced by the in-training weights and the saved model checkpoints. To ensure the accurate selection of model checkpoints, our evaluation is conducted using the saved checkpoints.
|
| 159 |
+
|
| 160 |
+
#### 3.1 Automated Evaluation Pipeline
|
| 161 |
+
To automatically scan a specified directory and evaluate all saved model checkpoints during training, run the following script on a GPU-enabled machine:
|
| 162 |
+
|
| 163 |
+
```bash
|
| 164 |
+
bash ./tools/run_eval_pipeline.sh
|
| 165 |
+
```
|
| 166 |
+
Since code evaluation tasks run on CPU only, we separate the LiveCodeBench evaluation to optimize GPU utilization. Execute the following script on a CPU machine to automatically evaluate the inference results generated in the previous step:
|
| 167 |
+
|
| 168 |
+
```bash
|
| 169 |
+
bash ./tools/run_lcb_eval.sh
|
| 170 |
+
```
|
| 171 |
+
|
| 172 |
+
#### 3.2 Head-On Evaluation
|
| 173 |
+
|
| 174 |
+
##### Step 1: Convert Model Format
|
| 175 |
+
|
| 176 |
+
Run the following command to convert the model to Hugging Face format:
|
| 177 |
+
|
| 178 |
+
```bash
|
| 179 |
+
bash ./tools/model_merge.sh
|
| 180 |
+
```
|
| 181 |
+
|
| 182 |
+
##### Step 2: Run Inference
|
| 183 |
+
|
| 184 |
+
Execute the script below to generate inference results for the test data:
|
| 185 |
+
|
| 186 |
+
```bash
|
| 187 |
+
bash ./scripts/eval/run_eval.sh
|
| 188 |
+
```
|
| 189 |
+
|
| 190 |
+
##### Step 3: Run Evaluation
|
| 191 |
+
|
| 192 |
+
Navigate to line 245 in [LiveCodeBench/blob/main/lcb_runner/evaluation/compute_code_generation_metrics_v5.py](https://github.com/wizard-III/LiveCodeBench/blob/main/lcb_runner/evaluation/compute_code_generation_metrics_v5.py#L245) and update the parquet_file path to point to the result file generated in Step 2.
|
| 193 |
+
|
| 194 |
+
Execute the following script to evaluate performance on the LiveCodeBench v5 benchmark:
|
| 195 |
+
|
| 196 |
+
```bash
|
| 197 |
+
python LiveCodeBench/lcb_runner/evaluation/compute_code_generation_metrics_v5.py
|
| 198 |
+
```
|
| 199 |
+
|
| 200 |
+
Note: Please update the path parameters in the scripts above as needed.
|
| 201 |
+
|
| 202 |
+
## Technical Report
|
| 203 |
+
[ASPO: Asymmetric Importance Sampling Policy Optimization](https://huggingface.co/papers/2510.06062)
|
| 204 |
+
|
| 205 |
+
## Acknowledgements
|
| 206 |
+
|
| 207 |
+
- We build our model upon [`DeepSeek-R1-Distill-Qwen-1.5B`](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B).
|
| 208 |
+
- Training was carried out with a modified version of [verl](https://github.com/volcengine/verl).
|
| 209 |
+
|
| 210 |
+
## Citation
|
| 211 |
+
|
| 212 |
+
Please cite the following:
|
| 213 |
+
```bibtex
|
| 214 |
+
@misc{wang2025aspoasymmetricimportancesampling,
|
| 215 |
+
title={ASPO: Asymmetric Importance Sampling Policy Optimization},
|
| 216 |
+
author={Jiakang Wang and Runze Liu and Lei Lin and Wenping Hu and Xiu Li and Fuzheng Zhang and Guorui Zhou and Kun Gai},
|
| 217 |
+
year={2025},
|
| 218 |
+
eprint={2510.06062},
|
| 219 |
+
archivePrefix={arXiv},
|
| 220 |
+
primaryClass={cs.CL},
|
| 221 |
+
url={https://arxiv.org/abs/2510.06062},
|
| 222 |
+
}
|
| 223 |
+
```
|
| 224 |
+
|
| 225 |
+
```bibtex
|
| 226 |
+
@article{wang2025stabilizing,
|
| 227 |
+
title={Stabilizing Knowledge, Promoting Reasoning: Dual-Token Constraints for RLVR},
|
| 228 |
+
author={Wang, Jiakang and Liu, Runze and Zhang, Fuzheng and Li, Xiu and Zhou, Guorui},
|
| 229 |
+
journal={arXiv preprint arXiv:2507.15778},
|
| 230 |
+
year={2025}
|
| 231 |
+
}
|
| 232 |
+
```
|