Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,152 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
---
|
| 4 |
+
|
| 5 |
+
<div align="center">
|
| 6 |
+
|
| 7 |
+
# ✨ Archer
|
| 8 |
+
|
| 9 |
+
<div>
|
| 10 |
+
🏹️ Reinforcement Learning for Enhanced Reasoning in LLMs 🎯
|
| 11 |
+
</div>
|
| 12 |
+
|
| 13 |
+
</div>
|
| 14 |
+
<div>
|
| 15 |
+
<br>
|
| 16 |
+
|
| 17 |
+
<div align="center">
|
| 18 |
+
|
| 19 |
+
[](https://github.com/wizard-III/ArcherCodeR)
|
| 20 |
+
[](https://huggingface.co/Fate-Zero/Archer-Code-1.5B)
|
| 21 |
+
[](https://huggingface.co/datasets/Fate-Zero/Archer-Code-1.5B)
|
| 22 |
+
[](https://wandb.ai/wangjkpkucs-peking-university/ArcherCodeR?nw=nwuserwangjkpkucs)
|
| 23 |
+
[](https://zhuanlan.zhihu.com/p/1918765619614057424)
|
| 24 |
+
|
| 25 |
+
</div>
|
| 26 |
+
|
| 27 |
+
## Overview
|
| 28 |
+
|
| 29 |
+
The Archer series focuses on research into RL algorithms and training for medium and small-scale models, aiming to deepen the community's understanding of the fundamental principles of reinforcement learning (RL) on large language models (LLMs). All released content will be comprehensively open-sourced to advance community research development.
|
| 30 |
+
|
| 31 |
+
<div align="center">
|
| 32 |
+
<img src="assets/combined_math_code_benchmarks.png" width="100%"/>
|
| 33 |
+
|
| 34 |
+
<sub>Archer significantly improves the reasoning performance upon DAPO and outperforms previous 1.5B-level SOTA reasoning models.</sub>
|
| 35 |
+
</div>
|
| 36 |
+
|
| 37 |
+
**Archer** is an open-source initiative enhancing reasoning in large language models through scalable, rule-governed reinforcement learning. We provide full-stack reproducibility including:
|
| 38 |
+
|
| 39 |
+
- Training code and pipelines
|
| 40 |
+
- Curated datasets
|
| 41 |
+
- Trained models
|
| 42 |
+
- Complete training logs
|
| 43 |
+
|
| 44 |
+
**Current Models**:
|
| 45 |
+
- **[Archer-Code-1.5B](https://huggingface.co/Fate-Zero/Archer-Code-1.5B)** - SOTA among similarly-sized models.
|
| 46 |
+
|
| 47 |
+
## Evaluation
|
| 48 |
+
We conduct evaluation on both mathematical and coding benchmarks. Due to the high variance of the outputs from reasoning models, we report avg@K (pass@1 performance averaged over K outputs) and pass@K for each benchmark. The detailed results are shown in the table below.
|
| 49 |
+
|
| 50 |
+
<div align="center">
|
| 51 |
+
|
| 52 |
+
<img src="assets/math_benchmark_table.png" width="100%"/>
|
| 53 |
+
|
| 54 |
+
<img src="assets/code_benchmark_table.png" width="100%"/>
|
| 55 |
+
|
| 56 |
+
</div>
|
| 57 |
+
|
| 58 |
+
<!-- Note:
|
| 59 |
+
1. Evaluation variance for the same model is typically within ±0.5 across multiple runs.
|
| 60 |
+
2. DeepCoder consistently scored around 23 in our tests - lower than its reported performance.
|
| 61 |
+
3. NVIDIA's Nemotron-Research-Reasoning-Qwen-1.5B slightly outperformed its reported score, potentially due to different parameter settings in their original evaluation. -->
|
| 62 |
+
|
| 63 |
+
## Getting Started
|
| 64 |
+
|
| 65 |
+
### Installation
|
| 66 |
+
|
| 67 |
+
```bash
|
| 68 |
+
# Installing Python 3.10 Environment.
|
| 69 |
+
conda create -n archer python=3.10 -y
|
| 70 |
+
conda activate archer
|
| 71 |
+
|
| 72 |
+
# Installing dependencies.
|
| 73 |
+
pip install torch==2.5.1 --index-url https://download.pytorch.org/whl/cu124
|
| 74 |
+
wget -nv https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.3/flash_attn-2.7.3+cu12torch2.5cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
|
| 75 |
+
pip install --no-cache-dir flash_attn-2.7.3+cu12torch2.5cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
|
| 76 |
+
|
| 77 |
+
cd ArcherCodeR
|
| 78 |
+
pip install -e .
|
| 79 |
+
```
|
| 80 |
+
|
| 81 |
+
### Data Preparation
|
| 82 |
+
|
| 83 |
+
Download the training and test data from Hugging Face.
|
| 84 |
+
|
| 85 |
+
```bash
|
| 86 |
+
python tools/download_datasets.py
|
| 87 |
+
```
|
| 88 |
+
|
| 89 |
+
#### Initialize Ray Cluster
|
| 90 |
+
|
| 91 |
+
We have provided a one-click script to initialize Ray environments on any number of machines. Run the following command on the head node:
|
| 92 |
+
|
| 93 |
+
```bash
|
| 94 |
+
bash ./tools/start_ray.sh
|
| 95 |
+
```
|
| 96 |
+
|
| 97 |
+
Note:
|
| 98 |
+
- Please replace your_wandb_api_key in export WANDB_API_KEY=your_wandb_api_key with your actual key.
|
| 99 |
+
- Hostfile locations vary across operating systems (e.g., on my machine, it's located at /etc/mpi/hostfile). Locate the file on your server and modify its content accordingly.
|
| 100 |
+
|
| 101 |
+
### Training
|
| 102 |
+
|
| 103 |
+
We have currently only provided the script and data to reproduce the results of the “ArcherCodeR-1.5B-DAPO”.
|
| 104 |
+
|
| 105 |
+
```bash
|
| 106 |
+
bash ./scripts/train/run_archer_qwen2.5_1.5b_code.sh
|
| 107 |
+
```
|
| 108 |
+
|
| 109 |
+
### Evaluation
|
| 110 |
+
|
| 111 |
+
#### Step 1: Convert model format
|
| 112 |
+
|
| 113 |
+
Run the following command to convert the model to Hugging Face format:
|
| 114 |
+
|
| 115 |
+
```bash
|
| 116 |
+
bash ./tools/model_merge.sh
|
| 117 |
+
```
|
| 118 |
+
|
| 119 |
+
#### Step 2: Run evaluation
|
| 120 |
+
|
| 121 |
+
Execute the script below to evaluate model performance on the LiveCodeBench v5 benchmark:
|
| 122 |
+
|
| 123 |
+
```bash
|
| 124 |
+
bash ./scripts/eval/run_eval.sh
|
| 125 |
+
```
|
| 126 |
+
|
| 127 |
+
Note: Please update the path parameters in the scripts above as needed.
|
| 128 |
+
|
| 129 |
+
## Technical Report
|
| 130 |
+
|
| 131 |
+
[Stabilizing Knowledge, Promoting Reasoning: Dual-Token Constraints for RLVR](https://arxiv.org/abs/2507.15778)
|
| 132 |
+
|
| 133 |
+
## Acknowledgements
|
| 134 |
+
|
| 135 |
+
- We build our model upon [`DeepSeek-R1-Distill-Qwen-1.5B`](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B).
|
| 136 |
+
- Training was carried out with a modified version of [verl](https://github.com/volcengine/verl).
|
| 137 |
+
|
| 138 |
+
## Citation
|
| 139 |
+
|
| 140 |
+
Please cite the following:
|
| 141 |
+
|
| 142 |
+
```bibtex
|
| 143 |
+
@misc{wang2025stabilizingknowledgepromotingreasoning,
|
| 144 |
+
title={Stabilizing Knowledge, Promoting Reasoning: Dual-Token Constraints for RLVR},
|
| 145 |
+
author={Jiakang Wang and Runze Liu and Fuzheng Zhang and Xiu Li and Guorui Zhou},
|
| 146 |
+
year={2025},
|
| 147 |
+
eprint={2507.15778},
|
| 148 |
+
archivePrefix={arXiv},
|
| 149 |
+
primaryClass={cs.CL},
|
| 150 |
+
url={https://arxiv.org/abs/2507.15778},
|
| 151 |
+
}
|
| 152 |
+
```
|