wizardII commited on
Commit
8fa0dcb
·
verified ·
1 Parent(s): e38edb3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +152 -3
README.md CHANGED
@@ -1,3 +1,152 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ ---
4
+
5
+ <div align="center">
6
+
7
+ # ✨ Archer
8
+
9
+ <div>
10
+ 🏹️ Reinforcement Learning for Enhanced Reasoning in LLMs 🎯
11
+ </div>
12
+
13
+ </div>
14
+ <div>
15
+ <br>
16
+
17
+ <div align="center">
18
+
19
+ [![Github](https://img.shields.io/badge/Code-000000?style=for-the-badge&logo=github&logoColor=000&logoColor=white)](https://github.com/wizard-III/ArcherCodeR)
20
+ [![Model](https://img.shields.io/badge/Model-fcd022?style=for-the-badge&logo=huggingface&logoColor=000&labelColor)](https://huggingface.co/Fate-Zero/Archer-Code-1.5B)
21
+ [![Data](https://img.shields.io/badge/Data-fcd022?style=for-the-badge&logo=huggingface&logoColor=000&labelColor)](https://huggingface.co/datasets/Fate-Zero/Archer-Code-1.5B)
22
+ [![Wandb](https://img.shields.io/badge/Wandb-000000?style=for-the-badge&logo=Wandb&logoColor=000&labelColor)](https://wandb.ai/wangjkpkucs-peking-university/ArcherCodeR?nw=nwuserwangjkpkucs)
23
+ [![知乎](https://img.shields.io/badge/知乎-0084FF?style=for-the-badge&logo=zhihu&logoColor=white)](https://zhuanlan.zhihu.com/p/1918765619614057424)
24
+
25
+ </div>
26
+
27
+ ## Overview
28
+
29
+ The Archer series focuses on research into RL algorithms and training for medium and small-scale models, aiming to deepen the community's understanding of the fundamental principles of reinforcement learning (RL) on large language models (LLMs). All released content will be comprehensively open-sourced to advance community research development.
30
+
31
+ <div align="center">
32
+ <img src="assets/combined_math_code_benchmarks.png" width="100%"/>
33
+
34
+ <sub>Archer significantly improves the reasoning performance upon DAPO and outperforms previous 1.5B-level SOTA reasoning models.</sub>
35
+ </div>
36
+
37
+ **Archer** is an open-source initiative enhancing reasoning in large language models through scalable, rule-governed reinforcement learning. We provide full-stack reproducibility including:
38
+
39
+ - Training code and pipelines
40
+ - Curated datasets
41
+ - Trained models
42
+ - Complete training logs
43
+
44
+ **Current Models**:
45
+ - **[Archer-Code-1.5B](https://huggingface.co/Fate-Zero/Archer-Code-1.5B)** - SOTA among similarly-sized models.
46
+
47
+ ## Evaluation
48
+ We conduct evaluation on both mathematical and coding benchmarks. Due to the high variance of the outputs from reasoning models, we report avg@K (pass@1 performance averaged over K outputs) and pass@K for each benchmark. The detailed results are shown in the table below.
49
+
50
+ <div align="center">
51
+
52
+ <img src="assets/math_benchmark_table.png" width="100%"/>
53
+
54
+ <img src="assets/code_benchmark_table.png" width="100%"/>
55
+
56
+ </div>
57
+
58
+ <!-- Note:
59
+ 1. Evaluation variance for the same model is typically within ±0.5 across multiple runs.
60
+ 2. DeepCoder consistently scored around 23 in our tests - lower than its reported performance.
61
+ 3. NVIDIA's Nemotron-Research-Reasoning-Qwen-1.5B slightly outperformed its reported score, potentially due to different parameter settings in their original evaluation. -->
62
+
63
+ ## Getting Started
64
+
65
+ ### Installation
66
+
67
+ ```bash
68
+ # Installing Python 3.10 Environment.
69
+ conda create -n archer python=3.10 -y
70
+ conda activate archer
71
+
72
+ # Installing dependencies.
73
+ pip install torch==2.5.1 --index-url https://download.pytorch.org/whl/cu124
74
+ wget -nv https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.3/flash_attn-2.7.3+cu12torch2.5cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
75
+ pip install --no-cache-dir flash_attn-2.7.3+cu12torch2.5cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
76
+
77
+ cd ArcherCodeR
78
+ pip install -e .
79
+ ```
80
+
81
+ ### Data Preparation
82
+
83
+ Download the training and test data from Hugging Face.
84
+
85
+ ```bash
86
+ python tools/download_datasets.py
87
+ ```
88
+
89
+ #### Initialize Ray Cluster
90
+
91
+ We have provided a one-click script to initialize Ray environments on any number of machines. Run the following command on the head node:
92
+
93
+ ```bash
94
+ bash ./tools/start_ray.sh
95
+ ```
96
+
97
+ Note:
98
+ - Please replace your_wandb_api_key in export WANDB_API_KEY=your_wandb_api_key with your actual key.
99
+ - Hostfile locations vary across operating systems (e.g., on my machine, it's located at /etc/mpi/hostfile). Locate the file on your server and modify its content accordingly.
100
+
101
+ ### Training
102
+
103
+ We have currently only provided the script and data to reproduce the results of the “ArcherCodeR-1.5B-DAPO”.
104
+
105
+ ```bash
106
+ bash ./scripts/train/run_archer_qwen2.5_1.5b_code.sh
107
+ ```
108
+
109
+ ### Evaluation
110
+
111
+ #### Step 1: Convert model format
112
+
113
+ Run the following command to convert the model to Hugging Face format:
114
+
115
+ ```bash
116
+ bash ./tools/model_merge.sh
117
+ ```
118
+
119
+ #### Step 2: Run evaluation
120
+
121
+ Execute the script below to evaluate model performance on the LiveCodeBench v5 benchmark:
122
+
123
+ ```bash
124
+ bash ./scripts/eval/run_eval.sh
125
+ ```
126
+
127
+ Note: Please update the path parameters in the scripts above as needed.
128
+
129
+ ## Technical Report
130
+
131
+ [Stabilizing Knowledge, Promoting Reasoning: Dual-Token Constraints for RLVR](https://arxiv.org/abs/2507.15778)
132
+
133
+ ## Acknowledgements
134
+
135
+ - We build our model upon [`DeepSeek-R1-Distill-Qwen-1.5B`](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B).
136
+ - Training was carried out with a modified version of [verl](https://github.com/volcengine/verl).
137
+
138
+ ## Citation
139
+
140
+ Please cite the following:
141
+
142
+ ```bibtex
143
+ @misc{wang2025stabilizingknowledgepromotingreasoning,
144
+ title={Stabilizing Knowledge, Promoting Reasoning: Dual-Token Constraints for RLVR},
145
+ author={Jiakang Wang and Runze Liu and Fuzheng Zhang and Xiu Li and Guorui Zhou},
146
+ year={2025},
147
+ eprint={2507.15778},
148
+ archivePrefix={arXiv},
149
+ primaryClass={cs.CL},
150
+ url={https://arxiv.org/abs/2507.15778},
151
+ }
152
+ ```