rzzhan nielsr HF Staff commited on
Commit
8ae957c
·
verified ·
1 Parent(s): 315989a

Update model card: Add metadata, paper link, and GitHub content (#1)

Browse files

- Update model card: Add metadata, paper link, and GitHub content (146307812cdc71369a5444a03c70ece2a3288d89)


Co-authored-by: Niels Rogge <nielsr@users.noreply.huggingface.co>

Files changed (1) hide show
  1. README.md +224 -3
README.md CHANGED
@@ -1,3 +1,224 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ library_name: transformers
4
+ pipeline_tag: text-generation
5
+ ---
6
+
7
+ <div align="center">
8
+
9
+ <h1 style="display: flex; justify-content: center; align-items: center; gap: 10px; margin: 0;">
10
+ ExGRPO: Learning to Reason from Experience
11
+ </h1>
12
+ <p align="center"><em>Unearth and learn high-value experience in RLVR.</em></p>
13
+
14
+ <div align="center">
15
+ <img src="https://github.com/ElliottYan/LUFFY/raw/main/ExGRPO/figures/exgrpo_intro.png" alt="overview" style="width: 88%; height: auto;">
16
+ </div>
17
+
18
+ [![Paper](https://img.shields.io/badge/paper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https://arxiv.org/abs/2510.02245) [![Github](https://img.shields.io/badge/ExGRPO-000000?style=for-the-badge&logo=github&logoColor=000&logoColor=white)](https://github.com/ElliottYan/LUFFY/tree/main/ExGRPO) [![Hugging Face Collection](https://img.shields.io/badge/ExGRPO_Collection-fcd022?style=for-the-badge&logo=huggingface&logoColor=000)](https://huggingface.co/collections/rzzhan/exgrpo-68d8e302efdfe325187d5c96)
19
+
20
+ </div>
21
+
22
+
23
+ <div align="center" style="font-family: Arial, sans-serif;">
24
+ <p>
25
+ <a href="#news" style="text-decoration: none; font-weight: bold;">📢 News</a> •
26
+ <a href="#introduction" style="text-decoration: none; font-weight: bold;">📖 Introduction</a> •
27
+ <a href="#getting-started" style="text-decoration: none; font-weight: bold;">🚀 Getting Started</a>
28
+ </p>
29
+ <p>
30
+ <a href="#usage" style="text-decoration: none; font-weight: bold;">🔧 Usage</a> •
31
+ <a href="#evaluation" style="text-decoration: none; font-weight: bold;">📊 Evaluation</a> •
32
+ <a href="#acknowledgement" style="text-decoration: none; font-weight: bold;">✨ Acknowledgement</a> •
33
+ <a href="#contact" style="text-decoration: none; font-weight: bold;">📬 Contact</a> •
34
+ <a href="#citation" style="text-decoration: none; font-weight: bold;">📝 Citation</a>
35
+ </p>
36
+ </div>
37
+
38
+ ---
39
+
40
+ # 📢News
41
+
42
+ - **[2025/10/03]** ExGRPO paper is available on [arXiv](https://arxiv.org/abs/2510.02245).
43
+
44
+ ---
45
+
46
+ # 📖Introduction
47
+
48
+ Existing RLVR methods for reasoning tasks predominantly rely on on-policy optimization, which discards online rollouts after a single update, wasting valuable exploration signals and constraining scalability.
49
+ We conduct a systematic analysis of experience utility in RLVR and identify question difficulty and trajectory entropy as effective online proxies for assessing experience quality.
50
+ Building on these insights, we propose *ExGRPO*, a novel framework that **strategically manages and replays high-value experiences** through bucketed prioritization and mixed-policy optimization, enabling more efficient and stable RLVR training.
51
+
52
+ ### Key Highlights:
53
+ - **Experience Value Modeling**: Introduces the online proxy metrics: rollout correctness and trajectory entropy, for quantifying the value of RLVR experience.
54
+ - **ExGRPO Framework**: Built on top of GRPO, ExGRPO introduces a systematic experience management mechanism and an experience optimization objective to maximize the benefit of past explorations.
55
+ - **Generalization and Stability**: Demonstrates broad applicability across different backbone models and mitigates training collapse of on-policy RLVR in challenging scenarios.
56
+
57
+ ---
58
+
59
+ # 🚀Getting Started
60
+
61
+ ## Installation
62
+
63
+ You can install dependencies by running the following commands:
64
+ ```bash
65
+ conda create -n exgrpo python=3.10
66
+ conda activate exgrpo
67
+ cd exgrpo
68
+ pip install -r requirements.txt
69
+ pip install -e .
70
+ cd verl
71
+ pip install -e .
72
+ ```
73
+ > **Note**: If you encounter issues caused by the `pyairports` library, please refer to this hot-fix [solution](https://github.com/ElliottYan/LUFFY?tab=readme-ov-file#update-98).
74
+
75
+ For the `flash-attn` library, we use the `v2.7.4-post1` release and recommend installing it via the pre-built wheel. Please adjust based on your environment.
76
+ ```bash
77
+ wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.4cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
78
+ pip install flash_attn-2.7.4.post1+cu12torch2.4cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
79
+ ```
80
+
81
+
82
+ ## ExGRPO Plug-and-Play Modules Structure
83
+
84
+ **ExGRPO** extends `verl` framework by introducing plug-and-play experience modules, following a design similar to that of `LUFFY`. It focuses on the `experience/` submodule and the trainer `mix_trainer_experience.py`, enabling dynamic integration of on-policy data with collected experiences.
85
+ The key modules are structured as follows:
86
+
87
+ ```text
88
+ exgrpo/verl/verl/mix_src
89
+ ├── ...
90
+ ├── experience
91
+ │   ├── experience_bucket_manager.py # Abstraction of experience bucket; stats & maintenance
92
+ │   ├── weighted_bucket_sampler.py # Probabilistic experience sampler (across/within buckets)
93
+ │   ├── experience_collate_fn.py # Mix fresh on-policy data with experience per batch
94
+ │   ├── experience_helpers.py # Sampling, metric computation, sample builders used by collate_fn
95
+ │   ├── experience_trainer_ops.py # Trainer-side experience management operations
96
+ │   └── rl_dataset_with_experience.py # Dataset class for ExGRPO training
97
+ ├── ...
98
+ ├── mix_trainer_experience.py # ExGRPO Trainer
99
+ └── ...
100
+
101
+ # Additional Training/Runtime Modules:
102
+ are largely similar to those in `LUFFY`, with minor modifications to components such as the rollout
103
+ mechanism, checkpoint manager, and FSDPworker to better align with the requirements of ExGRPO.
104
+ ```
105
+
106
+ ---
107
+
108
+ # 🔧Usage
109
+
110
+ ## Data Preparation
111
+ You need to first run the data preparation script to get the training data in parquet format.
112
+ ```bash
113
+ cd data
114
+ python prepare_train.py --dataset_name Elliott/Openr1-Math-46k-8192 --output_file openr1.parquet
115
+ ```
116
+
117
+ > **Note**: Although we utilize the OpenR1 data, only the question field is used in RLVR. The ExGRPO data processing pipeline does not incorporate the external R1 trajectory during training.
118
+
119
+
120
+ ## Training
121
+
122
+ We provide an example script to train ExGRPO on 46k-subset of OpenR1-Math-220k. You can run the following command to train:
123
+
124
+ ```bash
125
+ cd exp_scripts
126
+ bash run_exgrpo.sh
127
+ ```
128
+
129
+ For Qwen2.5-Math-7B backbone model, we use [this version](https://huggingface.co/Elliott/Qwen2.5-Math-7B-16k-think).
130
+ Other Qwen backbone models follow the same prompt template.
131
+
132
+ ## Configuration Quick Reference
133
+
134
+ Key fields read by the ExGRPO components (names reflect usage in training scipts):
135
+
136
+ - `trainer.experience` (bool): Enable ExGRPO training.
137
+ - `trainer.experience_ratio` (float): Fraction of each batch taken from the experience pool in mixed training.
138
+ - `trainer.exp_metric` (str): Metric for trajectory selection. Default: `ent`.
139
+ - `exp_bucket_manager` (str|bool): Probabilistic bucket sampling method. Default: `normal`.
140
+ - `exp_is_correct` (bool): Enable importance sampling correction for experiential trajectories.
141
+ - `experience_lbound` / `experience_rbound` (int): Eligibility bounds on number of successes recorded per question (lbound, rbound].
142
+
143
+ ---
144
+
145
+ # 📊Evaluation
146
+
147
+ ## Reproducing the Results
148
+ We currently support automated evaluation on six widely used mathematical reasoning benchmarks (AIME24/25, AMC, MATH-500, Minerva, and Olympiad) and three out-of-distribution tasks (ARC-c, GPQA-diamond, and MMLU-pro).
149
+
150
+
151
+ You can reproduce our results by running the following commands:
152
+ ```bash
153
+ ROOT= # Your Root Path
154
+ TEMPLATE=own
155
+ MODEL_PATH= # Your checkpoint Path
156
+ OUTPUT_DIR=results/
157
+
158
+ DATA=$ROOT/data/valid.id.parquet
159
+ MODEL_NAME=exgrpo+testid
160
+
161
+ mkdir -p $OUTPUT_DIR
162
+
163
+ python generate_vllm.py \
164
+ --model_path $MODEL_PATH \
165
+ --input_file $DATA \
166
+ --remove_system True \
167
+ --add_oat_evaluate True \
168
+ --output_file $OUTPUT_DIR/$MODEL_NAME.jsonl \
169
+ --template $TEMPLATE > $OUTPUT_DIR/$MODEL_NAME.log
170
+ ```
171
+
172
+ ## Main Results
173
+
174
+ ### Zero RLVR on Qwen2.5-Math-7B & Continual RLVR on LUFFY
175
+ <div align="center">
176
+ <img src="https://github.com/ElliottYan/LUFFY/raw/main/ExGRPO/figures/main_result.png" alt="overview" style="width: 95%; height: auto;">
177
+ </div>
178
+
179
+ ### Zero RLVR on Llama3.1-8B (Base, Instruct), Qwen2.5-Math 1.5B Base, Qwen2.5-7B Instruct
180
+ <div align="center">
181
+ <img src="https://github.com/ElliottYan/LUFFY/raw/main/ExGRPO/figures/model_extensions_bar.png" alt="overview" style="width: 95%; height: auto;">
182
+ </div>
183
+
184
+ <details>
185
+ <summary>Click to view full results of model extension</summary>
186
+ <div align="center">
187
+ <img src="https://github.com/ElliottYan/LUFFY/raw/main/ExGRPO/figures/model_extensions.png" alt="overview" style="width: 95%; height: auto;">
188
+ </div>
189
+ </details>
190
+
191
+ ## Released Models
192
+ | **Model** | **Huggingface** | **Base Model** |
193
+ |-----------------------------------|------------------|------------------|
194
+ | ExGRPO-Qwen2.5-Math-7B-Zero | https://huggingface.co/rzzhan/ExGRPO-Qwen2.5-Math-7B-Zero | Qwen2.5-Math-7B |
195
+ | ExGRPO-LUFFY-7B-Continual | https://huggingface.co/rzzhan/ExGRPO-LUFFY-7B-Continual | LUFFY-Qwen-Math-7B-Zero |
196
+ | ExGRPO-Qwen2.5-7B-Instruct | https://huggingface.co/rzzhan/ExGRPO-Qwen2.5-7B-Instruct | Qwen2.5-7B Instruct |
197
+ | ExGRPO-Qwen2.5-Math-1.5B-Zero | https://huggingface.co/rzzhan/ExGRPO-Qwen2.5-Math-1.5B-Zero | Qwen2.5-Math-1.5B |
198
+ | ExGRPO-Llama3.1-8B-Zero | https://huggingface.co/rzzhan/ExGRPO-Llama3.1-8B-Zero | Llama3.1-8B |
199
+ | ExGRPO-Llama3.1-8B-Instruct | https://huggingface.co/rzzhan/ExGRPO-Llama3.1-8B-Instruct | Llama3.1-8B Instruct |
200
+
201
+
202
+ # ✨Acknowledgement
203
+
204
+ ExGRPO builds upon [LUFFY](https://github.com/ElliottYan/LUFFY), [veRL](https://github.com/volcengine/verl) and [deepscaler](https://github.com/agentica-project/rllm), and utilizes [vLLM](https://github.com/vllm-project/vllm) for inference. We utilize [Math-Verify](https://github.com/huggingface/Math-Verify) for RLVR reward model.
205
+ We thank the open-source community for datasets and backbones, including [NuminaMath](https://huggingface.co/datasets/AI-MO/NuminaMath-CoT), [OpenR1-Math-220k](https://huggingface.co/datasets/open-r1/OpenR1-Math-220k), [OpenR1-Math-46k](https://huggingface.co/datasets/Elliott/Openr1-Math-46k-8192), [Qwen-2.5-Math](https://huggingface.co/collections/Qwen/qwen25-math-66eaa240a1b7d5ee65f1da3e), [Qwen-2.5](https://huggingface.co/collections/Qwen/qwen25-66e81a666513e518adb90d9e) and [Llama-3.1](https://huggingface.co/collections/meta-llama/llama-31-669fc079a0c406a149a5738f) model.
206
+
207
+ # 📬Contact
208
+
209
+ For questions, feedback, or collaboration opportunities, feel free to reach out:
210
+ - Runzhe Zhan: nlp2ct.runzhe@gmail.com
211
+ - Yafu Li: yafuly@gmail.com
212
+
213
+ # 📝Citation
214
+ If you find our model, data, or evaluation code useful, please kindly cite our paper:
215
+ ```bib
216
+ @article{zhan2025exgrpo,
217
+ title={ExGRPO: Learning to Reason from Experience},
218
+ author={Runzhe Zhan and Yafu Li and Zhi Wang and Xiaoye Qu and Dongrui Liu and Jing Shao and Derek F. Wong and Yu Cheng},
219
+ year={2025},
220
+ journal = {ArXiv preprint},
221
+ volume = {2510.02245},
222
+ url={https://arxiv.org/abs/2510.02245},
223
+ }
224
+ ```