Update model card: Add metadata, paper link, and GitHub content (#1)
Browse files- Update model card: Add metadata, paper link, and GitHub content (146307812cdc71369a5444a03c70ece2a3288d89)
Co-authored-by: Niels Rogge <nielsr@users.noreply.huggingface.co>
README.md
CHANGED
|
@@ -1,3 +1,224 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
library_name: transformers
|
| 4 |
+
pipeline_tag: text-generation
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
<div align="center">
|
| 8 |
+
|
| 9 |
+
<h1 style="display: flex; justify-content: center; align-items: center; gap: 10px; margin: 0;">
|
| 10 |
+
ExGRPO: Learning to Reason from Experience
|
| 11 |
+
</h1>
|
| 12 |
+
<p align="center"><em>Unearth and learn high-value experience in RLVR.</em></p>
|
| 13 |
+
|
| 14 |
+
<div align="center">
|
| 15 |
+
<img src="https://github.com/ElliottYan/LUFFY/raw/main/ExGRPO/figures/exgrpo_intro.png" alt="overview" style="width: 88%; height: auto;">
|
| 16 |
+
</div>
|
| 17 |
+
|
| 18 |
+
[](https://arxiv.org/abs/2510.02245) [](https://github.com/ElliottYan/LUFFY/tree/main/ExGRPO) [](https://huggingface.co/collections/rzzhan/exgrpo-68d8e302efdfe325187d5c96)
|
| 19 |
+
|
| 20 |
+
</div>
|
| 21 |
+
|
| 22 |
+
|
| 23 |
+
<div align="center" style="font-family: Arial, sans-serif;">
|
| 24 |
+
<p>
|
| 25 |
+
<a href="#news" style="text-decoration: none; font-weight: bold;">📢 News</a> •
|
| 26 |
+
<a href="#introduction" style="text-decoration: none; font-weight: bold;">📖 Introduction</a> •
|
| 27 |
+
<a href="#getting-started" style="text-decoration: none; font-weight: bold;">🚀 Getting Started</a>
|
| 28 |
+
</p>
|
| 29 |
+
<p>
|
| 30 |
+
<a href="#usage" style="text-decoration: none; font-weight: bold;">🔧 Usage</a> •
|
| 31 |
+
<a href="#evaluation" style="text-decoration: none; font-weight: bold;">📊 Evaluation</a> •
|
| 32 |
+
<a href="#acknowledgement" style="text-decoration: none; font-weight: bold;">✨ Acknowledgement</a> •
|
| 33 |
+
<a href="#contact" style="text-decoration: none; font-weight: bold;">📬 Contact</a> •
|
| 34 |
+
<a href="#citation" style="text-decoration: none; font-weight: bold;">📝 Citation</a>
|
| 35 |
+
</p>
|
| 36 |
+
</div>
|
| 37 |
+
|
| 38 |
+
---
|
| 39 |
+
|
| 40 |
+
# 📢News
|
| 41 |
+
|
| 42 |
+
- **[2025/10/03]** ExGRPO paper is available on [arXiv](https://arxiv.org/abs/2510.02245).
|
| 43 |
+
|
| 44 |
+
---
|
| 45 |
+
|
| 46 |
+
# 📖Introduction
|
| 47 |
+
|
| 48 |
+
Existing RLVR methods for reasoning tasks predominantly rely on on-policy optimization, which discards online rollouts after a single update, wasting valuable exploration signals and constraining scalability.
|
| 49 |
+
We conduct a systematic analysis of experience utility in RLVR and identify question difficulty and trajectory entropy as effective online proxies for assessing experience quality.
|
| 50 |
+
Building on these insights, we propose *ExGRPO*, a novel framework that **strategically manages and replays high-value experiences** through bucketed prioritization and mixed-policy optimization, enabling more efficient and stable RLVR training.
|
| 51 |
+
|
| 52 |
+
### Key Highlights:
|
| 53 |
+
- **Experience Value Modeling**: Introduces the online proxy metrics: rollout correctness and trajectory entropy, for quantifying the value of RLVR experience.
|
| 54 |
+
- **ExGRPO Framework**: Built on top of GRPO, ExGRPO introduces a systematic experience management mechanism and an experience optimization objective to maximize the benefit of past explorations.
|
| 55 |
+
- **Generalization and Stability**: Demonstrates broad applicability across different backbone models and mitigates training collapse of on-policy RLVR in challenging scenarios.
|
| 56 |
+
|
| 57 |
+
---
|
| 58 |
+
|
| 59 |
+
# 🚀Getting Started
|
| 60 |
+
|
| 61 |
+
## Installation
|
| 62 |
+
|
| 63 |
+
You can install dependencies by running the following commands:
|
| 64 |
+
```bash
|
| 65 |
+
conda create -n exgrpo python=3.10
|
| 66 |
+
conda activate exgrpo
|
| 67 |
+
cd exgrpo
|
| 68 |
+
pip install -r requirements.txt
|
| 69 |
+
pip install -e .
|
| 70 |
+
cd verl
|
| 71 |
+
pip install -e .
|
| 72 |
+
```
|
| 73 |
+
> **Note**: If you encounter issues caused by the `pyairports` library, please refer to this hot-fix [solution](https://github.com/ElliottYan/LUFFY?tab=readme-ov-file#update-98).
|
| 74 |
+
|
| 75 |
+
For the `flash-attn` library, we use the `v2.7.4-post1` release and recommend installing it via the pre-built wheel. Please adjust based on your environment.
|
| 76 |
+
```bash
|
| 77 |
+
wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.4cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
|
| 78 |
+
pip install flash_attn-2.7.4.post1+cu12torch2.4cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
|
| 79 |
+
```
|
| 80 |
+
|
| 81 |
+
|
| 82 |
+
## ExGRPO Plug-and-Play Modules Structure
|
| 83 |
+
|
| 84 |
+
**ExGRPO** extends `verl` framework by introducing plug-and-play experience modules, following a design similar to that of `LUFFY`. It focuses on the `experience/` submodule and the trainer `mix_trainer_experience.py`, enabling dynamic integration of on-policy data with collected experiences.
|
| 85 |
+
The key modules are structured as follows:
|
| 86 |
+
|
| 87 |
+
```text
|
| 88 |
+
exgrpo/verl/verl/mix_src
|
| 89 |
+
├── ...
|
| 90 |
+
├── experience
|
| 91 |
+
│ ├── experience_bucket_manager.py # Abstraction of experience bucket; stats & maintenance
|
| 92 |
+
│ ├── weighted_bucket_sampler.py # Probabilistic experience sampler (across/within buckets)
|
| 93 |
+
│ ├── experience_collate_fn.py # Mix fresh on-policy data with experience per batch
|
| 94 |
+
│ ├── experience_helpers.py # Sampling, metric computation, sample builders used by collate_fn
|
| 95 |
+
│ ├── experience_trainer_ops.py # Trainer-side experience management operations
|
| 96 |
+
│ └── rl_dataset_with_experience.py # Dataset class for ExGRPO training
|
| 97 |
+
├── ...
|
| 98 |
+
├── mix_trainer_experience.py # ExGRPO Trainer
|
| 99 |
+
└── ...
|
| 100 |
+
|
| 101 |
+
# Additional Training/Runtime Modules:
|
| 102 |
+
are largely similar to those in `LUFFY`, with minor modifications to components such as the rollout
|
| 103 |
+
mechanism, checkpoint manager, and FSDPworker to better align with the requirements of ExGRPO.
|
| 104 |
+
```
|
| 105 |
+
|
| 106 |
+
---
|
| 107 |
+
|
| 108 |
+
# 🔧Usage
|
| 109 |
+
|
| 110 |
+
## Data Preparation
|
| 111 |
+
You need to first run the data preparation script to get the training data in parquet format.
|
| 112 |
+
```bash
|
| 113 |
+
cd data
|
| 114 |
+
python prepare_train.py --dataset_name Elliott/Openr1-Math-46k-8192 --output_file openr1.parquet
|
| 115 |
+
```
|
| 116 |
+
|
| 117 |
+
> **Note**: Although we utilize the OpenR1 data, only the question field is used in RLVR. The ExGRPO data processing pipeline does not incorporate the external R1 trajectory during training.
|
| 118 |
+
|
| 119 |
+
|
| 120 |
+
## Training
|
| 121 |
+
|
| 122 |
+
We provide an example script to train ExGRPO on 46k-subset of OpenR1-Math-220k. You can run the following command to train:
|
| 123 |
+
|
| 124 |
+
```bash
|
| 125 |
+
cd exp_scripts
|
| 126 |
+
bash run_exgrpo.sh
|
| 127 |
+
```
|
| 128 |
+
|
| 129 |
+
For Qwen2.5-Math-7B backbone model, we use [this version](https://huggingface.co/Elliott/Qwen2.5-Math-7B-16k-think).
|
| 130 |
+
Other Qwen backbone models follow the same prompt template.
|
| 131 |
+
|
| 132 |
+
## Configuration Quick Reference
|
| 133 |
+
|
| 134 |
+
Key fields read by the ExGRPO components (names reflect usage in training scipts):
|
| 135 |
+
|
| 136 |
+
- `trainer.experience` (bool): Enable ExGRPO training.
|
| 137 |
+
- `trainer.experience_ratio` (float): Fraction of each batch taken from the experience pool in mixed training.
|
| 138 |
+
- `trainer.exp_metric` (str): Metric for trajectory selection. Default: `ent`.
|
| 139 |
+
- `exp_bucket_manager` (str|bool): Probabilistic bucket sampling method. Default: `normal`.
|
| 140 |
+
- `exp_is_correct` (bool): Enable importance sampling correction for experiential trajectories.
|
| 141 |
+
- `experience_lbound` / `experience_rbound` (int): Eligibility bounds on number of successes recorded per question (lbound, rbound].
|
| 142 |
+
|
| 143 |
+
---
|
| 144 |
+
|
| 145 |
+
# 📊Evaluation
|
| 146 |
+
|
| 147 |
+
## Reproducing the Results
|
| 148 |
+
We currently support automated evaluation on six widely used mathematical reasoning benchmarks (AIME24/25, AMC, MATH-500, Minerva, and Olympiad) and three out-of-distribution tasks (ARC-c, GPQA-diamond, and MMLU-pro).
|
| 149 |
+
|
| 150 |
+
|
| 151 |
+
You can reproduce our results by running the following commands:
|
| 152 |
+
```bash
|
| 153 |
+
ROOT= # Your Root Path
|
| 154 |
+
TEMPLATE=own
|
| 155 |
+
MODEL_PATH= # Your checkpoint Path
|
| 156 |
+
OUTPUT_DIR=results/
|
| 157 |
+
|
| 158 |
+
DATA=$ROOT/data/valid.id.parquet
|
| 159 |
+
MODEL_NAME=exgrpo+testid
|
| 160 |
+
|
| 161 |
+
mkdir -p $OUTPUT_DIR
|
| 162 |
+
|
| 163 |
+
python generate_vllm.py \
|
| 164 |
+
--model_path $MODEL_PATH \
|
| 165 |
+
--input_file $DATA \
|
| 166 |
+
--remove_system True \
|
| 167 |
+
--add_oat_evaluate True \
|
| 168 |
+
--output_file $OUTPUT_DIR/$MODEL_NAME.jsonl \
|
| 169 |
+
--template $TEMPLATE > $OUTPUT_DIR/$MODEL_NAME.log
|
| 170 |
+
```
|
| 171 |
+
|
| 172 |
+
## Main Results
|
| 173 |
+
|
| 174 |
+
### Zero RLVR on Qwen2.5-Math-7B & Continual RLVR on LUFFY
|
| 175 |
+
<div align="center">
|
| 176 |
+
<img src="https://github.com/ElliottYan/LUFFY/raw/main/ExGRPO/figures/main_result.png" alt="overview" style="width: 95%; height: auto;">
|
| 177 |
+
</div>
|
| 178 |
+
|
| 179 |
+
### Zero RLVR on Llama3.1-8B (Base, Instruct), Qwen2.5-Math 1.5B Base, Qwen2.5-7B Instruct
|
| 180 |
+
<div align="center">
|
| 181 |
+
<img src="https://github.com/ElliottYan/LUFFY/raw/main/ExGRPO/figures/model_extensions_bar.png" alt="overview" style="width: 95%; height: auto;">
|
| 182 |
+
</div>
|
| 183 |
+
|
| 184 |
+
<details>
|
| 185 |
+
<summary>Click to view full results of model extension</summary>
|
| 186 |
+
<div align="center">
|
| 187 |
+
<img src="https://github.com/ElliottYan/LUFFY/raw/main/ExGRPO/figures/model_extensions.png" alt="overview" style="width: 95%; height: auto;">
|
| 188 |
+
</div>
|
| 189 |
+
</details>
|
| 190 |
+
|
| 191 |
+
## Released Models
|
| 192 |
+
| **Model** | **Huggingface** | **Base Model** |
|
| 193 |
+
|-----------------------------------|------------------|------------------|
|
| 194 |
+
| ExGRPO-Qwen2.5-Math-7B-Zero | https://huggingface.co/rzzhan/ExGRPO-Qwen2.5-Math-7B-Zero | Qwen2.5-Math-7B |
|
| 195 |
+
| ExGRPO-LUFFY-7B-Continual | https://huggingface.co/rzzhan/ExGRPO-LUFFY-7B-Continual | LUFFY-Qwen-Math-7B-Zero |
|
| 196 |
+
| ExGRPO-Qwen2.5-7B-Instruct | https://huggingface.co/rzzhan/ExGRPO-Qwen2.5-7B-Instruct | Qwen2.5-7B Instruct |
|
| 197 |
+
| ExGRPO-Qwen2.5-Math-1.5B-Zero | https://huggingface.co/rzzhan/ExGRPO-Qwen2.5-Math-1.5B-Zero | Qwen2.5-Math-1.5B |
|
| 198 |
+
| ExGRPO-Llama3.1-8B-Zero | https://huggingface.co/rzzhan/ExGRPO-Llama3.1-8B-Zero | Llama3.1-8B |
|
| 199 |
+
| ExGRPO-Llama3.1-8B-Instruct | https://huggingface.co/rzzhan/ExGRPO-Llama3.1-8B-Instruct | Llama3.1-8B Instruct |
|
| 200 |
+
|
| 201 |
+
|
| 202 |
+
# ✨Acknowledgement
|
| 203 |
+
|
| 204 |
+
ExGRPO builds upon [LUFFY](https://github.com/ElliottYan/LUFFY), [veRL](https://github.com/volcengine/verl) and [deepscaler](https://github.com/agentica-project/rllm), and utilizes [vLLM](https://github.com/vllm-project/vllm) for inference. We utilize [Math-Verify](https://github.com/huggingface/Math-Verify) for RLVR reward model.
|
| 205 |
+
We thank the open-source community for datasets and backbones, including [NuminaMath](https://huggingface.co/datasets/AI-MO/NuminaMath-CoT), [OpenR1-Math-220k](https://huggingface.co/datasets/open-r1/OpenR1-Math-220k), [OpenR1-Math-46k](https://huggingface.co/datasets/Elliott/Openr1-Math-46k-8192), [Qwen-2.5-Math](https://huggingface.co/collections/Qwen/qwen25-math-66eaa240a1b7d5ee65f1da3e), [Qwen-2.5](https://huggingface.co/collections/Qwen/qwen25-66e81a666513e518adb90d9e) and [Llama-3.1](https://huggingface.co/collections/meta-llama/llama-31-669fc079a0c406a149a5738f) model.
|
| 206 |
+
|
| 207 |
+
# 📬Contact
|
| 208 |
+
|
| 209 |
+
For questions, feedback, or collaboration opportunities, feel free to reach out:
|
| 210 |
+
- Runzhe Zhan: nlp2ct.runzhe@gmail.com
|
| 211 |
+
- Yafu Li: yafuly@gmail.com
|
| 212 |
+
|
| 213 |
+
# 📝Citation
|
| 214 |
+
If you find our model, data, or evaluation code useful, please kindly cite our paper:
|
| 215 |
+
```bib
|
| 216 |
+
@article{zhan2025exgrpo,
|
| 217 |
+
title={ExGRPO: Learning to Reason from Experience},
|
| 218 |
+
author={Runzhe Zhan and Yafu Li and Zhi Wang and Xiaoye Qu and Dongrui Liu and Jing Shao and Derek F. Wong and Yu Cheng},
|
| 219 |
+
year={2025},
|
| 220 |
+
journal = {ArXiv preprint},
|
| 221 |
+
volume = {2510.02245},
|
| 222 |
+
url={https://arxiv.org/abs/2510.02245},
|
| 223 |
+
}
|
| 224 |
+
```
|