PRO-STEP: Step-level Process Reward Optimization for RAG (Policy Model)
This is the main policy model for PRO-STEP, a self-improving framework for agentic Retrieval-Augmented Generation. The policy is trained on its own MCTS trajectories scored by an open-source 8B PRM, using step-level DPO.
- Backbone: Qwen2.5-7B-Instruct
- PRM: DORAEMONG/PRO-STEP-PRM-8B
- Preference data: DORAEMONG/PRO-STEP-Preference-Data
- Training: DPO (β=0.1) with document-token masking, LoRA r=64/α=128, 1 epoch, 5,000 questions / 15,877 step-level preference pairs
- MCTS: K=3 branching, depth 7, 64 rollouts/question, value V(s) = Q̄(s) + α · r̂(s) with α=0.3
Performance (5-dataset, identical FlashRAG eval pipeline)
| Method | Train data | HotpotQA | PopQA | 2Wiki | Bamboogle | Musique | AVG |
|---|---|---|---|---|---|---|---|
| Search-R1 | ~90,000 | 37.88 / 49.56 | 40.65 / 46.78 | 34.87 / 42.50 | 33.60 / 43.55 | 12.99 / 21.23 | 32.00 / 40.72 |
| ReasonRAG | ~5,000 | 36.37 / 47.51 | 37.78 / 44.87 | 39.80 / 46.32 | 38.40 / 46.86 | 10.59 / 19.22 | 32.59 / 40.96 |
| StepSearch | ~19,000 | 38.72 / 50.67 | 39.24 / 44.97 | 40.38 / 47.12 | 33.60 / 44.16 | 13.82 / 23.06 | 33.15 / 42.00 |
| PRO-STEP (ours) ★ | 5,000 | 38.73 / 51.63 | 40.47 / 47.37 | 44.07 / 51.43 | 36.80 / 47.63 | 12.49 / 22.41 | 34.51 / 44.09 |
EM / F1 (Strict EM, token-F1). Bootstrap 95% CI: vs Search-R1 +2.51 EM [+1.01, +4.06], vs ReasonRAG +1.93 EM [+0.46, +3.36].
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("DORAEMONG/PRO-STEP-Policy-7B", torch_dtype="auto")
tokenizer = AutoTokenizer.from_pretrained("DORAEMONG/PRO-STEP-Policy-7B")
# Use with FlashRAG SearchR1Pipeline or any agentic-RAG inference loop
# System prompt: see paper Appendix A
Citation
@article{prostep2026,
title={PRO-STEP: Step-level Process Reward Optimization for Retrieval-Augmented Generation},
author={...},
year={2026}
}
- Downloads last month
- 16