PRO-STEP: Step-level Process Reward Optimization for RAG (Policy Model)

This is the main policy model for PRO-STEP, a self-improving framework for agentic Retrieval-Augmented Generation. The policy is trained on its own MCTS trajectories scored by an open-source 8B PRM, using step-level DPO.

  • Backbone: Qwen2.5-7B-Instruct
  • PRM: DORAEMONG/PRO-STEP-PRM-8B
  • Preference data: DORAEMONG/PRO-STEP-Preference-Data
  • Training: DPO (β=0.1) with document-token masking, LoRA r=64/α=128, 1 epoch, 5,000 questions / 15,877 step-level preference pairs
  • MCTS: K=3 branching, depth 7, 64 rollouts/question, value V(s) = Q̄(s) + α · r̂(s) with α=0.3

Performance (5-dataset, identical FlashRAG eval pipeline)

Method Train data HotpotQA PopQA 2Wiki Bamboogle Musique AVG
Search-R1 ~90,000 37.88 / 49.56 40.65 / 46.78 34.87 / 42.50 33.60 / 43.55 12.99 / 21.23 32.00 / 40.72
ReasonRAG ~5,000 36.37 / 47.51 37.78 / 44.87 39.80 / 46.32 38.40 / 46.86 10.59 / 19.22 32.59 / 40.96
StepSearch ~19,000 38.72 / 50.67 39.24 / 44.97 40.38 / 47.12 33.60 / 44.16 13.82 / 23.06 33.15 / 42.00
PRO-STEP (ours) 5,000 38.73 / 51.63 40.47 / 47.37 44.07 / 51.43 36.80 / 47.63 12.49 / 22.41 34.51 / 44.09

EM / F1 (Strict EM, token-F1). Bootstrap 95% CI: vs Search-R1 +2.51 EM [+1.01, +4.06], vs ReasonRAG +1.93 EM [+0.46, +3.36].

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("DORAEMONG/PRO-STEP-Policy-7B", torch_dtype="auto")
tokenizer = AutoTokenizer.from_pretrained("DORAEMONG/PRO-STEP-Policy-7B")

# Use with FlashRAG SearchR1Pipeline or any agentic-RAG inference loop
# System prompt: see paper Appendix A

Citation

@article{prostep2026,
  title={PRO-STEP: Step-level Process Reward Optimization for Retrieval-Augmented Generation},
  author={...},
  year={2026}
}
Downloads last month
16
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for DORAEMONG/PRO-STEP-Policy-7B

Base model

Qwen/Qwen2.5-7B
Finetuned
(3288)
this model
Quantizations
2 models

Dataset used to train DORAEMONG/PRO-STEP-Policy-7B