PRO-STEP: Process Reward Model for Agentic RAG
The supervised step-level Process Reward Model (PRM) for PRO-STEP. Trained to evaluate agentic-RAG steps along six axes (entity grounding, search query quality, reasoning, answer specificity, recovery, overconfidence).
- Base model: DeepSeek-R1-0528-Qwen3-8B
- Adapter: LoRA (r=16)
- Training data: DORAEMONG/PRO-STEP-PRM-Data — ~109K step annotations across 31,728 trajectories from 2,000 questions
- Annotation source: QwQ-32B (open-source labeler), 84% human agreement on a 50-trajectory inspection sample
- Output: rationale + binary GOOD/BAD label
Performance vs other PRMs (BoN/WMV at K=128, F1)
| Setting | PRO-STEP PRM | VersaPRM | Math-PRM | Majority Voting |
|---|---|---|---|---|
| HotpotQA WMV-min | 59.50 ★ | 53.58 | 53.90 | 54.07 |
| HotpotQA BoN-min | 59.07 ★ | 53.26 | 48.07 | 54.07 |
| PopQA WMV-min | 50.01 ★ | 49.50 | 49.49 | 48.74 |
| PopQA BoN-min | 49.07 ★ | 45.11 | 41.32 | 48.74 |
| 2Wiki WMV-min | 46.01 ★ | 43.40 | 43.49 | 44.00 |
| 2Wiki BoN-min | 43.54 | 27.78 | 34.27 | 44.00 |
PRO-STEP PRM is the only learned PRM that stays at or above the Majority Voting baseline across 5 of 6 settings. VersaPRM and Math-PRM collapse below MV on multi-hop BoN.
Usage
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
base_model = AutoModelForCausalLM.from_pretrained("deepseek-ai/DeepSeek-R1-0528-Qwen3-8B", torch_dtype="auto")
prm = PeftModel.from_pretrained(base_model, "DORAEMONG/PRO-STEP-PRM-8B")
tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-R1-0528-Qwen3-8B")
# Inference: provide trajectory step, get [REASONING] + Label: 0/1
# See paper Appendix A for the MCTS scoring prompt
Citation
@article{prostep2026,
title={PRO-STEP: Step-level Process Reward Optimization for Retrieval-Augmented Generation},
author={...},
year={2026}
}
- Downloads last month
- 12
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support
Model tree for DORAEMONG/PRO-STEP-PRM-8B
Base model
deepseek-ai/DeepSeek-R1-0528-Qwen3-8B