PRO-STEP: Process Reward Model for Agentic RAG

The supervised step-level Process Reward Model (PRM) for PRO-STEP. Trained to evaluate agentic-RAG steps along six axes (entity grounding, search query quality, reasoning, answer specificity, recovery, overconfidence).

  • Base model: DeepSeek-R1-0528-Qwen3-8B
  • Adapter: LoRA (r=16)
  • Training data: DORAEMONG/PRO-STEP-PRM-Data — ~109K step annotations across 31,728 trajectories from 2,000 questions
  • Annotation source: QwQ-32B (open-source labeler), 84% human agreement on a 50-trajectory inspection sample
  • Output: rationale + binary GOOD/BAD label

Performance vs other PRMs (BoN/WMV at K=128, F1)

Setting PRO-STEP PRM VersaPRM Math-PRM Majority Voting
HotpotQA WMV-min 59.50 53.58 53.90 54.07
HotpotQA BoN-min 59.07 53.26 48.07 54.07
PopQA WMV-min 50.01 49.50 49.49 48.74
PopQA BoN-min 49.07 45.11 41.32 48.74
2Wiki WMV-min 46.01 43.40 43.49 44.00
2Wiki BoN-min 43.54 27.78 34.27 44.00

PRO-STEP PRM is the only learned PRM that stays at or above the Majority Voting baseline across 5 of 6 settings. VersaPRM and Math-PRM collapse below MV on multi-hop BoN.

Usage

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base_model = AutoModelForCausalLM.from_pretrained("deepseek-ai/DeepSeek-R1-0528-Qwen3-8B", torch_dtype="auto")
prm = PeftModel.from_pretrained(base_model, "DORAEMONG/PRO-STEP-PRM-8B")
tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-R1-0528-Qwen3-8B")

# Inference: provide trajectory step, get [REASONING] + Label: 0/1
# See paper Appendix A for the MCTS scoring prompt

Citation

@article{prostep2026,
  title={PRO-STEP: Step-level Process Reward Optimization for Retrieval-Augmented Generation},
  author={...},
  year={2026}
}
Downloads last month
12
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for DORAEMONG/PRO-STEP-PRM-8B

Adapter
(15)
this model

Dataset used to train DORAEMONG/PRO-STEP-PRM-8B