PRO-STEP: Process Reward Model for Agentic RAG

The supervised step-level Process Reward Model (PRM) for PRO-STEP. Trained to evaluate agentic-RAG steps along six axes (entity grounding, search query quality, reasoning, answer specificity, recovery, overconfidence).

Base model: DeepSeek-R1-0528-Qwen3-8B
Adapter: LoRA (r=16)
Training data: DORAEMONG/PRO-STEP-PRM-Data — ~109K step annotations across 31,728 trajectories from 2,000 questions
Annotation source: QwQ-32B (open-source labeler), 84% human agreement on a 50-trajectory inspection sample
Output: rationale + binary GOOD/BAD label

Performance vs other PRMs (BoN/WMV at K=128, F1)

Setting	PRO-STEP PRM	VersaPRM	Math-PRM	Majority Voting
HotpotQA WMV-min	59.50 ★	53.58	53.90	54.07
HotpotQA BoN-min	59.07 ★	53.26	48.07	54.07
PopQA WMV-min	50.01 ★	49.50	49.49	48.74
PopQA BoN-min	49.07 ★	45.11	41.32	48.74
2Wiki WMV-min	46.01 ★	43.40	43.49	44.00
2Wiki BoN-min	43.54	27.78	34.27	44.00

PRO-STEP PRM is the only learned PRM that stays at or above the Majority Voting baseline across 5 of 6 settings. VersaPRM and Math-PRM collapse below MV on multi-hop BoN.

Usage

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base_model = AutoModelForCausalLM.from_pretrained("deepseek-ai/DeepSeek-R1-0528-Qwen3-8B", torch_dtype="auto")
prm = PeftModel.from_pretrained(base_model, "DORAEMONG/PRO-STEP-PRM-8B")
tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-R1-0528-Qwen3-8B")

# Inference: provide trajectory step, get [REASONING] + Label: 0/1
# See paper Appendix A for the MCTS scoring prompt

Citation

@article{prostep2026,
  title={PRO-STEP: Step-level Process Reward Optimization for Retrieval-Augmented Generation},
  author={...},
  year={2026}
}

Downloads last month: 12

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for DORAEMONG/PRO-STEP-PRM-8B

Base model

deepseek-ai/DeepSeek-R1-0528-Qwen3-8B

Adapter

(15)

this model

DORAEMONG
/

PRO-STEP-PRM-8B

PRO-STEP: Process Reward Model for Agentic RAG

Performance vs other PRMs (BoN/WMV at K=128, F1)

Usage

Citation

Model tree for DORAEMONG/PRO-STEP-PRM-8B

Dataset used to train DORAEMONG/PRO-STEP-PRM-8B