[[**🤗 Model & Dataset**](https://huggingface.co/collections/gaotang/rm-r1-681128cdab932701cad844c8)]
[[**📊 Code**](https://github.com/RM-R1-UIUC/RM-R1)]
[[**đź“– Paper**](https://arxiv.org/abs/2505.02387)]
# 🚀 Can we cast reward modeling as a reasoning task?
**RM-R1** is a training framework for *Reasoning Reward Model* (ReasRM) that judges two candidate answers by first **thinking out loud**—generating structured rubrics or reasoning traces—then emitting its preference. Compared to traditional scalar or generative reward models, RM-R1 delivers **state-of-the-art performance** on public RM benchmarks on average while offering fully interpretable justifications.
## đź§ TL;DR
* **Two-stage training**
1. **Distillation** of ~8.7 K high-quality reasoning traces (Chain-of-Rubrics).
2. **Reinforcement Learning with Verifiable Rewards** (RLVR) on ~64 K preference pairs.
* **Backbones** released: 7 B / 14 B / 32 B Qwen-2.5-Instruct variants + DeepSeek-distilled checkpoints.
## đź’ˇ Intended uses
* **RLHF / RLAIF**: plug-and-play reward function for policy optimisation.
* **Automated evaluation**: LLM-as-a-judge for open-domain QA, chat, and reasoning.
* **Research**: study process supervision, chain-of-thought verification, or rubric generation.
## 🔍 Demo Code
Try the model with this example. Full demo notebook available at:
📎 [Official Demo Link](https://github.com/RM-R1-UIUC/RM-R1/blob/main/demo/demo.ipynb)
### đź§ľ Prompt Template
```python
INSTRUCT_SYSTEM_PROMPT = (
"Please act as an impartial judge and evaluate the quality of the responses provided by two AI Chatbots to the Client's question displayed below.\n\n"
"First, classify the task into one of two categories: