CE-RM
Paper: CE-RM: A Pointwise Generative Reward Model Optimized via Two-Stage Rollout and Unified Criteria
Introduction
Automatic evaluation is crucial yet challenging for open-ended natural language generation, especially when rule-based metrics are infeasible. Compared with traditional methods, the recent LLM-as-a-Judge paradigms enable better and more flexible evaluation, and show promise as generative reward models for reinforcement learning. However, prior work has revealed a notable gap between their seemingly impressive benchmark performance and actual effectiveness in RL practice. We attribute this issue to some limitations in existing studies, including the dominance of pairwise evaluation and inadequate optimization of evaluation criteria. Therefore, we propose CE-RM-4B, a pointwise generative reward model trained with a dedicated two-stage rollout method, and adopting unified query-based criteria. Using only about 5.7K high-quality data curated from the open-source preference dataset, our CE-RM-4B achieves superior performance on diverse reward model benchmarks, especially in Best-of-N scenarios, and delivers more effective improvements in downstream RL practice.
Usage
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "PKU-ONELab/CE-RM-4B"
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
criteria_prompt = """Your task is to produce a minimal set of criteria for evaluating the quality of potential responses to the user query given below.
Begin by carefully analyzing the query to fully understand the user's intent and requirements, and then take into account all common and tangible factors that can indicate the response quality.
From these considerations, derive the final evaluation criteria list, which **must adhere to the following requirements:**
- Each criterion should consist of a concise term as well as its unambiguous description.
- The number of criteria is not necessarily the more the better; Fewer yet comprehensive is more desired.
- The criteria should be sufficient and complete, ensuring that no essential aspects or key signals of response quality are omitted.
- The criteria should be necessary and non-overlapping, with each one indispensable, distinct in perspective, and strictly orthogonal to others.
Provide the relevant analysis first, followed by the numbered list of criteria between [Start of Criteria] and [End of Criteria], with one criterion per line and the more important ones coming first.
Below is the user query:
[Start of Query]
{query}
[End of Query]
"""
evaluation_prompt = """Now that you have a response to the previous user query, your new task is to evaluate it using the criteria list you have produced.
For each criterion, focus on its concerns and carefully evaluate the corresponding specific quality of the response, providing the detailed analysis as well as relevant arguments, followed by the corresponding quality score from 0 to 5 within $\\boxed{}$.
Moreover, if the response demonstrates strengths or weaknesses beyond the scope of your criteria list, introduce an additional criterion titled \"Other Point(s),\" discussing them and considering them as bonus points or deductions as appropriate.
Finally, based on the analyses of these criteria, including their relative importance and scores, **conduct a comprehensive evaluation of the response's overall quality with sufficient and explicit evidence**, and then provide a corresponding overall quality score from 0 to 10 within $\\boxed{}$.
Use integers or half-point increments for all scores, with higher numbers representing higher quality.
Below is the response:
[Start of Response]
{response}
[End of Response]
"""
criteria_conversation = [
{"role":"user", "content": criteria_prompt.replace("{query}", query)}
]
input_ids = tokenizer.apply_chat_template(
criteria_conversation,
tokenize=True,
add_generation_prompt=True,
enable_thinking=False,
return_tensors="pt").to(model.device)
output = model.generate(
input_ids=input_ids,
max_new_tokens=4096,
temperature=0,
)
criteria = tokenizer.decode(output[0][len(input_ids[0]):], skip_special_tokens=True)
print(criteria)
evaluation_conversation = [
{"role":"user", "content": criteria_prompt.replace("{query}", query)},
{"role":"assistant", "content": criteria},
{"role":"user", "content": evaluation_prompt.replace("{response}", response)}
]
input_ids = tokenizer.apply_chat_template(
evaluation_conversation,
tokenize=True,
add_generation_prompt=True,
enable_thinking=False,
return_tensors="pt").to(model.device)
output = model.generate(
input_ids=input_ids,
max_new_tokens=8192,
temperature=0,
)
evaluation = tokenizer.decode(output[0][len(input_ids[0]):], skip_special_tokens=True)
print(evaluation)
Citation
@article{hu2026rm,
title={CE-RM: A Pointwise Generative Reward Model Optimized via Two-Stage Rollout and Unified Criteria},
author={Hu, Xinyu and He, Yancheng and Wang, Weixun and Feng, Tao and Lin, Li and Liu, Jiashun and Su, Wenbo and Zheng, Bo and Wan, Xiaojun},
journal={arXiv preprint arXiv:2601.20327},
year={2026}
}
- Downloads last month
- 13