UXBench-GRM: Pointwise Generative Reward Model for User Experience
UXBench-GRM is a Pointwise Generative Reward Model trained to predict real user satisfaction with AI assistant responses. It is the official evaluation model used in the UXBench benchmark for Task 1 (UX Judge), Task 2 (UX Eval), and Task 3 (UX Recovery).
Model Description
| Attribute | Value |
|---|---|
| Base model | Hunyuan 3 (MoE, 20B active parameters) |
| Task | Pointwise UX quality judgment |
| Output | 好 (good) or 差 (bad) + continuous score via logprobs |
| Language | Chinese (Simplified) |
| Serving | OpenAI-compatible API via vLLM |
What it does
Given a multi-turn dialogue history, a user query, and an AI assistant response, the GRM predicts whether the response would satisfy the user. It outputs:
- Binary verdict:
好(satisfied, verdict = 1) or差(unsatisfied, verdict = −1) - Continuous score:
score = P(好) / (P(好) + P(差))derived from output token logprobs, where score ∈ [0, 1]
Training
- Architecture: Fine-tuned from Hunyuan 3 (MoE, 20B active params)
- Training data: 8,547 positive + 8,559 negative in-the-wild instances extracted from real AI assistant interaction logs
- Ground truth: Real user thumbs-up / thumbs-down signals (no synthetic labels, no LLM filtering)
- Data flywheel: Fully automated pipeline — no manual curation
- Leakage prevention: Training and test instances come from temporally separated windows
Performance on UXBench Task 1
| Model | Good-Acc | Bad-Acc | Avg-Acc |
|---|---|---|---|
| Pointwise GRM (Ours) | 82.1% | 72.4% | 77.2% |
| Claude Opus 4.7 | 89.1% | 61.5% | 75.3% |
| GPT-5.2 | 85.0% | 65.1% | 75.0% |
| Gemini 3.1 Pro | 91.6% | 49.3% | 70.4% |
The GRM achieves the highest Avg-Acc (balanced accuracy across liked and disliked conversations) among all evaluated models, with significantly better Bad-Acc (recall on failure cases) than frontier LLMs that tend toward positive bias.
Human alignment: In blind pairwise evaluation by 5 human experts, annotators preferred GRM-labeled good responses in 83.3% of cases on average.
Intended Use
This model is intended for academic research on:
- Automated evaluation of AI assistant response quality
- Reward modeling from real user feedback signals
- Studying user experience patterns in Chinese AI assistant dialogues
- Benchmarking LLMs on user-perceived utility
Not intended for: Production deployment, commercial applications, or any use outside the scope described above.
How to Use
Requirements
- vLLM ≥ 0.13.0 (required for HunyuanV3ForCausalLM)
- 8× H20 GPUs (or equivalent)
Step 1: Serve with vLLM
export MODEL_PATH=/path/to/mengze-hong/UXBench-GRM
export GRM_API_KEY=your_token_here # set any token you choose
export VLLM_WORKER_MULTIPROC_METHOD=spawn
export TOKENIZERS_PARALLELISM=false
nohup vllm serve "$MODEL_PATH" \
--served-model-name pointwise_grm_ux \
--host 0.0.0.0 \
--port 8021 \
--tensor-parallel-size 8 \
--max-model-len 16384 \
--max-num-seqs 32 \
--gpu-memory-utilization 0.90 \
--enforce-eager \
--trust-remote-code \
--enable-prefix-caching \
>> vllm.log 2>&1 &
# Health check (wait ~5 min for model to load)
curl http://localhost:8021/v1/models
Step 2: Build the prompt
POINTWISE_GRM_TEMPLATE = """# Role
你是一位用户体验评估专家。你的任务是判断AI助手的回复是否能够令用户满意。
# Task
请根据以下对话上下文和AI助手的回复,判断该回复是否能让用户满意。
评估时应考虑:回复是否准确、完整、有帮助,是否正确理解了用户意图,表达是否清晰恰当。
# Input Data
## 历史对话
{context}
## 用户问题
{prompt}
## AI助手回复
{response}
# Evaluation Criteria
1. 正确性:回复是否包含事实性错误
2. 完整性:回复是否充分回答了用户的问题
3. 意图理解:AI是否正确理解了用户的需求
4. 表达质量:回复是否清晰、简洁、格式恰当(不冗余啰嗦)
5. 实用性:回复对用户是否有实际帮助
# Output Format
请仅输出一个字:"好" 或 "差"。
- 输出"好"代表:该回复能够令用户满意
- 输出"差"代表:该回复无法令用户满意
# Constraint
不要输出任何解释、分析或额外的标点符号,只输出最终的判定结果。
# Final Answer"""
def build_grm_prompt(history: list, query: str, response: str,
max_history_turns: int = 6,
max_response_chars: int = 4000) -> str:
if max_history_turns > 0:
history = history[-max_history_turns:]
if max_response_chars > 0:
response = response[:max_response_chars]
ctx_lines = []
for turn in history:
role = turn.get("role", "")
msg = turn.get("content", "") or turn.get("message", "")
label = "用户" if role == "user" else "AI"
ctx_lines.append(f"[{label}]: {msg}")
context = "\n".join(ctx_lines) if ctx_lines else "(无历史对话)"
return POINTWISE_GRM_TEMPLATE.format(
context=context, prompt=query, response=response
)
Step 3: Call the GRM and extract verdict + score
import math, requests
def call_grm(prompt_text: str,
endpoint: str = "http://localhost:8021/v1/chat/completions",
model: str = "pointwise_grm_ux",
api_key: str = "your_token_here") -> dict:
headers = {
"Content-Type": "application/json",
"Authorization": f"Bearer {api_key}",
}
payload = {
"model": model,
"messages": [{"role": "user", "content": prompt_text}],
"max_tokens": 1,
"temperature": 0.0,
"logprobs": True,
"top_logprobs": 20,
}
resp = requests.post(endpoint, headers=headers, json=payload, timeout=60)
resp.raise_for_status()
data = resp.json()
choice = data["choices"][0]
content = choice["message"]["content"].strip()
# Extract logprobs for continuous score
good_logprob = bad_logprob = None
for item in choice.get("logprobs", {}).get("content", [{}])[0].get("top_logprobs", []):
token, lp = item.get("token", ""), item.get("logprob", -100)
if "好" in token and good_logprob is None:
good_logprob = lp
elif "差" in token and bad_logprob is None:
bad_logprob = lp
score = None
if good_logprob is not None and bad_logprob is not None:
p_good = math.exp(good_logprob)
p_bad = math.exp(bad_logprob)
score = round(p_good / (p_good + p_bad), 4)
verdict = 1 if "好" in content else (-1 if "差" in content else 0)
return {"verdict": verdict, "score": score, "content": content}
# Example
history = [
{"role": "user", "content": "帮我写一首关于春天的诗"},
{"role": "assistant", "content": "春风轻抚绿柳梢..."},
]
query = "可以再长一点吗?"
response = "当然,我来为您扩展这首诗..."
prompt = build_grm_prompt(history, query, response)
result = call_grm(prompt)
print(result)
# {'verdict': 1, 'score': 0.8732, 'content': '好'}
Threshold
The UXBench protocol uses score ≥ 0.5 as the binary threshold for "good":
is_good = result["score"] >= 0.5 # or result["verdict"] == 1
Evaluation Scripts
Full evaluation scripts for UXBench Tasks 1/2/3 are available in the GitHub repo:
# Task 2: score model-generated responses
python scripts/grm_judge/run_grm_judge_task2.py \
--responses-dir experiments/results/task2/responses/ \
--endpoint http://localhost:8021/v1/chat/completions \
--workers 80
# Task 3: score recovery responses
python scripts/grm_judge/run_grm_judge_task3.py \
--responses-dir experiments/results/task3/responses/ \
--workers 60
See scripts/grm_judge/README.md for full instructions.
Citation
@misc{hong2026uxbench,
title = {UXBench: Benchmarking User Experience in AI Assistants},
author = {Mengze Hong and Xia Zeng and Zeyang Lei and Sheng Wang and
Chen Jason Zhang and Di Jiang and others},
year = {2026},
eprint = {2606.09570},
archivePrefix = {arXiv},
primaryClass = {cs.CL},
url = {https://arxiv.org/abs/2606.09570}
}
License
UXBench-GRM Research-Only License
Copyright © 2026 Mengze Hong, Tencent Yuanbao Team. All rights reserved.
This model and its weights are made available solely for non-commercial academic research. By downloading or using this model, you agree to the following terms:
- Research use only. You may use this model solely for non-commercial academic research and educational purposes.
- No commercial use. You may not use this model, in whole or in part, for any commercial purpose, including but not limited to: commercial products or services, revenue-generating applications, internal business tools, or production systems.
- No redistribution. You may not redistribute, sublicense, sell, or otherwise transfer the model weights or any derivative thereof to any third party.
- No derivative models for external release. You may not fine-tune, distill, or otherwise modify this model and release the resulting model externally without prior written permission from the authors.
- No use in training data pipelines. You may not use model outputs as training data or labels for training other models intended for commercial use.
- Attribution required. Any publication, presentation, or public use must cite the UXBench paper (arXiv:2606.09570).
- No warranty. This model is provided "as is" without any warranty of any kind. The authors are not liable for any damages arising from its use.
For permissions beyond this license (including commercial licensing), contact: mengze.hong@connect.polyu.hk · zeyanglei@gmail.com
- Downloads last month
- -