UXBench-GRM: Pointwise Generative Reward Model for User Experience

Paper Dataset GitHub

UXBench-GRM is a Pointwise Generative Reward Model trained to predict real user satisfaction with AI assistant responses. It is the official evaluation model used in the UXBench benchmark for Task 1 (UX Judge), Task 2 (UX Eval), and Task 3 (UX Recovery).

Model Description

Attribute Value
Base model Hunyuan 3 (MoE, 20B active parameters)
Task Pointwise UX quality judgment
Output (good) or (bad) + continuous score via logprobs
Language Chinese (Simplified)
Serving OpenAI-compatible API via vLLM

What it does

Given a multi-turn dialogue history, a user query, and an AI assistant response, the GRM predicts whether the response would satisfy the user. It outputs:

  • Binary verdict: (satisfied, verdict = 1) or (unsatisfied, verdict = −1)
  • Continuous score: score = P(好) / (P(好) + P(差)) derived from output token logprobs, where score ∈ [0, 1]

Training

  • Architecture: Fine-tuned from Hunyuan 3 (MoE, 20B active params)
  • Training data: 8,547 positive + 8,559 negative in-the-wild instances extracted from real AI assistant interaction logs
  • Ground truth: Real user thumbs-up / thumbs-down signals (no synthetic labels, no LLM filtering)
  • Data flywheel: Fully automated pipeline — no manual curation
  • Leakage prevention: Training and test instances come from temporally separated windows

Performance on UXBench Task 1

Model Good-Acc Bad-Acc Avg-Acc
Pointwise GRM (Ours) 82.1% 72.4% 77.2%
Claude Opus 4.7 89.1% 61.5% 75.3%
GPT-5.2 85.0% 65.1% 75.0%
Gemini 3.1 Pro 91.6% 49.3% 70.4%

The GRM achieves the highest Avg-Acc (balanced accuracy across liked and disliked conversations) among all evaluated models, with significantly better Bad-Acc (recall on failure cases) than frontier LLMs that tend toward positive bias.

Human alignment: In blind pairwise evaluation by 5 human experts, annotators preferred GRM-labeled good responses in 83.3% of cases on average.


Intended Use

This model is intended for academic research on:

  • Automated evaluation of AI assistant response quality
  • Reward modeling from real user feedback signals
  • Studying user experience patterns in Chinese AI assistant dialogues
  • Benchmarking LLMs on user-perceived utility

Not intended for: Production deployment, commercial applications, or any use outside the scope described above.


How to Use

Requirements

  • vLLM ≥ 0.13.0 (required for HunyuanV3ForCausalLM)
  • 8× H20 GPUs (or equivalent)

Step 1: Serve with vLLM

export MODEL_PATH=/path/to/mengze-hong/UXBench-GRM
export GRM_API_KEY=your_token_here   # set any token you choose

export VLLM_WORKER_MULTIPROC_METHOD=spawn
export TOKENIZERS_PARALLELISM=false

nohup vllm serve "$MODEL_PATH" \
    --served-model-name pointwise_grm_ux \
    --host 0.0.0.0 \
    --port 8021 \
    --tensor-parallel-size 8 \
    --max-model-len 16384 \
    --max-num-seqs 32 \
    --gpu-memory-utilization 0.90 \
    --enforce-eager \
    --trust-remote-code \
    --enable-prefix-caching \
    >> vllm.log 2>&1 &

# Health check (wait ~5 min for model to load)
curl http://localhost:8021/v1/models

Step 2: Build the prompt

POINTWISE_GRM_TEMPLATE = """# Role
你是一位用户体验评估专家。你的任务是判断AI助手的回复是否能够令用户满意。

# Task
请根据以下对话上下文和AI助手的回复,判断该回复是否能让用户满意。
评估时应考虑:回复是否准确、完整、有帮助,是否正确理解了用户意图,表达是否清晰恰当。

# Input Data
## 历史对话
{context}
## 用户问题
{prompt}
## AI助手回复
{response}

# Evaluation Criteria
1. 正确性:回复是否包含事实性错误
2. 完整性:回复是否充分回答了用户的问题
3. 意图理解:AI是否正确理解了用户的需求
4. 表达质量:回复是否清晰、简洁、格式恰当(不冗余啰嗦)
5. 实用性:回复对用户是否有实际帮助

# Output Format
请仅输出一个字:"好" 或 "差"。
- 输出"好"代表:该回复能够令用户满意
- 输出"差"代表:该回复无法令用户满意

# Constraint
不要输出任何解释、分析或额外的标点符号,只输出最终的判定结果。

# Final Answer"""


def build_grm_prompt(history: list, query: str, response: str,
                     max_history_turns: int = 6,
                     max_response_chars: int = 4000) -> str:
    if max_history_turns > 0:
        history = history[-max_history_turns:]
    if max_response_chars > 0:
        response = response[:max_response_chars]

    ctx_lines = []
    for turn in history:
        role = turn.get("role", "")
        msg = turn.get("content", "") or turn.get("message", "")
        label = "用户" if role == "user" else "AI"
        ctx_lines.append(f"[{label}]: {msg}")

    context = "\n".join(ctx_lines) if ctx_lines else "(无历史对话)"
    return POINTWISE_GRM_TEMPLATE.format(
        context=context, prompt=query, response=response
    )

Step 3: Call the GRM and extract verdict + score

import math, requests

def call_grm(prompt_text: str,
             endpoint: str = "http://localhost:8021/v1/chat/completions",
             model: str = "pointwise_grm_ux",
             api_key: str = "your_token_here") -> dict:
    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {api_key}",
    }
    payload = {
        "model": model,
        "messages": [{"role": "user", "content": prompt_text}],
        "max_tokens": 1,
        "temperature": 0.0,
        "logprobs": True,
        "top_logprobs": 20,
    }

    resp = requests.post(endpoint, headers=headers, json=payload, timeout=60)
    resp.raise_for_status()
    data = resp.json()

    choice = data["choices"][0]
    content = choice["message"]["content"].strip()

    # Extract logprobs for continuous score
    good_logprob = bad_logprob = None
    for item in choice.get("logprobs", {}).get("content", [{}])[0].get("top_logprobs", []):
        token, lp = item.get("token", ""), item.get("logprob", -100)
        if "好" in token and good_logprob is None:
            good_logprob = lp
        elif "差" in token and bad_logprob is None:
            bad_logprob = lp

    score = None
    if good_logprob is not None and bad_logprob is not None:
        p_good = math.exp(good_logprob)
        p_bad  = math.exp(bad_logprob)
        score  = round(p_good / (p_good + p_bad), 4)

    verdict = 1 if "好" in content else (-1 if "差" in content else 0)
    return {"verdict": verdict, "score": score, "content": content}


# Example
history = [
    {"role": "user",      "content": "帮我写一首关于春天的诗"},
    {"role": "assistant", "content": "春风轻抚绿柳梢..."},
]
query    = "可以再长一点吗?"
response = "当然,我来为您扩展这首诗..."

prompt = build_grm_prompt(history, query, response)
result = call_grm(prompt)
print(result)
# {'verdict': 1, 'score': 0.8732, 'content': '好'}

Threshold

The UXBench protocol uses score ≥ 0.5 as the binary threshold for "good":

is_good = result["score"] >= 0.5  # or result["verdict"] == 1

Evaluation Scripts

Full evaluation scripts for UXBench Tasks 1/2/3 are available in the GitHub repo:

# Task 2: score model-generated responses
python scripts/grm_judge/run_grm_judge_task2.py \
    --responses-dir experiments/results/task2/responses/ \
    --endpoint http://localhost:8021/v1/chat/completions \
    --workers 80

# Task 3: score recovery responses
python scripts/grm_judge/run_grm_judge_task3.py \
    --responses-dir experiments/results/task3/responses/ \
    --workers 60

See scripts/grm_judge/README.md for full instructions.


Citation

@misc{hong2026uxbench,
  title         = {UXBench: Benchmarking User Experience in AI Assistants},
  author        = {Mengze Hong and Xia Zeng and Zeyang Lei and Sheng Wang and
                   Chen Jason Zhang and Di Jiang and others},
  year          = {2026},
  eprint        = {2606.09570},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CL},
  url           = {https://arxiv.org/abs/2606.09570}
}

License

UXBench-GRM Research-Only License

Copyright © 2026 Mengze Hong, Tencent Yuanbao Team. All rights reserved.

This model and its weights are made available solely for non-commercial academic research. By downloading or using this model, you agree to the following terms:

  1. Research use only. You may use this model solely for non-commercial academic research and educational purposes.
  2. No commercial use. You may not use this model, in whole or in part, for any commercial purpose, including but not limited to: commercial products or services, revenue-generating applications, internal business tools, or production systems.
  3. No redistribution. You may not redistribute, sublicense, sell, or otherwise transfer the model weights or any derivative thereof to any third party.
  4. No derivative models for external release. You may not fine-tune, distill, or otherwise modify this model and release the resulting model externally without prior written permission from the authors.
  5. No use in training data pipelines. You may not use model outputs as training data or labels for training other models intended for commercial use.
  6. Attribution required. Any publication, presentation, or public use must cite the UXBench paper (arXiv:2606.09570).
  7. No warranty. This model is provided "as is" without any warranty of any kind. The authors are not liable for any damages arising from its use.

For permissions beyond this license (including commercial licensing), contact: mengze.hong@connect.polyu.hk · zeyanglei@gmail.com

Downloads last month
-
Safetensors
Model size
295B params
Tensor type
BF16
·
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for mengze-hong/UXBench-GRM