Text Generation
Transformers
Safetensors
qwen3
reward-model
reasoning
evaluation
generative-reward-model
rlhf
grpo
conversational
text-generation-inference
Instructions to use SUSTech-NLP/UniRRM-8B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use SUSTech-NLP/UniRRM-8B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="SUSTech-NLP/UniRRM-8B") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("SUSTech-NLP/UniRRM-8B") model = AutoModelForCausalLM.from_pretrained("SUSTech-NLP/UniRRM-8B") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use SUSTech-NLP/UniRRM-8B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "SUSTech-NLP/UniRRM-8B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "SUSTech-NLP/UniRRM-8B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/SUSTech-NLP/UniRRM-8B
- SGLang
How to use SUSTech-NLP/UniRRM-8B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "SUSTech-NLP/UniRRM-8B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "SUSTech-NLP/UniRRM-8B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "SUSTech-NLP/UniRRM-8B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "SUSTech-NLP/UniRRM-8B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use SUSTech-NLP/UniRRM-8B with Docker Model Runner:
docker model run hf.co/SUSTech-NLP/UniRRM-8B
| license: apache-2.0 | |
| language: | |
| - en | |
| - fr | |
| - es | |
| - it | |
| - de | |
| - ru | |
| - tr | |
| - pt | |
| - zh | |
| - pl | |
| - ar | |
| - ko | |
| - ja | |
| - id | |
| - vi | |
| - multilingual | |
| library_name: transformers | |
| pipeline_tag: text-generation | |
| tags: | |
| - reward-model | |
| - reasoning | |
| - multilingual | |
| - evaluation | |
| - generative-reward-model | |
| - rlhf | |
| - grpo | |
| base_model: Qwen/Qwen3-8B | |
| # UniRRM-8B: Unified Reasoning Reward Model (8B) | |
| ## Overview | |
| **UniRRM-8B** is a unified reasoning reward model that supports **multiple languages (103 languages)** and **multiple evaluation paradigms (pairwise, listwise, and pointwise)** in a single model. It is built on **Qwen3-8B** and trained with a two-stage pipeline (SFT + GRPO) on the [MixReward](https://huggingface.co/datasets/SUSTech-NLP/MixReward) dataset. | |
| This model is introduced in the following paper, accepted at **ICML 2026** (the 43rd International Conference on Machine Learning): | |
| > **UniRRM: Unified Reasoning Reward Models Across Languages and Evaluation Paradigms** [[Paper]](https://icml.cc/virtual/2026/poster/61930) | |
| ## Key Features | |
| - π **103 Languages**: Trained on multilingual data spanning 103 languages across 6 domains | |
| - π **Unified Evaluation Paradigms**: Supports pairwise, listwise, and pointwise evaluation in a single model | |
| - π§ **Adaptive Rubric Generation**: Dynamically generates task-generic and instruction-specific evaluation criteria through a staged reasoning chain | |
| - β‘ **Structured Reasoning**: Follows a three-stage reasoning pipeline β Deep Analysis β Adaptive Rubric Generation β Detailed Evaluation | |
| - πͺΆ **Efficient**: Strong performance in a compact 8B parameter model | |
| ## Reasoning Workflow | |
| UniRRM follows a structured three-stage reasoning chain: | |
| 1. **Deep Analysis (π)**: Identifies task intent, potential risks, core evaluation objectives, and strict constraints | |
| 2. **Adaptive Rubric Generation (π)**: Produces both task-generic criteria (broadly applicable) and instruction-specific criteria (tailored to user query), each on a 1β5 scoring scale | |
| 3. **Detailed Evaluation (π)**: Applies generated rubrics to judge candidate responses with per-criterion scoring and final judgment | |
| ## Quick Start with vLLM | |
| The following example demonstrates **pairwise evaluation** using vLLM offline inference. To switch to other evaluation paradigms, simply adjust the number of `<Response>` blocks in the user prompt: | |
| - **Pairwise**: 2 responses (`<Response1>`, `<Response2>`) | |
| - **Listwise**: 4 responses (`<Response1>` through `<Response4>`) | |
| - **Pointwise**: 1 response (`<Response1>`), optionally with a `<Reference_Answer>` block | |
| ```python | |
| import json | |
| import re | |
| from vllm import LLM, SamplingParams | |
| from transformers import AutoTokenizer | |
| MODEL_NAME = "SUSTech-NLP/UniRRM-8B" | |
| # ---------- 1. Load model ---------- | |
| tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME) | |
| llm = LLM(model=MODEL_NAME, max_model_len=16384) | |
| sampling_params = SamplingParams(temperature=0, max_tokens=4096, repetition_penalty=1.05) | |
| # ---------- 2. Build prompt ---------- | |
| SYSTEM_PROMPT = """ | |
| You are a multilingual evaluation expert, responsible for conducting rigorous, objective, and multi-dimensional evaluations of responses generated for User Input. Your evaluation must strictly follow the step-by-step process outlined below: | |
| ### Phase 1: Deep Analysis | |
| Before evaluating, perform a comprehensive analysis of the User Input to establish a robust baseline: | |
| 1. **Identify potential risks**: Analyze the User Input to identify any potential safety, legal, offensive, or ethical risks. | |
| 2. **Identify task type**: Identify the primary task type (e.g., chat, reasoning, code generation, translation, or creative writing). | |
| 3. **Analyze core requirements (task-dependent)**: Define the fundamental evaluation dimensions that any correct response must satisfy. | |
| 4. **Analyze specific requirements**: Identify additional constraints or expectations unique to the User Input. | |
| 5. **Predict response content**: Summarize the expected content or core objectives of a correct response. | |
| ### Phase 2: Dynamic Rubric Generation | |
| 1. Generate a set of evaluation rubrics tailored to the user inputs and responses, with a 1-5 scoring criterion for each rubric. | |
| 2. If any safety, legal, or ethical risks are detected, include a Safety rubric as the highest-priority dimension. | |
| 3. Ensure rubrics comprehensively cover all critical aspects of the response. | |
| ### Phase 3: Detailed Evaluation | |
| For each rubric, evaluate the response: | |
| 1. **Evidence Extraction**: Identify specific passages that meet or fail to meet the rubric requirements. | |
| 2. **Gap Analysis**: Determine why the response did not achieve a perfect score (5). | |
| 3. **Scoring**: Assign a score from 1 to 5. | |
| ### OUTPUT FORMAT | |
| { | |
| "Analysis_process": "Concise summary of the analysis.", | |
| "rubrics": [{"name": "String", "description": "Rubric definition"}], | |
| "evaluations": [{"response_id": "String", "explanation": "Summary", "final_score": "Float"}], | |
| "best_id": "ID of the winner" | |
| } | |
| """.strip() | |
| question = "Explain the concept of recursion in programming." | |
| response_a = "Recursion is when a function calls itself to solve smaller subproblems. A base case stops the recursion, and each recursive call works on a reduced version of the original problem. For example, calculating factorial: factorial(n) = n * factorial(n-1), with factorial(0) = 1 as the base case." | |
| response_b = "Recursion means repeating something. In programming, it is used sometimes." | |
| user_prompt = f""" | |
| <User_Input> | |
| {question} | |
| </User_Input> | |
| <Response1> | |
| {response_a} | |
| </Response1> | |
| <Response2> | |
| {response_b} | |
| </Response2> | |
| """ | |
| messages = [ | |
| {"role": "system", "content": SYSTEM_PROMPT}, | |
| {"role": "user", "content": user_prompt}, | |
| ] | |
| prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) | |
| # ---------- 3. Generate ---------- | |
| outputs = llm.generate([prompt], sampling_params) | |
| raw_output = outputs[0].outputs[0].text | |
| print(raw_output) | |
| # ---------- 4. Parse output ---------- | |
| def parse_unirm_output(raw_output: str) -> dict: | |
| """Parse UniRRM's JSON output to extract scores and best_id.""" | |
| text = raw_output | |
| # Strip " in text: | |
| text = text.split("</think>")[-1].strip() | |
| # Extract JSON from markdown code block or raw text | |
| code_block = re.search(r"```(?:json)?\s*(\{.*?\})\s*```", text, re.DOTALL) | |
| if code_block: | |
| json_str = code_block.group(1) | |
| else: | |
| start, end = text.find("{"), text.rfind("}") | |
| if start != -1 and end != -1: | |
| json_str = text[start : end + 1] | |
| else: | |
| return {"error": "No JSON found in output"} | |
| try: | |
| return json.loads(json_str) | |
| except json.JSONDecodeError: | |
| match = re.search(r'"final_score"\s*:\s*"?(\d+(?:\.\d+)?)"?', json_str) | |
| if match: | |
| return {"final_score": float(match.group(1))} | |
| return {"error": "Failed to parse JSON"} | |
| result = parse_unirm_output(raw_output) | |
| print(f"Best response: {result.get('best_id')}") | |
| for evaluation in result.get("evaluations", []): | |
| print(f" {evaluation['response_id']}: score={evaluation['final_score']}") | |
| ``` | |
| ## Training | |
| UniRRM-8B is trained using a two-stage pipeline: | |
| ### Stage 1: Supervised Fine-Tuning (SFT) | |
| - **Base model**: Qwen3-8B | |
| - **Training data**: [UniRRM-SFT](https://huggingface.co/datasets/SUSTech-NLP/UniRRM-SFT) (35,749 samples distilled from GPT-OSS-120B) | |
| - **Epochs**: 3 | |
| - **Objective**: Initialize structured reasoning capabilities (analysis β rubric generation β evaluation) | |
| ### Stage 2: Reinforcement Learning with GRPO | |
| - **Training data**: [UniRRM-RL](https://huggingface.co/datasets/SUSTech-NLP/UniRRM-RL) (32,832 samples) | |
| - **Algorithm**: Group Relative Policy Optimization (GRPO) | |
| - **Composite reward**: `R = 0.8 Γ r_fmt + 0.15 Γ r_acc + 0.05 Γ r_rubric` | |
| - **Format Reward (r_fmt)**: Ensures structured output compliance | |
| - **Outcome Consistency Reward (r_acc)**: Binary reward for correct final judgment | |
| - **Rubric Quality Reward (r_rubric)**: Teacher model (Qwen3-Max) evaluates rubric quality (1β5) | |
| - **Hyperparameters**: lr=1e-6, weight_decay=0.01, batch_size=1024, epochs=2, kl_coef=0.001, rollout=5 | |
| - **Hardware**: 8 Γ NVIDIA H100 80GB GPUs | |
| ## Model Details | |
| | Attribute | Value | | |
| |-----------|-------| | |
| | **Architecture** | Qwen3ForCausalLM | | |
| | **Parameters** | ~8B | | |
| | **Precision** | bfloat16 | | |
| | **Max Position Embeddings** | 40960 | | |
| | **Vocabulary Size** | 151936 | | |
| ## Related Resources | |
| - **π Paper**: [UniRRM: Unified Reasoning Reward Models Across Languages and Evaluation Paradigms](https://openreview.net/forum?id=laiK6TlhL2) (ICML 2026) | |
| - **π€ UniRRM-14B**: [SUSTech-NLP/UniRRM-14B](https://huggingface.co/SUSTech-NLP/UniRRM-14B) | |
| - **π MixReward Dataset**: [SUSTech-NLP/MixReward](https://huggingface.co/datasets/SUSTech-NLP/MixReward) (64,528 samples, 103 languages) | |
| - **π UniRRM-SFT Dataset**: [SUSTech-NLP/UniRRM-SFT](https://huggingface.co/datasets/SUSTech-NLP/UniRRM-SFT) (35,749 SFT samples) | |
| - **π UniRRM-RL Dataset**: [SUSTech-NLP/UniRRM-RL](https://huggingface.co/datasets/SUSTech-NLP/UniRRM-RL) (32,832 RL samples) | |
| ## Citation | |
| ```bibtex | |
| @inproceedings{ | |
| anonymous2026unirrm, | |
| title={Uni{RRM}: Unified Reasoning Reward Models Across Languages and Evaluation Paradigms}, | |
| author={Anonymous}, | |
| booktitle={Forty-third International Conference on Machine Learning}, | |
| year={2026}, | |
| url={https://openreview.net/forum?id=laiK6TlhL2} | |
| } | |
| ``` | |
| ## License | |
| This model is released under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0). | |