metadata
license: apache-2.0
model_type:
- reward-model
domain:
- nlp
language:
- en
- zh
metrics:
- spearman
tags:
- evaluator
- reward-model
- rebuttal
tools:
- vllm
- openai api
Rebuttal-RM 🏅
Introduction
Rebuttal-RM is an open-source scorer that emulates human reviewers when judging an author’s reply to a peer-review comment.
1.1 Purpose
- Produce four rubric-aligned 0-to-10 scores
- Attitude (tone & professionalism)
- Clarity (logical flow)
- Persuasiveness (strength of evidence)
- Constructiveness (actionable improvement)
- Return a short textual explanation for transparency.
- Serve both as (i) a public benchmark and (ii) the reward signal for RebuttalAgent RL.
1.2 Model I/O
Input : {relevant paper chunks} + {full review Ri} + {target comment} + {candidate response} Output: { "score": {Attitude, Clarity, Persuasiveness, Constructiveness}, "explanation": "..." }
1.3 Model Recipe
- Backbone: Qwen-3-8B-Chat
- Supervised fine-tuning on the RM_Bench dataset
The resulting model offers reproducible, high-fidelity evaluation
2 Performance (agreement with human ratings)
Table – Consistency between automatic scores and human ratings
Higher values indicate better agreement.
Consistency between automatic scores and human ratings (higher = better)
| Scoring Model | Attitude r | Attitude β | Attitude f | Clarity r | Clarity β | Clarity f | Persuasiveness r | Persuasiveness β | Persuasiveness f | Constructiveness r | Constructiveness β | Constructiveness f | Avg |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Qwen-3-8B | 0.718 | 0.672 | 0.620 | 0.609 | 0.568 | 0.710 | 0.622 | 0.577 | 0.690 | 0.718 | 0.745 | 0.720 | 0.664 |
| Llama-3-8B | 0.297 | 0.347 | 0.540 | 0.158 | 0.047 | 0.380 | 0.272 | 0.245 | 0.560 | 0.424 | 0.457 | 0.460 | 0.349 |
| GLM-4-9B | 0.420 | 0.475 | 0.460 | 0.467 | 0.436 | 0.730 | 0.369 | 0.361 | 0.700 | 0.561 | 0.519 | 0.570 | 0.506 |
| GPT-4.1 | 0.743 | 0.712 | 0.800 | 0.739 | 0.671 | 0.750 | 0.779 | 0.763 | 0.740 | 0.804 | 0.756 | 0.680 | 0.745 |
| DeepSeek-R1 | 0.646 | 0.633 | 0.790 | 0.708 | 0.615 | 0.760 | 0.710 | 0.664 | 0.720 | 0.742 | 0.701 | 0.620 | 0.705 |
| DeepSeek-V3 | 0.699 | 0.733 | 0.710 | 0.687 | 0.578 | 0.740 | 0.697 | 0.652 | 0.770 | 0.771 | 0.719 | 0.750 | 0.692 |
| Gemini-2.5 | 0.620 | 0.509 | 0.750 | 0.605 | 0.593 | 0.540 | 0.627 | 0.607 | 0.520 | 0.711 | 0.705 | 0.610 | 0.616 |
| Claude-3.5 | 0.569 | 0.635 | 0.720 | 0.704 | 0.670 | 0.680 | 0.706 | 0.686 | 0.670 | 0.753 | 0.738 | 0.630 | 0.680 |
| Rebuttal-RM | 0.839 | 0.828 | 0.910 | 0.753 | 0.677 | 0.790 | 0.821 | 0.801 | 0.820 | 0.839 | 0.835 | 0.810 | 0.812 |
3 Deployment / Usage
3.1 Run with vLLM (OpenAI protocol)
pip install "vllm>=0.4.2"
python -m vllm.entrypoints.openai.api_server \
--model Zhitao-He/Rebuttal-RM \
--dtype auto \
--port 8000
3.2 Query via the official openai SDK
python
import openai
openai.api_key = "EMPTY"
openai.base_url = "http://localhost:8000/v1"
prompt = f"""
The whole review content is{Full_Review_Content}
The target comment content is{Target_Comment}.
The best relevant paper fragment is{Relevant_Paper_Fragment}.
The response is{response}
"""
resp = openai.chat.completions.create(
model="Zhitao-He/Rebuttal-RM",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}
],
temperature=0.7
)
print(resp.choices[0].message.content)
3.3 Default System Prompt (system_prompt)
You are a seasoned academic reviewer and response optimization expert. Your task is to evaluate the quality of the response based on the review comments, paper fragments, and the authors' responses. Please strictly follow the requirements below, and output only the score and score explanation.
Input variables:
review_content: Complete content of the review comments. similar_paper_fragments: Best paper fragment most relevant to the comment. comment: Specific segment of the review comments. original_response: The authors' original response text to the comment.
Your task: Based on the input information, output only a JSON object containing the following two items:
Scoring Standard
Score Range: 0 - 10
0: Wholly Ineffective
1-2: Perfunctory
3-4: Unconvincing
5-6: Addresses Some Concerns
7-8: Exceptional
9-10: Outstanding
score: Four-dimensional score breakdown, ranging from 0-10, structured as follows:
Attitude: The tone and professionalism of the response.
Clarity: The logic, structure, and focus of the response.
Persuasiveness: The effectiveness of the argumentation and evidence support.
Constructiveness: The commitment to revisions and specific actions taken.
score_explanation: A brief explanation of each score, specifically citing key points from the response text that reflect the scores and any shortcomings.
Output requirements:
Only output the JSON object; do not include any other characters or explanations.
The scoring must be reasonable, and the score explanation must clearly reference the original text that reflects the score.
All output must be in formal, polite academic English.
Your output must be strictly JSON formal which can be dirctly loaded by json.load()
Output format example:
{ "score": { "Attitude": <int>,
"Clarity": <int>,
"Persuasiveness": <int>,
"Constructiveness": <int> },
"score_explanation": <explanation for your given score>}"""
4 Citation
@inproceedings{he2025rebuttalagent,
title = {RebuttalAgent: Strategic Persuasion in Academic Rebuttal via Theory of Mind},
author = {Zhitao He and Zongwei Lyu and Wuzhenhai Dai and Yi R. (May) Fung},
year = {2025},
institution = {Hong Kong University of Science and Technology},
url = {https://arxiv.org/abs/YYMM.NNNNN}
}