Files changed (1) hide show
  1. README.md +5 -24
README.md CHANGED
@@ -19,32 +19,13 @@ tools:
19
  ---
20
  # Rebuttal-RM 🏅
21
  ## Introduction
22
- Rebuttal-RM is an open-source scorer that emulates human reviewers when judging an author’s reply to a peer-review comment.
23
-
24
- #### 1.1 Purpose
25
- * Produce four rubric-aligned 0-to-10 scores
26
- 1. Attitude (tone & professionalism)
27
- 2. Clarity (logical flow)
28
- 3. Persuasiveness (strength of evidence)
29
- 4. Constructiveness (actionable improvement)
30
- * Return a short textual explanation for transparency.
31
- * Serve both as (i) a public benchmark and (ii) the reward signal for RebuttalAgent RL.
32
-
33
- #### 1.2 Model I/O
34
- Input : {relevant paper chunks} + {full review Ri} + {target comment} + {candidate response}
35
- Output: {
36
- "score": {Attitude, Clarity, Persuasiveness, Constructiveness},
37
- "explanation": "..."
38
- }
39
-
40
- #### 1.3 Model Recipe
41
- * Backbone: **Qwen-3-8B-Chat**
42
- * Supervised fine-tuning on the RM_Bench dataset
43
-
44
 
45
- The resulting model offers reproducible, high-fidelity evaluation
 
 
46
 
47
- ---
48
 
49
  ## 2 Performance (agreement with human ratings)
50
 
 
19
  ---
20
  # Rebuttal-RM 🏅
21
  ## Introduction
22
+ **Rebuttal-RM** is a scoring model trained to automatically assess author responses in light of the target comment and its supporting context, with the explicit goal of matching human‐reviewer preferences. The reward model, denoted **GRM**, receives as input the retrieved evidence chunks **CE**, the current review **R_i**, the target comment **target**, and a candidate reply **response**; it returns a vector of rubric-aligned scores *s* together with an explanatory rationale **e**. Formally,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
23
 
24
+ $$
25
+ (s,e)\;=\;\mathrm{GRM}\!\bigl(\,\sum_{p_j\in\text{CE}} p_j,\; R_i,\; c_{\text{target}},\; r_{\text{response}}\bigr).
26
+ $$
27
 
28
+ To obtain a robust evaluator we curate **102 K** training instances drawn from three sources: (i) 12 000 original author rebuttals that serve as a human baseline, (ii) GPT-4.1–refined answers representing an upper quality bound, and (iii) diverse model-generated replies (e.g., Qwen-2.5-3B, Claude-3.5) to broaden stylistic coverage. Using **Qwen-3-8B** as the backbone, we fine-tune on this corpus to yield the final Rebuttal-RM. As reported in Table 1, Rebuttal-RM achieves the strongest agreement with expert annotators, posting an average correlation of 0.812 and outperforming GPT-4.1 and DeepSeek-R1 by 9.0 % and 15.2 %, respectively.## 2 Performance (agreement with human ratings)
29
 
30
  ## 2 Performance (agreement with human ratings)
31