Update README.md
#3
by
LYUZongwei
- opened
README.md
CHANGED
|
@@ -18,20 +18,18 @@ tools:
|
|
| 18 |
- openai api
|
| 19 |
---
|
| 20 |
# Rebuttal-RM 🏅
|
| 21 |
-
## Introduction
|
| 22 |
**Rebuttal-RM** is a scoring model trained to automatically assess author responses in light of the target comment and its supporting context, with the explicit goal of matching human‐reviewer preferences. The reward model, denoted **GRM**, receives as input the retrieved evidence chunks **CE**, the current review **R_i**, the target comment **target**, and a candidate reply **response**; it returns a vector of rubric-aligned scores *s* together with an explanatory rationale **e**. Formally,
|
| 23 |
|
| 24 |
$$
|
| 25 |
(s,e)\;=\;\mathrm{GRM}\!\bigl(\,\sum_{p_j\in\text{CE}} p_j,\; R_i,\; c_{\text{target}},\; r_{\text{response}}\bigr).
|
| 26 |
$$
|
| 27 |
|
| 28 |
-
To obtain a robust evaluator we curate **102 K** training instances drawn from three sources: (i) 12 000 original author rebuttals that serve as a human baseline, (ii) GPT-4.1–refined answers representing an upper quality bound, and (iii) diverse model-generated replies (e.g., Qwen-2.5-3B, Claude-3.5) to broaden stylistic coverage. Using **Qwen-3-8B** as the backbone, we fine-tune on this corpus to yield the final Rebuttal-RM. As reported in Table 1, Rebuttal-RM achieves the strongest agreement with expert annotators, posting an average correlation of 0.812 and outperforming GPT-4.1 and DeepSeek-R1 by 9.0 % and 15.2 %, respectively
|
|
|
|
|
|
|
| 29 |
|
| 30 |
-
## 2 Performance (agreement with human ratings)
|
| 31 |
|
| 32 |
-
### Table – Consistency between automatic scores and human ratings
|
| 33 |
-
*Higher values indicate better agreement.*
|
| 34 |
-
### Consistency between automatic scores and human ratings (higher = better)
|
| 35 |
| Scoring Model | Attitude r | Attitude β | Attitude f | Clarity r | Clarity β | Clarity f | Persuasiveness r | Persuasiveness β | Persuasiveness f | Constructiveness r | Constructiveness β | Constructiveness f | Avg |
|
| 36 |
|-------------- |:---------:|:----------:|:----------:|:---------:|:----------:|:----------:|:-----------------:|:----------------:|:----------------:|:-------------------:|:-------------------:|:------------------:|:---:|
|
| 37 |
| Qwen-3-8B | 0.718 | 0.672 | 0.620 | 0.609 | 0.568 | 0.710 | 0.622 | 0.577 | 0.690 | 0.718 | 0.745 | 0.720 | 0.664 |
|
|
@@ -45,7 +43,7 @@ To obtain a robust evaluator we curate **102 K** training instances drawn from t
|
|
| 45 |
| **Rebuttal-RM**| **0.839** | **0.828** | **0.910** | **0.753** | **0.677** | **0.790** | **0.821** | **0.801** | **0.820** | **0.839** | **0.835** | **0.810** | **0.812** |
|
| 46 |
|
| 47 |
|
| 48 |
-
## 3 Deployment / Usage
|
| 49 |
|
| 50 |
|
| 51 |
### 3.1 Run with vLLM (OpenAI protocol)
|
|
@@ -120,7 +118,7 @@ Output format example:
|
|
| 120 |
"Constructiveness": <int> },
|
| 121 |
"score_explanation": <explanation for your given score>}"""
|
| 122 |
```
|
| 123 |
-
## 4 Citation
|
| 124 |
```
|
| 125 |
@inproceedings{he2025rebuttalagent,
|
| 126 |
title = {RebuttalAgent: Strategic Persuasion in Academic Rebuttal via Theory of Mind},
|
|
|
|
| 18 |
- openai api
|
| 19 |
---
|
| 20 |
# Rebuttal-RM 🏅
|
| 21 |
+
## 1. Introduction
|
| 22 |
**Rebuttal-RM** is a scoring model trained to automatically assess author responses in light of the target comment and its supporting context, with the explicit goal of matching human‐reviewer preferences. The reward model, denoted **GRM**, receives as input the retrieved evidence chunks **CE**, the current review **R_i**, the target comment **target**, and a candidate reply **response**; it returns a vector of rubric-aligned scores *s* together with an explanatory rationale **e**. Formally,
|
| 23 |
|
| 24 |
$$
|
| 25 |
(s,e)\;=\;\mathrm{GRM}\!\bigl(\,\sum_{p_j\in\text{CE}} p_j,\; R_i,\; c_{\text{target}},\; r_{\text{response}}\bigr).
|
| 26 |
$$
|
| 27 |
|
| 28 |
+
To obtain a robust evaluator we curate **102 K** training instances drawn from three sources: (i) 12 000 original author rebuttals that serve as a human baseline, (ii) GPT-4.1–refined answers representing an upper quality bound, and (iii) diverse model-generated replies (e.g., Qwen-2.5-3B, Claude-3.5) to broaden stylistic coverage. Using **Qwen-3-8B** as the backbone, we fine-tune on this corpus to yield the final Rebuttal-RM. As reported in Table 1, Rebuttal-RM achieves the strongest agreement with expert annotators, posting an average correlation of 0.812 and outperforming GPT-4.1 and DeepSeek-R1 by 9.0 % and 15.2 %, respectively.
|
| 29 |
+
|
| 30 |
+
## 2. Performance (agreement with human ratings)(higher = better)
|
| 31 |
|
|
|
|
| 32 |
|
|
|
|
|
|
|
|
|
|
| 33 |
| Scoring Model | Attitude r | Attitude β | Attitude f | Clarity r | Clarity β | Clarity f | Persuasiveness r | Persuasiveness β | Persuasiveness f | Constructiveness r | Constructiveness β | Constructiveness f | Avg |
|
| 34 |
|-------------- |:---------:|:----------:|:----------:|:---------:|:----------:|:----------:|:-----------------:|:----------------:|:----------------:|:-------------------:|:-------------------:|:------------------:|:---:|
|
| 35 |
| Qwen-3-8B | 0.718 | 0.672 | 0.620 | 0.609 | 0.568 | 0.710 | 0.622 | 0.577 | 0.690 | 0.718 | 0.745 | 0.720 | 0.664 |
|
|
|
|
| 43 |
| **Rebuttal-RM**| **0.839** | **0.828** | **0.910** | **0.753** | **0.677** | **0.790** | **0.821** | **0.801** | **0.820** | **0.839** | **0.835** | **0.810** | **0.812** |
|
| 44 |
|
| 45 |
|
| 46 |
+
## 3. Deployment / Usage
|
| 47 |
|
| 48 |
|
| 49 |
### 3.1 Run with vLLM (OpenAI protocol)
|
|
|
|
| 118 |
"Constructiveness": <int> },
|
| 119 |
"score_explanation": <explanation for your given score>}"""
|
| 120 |
```
|
| 121 |
+
## 4. Citation
|
| 122 |
```
|
| 123 |
@inproceedings{he2025rebuttalagent,
|
| 124 |
title = {RebuttalAgent: Strategic Persuasion in Academic Rebuttal via Theory of Mind},
|