🚀 Remedy-R: Generative Reasoning Models for MT Evaluation

<h1 align="center">🚀 Remedy-R: Generative Reasoning Models for MT Evaluation</h1>
<p align="center"><b>Reasoning-driven, reinforcement-trained metrics for machine translation evaluation</b></p>

---

## ✨ What is Remedy-R?

**Remedy-R** is a family of **reasoning-based MT evaluation models** trained with **reinforcement learning via verifiable rewards (RLVR)** on **pairwise human translation preferences**.

Instead of directly regressing a scalar score, Remedy-R:

- Generates **step-by-step analyses** of *accuracy*, *fluency*, and *completeness*.
- Outputs a **final numeric score in [0, 100]** that can be parsed and used like a standard metric.
- Is trained with **PPO + rule-based rewards** that check whether predicted preferences match human rankings and calibrate scores toward human ratings.
- Supports both **reference-based** and **reference-free (QE)** evaluation.

On WMT22–24 and MSLC24-style OOD stress tests, Remedy-R:

- **Surpasses** strong LLM-as-judge methods.
- Matches top-performing scalar SOTA metrics.
- Remains **robust under OOD conditions** such as source copy, empty translations, wrong language, and mixed-language outputs.
- Enables **Test-Time Scaling (TTS)** via multiple reasoning passes, improving segment-level meta-evaluation.
- Powers **Remedy-R Agent**, an evaluate–revise pipeline that improves translations for diverse base systems.

---

## 📚 Contents

- [✨ What is Remedy-R?](#-what-is-remedy-r)
- [📚 Contents](#-contents)
- [📦 Installation](#-installation)
  - [From PyPI (recommended)](#from-pypi-recommended)
  - [From source](#from-source)
- [⚙️ Requirements](#️-requirements)
- [🧠 Model Zoo](#-model-zoo)
- [🚀 Quickstart](#-quickstart)
  - [CLI: Local vLLM Inference](#cli-local-vllm-inference)
  - [Reference-Free / QE Mode](#reference-free--qe-mode)
  - [Test-Time Scaling (TTS)](#test-time-scaling-tts)
- [🌐 Optional: vLLM Online Serving](#-optional-vllm-online-serving)
- [📄 Outputs](#-outputs)
- [📚 Citation](#-citation)

---

## 📦 Installation

### From PyPI (unavailable for now)

```bash
pip install --upgrade pip
pip install remedy-r-mt-eval
````

This installs the `remedy_r` package and the CLI entrypoint `remedy-r-score` (plus related tools).

### From source

```bash
git clone https://github.com/Smu-Tan/Remedy-R.git
cd Remedy-R
pip install -e .
```

---

## ⚙️ Requirements

Core runtime dependencies (see `pyproject.toml` for exact versions):

* Python ≥ 3.10 (tested mostly with 3.12)
* [PyTorch](https://pytorch.org/) with GPU support
* [vLLM](https://github.com/vllm-project/vllm) for efficient batched inference
* `transformers`, `numpy`, `pandas`, `tqdm`

You also need:

* At least **1 GPU (16–24 GB)** for 7B models
* More memory/GPUs for 14B/32B models or large batch sizes

---

## 🧠 Model Zoo

Remedy-R models are hosted on HuggingFace under `ShaomuTan/`:

| Model        | Size | Base model  | Mode     | Link                        |
| ------------ | ---- | ----------- | -------- | --------------------------- |
| Remedy-R-7B  | 7B   | Qwen2.5-7B  | Ref + QE | [🤗 HuggingFace](https://huggingface.co/ShaomuTan/Remedy-R-7B)  |
| Remedy-R-14B | 14B  | Qwen2.5-14B | Ref + QE | [🤗 HuggingFace](https://huggingface.co/ShaomuTan/Remedy-R-14B) |
| Remedy-R-32B | 32B  | Qwen2.5-32B | Ref + QE | [🤗 HuggingFace](https://huggingface.co/ShaomuTan/Remedy-R-32B) |

You can cache them locally:

```bash
HF_HUB_ENABLE_HF_TRANSFER=1 \
huggingface-cli download ShaomuTan/Remedy-R-14B \
  --local-dir Models/Remedy-R-14B
```

Then point `--model` to either the **HF ID** or the **local path**.

---

## 🚀 Quickstart

### CLI: Local vLLM Inference

The main entrypoint is:

```bash
remedy-r-score \
  --model "$MODEL_CHECKPOINT" \
  --save_metric_name "$METRIC_NAME" \
  --output_dir "$DATA_DIR" \
  --max-tokens "$MAX_TOKENS" \
  --tp_size "$TP_SIZE" \
  --dp_size "$DP_SIZE" \
  --temperature "$DEC_TEMPERATURE" \
  --repetition_penalty "$REPETITION_PENALTY" \
  --gpu-memory-utilization "$GPU_MEM_UTIL" \
  --max-model-len "$MAX_MODEL_LEN" \
  --seed "$SEED" \
  --src-file "$SRC_FILE" \
  --mt-file  "$MT_FILE" \
  --lp "$LP" \
```

**Key arguments**

* `--model`        : HF repo ID or local checkpoint
* `--src-file`     : Source sentences (one per line)
* `--mt-file`      : MT outputs (one per line)
* `--ref-file`     : Reference translations (optional; enables ref-based mode)
* `--lp` : Language-pair codes (e.g., `en-de`)
* `--output_dir`     : Output folder
* `--temperature`    : Generation temperature
* `--tp_size`     : Tensor parallel size
* `--dp_size`     : Data parallel size
* `--num-seqs`     : Max parallel sequences per step
* `--max-tokens`   : Max generation token numebrs
* `--gpu-memory-utilization` : vLLM memory ratio (e.g. 0.9)


You can also call the CLI via Python:

```bash
python -m remedy_r.cli.score \
  --model ShaomuTan/Remedy-R-7B \
  ...
```

---

### Reference-Free / QE Mode

If you don’t have references, just drop `--ref-file` and add `--no-ref`:

```bash
remedy-r-score \
  --model ShaomuTan/Remedy-R-7B \
  --src-file ./testcase/en.src \
  --mt-file ./testcase/en-de.hyp \
  --no-ref \
  --src-lang en \
  --tgt-lang de \
  --save-dir ./testcase \
  --cache-dir ./Models
```

The prompt automatically switches to **reference-free quality estimation** while keeping the same [0, 100] score scale.

---

### Test-Time Scaling (TTS)

Remedy-R supports **Test-Time Scaling** by averaging multiple independent evaluation passes with different seeds:

```bash
remedy-r-score \
  --model ShaomuTan/Remedy-R-14B \
  --src-file ./testcase/en.src \
  --mt-file ./testcase/en-de.hyp \
  --ref-file ./testcase/de.ref \
  --src-lang en --tgt-lang de \
  --save-dir ./testcase_tts \
  --TTS \
  --best-of-n 4 \
  --seed 42
```

* `--TTS`           : Enable multi-pass evaluation
* `--best-of-n`     : Number of independent passes (e.g., 2–6)
* Scores are averaged; the detailed per-pass scores can be optionally logged.

TTS typically improves **segment-level pairwise accuracy** and stabilizes scores for difficult segments.

---

## 🌐 Optional: vLLM Online Serving

To avoid re-loading the model for every scoring run, you can:

1. **Start a local vLLM server** (OpenAI-compatible):

```bash
remedy-r-serve \
  --model ShaomuTan/Remedy-R-14B \
  --port 8000 \
  --max-model-len 4096 \
  --gpu-memory-utilization 0.9
```

2. **Score via the server**:

```bash
remedy-r-score \
  --src-file ./testcase/en.src \
  --mt-file ./testcase/en-de.hyp \
  --ref-file ./testcase/de.ref \
  --lp en-de \
  --save_metric_name Remedy-R-14B \
  --save-dir ./testcase_server \
  --server-url http://localhost:8000/v1
```

Internally this reuses the same Remedy-R prompting and scoring logic, but routes generation requests through the running vLLM server instead of instantiating `LLM()` in every process.

---

## 📄 Outputs

For each language pair `SRC-TGT`, Remedy-R writes:

* `results.jsonl`
* `segment_scores.tsv`
* `system_score.txt`


## 📚 Citation

If you use Remedy-R or this codebase, please cite:

Arxiv coming soon...