Reasoning-driven, reinforcement-trained metrics for machine translation evaluation
--- ## β¨ What is Remedy-R? **Remedy-R** is a family of **reasoning-based MT evaluation models** trained with **reinforcement learning via verifiable rewards (RLVR)** on **pairwise human translation preferences**. Instead of directly regressing a scalar score, Remedy-R: - Generates **step-by-step analyses** of *accuracy*, *fluency*, and *completeness*. - Outputs a **final numeric score in [0, 100]** that can be parsed and used like a standard metric. - Is trained with **PPO + rule-based rewards** that check whether predicted preferences match human rankings and calibrate scores toward human ratings. - Supports both **reference-based** and **reference-free (QE)** evaluation. On WMT22β24 and MSLC24-style OOD stress tests, Remedy-R: - **Surpasses** strong LLM-as-judge methods. - Matches top-performing scalar SOTA metrics. - Remains **robust under OOD conditions** such as source copy, empty translations, wrong language, and mixed-language outputs. - Enables **Test-Time Scaling (TTS)** via multiple reasoning passes, improving segment-level meta-evaluation. - Powers **Remedy-R Agent**, an evaluateβrevise pipeline that improves translations for diverse base systems. --- ## π Contents - [β¨ What is Remedy-R?](#-what-is-remedy-r) - [π Contents](#-contents) - [π¦ Installation](#-installation) - [From PyPI (recommended)](#from-pypi-recommended) - [From source](#from-source) - [βοΈ Requirements](#οΈ-requirements) - [π§ Model Zoo](#-model-zoo) - [π Quickstart](#-quickstart) - [CLI: Local vLLM Inference](#cli-local-vllm-inference) - [Reference-Free / QE Mode](#reference-free--qe-mode) - [Test-Time Scaling (TTS)](#test-time-scaling-tts) - [π Optional: vLLM Online Serving](#-optional-vllm-online-serving) - [π Outputs](#-outputs) - [π Citation](#-citation) --- ## π¦ Installation ### From PyPI (unavailable for now) ```bash pip install --upgrade pip pip install remedy-r-mt-eval ```` This installs the `remedy_r` package and the CLI entrypoint `remedy-r-score` (plus related tools). ### From source ```bash git clone https://github.com/Smu-Tan/Remedy-R.git cd Remedy-R pip install -e . ``` --- ## βοΈ Requirements Core runtime dependencies (see `pyproject.toml` for exact versions): * Python β₯ 3.10 (tested mostly with 3.12) * [PyTorch](https://pytorch.org/) with GPU support * [vLLM](https://github.com/vllm-project/vllm) for efficient batched inference * `transformers`, `numpy`, `pandas`, `tqdm` You also need: * At least **1 GPU (16β24 GB)** for 7B models * More memory/GPUs for 14B/32B models or large batch sizes --- ## π§ Model Zoo Remedy-R models are hosted on HuggingFace under `ShaomuTan/`: | Model | Size | Base model | Mode | Link | | ------------ | ---- | ----------- | -------- | --------------------------- | | Remedy-R-7B | 7B | Qwen2.5-7B | Ref + QE | [π€ HuggingFace](https://huggingface.co/ShaomuTan/Remedy-R-7B) | | Remedy-R-14B | 14B | Qwen2.5-14B | Ref + QE | [π€ HuggingFace](https://huggingface.co/ShaomuTan/Remedy-R-14B) | | Remedy-R-32B | 32B | Qwen2.5-32B | Ref + QE | [π€ HuggingFace](https://huggingface.co/ShaomuTan/Remedy-R-32B) | You can cache them locally: ```bash HF_HUB_ENABLE_HF_TRANSFER=1 \ huggingface-cli download ShaomuTan/Remedy-R-14B \ --local-dir Models/Remedy-R-14B ``` Then point `--model` to either the **HF ID** or the **local path**. --- ## π Quickstart ### CLI: Local vLLM Inference The main entrypoint is: ```bash remedy-r-score \ --model "$MODEL_CHECKPOINT" \ --save_metric_name "$METRIC_NAME" \ --output_dir "$DATA_DIR" \ --max-tokens "$MAX_TOKENS" \ --tp_size "$TP_SIZE" \ --dp_size "$DP_SIZE" \ --temperature "$DEC_TEMPERATURE" \ --repetition_penalty "$REPETITION_PENALTY" \ --gpu-memory-utilization "$GPU_MEM_UTIL" \ --max-model-len "$MAX_MODEL_LEN" \ --seed "$SEED" \ --src-file "$SRC_FILE" \ --mt-file "$MT_FILE" \ --lp "$LP" \ ``` **Key arguments** * `--model` : HF repo ID or local checkpoint * `--src-file` : Source sentences (one per line) * `--mt-file` : MT outputs (one per line) * `--ref-file` : Reference translations (optional; enables ref-based mode) * `--lp` : Language-pair codes (e.g., `en-de`) * `--output_dir` : Output folder * `--temperature` : Generation temperature * `--tp_size` : Tensor parallel size * `--dp_size` : Data parallel size * `--num-seqs` : Max parallel sequences per step * `--max-tokens` : Max generation token numebrs * `--gpu-memory-utilization` : vLLM memory ratio (e.g. 0.9) You can also call the CLI via Python: ```bash python -m remedy_r.cli.score \ --model ShaomuTan/Remedy-R-7B \ ... ``` --- ### Reference-Free / QE Mode If you donβt have references, just drop `--ref-file` and add `--no-ref`: ```bash remedy-r-score \ --model ShaomuTan/Remedy-R-7B \ --src-file ./testcase/en.src \ --mt-file ./testcase/en-de.hyp \ --no-ref \ --src-lang en \ --tgt-lang de \ --save-dir ./testcase \ --cache-dir ./Models ``` The prompt automatically switches to **reference-free quality estimation** while keeping the same [0, 100] score scale. --- ### Test-Time Scaling (TTS) Remedy-R supports **Test-Time Scaling** by averaging multiple independent evaluation passes with different seeds: ```bash remedy-r-score \ --model ShaomuTan/Remedy-R-14B \ --src-file ./testcase/en.src \ --mt-file ./testcase/en-de.hyp \ --ref-file ./testcase/de.ref \ --src-lang en --tgt-lang de \ --save-dir ./testcase_tts \ --TTS \ --best-of-n 4 \ --seed 42 ``` * `--TTS` : Enable multi-pass evaluation * `--best-of-n` : Number of independent passes (e.g., 2β6) * Scores are averaged; the detailed per-pass scores can be optionally logged. TTS typically improves **segment-level pairwise accuracy** and stabilizes scores for difficult segments. --- ## π Optional: vLLM Online Serving To avoid re-loading the model for every scoring run, you can: 1. **Start a local vLLM server** (OpenAI-compatible): ```bash remedy-r-serve \ --model ShaomuTan/Remedy-R-14B \ --port 8000 \ --max-model-len 4096 \ --gpu-memory-utilization 0.9 ``` 2. **Score via the server**: ```bash remedy-r-score \ --src-file ./testcase/en.src \ --mt-file ./testcase/en-de.hyp \ --ref-file ./testcase/de.ref \ --lp en-de \ --save_metric_name Remedy-R-14B \ --save-dir ./testcase_server \ --server-url http://localhost:8000/v1 ``` Internally this reuses the same Remedy-R prompting and scoring logic, but routes generation requests through the running vLLM server instead of instantiating `LLM()` in every process. --- ## π Outputs For each language pair `SRC-TGT`, Remedy-R writes: * `results.jsonl` * `segment_scores.tsv` * `system_score.txt` ## π Citation If you use Remedy-R or this codebase, please cite: Arxiv coming soon...