ReMedy-2B / README.md

Update README.md

29ed428 verified 2 months ago

8.86 kB

	---
	license: apache-2.0
	base_model:
	- google/gemma-2-9b-it
	tags:
	- translation
	---


	# 🚀 ReMedy: Machine Translation Evaluation via Reward Modeling

	<div align="left">

	Learning High-Quality Machine Translation Evaluation from Human Preferences with Reward Modeling

	</div>

	[![arXiv](https://img.shields.io/badge/arXiv-2405.12345-b31b1b)](https://arxiv.org/abs/2504.13630)
	[![PyPI version](https://img.shields.io/pypi/v/remedy-mt-eval)](https://pypi.org/project/remedy-mt-eval/)
	[![GitHub Stars](https://img.shields.io/github/stars/Smu-Tan/Remedy)](https://github.com/Smu-Tan/Remedy/stargazers)
	[![License](https://img.shields.io/github/license/Smu-Tan/Remedy)](./LICENSE)

	---

	## ✨ About ReMedy

	ReMedy is a new state-of-the-art machine translation (MT) evaluation framework that reframes the task as reward modeling rather than direct regression. Instead of relying on noisy human scores, ReMedy learns from pairwise human preferences, leading to better alignment with human judgments.

	- 📈 State-of-the-art accuracy on WMT22–24 (39 language pairs, 111 systems)
	- ⚖️ Segment- and system-level evaluation, outperforming GPT-4, PaLM-540B, Finetuned-PaLM2, MetricX-13B, and XCOMET
	- 🔍 More robust on low-quality and out-of-domain translations (ACES, MSLC benchmarks)
	- 🧠 Can be used as a reward model in RLHF pipelines to improve MT systems

	> ReMedy demonstrates that reward modeling with pairwise preferences offers a more reliable and human-aligned approach for MT evaluation.

	---

	## 📚 Contents

	- [📦 Quick Installation](#-quick-installation)
	- [⚙️ Requirements](#️-requirements)
	- [🚀 Usage](#-usage)
	- [💾 Download Models](#-download-remedy-models)
	- [🔹 Basic Usage](#-basic-usage)
	- [🔹 Reference-Free Mode](#-reference-free-mode)
	- [📄 Output Files](#-output-files)
	- [⚙️ Full Argument List](#️-full-argument-list)
	- [🧠 Model Variants](#-model-variants)
	- [🔁 Reproducing WMT Results](#-reproducing-wmt-results)
	- [📚 Citation](#-citation)

	---

	## 📦 Quick Installation

	> ReMedy requires Python ≥ 3.10, and leverages [VLLM](https://github.com/vllm-project/vllm) for fast inference.

	### ✅ Recommended: Install via pip

	```bash
	pip install remedy-mt-eval
	git clone https://github.com/Smu-Tan/Remedy
	cd Remedy
	```

	### 🛠️ Install from Source

	```bash
	git clone https://github.com/Smu-Tan/Remedy
	cd Remedy
	pip install -e .
	```

	### 📜 Install via Poetry

	```bash
	git clone https://github.com/Smu-Tan/Remedy
	cd Remedy
	poetry install
	```

	---

	## ⚙️ Requirements

	- `Python` ≥ 3.10
	- `transformers` ≥ 4.51.1
	- `vllm` ≥ 0.8.5
	- `torch` ≥ 2.6.0
	- (See `pyproject.toml` for full dependencies)

	---

	## 🚀 Usage

	### 💾 Download ReMedy Models

	Before using, download the model from HuggingFace:

	```bash
	HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download ShaomuTan/ReMedy-9B-23 --local-dir Models/remedy-9B-23
	```

	You can replace `ReMedy-9B-22` with other variants like `ReMedy-9B-23`.

	---

	### 🔹 Basic Usage

	```bash
	remedy-score \
	--model Models/remedy-9B-22 \
	--src_file testcase/en.src \
	--mt_file testcase/en-de.hyp \
	--ref_file testcase/de.ref \
	--src_lang en --tgt_lang de \
	--cache_dir $CACHE_DIR \
	--save_dir testcase \
	--num_gpus 4 \
	--calibrate
	```

	### 🔹 Reference-Free Mode (Quality Estimation)

	```bash
	remedy-score \
	--model Models/remedy-9B-22 \
	--src_file testcase/en.src \
	--mt_file testcase/en-de.hyp \
	--no_ref \
	--src_lang en --tgt_lang de \
	--cache_dir $CACHE_DIR \
	--save_dir testcase \
	--num_gpus 4 \
	--calibrate
	```

	---

	## 📄 Output Files

	- `src-tgt_raw_scores.txt`
	- `src-tgt_sigmoid_scores.txt`
	- `src-tgt_calibration_scores.txt`
	- `src-tgt_detailed_results.tsv`
	- `src-tgt_result.json`

	Inspired by SacreBLEU, ReMedy provides JSON-style results to ensure transparency and comparability.

	<details>
	<summary>📘 Example JSON Output</summary>

	```json
	{
	"metric_name": "remedy-9B-22",
	"raw_score": 4.502863049214531,
	"sigmoid_score": 0.9613502018042875,
	"calibration_score": 0.9029647169507162,
	"calibration_temp": 1.7999999999999998,
	"signature": "metric_name:remedy-9B-22\|lp:en-de\|ref:yes\|version:0.1.1",
	"language_pair": "en-de",
	"source_language": "en",
	"target_language": "de",
	"segments": 2037,
	"version": "0.1.1",
	"args": {
	"src_file": "testcase/en.src",
	"mt_file": "testcase/en-de.hyp",
	"src_lang": "en",
	"tgt_lang": "de",
	"model": "Models/remedy-9B-22",
	"cache_dir": "Models",
	"save_dir": "testcase",
	"ref_file": "testcase/de.ref",
	"no_ref": false,
	"calibrate": true,
	"num_gpus": 4,
	"num_seqs": 256,
	"max_length": 4096,
	"enable_truncate": false,
	"version": false,
	"list_languages": false
	}
	}
	```

	</details>

	---

	## ⚙️ Full Argument List

	<details>
	<summary>📋 Show CLI Arguments</summary>

	### 🔸 Required

	```python
	--src_file # Path to source file
	--mt_file # Path to MT output file
	--src_lang # Source language code
	--tgt_lang # Target language code
	--model # Model path or HuggingFace ID
	--save_dir # Output directory
	```

	### 🔸 Optional

	```python
	--ref_file # Reference file path
	--no_ref # Reference-free mode
	--cache_dir # Cache directory
	--calibrate # Enable calibration
	--num_gpus # Number of GPUs
	--num_seqs # Number of sequences (default: 256)
	--max_length # Max token length (default: 4096)
	--enable_truncate # Truncate sequences
	--version # Print version
	--list_languages # List supported languages
	```

	</details>

	---

	## 🧠 Model Variants

	\| Model \| Size \| Base Model \| Ref/QE \| Download \|
	\|---------------\|------\|--------------\|--------\|----------\|
	\| ReMedy-2B \| 2B \| Gemma-2-2B \| Both \| [🤗 HuggingFace](https://huggingface.co/ShaomuTan/ReMedy-2B) \|
	\| ReMedy-9B-22 \| 9B \| Gemma-2-9B \| Both \| [🤗 HuggingFace](https://huggingface.co/ShaomuTan/ReMedy-9B-22) \|
	\| ReMedy-9B-23 \| 9B \| Gemma-2-9B \| Both \| [🤗 HuggingFace](https://huggingface.co/ShaomuTan/ReMedy-9B-23) \|
	\| ReMedy-9B-24 \| 9B \| Gemma-2-9B \| Both \| [🤗 HuggingFace](https://huggingface.co/ShaomuTan/ReMedy-9B-24) \|

	> More variants coming soon...

	---

	## 🔁 Reproducing WMT Results

	<details>
	<summary>Click to show instructions for reproducing WMT22–24 evaluation</summary>

	### 1. Install `mt-metrics-eval`

	```bash
	git clone https://github.com/google-research/mt-metrics-eval.git
	cd mt-metrics-eval
	pip install .
	```

	### 2. Download WMT evaluation data

	```bash
	python3 -m mt_metrics_eval.mtme --download
	```

	### 3. Run ReMedy on WMT data

	```bash
	bash wmt/wmt22.sh
	bash wmt/wmt23.sh
	bash wmt/wmt24.sh
	```

	> 📄 Results will be comparable with other metrics reported in WMT shared tasks.

	</details>

	---

	## 📚 Citation

	If you use ReMedy, please cite the following paper:

	```bibtex
	@inproceedings{tan-monz-2025-remedy,
	title = "{R}e{M}edy: Learning Machine Translation Evaluation from Human Preferences with Reward Modeling",
	author = "Tan, Shaomu and
	Monz, Christof",
	editor = "Christodoulopoulos, Christos and
	Chakraborty, Tanmoy and
	Rose, Carolyn and
	Peng, Violet",
	booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing",
	month = nov,
	year = "2025",
	address = "Suzhou, China",
	publisher = "Association for Computational Linguistics",
	url = "https://aclanthology.org/2025.emnlp-main.217/",
	doi = "10.18653/v1/2025.emnlp-main.217",
	pages = "4370--4387",
	ISBN = "979-8-89176-332-6",
	abstract = "A key challenge in MT evaluation is the inherent noise and inconsistency of human ratings. Regression-based neural metrics struggle with this noise, while prompting LLMs shows promise at system-level evaluation but performs poorly at segment level. In this work, we propose ReMedy, a novel MT metric framework that reformulates translation evaluation as a reward modeling task. Instead of regressing on imperfect human ratings directly, ReMedy learns relative translation quality using pairwise preference data, resulting in a more reliable evaluation. In extensive experiments across WMT22-24 shared tasks (39 language pairs, 111 MT systems), ReMedy achieves state-of-the-art performance at both segment- and system-level evaluation. Specifically, ReMedy-9B surpasses larger WMT winners and massive closed LLMs such as MetricX-13B, XCOMET-Ensemble, GEMBA-GPT-4, PaLM-540B, and finetuned PaLM2. Further analyses demonstrate that ReMedy delivers superior capability in detecting translation errors and evaluating low-quality translations."
	}

	```

	---