|
|
--- |
|
|
license: apache-2.0 |
|
|
base_model: |
|
|
- google/gemma-2-9b-it |
|
|
tags: |
|
|
- translation |
|
|
--- |
|
|
|
|
|
|
|
|
# π ReMedy: Machine Translation Evaluation via Reward Modeling |
|
|
|
|
|
<div align="left"> |
|
|
|
|
|
**Learning High-Quality Machine Translation Evaluation from Human Preferences with Reward Modeling** |
|
|
|
|
|
</div> |
|
|
|
|
|
[](https://arxiv.org/abs/2504.13630) |
|
|
[](https://pypi.org/project/remedy-mt-eval/) |
|
|
[](https://github.com/Smu-Tan/Remedy/stargazers) |
|
|
[](./LICENSE) |
|
|
|
|
|
--- |
|
|
|
|
|
## β¨ About ReMedy |
|
|
|
|
|
**ReMedy** is a new state-of-the-art machine translation (MT) evaluation framework that reframes the task as **reward modeling** rather than direct regression. Instead of relying on noisy human scores, ReMedy learns from **pairwise human preferences**, leading to better alignment with human judgments. |
|
|
|
|
|
- π **State-of-the-art accuracy** on WMT22β24 (39 language pairs, 111 systems) |
|
|
- βοΈ **Segment- and system-level** evaluation, outperforming GPT-4, PaLM-540B, Finetuned-PaLM2, MetricX-13B, and XCOMET |
|
|
- π **More robust** on low-quality and out-of-domain translations (ACES, MSLC benchmarks) |
|
|
- π§ Can be used as a **reward model** in RLHF pipelines to improve MT systems |
|
|
|
|
|
> ReMedy demonstrates that **reward modeling with pairwise preferences** offers a more reliable and human-aligned approach for MT evaluation. |
|
|
|
|
|
--- |
|
|
|
|
|
## π Contents |
|
|
|
|
|
- [π¦ Quick Installation](#-quick-installation) |
|
|
- [βοΈ Requirements](#οΈ-requirements) |
|
|
- [π Usage](#-usage) |
|
|
- [πΎ Download Models](#-download-remedy-models) |
|
|
- [πΉ Basic Usage](#-basic-usage) |
|
|
- [πΉ Reference-Free Mode](#-reference-free-mode) |
|
|
- [π Output Files](#-output-files) |
|
|
- [βοΈ Full Argument List](#οΈ-full-argument-list) |
|
|
- [π§ Model Variants](#-model-variants) |
|
|
- [π Reproducing WMT Results](#-reproducing-wmt-results) |
|
|
- [π Citation](#-citation) |
|
|
|
|
|
--- |
|
|
|
|
|
## π¦ Quick Installation |
|
|
|
|
|
> ReMedy requires **Python β₯ 3.10**, and leverages **[VLLM](https://github.com/vllm-project/vllm)** for fast inference. |
|
|
|
|
|
### β
Recommended: Install via pip |
|
|
|
|
|
```bash |
|
|
pip install remedy-mt-eval |
|
|
git clone https://github.com/Smu-Tan/Remedy |
|
|
cd Remedy |
|
|
``` |
|
|
|
|
|
### π οΈ Install from Source |
|
|
|
|
|
```bash |
|
|
git clone https://github.com/Smu-Tan/Remedy |
|
|
cd Remedy |
|
|
pip install -e . |
|
|
``` |
|
|
|
|
|
### π Install via Poetry |
|
|
|
|
|
```bash |
|
|
git clone https://github.com/Smu-Tan/Remedy |
|
|
cd Remedy |
|
|
poetry install |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## βοΈ Requirements |
|
|
|
|
|
- `Python` β₯ 3.10 |
|
|
- `transformers` β₯ 4.51.1 |
|
|
- `vllm` β₯ 0.8.5 |
|
|
- `torch` β₯ 2.6.0 |
|
|
- *(See `pyproject.toml` for full dependencies)* |
|
|
|
|
|
--- |
|
|
|
|
|
## π Usage |
|
|
|
|
|
### πΎ Download ReMedy Models |
|
|
|
|
|
Before using, download the model from HuggingFace: |
|
|
|
|
|
```bash |
|
|
HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download ShaomuTan/ReMedy-9B-23 --local-dir Models/remedy-9B-23 |
|
|
``` |
|
|
|
|
|
You can replace `ReMedy-9B-22` with other variants like `ReMedy-9B-23`. |
|
|
|
|
|
--- |
|
|
|
|
|
### πΉ Basic Usage |
|
|
|
|
|
```bash |
|
|
remedy-score \ |
|
|
--model Models/remedy-9B-22 \ |
|
|
--src_file testcase/en.src \ |
|
|
--mt_file testcase/en-de.hyp \ |
|
|
--ref_file testcase/de.ref \ |
|
|
--src_lang en --tgt_lang de \ |
|
|
--cache_dir $CACHE_DIR \ |
|
|
--save_dir testcase \ |
|
|
--num_gpus 4 \ |
|
|
--calibrate |
|
|
``` |
|
|
|
|
|
### πΉ Reference-Free Mode (Quality Estimation) |
|
|
|
|
|
```bash |
|
|
remedy-score \ |
|
|
--model Models/remedy-9B-22 \ |
|
|
--src_file testcase/en.src \ |
|
|
--mt_file testcase/en-de.hyp \ |
|
|
--no_ref \ |
|
|
--src_lang en --tgt_lang de \ |
|
|
--cache_dir $CACHE_DIR \ |
|
|
--save_dir testcase \ |
|
|
--num_gpus 4 \ |
|
|
--calibrate |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## π Output Files |
|
|
|
|
|
- `src-tgt_raw_scores.txt` |
|
|
- `src-tgt_sigmoid_scores.txt` |
|
|
- `src-tgt_calibration_scores.txt` |
|
|
- `src-tgt_detailed_results.tsv` |
|
|
- `src-tgt_result.json` |
|
|
|
|
|
Inspired by **SacreBLEU**, ReMedy provides JSON-style results to ensure transparency and comparability. |
|
|
|
|
|
<details> |
|
|
<summary>π Example JSON Output</summary> |
|
|
|
|
|
```json |
|
|
{ |
|
|
"metric_name": "remedy-9B-22", |
|
|
"raw_score": 4.502863049214531, |
|
|
"sigmoid_score": 0.9613502018042875, |
|
|
"calibration_score": 0.9029647169507162, |
|
|
"calibration_temp": 1.7999999999999998, |
|
|
"signature": "metric_name:remedy-9B-22|lp:en-de|ref:yes|version:0.1.1", |
|
|
"language_pair": "en-de", |
|
|
"source_language": "en", |
|
|
"target_language": "de", |
|
|
"segments": 2037, |
|
|
"version": "0.1.1", |
|
|
"args": { |
|
|
"src_file": "testcase/en.src", |
|
|
"mt_file": "testcase/en-de.hyp", |
|
|
"src_lang": "en", |
|
|
"tgt_lang": "de", |
|
|
"model": "Models/remedy-9B-22", |
|
|
"cache_dir": "Models", |
|
|
"save_dir": "testcase", |
|
|
"ref_file": "testcase/de.ref", |
|
|
"no_ref": false, |
|
|
"calibrate": true, |
|
|
"num_gpus": 4, |
|
|
"num_seqs": 256, |
|
|
"max_length": 4096, |
|
|
"enable_truncate": false, |
|
|
"version": false, |
|
|
"list_languages": false |
|
|
} |
|
|
} |
|
|
``` |
|
|
|
|
|
</details> |
|
|
|
|
|
--- |
|
|
|
|
|
## βοΈ Full Argument List |
|
|
|
|
|
<details> |
|
|
<summary>π Show CLI Arguments</summary> |
|
|
|
|
|
### πΈ Required |
|
|
|
|
|
```python |
|
|
--src_file # Path to source file |
|
|
--mt_file # Path to MT output file |
|
|
--src_lang # Source language code |
|
|
--tgt_lang # Target language code |
|
|
--model # Model path or HuggingFace ID |
|
|
--save_dir # Output directory |
|
|
``` |
|
|
|
|
|
### πΈ Optional |
|
|
|
|
|
```python |
|
|
--ref_file # Reference file path |
|
|
--no_ref # Reference-free mode |
|
|
--cache_dir # Cache directory |
|
|
--calibrate # Enable calibration |
|
|
--num_gpus # Number of GPUs |
|
|
--num_seqs # Number of sequences (default: 256) |
|
|
--max_length # Max token length (default: 4096) |
|
|
--enable_truncate # Truncate sequences |
|
|
--version # Print version |
|
|
--list_languages # List supported languages |
|
|
``` |
|
|
|
|
|
</details> |
|
|
|
|
|
--- |
|
|
|
|
|
## π§ Model Variants |
|
|
|
|
|
| Model | Size | Base Model | Ref/QE | Download | |
|
|
|---------------|------|--------------|--------|----------| |
|
|
| ReMedy-2B | 2B | Gemma-2-2B | Both | [π€ HuggingFace](https://huggingface.co/ShaomuTan/ReMedy-2B) | |
|
|
| ReMedy-9B-22 | 9B | Gemma-2-9B | Both | [π€ HuggingFace](https://huggingface.co/ShaomuTan/ReMedy-9B-22) | |
|
|
| ReMedy-9B-23 | 9B | Gemma-2-9B | Both | [π€ HuggingFace](https://huggingface.co/ShaomuTan/ReMedy-9B-23) | |
|
|
| ReMedy-9B-24 | 9B | Gemma-2-9B | Both | [π€ HuggingFace](https://huggingface.co/ShaomuTan/ReMedy-9B-24) | |
|
|
|
|
|
> More variants coming soon... |
|
|
|
|
|
--- |
|
|
|
|
|
## π Reproducing WMT Results |
|
|
|
|
|
<details> |
|
|
<summary>Click to show instructions for reproducing WMT22β24 evaluation</summary> |
|
|
|
|
|
### 1. Install `mt-metrics-eval` |
|
|
|
|
|
```bash |
|
|
git clone https://github.com/google-research/mt-metrics-eval.git |
|
|
cd mt-metrics-eval |
|
|
pip install . |
|
|
``` |
|
|
|
|
|
### 2. Download WMT evaluation data |
|
|
|
|
|
```bash |
|
|
python3 -m mt_metrics_eval.mtme --download |
|
|
``` |
|
|
|
|
|
### 3. Run ReMedy on WMT data |
|
|
|
|
|
```bash |
|
|
bash wmt/wmt22.sh |
|
|
bash wmt/wmt23.sh |
|
|
bash wmt/wmt24.sh |
|
|
``` |
|
|
|
|
|
> π Results will be comparable with other metrics reported in WMT shared tasks. |
|
|
|
|
|
</details> |
|
|
|
|
|
--- |
|
|
|
|
|
## π Citation |
|
|
|
|
|
If you use **ReMedy**, please cite the following paper: |
|
|
|
|
|
```bibtex |
|
|
@inproceedings{tan-monz-2025-remedy, |
|
|
title = "{R}e{M}edy: Learning Machine Translation Evaluation from Human Preferences with Reward Modeling", |
|
|
author = "Tan, Shaomu and |
|
|
Monz, Christof", |
|
|
editor = "Christodoulopoulos, Christos and |
|
|
Chakraborty, Tanmoy and |
|
|
Rose, Carolyn and |
|
|
Peng, Violet", |
|
|
booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing", |
|
|
month = nov, |
|
|
year = "2025", |
|
|
address = "Suzhou, China", |
|
|
publisher = "Association for Computational Linguistics", |
|
|
url = "https://aclanthology.org/2025.emnlp-main.217/", |
|
|
doi = "10.18653/v1/2025.emnlp-main.217", |
|
|
pages = "4370--4387", |
|
|
ISBN = "979-8-89176-332-6", |
|
|
abstract = "A key challenge in MT evaluation is the inherent noise and inconsistency of human ratings. Regression-based neural metrics struggle with this noise, while prompting LLMs shows promise at system-level evaluation but performs poorly at segment level. In this work, we propose ReMedy, a novel MT metric framework that reformulates translation evaluation as a reward modeling task. Instead of regressing on imperfect human ratings directly, ReMedy learns relative translation quality using pairwise preference data, resulting in a more reliable evaluation. In extensive experiments across WMT22-24 shared tasks (39 language pairs, 111 MT systems), ReMedy achieves state-of-the-art performance at both segment- and system-level evaluation. Specifically, ReMedy-9B surpasses larger WMT winners and massive closed LLMs such as MetricX-13B, XCOMET-Ensemble, GEMBA-GPT-4, PaLM-540B, and finetuned PaLM2. Further analyses demonstrate that ReMedy delivers superior capability in detecting translation errors and evaluating low-quality translations." |
|
|
} |
|
|
|
|
|
``` |
|
|
|
|
|
--- |