xCOMET-XL-TR (v2) — English↔Turkish MT evaluation

A fine-tune of Unbabel's xCOMET-XL (3.5B params) specialised for English↔Turkish machine-translation quality estimation, with 5 lexical features fused through a small residual bottleneck (per BLEU Meets COMET, Glushkova et al. 2023).

Given a (source, machine-translation, reference) triplet it returns a quality score, roughly in [0, 1] — higher is better.

Weights: this repo — xcomet-xl-tr-v2.bf16.ckpt (BF16, ~7 GB).
Code: [code repository — anonymized for review] — you need both this repo and the code repo.
Anonymized repository URL: https://anonymous.4open.science/r/TurCOMET-8A79/README.md

Performance

On a held-out Turkish WMT-DA test split (1,768 rows) it beats baseline xCOMET-XL on every correlation metric against human DA scores:

Metric	Baseline xCOMET-XL	This model
Pearson (regression-only)	0.473	0.547
Spearman (regression-only)	0.531	0.562
Kendall (regression-only)	0.368	0.394
Pearson (full predict_step)	0.479	0.515

It cleanly ranks hand-crafted PERFECT > GOOD > BAD > TERRIBLE translations (8/8 groups) in both directions, with PERFECT translations scoring ~0.94–0.98.

Quick start

1. Install (Python ≥ 3.10, CUDA GPU recommended). The order matters — unbabel-comet over-pins numpy/protobuf, so they are restored afterwards:

# clone the (anonymized) code repository, then:
cd xcomet-xl-tr
bash install.sh
# install.sh runs:
#   pip install "unbabel-comet==2.2.7" "sentence-transformers>=3.0.0" \
#               "sacrebleu>=2.4.0" "zemberek-python>=0.2.3" "huggingface_hub>=0.23"
#   pip install "numpy==2.0.2" "protobuf>=5.29,<6"

2. Authenticate (this model is private):

huggingface-cli login        # or:  export HF_TOKEN=hf_xxx

3. Score a triplet:

from huggingface_hub import hf_hub_download
from xcomet_tr import load_model, score

ckpt = hf_hub_download("XCOMETTR/XCOMET-XL-TR", "xcomet-xl-tr-v2.bf16.ckpt")
model = load_model(ckpt)          # bf16, GPU if available

# (source, machine_translation, reference, direction)  — direction: "en-tr" | "tr-en"
triplets = [
    ("Istanbul is the largest city in Turkey.",
     "İstanbul, Türkiye'nin en büyük şehridir.",
     "İstanbul, Türkiye'nin en büyük şehridir.", "en-tr"),
    ("Hava bugün çok güzel.",
     "The weather is very nice today.",
     "The weather is very nice today.", "tr-en"),
]
print(score(model, triplets))     # e.g. [0.97, 0.96]

python example.py in the code repo runs exactly this end-to-end.

How it works

XCOMETFeatures (an XCOMETMetric subclass) adds one module — a [encoder_dim + 5] → 64 → encoder_dim bottleneck added residually (zero-init, so it starts identical to xCOMET-XL) to the pooled sentence embedding. The 5 features are: chrF++(mt,ref), LaBSE cos(src,mt), length-ratio z-score, lemma-TER (Turkish lemmatised via Zemberek), and a direction flag. Load it via the code repo's load_model, which uses load_pretrained_weights=False so the self-contained checkpoint needs no extra base-encoder download.

Notes & limitations

BF16 — published/loaded in bfloat16 (xCOMET-XL was trained bf16-mixed); matches fp32 to 2–3 decimals.
512-token window — XLM-R-XL caps at 512 tokens; xCOMET encodes mt+src, mt+ref, and mt+src+ref, so long documents are truncated. Best used per sentence / short paragraph; for documents, score sentences and average.
Domain — fine-tuned on news-domain WMT-DA (2017–2018); expect some shift elsewhere. Turkish word-level supervision is heuristic (no human MQM spans exist for Turkish).

License

CC-BY-NC-SA 4.0, inherited from Unbabel/XCOMET-XL. Non-commercial use only; derivatives must use the same license. Built on Unbabel/COMET; lexical fusion from Glushkova et al. 2023; Turkish morphology via Zemberek.

Downloads last month: -

Model tree for TurCOMET/XCOMET-XL-TR

Base model

facebook/xlm-roberta-xl

Finetuned

Unbabel/XCOMET-XL

Finetuned

(3)

this model