KinyCOMET — Translation Quality Estimation for Kinyarwanda ↔ English

Model Description

KinyCOMET is a neural translation quality estimation model for Kinyarwanda-English translation pairs. The model addresses the poor correlation between BLEU scores and human judgment in Kinyarwanda translation evaluation, achieving 0.75 Pearson correlation with human assessments

The model was trained on 4,323 human-annotated translation pairs collected from 15 linguistics students using Direct Assessment scoring aligned with WMT evaluation standards.

Model Variants & Performance

Variant	Base Model	Pearson	Spearman	Kendall's τ	MAE
KinyCOMET-Unbabel	Unbabel/wmt22-comet-da	0.75	0.59	0.42	0.07
KinyCOMET-XLM	XLM-RoBERTa-large	0.73	0.50	0.35	0.07
Unbabel (baseline)	wmt22-comet-da	0.54	0.55	0.39	0.17
AfriCOMET STL 1.1	AfriCOMET base	0.52	0.35	0.24	0.18
BLEU	N/A	0.30	0.34	0.23	0.62
chrF	N/A	0.38	0.30	0.21	0.34

Both KinyCOMET variants outperform existing baselines. KinyCOMET-Unbabel shows the strongest overall correlation, while performance varies by translation direction:

Performance Highlights

Comprehensive Evaluation Results

Overall Performance (Both Directions)

Pearson Correlation: 0.75 (KinyCOMET-Unbabel) vs 0.30 (BLEU) - 2.5x improvement
Spearman Correlation: 0.59 vs 0.34 (BLEU) - 73% improvement
Mean Absolute Error: 0.07 vs 0.62 (BLEU) - 89% reduction

Directional Analysis

Direction	Model	Pearson	Spearman	Kendall's τ
English → Kinyarwanda	KinyCOMET-XLM	0.76	0.52	0.37
English → Kinyarwanda	KinyCOMET-Unbabel	0.75	0.56	0.40
Kinyarwanda → English	KinyCOMET-Unbabel	0.63	0.47	0.33
Kinyarwanda → English	KinyCOMET-XLM	0.37	0.29	0.21

Key Insights:

English→Kinyarwanda consistently outperforms Kinyarwanda→English across all metrics
Both KinyCOMET variants significantly outperform AfriCOMET baselines despite including Kinyarwanda
Surprising finding: Unbabel baseline (not trained on Kinyarwanda) outperforms AfriCOMET variants

Installation

Make sure you have Python ≥ 3.8 and install COMET via pip:

pip install unbabel-comet

You can verify the CLI tool is installed:

which comet-score
# should print something like: /usr/local/bin/comet-score

For more details on COMET, see the official documentation.

Usage

Load and Use the Model in Python

Here's a simple example to score translations directly in Python:

from comet import load_from_checkpoint

# Load the public KinyCOMET model
model = load_from_checkpoint("chrismazii/kinycomet_unbabel")

# Example translations
samples = [
    {
        "src": "Umugabo ararya.",
        "mt": "The man is eating.",
        "ref": "The man is eating."
    },
    {
        "src": "Umwana arasinzira.",
        "mt": "A dog sleeps.",
        "ref": "The child is sleeping."
    }
]

# Predict scores
pred = model.predict(samples, gpus=0)
print(pred)

Output Example:

Prediction({
  'scores': [0.9899, 0.8813],
  'system_score': 0.9356
})

Using the Command Line Interface (CLI)

You can also evaluate translations directly using the terminal.

Step 1: Create the text files

cat > source.txt <<'SRC'
Umugabo ararya.
Umwana arasinzira.
Uyu mwanya neza cyane.
SRC

cat > reference.txt <<'REF'
The man is eating.
The child is sleeping.
This place is very nice.
REF

cat > hypothesis.txt <<'HYP'
The man is eating.
A dog sleeps.
This place is very nice.
HYP

Step 2: Run KinyCOMET

comet-score -s source.txt -r reference.txt -t hypothesis.txt \
  --model chrismazii/kinycomet_unbabel --gpus 0 --to_json results.json

Step 3: View the results

cat results.json

Score Interpretation

Scores range from 0 to 1: Higher scores indicate better translation quality
System score: Average quality across all translations
Segment scores: Individual quality scores for each translation pair
Threshold guidance: Scores above 0.8 typically indicate high-quality translations

Training Details

Data

4,323 human-annotated Kinyarwanda-English translation pairs
Annotations collected from 15 linguistics students
Direct Assessment scoring following WMT standards
Split: 80% train (3,497) / 10% validation (404) / 10% test (422)
Domains: education and tourism

Model Architecture

Base Models: XLM-RoBERTa-large and Unbabel/wmt22-comet-da
Framework: COMET quality estimation framework
Evaluation metrics: Kendall's τ and Spearman ρ correlation with human DA scores

Training Configuration

Methodology: COMET framework with Direct Assessment supervision
Evaluation Metrics: Kendall's τ and Spearman ρ correlation with human DA scores
Data Split: 80% train (3,497) / 10% validation (404) / 10% test (422)

MT System Benchmarking Results

We evaluated several production MT systems using KinyCOMET:

MT System	Kinyarwanda→English	English→Kinyarwanda	Overall
GPT-4o	93.10% ± 7.77	87.83% ± 11.15	90.69% ± 9.82
GPT-4.1	93.08% ± 6.62	87.92% ± 10.38	90.75% ± 8.90
Gemini Flash 2.0	91.46% ± 11.39	90.02% ± 8.92	90.80% ± 10.35
Claude 3.7	92.48% ± 8.32	85.75% ± 11.28	89.43% ± 10.33
NLLB-1.3B	89.42% ± 12.04	83.96% ± 16.31	86.78% ± 14.52
NLLB-600M	88.87% ± 12.11	75.46% ± 28.49	82.71% ± 22.27

Key Findings:

LLM-based systems significantly outperform traditional neural MT
All systems perform better on Kinyarwanda→English than English→Kinyarwanda

Dataset Access

The training dataset is available separately. See the KinyCOMET Dataset Card for details on accessing the human-annotated quality estimation data.

Citation & Research

If you use KinyCOMET in your research, please cite:

@misc{kinycomet2025,
    title={KinyCOMET: Translation Quality Estimation for Kinyarwanda-English},
    author={Prince Chris Mazimpaka and Jan Nehring},
    year={2025},
    publisher={Hugging Face},
    howpublished={\url{https://huggingface.co/chrismazii/kinycomet_unbabel}}
}

License

This model is released under the Apache 2.0 License.

Acknowledgments

COMET Framework: Built on the excellent COMET quality estimation framework
Base Models: Leverages XLM-RoBERTa and Unbabel's WMT22 COMET-DA models
African NLP Community: Inspired by ongoing efforts to advance African language technologies
Contributors: Thanks to the 15 linguistics students and all researchers who made this work possible

Resources:

Downloads last month: 3

Model tree for chrismazii/kinycomet_unbabel

Base model

FacebookAI/xlm-roberta-large

Finetuned

(928)

this model

Evaluation results

Pearson Correlation on Kinyarwanda-English QE Dataset
self-reported

0.751
Spearman Correlation on Kinyarwanda-English QE Dataset
self-reported

0.593
System Score on Kinyarwanda-English QE Dataset
self-reported

0.896