Evaluation Results
Evaluated on a held-out test set of 1,037 pairs (10% of total dataset), stratified by translation direction.
Overall (mixed directions, 1,037 pairs)
| Metric | Score | Notes |
|---|---|---|
| BLEU | 8.26 | |
| chrF | 30.36 | |
| TER | 89.90 | lower is better |
| BLEU 1-gram | 36.71 | word-level accuracy |
| BLEU 2-gram | 11.50 | |
| BLEU 3-gram | 4.90 | |
| BLEU 4-gram | 2.25 | |
| Brevity penalty | 1.00 | no length penalty |
Twi → English (384 pairs)
| Metric | Score |
|---|---|
| BLEU | 10.50 |
| chrF | 32.30 |
| TER | 87.64 |
| BLEU 1-gram | 39.32 |
| BLEU 2-gram | 14.07 |
| BLEU 3-gram | 6.61 |
| BLEU 4-gram | 3.33 |
English → Twi (653 pairs)
| Metric | Score |
|---|---|
| BLEU | 6.32 |
| chrF | 28.74 |
| TER | 91.55 |
| BLEU 1-gram | 34.70 |
| BLEU 2-gram | 9.48 |
| BLEU 3-gram | 3.53 |
| BLEU 4-gram | 1.37 |
Note on BLEU scores: BLEU scores in the 8–11 range are expected and well-documented for low-resource language pairs like Twi-English. The 1-gram precision of 37–39% indicates the model captures individual word meanings reasonably well. chrF is a more reliable metric for morphologically rich languages like Twi — scores of 28–32 reflect a functional baseline. For context, the GhanaNLP team's own Khaya model reports similar ranges on this dataset.
Training Details
| Parameter | Value |
|---|---|
| Base model | Helsinki-NLP/opus-mt-mul-en |
| Dataset | Ghana-NLP/TWI_ENGLISH_PARALLEL_TEXT |
| Total pairs | 10,370 |
| Train pairs | 8,296 |
| Validation pairs | 1,037 |
| Test pairs | 1,037 |
| Train/Val/Test | 80% / 10% / 10% |
| Epochs | 5 |
| Batch size | 16 |
| Learning rate | 5e-5 |
| Warmup steps | 200 |
| Weight decay | 0.01 |
| Max sequence length | 128 tokens |
| Precision | fp16 (mixed precision) |
| Optimizer | AdamW |
| Hardware | NVIDIA T4 GPU (Google Colab) |
| Training time | ~25.6 minutes (1,534 seconds) |
| Global steps | 2,595 |
| Final train loss | 2.94 |
| Best checkpoint | checkpoint-2595 (epoch 5) |
Training Loss Curve
| Epoch | Training Loss | Validation Loss | BLEU | chrF |
|---|---|---|---|---|
| 1 | 4.32 | — | 13.25 | 15.50 |
| 2 | 3.59 | — | 19.85 | 22.63 |
| 3 | 3.39 | — | 22.68 | 25.93 |
| 4 | 3.01 | — | 32.39 | 32.40 |
| 5 | 2.94 | — | 42.14 | 32.30 |
Validation BLEU of 42.14 at epoch 5 reflects in-training evaluation. Post-training evaluation on the held-out test set yields BLEU 8.26–10.50 depending on direction, consistent with expected performance for this language pair and dataset size.
Dataset
The GhanaNLP TWI-ENGLISH Parallel Text contains 14,875 professionally translated sentence pairs covering everyday Ghanaian contexts including civic life, health, agriculture, education and culture. Sources include Wikipedia, novels, news outlets and local linguists. The dataset includes dialectal diversity across Asante Twi, Akwapim Twi and other Akan variants spoken among the Akan people of Ghana.
Translations were produced by independent paid professionals with a focus on contextual accuracy rather than word-for-word equivalence. The training split used here (10,370 pairs) contains both Twi→English and English→Twi direction pairs, which the model handles as a single bidirectional task.
Direction breakdown in test set:
- Twi → English: 384 pairs (37%)
- English → Twi: 653 pairs (63%)
Recommended Generation Config
# Use these settings to reduce repetition artifacts
out = model.generate(
**inputs,
num_beams=4,
max_length=128,
no_repeat_ngram_size=3,
early_stopping=True,
length_penalty=1.0
)
Limitations
Mixed-direction training: The model was trained on both translation directions simultaneously without explicit language direction tags (e.g.
>>en<<). This limits performance compared to direction-specific models and can cause the model to occasionally output the wrong language.Low-resource constraints: With ~8,300 training pairs, the model may hallucinate on complex or domain-specific sentences, particularly for rare vocabulary or idioms.
Repetition artifacts: Some outputs exhibit token repetition loops on longer inputs. Using
no_repeat_ngram_size=3in generation significantly mitigates this.Dialectal variation: Performance may vary across Twi dialects. The dataset skews toward Asante Twi. Akwapim Twi and other variants may produce lower quality outputs.
Special characters: Twi uses characters such as
ɛ,ɔ, andŋthat are critical for meaning. Inputs missing these characters (e.g. typed on a standard keyboard) may produce degraded translations.Dataset noise: A subset of the training data contains English text in the Twi column and vice versa. This directional noise contributes to lower performance on the English→Twi direction (BLEU 6.32 vs 10.50 for Twi→English).
Roadmap
This is a v1 baseline model. The following improvements are planned:
NLLB-200 fine-tune: Re-train on
facebook/nllb-200-distilled-600M, which natively supports Twi (Akan,twi_Latn) and is significantly more capable for low-resource translation. Expected to yield substantially higher BLEU and chrF scores.Direction separation: Train dedicated
twi→enanden→twimodels with clean, direction-filtered data to eliminate mixed-direction noise and language confusion.Data cleaning: Remove noisy pairs where source and target languages are mismatched or swapped, and deduplicate across the full corpus.
Data augmentation: Use back-translation to expand the training corpus beyond 10,370 pairs, particularly for underrepresented domains.
Extended training: The current model shows no clear plateau at epoch 5. Additional epochs with a reduced learning rate (
2e-5) are expected to improve chrF further.
Example Translations
Twi → English (strong examples)
| Twi | Predicted | Reference |
|---|---|---|
| Ne dwonsɔ ani tumi sesa. | Their response is "Obiri-Nana (Obiri-na)" or "Oburu". | Their response is "Obiri-Nana (Obiri-na)" or "Oburu". |
| Yei nti, ama nnipa pii ani abɛgye nwa ho pa ara. | This has led to a growing interest in snails. | This has led to a growing interest in drinking. |
Known failure modes
| Input | Issue |
|---|---|
| Long complex sentences (30+ words) | Repetition loops, semantic drift |
| Domain-specific medical/legal text | Hallucination, wrong terminology |
| Sentences without Twi special chars | Degraded output quality |
Citation
If you use this model in your research or application, please cite the GhanaNLP dataset it was trained on:
@dataset{ghananlp_twi_english_2023,
author = {GhanaNLP},
title = {GhanaNLP TWI-ENGLISH Parallel Text},
year = {2023},
publisher = {Hugging Face},
url = {https://huggingface.co/datasets/Ghana-NLP/TWI_ENGLISH_PARALLEL_TEXT}
}
Acknowledgements
Dataset funded by Google LLC. Professional translations by independent paid translators. This model was developed as part of an initiative to improve NLP tooling and language technology access for Ghanaian and West African languages. Trained on Google Colab infrastructure.
License
Creative Commons Attribution Non-Commercial 4.0 (CC-BY-NC-4.0)
You may share and adapt this model for non-commercial purposes with appropriate attribution. Commercial use requires explicit permission.
- Downloads last month
- 52
Model tree for ninte/twi-en-marianmt
Base model
Helsinki-NLP/opus-mt-mul-en