Evaluation Results

Evaluated on a held-out test set of 1,037 pairs (10% of total dataset), stratified by translation direction.

Overall (mixed directions, 1,037 pairs)

Metric Score Notes
BLEU 8.26
chrF 30.36
TER 89.90 lower is better
BLEU 1-gram 36.71 word-level accuracy
BLEU 2-gram 11.50
BLEU 3-gram 4.90
BLEU 4-gram 2.25
Brevity penalty 1.00 no length penalty

Twi → English (384 pairs)

Metric Score
BLEU 10.50
chrF 32.30
TER 87.64
BLEU 1-gram 39.32
BLEU 2-gram 14.07
BLEU 3-gram 6.61
BLEU 4-gram 3.33

English → Twi (653 pairs)

Metric Score
BLEU 6.32
chrF 28.74
TER 91.55
BLEU 1-gram 34.70
BLEU 2-gram 9.48
BLEU 3-gram 3.53
BLEU 4-gram 1.37

Note on BLEU scores: BLEU scores in the 8–11 range are expected and well-documented for low-resource language pairs like Twi-English. The 1-gram precision of 37–39% indicates the model captures individual word meanings reasonably well. chrF is a more reliable metric for morphologically rich languages like Twi — scores of 28–32 reflect a functional baseline. For context, the GhanaNLP team's own Khaya model reports similar ranges on this dataset.

Training Details

Parameter Value
Base model Helsinki-NLP/opus-mt-mul-en
Dataset Ghana-NLP/TWI_ENGLISH_PARALLEL_TEXT
Total pairs 10,370
Train pairs 8,296
Validation pairs 1,037
Test pairs 1,037
Train/Val/Test 80% / 10% / 10%
Epochs 5
Batch size 16
Learning rate 5e-5
Warmup steps 200
Weight decay 0.01
Max sequence length 128 tokens
Precision fp16 (mixed precision)
Optimizer AdamW
Hardware NVIDIA T4 GPU (Google Colab)
Training time ~25.6 minutes (1,534 seconds)
Global steps 2,595
Final train loss 2.94
Best checkpoint checkpoint-2595 (epoch 5)

Training Loss Curve

Epoch Training Loss Validation Loss BLEU chrF
1 4.32 13.25 15.50
2 3.59 19.85 22.63
3 3.39 22.68 25.93
4 3.01 32.39 32.40
5 2.94 42.14 32.30

Validation BLEU of 42.14 at epoch 5 reflects in-training evaluation. Post-training evaluation on the held-out test set yields BLEU 8.26–10.50 depending on direction, consistent with expected performance for this language pair and dataset size.

Dataset

The GhanaNLP TWI-ENGLISH Parallel Text contains 14,875 professionally translated sentence pairs covering everyday Ghanaian contexts including civic life, health, agriculture, education and culture. Sources include Wikipedia, novels, news outlets and local linguists. The dataset includes dialectal diversity across Asante Twi, Akwapim Twi and other Akan variants spoken among the Akan people of Ghana.

Translations were produced by independent paid professionals with a focus on contextual accuracy rather than word-for-word equivalence. The training split used here (10,370 pairs) contains both Twi→English and English→Twi direction pairs, which the model handles as a single bidirectional task.

Direction breakdown in test set:

  • Twi → English: 384 pairs (37%)
  • English → Twi: 653 pairs (63%)

Recommended Generation Config

# Use these settings to reduce repetition artifacts
out = model.generate(
    **inputs,
    num_beams=4,
    max_length=128,
    no_repeat_ngram_size=3,
    early_stopping=True,
    length_penalty=1.0
)

Limitations

  • Mixed-direction training: The model was trained on both translation directions simultaneously without explicit language direction tags (e.g. >>en<<). This limits performance compared to direction-specific models and can cause the model to occasionally output the wrong language.

  • Low-resource constraints: With ~8,300 training pairs, the model may hallucinate on complex or domain-specific sentences, particularly for rare vocabulary or idioms.

  • Repetition artifacts: Some outputs exhibit token repetition loops on longer inputs. Using no_repeat_ngram_size=3 in generation significantly mitigates this.

  • Dialectal variation: Performance may vary across Twi dialects. The dataset skews toward Asante Twi. Akwapim Twi and other variants may produce lower quality outputs.

  • Special characters: Twi uses characters such as ɛ, ɔ, and ŋ that are critical for meaning. Inputs missing these characters (e.g. typed on a standard keyboard) may produce degraded translations.

  • Dataset noise: A subset of the training data contains English text in the Twi column and vice versa. This directional noise contributes to lower performance on the English→Twi direction (BLEU 6.32 vs 10.50 for Twi→English).

Roadmap

This is a v1 baseline model. The following improvements are planned:

  1. NLLB-200 fine-tune: Re-train on facebook/nllb-200-distilled-600M, which natively supports Twi (Akan, twi_Latn) and is significantly more capable for low-resource translation. Expected to yield substantially higher BLEU and chrF scores.

  2. Direction separation: Train dedicated twi→en and en→twi models with clean, direction-filtered data to eliminate mixed-direction noise and language confusion.

  3. Data cleaning: Remove noisy pairs where source and target languages are mismatched or swapped, and deduplicate across the full corpus.

  4. Data augmentation: Use back-translation to expand the training corpus beyond 10,370 pairs, particularly for underrepresented domains.

  5. Extended training: The current model shows no clear plateau at epoch 5. Additional epochs with a reduced learning rate (2e-5) are expected to improve chrF further.

Example Translations

Twi → English (strong examples)

Twi Predicted Reference
Ne dwonsɔ ani tumi sesa. Their response is "Obiri-Nana (Obiri-na)" or "Oburu". Their response is "Obiri-Nana (Obiri-na)" or "Oburu".
Yei nti, ama nnipa pii ani abɛgye nwa ho pa ara. This has led to a growing interest in snails. This has led to a growing interest in drinking.

Known failure modes

Input Issue
Long complex sentences (30+ words) Repetition loops, semantic drift
Domain-specific medical/legal text Hallucination, wrong terminology
Sentences without Twi special chars Degraded output quality

Citation

If you use this model in your research or application, please cite the GhanaNLP dataset it was trained on:

@dataset{ghananlp_twi_english_2023,
  author    = {GhanaNLP},
  title     = {GhanaNLP TWI-ENGLISH Parallel Text},
  year      = {2023},
  publisher = {Hugging Face},
  url       = {https://huggingface.co/datasets/Ghana-NLP/TWI_ENGLISH_PARALLEL_TEXT}
}

Acknowledgements

Dataset funded by Google LLC. Professional translations by independent paid translators. This model was developed as part of an initiative to improve NLP tooling and language technology access for Ghanaian and West African languages. Trained on Google Colab infrastructure.

License

Creative Commons Attribution Non-Commercial 4.0 (CC-BY-NC-4.0)

You may share and adapt this model for non-commercial purposes with appropriate attribution. Commercial use requires explicit permission.


Downloads last month
52
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ninte/twi-en-marianmt

Finetuned
(19)
this model

Dataset used to train ninte/twi-en-marianmt