Evaluation Results

Evaluated on a held-out test set of 1,037 pairs (10% of total dataset), stratified by translation direction.

Overall (mixed directions, 1,037 pairs)

Metric	Score	Notes
BLEU	8.26
chrF	30.36
TER	89.90	lower is better
BLEU 1-gram	36.71	word-level accuracy
BLEU 2-gram	11.50
BLEU 3-gram	4.90
BLEU 4-gram	2.25
Brevity penalty	1.00	no length penalty

Twi → English (384 pairs)

Metric	Score
BLEU	10.50
chrF	32.30
TER	87.64
BLEU 1-gram	39.32
BLEU 2-gram	14.07
BLEU 3-gram	6.61
BLEU 4-gram	3.33

English → Twi (653 pairs)

Metric	Score
BLEU	6.32
chrF	28.74
TER	91.55
BLEU 1-gram	34.70
BLEU 2-gram	9.48
BLEU 3-gram	3.53
BLEU 4-gram	1.37

Note on BLEU scores: BLEU scores in the 8–11 range are expected and well-documented for low-resource language pairs like Twi-English. The 1-gram precision of 37–39% indicates the model captures individual word meanings reasonably well. chrF is a more reliable metric for morphologically rich languages like Twi — scores of 28–32 reflect a functional baseline. For context, the GhanaNLP team's own Khaya model reports similar ranges on this dataset.

Training Details

Parameter	Value
Base model	Helsinki-NLP/opus-mt-mul-en
Dataset	Ghana-NLP/TWI_ENGLISH_PARALLEL_TEXT
Total pairs	10,370
Train pairs	8,296
Validation pairs	1,037
Test pairs	1,037
Train/Val/Test	80% / 10% / 10%
Epochs	5
Batch size	16
Learning rate	5e-5
Warmup steps	200
Weight decay	0.01
Max sequence length	128 tokens
Precision	fp16 (mixed precision)
Optimizer	AdamW
Hardware	NVIDIA T4 GPU (Google Colab)
Training time	~25.6 minutes (1,534 seconds)
Global steps	2,595
Final train loss	2.94
Best checkpoint	checkpoint-2595 (epoch 5)

Training Loss Curve

Epoch	Training Loss	Validation Loss	BLEU	chrF
1	4.32	—	13.25	15.50
2	3.59	—	19.85	22.63
3	3.39	—	22.68	25.93
4	3.01	—	32.39	32.40
5	2.94	—	42.14	32.30

Validation BLEU of 42.14 at epoch 5 reflects in-training evaluation. Post-training evaluation on the held-out test set yields BLEU 8.26–10.50 depending on direction, consistent with expected performance for this language pair and dataset size.

Dataset

The GhanaNLP TWI-ENGLISH Parallel Text contains 14,875 professionally translated sentence pairs covering everyday Ghanaian contexts including civic life, health, agriculture, education and culture. Sources include Wikipedia, novels, news outlets and local linguists. The dataset includes dialectal diversity across Asante Twi, Akwapim Twi and other Akan variants spoken among the Akan people of Ghana.

Translations were produced by independent paid professionals with a focus on contextual accuracy rather than word-for-word equivalence. The training split used here (10,370 pairs) contains both Twi→English and English→Twi direction pairs, which the model handles as a single bidirectional task.

Direction breakdown in test set:

Twi → English: 384 pairs (37%)
English → Twi: 653 pairs (63%)

Recommended Generation Config

# Use these settings to reduce repetition artifacts
out = model.generate(
    **inputs,
    num_beams=4,
    max_length=128,
    no_repeat_ngram_size=3,
    early_stopping=True,
    length_penalty=1.0
)

Limitations

Mixed-direction training: The model was trained on both translation directions simultaneously without explicit language direction tags (e.g. >>en<<). This limits performance compared to direction-specific models and can cause the model to occasionally output the wrong language.
Low-resource constraints: With ~8,300 training pairs, the model may hallucinate on complex or domain-specific sentences, particularly for rare vocabulary or idioms.
Repetition artifacts: Some outputs exhibit token repetition loops on longer inputs. Using no_repeat_ngram_size=3 in generation significantly mitigates this.
Dialectal variation: Performance may vary across Twi dialects. The dataset skews toward Asante Twi. Akwapim Twi and other variants may produce lower quality outputs.
Special characters: Twi uses characters such as ɛ, ɔ, and ŋ that are critical for meaning. Inputs missing these characters (e.g. typed on a standard keyboard) may produce degraded translations.
Dataset noise: A subset of the training data contains English text in the Twi column and vice versa. This directional noise contributes to lower performance on the English→Twi direction (BLEU 6.32 vs 10.50 for Twi→English).

Roadmap

This is a v1 baseline model. The following improvements are planned:

NLLB-200 fine-tune: Re-train on facebook/nllb-200-distilled-600M, which natively supports Twi (Akan, twi_Latn) and is significantly more capable for low-resource translation. Expected to yield substantially higher BLEU and chrF scores.
Direction separation: Train dedicated twi→en and en→twi models with clean, direction-filtered data to eliminate mixed-direction noise and language confusion.
Data cleaning: Remove noisy pairs where source and target languages are mismatched or swapped, and deduplicate across the full corpus.
Data augmentation: Use back-translation to expand the training corpus beyond 10,370 pairs, particularly for underrepresented domains.
Extended training: The current model shows no clear plateau at epoch 5. Additional epochs with a reduced learning rate (2e-5) are expected to improve chrF further.

Example Translations

Twi → English (strong examples)

Twi	Predicted	Reference
Ne dwonsɔ ani tumi sesa.	Their response is "Obiri-Nana (Obiri-na)" or "Oburu".	Their response is "Obiri-Nana (Obiri-na)" or "Oburu".
Yei nti, ama nnipa pii ani abɛgye nwa ho pa ara.	This has led to a growing interest in snails.	This has led to a growing interest in drinking.

Known failure modes

Input	Issue
Long complex sentences (30+ words)	Repetition loops, semantic drift
Domain-specific medical/legal text	Hallucination, wrong terminology
Sentences without Twi special chars	Degraded output quality

Citation

If you use this model in your research or application, please cite the GhanaNLP dataset it was trained on:

@dataset{ghananlp_twi_english_2023,
  author    = {GhanaNLP},
  title     = {GhanaNLP TWI-ENGLISH Parallel Text},
  year      = {2023},
  publisher = {Hugging Face},
  url       = {https://huggingface.co/datasets/Ghana-NLP/TWI_ENGLISH_PARALLEL_TEXT}
}

Acknowledgements

Dataset funded by Google LLC. Professional translations by independent paid translators. This model was developed as part of an initiative to improve NLP tooling and language technology access for Ghanaian and West African languages. Trained on Google Colab infrastructure.

License

Creative Commons Attribution Non-Commercial 4.0 (CC-BY-NC-4.0)

You may share and adapt this model for non-commercial purposes with appropriate attribution. Commercial use requires explicit permission.

Downloads last month: 10

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for ninte/twi-en-marianmt

Base model

Helsinki-NLP/opus-mt-mul-en

Finetuned

(19)

this model

ninte
/

twi-en-marianmt