NLLB-200 Fine-tuned for Portuguese ↔ Umbundu Translation
First publicly available machine translation model for the Portuguese ↔ Umbundu language pair.
Umbundu is a Bantu language spoken by approximately 6 million people in Angola, primarily in the central highlands region.
Model Details
- Base model: facebook/nllb-200-distilled-600M
- Fine-tuning method: LoRA (Low-Rank Adaptation)
- Training data: 10,277 aligned Portuguese-Umbundu sentence pairs (Bible corpus)
- Training: 3 epochs on Google Colab T4 GPU
- BLEU score: 27.48 (epoch 2, validation set)
Usage
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch
model_name = "robsonrtp/nllb-umbundu-pt"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
def traduzir(texto, src_lang="por_Latn", tgt_lang="umb_Latn"):
tokenizer.src_lang = src_lang
inputs = tokenizer(texto, return_tensors="pt", padding=True)
forced_bos_token_id = tokenizer.convert_tokens_to_ids(tgt_lang)
with torch.no_grad():
output = model.generate(
**inputs,
forced_bos_token_id=forced_bos_token_id,
max_length=128,
num_beams=4,
)
return tokenizer.batch_decode(output, skip_special_tokens=True)[0]
# Portuguese → Umbundu
print(traduzir("Deus criou o céu e a terra."))
# Output: Suku wa lulika ilu longongo.
# Umbundu → Portuguese
print(traduzir("Suku wa lulika ilu longongo.", src_lang="umb_Latn", tgt_lang="por_Latn"))
Training Data
The model was trained on a parallel corpus extracted from the Bible, containing 10,277 sentence pairs.
| Split | Examples |
|---|---|
| Train | 8,221 |
| Validation | 1,027 |
| Test | 1,029 |
Performance
| Metric | Score |
|---|---|
| BLEU (val) | 27.48 |
Limitations
- Trained exclusively on Biblical text — performance on modern/colloquial language is limited
- Native speaker validation still in progress
- Modern vocabulary (technology, medicine) may be translated with approximations
- Looking for native Umbundu speakers to help validate and improve the model
Intended Use
- Research on low-resource African language NLP
- Baseline model for Portuguese-Umbundu translation
- Foundation for future work on Angolan national languages
Future Work
- Expand dataset with non-biblical sources
- Add support for other Angolan languages (Kimbundu, Kikongo)
- Native speaker evaluation and correction
- Fine-tune for specific domains (health, education, government)
Author
Robson — Developer from Angola building AI tools for African languages.
- 🤗 HuggingFace: robsonrtp
- 💼 LinkedIn: Robson Paulo
- 🔬 Project: NganaNLP — Neural Machine Translation for Angolan Languages
Contributions and feedback from Umbundu speakers are welcome!
Citation
If you use this model in your research, please cite:
```bibtex @misc{robsonrtp2025umbundu, author = {Robson Paulo}, title = {NLLB-200 Fine-tuned for Portuguese-Umbundu Translation}, year = {2026}, publisher = {HuggingFace}, url = {https://huggingface.co/robsonrtp/nllb-umbundu-pt} } ```
Model tree for robsonrtp/nllb-umbundu-pt
Base model
facebook/nllb-200-distilled-600M