NLLB-200 Fine-tuned for Portuguese ↔ Umbundu Translation

First publicly available machine translation model for the Portuguese ↔ Umbundu language pair.

Umbundu is a Bantu language spoken by approximately 6 million people in Angola, primarily in the central highlands region.

Model Details

Base model: facebook/nllb-200-distilled-600M
Fine-tuning method: LoRA (Low-Rank Adaptation)
Training data: 10,277 aligned Portuguese-Umbundu sentence pairs (Bible corpus)
Training: 3 epochs on Google Colab T4 GPU
BLEU score: 27.48 (epoch 2, validation set)

Usage

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch

model_name = "robsonrtp/nllb-umbundu-pt"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

def traduzir(texto, src_lang="por_Latn", tgt_lang="umb_Latn"):
    tokenizer.src_lang = src_lang
    inputs = tokenizer(texto, return_tensors="pt", padding=True)
    forced_bos_token_id = tokenizer.convert_tokens_to_ids(tgt_lang)
    with torch.no_grad():
        output = model.generate(
            **inputs,
            forced_bos_token_id=forced_bos_token_id,
            max_length=128,
            num_beams=4,
        )
    return tokenizer.batch_decode(output, skip_special_tokens=True)[0]

# Portuguese → Umbundu
print(traduzir("Deus criou o céu e a terra."))
# Output: Suku wa lulika ilu longongo.

# Umbundu → Portuguese  
print(traduzir("Suku wa lulika ilu longongo.", src_lang="umb_Latn", tgt_lang="por_Latn"))

Training Data

The model was trained on a parallel corpus extracted from the Bible, containing 10,277 sentence pairs.

Split	Examples
Train	8,221
Validation	1,027
Test	1,029

Performance

Metric	Score
BLEU (val)	27.48

Limitations

Trained exclusively on Biblical text — performance on modern/colloquial language is limited
Native speaker validation still in progress
Modern vocabulary (technology, medicine) may be translated with approximations
Looking for native Umbundu speakers to help validate and improve the model

Intended Use

Research on low-resource African language NLP
Baseline model for Portuguese-Umbundu translation
Foundation for future work on Angolan national languages

Future Work

Expand dataset with non-biblical sources
Add support for other Angolan languages (Kimbundu, Kikongo)
Native speaker evaluation and correction
Fine-tune for specific domains (health, education, government)

Author

Robson — Developer from Angola building AI tools for African languages.

🤗 HuggingFace: robsonrtp
💼 LinkedIn: Robson Paulo
🔬 Project: NganaNLP — Neural Machine Translation for Angolan Languages

Contributions and feedback from Umbundu speakers are welcome!

Citation

If you use this model in your research, please cite:

```bibtex @misc{robsonrtp2025umbundu, author = {Robson Paulo}, title = {NLLB-200 Fine-tuned for Portuguese-Umbundu Translation}, year = {2026}, publisher = {HuggingFace}, url = {https://huggingface.co/robsonrtp/nllb-umbundu-pt} } ```

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for robsonrtp/nllb-umbundu-pt

Base model

facebook/nllb-200-distilled-600M

Finetuned

(290)

this model

robsonrtp
/

nllb-umbundu-pt