NLLB-200 Fine-tuned for Portuguese ↔ Umbundu Translation

First publicly available machine translation model for the Portuguese ↔ Umbundu language pair.

Umbundu is a Bantu language spoken by approximately 6 million people in Angola, primarily in the central highlands region.

Model Details

  • Base model: facebook/nllb-200-distilled-600M
  • Fine-tuning method: LoRA (Low-Rank Adaptation)
  • Training data: 10,277 aligned Portuguese-Umbundu sentence pairs (Bible corpus)
  • Training: 3 epochs on Google Colab T4 GPU
  • BLEU score: 27.48 (epoch 2, validation set)

Usage

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch

model_name = "robsonrtp/nllb-umbundu-pt"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

def traduzir(texto, src_lang="por_Latn", tgt_lang="umb_Latn"):
    tokenizer.src_lang = src_lang
    inputs = tokenizer(texto, return_tensors="pt", padding=True)
    forced_bos_token_id = tokenizer.convert_tokens_to_ids(tgt_lang)
    with torch.no_grad():
        output = model.generate(
            **inputs,
            forced_bos_token_id=forced_bos_token_id,
            max_length=128,
            num_beams=4,
        )
    return tokenizer.batch_decode(output, skip_special_tokens=True)[0]

# Portuguese → Umbundu
print(traduzir("Deus criou o céu e a terra."))
# Output: Suku wa lulika ilu longongo.

# Umbundu → Portuguese  
print(traduzir("Suku wa lulika ilu longongo.", src_lang="umb_Latn", tgt_lang="por_Latn"))

Training Data

The model was trained on a parallel corpus extracted from the Bible, containing 10,277 sentence pairs.

Split Examples
Train 8,221
Validation 1,027
Test 1,029

Performance

Metric Score
BLEU (val) 27.48

Limitations

  • Trained exclusively on Biblical text — performance on modern/colloquial language is limited
  • Native speaker validation still in progress
  • Modern vocabulary (technology, medicine) may be translated with approximations
  • Looking for native Umbundu speakers to help validate and improve the model

Intended Use

  • Research on low-resource African language NLP
  • Baseline model for Portuguese-Umbundu translation
  • Foundation for future work on Angolan national languages

Future Work

  • Expand dataset with non-biblical sources
  • Add support for other Angolan languages (Kimbundu, Kikongo)
  • Native speaker evaluation and correction
  • Fine-tune for specific domains (health, education, government)

Author

Robson — Developer from Angola building AI tools for African languages.

  • 🤗 HuggingFace: robsonrtp
  • 💼 LinkedIn: Robson Paulo
  • 🔬 Project: NganaNLP — Neural Machine Translation for Angolan Languages

Contributions and feedback from Umbundu speakers are welcome!

Citation

If you use this model in your research, please cite:

```bibtex @misc{robsonrtp2025umbundu, author = {Robson Paulo}, title = {NLLB-200 Fine-tuned for Portuguese-Umbundu Translation}, year = {2026}, publisher = {HuggingFace}, url = {https://huggingface.co/robsonrtp/nllb-umbundu-pt} } ```

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for robsonrtp/nllb-umbundu-pt

Finetuned
(275)
this model

Spaces using robsonrtp/nllb-umbundu-pt 2