NLLB-200 Fine-Tuned for Kimbundu β†’ Portuguese

Model Description

This model is a fine-tuned version of NLLB-200, adapted specifically for machine translation from Kimbundu (kmb) to Portuguese (pt).
It targets a low-resource African language pair, aiming to improve translation quality and support linguistic research and digital inclusion for Angolan languages.

  • Base model: facebook/nllb-200-distilled-600M
  • Task: Neural Machine Translation
  • Languages: Kimbundu β†’ Portuguese
  • Fine-tuned by: Lirio Ramalheira

Intended Uses

Primary Intended Uses

  • Automatic translation from Kimbundu to Portuguese
  • Linguistic and NLP research on Bantu and Angolan languages
  • Educational and cultural preservation tools

Out-of-Scope Uses

  • Legal, medical, or governmental translation without human review
  • High-stakes or safety-critical applications
  • Dialects or orthographies not covered in the training data

Training Data

The model was fine-tuned using a manually curated parallel corpus of Kimbundu–Portuguese sentence pairs, created due to the absence of publicly available datasets for this language pair.

  • Dataset type: Parallel corpus
  • Languages: Kimbundu (kmb), Portuguese (pt)
  • Source: Public-domain, educational, and community-translated texts
  • Synthetic data: Not used

Dataset repository:
πŸ‘‰ LirioSandro/KmbPtMT

Data Splits

Split Sentences
Train 18k
Validation 10%
Test 1k

Preprocessing

  • Sentence-level alignment
  • Unicode normalization
  • Duplicate and noisy pair removal
  • Enforcement of NLLB language codes (kmb_Latn, por_Latn)

Training Procedure

Training Setup

  • Framework: Hugging Face Transformers
  • Platform: Google Colab (GPU)
  • Mixed precision: FP16

Hyperparameters

  • Optimizer: AdamW
  • Learning rate: 1.5 Γ— 10βˆ’4
  • Batch size: 16
  • Epochs: 20
  • Max sequence length: 256
  • Gradient accumulation: Enabled

Evaluation

The model was evaluated using a manually curated test set, without synthetic augmentation.

Metrics

  • BLEU
  • chrF++
  • COMET
  • AfriCOMET
  • BERTScore
  • Len. Ratio

Results (Kimbundu β†’ Portuguese)

Metric Score
BLEU 17.85
chrF++ 37.65
COMET 0.6702
AfriCOMET 0.5169
BERTScore 0.8161
Len. Ratio 0.97

Comparison with Base Model

Compared to the original NLLB-200 model without fine-tuning, this model shows consistent improvements in:

  • Lexical choice and fluency
  • Morphological agreement
  • Preservation of named entities
  • Translation of culturally specific expressions

Limitations

  • Limited corpus size due to data scarcity
  • Partial coverage of Kimbundu dialectal variation
  • Reduced performance on informal or spoken language

Ethical Considerations

  • The dataset does not contain personal or sensitive data
  • Cultural and topical biases present in source texts may be reflected in outputs
  • Users should validate translations before real-world deployment

Recommendations

  • Apply domain-specific fine-tuning for specialized applications
  • Use human post-editing for high-quality translation
  • Extend the dataset with dialectal and oral data

Reproducibility

  • Training scripts: Google Colab notebooks
  • Dataset: Hugging Face Datasets
  • Base model: facebook/nllb-200-distilled-600M

All experiments are reproducible using the released data and scripts.


Citation

If you use this model, please cite:


Downloads last month
13
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for LirioSandro/NLLB-200-600M-KMBPT

Adapter
(44)
this model