NLLB-200 Fine-Tuned for Kimbundu → Portuguese

Model Description

This model is a fine-tuned version of NLLB-200, adapted specifically for machine translation from Kimbundu (kmb) to Portuguese (pt).
It targets a low-resource African language pair, aiming to improve translation quality and support linguistic research and digital inclusion for Angolan languages.

Base model: facebook/nllb-200-distilled-600M
Task: Neural Machine Translation
Languages: Kimbundu → Portuguese
Fine-tuned by: Lirio Ramalheira

Intended Uses

Primary Intended Uses

Automatic translation from Kimbundu to Portuguese
Linguistic and NLP research on Bantu and Angolan languages
Educational and cultural preservation tools

Out-of-Scope Uses

Legal, medical, or governmental translation without human review
High-stakes or safety-critical applications
Dialects or orthographies not covered in the training data

Training Data

The model was fine-tuned using a manually curated parallel corpus of Kimbundu–Portuguese sentence pairs, created due to the absence of publicly available datasets for this language pair.

Dataset type: Parallel corpus
Languages: Kimbundu (kmb), Portuguese (pt)
Source: Public-domain, educational, and community-translated texts
Synthetic data: Not used

Dataset repository:
👉 LirioSandro/KmbPtMT

Data Splits

Split	Sentences
Train	18k
Validation	10%
Test	1k

Preprocessing

Sentence-level alignment
Unicode normalization
Duplicate and noisy pair removal
Enforcement of NLLB language codes (kmb_Latn, por_Latn)

Training Procedure

Training Setup

Framework: Hugging Face Transformers
Platform: Google Colab (GPU)
Mixed precision: FP16

Hyperparameters

Optimizer: AdamW
Learning rate: 1.5 × 10−4
Batch size: 16
Epochs: 20
Max sequence length: 256
Gradient accumulation: Enabled

Evaluation

The model was evaluated using a manually curated test set, without synthetic augmentation.

Metrics

BLEU
chrF++
COMET
AfriCOMET
BERTScore
Len. Ratio

Results (Kimbundu → Portuguese)

Metric	Score
BLEU	17.85
chrF++	37.65
COMET	0.6702
AfriCOMET	0.5169
BERTScore	0.8161
Len. Ratio	0.97

Comparison with Base Model

Compared to the original NLLB-200 model without fine-tuning, this model shows consistent improvements in:

Lexical choice and fluency
Morphological agreement
Preservation of named entities
Translation of culturally specific expressions

Limitations

Limited corpus size due to data scarcity
Partial coverage of Kimbundu dialectal variation
Reduced performance on informal or spoken language

Ethical Considerations

The dataset does not contain personal or sensitive data
Cultural and topical biases present in source texts may be reflected in outputs
Users should validate translations before real-world deployment
Due to the scarcity of digital data in the Kimbundu language, this work used diverse sources for the purposes of cultural preservation and non-profit linguistic research. The dataset is made available under a scientific use license, respecting the integrity of the original sources through data fragmentation.
Purpose: This dataset was compiled exclusively for the purposes of linguistic preservation, scientific research and development of Natural Language Processing (NLP) technologies for the Kimbundu language.
Copyright: This dataset contains fragments of texts from different sources (literary, musical works and information platforms). Original copyright belongs to their respective holders.
Responsibility: The author of this dataset does not claim ownership of third-party content. By downloading or using this data, you assume full responsibility for any copyright infringements that may arise from commercial use or improper redistribution.
Removal: If you own the rights to any content here and wish to have it removed, please contact lirio.ramalheira1@gmail.com.

Recommendations

Apply domain-specific fine-tuning for specialized applications
Use human post-editing for high-quality translation
Extend the dataset with dialectal and oral data

Reproducibility

Training scripts: Google Colab notebooks
Dataset: Hugging Face Datasets
Base model: facebook/nllb-200-distilled-600M

All experiments are reproducible using the released data and scripts.

Citation

If you use this model, please cite:

Downloads last month: 4

Model tree for LirioSandro/NLLB-200-600M-KMBPT

Base model

facebook/nllb-200-distilled-600M

Adapter

(94)

this model

LirioSandro
/

NLLB-200-600M-KMBPT