NLLB-200 Fine-Tuned for Kimbundu β Portuguese
Model Description
This model is a fine-tuned version of NLLB-200, adapted specifically for machine translation from Kimbundu (kmb) to Portuguese (pt).
It targets a low-resource African language pair, aiming to improve translation quality and support linguistic research and digital inclusion for Angolan languages.
- Base model: facebook/nllb-200-distilled-600M
- Task: Neural Machine Translation
- Languages: Kimbundu β Portuguese
- Fine-tuned by: Lirio Ramalheira
Intended Uses
Primary Intended Uses
- Automatic translation from Kimbundu to Portuguese
- Linguistic and NLP research on Bantu and Angolan languages
- Educational and cultural preservation tools
Out-of-Scope Uses
- Legal, medical, or governmental translation without human review
- High-stakes or safety-critical applications
- Dialects or orthographies not covered in the training data
Training Data
The model was fine-tuned using a manually curated parallel corpus of KimbunduβPortuguese sentence pairs, created due to the absence of publicly available datasets for this language pair.
- Dataset type: Parallel corpus
- Languages: Kimbundu (kmb), Portuguese (pt)
- Source: Public-domain, educational, and community-translated texts
- Synthetic data: Not used
Dataset repository:
π LirioSandro/KmbPtMT
Data Splits
| Split | Sentences |
|---|---|
| Train | 18k |
| Validation | 10% |
| Test | 1k |
Preprocessing
- Sentence-level alignment
- Unicode normalization
- Duplicate and noisy pair removal
- Enforcement of NLLB language codes (
kmb_Latn,por_Latn)
Training Procedure
Training Setup
- Framework: Hugging Face Transformers
- Platform: Google Colab (GPU)
- Mixed precision: FP16
Hyperparameters
- Optimizer: AdamW
- Learning rate: 1.5 Γ 10β4
- Batch size: 16
- Epochs: 20
- Max sequence length: 256
- Gradient accumulation: Enabled
Evaluation
The model was evaluated using a manually curated test set, without synthetic augmentation.
Metrics
- BLEU
- chrF++
- COMET
- AfriCOMET
- BERTScore
- Len. Ratio
Results (Kimbundu β Portuguese)
| Metric | Score |
|---|---|
| BLEU | 17.85 |
| chrF++ | 37.65 |
| COMET | 0.6702 |
| AfriCOMET | 0.5169 |
| BERTScore | 0.8161 |
| Len. Ratio | 0.97 |
Comparison with Base Model
Compared to the original NLLB-200 model without fine-tuning, this model shows consistent improvements in:
- Lexical choice and fluency
- Morphological agreement
- Preservation of named entities
- Translation of culturally specific expressions
Limitations
- Limited corpus size due to data scarcity
- Partial coverage of Kimbundu dialectal variation
- Reduced performance on informal or spoken language
Ethical Considerations
- The dataset does not contain personal or sensitive data
- Cultural and topical biases present in source texts may be reflected in outputs
- Users should validate translations before real-world deployment
Recommendations
- Apply domain-specific fine-tuning for specialized applications
- Use human post-editing for high-quality translation
- Extend the dataset with dialectal and oral data
Reproducibility
- Training scripts: Google Colab notebooks
- Dataset: Hugging Face Datasets
- Base model: facebook/nllb-200-distilled-600M
All experiments are reproducible using the released data and scripts.
Citation
If you use this model, please cite:
- Downloads last month
- 13
Model tree for LirioSandro/NLLB-200-600M-KMBPT
Base model
facebook/nllb-200-distilled-600M