SangoNMT: Fine-Tuned NLLB-200 for Sango-French Translation
The first dedicated Sango-French neural machine translation model. Fine-tuned from Meta's NLLB-200-distilled-600M using LoRA on a novel Bible parallel corpus of 21,125 verse pairs.
🔗 Try it live: Sango-French Translator Demo
Results
| Metric | NLLB-200 Baseline | Fine-Tuned | Improvement |
|---|---|---|---|
| BLEU | 2.38 | 22.21 | +832% |
| chrF++ | 12.18 | 43.06 | +254% |
Evaluated on 300 held-out test examples from books not seen during training (Esther, Amos, Galatians, Proverbs, John). All translations verified by a native Sango speaker.
Translation Examples
French → Sango
| French | Sango |
|---|---|
| Au commencement Dieu créa les cieux et la terre. | Na tongo nda ni, Nzapa asara yayu na sese. |
| Tu aimeras le Seigneur ton Dieu de tout ton cœur. | Mo ye Kota Gbia Nzapa ti mo na be ti mo kue. |
| Je suis satisfait du resultat. | Ye so asi anzere na mbi mingi. |
Sango → French
| Sango | French |
|---|---|
| Na tongo nda ni, Nzapa asara yayu na sese. | Au commencement Dieu créa les cieux et la terre. |
| Nzapa abaa so ye ni ayeke nzoni. | Et Dieu vit que cela était bon. |
| So zo la! | C'est une personne! |
Usage
Python (Transformers)
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
model_name = "alaminerca/nllb-sango-french"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
def translate(text, src_lang, tgt_lang):
"""Translate text between French (fra_Latn) and Sango (sag_Latn)."""
tokenizer.src_lang = src_lang
inputs = tokenizer(text, return_tensors="pt", max_length=256, truncation=True)
generated = model.generate(
**inputs,
forced_bos_token_id=tokenizer.convert_tokens_to_ids(tgt_lang),
max_new_tokens=256,
num_beams=3,
)
return tokenizer.batch_decode(generated, skip_special_tokens=True)[0]
# French → Sango
print(translate("Je vous salue!", "fra_Latn", "sag_Latn"))
# Sango → French
print(translate("Mbi Bara mo!", "sag_Latn", "fra_Latn"))
Gradio App
import gradio as gr
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
model_name = "alaminerca/nllb-sango-french"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
def translate(text, direction):
if direction == "French → Sango":
src, tgt = "fra_Latn", "sag_Latn"
else:
src, tgt = "sag_Latn", "fra_Latn"
tokenizer.src_lang = src
inputs = tokenizer(text, return_tensors="pt", max_length=256, truncation=True)
generated = model.generate(
**inputs,
forced_bos_token_id=tokenizer.convert_tokens_to_ids(tgt),
max_new_tokens=256,
num_beams=3,
)
return tokenizer.batch_decode(generated, skip_special_tokens=True)[0]
gr.Interface(
fn=translate,
inputs=[
gr.Textbox(label="Input text"),
gr.Radio(["French → Sango", "Sango → French"], value="French → Sango"),
],
outputs=gr.Textbox(label="Translation"),
title="Sango-French Translator",
).launch()
About Sango
Sango (ISO 639-3: sag) is the national language of the Central African Republic (CAR), spoken by approximately 5.5 million people. It is one of Africa's few indigenous lingua francas, derived from Ngbandi (Ubangian family), and has served as an official language alongside French since 1991. Sango is a tonal language with minimal morphology and a syntax-driven grammar. Despite its sociolinguistic importance, it remains severely underserved by NLP technology.
Training Details
| Parameter | Value |
|---|---|
| Base model | facebook/nllb-200-distilled-600M |
| Method | LoRA (rank 16, alpha 32, dropout 0.05) |
| Target modules | q_proj, v_proj |
| Trainable params | 2.36M / 617M total (0.38%) |
| Training data | 36,846 bidirectional examples |
| Hardware | NVIDIA Tesla T4 (16GB VRAM) |
| Training time | ~80 minutes |
| Epochs | 3 |
| Effective batch size | 32 (batch 4 × gradient accumulation 8) |
| Learning rate | 2e-4 (cosine schedule, 5% warmup) |
| Precision | FP16 |
| Max sequence length | 256 tokens |
Dataset
Built by automatically aligning the Sango Bible (Tënë ti Nzapä, 2010) with the French Darby Bible at verse level, followed by quality filtering (length ratio, minimum length).
| Split | Verse Pairs | Bidirectional Examples |
|---|---|---|
| Train | 18,423 | 36,846 |
| Validation | 862 | 1,724 |
| Test | 1,840 | 3,680 |
| Total | 21,125 | 42,250 |
Book-level splitting prevents data leakage between sets.
Dataset: alaminerca/sango-french-bible-parallel
Limitations
- Domain: Trained on Biblical text. Performance on casual or conversational Sango (especially with French code-switching) may be lower, though preliminary tests on everyday sentences show reasonable quality.
- Tone: Sango is tonal, but tone is inconsistently marked in writing. The model does not explicitly handle tonal distinctions.
- TTS: The integrated demo uses Meta's MMS-TTS for Sango, which produces intelligible but formal-sounding speech.
Citation
@misc{alamine2026sangonmt,
title={SangoNMT: Fine-Tuning NLLB-200 for Sango-French Machine Translation with a Novel Bible Parallel Corpus},
author={Al-Amine, Mouhamad Alim},
year={2026},
url={https://huggingface.co/alaminerca/nllb-sango-french}
}
🇫🇷 En Français
SangoNMT : NLLB-200 affiné pour la traduction Sango-Français
Le premier modèle dédié de traduction automatique Sango-Français. Affiné à partir de NLLB-200 de Meta avec LoRA sur un corpus parallèle biblique de 21 125 paires de versets.
Résultats : Amélioration du score BLEU de 2,38 à 22,21 (+832%) et du chrF++ de 12,18 à 43,06 (+254%) par rapport au modèle de base NLLB-200.
Le sango (ISO 639-3 : sag) est la langue nationale de la République Centrafricaine, parlée par environ 5,5 millions de personnes. C'est l'une des rares langues véhiculaires africaines autochtones, langue officielle aux côtés du français depuis 1991. Malgré son importance sociolinguistique, le sango reste très peu représenté dans les technologies de traitement automatique des langues.
Essayez-le en ligne : Démo Sango-Français
Utilisation
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("alaminerca/nllb-sango-french")
model = AutoModelForSeq2SeqLM.from_pretrained("alaminerca/nllb-sango-french")
# Français → Sango
tokenizer.src_lang = "fra_Latn"
inputs = tokenizer("Bonjour, comment vas-tu?", return_tensors="pt")
result = model.generate(**inputs, forced_bos_token_id=tokenizer.convert_tokens_to_ids("sag_Latn"))
print(tokenizer.decode(result[0], skip_special_tokens=True))
Limites
- Domaine : Entraîné sur du texte biblique. Les performances sur le sango conversationnel (avec alternance codique français-sango) peuvent être inférieures.
- Tons : Le sango est une langue tonale, mais les tons sont marqués de manière incohérente à l'écrit.
Author / Auteur
Mouhamad Alim Al-Amine — Senior CS Undergraduate, Islamic University of Madinah, KSA
Acknowledgments / Remerciements
Built with Meta's NLLB-200 and MMS-TTS. Sango Bible text from the Société Biblique de Centrafrique (2010).
- Downloads last month
- 8
Model tree for alaminerca/nllb-sango-french
Base model
facebook/nllb-200-distilled-600M