SangoNMT: Fine-Tuned NLLB-200 for Sango-French Translation

The first dedicated Sango-French neural machine translation model. Fine-tuned from Meta's NLLB-200-distilled-600M using LoRA on a novel Bible parallel corpus of 21,125 verse pairs.

🔗 Try it live: Sango-French Translator Demo

Results

Metric	NLLB-200 Baseline	Fine-Tuned	Improvement
BLEU	2.38	22.21	+832%
chrF++	12.18	43.06	+254%

Evaluated on 300 held-out test examples from books not seen during training (Esther, Amos, Galatians, Proverbs, John). All translations verified by a native Sango speaker.

Translation Examples

French → Sango

French	Sango
Au commencement Dieu créa les cieux et la terre.	Na tongo nda ni, Nzapa asara yayu na sese.
Tu aimeras le Seigneur ton Dieu de tout ton cœur.	Mo ye Kota Gbia Nzapa ti mo na be ti mo kue.
Je suis satisfait du resultat.	Ye so asi anzere na mbi mingi.

Sango → French

Sango	French
Na tongo nda ni, Nzapa asara yayu na sese.	Au commencement Dieu créa les cieux et la terre.
Nzapa abaa so ye ni ayeke nzoni.	Et Dieu vit que cela était bon.
So zo la!	C'est une personne!

Usage

Python (Transformers)

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model_name = "alaminerca/nllb-sango-french"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

def translate(text, src_lang, tgt_lang):
    """Translate text between French (fra_Latn) and Sango (sag_Latn)."""
    tokenizer.src_lang = src_lang
    inputs = tokenizer(text, return_tensors="pt", max_length=256, truncation=True)
    generated = model.generate(
        **inputs,
        forced_bos_token_id=tokenizer.convert_tokens_to_ids(tgt_lang),
        max_new_tokens=256,
        num_beams=3,
    )
    return tokenizer.batch_decode(generated, skip_special_tokens=True)[0]

# French → Sango
print(translate("Je vous salue!", "fra_Latn", "sag_Latn"))

# Sango → French
print(translate("Mbi Bara mo!", "sag_Latn", "fra_Latn"))

Gradio App

import gradio as gr
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model_name = "alaminerca/nllb-sango-french"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

def translate(text, direction):
    if direction == "French → Sango":
        src, tgt = "fra_Latn", "sag_Latn"
    else:
        src, tgt = "sag_Latn", "fra_Latn"
    tokenizer.src_lang = src
    inputs = tokenizer(text, return_tensors="pt", max_length=256, truncation=True)
    generated = model.generate(
        **inputs,
        forced_bos_token_id=tokenizer.convert_tokens_to_ids(tgt),
        max_new_tokens=256,
        num_beams=3,
    )
    return tokenizer.batch_decode(generated, skip_special_tokens=True)[0]

gr.Interface(
    fn=translate,
    inputs=[
        gr.Textbox(label="Input text"),
        gr.Radio(["French → Sango", "Sango → French"], value="French → Sango"),
    ],
    outputs=gr.Textbox(label="Translation"),
    title="Sango-French Translator",
).launch()

About Sango

Sango (ISO 639-3: sag) is the national language of the Central African Republic (CAR), spoken by approximately 5.5 million people. It is one of Africa's few indigenous lingua francas, derived from Ngbandi (Ubangian family), and has served as an official language alongside French since 1991. Sango is a tonal language with minimal morphology and a syntax-driven grammar. Despite its sociolinguistic importance, it remains severely underserved by NLP technology.

Training Details

Parameter	Value
Base model	facebook/nllb-200-distilled-600M
Method	LoRA (rank 16, alpha 32, dropout 0.05)
Target modules	q_proj, v_proj
Trainable params	2.36M / 617M total (0.38%)
Training data	36,846 bidirectional examples
Hardware	NVIDIA Tesla T4 (16GB VRAM)
Training time	~80 minutes
Epochs	3
Effective batch size	32 (batch 4 × gradient accumulation 8)
Learning rate	2e-4 (cosine schedule, 5% warmup)
Precision	FP16
Max sequence length	256 tokens

Dataset

Built by automatically aligning the Sango Bible (Tënë ti Nzapä, 2010) with the French Darby Bible at verse level, followed by quality filtering (length ratio, minimum length).

Split	Verse Pairs	Bidirectional Examples
Train	18,423	36,846
Validation	862	1,724
Test	1,840	3,680
Total	21,125	42,250

Book-level splitting prevents data leakage between sets.

Dataset: alaminerca/sango-french-bible-parallel

Limitations

Domain: Trained on Biblical text. Performance on casual or conversational Sango (especially with French code-switching) may be lower, though preliminary tests on everyday sentences show reasonable quality.
Tone: Sango is tonal, but tone is inconsistently marked in writing. The model does not explicitly handle tonal distinctions.
TTS: The integrated demo uses Meta's MMS-TTS for Sango, which produces intelligible but formal-sounding speech.

Citation

@misc{alamine2026sangonmt,
  title={SangoNMT: Fine-Tuning NLLB-200 for Sango-French Machine Translation with a Novel Bible Parallel Corpus},
  author={Al-Amine, Mouhamad Alim},
  year={2026},
  url={https://huggingface.co/alaminerca/nllb-sango-french}
}

🇫🇷 En Français

SangoNMT : NLLB-200 affiné pour la traduction Sango-Français

Le premier modèle dédié de traduction automatique Sango-Français. Affiné à partir de NLLB-200 de Meta avec LoRA sur un corpus parallèle biblique de 21 125 paires de versets.

Résultats : Amélioration du score BLEU de 2,38 à 22,21 (+832%) et du chrF++ de 12,18 à 43,06 (+254%) par rapport au modèle de base NLLB-200.

Le sango (ISO 639-3 : sag) est la langue nationale de la République Centrafricaine, parlée par environ 5,5 millions de personnes. C'est l'une des rares langues véhiculaires africaines autochtones, langue officielle aux côtés du français depuis 1991. Malgré son importance sociolinguistique, le sango reste très peu représenté dans les technologies de traitement automatique des langues.

Essayez-le en ligne : Démo Sango-Français

Utilisation

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("alaminerca/nllb-sango-french")
model = AutoModelForSeq2SeqLM.from_pretrained("alaminerca/nllb-sango-french")

# Français → Sango
tokenizer.src_lang = "fra_Latn"
inputs = tokenizer("Bonjour, comment vas-tu?", return_tensors="pt")
result = model.generate(**inputs, forced_bos_token_id=tokenizer.convert_tokens_to_ids("sag_Latn"))
print(tokenizer.decode(result[0], skip_special_tokens=True))

Limites

Domaine : Entraîné sur du texte biblique. Les performances sur le sango conversationnel (avec alternance codique français-sango) peuvent être inférieures.
Tons : Le sango est une langue tonale, mais les tons sont marqués de manière incohérente à l'écrit.

Author / Auteur

Mouhamad Alim Al-Amine — Senior CS Undergraduate, Islamic University of Madinah, KSA

Acknowledgments / Remerciements

Built with Meta's NLLB-200 and MMS-TTS. Sango Bible text from the Société Biblique de Centrafrique (2010).

Downloads last month: 8

Safetensors

Model size

0.6B params

Tensor type

F32

Model tree for alaminerca/nllb-sango-french

Base model

facebook/nllb-200-distilled-600M

Adapter

(74)

this model

alaminerca
/

nllb-sango-french