SangoNMT: Fine-Tuned NLLB-200 for Sango-French Translation

The first dedicated Sango-French neural machine translation model. Fine-tuned from Meta's NLLB-200-distilled-600M using LoRA on a novel Bible parallel corpus of 21,125 verse pairs.

🔗 Try it live: Sango-French Translator Demo


Results

Metric NLLB-200 Baseline Fine-Tuned Improvement
BLEU 2.38 22.21 +832%
chrF++ 12.18 43.06 +254%

Evaluated on 300 held-out test examples from books not seen during training (Esther, Amos, Galatians, Proverbs, John). All translations verified by a native Sango speaker.

Translation Examples

French → Sango

French Sango
Au commencement Dieu créa les cieux et la terre. Na tongo nda ni, Nzapa asara yayu na sese.
Tu aimeras le Seigneur ton Dieu de tout ton cœur. Mo ye Kota Gbia Nzapa ti mo na be ti mo kue.
Je suis satisfait du resultat. Ye so asi anzere na mbi mingi.

Sango → French

Sango French
Na tongo nda ni, Nzapa asara yayu na sese. Au commencement Dieu créa les cieux et la terre.
Nzapa abaa so ye ni ayeke nzoni. Et Dieu vit que cela était bon.
So zo la! C'est une personne!

Usage

Python (Transformers)

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model_name = "alaminerca/nllb-sango-french"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

def translate(text, src_lang, tgt_lang):
    """Translate text between French (fra_Latn) and Sango (sag_Latn)."""
    tokenizer.src_lang = src_lang
    inputs = tokenizer(text, return_tensors="pt", max_length=256, truncation=True)
    generated = model.generate(
        **inputs,
        forced_bos_token_id=tokenizer.convert_tokens_to_ids(tgt_lang),
        max_new_tokens=256,
        num_beams=3,
    )
    return tokenizer.batch_decode(generated, skip_special_tokens=True)[0]

# French → Sango
print(translate("Je vous salue!", "fra_Latn", "sag_Latn"))

# Sango → French
print(translate("Mbi Bara mo!", "sag_Latn", "fra_Latn"))

Gradio App

import gradio as gr
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model_name = "alaminerca/nllb-sango-french"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

def translate(text, direction):
    if direction == "French → Sango":
        src, tgt = "fra_Latn", "sag_Latn"
    else:
        src, tgt = "sag_Latn", "fra_Latn"
    tokenizer.src_lang = src
    inputs = tokenizer(text, return_tensors="pt", max_length=256, truncation=True)
    generated = model.generate(
        **inputs,
        forced_bos_token_id=tokenizer.convert_tokens_to_ids(tgt),
        max_new_tokens=256,
        num_beams=3,
    )
    return tokenizer.batch_decode(generated, skip_special_tokens=True)[0]

gr.Interface(
    fn=translate,
    inputs=[
        gr.Textbox(label="Input text"),
        gr.Radio(["French → Sango", "Sango → French"], value="French → Sango"),
    ],
    outputs=gr.Textbox(label="Translation"),
    title="Sango-French Translator",
).launch()

About Sango

Sango (ISO 639-3: sag) is the national language of the Central African Republic (CAR), spoken by approximately 5.5 million people. It is one of Africa's few indigenous lingua francas, derived from Ngbandi (Ubangian family), and has served as an official language alongside French since 1991. Sango is a tonal language with minimal morphology and a syntax-driven grammar. Despite its sociolinguistic importance, it remains severely underserved by NLP technology.

Training Details

Parameter Value
Base model facebook/nllb-200-distilled-600M
Method LoRA (rank 16, alpha 32, dropout 0.05)
Target modules q_proj, v_proj
Trainable params 2.36M / 617M total (0.38%)
Training data 36,846 bidirectional examples
Hardware NVIDIA Tesla T4 (16GB VRAM)
Training time ~80 minutes
Epochs 3
Effective batch size 32 (batch 4 × gradient accumulation 8)
Learning rate 2e-4 (cosine schedule, 5% warmup)
Precision FP16
Max sequence length 256 tokens

Dataset

Built by automatically aligning the Sango Bible (Tënë ti Nzapä, 2010) with the French Darby Bible at verse level, followed by quality filtering (length ratio, minimum length).

Split Verse Pairs Bidirectional Examples
Train 18,423 36,846
Validation 862 1,724
Test 1,840 3,680
Total 21,125 42,250

Book-level splitting prevents data leakage between sets.

Dataset: alaminerca/sango-french-bible-parallel

Limitations

  • Domain: Trained on Biblical text. Performance on casual or conversational Sango (especially with French code-switching) may be lower, though preliminary tests on everyday sentences show reasonable quality.
  • Tone: Sango is tonal, but tone is inconsistently marked in writing. The model does not explicitly handle tonal distinctions.
  • TTS: The integrated demo uses Meta's MMS-TTS for Sango, which produces intelligible but formal-sounding speech.

Citation

@misc{alamine2026sangonmt,
  title={SangoNMT: Fine-Tuning NLLB-200 for Sango-French Machine Translation with a Novel Bible Parallel Corpus},
  author={Al-Amine, Mouhamad Alim},
  year={2026},
  url={https://huggingface.co/alaminerca/nllb-sango-french}
}

🇫🇷 En Français

SangoNMT : NLLB-200 affiné pour la traduction Sango-Français

Le premier modèle dédié de traduction automatique Sango-Français. Affiné à partir de NLLB-200 de Meta avec LoRA sur un corpus parallèle biblique de 21 125 paires de versets.

Résultats : Amélioration du score BLEU de 2,38 à 22,21 (+832%) et du chrF++ de 12,18 à 43,06 (+254%) par rapport au modèle de base NLLB-200.

Le sango (ISO 639-3 : sag) est la langue nationale de la République Centrafricaine, parlée par environ 5,5 millions de personnes. C'est l'une des rares langues véhiculaires africaines autochtones, langue officielle aux côtés du français depuis 1991. Malgré son importance sociolinguistique, le sango reste très peu représenté dans les technologies de traitement automatique des langues.

Essayez-le en ligne : Démo Sango-Français

Utilisation

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("alaminerca/nllb-sango-french")
model = AutoModelForSeq2SeqLM.from_pretrained("alaminerca/nllb-sango-french")

# Français → Sango
tokenizer.src_lang = "fra_Latn"
inputs = tokenizer("Bonjour, comment vas-tu?", return_tensors="pt")
result = model.generate(**inputs, forced_bos_token_id=tokenizer.convert_tokens_to_ids("sag_Latn"))
print(tokenizer.decode(result[0], skip_special_tokens=True))

Limites

  • Domaine : Entraîné sur du texte biblique. Les performances sur le sango conversationnel (avec alternance codique français-sango) peuvent être inférieures.
  • Tons : Le sango est une langue tonale, mais les tons sont marqués de manière incohérente à l'écrit.

Author / Auteur

Mouhamad Alim Al-Amine — Senior CS Undergraduate, Islamic University of Madinah, KSA

Acknowledgments / Remerciements

Built with Meta's NLLB-200 and MMS-TTS. Sango Bible text from the Société Biblique de Centrafrique (2010).

Downloads last month
8
Safetensors
Model size
0.6B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for alaminerca/nllb-sango-french

Adapter
(74)
this model

Space using alaminerca/nllb-sango-french 1