Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
cesear64 
posted an update 2 days ago
Post
3963
Just published: how we built production Sango (Central African Republic) translation without fine-tuning, parallel corpus, or training compute.

The method — vocabulary-augmented prompting with a 581-entry native-speaker-verified lexicon — generalizes to any of the ~2,000 African languages at the same data-poverty level. Recipe, dataset, and code template all included.

📄 Blog: https://huggingface.co/blog/MEYNG/sangoai
📦 Dataset: MEYNG/sango-vocabulary

Would especially value feedback from anyone working on other low-resource African languages — Ewondo, Lingala, Wolof next on our roadmap.

I'm not working on low-resource African languages but this method sounds interesting.

So you put the orthography, grammar, and vocabulary in the prompt and then get the LLM to translate a language that it doesn't know. Clever!

Then once you have enough native speaker-verified Sango-French translations, you can bootstrap it to a full-fledged dataset...

In this post