NLLB-Tarjumi: Fine-tuned Translation for African Languages

Fine-tuned NLLB-200-distilled-600M on parallel corpus data for Kenyan languages.

Training Results

Language Base BLEU Fine-tuned BLEU Improvement
Kikuyu (ki) 17.00 22.24 +31%
Luo (luo) 11.29 18.64 +65%
Swahili (sw) 16.45 19.39 +18%

Training Details

  • Base model: facebook/nllb-200-distilled-600M
  • Method: LoRA (rank-16, alpha-32)
  • Corpus: 124K parallel pairs (Bible + JW300)
  • Hardware: NVIDIA H100 80GB
  • Training time: 13 minutes
  • Cost: ~$0.75

Supported Languages

16 African languages: Kikuyu, Swahili, Luo, Luhya, Kalenjin, Kamba, Meru, Maasai, Somali, Amharic, Yoruba, Zulu, Xhosa, Shona, Kinyarwanda, Luganda.

Usage

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model = AutoModelForSeq2SeqLM.from_pretrained("cif-ai/nllb-tarjumi")
tokenizer = AutoTokenizer.from_pretrained("cif-ai/nllb-tarjumi")

tokenizer.src_lang = "eng_Latn"
inputs = tokenizer("Hello, how are you?", return_tensors="pt")
tgt_id = tokenizer.convert_tokens_to_ids("kik_Latn")

translated = model.generate(**inputs, forced_bos_token_id=tgt_id, max_new_tokens=128)
print(tokenizer.decode(translated[0], skip_special_tokens=True))

Product

This model powers Tarjumi — a free translation service for African mother-tongue languages.

Built by CIF AI.

Downloads last month
60
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for cif-ai/nllb-tarjumi

Adapter
(68)
this model