Overview

THiNK’s Luo–Swahili Translation Model is a fine-tuned T5-small designed to translate between the Luo (dav) and Swahili (swa) languages. It was trained on a parallel corpus of approximately 29,300 Luo–Swahili sentence pairs drawn from the larger “Kenyan Low-Resource Language Data” dataset. thinkKenya—also known as Tech Innovators Network Kenya—is a community-driven technology initiative founded in 2019 to support digital transformation and applied open innovation across Kenya, with a special focus on African local languages.

Model Details

Dataset

  • Dataset name: thinkKenya/kenyan-low-resource-language-data
  • Task: Translation (parallel text, Parquet format)
  • Languages: Luo (ISO dav) ↔ Swahili (ISO swa)
  • Subset used: luo_swa (≈29.3 k total examples; train split: 21.3 k; test split: 5.33 k)
  • License: CC BY 4.0

Organization: Tech Innovators Network Kenya (thinkKenya)

  • Website: think.ke
  • Hugging Face Org: thinkKenya
  • Founded: 2019
  • Mission: To accelerate digital transformation and applied open innovation in Kenya, with special emphasis on building AI solutions for African local languages.

Training Configuration

Component Details
Model weights model.safetensors (242 MB)
Tokenizer files tokenizer.json, special_tokens_map.json, tokenizer_config.json
Config file config.json
Training args training_args.bin
Software versions transformers ≥ 4.x, datasets ≥ 2.x

Example Usage

from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained("thinkKenya/luo_swa_translation_model")
model     = T5ForConditionalGeneration.from_pretrained("thinkKenya/luo_swa_translation_model")

input_text = "translate Luo to Swahili: Wuki ghwa choki"
inputs     = tokenizer(input_text, return_tensors="pt")
outputs    = model.generate(**inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Limitations

  • Ongoing fine-tuning: Outputs may still be unstable until training completes.
  • Domain coverage: Trained on conversational and narrative sentences—performance may drop on highly specialized or out-of-domain text.
  • No public benchmarks yet: Users are encouraged to evaluate with their own BLEU/ROUGE metrics.

Thought for a couple of seconds

Below is the License section added to the model card, specifying the CC BY 4.0 terms and the required attribution format.


License

This model and the underlying dataset are released under the Creative Commons Attribution 4.0 International License (CC BY 4.0).

You are free to:

  • Share — copy and redistribute the material in any medium or format
  • Adapt — remix, transform, and build upon the material for any purpose, even commercially

Under the following terms:

  • Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made.
  • No additional restrictions — You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.

Suggested attribution for this model:

“Luo–Swahili Translation Model, thinkKenya (Tech Innovators Network Kenya), CC BY 4.0, https://huggingface.co/thinkKenya/luo_swa_translation_model”

Citation

@misc{luo_swa_translation_model,
  title        = {Luo–Swahili Translation Model},
  author       = {thinkKenya (Tech Innovators Network Kenya)},
  year         = {2024},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/thinkKenya/luo_swa_translation_model}},
}
Downloads last month
136
Safetensors
Model size
60.5M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for thinkKenya/luo_swa_translation_model

Base model

google-t5/t5-small
Finetuned
(2212)
this model