Overview

THiNK’s Luo–Swahili Translation Model is a fine-tuned T5-small designed to translate between the Luo (dav) and Swahili (swa) languages. It was trained on a parallel corpus of approximately 29,300 Luo–Swahili sentence pairs drawn from the larger “Kenyan Low-Resource Language Data” dataset. thinkKenya—also known as Tech Innovators Network Kenya—is a community-driven technology initiative founded in 2019 to support digital transformation and applied open innovation across Kenya, with a special focus on African local languages.

Model Details

Model name: thinkKenya/luo_swa_translation_model
Architecture: T5-small (≈60.5 M parameters; base checkpoint: google/t5-small)
Framework: Hugging Face Transformers (weights in safetensors format)
Tensor type: fp32
Training status: In progress (latest reported step: 146,600)

Dataset

Dataset name: thinkKenya/kenyan-low-resource-language-data
Task: Translation (parallel text, Parquet format)
Languages: Luo (ISO dav) ↔ Swahili (ISO swa)
Subset used: luo_swa (≈29.3 k total examples; train split: 21.3 k; test split: 5.33 k)
License: CC BY 4.0

Organization: Tech Innovators Network Kenya (thinkKenya)

Website: think.ke
Hugging Face Org: thinkKenya
Founded: 2019
Mission: To accelerate digital transformation and applied open innovation in Kenya, with special emphasis on building AI solutions for African local languages.

Training Configuration

Component	Details
Model weights	`model.safetensors` (242 MB)
Tokenizer files	`tokenizer.json`, `special_tokens_map.json`, `tokenizer_config.json`
Config file	`config.json`
Training args	`training_args.bin`
Software versions	`transformers` ≥ 4.x, `datasets` ≥ 2.x

Example Usage

from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained("thinkKenya/luo_swa_translation_model")
model     = T5ForConditionalGeneration.from_pretrained("thinkKenya/luo_swa_translation_model")

input_text = "translate Luo to Swahili: Wuki ghwa choki"
inputs     = tokenizer(input_text, return_tensors="pt")
outputs    = model.generate(**inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Limitations

Ongoing fine-tuning: Outputs may still be unstable until training completes.
Domain coverage: Trained on conversational and narrative sentences—performance may drop on highly specialized or out-of-domain text.
No public benchmarks yet: Users are encouraged to evaluate with their own BLEU/ROUGE metrics.

Thought for a couple of seconds

Below is the License section added to the model card, specifying the CC BY 4.0 terms and the required attribution format.

License

This model and the underlying dataset are released under the Creative Commons Attribution 4.0 International License (CC BY 4.0).

You are free to:

Share — copy and redistribute the material in any medium or format
Adapt — remix, transform, and build upon the material for any purpose, even commercially

Under the following terms:

Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made.
No additional restrictions — You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.

Suggested attribution for this model:

“Luo–Swahili Translation Model, thinkKenya (Tech Innovators Network Kenya), CC BY 4.0, https://huggingface.co/thinkKenya/luo_swa_translation_model”

Citation

@misc{luo_swa_translation_model,
  title        = {Luo–Swahili Translation Model},
  author       = {thinkKenya (Tech Innovators Network Kenya)},
  year         = {2024},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/thinkKenya/luo_swa_translation_model}},
}

Downloads last month: 9

Safetensors

Model size

60.5M params

Tensor type

F32

Model tree for thinkKenya/luo_swa_translation_model

Base model

google-t5/t5-small

Finetuned

(2283)

this model