Overview
THiNK’s Luo–Swahili Translation Model is a fine-tuned T5-small designed to translate between the Luo (dav) and Swahili (swa) languages. It was trained on a parallel corpus of approximately 29,300 Luo–Swahili sentence pairs drawn from the larger “Kenyan Low-Resource Language Data” dataset. thinkKenya—also known as Tech Innovators Network Kenya—is a community-driven technology initiative founded in 2019 to support digital transformation and applied open innovation across Kenya, with a special focus on African local languages.
Model Details
- Model name: thinkKenya/luo_swa_translation_model
- Architecture: T5-small (≈60.5 M parameters; base checkpoint: google/t5-small)
- Framework: Hugging Face Transformers (weights in safetensors format)
- Tensor type: fp32
- Training status: In progress (latest reported step: 146,600)
Dataset
- Dataset name: thinkKenya/kenyan-low-resource-language-data
- Task: Translation (parallel text, Parquet format)
- Languages: Luo (ISO
dav) ↔ Swahili (ISOswa) - Subset used:
luo_swa(≈29.3 k total examples; train split: 21.3 k; test split: 5.33 k) - License: CC BY 4.0
Organization: Tech Innovators Network Kenya (thinkKenya)
- Website: think.ke
- Hugging Face Org: thinkKenya
- Founded: 2019
- Mission: To accelerate digital transformation and applied open innovation in Kenya, with special emphasis on building AI solutions for African local languages.
Training Configuration
| Component | Details |
|---|---|
| Model weights | model.safetensors (242 MB) |
| Tokenizer files | tokenizer.json, special_tokens_map.json, tokenizer_config.json |
| Config file | config.json |
| Training args | training_args.bin |
| Software versions | transformers ≥ 4.x, datasets ≥ 2.x |
Example Usage
from transformers import T5Tokenizer, T5ForConditionalGeneration
tokenizer = T5Tokenizer.from_pretrained("thinkKenya/luo_swa_translation_model")
model = T5ForConditionalGeneration.from_pretrained("thinkKenya/luo_swa_translation_model")
input_text = "translate Luo to Swahili: Wuki ghwa choki"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Limitations
- Ongoing fine-tuning: Outputs may still be unstable until training completes.
- Domain coverage: Trained on conversational and narrative sentences—performance may drop on highly specialized or out-of-domain text.
- No public benchmarks yet: Users are encouraged to evaluate with their own BLEU/ROUGE metrics.
Thought for a couple of seconds
Below is the License section added to the model card, specifying the CC BY 4.0 terms and the required attribution format.
License
This model and the underlying dataset are released under the Creative Commons Attribution 4.0 International License (CC BY 4.0).
You are free to:
- Share — copy and redistribute the material in any medium or format
- Adapt — remix, transform, and build upon the material for any purpose, even commercially
Under the following terms:
- Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made.
- No additional restrictions — You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.
Suggested attribution for this model:
“Luo–Swahili Translation Model, thinkKenya (Tech Innovators Network Kenya), CC BY 4.0, https://huggingface.co/thinkKenya/luo_swa_translation_model”
Citation
@misc{luo_swa_translation_model,
title = {Luo–Swahili Translation Model},
author = {thinkKenya (Tech Innovators Network Kenya)},
year = {2024},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/thinkKenya/luo_swa_translation_model}},
}
- Downloads last month
- 136
Model tree for thinkKenya/luo_swa_translation_model
Base model
google-t5/t5-small