File size: 5,158 Bytes
cc9c875
 
 
 
 
 
491f573
 
98b47e4
491f573
 
296950f
491f573
 
 
98b47e4
 
 
 
 
491f573
 
 
98b47e4
 
 
 
 
491f573
 
 
98b47e4
 
 
 
491f573
 
 
98b47e4
 
 
 
 
 
 
491f573
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
98b47e4
 
 
491f573
dfe9c3d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
491f573
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
---
language:
- sw
base_model:
- google-t5/t5-small
pipeline_tag: translation
---


## Overview

THiNK’s **Luo–Swahili Translation Model** is a fine-tuned T5-small designed to translate between the Luo (dav) and Swahili (swa) languages. It was trained on a parallel corpus of approximately 29,300 Luo–Swahili sentence pairs drawn from the larger “Kenyan Low-Resource Language Data” dataset. thinkKenya—also known as Tech Innovators Network Kenya—is a community-driven technology initiative founded in 2019 to support digital transformation and applied open innovation across Kenya, with a special focus on African local languages.

## Model Details

* **Model name**: [thinkKenya/luo\_swa\_translation\_model](https://huggingface.co/thinkKenya/luo_swa_translation_model)
* **Architecture**: T5-small (≈60.5 M parameters; base checkpoint: [google/t5-small](https://huggingface.co/google/t5-small))
* **Framework**: Hugging Face Transformers (weights in [safetensors format](https://huggingface.co/thinkKenya/luo_swa_translation_model/tree/main))
* **Tensor type**: fp32
* **Training status**: In progress (latest reported step: 146,600)

## Dataset

* **Dataset name**: [thinkKenya/kenyan-low-resource-language-data](https://huggingface.co/datasets/thinkKenya/kenyan-low-resource-language-data)
* **Task**: Translation (parallel text, Parquet format)
* **Languages**: Luo (ISO `dav`) ↔ Swahili (ISO `swa`)
* **Subset used**: `luo_swa` (≈29.3 k total examples; train split: 21.3 k; test split: 5.33 k)
* **License**: [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/)

## Organization: Tech Innovators Network Kenya (thinkKenya)

* **Website**: [think.ke](https://think.ke)
* **Hugging Face Org**: [thinkKenya](https://huggingface.co/thinkKenya)
* **Founded**: 2019
* **Mission**: To accelerate digital transformation and applied open innovation in Kenya, with special emphasis on building AI solutions for African local languages.

## Training Configuration

| Component         | Details                                                                                               |
| ----------------- | ----------------------------------------------------------------------------------------------------- |
| Model weights     | [`model.safetensors`](https://huggingface.co/thinkKenya/luo_swa_translation_model/tree/main) (242 MB) |
| Tokenizer files   | `tokenizer.json`, `special_tokens_map.json`, `tokenizer_config.json`                                  |
| Config file       | `config.json`                                                                                         |
| Training args     | `training_args.bin`                                                                                   |
| Software versions | `transformers` ≥ 4.x, `datasets` ≥ 2.x                                                                |

## Example Usage

```python
from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained("thinkKenya/luo_swa_translation_model")
model     = T5ForConditionalGeneration.from_pretrained("thinkKenya/luo_swa_translation_model")

input_text = "translate Luo to Swahili: Wuki ghwa choki"
inputs     = tokenizer(input_text, return_tensors="pt")
outputs    = model.generate(**inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

## Limitations

* **Ongoing fine-tuning**: Outputs may still be unstable until training completes.
* **Domain coverage**: Trained on conversational and narrative sentences—performance may drop on highly specialized or out-of-domain text.
* **No public benchmarks yet**: Users are encouraged to evaluate with their own BLEU/ROUGE metrics.

Thought for a couple of seconds


Below is the **License** section added to the model card, specifying the CC BY 4.0 terms and the required attribution format.

---

## License

This model and the underlying dataset are released under the [Creative Commons Attribution 4.0 International License (CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/).

You are free to:

* **Share** — copy and redistribute the material in any medium or format
* **Adapt** — remix, transform, and build upon the material for any purpose, even commercially

**Under the following terms:**

* **Attribution** — You must give appropriate credit, provide a link to the license, and indicate if changes were made.
* **No additional restrictions** — You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.

**Suggested attribution for this model:**

> “Luo–Swahili Translation Model, thinkKenya (Tech Innovators Network Kenya), CC BY 4.0, [https://huggingface.co/thinkKenya/luo\_swa\_translation\_model”](https://huggingface.co/thinkKenya/luo_swa_translation_model”)


## Citation

```bibtex
@misc{luo_swa_translation_model,
  title        = {Luo–Swahili Translation Model},
  author       = {thinkKenya (Tech Innovators Network Kenya)},
  year         = {2024},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/thinkKenya/luo_swa_translation_model}},
}
```