Update README.md
Browse files
README.md
CHANGED
|
@@ -4,4 +4,81 @@ language:
|
|
| 4 |
base_model:
|
| 5 |
- google-t5/t5-small
|
| 6 |
pipeline_tag: translation
|
| 7 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 4 |
base_model:
|
| 5 |
- google-t5/t5-small
|
| 6 |
pipeline_tag: translation
|
| 7 |
+
---
|
| 8 |
+
|
| 9 |
+
## Overview
|
| 10 |
+
|
| 11 |
+
thinkKenya’s **Luo–Swahili Translation Model** is a fine-tuned T5-small designed to translate between the Luo (dav) and Swahili (swa) languages. It was trained on a parallel corpus of approximately 29,300 Luo–Swahili sentence pairs drawn from the larger “Kenyan Low-Resource Language Data” dataset. thinkKenya—also known as Tech Innovators Network Kenya—is a community-driven technology initiative founded in 2019 to support digital transformation and applied open innovation across Kenya, with a special focus on African local languages.
|
| 12 |
+
|
| 13 |
+
## Model Details
|
| 14 |
+
|
| 15 |
+
* **Model name**: `thinkKenya/luo_swa_translation_model`
|
| 16 |
+
* **Architecture**: T5-small (60.5M parameters) ([Hugging Face][1])
|
| 17 |
+
* **Framework**: Hugging Face Transformers (safetensors weights) ([Hugging Face][1])
|
| 18 |
+
* **Base checkpoint**: `google-t5/t5-small` ([Hugging Face][1])
|
| 19 |
+
* **Tensor type**: F32 ([Hugging Face][1])
|
| 20 |
+
* **Training status**: In progress (latest reported step: 146,600) ([Hugging Face][1])
|
| 21 |
+
|
| 22 |
+
## Dataset
|
| 23 |
+
|
| 24 |
+
* **Dataset name**: `thinkKenya/kenyan-low-resource-language-data` ([Hugging Face][2])
|
| 25 |
+
* **Task**: Translation (parallel text) ([Hugging Face][2])
|
| 26 |
+
* **Languages**: Swahili (ISO: swa) ([Hugging Face][2])
|
| 27 |
+
* **Subset used**: `luo_swa` (29.3k rows total; train split: 21.3k rows; test split: 5.33k rows) ([Hugging Face][2], [Hugging Face][2])
|
| 28 |
+
* **Format**: Parquet
|
| 29 |
+
* **License**: CC-BY-4.0 ([Hugging Face][2])
|
| 30 |
+
|
| 31 |
+
## Organization: Tech Innovators Network Kenya (thinkKenya)
|
| 32 |
+
|
| 33 |
+
### Mission & History
|
| 34 |
+
|
| 35 |
+
Tech Innovators Network Kenya (THiNK) is a community-driven technology initiative founded in 2019, aimed at assisting businesses and citizens in their digital transformation journeys through applied open innovation ([LinkedIn][3]). Their primary objectives include supporting local language AI, fintech solutions, and ecosystem development for innovators across Kenya ([Hugging Face][4]).
|
| 36 |
+
|
| 37 |
+
### Key Focus Areas
|
| 38 |
+
|
| 39 |
+
* **African local languages**: Building datasets and models for under-resourced languages ([Hugging Face][5])
|
| 40 |
+
* **Digital transformation**: Consulting and technology services for Kenyan businesses ([LinkedIn][3])
|
| 41 |
+
* **Community building**: Convening forums like the AI Community of Practice to foster collaboration ([Community of Practitioners in AI][6])
|
| 42 |
+
|
| 43 |
+
## Training Configuration
|
| 44 |
+
|
| 45 |
+
| Component | Details |
|
| 46 |
+
| ------------------ | -------------------------------------------------------------------- |
|
| 47 |
+
| Model weights | `model.safetensors` (242 MB) |
|
| 48 |
+
| Tokenizer files | `tokenizer.json`, `special_tokens_map.json`, `tokenizer_config.json` |
|
| 49 |
+
| Config file | `config.json` |
|
| 50 |
+
| Training args | `training_args.bin` |
|
| 51 |
+
| Framework versions | Transformers ≥4.x, Datasets ≥2.x |
|
| 52 |
+
|
| 53 |
+
## Example Usage
|
| 54 |
+
|
| 55 |
+
```python
|
| 56 |
+
from transformers import T5Tokenizer, T5ForConditionalGeneration
|
| 57 |
+
|
| 58 |
+
tokenizer = T5Tokenizer.from_pretrained("thinkKenya/luo_swa_translation_model")
|
| 59 |
+
model = T5ForConditionalGeneration.from_pretrained("thinkKenya/luo_swa_translation_model")
|
| 60 |
+
|
| 61 |
+
input_text = "translate Luo to Swahili: Wuki ghwa choki"
|
| 62 |
+
inputs = tokenizer(input_text, return_tensors="pt")
|
| 63 |
+
outputs = model.generate(**inputs)
|
| 64 |
+
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
|
| 65 |
+
```
|
| 66 |
+
|
| 67 |
+
## Limitations
|
| 68 |
+
|
| 69 |
+
* **Training in progress**: May produce unstable or under-trained outputs due to ongoing fine-tuning ([Hugging Face][1]).
|
| 70 |
+
* **Domain coverage**: Limited to conversational and narrative sentences present in the corpus—out-of-domain text may yield poor translations.
|
| 71 |
+
* **Evaluation metrics**: No public BLEU/ROUGE scores yet; users should perform their own evaluations.
|
| 72 |
+
|
| 73 |
+
## Citation
|
| 74 |
+
|
| 75 |
+
```bibtex
|
| 76 |
+
@misc{luo_swa_translation_model,
|
| 77 |
+
title = {Luo–Swahili Translation Model},
|
| 78 |
+
author = {thinkKenya (Tech Innovators Network Kenya)},
|
| 79 |
+
year = {2024},
|
| 80 |
+
publisher = {Hugging Face},
|
| 81 |
+
howpublished = {\url{https://huggingface.co/thinkKenya/luo_swa_translation_model}},
|
| 82 |
+
}
|
| 83 |
+
```
|
| 84 |
+
|