Update README.md
Browse files
README.md
CHANGED
|
@@ -6,49 +6,43 @@ base_model:
|
|
| 6 |
pipeline_tag: translation
|
| 7 |
---
|
| 8 |
|
|
|
|
| 9 |
## Overview
|
| 10 |
|
| 11 |
thinkKenya’s **Luo–Swahili Translation Model** is a fine-tuned T5-small designed to translate between the Luo (dav) and Swahili (swa) languages. It was trained on a parallel corpus of approximately 29,300 Luo–Swahili sentence pairs drawn from the larger “Kenyan Low-Resource Language Data” dataset. thinkKenya—also known as Tech Innovators Network Kenya—is a community-driven technology initiative founded in 2019 to support digital transformation and applied open innovation across Kenya, with a special focus on African local languages.
|
| 12 |
|
| 13 |
## Model Details
|
| 14 |
|
| 15 |
-
* **Model name**:
|
| 16 |
-
* **Architecture**: T5-small (60.
|
| 17 |
-
* **Framework**: Hugging Face Transformers (
|
| 18 |
-
* **
|
| 19 |
-
* **
|
| 20 |
-
* **Training status**: In progress (latest reported step: 146,600) ([Hugging Face][1])
|
| 21 |
|
| 22 |
## Dataset
|
| 23 |
|
| 24 |
-
* **Dataset name**:
|
| 25 |
-
* **Task**: Translation (parallel text
|
| 26 |
-
* **Languages**:
|
| 27 |
-
* **Subset used**: `luo_swa` (29.
|
| 28 |
-
* **
|
| 29 |
-
* **License**: CC-BY-4.0 ([Hugging Face][2])
|
| 30 |
|
| 31 |
## Organization: Tech Innovators Network Kenya (thinkKenya)
|
| 32 |
|
| 33 |
-
|
| 34 |
-
|
| 35 |
-
|
| 36 |
-
|
| 37 |
-
### Key Focus Areas
|
| 38 |
-
|
| 39 |
-
* **African local languages**: Building datasets and models for under-resourced languages ([Hugging Face][5])
|
| 40 |
-
* **Digital transformation**: Consulting and technology services for Kenyan businesses ([LinkedIn][3])
|
| 41 |
-
* **Community building**: Convening forums like the AI Community of Practice to foster collaboration ([Community of Practitioners in AI][6])
|
| 42 |
|
| 43 |
## Training Configuration
|
| 44 |
|
| 45 |
-
| Component
|
| 46 |
-
| -----------------
|
| 47 |
-
| Model weights
|
| 48 |
-
| Tokenizer files
|
| 49 |
-
| Config file
|
| 50 |
-
| Training args
|
| 51 |
-
|
|
| 52 |
|
| 53 |
## Example Usage
|
| 54 |
|
|
@@ -66,9 +60,9 @@ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
|
|
| 66 |
|
| 67 |
## Limitations
|
| 68 |
|
| 69 |
-
* **
|
| 70 |
-
* **Domain coverage**:
|
| 71 |
-
* **
|
| 72 |
|
| 73 |
## Citation
|
| 74 |
|
|
|
|
| 6 |
pipeline_tag: translation
|
| 7 |
---
|
| 8 |
|
| 9 |
+
|
| 10 |
## Overview
|
| 11 |
|
| 12 |
thinkKenya’s **Luo–Swahili Translation Model** is a fine-tuned T5-small designed to translate between the Luo (dav) and Swahili (swa) languages. It was trained on a parallel corpus of approximately 29,300 Luo–Swahili sentence pairs drawn from the larger “Kenyan Low-Resource Language Data” dataset. thinkKenya—also known as Tech Innovators Network Kenya—is a community-driven technology initiative founded in 2019 to support digital transformation and applied open innovation across Kenya, with a special focus on African local languages.
|
| 13 |
|
| 14 |
## Model Details
|
| 15 |
|
| 16 |
+
* **Model name**: [thinkKenya/luo\_swa\_translation\_model](https://huggingface.co/thinkKenya/luo_swa_translation_model)
|
| 17 |
+
* **Architecture**: T5-small (≈60.5 M parameters; base checkpoint: [google/t5-small](https://huggingface.co/google/t5-small))
|
| 18 |
+
* **Framework**: Hugging Face Transformers (weights in [safetensors format](https://huggingface.co/thinkKenya/luo_swa_translation_model/tree/main))
|
| 19 |
+
* **Tensor type**: fp32
|
| 20 |
+
* **Training status**: In progress (latest reported step: 146,600)
|
|
|
|
| 21 |
|
| 22 |
## Dataset
|
| 23 |
|
| 24 |
+
* **Dataset name**: [thinkKenya/kenyan-low-resource-language-data](https://huggingface.co/datasets/thinkKenya/kenyan-low-resource-language-data)
|
| 25 |
+
* **Task**: Translation (parallel text, Parquet format)
|
| 26 |
+
* **Languages**: Luo (ISO `dav`) ↔ Swahili (ISO `swa`)
|
| 27 |
+
* **Subset used**: `luo_swa` (≈29.3 k total examples; train split: 21.3 k; test split: 5.33 k)
|
| 28 |
+
* **License**: [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/)
|
|
|
|
| 29 |
|
| 30 |
## Organization: Tech Innovators Network Kenya (thinkKenya)
|
| 31 |
|
| 32 |
+
* **Website**: [think.ke](https://think.ke)
|
| 33 |
+
* **Hugging Face Org**: [thinkKenya](https://huggingface.co/thinkKenya)
|
| 34 |
+
* **Founded**: 2019
|
| 35 |
+
* **Mission**: To accelerate digital transformation and applied open innovation in Kenya, with special emphasis on building AI solutions for African local languages.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 36 |
|
| 37 |
## Training Configuration
|
| 38 |
|
| 39 |
+
| Component | Details |
|
| 40 |
+
| ----------------- | ----------------------------------------------------------------------------------------------------- |
|
| 41 |
+
| Model weights | [`model.safetensors`](https://huggingface.co/thinkKenya/luo_swa_translation_model/tree/main) (242 MB) |
|
| 42 |
+
| Tokenizer files | `tokenizer.json`, `special_tokens_map.json`, `tokenizer_config.json` |
|
| 43 |
+
| Config file | `config.json` |
|
| 44 |
+
| Training args | `training_args.bin` |
|
| 45 |
+
| Software versions | `transformers` ≥ 4.x, `datasets` ≥ 2.x |
|
| 46 |
|
| 47 |
## Example Usage
|
| 48 |
|
|
|
|
| 60 |
|
| 61 |
## Limitations
|
| 62 |
|
| 63 |
+
* **Ongoing fine-tuning**: Outputs may still be unstable until training completes.
|
| 64 |
+
* **Domain coverage**: Trained on conversational and narrative sentences—performance may drop on highly specialized or out-of-domain text.
|
| 65 |
+
* **No public benchmarks yet**: Users are encouraged to evaluate with their own BLEU/ROUGE metrics.
|
| 66 |
|
| 67 |
## Citation
|
| 68 |
|