thinkKenya
/

luo_swa_translation_model

Translation

Safetensors

Swahili

Model card Files Files and versions

xet

Community

nickdee96 commited on May 6, 2025

Commit

98b47e4

verified ·

1 Parent(s): 491f573

Update README.md

Browse files

Files changed (1) hide show

README.md +25 -31

README.md CHANGED Viewed

@@ -6,49 +6,43 @@ base_model:
 pipeline_tag: translation
 ---
 ## Overview
 thinkKenya’s **Luo–Swahili Translation Model** is a fine-tuned T5-small designed to translate between the Luo (dav) and Swahili (swa) languages. It was trained on a parallel corpus of approximately 29,300 Luo–Swahili sentence pairs drawn from the larger “Kenyan Low-Resource Language Data” dataset. thinkKenya—also known as Tech Innovators Network Kenya—is a community-driven technology initiative founded in 2019 to support digital transformation and applied open innovation across Kenya, with a special focus on African local languages.
 ## Model Details
-* **Model name**: `thinkKenya/luo_swa_translation_model`
-* **Architecture**: T5-small (60.5M parameters) ([Hugging Face][1])
-* **Framework**: Hugging Face Transformers (safetensors weights) ([Hugging Face][1])
-* **Base checkpoint**: `google-t5/t5-small` ([Hugging Face][1])
-* **Tensor type**: F32 ([Hugging Face][1])
-* **Training status**: In progress (latest reported step: 146,600) ([Hugging Face][1])
 ## Dataset
-* **Dataset name**: `thinkKenya/kenyan-low-resource-language-data` ([Hugging Face][2])
-* **Task**: Translation (parallel text) ([Hugging Face][2])
-* **Languages**: Swahili (ISO: swa) ([Hugging Face][2])
-* **Subset used**: `luo_swa` (29.3k rows total; train split: 21.3k rows; test split: 5.33k rows) ([Hugging Face][2], [Hugging Face][2])
-* **Format**: Parquet
-* **License**: CC-BY-4.0 ([Hugging Face][2])
 ## Organization: Tech Innovators Network Kenya (thinkKenya)
-### Mission & History
-Tech Innovators Network Kenya (THiNK) is a community-driven technology initiative founded in 2019, aimed at assisting businesses and citizens in their digital transformation journeys through applied open innovation ([LinkedIn][3]). Their primary objectives include supporting local language AI, fintech solutions, and ecosystem development for innovators across Kenya ([Hugging Face][4]).
-### Key Focus Areas
-* **African local languages**: Building datasets and models for under-resourced languages ([Hugging Face][5])
-* **Digital transformation**: Consulting and technology services for Kenyan businesses ([LinkedIn][3])
-* **Community building**: Convening forums like the AI Community of Practice to foster collaboration ([Community of Practitioners in AI][6])
 ## Training Configuration
-| Component          | Details                                                              |
-| ------------------ | -------------------------------------------------------------------- |
-| Model weights      | `model.safetensors` (242 MB)                                         |
-| Tokenizer files    | `tokenizer.json`, `special_tokens_map.json`, `tokenizer_config.json` |
-| Config file        | `config.json`                                                        |
-| Training args      | `training_args.bin`                                                  |
-| Framework versions | Transformers ≥4.x, Datasets ≥2.x                                     |
 ## Example Usage
@@ -66,9 +60,9 @@ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
 ## Limitations
-* **Training in progress**: May produce unstable or under-trained outputs due to ongoing fine-tuning ([Hugging Face][1]).
-* **Domain coverage**: Limited to conversational and narrative sentences present in the corpus—out-of-domain text may yield poor translations.
-* **Evaluation metrics**: No public BLEU/ROUGE scores yet; users should perform their own evaluations.
 ## Citation

 pipeline_tag: translation
 ---
 ## Overview
 thinkKenya’s **Luo–Swahili Translation Model** is a fine-tuned T5-small designed to translate between the Luo (dav) and Swahili (swa) languages. It was trained on a parallel corpus of approximately 29,300 Luo–Swahili sentence pairs drawn from the larger “Kenyan Low-Resource Language Data” dataset. thinkKenya—also known as Tech Innovators Network Kenya—is a community-driven technology initiative founded in 2019 to support digital transformation and applied open innovation across Kenya, with a special focus on African local languages.
 ## Model Details
+* **Model name**: [thinkKenya/luo\_swa\_translation\_model](https://huggingface.co/thinkKenya/luo_swa_translation_model)
+* **Architecture**: T5-small (≈60.5 M parameters; base checkpoint: [google/t5-small](https://huggingface.co/google/t5-small))
+* **Framework**: Hugging Face Transformers (weights in [safetensors format](https://huggingface.co/thinkKenya/luo_swa_translation_model/tree/main))
+* **Tensor type**: fp32
+* **Training status**: In progress (latest reported step: 146,600)
 ## Dataset
+* **Dataset name**: [thinkKenya/kenyan-low-resource-language-data](https://huggingface.co/datasets/thinkKenya/kenyan-low-resource-language-data)
+* **Task**: Translation (parallel text, Parquet format)
+* **Languages**: Luo (ISO `dav`) ↔ Swahili (ISO `swa`)
+* **Subset used**: `luo_swa` (≈29.3 k total examples; train split: 21.3 k; test split: 5.33 k)
+* **License**: [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/)
 ## Organization: Tech Innovators Network Kenya (thinkKenya)
+* **Website**: [think.ke](https://think.ke)
+* **Hugging Face Org**: [thinkKenya](https://huggingface.co/thinkKenya)
+* **Founded**: 2019
+* **Mission**: To accelerate digital transformation and applied open innovation in Kenya, with special emphasis on building AI solutions for African local languages.
 ## Training Configuration
+| Component         | Details                                                                                               |
+| ----------------- | ----------------------------------------------------------------------------------------------------- |
+| Model weights     | [`model.safetensors`](https://huggingface.co/thinkKenya/luo_swa_translation_model/tree/main) (242 MB) |
+| Tokenizer files   | `tokenizer.json`, `special_tokens_map.json`, `tokenizer_config.json`                                  |
+| Config file       | `config.json`                                                                                         |
+| Training args     | `training_args.bin`                                                                                   |
+| Software versions | `transformers` ≥ 4.x, `datasets` ≥ 2.x                                                                |
 ## Example Usage
 ## Limitations
+* **Ongoing fine-tuning**: Outputs may still be unstable until training completes.
+* **Domain coverage**: Trained on conversational and narrative sentences—performance may drop on highly specialized or out-of-domain text.
+* **No public benchmarks yet**: Users are encouraged to evaluate with their own BLEU/ROUGE metrics.
 ## Citation