nickdee96 commited on
Commit
98b47e4
·
verified ·
1 Parent(s): 491f573

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +25 -31
README.md CHANGED
@@ -6,49 +6,43 @@ base_model:
6
  pipeline_tag: translation
7
  ---
8
 
 
9
  ## Overview
10
 
11
  thinkKenya’s **Luo–Swahili Translation Model** is a fine-tuned T5-small designed to translate between the Luo (dav) and Swahili (swa) languages. It was trained on a parallel corpus of approximately 29,300 Luo–Swahili sentence pairs drawn from the larger “Kenyan Low-Resource Language Data” dataset. thinkKenya—also known as Tech Innovators Network Kenya—is a community-driven technology initiative founded in 2019 to support digital transformation and applied open innovation across Kenya, with a special focus on African local languages.
12
 
13
  ## Model Details
14
 
15
- * **Model name**: `thinkKenya/luo_swa_translation_model`
16
- * **Architecture**: T5-small (60.5M parameters) ([Hugging Face][1])
17
- * **Framework**: Hugging Face Transformers (safetensors weights) ([Hugging Face][1])
18
- * **Base checkpoint**: `google-t5/t5-small` ([Hugging Face][1])
19
- * **Tensor type**: F32 ([Hugging Face][1])
20
- * **Training status**: In progress (latest reported step: 146,600) ([Hugging Face][1])
21
 
22
  ## Dataset
23
 
24
- * **Dataset name**: `thinkKenya/kenyan-low-resource-language-data` ([Hugging Face][2])
25
- * **Task**: Translation (parallel text) ([Hugging Face][2])
26
- * **Languages**: Swahili (ISO: swa) ([Hugging Face][2])
27
- * **Subset used**: `luo_swa` (29.3k rows total; train split: 21.3k rows; test split: 5.33k rows) ([Hugging Face][2], [Hugging Face][2])
28
- * **Format**: Parquet
29
- * **License**: CC-BY-4.0 ([Hugging Face][2])
30
 
31
  ## Organization: Tech Innovators Network Kenya (thinkKenya)
32
 
33
- ### Mission & History
34
-
35
- Tech Innovators Network Kenya (THiNK) is a community-driven technology initiative founded in 2019, aimed at assisting businesses and citizens in their digital transformation journeys through applied open innovation ([LinkedIn][3]). Their primary objectives include supporting local language AI, fintech solutions, and ecosystem development for innovators across Kenya ([Hugging Face][4]).
36
-
37
- ### Key Focus Areas
38
-
39
- * **African local languages**: Building datasets and models for under-resourced languages ([Hugging Face][5])
40
- * **Digital transformation**: Consulting and technology services for Kenyan businesses ([LinkedIn][3])
41
- * **Community building**: Convening forums like the AI Community of Practice to foster collaboration ([Community of Practitioners in AI][6])
42
 
43
  ## Training Configuration
44
 
45
- | Component | Details |
46
- | ------------------ | -------------------------------------------------------------------- |
47
- | Model weights | `model.safetensors` (242 MB) |
48
- | Tokenizer files | `tokenizer.json`, `special_tokens_map.json`, `tokenizer_config.json` |
49
- | Config file | `config.json` |
50
- | Training args | `training_args.bin` |
51
- | Framework versions | Transformers ≥4.x, Datasets ≥2.x |
52
 
53
  ## Example Usage
54
 
@@ -66,9 +60,9 @@ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
66
 
67
  ## Limitations
68
 
69
- * **Training in progress**: May produce unstable or under-trained outputs due to ongoing fine-tuning ([Hugging Face][1]).
70
- * **Domain coverage**: Limited to conversational and narrative sentences present in the corpus—out-of-domain text may yield poor translations.
71
- * **Evaluation metrics**: No public BLEU/ROUGE scores yet; users should perform their own evaluations.
72
 
73
  ## Citation
74
 
 
6
  pipeline_tag: translation
7
  ---
8
 
9
+
10
  ## Overview
11
 
12
  thinkKenya’s **Luo–Swahili Translation Model** is a fine-tuned T5-small designed to translate between the Luo (dav) and Swahili (swa) languages. It was trained on a parallel corpus of approximately 29,300 Luo–Swahili sentence pairs drawn from the larger “Kenyan Low-Resource Language Data” dataset. thinkKenya—also known as Tech Innovators Network Kenya—is a community-driven technology initiative founded in 2019 to support digital transformation and applied open innovation across Kenya, with a special focus on African local languages.
13
 
14
  ## Model Details
15
 
16
+ * **Model name**: [thinkKenya/luo\_swa\_translation\_model](https://huggingface.co/thinkKenya/luo_swa_translation_model)
17
+ * **Architecture**: T5-small (60.5 M parameters; base checkpoint: [google/t5-small](https://huggingface.co/google/t5-small))
18
+ * **Framework**: Hugging Face Transformers (weights in [safetensors format](https://huggingface.co/thinkKenya/luo_swa_translation_model/tree/main))
19
+ * **Tensor type**: fp32
20
+ * **Training status**: In progress (latest reported step: 146,600)
 
21
 
22
  ## Dataset
23
 
24
+ * **Dataset name**: [thinkKenya/kenyan-low-resource-language-data](https://huggingface.co/datasets/thinkKenya/kenyan-low-resource-language-data)
25
+ * **Task**: Translation (parallel text, Parquet format)
26
+ * **Languages**: Luo (ISO `dav`) ↔ Swahili (ISO `swa`)
27
+ * **Subset used**: `luo_swa` (29.3 k total examples; train split: 21.3 k; test split: 5.33 k)
28
+ * **License**: [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/)
 
29
 
30
  ## Organization: Tech Innovators Network Kenya (thinkKenya)
31
 
32
+ * **Website**: [think.ke](https://think.ke)
33
+ * **Hugging Face Org**: [thinkKenya](https://huggingface.co/thinkKenya)
34
+ * **Founded**: 2019
35
+ * **Mission**: To accelerate digital transformation and applied open innovation in Kenya, with special emphasis on building AI solutions for African local languages.
 
 
 
 
 
36
 
37
  ## Training Configuration
38
 
39
+ | Component | Details |
40
+ | ----------------- | ----------------------------------------------------------------------------------------------------- |
41
+ | Model weights | [`model.safetensors`](https://huggingface.co/thinkKenya/luo_swa_translation_model/tree/main) (242 MB) |
42
+ | Tokenizer files | `tokenizer.json`, `special_tokens_map.json`, `tokenizer_config.json` |
43
+ | Config file | `config.json` |
44
+ | Training args | `training_args.bin` |
45
+ | Software versions | `transformers` 4.x, `datasets` 2.x |
46
 
47
  ## Example Usage
48
 
 
60
 
61
  ## Limitations
62
 
63
+ * **Ongoing fine-tuning**: Outputs may still be unstable until training completes.
64
+ * **Domain coverage**: Trained on conversational and narrative sentences—performance may drop on highly specialized or out-of-domain text.
65
+ * **No public benchmarks yet**: Users are encouraged to evaluate with their own BLEU/ROUGE metrics.
66
 
67
  ## Citation
68