nickdee96 commited on
Commit
491f573
·
verified ·
1 Parent(s): bfca9b7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +78 -1
README.md CHANGED
@@ -4,4 +4,81 @@ language:
4
  base_model:
5
  - google-t5/t5-small
6
  pipeline_tag: translation
7
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
  base_model:
5
  - google-t5/t5-small
6
  pipeline_tag: translation
7
+ ---
8
+
9
+ ## Overview
10
+
11
+ thinkKenya’s **Luo–Swahili Translation Model** is a fine-tuned T5-small designed to translate between the Luo (dav) and Swahili (swa) languages. It was trained on a parallel corpus of approximately 29,300 Luo–Swahili sentence pairs drawn from the larger “Kenyan Low-Resource Language Data” dataset. thinkKenya—also known as Tech Innovators Network Kenya—is a community-driven technology initiative founded in 2019 to support digital transformation and applied open innovation across Kenya, with a special focus on African local languages.
12
+
13
+ ## Model Details
14
+
15
+ * **Model name**: `thinkKenya/luo_swa_translation_model`
16
+ * **Architecture**: T5-small (60.5M parameters) ([Hugging Face][1])
17
+ * **Framework**: Hugging Face Transformers (safetensors weights) ([Hugging Face][1])
18
+ * **Base checkpoint**: `google-t5/t5-small` ([Hugging Face][1])
19
+ * **Tensor type**: F32 ([Hugging Face][1])
20
+ * **Training status**: In progress (latest reported step: 146,600) ([Hugging Face][1])
21
+
22
+ ## Dataset
23
+
24
+ * **Dataset name**: `thinkKenya/kenyan-low-resource-language-data` ([Hugging Face][2])
25
+ * **Task**: Translation (parallel text) ([Hugging Face][2])
26
+ * **Languages**: Swahili (ISO: swa) ([Hugging Face][2])
27
+ * **Subset used**: `luo_swa` (29.3k rows total; train split: 21.3k rows; test split: 5.33k rows) ([Hugging Face][2], [Hugging Face][2])
28
+ * **Format**: Parquet
29
+ * **License**: CC-BY-4.0 ([Hugging Face][2])
30
+
31
+ ## Organization: Tech Innovators Network Kenya (thinkKenya)
32
+
33
+ ### Mission & History
34
+
35
+ Tech Innovators Network Kenya (THiNK) is a community-driven technology initiative founded in 2019, aimed at assisting businesses and citizens in their digital transformation journeys through applied open innovation ([LinkedIn][3]). Their primary objectives include supporting local language AI, fintech solutions, and ecosystem development for innovators across Kenya ([Hugging Face][4]).
36
+
37
+ ### Key Focus Areas
38
+
39
+ * **African local languages**: Building datasets and models for under-resourced languages ([Hugging Face][5])
40
+ * **Digital transformation**: Consulting and technology services for Kenyan businesses ([LinkedIn][3])
41
+ * **Community building**: Convening forums like the AI Community of Practice to foster collaboration ([Community of Practitioners in AI][6])
42
+
43
+ ## Training Configuration
44
+
45
+ | Component | Details |
46
+ | ------------------ | -------------------------------------------------------------------- |
47
+ | Model weights | `model.safetensors` (242 MB) |
48
+ | Tokenizer files | `tokenizer.json`, `special_tokens_map.json`, `tokenizer_config.json` |
49
+ | Config file | `config.json` |
50
+ | Training args | `training_args.bin` |
51
+ | Framework versions | Transformers ≥4.x, Datasets ≥2.x |
52
+
53
+ ## Example Usage
54
+
55
+ ```python
56
+ from transformers import T5Tokenizer, T5ForConditionalGeneration
57
+
58
+ tokenizer = T5Tokenizer.from_pretrained("thinkKenya/luo_swa_translation_model")
59
+ model = T5ForConditionalGeneration.from_pretrained("thinkKenya/luo_swa_translation_model")
60
+
61
+ input_text = "translate Luo to Swahili: Wuki ghwa choki"
62
+ inputs = tokenizer(input_text, return_tensors="pt")
63
+ outputs = model.generate(**inputs)
64
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
65
+ ```
66
+
67
+ ## Limitations
68
+
69
+ * **Training in progress**: May produce unstable or under-trained outputs due to ongoing fine-tuning ([Hugging Face][1]).
70
+ * **Domain coverage**: Limited to conversational and narrative sentences present in the corpus—out-of-domain text may yield poor translations.
71
+ * **Evaluation metrics**: No public BLEU/ROUGE scores yet; users should perform their own evaluations.
72
+
73
+ ## Citation
74
+
75
+ ```bibtex
76
+ @misc{luo_swa_translation_model,
77
+ title = {Luo–Swahili Translation Model},
78
+ author = {thinkKenya (Tech Innovators Network Kenya)},
79
+ year = {2024},
80
+ publisher = {Hugging Face},
81
+ howpublished = {\url{https://huggingface.co/thinkKenya/luo_swa_translation_model}},
82
+ }
83
+ ```
84
+