elmadany commited on
Commit
bae5a66
·
verified ·
1 Parent(s): f84b035

Initial model upload

Browse files
.ipynb_checkpoints/README-checkpoint.md ADDED
@@ -0,0 +1,211 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - am # Amharic
4
+ - ar # Arabic
5
+ - tw # Asante Twi
6
+ - bm # Bambara
7
+ - fr # French
8
+ - lg # Ganda
9
+ - ha # Hausa
10
+ - ig # Igbo
11
+ - rw # Kinyarwanda
12
+ - kg # Kongo
13
+ - ln # Lingala
14
+ - lu # Luba-Katanga
15
+ - mg # Malagasy
16
+ - nso # Northern Sotho
17
+ - ny # Nyanja
18
+ - om # Oromo
19
+ - pt # Portuguese
20
+ - sn # Shona
21
+ - so # Somali
22
+ - st # Southern Sotho
23
+ - sw # Swahili
24
+ - ss # Swati
25
+ - ti # Tigrinya
26
+ - ts # Tsonga
27
+ - tn # Tswana
28
+ - ak # Twi
29
+ - ve # Venda
30
+ - wo # Wolof
31
+ - xh # Xhosa
32
+ - yo # Yoruba
33
+ - zu # Zulu
34
+ - tzm # Tamazight
35
+ - sg # Sango
36
+ - din # Dinka
37
+ - ee # Ewe
38
+ - fo # Fon
39
+ - luo # Luo
40
+ - mos # Mossi
41
+ - umb # Umbundu
42
+ license: cc-by-4.0
43
+ tags:
44
+ - automatic-speech-recognition
45
+ - audio
46
+ - speech
47
+ - african-languages
48
+ - multilingual
49
+ - simba
50
+ - low-resource
51
+ - speech-recognition
52
+ - asr
53
+ datasets:
54
+ - UBC-NLP/SimbaBench
55
+ metrics:
56
+ - wer
57
+ - cer
58
+ library_name: transformers
59
+ pipeline_tag: automatic-speech-recognition
60
+ ---
61
+ <div align="center">
62
+
63
+ <img src="https://africa.dlnlp.ai/simba/images/VoC_simba" alt="VoC Simba Models Logo">
64
+
65
+
66
+ [![EMNLP 2025 Paper](https://img.shields.io/badge/EMNLP_2025-Paper-B31B1B?style=for-the-badge&logo=arxiv&logoColor=B31B1B&labelColor=FFCDD2)](https://aclanthology.org/2025.emnlp-main.559/)
67
+ [![Official Website](https://img.shields.io/badge/Official-Website-2EA44F?style=for-the-badge&logo=googlechrome&logoColor=2EA44F&labelColor=C8E6C9)](https://africa.dlnlp.ai/simba/)
68
+ [![SimbaBench](https://img.shields.io/badge/SimbaBench-Benchmark-8A2BE2?style=for-the-badge&logo=googlecharts&logoColor=8A2BE2&labelColor=E1BEE7)](#simbabench)
69
+ [![Hugging Face](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-FFD21E?style=for-the-badge&logoColor=black&labelColor=FFF9C4)](https://huggingface.co/collections/UBC-NLP/simba-speech-series)
70
+ [![YouTube Video](https://img.shields.io/badge/YouTube-Video-FF0000?style=for-the-badge&logo=youtube&logoColor=FF0000&labelColor=FFCCBC)](#demo)
71
+
72
+ </div>
73
+
74
+ ## *Bridging the Digital Divide for African AI*
75
+
76
+ **Voice of a Continent** is a comprehensive open-source ecosystem designed to bring African languages to the forefront of artificial intelligence. By providing a unified suite of benchmarking tools and state-of-the-art models, we ensure that the future of speech technology is inclusive, representative, and accessible to over a billion people.
77
+
78
+ ## Best-in-Class Multilingual Models
79
+
80
+ Introduced in our EMNLP 2025 paper *[Voice of a Continent](https://aclanthology.org/2025.emnlp-main.559/)*, the **Simba Series** represents the current state-of-the-art for African speech AI.
81
+
82
+ - **Unified Suite:** Models optimized for African languages.
83
+ - **Superior Accuracy:** Outperforms generic multilingual models by leveraging SimbaBench's high-quality, domain-diverse datasets.
84
+ - **Multitask Capability:** Designed for high performance in ASR (Automatic Speech Recognition) and TTS (Text-to-Speech).
85
+ - **Inclusion-First:** Specifically built to mitigate the "digital divide" by empowering speakers of underrepresented languages.
86
+
87
+ The **Simba** family consists of state-of-the-art models fine-tuned using SimbaBench. These models achieve superior performance by leveraging dataset quality, domain diversity, and language family relationships.
88
+
89
+ ### 🗣️✍️ Simba-ASR
90
+ > **The New Standard for African Speech-to-Text**
91
+
92
+ **🎯 Task** `Automatic Speech Recognition` — Powering high-accuracy transcription across the continent.
93
+
94
+ **🌍 Language Coverage (43 African languages)**
95
+ > **Amharic** (`amh`), **Arabic** (`ara`), **Asante Twi** (`asanti`), **Bambara** (`bam`), **Baoulé** (`bau`), **Bemba** (`bem`), **Ewe** (`ewe`), **Fanti** (`fat`), **Fon** (`fon`), **French** (`fra`), **Ganda** (`lug`), **Hausa** (`hau`), **Igbo** (`ibo`), **Kabiye** (`kab`), **Kinyarwanda** (`kin`), **Kongo** (`kon`), **Lingala** (`lin`), **Luba-Katanga** (`lub`), **Luo** (`luo`), **Malagasy** (`mlg`), **Mossi** (`mos`), **Northern Sotho** (`nso`), **Nyanja** (`nya`), **Oromo** (`orm`), **Portuguese** (`por`), **Shona** (`sna`), **Somali** (`som`), **Southern Sotho** (`sot`), **Swahili** (`swa`), **Swati** (`ssw`), **Tigrinya** (`tir`), **Tsonga** (`tso`), **Tswana** (`tsn`), **Twi** (`twi`), **Umbundu** (`umb`), **Venda** (`ven`), **Wolof** (`wol`), **Xhosa** (`xho`), **Yoruba** (`yor`), **Zulu** (`zul`), **Tamazight** (`tzm`), **Sango** (`sag`), **Dinka** (`din`).
96
+
97
+ **🏗️ Base Architectures**
98
+
99
+ - **Simba-S** (SeamlessM4T-v2-MT) — *Top Performer*
100
+ - **Simba-W** (Whisper-v3-large)
101
+ - **Simba-X** (Wav2Vec2-XLS-R-2b)
102
+ - **Simba-M** (MMS-1b-all)
103
+ - **Simba-H** (AfriHuBERT)
104
+
105
+ 🌐 Explore the Frontier
106
+
107
+ | **ASR Models** | **Architecture** | **#Parameters** | **🤗 Hugging Face Model Card** | **Status** |
108
+ |---------|:------------------:| :------------------:| :------------------:|:------------------:|
109
+ | 🔥**Simba-S**🔥| SeamlessM4T-v2 | 2.3B | 🤗 [https://huggingface.co/UBC-NLP/Simba-S](https://huggingface.co/UBC-NLP/Simba-S) | ✅ Released |
110
+ | 🔥**Simba-W**🔥| Whisper | 1.5B | 🤗 [https://huggingface.co/UBC-NLP/Simba-W](https://huggingface.co/UBC-NLP/Simba-W) | ✅ Released |
111
+ | 🔥**Simba-X**🔥| Wav2Vec2 | 1B | 🤗 [https://huggingface.co/UBC-NLP/Simba-X](https://huggingface.co/UBC-NLP/Simba-X) | ✅ Released |
112
+ | 🔥**Simba-M**🔥| MMS | 1B | 🤗 [https://huggingface.co/UBC-NLP/Simba-M](https://huggingface.co/UBC-NLP/Simba-M) | ✅ Released |
113
+ | 🔥**Simba-H**🔥| HuBERT | 94M | 🤗 [https://huggingface.co/UBC-NLP/Simba-H](https://huggingface.co/UBC-NLP/Simba-H) | ✅ Released |
114
+
115
+ * **Simba-S** emerged as the best-performing ASR model overall.
116
+
117
+
118
+ **🧩 Usage Example**
119
+
120
+ You can easily run inference using the Hugging Face `transformers` library.
121
+
122
+ ```python
123
+ from transformers import pipeline
124
+
125
+ # Load Simba-S for ASR
126
+ asr_pipeline = pipeline(
127
+ "automatic-speech-recognition",
128
+ model="UBC-NLP/Simba-S" #Simba mdoels `UBC-NLP/Simba-S`, `UBC-NLP/Simba-W`, `UBC-NLP/Simba-X`, `UBC-NLP/Simba-H`, `UBC-NLP/Simba-M`
129
+ )
130
+
131
+ ##### Load the multilingual African adapter (Only for `UBC-NLP/Simba-M`)
132
+ asr_pipeline.model.load_adapter("multilingual_african") # Only for `UBC-NLP/Simba-M`
133
+ ###########################
134
+
135
+ # Transcribe audio from file
136
+ result = asr_pipeline("https://africa.dlnlp.ai/simba/audio/afr_Lwazi_afr_test_idx3889.wav")
137
+ print(result["text"])
138
+
139
+
140
+ # Transcribe audio from audio array
141
+ result = asr_pipeline({
142
+ "array": audio_array,
143
+ "sampling_rate": 16_000
144
+ })
145
+ print(result["text"])
146
+
147
+ ```
148
+
149
+ #### Example Outputs
150
+
151
+ Using the same audio file with different Simba models:
152
+
153
+ ```python
154
+ # Simba-S
155
+ {'text': 'watter verontwaardiging sou daar, in ons binneste gewees het.'}
156
+ ```
157
+
158
+ ```python
159
+ # Simba-W
160
+ {'text': 'watter veronwaardigingsel daar, in ons binneste gewees het.'}
161
+ ```
162
+
163
+ ```python
164
+ # Simba-X
165
+ {'text': 'fator fr on ar taamsodr is'}
166
+ ```
167
+
168
+ ```python
169
+ # Simba-M
170
+ {'text': 'watter veronwaardiging sodaar in ons binniste gewees het'}
171
+ ```
172
+
173
+ ```python
174
+ # Simba-H
175
+ {'text': 'watter vironwaardiging so daar in ons binneste geweeshet'}
176
+ ```
177
+
178
+ Get started with Simba models in minutes using our interactive Colab notebook: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://github.com/UBC-NLP/simba/edit/main/simba_models.ipynb)
179
+
180
+
181
+ ## Citation
182
+
183
+ If you use the Simba models or SimbaBench benchmark for your scientific publication, or if you find the resources in this website useful, please cite our paper.
184
+
185
+ ```bibtex
186
+
187
+ @inproceedings{elmadany-etal-2025-voice,
188
+ title = "Voice of a Continent: Mapping {A}frica{'}s Speech Technology Frontier",
189
+ author = "Elmadany, AbdelRahim A. and
190
+ Kwon, Sang Yun and
191
+ Toyin, Hawau Olamide and
192
+ Alcoba Inciarte, Alcides and
193
+ Aldarmaki, Hanan and
194
+ Abdul-Mageed, Muhammad",
195
+ editor = "Christodoulopoulos, Christos and
196
+ Chakraborty, Tanmoy and
197
+ Rose, Carolyn and
198
+ Peng, Violet",
199
+ booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing",
200
+ month = nov,
201
+ year = "2025",
202
+ address = "Suzhou, China",
203
+ publisher = "Association for Computational Linguistics",
204
+ url = "https://aclanthology.org/2025.emnlp-main.559/",
205
+ doi = "10.18653/v1/2025.emnlp-main.559",
206
+ pages = "11039--11061",
207
+ ISBN = "979-8-89176-332-6",
208
+ }
209
+
210
+ ```
211
+
README.md ADDED
@@ -0,0 +1,211 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - am # Amharic
4
+ - ar # Arabic
5
+ - tw # Asante Twi
6
+ - bm # Bambara
7
+ - fr # French
8
+ - lg # Ganda
9
+ - ha # Hausa
10
+ - ig # Igbo
11
+ - rw # Kinyarwanda
12
+ - kg # Kongo
13
+ - ln # Lingala
14
+ - lu # Luba-Katanga
15
+ - mg # Malagasy
16
+ - nso # Northern Sotho
17
+ - ny # Nyanja
18
+ - om # Oromo
19
+ - pt # Portuguese
20
+ - sn # Shona
21
+ - so # Somali
22
+ - st # Southern Sotho
23
+ - sw # Swahili
24
+ - ss # Swati
25
+ - ti # Tigrinya
26
+ - ts # Tsonga
27
+ - tn # Tswana
28
+ - ak # Twi
29
+ - ve # Venda
30
+ - wo # Wolof
31
+ - xh # Xhosa
32
+ - yo # Yoruba
33
+ - zu # Zulu
34
+ - tzm # Tamazight
35
+ - sg # Sango
36
+ - din # Dinka
37
+ - ee # Ewe
38
+ - fo # Fon
39
+ - luo # Luo
40
+ - mos # Mossi
41
+ - umb # Umbundu
42
+ license: cc-by-4.0
43
+ tags:
44
+ - automatic-speech-recognition
45
+ - audio
46
+ - speech
47
+ - african-languages
48
+ - multilingual
49
+ - simba
50
+ - low-resource
51
+ - speech-recognition
52
+ - asr
53
+ datasets:
54
+ - UBC-NLP/SimbaBench
55
+ metrics:
56
+ - wer
57
+ - cer
58
+ library_name: transformers
59
+ pipeline_tag: automatic-speech-recognition
60
+ ---
61
+ <div align="center">
62
+
63
+ <img src="https://africa.dlnlp.ai/simba/images/VoC_simba" alt="VoC Simba Models Logo">
64
+
65
+
66
+ [![EMNLP 2025 Paper](https://img.shields.io/badge/EMNLP_2025-Paper-B31B1B?style=for-the-badge&logo=arxiv&logoColor=B31B1B&labelColor=FFCDD2)](https://aclanthology.org/2025.emnlp-main.559/)
67
+ [![Official Website](https://img.shields.io/badge/Official-Website-2EA44F?style=for-the-badge&logo=googlechrome&logoColor=2EA44F&labelColor=C8E6C9)](https://africa.dlnlp.ai/simba/)
68
+ [![SimbaBench](https://img.shields.io/badge/SimbaBench-Benchmark-8A2BE2?style=for-the-badge&logo=googlecharts&logoColor=8A2BE2&labelColor=E1BEE7)](#simbabench)
69
+ [![Hugging Face](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-FFD21E?style=for-the-badge&logoColor=black&labelColor=FFF9C4)](https://huggingface.co/collections/UBC-NLP/simba-speech-series)
70
+ [![YouTube Video](https://img.shields.io/badge/YouTube-Video-FF0000?style=for-the-badge&logo=youtube&logoColor=FF0000&labelColor=FFCCBC)](#demo)
71
+
72
+ </div>
73
+
74
+ ## *Bridging the Digital Divide for African AI*
75
+
76
+ **Voice of a Continent** is a comprehensive open-source ecosystem designed to bring African languages to the forefront of artificial intelligence. By providing a unified suite of benchmarking tools and state-of-the-art models, we ensure that the future of speech technology is inclusive, representative, and accessible to over a billion people.
77
+
78
+ ## Best-in-Class Multilingual Models
79
+
80
+ Introduced in our EMNLP 2025 paper *[Voice of a Continent](https://aclanthology.org/2025.emnlp-main.559/)*, the **Simba Series** represents the current state-of-the-art for African speech AI.
81
+
82
+ - **Unified Suite:** Models optimized for African languages.
83
+ - **Superior Accuracy:** Outperforms generic multilingual models by leveraging SimbaBench's high-quality, domain-diverse datasets.
84
+ - **Multitask Capability:** Designed for high performance in ASR (Automatic Speech Recognition) and TTS (Text-to-Speech).
85
+ - **Inclusion-First:** Specifically built to mitigate the "digital divide" by empowering speakers of underrepresented languages.
86
+
87
+ The **Simba** family consists of state-of-the-art models fine-tuned using SimbaBench. These models achieve superior performance by leveraging dataset quality, domain diversity, and language family relationships.
88
+
89
+ ### 🗣️✍️ Simba-ASR
90
+ > **The New Standard for African Speech-to-Text**
91
+
92
+ **🎯 Task** `Automatic Speech Recognition` — Powering high-accuracy transcription across the continent.
93
+
94
+ **🌍 Language Coverage (43 African languages)**
95
+ > **Amharic** (`amh`), **Arabic** (`ara`), **Asante Twi** (`asanti`), **Bambara** (`bam`), **Baoulé** (`bau`), **Bemba** (`bem`), **Ewe** (`ewe`), **Fanti** (`fat`), **Fon** (`fon`), **French** (`fra`), **Ganda** (`lug`), **Hausa** (`hau`), **Igbo** (`ibo`), **Kabiye** (`kab`), **Kinyarwanda** (`kin`), **Kongo** (`kon`), **Lingala** (`lin`), **Luba-Katanga** (`lub`), **Luo** (`luo`), **Malagasy** (`mlg`), **Mossi** (`mos`), **Northern Sotho** (`nso`), **Nyanja** (`nya`), **Oromo** (`orm`), **Portuguese** (`por`), **Shona** (`sna`), **Somali** (`som`), **Southern Sotho** (`sot`), **Swahili** (`swa`), **Swati** (`ssw`), **Tigrinya** (`tir`), **Tsonga** (`tso`), **Tswana** (`tsn`), **Twi** (`twi`), **Umbundu** (`umb`), **Venda** (`ven`), **Wolof** (`wol`), **Xhosa** (`xho`), **Yoruba** (`yor`), **Zulu** (`zul`), **Tamazight** (`tzm`), **Sango** (`sag`), **Dinka** (`din`).
96
+
97
+ **🏗️ Base Architectures**
98
+
99
+ - **Simba-S** (SeamlessM4T-v2-MT) — *Top Performer*
100
+ - **Simba-W** (Whisper-v3-large)
101
+ - **Simba-X** (Wav2Vec2-XLS-R-2b)
102
+ - **Simba-M** (MMS-1b-all)
103
+ - **Simba-H** (AfriHuBERT)
104
+
105
+ 🌐 Explore the Frontier
106
+
107
+ | **ASR Models** | **Architecture** | **#Parameters** | **🤗 Hugging Face Model Card** | **Status** |
108
+ |---------|:------------------:| :------------------:| :------------------:|:------------------:|
109
+ | 🔥**Simba-S**🔥| SeamlessM4T-v2 | 2.3B | 🤗 [https://huggingface.co/UBC-NLP/Simba-S](https://huggingface.co/UBC-NLP/Simba-S) | ✅ Released |
110
+ | 🔥**Simba-W**🔥| Whisper | 1.5B | 🤗 [https://huggingface.co/UBC-NLP/Simba-W](https://huggingface.co/UBC-NLP/Simba-W) | ✅ Released |
111
+ | 🔥**Simba-X**🔥| Wav2Vec2 | 1B | 🤗 [https://huggingface.co/UBC-NLP/Simba-X](https://huggingface.co/UBC-NLP/Simba-X) | ✅ Released |
112
+ | 🔥**Simba-M**🔥| MMS | 1B | 🤗 [https://huggingface.co/UBC-NLP/Simba-M](https://huggingface.co/UBC-NLP/Simba-M) | ✅ Released |
113
+ | 🔥**Simba-H**🔥| HuBERT | 94M | 🤗 [https://huggingface.co/UBC-NLP/Simba-H](https://huggingface.co/UBC-NLP/Simba-H) | ✅ Released |
114
+
115
+ * **Simba-S** emerged as the best-performing ASR model overall.
116
+
117
+
118
+ **🧩 Usage Example**
119
+
120
+ You can easily run inference using the Hugging Face `transformers` library.
121
+
122
+ ```python
123
+ from transformers import pipeline
124
+
125
+ # Load Simba-S for ASR
126
+ asr_pipeline = pipeline(
127
+ "automatic-speech-recognition",
128
+ model="UBC-NLP/Simba-S" #Simba mdoels `UBC-NLP/Simba-S`, `UBC-NLP/Simba-W`, `UBC-NLP/Simba-X`, `UBC-NLP/Simba-H`, `UBC-NLP/Simba-M`
129
+ )
130
+
131
+ ##### Load the multilingual African adapter (Only for `UBC-NLP/Simba-M`)
132
+ asr_pipeline.model.load_adapter("multilingual_african") # Only for `UBC-NLP/Simba-M`
133
+ ###########################
134
+
135
+ # Transcribe audio from file
136
+ result = asr_pipeline("https://africa.dlnlp.ai/simba/audio/afr_Lwazi_afr_test_idx3889.wav")
137
+ print(result["text"])
138
+
139
+
140
+ # Transcribe audio from audio array
141
+ result = asr_pipeline({
142
+ "array": audio_array,
143
+ "sampling_rate": 16_000
144
+ })
145
+ print(result["text"])
146
+
147
+ ```
148
+
149
+ #### Example Outputs
150
+
151
+ Using the same audio file with different Simba models:
152
+
153
+ ```python
154
+ # Simba-S
155
+ {'text': 'watter verontwaardiging sou daar, in ons binneste gewees het.'}
156
+ ```
157
+
158
+ ```python
159
+ # Simba-W
160
+ {'text': 'watter veronwaardigingsel daar, in ons binneste gewees het.'}
161
+ ```
162
+
163
+ ```python
164
+ # Simba-X
165
+ {'text': 'fator fr on ar taamsodr is'}
166
+ ```
167
+
168
+ ```python
169
+ # Simba-M
170
+ {'text': 'watter veronwaardiging sodaar in ons binniste gewees het'}
171
+ ```
172
+
173
+ ```python
174
+ # Simba-H
175
+ {'text': 'watter vironwaardiging so daar in ons binneste geweeshet'}
176
+ ```
177
+
178
+ Get started with Simba models in minutes using our interactive Colab notebook: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://github.com/UBC-NLP/simba/edit/main/simba_models.ipynb)
179
+
180
+
181
+ ## Citation
182
+
183
+ If you use the Simba models or SimbaBench benchmark for your scientific publication, or if you find the resources in this website useful, please cite our paper.
184
+
185
+ ```bibtex
186
+
187
+ @inproceedings{elmadany-etal-2025-voice,
188
+ title = "Voice of a Continent: Mapping {A}frica{'}s Speech Technology Frontier",
189
+ author = "Elmadany, AbdelRahim A. and
190
+ Kwon, Sang Yun and
191
+ Toyin, Hawau Olamide and
192
+ Alcoba Inciarte, Alcides and
193
+ Aldarmaki, Hanan and
194
+ Abdul-Mageed, Muhammad",
195
+ editor = "Christodoulopoulos, Christos and
196
+ Chakraborty, Tanmoy and
197
+ Rose, Carolyn and
198
+ Peng, Violet",
199
+ booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing",
200
+ month = nov,
201
+ year = "2025",
202
+ address = "Suzhou, China",
203
+ publisher = "Association for Computational Linguistics",
204
+ url = "https://aclanthology.org/2025.emnlp-main.559/",
205
+ doi = "10.18653/v1/2025.emnlp-main.559",
206
+ pages = "11039--11061",
207
+ ISBN = "979-8-89176-332-6",
208
+ }
209
+
210
+ ```
211
+
added_tokens.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "</s>": 427,
3
+ "<s>": 426
4
+ }
config.json ADDED
@@ -0,0 +1,76 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "ajesujoba/AfriHuBERT",
3
+ "activation_dropout": 0.0,
4
+ "add_adapter": false,
5
+ "apply_spec_augment": true,
6
+ "architectures": [
7
+ "HubertForCTC"
8
+ ],
9
+ "attention_dropout": 0.0,
10
+ "bos_token_id": 1,
11
+ "classifier_proj_size": 256,
12
+ "conv_bias": false,
13
+ "conv_dim": [
14
+ 512,
15
+ 512,
16
+ 512,
17
+ 512,
18
+ 512,
19
+ 512,
20
+ 512
21
+ ],
22
+ "conv_kernel": [
23
+ 10,
24
+ 3,
25
+ 3,
26
+ 3,
27
+ 3,
28
+ 2,
29
+ 2
30
+ ],
31
+ "conv_stride": [
32
+ 5,
33
+ 2,
34
+ 2,
35
+ 2,
36
+ 2,
37
+ 2,
38
+ 2
39
+ ],
40
+ "ctc_loss_reduction": "mean",
41
+ "ctc_zero_infinity": false,
42
+ "do_stable_layer_norm": false,
43
+ "eos_token_id": 2,
44
+ "feat_extract_activation": "gelu",
45
+ "feat_extract_dropout": 0.0,
46
+ "feat_extract_norm": "group",
47
+ "feat_proj_dropout": 0.0,
48
+ "feat_proj_layer_norm": true,
49
+ "final_dropout": 0.0,
50
+ "hidden_act": "gelu",
51
+ "hidden_dropout": 0.0,
52
+ "hidden_dropout_prob": 0.1,
53
+ "hidden_size": 768,
54
+ "initializer_range": 0.02,
55
+ "intermediate_size": 3072,
56
+ "layer_norm_eps": 1e-05,
57
+ "layerdrop": 0.0,
58
+ "mask_feature_length": 10,
59
+ "mask_feature_min_masks": 0,
60
+ "mask_feature_prob": 0.0,
61
+ "mask_time_length": 10,
62
+ "mask_time_min_masks": 2,
63
+ "mask_time_prob": 0.05,
64
+ "model_type": "hubert",
65
+ "num_attention_heads": 12,
66
+ "num_conv_pos_embedding_groups": 16,
67
+ "num_conv_pos_embeddings": 128,
68
+ "num_feat_extract_layers": 7,
69
+ "num_hidden_layers": 12,
70
+ "pad_token_id": 425,
71
+ "tokenizer_class": "Wav2Vec2CTCTokenizer",
72
+ "torch_dtype": "float32",
73
+ "transformers_version": "4.33.2",
74
+ "use_weighted_layer_sum": false,
75
+ "vocab_size": 428
76
+ }
preprocessor_config.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "do_normalize": true,
3
+ "feature_extractor_type": "Wav2Vec2FeatureExtractor",
4
+ "feature_size": 1,
5
+ "padding_side": "right",
6
+ "padding_value": 0,
7
+ "return_attention_mask": false,
8
+ "sampling_rate": 16000
9
+ }
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4bd01b633fc5af37d2d3c9d95ea9275fe8b0fe10d0b04bba04f42520b02d2c35
3
+ size 378876385
special_tokens_map.json ADDED
@@ -0,0 +1,22 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ {
4
+ "content": "<s>",
5
+ "lstrip": false,
6
+ "normalized": true,
7
+ "rstrip": false,
8
+ "single_word": false
9
+ },
10
+ {
11
+ "content": "</s>",
12
+ "lstrip": false,
13
+ "normalized": true,
14
+ "rstrip": false,
15
+ "single_word": false
16
+ }
17
+ ],
18
+ "bos_token": "<s>",
19
+ "eos_token": "</s>",
20
+ "pad_token": "[PAD]",
21
+ "unk_token": "[UNK]"
22
+ }
tokenizer_config.json ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": "<s>",
3
+ "clean_up_tokenization_spaces": true,
4
+ "do_lower_case": false,
5
+ "eos_token": "</s>",
6
+ "model_max_length": 1000000000000000019884624838656,
7
+ "pad_token": "[PAD]",
8
+ "replace_word_delimiter_char": " ",
9
+ "target_lang": null,
10
+ "tokenizer_class": "Wav2Vec2CTCTokenizer",
11
+ "unk_token": "[UNK]",
12
+ "word_delimiter_token": "|"
13
+ }
vocab.json ADDED
@@ -0,0 +1,428 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "<": 1,
3
+ "=": 2,
4
+ ">": 3,
5
+ "[": 4,
6
+ "[PAD]": 425,
7
+ "[UNK]": 424,
8
+ "\\": 5,
9
+ "]": 6,
10
+ "_": 7,
11
+ "`": 8,
12
+ "a": 9,
13
+ "b": 10,
14
+ "c": 11,
15
+ "d": 12,
16
+ "e": 13,
17
+ "f": 14,
18
+ "g": 15,
19
+ "h": 16,
20
+ "i": 17,
21
+ "j": 18,
22
+ "k": 19,
23
+ "l": 20,
24
+ "m": 21,
25
+ "n": 22,
26
+ "o": 23,
27
+ "p": 24,
28
+ "q": 25,
29
+ "r": 26,
30
+ "s": 27,
31
+ "t": 28,
32
+ "u": 29,
33
+ "v": 30,
34
+ "w": 31,
35
+ "x": 32,
36
+ "y": 33,
37
+ "z": 34,
38
+ "|": 0,
39
+ "~": 35,
40
+ "«": 36,
41
+ "²": 37,
42
+ "µ": 38,
43
+ "·": 39,
44
+ "»": 40,
45
+ "à": 41,
46
+ "á": 42,
47
+ "â": 43,
48
+ "ã": 44,
49
+ "ç": 45,
50
+ "è": 46,
51
+ "é": 47,
52
+ "ê": 48,
53
+ "ë": 49,
54
+ "ì": 50,
55
+ "í": 51,
56
+ "ï": 52,
57
+ "ñ": 53,
58
+ "ò": 54,
59
+ "ó": 55,
60
+ "ô": 56,
61
+ "õ": 57,
62
+ "ö": 58,
63
+ "ù": 59,
64
+ "ú": 60,
65
+ "û": 61,
66
+ "ć": 62,
67
+ "č": 63,
68
+ "ĕ": 64,
69
+ "ğ": 65,
70
+ "ĩ": 66,
71
+ "ī": 67,
72
+ "ĭ": 68,
73
+ "ĺ": 69,
74
+ "ń": 70,
75
+ "ŋ": 71,
76
+ "ŏ": 72,
77
+ "ś": 73,
78
+ "š": 74,
79
+ "ŧ": 75,
80
+ "ũ": 76,
81
+ "ŭ": 77,
82
+ "ŵ": 78,
83
+ "ƙ": 79,
84
+ "ƥ": 80,
85
+ "ƭ": 81,
86
+ "ƴ": 82,
87
+ "ǧ": 83,
88
+ "ǹ": 84,
89
+ "ɓ": 85,
90
+ "ɔ": 86,
91
+ "ɖ": 87,
92
+ "ɗ": 88,
93
+ "ɛ": 89,
94
+ "ɣ": 90,
95
+ "ɲ": 91,
96
+ "ʹ": 92,
97
+ "ʻ": 93,
98
+ "̀": 94,
99
+ "́": 95,
100
+ "̆": 96,
101
+ "̈": 97,
102
+ "̣": 98,
103
+ "έ": 99,
104
+ "γ": 100,
105
+ "ε": 101,
106
+ "ԑ": 102,
107
+ "ሀ": 103,
108
+ "ሁ": 104,
109
+ "ሂ": 105,
110
+ "ሃ": 106,
111
+ "ሄ": 107,
112
+ "ህ": 108,
113
+ "ሆ": 109,
114
+ "ለ": 110,
115
+ "ሉ": 111,
116
+ "ሊ": 112,
117
+ "ላ": 113,
118
+ "ሌ": 114,
119
+ "ል": 115,
120
+ "ሎ": 116,
121
+ "ሏ": 117,
122
+ "ሐ": 118,
123
+ "ሑ": 119,
124
+ "ሒ": 120,
125
+ "ሓ": 121,
126
+ "ሔ": 122,
127
+ "ሕ": 123,
128
+ "ሖ": 124,
129
+ "መ": 125,
130
+ "ሙ": 126,
131
+ "ሚ": 127,
132
+ "ማ": 128,
133
+ "ሜ": 129,
134
+ "ም": 130,
135
+ "ሞ": 131,
136
+ "ሟ": 132,
137
+ "ሠ": 133,
138
+ "ሡ": 134,
139
+ "ሣ": 135,
140
+ "ሥ": 136,
141
+ "ሦ": 137,
142
+ "ረ": 138,
143
+ "ሩ": 139,
144
+ "ሪ": 140,
145
+ "ራ": 141,
146
+ "ሬ": 142,
147
+ "ር": 143,
148
+ "ሮ": 144,
149
+ "ሯ": 145,
150
+ "ሰ": 146,
151
+ "ሱ": 147,
152
+ "ሲ": 148,
153
+ "ሳ": 149,
154
+ "ሴ": 150,
155
+ "ስ": 151,
156
+ "ሶ": 152,
157
+ "ሷ": 153,
158
+ "ሸ": 154,
159
+ "ሹ": 155,
160
+ "ሺ": 156,
161
+ "ሻ": 157,
162
+ "ሼ": 158,
163
+ "ሽ": 159,
164
+ "ሾ": 160,
165
+ "ሿ": 161,
166
+ "ቀ": 162,
167
+ "ቁ": 163,
168
+ "ቂ": 164,
169
+ "ቃ": 165,
170
+ "ቄ": 166,
171
+ "ቅ": 167,
172
+ "ቆ": 168,
173
+ "ቋ": 169,
174
+ "ቐ": 170,
175
+ "ቒ": 171,
176
+ "ቓ": 172,
177
+ "ቕ": 173,
178
+ "ቚ": 174,
179
+ "በ": 175,
180
+ "ቡ": 176,
181
+ "ቢ": 177,
182
+ "ባ": 178,
183
+ "ቤ": 179,
184
+ "ብ": 180,
185
+ "ቦ": 181,
186
+ "ቧ": 182,
187
+ "ቨ": 183,
188
+ "ቩ": 184,
189
+ "ቪ": 185,
190
+ "ቫ": 186,
191
+ "ቬ": 187,
192
+ "ቭ": 188,
193
+ "ቮ": 189,
194
+ "ቯ": 190,
195
+ "ተ": 191,
196
+ "ቱ": 192,
197
+ "ቲ": 193,
198
+ "ታ": 194,
199
+ "ቴ": 195,
200
+ "ት": 196,
201
+ "ቶ": 197,
202
+ "ቷ": 198,
203
+ "ቸ": 199,
204
+ "ቹ": 200,
205
+ "ቺ": 201,
206
+ "ቻ": 202,
207
+ "ቼ": 203,
208
+ "ች": 204,
209
+ "ቾ": 205,
210
+ "ቿ": 206,
211
+ "ኃ": 207,
212
+ "ኅ": 208,
213
+ "ኋ": 209,
214
+ "ነ": 210,
215
+ "ኑ": 211,
216
+ "ኒ": 212,
217
+ "ና": 213,
218
+ "ኔ": 214,
219
+ "ን": 215,
220
+ "ኖ": 216,
221
+ "ኗ": 217,
222
+ "ኘ": 218,
223
+ "ኙ": 219,
224
+ "ኚ": 220,
225
+ "ኛ": 221,
226
+ "ኜ": 222,
227
+ "ኝ": 223,
228
+ "ኞ": 224,
229
+ "ኟ": 225,
230
+ "አ": 226,
231
+ "ኡ": 227,
232
+ "ኢ": 228,
233
+ "ኣ": 229,
234
+ "ኤ": 230,
235
+ "እ": 231,
236
+ "ኦ": 232,
237
+ "ከ": 233,
238
+ "ኩ": 234,
239
+ "ኪ": 235,
240
+ "ካ": 236,
241
+ "ኬ": 237,
242
+ "ክ": 238,
243
+ "ኮ": 239,
244
+ "ኰ": 240,
245
+ "ኲ": 241,
246
+ "ኳ": 242,
247
+ "ኸ": 243,
248
+ "ኻ": 244,
249
+ "ኽ": 245,
250
+ "ኾ": 246,
251
+ "ወ": 247,
252
+ "ዉ": 248,
253
+ "ዊ": 249,
254
+ "ዋ": 250,
255
+ "ዌ": 251,
256
+ "ው": 252,
257
+ "ዎ": 253,
258
+ "ዐ": 254,
259
+ "ዑ": 255,
260
+ "ዒ": 256,
261
+ "ዓ": 257,
262
+ "ዔ": 258,
263
+ "ዕ": 259,
264
+ "ዖ": 260,
265
+ "ዘ": 261,
266
+ "ዙ": 262,
267
+ "ዚ": 263,
268
+ "ዛ": 264,
269
+ "ዜ": 265,
270
+ "ዝ": 266,
271
+ "ዞ": 267,
272
+ "ዟ": 268,
273
+ "ዠ": 269,
274
+ "ዡ": 270,
275
+ "ዢ": 271,
276
+ "ዣ": 272,
277
+ "ዤ": 273,
278
+ "ዥ": 274,
279
+ "ዦ": 275,
280
+ "ዧ": 276,
281
+ "የ": 277,
282
+ "ዩ": 278,
283
+ "ዪ": 279,
284
+ "ያ": 280,
285
+ "ዬ": 281,
286
+ "ይ": 282,
287
+ "ዮ": 283,
288
+ "ደ": 284,
289
+ "ዱ": 285,
290
+ "ዲ": 286,
291
+ "ዳ": 287,
292
+ "ዴ": 288,
293
+ "ድ": 289,
294
+ "ዶ": 290,
295
+ "ዷ": 291,
296
+ "ጀ": 292,
297
+ "ጁ": 293,
298
+ "ጂ": 294,
299
+ "ጃ": 295,
300
+ "ጄ": 296,
301
+ "ጅ": 297,
302
+ "ጆ": 298,
303
+ "ጇ": 299,
304
+ "ገ": 300,
305
+ "ጉ": 301,
306
+ "ጊ": 302,
307
+ "ጋ": 303,
308
+ "ጌ": 304,
309
+ "ግ": 305,
310
+ "ጎ": 306,
311
+ "ጐ": 307,
312
+ "ጓ": 308,
313
+ "ጔ": 309,
314
+ "ጠ": 310,
315
+ "ጡ": 311,
316
+ "ጢ": 312,
317
+ "ጣ": 313,
318
+ "ጤ": 314,
319
+ "ጥ": 315,
320
+ "ጦ": 316,
321
+ "ጧ": 317,
322
+ "ጨ": 318,
323
+ "ጩ": 319,
324
+ "ጪ": 320,
325
+ "ጫ": 321,
326
+ "ጬ": 322,
327
+ "ጭ": 323,
328
+ "ጮ": 324,
329
+ "ጯ": 325,
330
+ "ጰ": 326,
331
+ "ጱ": 327,
332
+ "ጲ": 328,
333
+ "ጳ": 329,
334
+ "ጴ": 330,
335
+ "ጵ": 331,
336
+ "ጶ": 332,
337
+ "ጸ": 333,
338
+ "ጹ": 334,
339
+ "ጺ": 335,
340
+ "ጻ": 336,
341
+ "ጼ": 337,
342
+ "ጽ": 338,
343
+ "ጾ": 339,
344
+ "ጿ": 340,
345
+ "ፀ": 341,
346
+ "ፁ": 342,
347
+ "ፃ": 343,
348
+ "ፅ": 344,
349
+ "ፈ": 345,
350
+ "ፉ": 346,
351
+ "ፊ": 347,
352
+ "ፋ": 348,
353
+ "ፌ": 349,
354
+ "ፍ": 350,
355
+ "ፎ": 351,
356
+ "ፏ": 352,
357
+ "ፐ": 353,
358
+ "ፑ": 354,
359
+ "ፒ": 355,
360
+ "ፓ": 356,
361
+ "ፔ": 357,
362
+ "ፕ": 358,
363
+ "ፖ": 359,
364
+ "፡": 360,
365
+ "።": 361,
366
+ "፣": 362,
367
+ "፤": 363,
368
+ "ḅ": 364,
369
+ "ḍ": 365,
370
+ "ḓ": 366,
371
+ "ḥ": 367,
372
+ "ḷ": 368,
373
+ "ḽ": 369,
374
+ "ṅ": 370,
375
+ "ṋ": 371,
376
+ "ṕ": 372,
377
+ "ṛ": 373,
378
+ "ṣ": 374,
379
+ "ṭ": 375,
380
+ "ṱ": 376,
381
+ "ẃ": 377,
382
+ "ẓ": 378,
383
+ "ạ": 379,
384
+ "ẹ": 380,
385
+ "ị": 381,
386
+ "ọ": 382,
387
+ "ụ": 383,
388
+ "ὲ": 384,
389
+ "–": 385,
390
+ "—": 386,
391
+ "’": 387,
392
+ "‟": 388,
393
+ "•": 389,
394
+ "…": 390,
395
+ "‽": 391,
396
+ "ⴰ": 392,
397
+ "ⴱ": 393,
398
+ "ⴳ": 394,
399
+ "ⴷ": 395,
400
+ "ⴹ": 396,
401
+ "ⴻ": 397,
402
+ "ⴼ": 398,
403
+ "ⴽ": 399,
404
+ "ⵀ": 400,
405
+ "ⵃ": 401,
406
+ "ⵄ": 402,
407
+ "ⵅ": 403,
408
+ "ⵇ": 404,
409
+ "ⵉ": 405,
410
+ "ⵊ": 406,
411
+ "ⵍ": 407,
412
+ "ⵎ": 408,
413
+ "ⵏ": 409,
414
+ "ⵓ": 410,
415
+ "ⵔ": 411,
416
+ "ⵕ": 412,
417
+ "ⵖ": 413,
418
+ "ⵙ": 414,
419
+ "ⵚ": 415,
420
+ "ⵛ": 416,
421
+ "ⵜ": 417,
422
+ "ⵟ": 418,
423
+ "ⵡ": 419,
424
+ "ⵢ": 420,
425
+ "ⵣ": 421,
426
+ "ⵥ": 422,
427
+ "ⵯ": 423
428
+ }