File size: 7,688 Bytes
5350329
8082634
 
619a2ec
 
8082634
 
619a2ec
8082634
 
 
 
 
 
 
 
 
 
5350329
 
8082634
f64b48e
619a2ec
f64b48e
8082634
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e9fb7cc
 
 
 
 
 
 
 
 
 
 
 
 
 
06279ff
 
 
8082634
e9fb7cc
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
---
pretty_name: African Next Voices  Kenya
languages:
- kik
- som
- luo
- mas
- kln

extra_gated_prompt: "Please provide your full name and institution to request access."
extra_gated_fields:
  Full name: text
  Institution: text
  I agree to the terms: checkbox

license: cc-by-4.0
task_categories:
- automatic-speech-recognition
---


⚠️ **IMPORTANT

Please **use the latest version of dataset** for attribution, benchmarking and publications.

## Overview

**African Next Voices: Pilot Data Collection in Kenya** is part of a larger initiative to support African language speech technology. This project, funded by the **Gates Foundation**, is led by the **KenCorpus Consortium**, a coalition of Kenyan universities and research centers.

It includes **scripted** and **unscripted** speech across multiple domains and five languages, collected through ethical, community-led processes.

## 👥 Contributors

- **Maseno University** – Dholuo & Somali  
- **USIU-Africa** – Maasai  
- **Kabarak University** – Kalenjin  
- **DeKUT and LDRI** – Kikuyu

## 📚 Domains

- Agriculture & Food  
- Everyday Scenarios  
- Financial Transactions  
- Digital Government Services  
- Named Entity Recognition  
- Role Play  
- Extempore Stories  
- Healthcare  
- News & Media  
- Education & Technology  
- Customer Care Scenarios

## 🧑🏿‍💻 Use Cases

- Training ASR models  
- Local language research  
- Benchmarking multilingual and low-resource speech

## 🌍 Languages

| Language   | Dialects                  | ISO | No. Hours | Scripted (Hrs) | Unscripted (Hrs) | Total No. Hours | % Updated |
|------------|---------------------------|-----------|--------------|----------|----------|----------|----------|
| Dholuo     | Nyandwat, Milambo    | `luo`     | 750          | 195     | 528       | 723     | 96%       |
| Kikuyu     | Gĩ-Kabete, Ki-Mathira, Ki-Muranga, Ki-Ndia & Gĩ-Gichugu    | `kik`     | 750   | 183     | 571       | 754     | 100%       |
| Somali     | Maxatire                  | `som`     | 500          | 118     | 384       | 502     | 100%       |
| Kalenjin   | Nandi & Kipsigis          | `kln`     | 500          | 122     | 399       | 521     | 100%       |
| Maasai     | Kimasaai & Kisamburu      | `mas`     | 500          | 51    | 454       | 505     | 100%       |

## Unscripted Data

- Unscripted % of Dholuo Transcribed: 60%, Duration: 317 hrs
- Unscripted % of Kikuyu Transcribed: 68%, Duration: 381 hrs
- Unscripted % of Somali Transcribed: 100%, Duration: 502 hrs
- Unscripted % of Kalenjin Transcribed: 95%, Duration: 383 hrs
- Unscripted % of Maasai Transcribed: 100%, Duration: 505 hrs


## Dataset Splits


The dataset is divided into the following subsets:

- train (85%)
- dev (5%)
- dev_test (5%)
- test (5%) 🔒 – held out for future public leaderboards and shared tasks

*All subsets are speaker-disjoint to support fair benchmarking and minimise the risk of data leakage.*

## Dataset Columns

| Column Name       | Description                                                                 |
|-------------------|-----------------------------------------------------------------------------|
| `mediaPathId`     | Unique identifier for the audio file path                                   |
| `recorder_uuid`   | Unique ID assigned to the speaker or contributor                            |
| `domain`          | The content domain or topic category (e.g., Healthcare, Agriculture)        |
| `translatedText (for type:scripted)`  | English translation of the spoken content                                   |
| `actualSentence (for type:scripted)`  | Original utterance in the native language                                   |
| `duration`        | Length of the audio in seconds                                              |
| `sentenceSource`  | Indicates the origin of the sentence                                        |
| `language`        | Language of the utterance (e.g., Dholuo, Kikuyu)                            |
| `sentenceDialect` | Specific dialect variation used in the utterance                            |
| `type`            | Boolean flag indicating if the utterance was scripted or unscripted         |
| `transcript (for type:unscripted)`            | The text transcript from spontenuous speech        |
| `prompt type (for type:unscripted)`            | Prompt used in generating spontenuous speech        |

> More updates will be added *soon*.
> 
> ⚠️ *Disclaimer:* 
This dataset is provided for research and development of ASR and related technologies.
Any use that facilitates surveillance, discrimination, exploitation or unethical profiling is strictly prohibited.
The creators disclaim liability for misuse that violates ethical guidelines, privacy rights or community consent.

## 📎 License

[CC BY 4.0](https://creativecommons.org/licenses/by/4.0/)

## Language Leads and Resource Persons

| Language | Name                  | Role                      |
| -------- | --------------------- | ------------------------- |
| General  | Dr. Lilian Wanzare    | Principal Investigator    |
| Dholuo   | Dr. Vivian Oloo       | Language Lead             |
| Dholuo   | Rennish Mboya         | Language Lead             |
| Dholuo   | Francis Njiri         | Resource Person           |
| Dholuo   | Fidelis Anne Achieng  | Resource Person           |
| Dholuo   | Denis Okoth           | Resource Person           |
| Dholuo   | Kelvin Ndanya         | Resource Person           |
| Kikuyu   | Muchiri Nyaggah       | Language Lead             |
| Kikuyu   | Leonida Mutuku        | Language Lead             |
| Kikuyu   | Dr. Joseph Muguro     | Language Lead             |
| Kikuyu   | Prof. Ciira Maina     | Language Lead             |
| Kikuyu   | Anne Munira           | Resource Person           |
| Kikuyu   | Mary Kariuki          | Resource Person           |
| Kikuyu   | Veronica Kariuki      | Resource Person           |
| Kikuyu   | Suzanne Njuguna       | Resource Person           |
| Maasai   | Prof. Edward Ombui    | Language Lead             |
| Maasai   | David Mbelati         | Resource Person           |
| Maasai   | Peter Tikan           | Resource Person           |
| Maasai   | Justus Somoire        | Resource Person           |
| Kalenjin | Dr. Andrew Kipkebut   | Language Lead             |
| Kalenjin | Emmanuel Chesire      | Resource Person           |
| Kalenjin | Betty Jepkemei        | Resource Person           |
| Kalenjin | Zipporah Chepkemoi    | Resource Person           |
| Somali   | Mohamed Ali Muhumed   | Language Lead             |
| Somali   | Ibrahim Dubow Mohamed | Language Lead             |
| Somali   | Abdisalam Issack Abdi | Resource Person           |
| Somali   | Yaseen                | Resource Person           |


## Citation

```bibtex
@misc{wanzare2026afrivoiceske,
  title        = {AfriVoices-KE: A Multilingual Speech Dataset for Kenyan Languages},
  author       = {Lilian Wanzare and Cynthia Amol and Ezekiel Maina and 
                  Nelson Odhiambo and Hope Kerubo and Leila Misula and 
                  Vivian Oloo and Rennish Mboya and Edwin Onkoba and 
                  Edward Ombui and Joseph Muguro and Ciira wa Maina and 
                  Andrew Kipkebut and Alfred Omondi Otom and 
                  Ian Ndung'u Kang'ethe and Angela Wambui Kanyi and 
                  Brian Gichana Omwenga},
  year         = {2026},
  eprint       = {2604.08448},
  archivePrefix = {arXiv},
  primaryClass = {cs.CL},
  url          = {https://arxiv.org/abs/2604.08448}

Accepted: LREC SIGUL 2026 Joint Workshop: https://sites.google.com/view/sigul2026/workshop-programme/day-2

}