File size: 7,688 Bytes
5350329 8082634 619a2ec 8082634 619a2ec 8082634 5350329 8082634 f64b48e 619a2ec f64b48e 8082634 e9fb7cc 06279ff 8082634 e9fb7cc | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 | ---
pretty_name: African Next Voices – Kenya
languages:
- kik
- som
- luo
- mas
- kln
extra_gated_prompt: "Please provide your full name and institution to request access."
extra_gated_fields:
Full name: text
Institution: text
I agree to the terms: checkbox
license: cc-by-4.0
task_categories:
- automatic-speech-recognition
---
⚠️ **IMPORTANT
Please **use the latest version of dataset** for attribution, benchmarking and publications.
## Overview
**African Next Voices: Pilot Data Collection in Kenya** is part of a larger initiative to support African language speech technology. This project, funded by the **Gates Foundation**, is led by the **KenCorpus Consortium**, a coalition of Kenyan universities and research centers.
It includes **scripted** and **unscripted** speech across multiple domains and five languages, collected through ethical, community-led processes.
## 👥 Contributors
- **Maseno University** – Dholuo & Somali
- **USIU-Africa** – Maasai
- **Kabarak University** – Kalenjin
- **DeKUT and LDRI** – Kikuyu
## 📚 Domains
- Agriculture & Food
- Everyday Scenarios
- Financial Transactions
- Digital Government Services
- Named Entity Recognition
- Role Play
- Extempore Stories
- Healthcare
- News & Media
- Education & Technology
- Customer Care Scenarios
## 🧑🏿💻 Use Cases
- Training ASR models
- Local language research
- Benchmarking multilingual and low-resource speech
## 🌍 Languages
| Language | Dialects | ISO | No. Hours | Scripted (Hrs) | Unscripted (Hrs) | Total No. Hours | % Updated |
|------------|---------------------------|-----------|--------------|----------|----------|----------|----------|
| Dholuo | Nyandwat, Milambo | `luo` | 750 | 195 | 528 | 723 | 96% |
| Kikuyu | Gĩ-Kabete, Ki-Mathira, Ki-Muranga, Ki-Ndia & Gĩ-Gichugu | `kik` | 750 | 183 | 571 | 754 | 100% |
| Somali | Maxatire | `som` | 500 | 118 | 384 | 502 | 100% |
| Kalenjin | Nandi & Kipsigis | `kln` | 500 | 122 | 399 | 521 | 100% |
| Maasai | Kimasaai & Kisamburu | `mas` | 500 | 51 | 454 | 505 | 100% |
## Unscripted Data
- Unscripted % of Dholuo Transcribed: 60%, Duration: 317 hrs
- Unscripted % of Kikuyu Transcribed: 68%, Duration: 381 hrs
- Unscripted % of Somali Transcribed: 100%, Duration: 502 hrs
- Unscripted % of Kalenjin Transcribed: 95%, Duration: 383 hrs
- Unscripted % of Maasai Transcribed: 100%, Duration: 505 hrs
## Dataset Splits
The dataset is divided into the following subsets:
- train (85%)
- dev (5%)
- dev_test (5%)
- test (5%) 🔒 – held out for future public leaderboards and shared tasks
*All subsets are speaker-disjoint to support fair benchmarking and minimise the risk of data leakage.*
## Dataset Columns
| Column Name | Description |
|-------------------|-----------------------------------------------------------------------------|
| `mediaPathId` | Unique identifier for the audio file path |
| `recorder_uuid` | Unique ID assigned to the speaker or contributor |
| `domain` | The content domain or topic category (e.g., Healthcare, Agriculture) |
| `translatedText (for type:scripted)` | English translation of the spoken content |
| `actualSentence (for type:scripted)` | Original utterance in the native language |
| `duration` | Length of the audio in seconds |
| `sentenceSource` | Indicates the origin of the sentence |
| `language` | Language of the utterance (e.g., Dholuo, Kikuyu) |
| `sentenceDialect` | Specific dialect variation used in the utterance |
| `type` | Boolean flag indicating if the utterance was scripted or unscripted |
| `transcript (for type:unscripted)` | The text transcript from spontenuous speech |
| `prompt type (for type:unscripted)` | Prompt used in generating spontenuous speech |
> More updates will be added *soon*.
>
> ⚠️ *Disclaimer:*
This dataset is provided for research and development of ASR and related technologies.
Any use that facilitates surveillance, discrimination, exploitation or unethical profiling is strictly prohibited.
The creators disclaim liability for misuse that violates ethical guidelines, privacy rights or community consent.
## 📎 License
[CC BY 4.0](https://creativecommons.org/licenses/by/4.0/)
## Language Leads and Resource Persons
| Language | Name | Role |
| -------- | --------------------- | ------------------------- |
| General | Dr. Lilian Wanzare | Principal Investigator |
| Dholuo | Dr. Vivian Oloo | Language Lead |
| Dholuo | Rennish Mboya | Language Lead |
| Dholuo | Francis Njiri | Resource Person |
| Dholuo | Fidelis Anne Achieng | Resource Person |
| Dholuo | Denis Okoth | Resource Person |
| Dholuo | Kelvin Ndanya | Resource Person |
| Kikuyu | Muchiri Nyaggah | Language Lead |
| Kikuyu | Leonida Mutuku | Language Lead |
| Kikuyu | Dr. Joseph Muguro | Language Lead |
| Kikuyu | Prof. Ciira Maina | Language Lead |
| Kikuyu | Anne Munira | Resource Person |
| Kikuyu | Mary Kariuki | Resource Person |
| Kikuyu | Veronica Kariuki | Resource Person |
| Kikuyu | Suzanne Njuguna | Resource Person |
| Maasai | Prof. Edward Ombui | Language Lead |
| Maasai | David Mbelati | Resource Person |
| Maasai | Peter Tikan | Resource Person |
| Maasai | Justus Somoire | Resource Person |
| Kalenjin | Dr. Andrew Kipkebut | Language Lead |
| Kalenjin | Emmanuel Chesire | Resource Person |
| Kalenjin | Betty Jepkemei | Resource Person |
| Kalenjin | Zipporah Chepkemoi | Resource Person |
| Somali | Mohamed Ali Muhumed | Language Lead |
| Somali | Ibrahim Dubow Mohamed | Language Lead |
| Somali | Abdisalam Issack Abdi | Resource Person |
| Somali | Yaseen | Resource Person |
## Citation
```bibtex
@misc{wanzare2026afrivoiceske,
title = {AfriVoices-KE: A Multilingual Speech Dataset for Kenyan Languages},
author = {Lilian Wanzare and Cynthia Amol and Ezekiel Maina and
Nelson Odhiambo and Hope Kerubo and Leila Misula and
Vivian Oloo and Rennish Mboya and Edwin Onkoba and
Edward Ombui and Joseph Muguro and Ciira wa Maina and
Andrew Kipkebut and Alfred Omondi Otom and
Ian Ndung'u Kang'ethe and Angela Wambui Kanyi and
Brian Gichana Omwenga},
year = {2026},
eprint = {2604.08448},
archivePrefix = {arXiv},
primaryClass = {cs.CL},
url = {https://arxiv.org/abs/2604.08448}
Accepted: LREC SIGUL 2026 Joint Workshop: https://sites.google.com/view/sigul2026/workshop-programme/day-2
}
|