|
|
--- |
|
|
language: |
|
|
- ak |
|
|
- tw |
|
|
- aeb |
|
|
- af |
|
|
- am |
|
|
- ar |
|
|
- bas |
|
|
- bem |
|
|
- dav |
|
|
- dyu |
|
|
- en |
|
|
- pcm |
|
|
- ee |
|
|
- fat |
|
|
- fon |
|
|
- fuc |
|
|
- ff |
|
|
- gaa |
|
|
- ha |
|
|
- ig |
|
|
- kab |
|
|
- rw |
|
|
- kln |
|
|
- ln |
|
|
- loz |
|
|
- lg |
|
|
- luo |
|
|
- mlq |
|
|
- nr |
|
|
- nso |
|
|
- ny |
|
|
- st |
|
|
- srr |
|
|
- ss |
|
|
- sus |
|
|
- sw |
|
|
- tig |
|
|
- ti |
|
|
- toi |
|
|
- tn |
|
|
- ts |
|
|
- tw |
|
|
- ve |
|
|
- wo |
|
|
- xh |
|
|
- yo |
|
|
- zgh |
|
|
- zu |
|
|
|
|
|
license: cc-by-4.0 |
|
|
tags: |
|
|
- automatic-speech-recognition |
|
|
- audio |
|
|
- speech |
|
|
- african-languages |
|
|
- multilingual |
|
|
- simba |
|
|
- low-resource |
|
|
- speech-recognition |
|
|
- asr |
|
|
- spoken-language-identification |
|
|
- language-identification |
|
|
datasets: |
|
|
- UBC-NLP/SimbaBench |
|
|
metrics: |
|
|
- wer |
|
|
- cer |
|
|
- accuracy |
|
|
library_name: transformers |
|
|
pipeline_tag: automatic-speech-recognition |
|
|
--- |
|
|
|
|
|
<div align="center"> |
|
|
|
|
|
<img src="https://africa.dlnlp.ai/simba/images/VoC_logo.png" alt="VoC Logo"> |
|
|
|
|
|
[](https://aclanthology.org/2025.emnlp-main.559/) |
|
|
[](https://africa.dlnlp.ai/simba/) |
|
|
[](https://huggingface.co/spaces/UBC-NLP/SimbaBench) |
|
|
[](https://github.com/UBC-NLP/simba) |
|
|
[](https://huggingface.co/collections/UBC-NLP/simba-speech-series) |
|
|
[](https://huggingface.co/datasets/UBC-NLP/SimbaBench_dataset) |
|
|
|
|
|
</div> |
|
|
|
|
|
## *Bridging the Digital Divide for African AI* |
|
|
|
|
|
**Voice of a Continent** is a comprehensive open-source ecosystem designed to bring African languages to the forefront of artificial intelligence. By providing a unified suite of benchmarking tools and state-of-the-art models, we ensure that the future of speech technology is inclusive, representative, and accessible to over a billion people. |
|
|
|
|
|
## Best-in-Class Multilingual Models |
|
|
|
|
|
<img src="https://africa.dlnlp.ai/simba/images/VoC_simba" alt="VoC Simba Models Logo"> |
|
|
|
|
|
Introduced in our EMNLP 2025 paper *[Voice of a Continent](https://aclanthology.org/2025.emnlp-main.559/)*, the **Simba Series** represents the current state-of-the-art for African speech AI. |
|
|
|
|
|
- **Unified Suite:** Models optimized for African languages. |
|
|
- **Superior Accuracy:** Outperforms generic multilingual models by leveraging SimbaBench's high-quality, domain-diverse datasets. |
|
|
- **Multitask Capability:** Designed for high performance in ASR (Automatic Speech Recognition) and TTS (Text-to-Speech). |
|
|
- **Inclusion-First:** Specifically built to mitigate the "digital divide" by empowering speakers of underrepresented languages. |
|
|
|
|
|
The **Simba** family consists of state-of-the-art models fine-tuned using SimbaBench. These models achieve superior performance by leveraging dataset quality, domain diversity, and language family relationships. |
|
|
|
|
|
|
|
|
### π Simba-SLID (Spoken Language Identification) |
|
|
* **π― Task:** `Spoken Language Identification` β Intelligent input routing. |
|
|
* **π Language Coverage (49 African languages)** |
|
|
> **Akuapim Twi** (`Akuapim-twi`), **Asante Twi** (`Asante-twi`), **Tunisian Arabic** (`aeb`), **Afrikaans** (`afr`), **Amharic** (`amh`), **Arabic** (`ara`), **Basaa** (`bas`), **Bemba** (`bem`), **Taita** (`dav`), **Dyula** (`dyu`), **English** (`eng`), **Nigerian Pidgin** (`eng-zul`), **Ewe** (`ewe`), **Fanti** (`fat`), **Fon** (`fon`), **Pulaar** (`fuc`), **Pular** (`fuf`), **Ga** (`gaa`), **Hausa** (`hau`), **Igbo** (`ibo`), **Kabyle** (`kab`), **Kinyarwanda** (`kin`), **Kalenjin** (`kln`), **Lingala** (`lin`), **Lozi** (`loz`), **Luganda** (`lug`), **Luo** (`luo`), **Western Maninkakan** (`mlq`), **South Ndebele** (`nbl`), **Northern Sotho** (`nso`), **Chichewa** (`nya`), **Southern Sotho** (`sot`), **Serer** (`srr`), **Swati** (`ssw`), **Susu** (`sus`), **Kiswahili** (`swa`), **Swahili** (`swh`), **Tigre** (`tig`), **Tigrinya** (`tir`), **Tonga** (`toi`), **Tswana** (`tsn`), **Tsonga** (`tso`), **Twi** (`twi`), **Venda** (`ven`), **Wolof** (`wol`), **Xhosa** (`xho`), **Yoruba** (`yor`), **Standard Moroccan Tamazight** (`zgh`), **Zulu** (`zul`) |
|
|
|
|
|
| **SLID Model** | **Architecture** | **Hugging Face Card** | **Status** | |
|
|
| :--- | :--- | :---: | :---: | |
|
|
| **Simba-SLID-49** π | HuBERT | π€ [https://huggingface.co/UBC-NLP/Simba-SLIS-49](https://huggingface.co/UBC-NLPSimba-SLIS-49) | β
Released | |
|
|
|
|
|
|
|
|
**π§© Usage Example** |
|
|
|
|
|
You can easily run inference using the Hugging Face `transformers` library. |
|
|
|
|
|
```python |
|
|
from transformers import ( |
|
|
HubertForSequenceClassification, |
|
|
AutoFeatureExtractor, |
|
|
AutoProcessor |
|
|
) |
|
|
import torch |
|
|
|
|
|
model_id = "UBC-NLP/Simba-SLIS_49" |
|
|
model = HubertForSequenceClassification.from_pretrained(model_id).to("cuda") |
|
|
# HuBERT models can use either processor or feature extractor depending on the specific model |
|
|
try: |
|
|
processor = AutoProcessor.from_pretrained(model_id) |
|
|
print("Loaded Simba-SLIS_49 model with AutoProcessor") |
|
|
except: |
|
|
processor = AutoFeatureExtractor.from_pretrained(model_id) |
|
|
print("Loaded Simba-SLIS_49 model with AutoFeatureExtractor") |
|
|
|
|
|
# Optimize model for inference |
|
|
model.eval() |
|
|
audio_arrays = [] ### add your audio array |
|
|
sample_rate=16000 |
|
|
|
|
|
nputs = processor(audio_arrays, sampling_rate=sample_rate, return_tensors="pt", padding=True).to("cuda") |
|
|
|
|
|
# Different models might have slightly different input formats |
|
|
try: |
|
|
logits = model(**inputs).logits |
|
|
except Exception as e: |
|
|
# Try alternative input format if the first attempt fails |
|
|
if "input_values" in inputs: |
|
|
logits = model(input_values=inputs.input_values).logits |
|
|
else: |
|
|
raise e |
|
|
|
|
|
# Calculate softmax probabilities |
|
|
probs = torch.nn.functional.softmax(logits, dim=-1) |
|
|
|
|
|
# Get the maximum probability (confidence) for each prediction |
|
|
confidence_values, pred_ids = torch.max(probs, dim=-1) |
|
|
|
|
|
# Convert to Python lists |
|
|
pred_ids = pred_ids.tolist() |
|
|
confidence_values = confidence_values.cpu().tolist() |
|
|
# Get labels from IDs |
|
|
pred_labels = [model.config.id2label[i] for i in pred_ids] |
|
|
|
|
|
|
|
|
print(pred_labels, confidence_values) |
|
|
``` |
|
|
|
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use the Simba models or SimbaBench benchmark for your scientific publication, or if you find the resources in this website useful, please cite our paper. |
|
|
|
|
|
```bibtex |
|
|
|
|
|
@inproceedings{elmadany-etal-2025-voice, |
|
|
title = "Voice of a Continent: Mapping {A}frica{'}s Speech Technology Frontier", |
|
|
author = "Elmadany, AbdelRahim A. and |
|
|
Kwon, Sang Yun and |
|
|
Toyin, Hawau Olamide and |
|
|
Alcoba Inciarte, Alcides and |
|
|
Aldarmaki, Hanan and |
|
|
Abdul-Mageed, Muhammad", |
|
|
editor = "Christodoulopoulos, Christos and |
|
|
Chakraborty, Tanmoy and |
|
|
Rose, Carolyn and |
|
|
Peng, Violet", |
|
|
booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing", |
|
|
month = nov, |
|
|
year = "2025", |
|
|
address = "Suzhou, China", |
|
|
publisher = "Association for Computational Linguistics", |
|
|
url = "https://aclanthology.org/2025.emnlp-main.559/", |
|
|
doi = "10.18653/v1/2025.emnlp-main.559", |
|
|
pages = "11039--11061", |
|
|
ISBN = "979-8-89176-332-6", |
|
|
} |
|
|
|
|
|
``` |
|
|
|
|
|
|
|
|
|