File size: 13,539 Bytes
0aa9a49
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
# Your leaderboard name
TITLE = """<h1 align="center" id="space-title">PazaBench Leaderboard</h1>"""

# What does your leaderboard evaluate?
INTRODUCTION_TEXT = """
The PazaBench Leaderboard is an Automatic Speech Recognition (ASR) benchmark for low-resource languages developed by the **[Microsoft Research Africa, Nairobi Lab](https://www.microsoft.com/en-us/research/lab/microsoft-research-lab-africa-nairobi/)**. Launching with **39 African Languages** across **52 State-of-the-Art ASR** and **Language Models**, PazaBench compares three key metrics: **Character Error Rate (CER)**, **Word Error Rate (WER)**, and **RTFx (Inverse Real-Time Factor)**.
"""

# Which evaluations are you running? how can people reproduce what you have?
LLM_BENCHMARKS_TEXT = """
## PazaBench Inputs

### Evaluation Dataset

The PazaBench evaluation dataset is unified from 7 datasets: [African Next Voices Kenya](https://huggingface.co/datasets/MCAA1-MSU/anv_data_ke), [African Next Voices South Africa](https://huggingface.co/datasets/dsfsi-anv/za-african-next-voices), [ALFFA](https://openslr.org/25/), [DigiGreen Kikuyu ASR](https://huggingface.co/datasets/DigiGreen/KikuyuASR_trainingdataset), [Google FLEURS](https://huggingface.co/datasets/google/fleurs), [Mozilla Common Voice 23.0](https://commonvoice.mozilla.org/), and [Naija Voices](https://huggingface.co/datasets/naijavoices/naijavoices-dataset). It captures **61 test splits** across the listed **39 languages** and adds up to **204,492 samples** with modalities limited to 16 kHz mono speech with aligned transcriptions and per-split metadata. For each language, the dataset groups representing that language are unified to provide a balanced measure of model performance.
"""

LLM_BENCHMARKS_DATASETS_TEXT = """
## Datasets

| Dataset | Description | Languages | Total Samples | License |
|---------|-------------|-----------|------------------|---------|
| [Mozilla Common Voice 23.0](https://datacollective.mozillafoundation.org/datasets) | Crowdsourced speech | Afrikaans, Amharic, Arabic, Basaa, Dholuo, Dioula, Ekoti, Hausa, Igbo, Kabyle, Kalenjin, Kidaw'ida, Kinyarwanda, Luganda, Nyungwe, Setswana, Swahili, Tamazight, Tigre, Tigrinya, Twi, Yoruba, Zulu (**20 languages**) | 76,747 | CC0 1.0 |
| [Naija Voices](https://huggingface.co/datasets/naijavoices/naijavoices-dataset) | Conversational speech from Nigeria | Hausa, Igbo, Yoruba (**3 languages**) | 57,529 | CC BY-NC-SA 4.0 |
| [African Next Voices Kenya](https://huggingface.co/datasets/MCAA1-MSU/anv_data_ke) | Conversational speech | Dholuo, Kalenjin, Kikuyu, Maasai, Somali (**5 languages**) | 26,415 | CC BY 4.0 |
| [African Next Voices South Africa](https://huggingface.co/datasets/dsfsi-anv/za-african-next-voices) | Conversational speech | Sesotho, Setswana, Tshivenda, Xhosa, Xitsonga, Zulu (**6 languages**) | 23,701 | CC BY 4.0 |
| [Google FLEURS](https://huggingface.co/datasets/google/fleurs) | Read speech | Afrikaans, Amharic, Dholuo, Fula, Ganda, Hausa, Igbo, Kamba, Lingala, Northern Sotho, Nyanja, Oromo, Shona, Somali, Swahili, Umbundu, Wolof, Xhosa, Yoruba, Zulu (**20 languages**) | 12,418 | CC BY 4.0 |
| [DigiGreen Kikuyu ASR](https://huggingface.co/datasets/DigiGreen/KikuyuASR_trainingdataset) | Agricultural speech from farmers | Kikuyu (**1 language**) | 5,054 | Apache 2.0 |
| [ALFFA](https://openslr.org/25/) | Broadcast news and read speech | Amharic, Swahili, Wolof (**3 languages**) | 4,350 | MIT |

---

## Evaluated Models

PazaBench evaluates **16 SOTA ASR model families across 52 individual models** listed below:

| Model Family | Model |
|--------------|-------|
| Paza by Microsoft Research Africa, Nairobi | [microsoft/paza-Phi-4-multimodal-instruct](https://huggingface.co/microsoft/paza-Phi-4-multimodal-instruct), [microsoft/paza-mms-1b-all](https://huggingface.co/microsoft/paza-mms-1b-all), [microsoft/paza-whisper-large-v3-turbo](https://huggingface.co/microsoft/paza-whisper-large-v3-turbo) |
| Distil Whisper | [distil-whisper/distil-large-v2](https://huggingface.co/distil-whisper/distil-large-v2), [distil-whisper/distil-large-v3](https://huggingface.co/distil-whisper/distil-large-v3), [distil-whisper/distil-medium.en](https://huggingface.co/distil-whisper/distil-medium.en) |
| Facebook Data2Vec | [facebook/data2vec-audio-base-960h](https://huggingface.co/facebook/data2vec-audio-base-960h), [facebook/data2vec-audio-large-960h](https://huggingface.co/facebook/data2vec-audio-large-960h) |
| Facebook HuBERT | [facebook/hubert-large-ls960-ft](https://huggingface.co/facebook/hubert-large-ls960-ft), [facebook/hubert-xlarge-ls960-ft](https://huggingface.co/facebook/hubert-xlarge-ls960-ft) |
| Facebook MMS | [facebook/mms-1b-all](https://huggingface.co/facebook/mms-1b-all), [facebook/mms-1b-fl102](https://huggingface.co/facebook/mms-1b-fl102) |
| Facebook Omnilingual ASR | [facebook/omniASR-CTC-300M](https://huggingface.co/facebook/omniASR-CTC-300M), [facebook/omniASR-CTC-1B](https://huggingface.co/facebook/omniASR-CTC-1B), [facebook/omniASR-CTC-3B](https://huggingface.co/facebook/omniASR-CTC-3B), [facebook/omniASR-CTC-7B](https://huggingface.co/facebook/omniASR-CTC-7B), [facebook/omniASR-LLM-300M](https://huggingface.co/facebook/omniASR-LLM-300M), [facebook/omniASR-LLM-1B](https://huggingface.co/facebook/omniASR-LLM-1B), [facebook/omniASR-LLM-3B](https://huggingface.co/facebook/omniASR-LLM-3B), [facebook/omniASR-LLM-7B](https://huggingface.co/facebook/omniASR-LLM-7B), [facebook/omniASR-LLM-7B-ZS](https://huggingface.co/facebook/omniASR-LLM-7B-ZS) |
| Facebook Wav2Vec2 | [facebook/wav2vec2-base-960h](https://huggingface.co/facebook/wav2vec2-base-960h), [facebook/wav2vec2-large-960h](https://huggingface.co/facebook/wav2vec2-large-960h), [facebook/wav2vec2-large-960h-lv60-self](https://huggingface.co/facebook/wav2vec2-large-960h-lv60-self), [facebook/wav2vec2-large-robust-ft-libri-960h](https://huggingface.co/facebook/wav2vec2-large-robust-ft-libri-960h) |
| Facebook Wav2Vec2 Conformer | [facebook/wav2vec2-conformer-rel-pos-large-960h-ft](https://huggingface.co/facebook/wav2vec2-conformer-rel-pos-large-960h-ft), [facebook/wav2vec2-conformer-rope-large-960h-ft](https://huggingface.co/facebook/wav2vec2-conformer-rope-large-960h-ft) |
| IBM Granite Speech | [ibm-granite/granite-speech-3.3-2b](https://huggingface.co/ibm-granite/granite-speech-3.3-2b), [ibm-granite/granite-speech-3.3-8b](https://huggingface.co/ibm-granite/granite-speech-3.3-8b) |
| Kyutai STT | [kyutai/stt-2.6b-en](https://huggingface.co/kyutai/stt-2.6b-en) |
| Lite ASR (EfficientSpeech) | [efficient-speech/lite-whisper-large-v3](https://huggingface.co/efficient-speech/lite-whisper-large-v3), [efficient-speech/lite-whisper-large-v3-acc](https://huggingface.co/efficient-speech/lite-whisper-large-v3-acc), [efficient-speech/lite-whisper-large-v3-fast](https://huggingface.co/efficient-speech/lite-whisper-large-v3-fast), [efficient-speech/lite-whisper-large-v3-turbo](https://huggingface.co/efficient-speech/lite-whisper-large-v3-turbo), [efficient-speech/lite-whisper-large-v3-turbo-acc](https://huggingface.co/efficient-speech/lite-whisper-large-v3-turbo-acc), [efficient-speech/lite-whisper-large-v3-turbo-fast](https://huggingface.co/efficient-speech/lite-whisper-large-v3-turbo-fast) |
| Microsoft Phi-4 | [microsoft/Phi-4-multimodal-instruct](https://huggingface.co/microsoft/Phi-4-multimodal-instruct) |
| Moonshine | [usefulsensors/moonshine-base](https://huggingface.co/usefulsensors/moonshine-base), [usefulsensors/moonshine-tiny](https://huggingface.co/usefulsensors/moonshine-tiny) |
| NVIDIA NeMo ASR | [nvidia/canary-1b-v2](https://huggingface.co/nvidia/canary-1b-v2), [nvidia/canary-qwen-2.5b](https://huggingface.co/nvidia/canary-qwen-2.5b), [nvidia/parakeet-tdt-0.6b-v3](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3) |
| OpenAI Whisper | [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3), [openai/whisper-large-v3-turbo](https://huggingface.co/openai/whisper-large-v3-turbo), [openai/whisper-large-v2](https://huggingface.co/openai/whisper-large-v2), [openai/whisper-large](https://huggingface.co/openai/whisper-large), [openai/whisper-medium.en](https://huggingface.co/openai/whisper-medium.en), [openai/whisper-small.en](https://huggingface.co/openai/whisper-small.en), [openai/whisper-base.en](https://huggingface.co/openai/whisper-base.en), [openai/whisper-tiny.en](https://huggingface.co/openai/whisper-tiny.en) |
| Qwen2 Audio | [Qwen/Qwen2-Audio-7B](https://huggingface.co/Qwen/Qwen2-Audio-7B), [Qwen/Qwen2-Audio-7B-Instruct](https://huggingface.co/Qwen/Qwen2-Audio-7B-Instruct) |

**Whisper Post-Processing:** Whisper model results include a duration-based truncation step to mitigate hallucination and known over-generation behavior.

---

## Acknowledgements

We gratefully acknowledge the dataset creators and leaderboard teams whose contributions made PazaBench possible:

**Datasets:** We extend our gratitude to the creators, community contributors, and maintainers of [African Next Voices Kenya](https://huggingface.co/datasets/MCAA1-MSU/anv_data_ke), [African Next Voices South Africa](https://huggingface.co/datasets/dsfsi-anv/za-african-next-voices), [ALFFA](https://openslr.org/25/), [DigiGreen Kikuyu ASR](https://huggingface.co/datasets/DigiGreen/KikuyuASR_trainingdataset), [Google FLEURS](https://huggingface.co/datasets/google/fleurs), [Mozilla Common Voice](https://commonvoice.mozilla.org/) and [Naija Voices](https://huggingface.co/datasets/naijavoices/naijavoices-dataset) whose efforts have been invaluable in advancing African languages speech data.

**Reference Implementation:** We recognize the foundational work of the [Open ASR Leaderboard](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard) by Hugging Face Audio team and appreciate the contributors of the [open_asr_leaderboard repository](https://github.com/huggingface/open_asr_leaderboard) for creating reproducible evaluation scripts.
"""

# Dataset group metadata with descriptions and language counts
# Used for adding descriptors to dataset filter dropdown
DATASET_GROUP_METADATA = {
    "ALFFA": {
        "description": "Read speech & broadcast news",
        "languages": ["Amharic", "Swahili", "Wolof"],
        "language_count": 3,
    },
    "African Next Voices Kenya": {
        "description": "Conversational speech",
        "languages": ["Dholuo", "Kalenjin", "Kikuyu", "Maasai", "Somali"],
        "language_count": 5,
    },
    "African Next Voices South Africa": {
        "description": "Conversational speech",
        "languages": ["Sesotho", "Setswana", "Xhosa", "Xitsonga", "Tshivenda", "Zulu"],
        "language_count": 6,
    },
    "Google FLEURS": {
        "description": "Read speech",
        "languages": ["Afrikaans", "Amharic", "Fula", "Ganda", "Hausa", "Igbo", "Kamba", "Lingala", "Dholuo", "Northern Sotho", "Nyanja", "Oromo", "Shona", "Somali", "Swahili", "Umbundu", "Wolof", "Xhosa", "Yoruba", "Zulu"],
        "language_count": 20,
    },
    "DigiGreen Kikuyu ASR": {
        "description": "Agricultural speech",
        "languages": ["Kikuyu"],
        "language_count": 1,
    },
    "Mozilla Common Voice 23.0": {
        "description": "Crowdsourced speech",
        "languages": ["Afrikaans", "Amharic", "Arabic", "Basaa", "Dholuo", "Dioula", "Ekoti", "Hausa", "Igbo", "Kabyle", "Kalenjin", "Kinyarwanda", "Kidaw'ida", "Luganda", "Nyungwe", "Setswana", "Swahili", "Tamazight", "Tigre", "Tigrinya", "Twi", "Yoruba", "Zulu"],
        "language_count": 20,
    },
    "Naija Voices": {
        "description": "Conversational speech",
        "languages": ["Hausa", "Igbo", "Yoruba"],
        "language_count": 3,
    },
}

def get_dataset_group_label(dataset_group: str) -> str:
    """Return a formatted label with description and language count."""
    meta = DATASET_GROUP_METADATA.get(dataset_group)
    if meta:
        return f"{dataset_group} ({meta['description']}, {meta['language_count']} languages)"
    return dataset_group

def get_dataset_group_languages(dataset_group: str) -> list[str]:
    """Return list of languages for a dataset group."""
    meta = DATASET_GROUP_METADATA.get(dataset_group)
    if meta:
        return meta['languages']
    return []

EVALUATION_LANGUAGE_TEXT = """
### Request Language Evaluation

Submit a language dataset from any region for evaluation on PazaBench. We'll benchmark it using all supported ASR models. Provide the dataset source in the form below.

**Requirements:**
- Dataset must be publicly accessible on Hugging Face Hub or via a public URL
- Must contain audio samples with text transcriptions
- Audio should be 16kHz mono WAV format (will be resampled if needed)
"""

EVALUATION_MODEL_TEXT = """
### Submit a Model for Evaluation

Add a new ASR model to PazaBench. We'll evaluate it across languages.

**Requirements:**
- Model must be **publicly available** on [Hugging Face Hub](https://huggingface.co/models)
- Must support speech-to-text / ASR tasks
- Should be compatible with `transformers` AutoModel or provide clear loading instructions
"""

CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
CITATION_BUTTON_TEXT = r"""
@misc{pazabench2026,
  title={PazaBench: A Benchmark for Automatic Speech Recognition on Low Resource Languages},
  author={Microsoft Research Africa, Nairobi},
  year={2026},
  howpublished={\url{https://www.microsoft.com/en-us/research/project/project-gecko/}},
  note={Alpha version. Part of Project Gecko - Equitable Generative AI for the Global Majority}
}
"""