# Your leaderboard name TITLE = """

PazaBench Leaderboard

""" # What does your leaderboard evaluate? INTRODUCTION_TEXT = """ The PazaBench Leaderboard is an Automatic Speech Recognition (ASR) benchmark for low-resource languages developed by the **[Microsoft Research Africa, Nairobi Lab](https://www.microsoft.com/en-us/research/lab/microsoft-research-lab-africa-nairobi/)**. Launching with **39 African Languages** across **52 State-of-the-Art ASR** and **Language Models**, PazaBench compares three key metrics: **Character Error Rate (CER)**, **Word Error Rate (WER)**, and **RTFx (Inverse Real-Time Factor)**. """ # Which evaluations are you running? how can people reproduce what you have? LLM_BENCHMARKS_TEXT = """ ## PazaBench Inputs ### Evaluation Dataset The PazaBench evaluation dataset is unified from 7 datasets: [African Next Voices Kenya](https://huggingface.co/datasets/MCAA1-MSU/anv_data_ke), [African Next Voices South Africa](https://huggingface.co/datasets/dsfsi-anv/za-african-next-voices), [ALFFA](https://openslr.org/25/), [DigiGreen Kikuyu ASR](https://huggingface.co/datasets/DigiGreen/KikuyuASR_trainingdataset), [Google FLEURS](https://huggingface.co/datasets/google/fleurs), [Mozilla Common Voice 23.0](https://commonvoice.mozilla.org/), and [Naija Voices](https://huggingface.co/datasets/naijavoices/naijavoices-dataset). It captures **61 test splits** across the listed **39 languages** and adds up to **204,492 samples** with modalities limited to 16 kHz mono speech with aligned transcriptions and per-split metadata. For each language, the dataset groups representing that language are unified to provide a balanced measure of model performance. """ LLM_BENCHMARKS_DATASETS_TEXT = """ ## Datasets | Dataset | Description | Languages | Total Samples | License | |---------|-------------|-----------|------------------|---------| | [Mozilla Common Voice 23.0](https://datacollective.mozillafoundation.org/datasets) | Crowdsourced speech | Afrikaans, Amharic, Arabic, Basaa, Dholuo, Dioula, Ekoti, Hausa, Igbo, Kabyle, Kalenjin, Kidaw'ida, Kinyarwanda, Luganda, Nyungwe, Setswana, Swahili, Tamazight, Tigre, Tigrinya, Twi, Yoruba, Zulu (**20 languages**) | 76,747 | CC0 1.0 | | [Naija Voices](https://huggingface.co/datasets/naijavoices/naijavoices-dataset) | Conversational speech from Nigeria | Hausa, Igbo, Yoruba (**3 languages**) | 57,529 | CC BY-NC-SA 4.0 | | [African Next Voices Kenya](https://huggingface.co/datasets/MCAA1-MSU/anv_data_ke) | Conversational speech | Dholuo, Kalenjin, Kikuyu, Maasai, Somali (**5 languages**) | 26,415 | CC BY 4.0 | | [African Next Voices South Africa](https://huggingface.co/datasets/dsfsi-anv/za-african-next-voices) | Conversational speech | Sesotho, Setswana, Tshivenda, Xhosa, Xitsonga, Zulu (**6 languages**) | 23,701 | CC BY 4.0 | | [Google FLEURS](https://huggingface.co/datasets/google/fleurs) | Read speech | Afrikaans, Amharic, Dholuo, Fula, Ganda, Hausa, Igbo, Kamba, Lingala, Northern Sotho, Nyanja, Oromo, Shona, Somali, Swahili, Umbundu, Wolof, Xhosa, Yoruba, Zulu (**20 languages**) | 12,418 | CC BY 4.0 | | [DigiGreen Kikuyu ASR](https://huggingface.co/datasets/DigiGreen/KikuyuASR_trainingdataset) | Agricultural speech from farmers | Kikuyu (**1 language**) | 5,054 | Apache 2.0 | | [ALFFA](https://openslr.org/25/) | Broadcast news and read speech | Amharic, Swahili, Wolof (**3 languages**) | 4,350 | MIT | --- ## Evaluated Models PazaBench evaluates **16 SOTA ASR model families across 52 individual models** listed below: | Model Family | Model | |--------------|-------| | Paza by Microsoft Research Africa, Nairobi | [microsoft/paza-Phi-4-multimodal-instruct](https://huggingface.co/microsoft/paza-Phi-4-multimodal-instruct), [microsoft/paza-mms-1b-all](https://huggingface.co/microsoft/paza-mms-1b-all), [microsoft/paza-whisper-large-v3-turbo](https://huggingface.co/microsoft/paza-whisper-large-v3-turbo) | | Distil Whisper | [distil-whisper/distil-large-v2](https://huggingface.co/distil-whisper/distil-large-v2), [distil-whisper/distil-large-v3](https://huggingface.co/distil-whisper/distil-large-v3), [distil-whisper/distil-medium.en](https://huggingface.co/distil-whisper/distil-medium.en) | | Facebook Data2Vec | [facebook/data2vec-audio-base-960h](https://huggingface.co/facebook/data2vec-audio-base-960h), [facebook/data2vec-audio-large-960h](https://huggingface.co/facebook/data2vec-audio-large-960h) | | Facebook HuBERT | [facebook/hubert-large-ls960-ft](https://huggingface.co/facebook/hubert-large-ls960-ft), [facebook/hubert-xlarge-ls960-ft](https://huggingface.co/facebook/hubert-xlarge-ls960-ft) | | Facebook MMS | [facebook/mms-1b-all](https://huggingface.co/facebook/mms-1b-all), [facebook/mms-1b-fl102](https://huggingface.co/facebook/mms-1b-fl102) | | Facebook Omnilingual ASR | [facebook/omniASR-CTC-300M](https://huggingface.co/facebook/omniASR-CTC-300M), [facebook/omniASR-CTC-1B](https://huggingface.co/facebook/omniASR-CTC-1B), [facebook/omniASR-CTC-3B](https://huggingface.co/facebook/omniASR-CTC-3B), [facebook/omniASR-CTC-7B](https://huggingface.co/facebook/omniASR-CTC-7B), [facebook/omniASR-LLM-300M](https://huggingface.co/facebook/omniASR-LLM-300M), [facebook/omniASR-LLM-1B](https://huggingface.co/facebook/omniASR-LLM-1B), [facebook/omniASR-LLM-3B](https://huggingface.co/facebook/omniASR-LLM-3B), [facebook/omniASR-LLM-7B](https://huggingface.co/facebook/omniASR-LLM-7B), [facebook/omniASR-LLM-7B-ZS](https://huggingface.co/facebook/omniASR-LLM-7B-ZS) | | Facebook Wav2Vec2 | [facebook/wav2vec2-base-960h](https://huggingface.co/facebook/wav2vec2-base-960h), [facebook/wav2vec2-large-960h](https://huggingface.co/facebook/wav2vec2-large-960h), [facebook/wav2vec2-large-960h-lv60-self](https://huggingface.co/facebook/wav2vec2-large-960h-lv60-self), [facebook/wav2vec2-large-robust-ft-libri-960h](https://huggingface.co/facebook/wav2vec2-large-robust-ft-libri-960h) | | Facebook Wav2Vec2 Conformer | [facebook/wav2vec2-conformer-rel-pos-large-960h-ft](https://huggingface.co/facebook/wav2vec2-conformer-rel-pos-large-960h-ft), [facebook/wav2vec2-conformer-rope-large-960h-ft](https://huggingface.co/facebook/wav2vec2-conformer-rope-large-960h-ft) | | IBM Granite Speech | [ibm-granite/granite-speech-3.3-2b](https://huggingface.co/ibm-granite/granite-speech-3.3-2b), [ibm-granite/granite-speech-3.3-8b](https://huggingface.co/ibm-granite/granite-speech-3.3-8b) | | Kyutai STT | [kyutai/stt-2.6b-en](https://huggingface.co/kyutai/stt-2.6b-en) | | Lite ASR (EfficientSpeech) | [efficient-speech/lite-whisper-large-v3](https://huggingface.co/efficient-speech/lite-whisper-large-v3), [efficient-speech/lite-whisper-large-v3-acc](https://huggingface.co/efficient-speech/lite-whisper-large-v3-acc), [efficient-speech/lite-whisper-large-v3-fast](https://huggingface.co/efficient-speech/lite-whisper-large-v3-fast), [efficient-speech/lite-whisper-large-v3-turbo](https://huggingface.co/efficient-speech/lite-whisper-large-v3-turbo), [efficient-speech/lite-whisper-large-v3-turbo-acc](https://huggingface.co/efficient-speech/lite-whisper-large-v3-turbo-acc), [efficient-speech/lite-whisper-large-v3-turbo-fast](https://huggingface.co/efficient-speech/lite-whisper-large-v3-turbo-fast) | | Microsoft Phi-4 | [microsoft/Phi-4-multimodal-instruct](https://huggingface.co/microsoft/Phi-4-multimodal-instruct) | | Moonshine | [usefulsensors/moonshine-base](https://huggingface.co/usefulsensors/moonshine-base), [usefulsensors/moonshine-tiny](https://huggingface.co/usefulsensors/moonshine-tiny) | | NVIDIA NeMo ASR | [nvidia/canary-1b-v2](https://huggingface.co/nvidia/canary-1b-v2), [nvidia/canary-qwen-2.5b](https://huggingface.co/nvidia/canary-qwen-2.5b), [nvidia/parakeet-tdt-0.6b-v3](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3) | | OpenAI Whisper | [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3), [openai/whisper-large-v3-turbo](https://huggingface.co/openai/whisper-large-v3-turbo), [openai/whisper-large-v2](https://huggingface.co/openai/whisper-large-v2), [openai/whisper-large](https://huggingface.co/openai/whisper-large), [openai/whisper-medium.en](https://huggingface.co/openai/whisper-medium.en), [openai/whisper-small.en](https://huggingface.co/openai/whisper-small.en), [openai/whisper-base.en](https://huggingface.co/openai/whisper-base.en), [openai/whisper-tiny.en](https://huggingface.co/openai/whisper-tiny.en) | | Qwen2 Audio | [Qwen/Qwen2-Audio-7B](https://huggingface.co/Qwen/Qwen2-Audio-7B), [Qwen/Qwen2-Audio-7B-Instruct](https://huggingface.co/Qwen/Qwen2-Audio-7B-Instruct) | **Whisper Post-Processing:** Whisper model results include a duration-based truncation step to mitigate hallucination and known over-generation behavior. --- ## Acknowledgements We gratefully acknowledge the dataset creators and leaderboard teams whose contributions made PazaBench possible: **Datasets:** We extend our gratitude to the creators, community contributors, and maintainers of [African Next Voices Kenya](https://huggingface.co/datasets/MCAA1-MSU/anv_data_ke), [African Next Voices South Africa](https://huggingface.co/datasets/dsfsi-anv/za-african-next-voices), [ALFFA](https://openslr.org/25/), [DigiGreen Kikuyu ASR](https://huggingface.co/datasets/DigiGreen/KikuyuASR_trainingdataset), [Google FLEURS](https://huggingface.co/datasets/google/fleurs), [Mozilla Common Voice](https://commonvoice.mozilla.org/) and [Naija Voices](https://huggingface.co/datasets/naijavoices/naijavoices-dataset) whose efforts have been invaluable in advancing African languages speech data. **Reference Implementation:** We recognize the foundational work of the [Open ASR Leaderboard](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard) by Hugging Face Audio team and appreciate the contributors of the [open_asr_leaderboard repository](https://github.com/huggingface/open_asr_leaderboard) for creating reproducible evaluation scripts. """ # Dataset group metadata with descriptions and language counts # Used for adding descriptors to dataset filter dropdown DATASET_GROUP_METADATA = { "ALFFA": { "description": "Read speech & broadcast news", "languages": ["Amharic", "Swahili", "Wolof"], "language_count": 3, }, "African Next Voices Kenya": { "description": "Conversational speech", "languages": ["Dholuo", "Kalenjin", "Kikuyu", "Maasai", "Somali"], "language_count": 5, }, "African Next Voices South Africa": { "description": "Conversational speech", "languages": ["Sesotho", "Setswana", "Xhosa", "Xitsonga", "Tshivenda", "Zulu"], "language_count": 6, }, "Google FLEURS": { "description": "Read speech", "languages": ["Afrikaans", "Amharic", "Fula", "Ganda", "Hausa", "Igbo", "Kamba", "Lingala", "Dholuo", "Northern Sotho", "Nyanja", "Oromo", "Shona", "Somali", "Swahili", "Umbundu", "Wolof", "Xhosa", "Yoruba", "Zulu"], "language_count": 20, }, "DigiGreen Kikuyu ASR": { "description": "Agricultural speech", "languages": ["Kikuyu"], "language_count": 1, }, "Mozilla Common Voice 23.0": { "description": "Crowdsourced speech", "languages": ["Afrikaans", "Amharic", "Arabic", "Basaa", "Dholuo", "Dioula", "Ekoti", "Hausa", "Igbo", "Kabyle", "Kalenjin", "Kinyarwanda", "Kidaw'ida", "Luganda", "Nyungwe", "Setswana", "Swahili", "Tamazight", "Tigre", "Tigrinya", "Twi", "Yoruba", "Zulu"], "language_count": 20, }, "Naija Voices": { "description": "Conversational speech", "languages": ["Hausa", "Igbo", "Yoruba"], "language_count": 3, }, } def get_dataset_group_label(dataset_group: str) -> str: """Return a formatted label with description and language count.""" meta = DATASET_GROUP_METADATA.get(dataset_group) if meta: return f"{dataset_group} ({meta['description']}, {meta['language_count']} languages)" return dataset_group def get_dataset_group_languages(dataset_group: str) -> list[str]: """Return list of languages for a dataset group.""" meta = DATASET_GROUP_METADATA.get(dataset_group) if meta: return meta['languages'] return [] EVALUATION_LANGUAGE_TEXT = """ ### Request Language Evaluation Submit a language dataset from any region for evaluation on PazaBench. We'll benchmark it using all supported ASR models. Provide the dataset source in the form below. **Requirements:** - Dataset must be publicly accessible on Hugging Face Hub or via a public URL - Must contain audio samples with text transcriptions - Audio should be 16kHz mono WAV format (will be resampled if needed) """ EVALUATION_MODEL_TEXT = """ ### Submit a Model for Evaluation Add a new ASR model to PazaBench. We'll evaluate it across all 39 African languages. **Requirements:** - Model must be **publicly available** on [Hugging Face Hub](https://huggingface.co/models) - Must support speech-to-text / ASR tasks - Should be compatible with `transformers` AutoModel or provide clear loading instructions """ CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results" CITATION_BUTTON_TEXT = r""" @misc{pazabench2026, title={PazaBench: A Benchmark for Automatic Speech Recognition on Low Resource Languages}, author={Microsoft Research Africa, Nairobi}, year={2026}, howpublished={\url{https://www.microsoft.com/en-us/research/project/project-gecko/}}, note={Alpha version. Part of Project Gecko - Equitable Generative AI for the Global Majority} } """