Spaces:

microsoft
/

paza-bench

Running

File size: 13,539 Bytes

0aa9a49

# Your leaderboard name
TITLE = """<h1 align="center" id="space-title">PazaBench Leaderboard</h1>"""

# What does your leaderboard evaluate?
INTRODUCTION_TEXT = """
The PazaBench Leaderboard is an Automatic Speech Recognition (ASR) benchmark for low-resource languages developed by the **[Microsoft Research Africa, Nairobi Lab](https://www.microsoft.com/en-us/research/lab/microsoft-research-lab-africa-nairobi/)**. Launching with **39 African Languages** across **52 State-of-the-Art ASR** and **Language Models**, PazaBench compares three key metrics: **Character Error Rate (CER)**, **Word Error Rate (WER)**, and **RTFx (Inverse Real-Time Factor)**.
"""

# Which evaluations are you running? how can people reproduce what you have?
LLM_BENCHMARKS_TEXT = """
## PazaBench Inputs

### Evaluation Dataset

The PazaBench evaluation dataset is unified from 7 datasets: [African Next Voices Kenya](https://huggingface.co/datasets/MCAA1-MSU/anv_data_ke), [African Next Voices South Africa](https://huggingface.co/datasets/dsfsi-anv/za-african-next-voices), [ALFFA](https://openslr.org/25/), [DigiGreen Kikuyu ASR](https://huggingface.co/datasets/DigiGreen/KikuyuASR_trainingdataset), [Google FLEURS](https://huggingface.co/datasets/google/fleurs), [Mozilla Common Voice 23.0](https://commonvoice.mozilla.org/), and [Naija Voices](https://huggingface.co/datasets/naijavoices/naijavoices-dataset). It captures **61 test splits** across the listed **39 languages** and adds up to **204,492 samples** with modalities limited to 16 kHz mono speech with aligned transcriptions and per-split metadata. For each language, the dataset groups representing that language are unified to provide a balanced measure of model performance.
"""

LLM_BENCHMARKS_DATASETS_TEXT = """
## Datasets

| Dataset | Description | Languages | Total Samples | License |
|---------|-------------|-----------|------------------|---------|
| [Mozilla Common Voice 23.0](https://datacollective.mozillafoundation.org/datasets) | Crowdsourced speech | Afrikaans, Amharic, Arabic, Basaa, Dholuo, Dioula, Ekoti, Hausa, Igbo, Kabyle, Kalenjin, Kidaw'ida, Kinyarwanda, Luganda, Nyungwe, Setswana, Swahili, Tamazight, Tigre, Tigrinya, Twi, Yoruba, Zulu (**20 languages**) | 76,747 | CC0 1.0 |
| [Naija Voices](https://huggingface.co/datasets/naijavoices/naijavoices-dataset) | Conversational speech from Nigeria | Hausa, Igbo, Yoruba (**3 languages**) | 57,529 | CC BY-NC-SA 4.0 |
| [African Next Voices Kenya](https://huggingface.co/datasets/MCAA1-MSU/anv_data_ke) | Conversational speech | Dholuo, Kalenjin, Kikuyu, Maasai, Somali (**5 languages**) | 26,415 | CC BY 4.0 |
| [African Next Voices South Africa](https://huggingface.co/datasets/dsfsi-anv/za-african-next-voices) | Conversational speech | Sesotho, Setswana, Tshivenda, Xhosa, Xitsonga, Zulu (**6 languages**) | 23,701 | CC BY 4.0 |
| [Google FLEURS](https://huggingface.co/datasets/google/fleurs) | Read speech | Afrikaans, Amharic, Dholuo, Fula, Ganda, Hausa, Igbo, Kamba, Lingala, Northern Sotho, Nyanja, Oromo, Shona, Somali, Swahili, Umbundu, Wolof, Xhosa, Yoruba, Zulu (**20 languages**) | 12,418 | CC BY 4.0 |
| [DigiGreen Kikuyu ASR](https://huggingface.co/datasets/DigiGreen/KikuyuASR_trainingdataset) | Agricultural speech from farmers | Kikuyu (**1 language**) | 5,054 | Apache 2.0 |
| [ALFFA](https://openslr.org/25/) | Broadcast news and read speech | Amharic, Swahili, Wolof (**3 languages**) | 4,350 | MIT |

---

## Evaluated Models

PazaBench evaluates **16 SOTA ASR model families across 52 individual models** listed below:

| Model Family | Model |
|--------------|-------|
| Paza by Microsoft Research Africa, Nairobi | [microsoft/paza-Phi-4-multimodal-instruct](https://huggingface.co/microsoft/paza-Phi-4-multimodal-instruct), [microsoft/paza-mms-1b-all](https://huggingface.co/microsoft/paza-mms-1b-all), [microsoft/paza-whisper-large-v3-turbo](https://huggingface.co/microsoft/paza-whisper-large-v3-turbo) |
| Distil Whisper | [distil-whisper/distil-large-v2](https://huggingface.co/distil-whisper/distil-large-v2), [distil-whisper/distil-large-v3](https://huggingface.co/distil-whisper/distil-large-v3), [distil-whisper/distil-medium.en](https://huggingface.co/distil-whisper/distil-medium.en) |
| Facebook Data2Vec | [facebook/data2vec-audio-base-960h](https://huggingface.co/facebook/data2vec-audio-base-960h), [facebook/data2vec-audio-large-960h](https://huggingface.co/facebook/data2vec-audio-large-960h) |
| Facebook HuBERT | [facebook/hubert-large-ls960-ft](https://huggingface.co/facebook/hubert-large-ls960-ft), [facebook/hubert-xlarge-ls960-ft](https://huggingface.co/facebook/hubert-xlarge-ls960-ft) |
| Facebook MMS | [facebook/mms-1b-all](https://huggingface.co/facebook/mms-1b-all), [facebook/mms-1b-fl102](https://huggingface.co/facebook/mms-1b-fl102) |
| Facebook Omnilingual ASR | [facebook/omniASR-CTC-300M](https://huggingface.co/facebook/omniASR-CTC-300M), [facebook/omniASR-CTC-1B](https://huggingface.co/facebook/omniASR-CTC-1B), [facebook/omniASR-CTC-3B](https://huggingface.co/facebook/omniASR-CTC-3B), [facebook/omniASR-CTC-7B](https://huggingface.co/facebook/omniASR-CTC-7B), [facebook/omniASR-LLM-300M](https://huggingface.co/facebook/omniASR-LLM-300M), [facebook/omniASR-LLM-1B](https://huggingface.co/facebook/omniASR-LLM-1B), [facebook/omniASR-LLM-3B](https://huggingface.co/facebook/omniASR-LLM-3B), [facebook/omniASR-LLM-7B](https://huggingface.co/facebook/omniASR-LLM-7B), [facebook/omniASR-LLM-7B-ZS](https://huggingface.co/facebook/omniASR-LLM-7B-ZS) |
| Facebook Wav2Vec2 | [facebook/wav2vec2-base-960h](https://huggingface.co/facebook/wav2vec2-base-960h), [facebook/wav2vec2-large-960h](https://huggingface.co/facebook/wav2vec2-large-960h), [facebook/wav2vec2-large-960h-lv60-self](https://huggingface.co/facebook/wav2vec2-large-960h-lv60-self), [facebook/wav2vec2-large-robust-ft-libri-960h](https://huggingface.co/facebook/wav2vec2-large-robust-ft-libri-960h) |
| Facebook Wav2Vec2 Conformer | [facebook/wav2vec2-conformer-rel-pos-large-960h-ft](https://huggingface.co/facebook/wav2vec2-conformer-rel-pos-large-960h-ft), [facebook/wav2vec2-conformer-rope-large-960h-ft](https://huggingface.co/facebook/wav2vec2-conformer-rope-large-960h-ft) |
| IBM Granite Speech | [ibm-granite/granite-speech-3.3-2b](https://huggingface.co/ibm-granite/granite-speech-3.3-2b), [ibm-granite/granite-speech-3.3-8b](https://huggingface.co/ibm-granite/granite-speech-3.3-8b) |
| Kyutai STT | [kyutai/stt-2.6b-en](https://huggingface.co/kyutai/stt-2.6b-en) |
| Lite ASR (EfficientSpeech) | [efficient-speech/lite-whisper-large-v3](https://huggingface.co/efficient-speech/lite-whisper-large-v3), [efficient-speech/lite-whisper-large-v3-acc](https://huggingface.co/efficient-speech/lite-whisper-large-v3-acc), [efficient-speech/lite-whisper-large-v3-fast](https://huggingface.co/efficient-speech/lite-whisper-large-v3-fast), [efficient-speech/lite-whisper-large-v3-turbo](https://huggingface.co/efficient-speech/lite-whisper-large-v3-turbo), [efficient-speech/lite-whisper-large-v3-turbo-acc](https://huggingface.co/efficient-speech/lite-whisper-large-v3-turbo-acc), [efficient-speech/lite-whisper-large-v3-turbo-fast](https://huggingface.co/efficient-speech/lite-whisper-large-v3-turbo-fast) |
| Microsoft Phi-4 | [microsoft/Phi-4-multimodal-instruct](https://huggingface.co/microsoft/Phi-4-multimodal-instruct) |
| Moonshine | [usefulsensors/moonshine-base](https://huggingface.co/usefulsensors/moonshine-base), [usefulsensors/moonshine-tiny](https://huggingface.co/usefulsensors/moonshine-tiny) |
| NVIDIA NeMo ASR | [nvidia/canary-1b-v2](https://huggingface.co/nvidia/canary-1b-v2), [nvidia/canary-qwen-2.5b](https://huggingface.co/nvidia/canary-qwen-2.5b), [nvidia/parakeet-tdt-0.6b-v3](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3) |
| OpenAI Whisper | [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3), [openai/whisper-large-v3-turbo](https://huggingface.co/openai/whisper-large-v3-turbo), [openai/whisper-large-v2](https://huggingface.co/openai/whisper-large-v2), [openai/whisper-large](https://huggingface.co/openai/whisper-large), [openai/whisper-medium.en](https://huggingface.co/openai/whisper-medium.en), [openai/whisper-small.en](https://huggingface.co/openai/whisper-small.en), [openai/whisper-base.en](https://huggingface.co/openai/whisper-base.en), [openai/whisper-tiny.en](https://huggingface.co/openai/whisper-tiny.en) |
| Qwen2 Audio | [Qwen/Qwen2-Audio-7B](https://huggingface.co/Qwen/Qwen2-Audio-7B), [Qwen/Qwen2-Audio-7B-Instruct](https://huggingface.co/Qwen/Qwen2-Audio-7B-Instruct) |

**Whisper Post-Processing:** Whisper model results include a duration-based truncation step to mitigate hallucination and known over-generation behavior.

---

## Acknowledgements

We gratefully acknowledge the dataset creators and leaderboard teams whose contributions made PazaBench possible:

**Datasets:** We extend our gratitude to the creators, community contributors, and maintainers of [African Next Voices Kenya](https://huggingface.co/datasets/MCAA1-MSU/anv_data_ke), [African Next Voices South Africa](https://huggingface.co/datasets/dsfsi-anv/za-african-next-voices), [ALFFA](https://openslr.org/25/), [DigiGreen Kikuyu ASR](https://huggingface.co/datasets/DigiGreen/KikuyuASR_trainingdataset), [Google FLEURS](https://huggingface.co/datasets/google/fleurs), [Mozilla Common Voice](https://commonvoice.mozilla.org/) and [Naija Voices](https://huggingface.co/datasets/naijavoices/naijavoices-dataset) whose efforts have been invaluable in advancing African languages speech data.

**Reference Implementation:** We recognize the foundational work of the [Open ASR Leaderboard](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard) by Hugging Face Audio team and appreciate the contributors of the [open_asr_leaderboard repository](https://github.com/huggingface/open_asr_leaderboard) for creating reproducible evaluation scripts.
"""

# Dataset group metadata with descriptions and language counts
# Used for adding descriptors to dataset filter dropdown
DATASET_GROUP_METADATA = {
    "ALFFA": {
        "description": "Read speech & broadcast news",
        "languages": ["Amharic", "Swahili", "Wolof"],
        "language_count": 3,
    },
    "African Next Voices Kenya": {
        "description": "Conversational speech",
        "languages": ["Dholuo", "Kalenjin", "Kikuyu", "Maasai", "Somali"],
        "language_count": 5,
    },
    "African Next Voices South Africa": {
        "description": "Conversational speech",
        "languages": ["Sesotho", "Setswana", "Xhosa", "Xitsonga", "Tshivenda", "Zulu"],
        "language_count": 6,
    },
    "Google FLEURS": {
        "description": "Read speech",
        "languages": ["Afrikaans", "Amharic", "Fula", "Ganda", "Hausa", "Igbo", "Kamba", "Lingala", "Dholuo", "Northern Sotho", "Nyanja", "Oromo", "Shona", "Somali", "Swahili", "Umbundu", "Wolof", "Xhosa", "Yoruba", "Zulu"],
        "language_count": 20,
    },
    "DigiGreen Kikuyu ASR": {
        "description": "Agricultural speech",
        "languages": ["Kikuyu"],
        "language_count": 1,
    },
    "Mozilla Common Voice 23.0": {
        "description": "Crowdsourced speech",
        "languages": ["Afrikaans", "Amharic", "Arabic", "Basaa", "Dholuo", "Dioula", "Ekoti", "Hausa", "Igbo", "Kabyle", "Kalenjin", "Kinyarwanda", "Kidaw'ida", "Luganda", "Nyungwe", "Setswana", "Swahili", "Tamazight", "Tigre", "Tigrinya", "Twi", "Yoruba", "Zulu"],
        "language_count": 20,
    },
    "Naija Voices": {
        "description": "Conversational speech",
        "languages": ["Hausa", "Igbo", "Yoruba"],
        "language_count": 3,
    },
}

def get_dataset_group_label(dataset_group: str) -> str:
    """Return a formatted label with description and language count."""
    meta = DATASET_GROUP_METADATA.get(dataset_group)
    if meta:
        return f"{dataset_group} ({meta['description']}, {meta['language_count']} languages)"
    return dataset_group

def get_dataset_group_languages(dataset_group: str) -> list[str]:
    """Return list of languages for a dataset group."""
    meta = DATASET_GROUP_METADATA.get(dataset_group)
    if meta:
        return meta['languages']
    return []

EVALUATION_LANGUAGE_TEXT = """
### Request Language Evaluation

Submit a language dataset from any region for evaluation on PazaBench. We'll benchmark it using all supported ASR models. Provide the dataset source in the form below.

**Requirements:**
- Dataset must be publicly accessible on Hugging Face Hub or via a public URL
- Must contain audio samples with text transcriptions
- Audio should be 16kHz mono WAV format (will be resampled if needed)
"""

EVALUATION_MODEL_TEXT = """
### Submit a Model for Evaluation

Add a new ASR model to PazaBench. We'll evaluate it across languages.

**Requirements:**
- Model must be **publicly available** on [Hugging Face Hub](https://huggingface.co/models)
- Must support speech-to-text / ASR tasks
- Should be compatible with `transformers` AutoModel or provide clear loading instructions
"""

CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
CITATION_BUTTON_TEXT = r"""
@misc{pazabench2026,
  title={PazaBench: A Benchmark for Automatic Speech Recognition on Low Resource Languages},
  author={Microsoft Research Africa, Nairobi},
  year={2026},
  howpublished={\url{https://www.microsoft.com/en-us/research/project/project-gecko/}},
  note={Alpha version. Part of Project Gecko - Equitable Generative AI for the Global Majority}
}
"""