Shared Task: Mozilla Common Voice Spontaneous Speech ASR

https://www.codabench.org/competitions/10820/

Training code of the 1st place solution in 3 of 4 subtasks:

Multilingual General
Best small model
Unseen Languages

Please see paper.pdf and training.ipynb for the details and code entry point.

Inference code and models are available in the following repository:
https://huggingface.co/vecxoz/mozilla-shared-task-1st-place-mms-inference

Author: Igor Ivanov (team "vecxoz")

License: MIT
License for KenLM distribution can be found in the kenlm_dist subdirectory.

Training datasets are not included according to the Common Voice requirements.

Official bundle of Common Voice spontaneous speech datasets for 21 languages is available via the link:
https://datacollective.mozillafoundation.org/datasets/cmfzu8u8wa555eq8onrk334h4
For the 5 languages without spontaneous data we used Common Voice scripted datasets version 23. At the time of writing the current version is 24, and direct links to the version 23 do not work. Probably any current version will suit. Please search 5 corresponding languages via the link:
https://datacollective.mozillafoundation.org/datasets

Directory structure of the datasets is the following:

mozilla-shared-task-1st-place-mms-training
|
|-- cv-corpus-23.0-2025-09-05
|   |
|   |-- ady
|   |   |
|   |   |-- clips
|   |   |   |-- common_voice_ady_41822822.mp3
|   |   |   |-- ...
|   |   |   |-- common_voice_ady_43592770.mp3
|   |   |-- clip_durations.tsv
|   |   |--  ...
|   |   |-- validated.tsv
|   |-- ...
|   |-- ush
|
|-- mcv-sps-st-09-2025
    |
    |-- sps-corpus-1.0-2025-09-05-aln
    |   |
    |   |-- audios
    |   |   |-- spontaneous-speech-aln-29890.mp3
    |   |   |-- ...
    |   |   |-- spontaneous-speech-aln-63486.mp3
    |   |-- ss-corpus-aln.tsv
    |   |-- ss-reported-audios-aln.tsv
    |-- ...
    |-- sps-corpus-1.0-2025-09-05-ukv

Downloads last month: -; Downloads are not tracked for this model. How to track