Shared Task: Mozilla Common Voice Spontaneous Speech ASR
https://www.codabench.org/competitions/10820/
Training code of the 1st place solution in 3 of 4 subtasks:
- Multilingual General
- Best small model
- Unseen Languages
Please see paper.pdf and training.ipynb for the details and code entry point.
Inference code and models are available in the following repository:
https://huggingface.co/vecxoz/mozilla-shared-task-1st-place-mms-inference
Author: Igor Ivanov (team "vecxoz")
License: MIT
License for KenLM distribution can be found in the kenlm_dist subdirectory.
Training datasets are not included according to the Common Voice requirements.
- Official bundle of Common Voice spontaneous speech datasets for 21 languages is available via the link:
https://datacollective.mozillafoundation.org/datasets/cmfzu8u8wa555eq8onrk334h4 - For the 5 languages without spontaneous data we used Common Voice scripted datasets version 23.
At the time of writing the current version is 24, and direct links to the version 23 do not work.
Probably any current version will suit. Please search 5 corresponding languages via the link:
https://datacollective.mozillafoundation.org/datasets
Directory structure of the datasets is the following:
mozilla-shared-task-1st-place-mms-training
|
|-- cv-corpus-23.0-2025-09-05
| |
| |-- ady
| | |
| | |-- clips
| | | |-- common_voice_ady_41822822.mp3
| | | |-- ...
| | | |-- common_voice_ady_43592770.mp3
| | |-- clip_durations.tsv
| | |-- ...
| | |-- validated.tsv
| |-- ...
| |-- ush
|
|-- mcv-sps-st-09-2025
|
|-- sps-corpus-1.0-2025-09-05-aln
| |
| |-- audios
| | |-- spontaneous-speech-aln-29890.mp3
| | |-- ...
| | |-- spontaneous-speech-aln-63486.mp3
| |-- ss-corpus-aln.tsv
| |-- ss-reported-audios-aln.tsv
|-- ...
|-- sps-corpus-1.0-2025-09-05-ukv