{ "cells": [ { "cell_type": "markdown", "id": "7fa2feb8", "metadata": {}, "source": [ "# Shared Task: Mozilla Common Voice Spontaneous Speech ASR\n", "*https://www.codabench.org/competitions/10820/* \n", "*1st place solution* \n", "*Copyright (c) 2025 Igor Ivanov* \n", "*Email: vecxoz@gmail.com* \n", "*License: MIT* \n", "*I will be happy to answer any questions.*" ] }, { "cell_type": "markdown", "id": "6dfa145d", "metadata": {}, "source": [ "# Contents\n", "\n", "1. Solution summary \n", "2. Installation \n", "3. Inference" ] }, { "cell_type": "markdown", "id": "43ff13da-4175-4345-a880-d59e26bbb8d7", "metadata": {}, "source": [ "# 1. Summary\n", "\n", "In this notebook we present an inference code and models for all 4 tasks of the Shared Task: Mozilla Common Voice Spontaneous Speech ASR. We did not use external data. Only Common Voice datasets were used: spontaneous speech for 21 languages, and scripted speech for 5 unseen languages. We fine-tuned the MMS model with adapter layers per language. \n", "\n", "Our best single model features the following improvements over the baseline. (1) More data. We used both training and validation subsets for fine-tuning. (2) Different pretrained checkpoint `facebook/mms-1b-l1107`. (3) Longer training for 30 epochs. (4) Learning rate schedules tailored for each language. (5) Beam search decoding with KenLM language model. For the small model subtask we used the same MMS models with pruning and 4-bit quantization. \n", "\n", "Our overall best submission is an ensemble of 4 models. (1) `facebook/mms-1b-l1107` fine-tuned using training data only. (2) `facebook/mms-1b-l1107` fine-tuned using all data (training and validation subsets). (3) `facebook/mms-1b-all` fine-tuned using all data. (4) `facebook/mms-1b-fl102` fine-tuned using all data. We applied the [ROVER](https://github.com/usnistgov/SCTK) ensembling method, which outperformed each single model. \n", "\n", "Please find all details in the paper." ] }, { "cell_type": "markdown", "id": "0ddd35e1-bbd4-4f56-a143-e8430cd7238c", "metadata": {}, "source": [ "# 2. Installation\n", "\n", "## Directory structure\n", "```\n", "solution\n", "|-- kenlm_models_order_3 # KenLM models per language\n", "|-- mdc_asr_shared_task_test_data # Official test dataset\n", "|-- models-01-mms-1b-l1107-tuned-commonvoice-train-data # `mms-1b-l1107` fine-tuned using training subsets only\n", "|-- models-02-mms-1b-l1107-tuned-commonvoice-all-data # `mms-1b-l1107` fine-tuned using all data (best single model)\n", "|-- models-03-mms-1b-all-tuned-commonvoice-all-data # `mms-1b-all` fine-tuned using all data\n", "|-- models-04-mms-1b-fl102-tuned-commonvoice-all-data # `mms-1b-fl102` fine-tuned using all data\n", "|-- models-05-mms-1b-l1107-tuned-commonvoice-all-data-pruned-quant # Pruned and quantized version of `models-02`\n", "|-- SCTK # ROVER ensemble (source and binaries)\n", "|-- ensemble.py\n", "|-- infer.py\n", "|-- LICENSE.txt\n", "|-- README.md\n", "|-- requirements.txt\n", "|-- solution.ipynb\n", "```" ] }, { "cell_type": "markdown", "id": "f734ecae-d352-439a-aa18-96fd2546702e", "metadata": {}, "source": [ "## ROVER check\n", "\n", "We built a binary from source on Ubuntu 22.04. If you are using a similar system it should work out of the box. \n", "Please run the command below. If a help message is displayed, then everything works. \n", "If not, you have to build it from source, as shown below or in official documentation: https://github.com/usnistgov/SCTK" ] }, { "cell_type": "code", "execution_count": 13, "id": "b3cf2628-be6c-48cb-9ebf-7d2955be1e5c", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "rover: \n", "rover Version: 0.1, SCTK Version: 1.3\n", "Description: rover takes N input files and does an N-way DP alignment\n", " on those files. The output is either a set of minimal cost\n", " alignments, or a Voted output depending the -m option.\n", "Input Options:\n", " -h hypfile ctm\n", " Define the hypothesis file and it's format. This option\n", " must be used more than once.\n", "Output Options:\n", " -o outfile Define the output file. (Will be same format as hyps)\n", " -f level Defines feedback mode, default is 1\n", " -l width Defines the line width.\n", "Alignment Options:\n", " -m meth Execute method:\n", " oracle -> output the fully alternated transcript\n", " meth1 -> alpha = -a , conf = -c, choose highest avg\n", " maxconf -> Same as meth1, but use the maximum conf\n", " score for a CS set as the metric\n", " avgconf -> alpha*Sum(WO) + (1-alpha)Sum(Conf(W))\n", " maxconfa -> Same as maxconf, but conf=N(@)/S\n", " putat -> Output the putative hit format\n", " -s Do Case-sensitive alignments.\n", " -T Use time information, (if available), to calculated word-to-\n", " word distances based on times.\n", "Processing Options:\n", " -a alpha Default: 1.0\n", " -c Null_conf\n", " Default: 0.0\n", "rover: Req'd Hyp File names, 2 or more\n", "\n" ] } ], "source": [ "!./SCTK/bin/rover" ] }, { "cell_type": "markdown", "id": "35b537f7-36a3-40cd-9d79-892b150cd925", "metadata": {}, "source": [ "## ROVER installation (if needed)\n", "\n", "Compilation takes about 2 minutes." ] }, { "cell_type": "markdown", "id": "cdaba278-3472-4289-8dfb-b1017310064e", "metadata": {}, "source": [ "```\n", "!mv SCTK SCTK_prebuilt\n", "!git clone https://github.com/usnistgov/SCTK\n", "%cd SCTK\n", "\n", "!make config\n", "!make all\n", "!make check\n", "!make install\n", "!make doc\n", " \n", "%cd ..\n", "```" ] }, { "cell_type": "markdown", "id": "8108da48-12aa-4663-9c20-3028b9b8117e", "metadata": {}, "source": [ "## Package installation\n", "\n", "**Hardware:**\n", "* Core-i5 CPU\n", "* 32 GB RAM\n", "* 500 GB SSD\n", "* RTX-3090-24GB GPU\n", "\n", "**System:**\n", "* Ubuntu 22.04\n", "* Python 3.12\n", "* CUDA 12.8\n", "* PyTorch 2.9.0\n", "\n", "**Note.** During the training phase we used Flash Attention 2 (`flash_attn==2.7.4.post1`), but for inference we did not use it. So we did not include it in the `requirements.txt`, given that the installation (compilation) of this version takes about 2 hours on Core-i5, 12th Gen." ] }, { "cell_type": "code", "execution_count": null, "id": "b5444ca5-e719-4e40-a851-1a2fc4900542", "metadata": {}, "outputs": [], "source": [ "!pip install -r requirements.txt" ] }, { "cell_type": "markdown", "id": "eda488d4-8e09-4c22-9ff4-3675ed3bc9dc", "metadata": {}, "source": [ "# 3. Inference\n", "\n", "The distribution archive already has an official test dataset `mdc_asr_shared_task_test_data`, so inference should run out of the box. \n", "If you want to infer a different dataset, just set the `input_dir` argument of both scripts `infer.py` and `ensemble.py`. \n", "\n", "Total inference time is about 3 hours on `RTX-3090-24GB` GPU.\n", "\n", "Script `infer.py` will create the following directories. \n", "Directories with `_ctm` suffix contain intermediate `.ctm` files which will be used for the ensemble. \n", "Directories with `_submission` suffix contain `.tsv` files ready for scoring. Specifically, `output_2_submission` is the results from the best single model, and `output_5_submission` is the results from the small model. \n", "Please note, that all directories with `_submission` suffix contain standard subdirectories `multilingual-general` and `unseen-langs`. \n", "```\n", "|-- output_1_ctm\n", "|-- output_1_submission\n", "|-- output_2_ctm\n", "|-- output_2_submission\n", "|-- output_3_ctm\n", "|-- output_3_submission\n", "|-- output_4_ctm\n", "|-- output_4_submission\n", "|-- output_5_ctm\n", "|-- output_5_submission\n", "```\n", "\n", "Script `ensemble.py` will create the following. \n", "`submission_single_model` is the final submission from the best single model corresponding to Codabench submission `ID 452610`. \n", "`submission_ensemble` is final submission from the ensemble corresponding to Codabench submission `ID 456691`. \n", "\n", "```\n", "|-- submission_ensemble\n", "|-- submission_single_model\n", "|-- submission_ensemble.zip\n", "|-- submission_single_model.zip\n", "```" ] }, { "cell_type": "code", "execution_count": null, "id": "79e8f7bc-d529-411f-a09f-08e8ca210a0f", "metadata": {}, "outputs": [], "source": [ "!python infer.py \\\n", "--input_dir=mdc_asr_shared_task_test_data \\\n", "--kenlm_model=kenlm_models_order_3 \\\n", "--beam_width=100 \\\n", "--attn_implementation=sdpa \\\n", "--use_amp=0 \\\n", "--device=cuda" ] }, { "cell_type": "code", "execution_count": null, "id": "65044956-7435-4bbf-b419-7ac437de9e00", "metadata": {}, "outputs": [], "source": [ "!python ensemble.py \\\n", "--input_dir=mdc_asr_shared_task_test_data \\\n", "--output_dir_single=submission_single_model \\\n", "--output_dir_ensemble=submission_ensemble \\\n", "--rover_path=./SCTK/bin/rover" ] }, { "cell_type": "code", "execution_count": 5, "id": "f3efa9c1", "metadata": {}, "outputs": [], "source": [ "# END" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.11" } }, "nbformat": 4, "nbformat_minor": 5 }