{
"cells": [
{
"cell_type": "markdown",
"id": "sHAHlxz1HYc0",
"metadata": {
"id": "sHAHlxz1HYc0"
},
"source": [
"# Introduction"
]
},
{
"cell_type": "markdown",
"id": "9BVTGynbHSoy",
"metadata": {
"id": "9BVTGynbHSoy"
},
"source": [
"[Speech Data Explorer](https://github.com/NVIDIA/NeMo/tree/main/tools/speech_data_explorer) (SDE) is a visual tool for interactive exploration of speech datasets and error analysis of Automatic Speech Recognition (ASR) models. This tutorial demonstrates how to use SDE in Comparison mode to evaluate two ASR models on a given test set and identify differences in their predictions."
]
},
{
"cell_type": "markdown",
"id": "57pDMtWtHdxv",
"metadata": {
"id": "57pDMtWtHdxv"
},
"source": [
"# Installation"
]
},
{
"cell_type": "markdown",
"id": "JXu9TViTuyVy",
"metadata": {
"id": "JXu9TViTuyVy"
},
"source": [
"During installation you may encounter pop up notifications from collab, with request to restart the runtime. **Decline** this, as colab's core default libraries have diferent versions."
]
},
{
"cell_type": "markdown",
"id": "u_sKMlVpcAvO",
"metadata": {
"id": "u_sKMlVpcAvO"
},
"source": [
"First, let's install NeMo:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c3919489",
"metadata": {
"id": "c3919489"
},
"outputs": [],
"source": [
"BRANCH = 'main'\n",
"\n",
"!git clone -b $BRANCH https://github.com/NVIDIA/NeMo\n",
"\n",
"!apt-get update && apt-get install -y libsndfile1 ffmpeg sox\n",
"\n",
"!cd ./NeMo/ && pip install -e \".[asr]\"\n",
"!pip3 install -r ./NeMo/tools/speech_data_explorer/requirements.txt\n",
"\n"
]
},
{
"cell_type": "markdown",
"id": "TOOhimHTrL9r",
"metadata": {
"id": "TOOhimHTrL9r"
},
"source": [
"# Dataset"
]
},
{
"cell_type": "markdown",
"id": "Eg2mqxC0XrOV",
"metadata": {
"id": "Eg2mqxC0XrOV"
},
"source": [
"In this tutorial we use LibriSpeech test-other dataset to evaluate ASR models. Let's download and prepare the dataset."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "LQYnm_hsYqyt",
"metadata": {
"id": "LQYnm_hsYqyt"
},
"outputs": [],
"source": [
"!python3 ./NeMo/scripts/dataset_processing/get_librispeech_data.py --data_sets test_other --data_root ."
]
},
{
"cell_type": "markdown",
"id": "9737e0f1",
"metadata": {
"id": "9737e0f1"
},
"source": [
"Before starting SDE, we need to run ASR inference on a given dataset and save predictions into a JSON manifest.\n"
]
},
{
"cell_type": "markdown",
"id": "t5ruQ3x84Yiq",
"metadata": {
"id": "t5ruQ3x84Yiq"
},
"source": [
"It is assumed that you already have a JSON manifest containing the `audio_filepath` and reference `text` fields. Here is a minimal example:"
]
},
{
"cell_type": "markdown",
"id": "ic-QvihbHMQa",
"metadata": {
"id": "ic-QvihbHMQa"
},
"source": [
"```\n",
"{\n",
" \"audio_filepath\": \"/LibriSpeech/dev-clean-processed/2902-9008-0000.wav\",\n",
" \"text\": \"the place seemed fragrant with all the riches of greek\",\n",
"}\n",
"```"
]
},
{
"cell_type": "markdown",
"id": "O7XnVkY0rXka",
"metadata": {
"id": "O7XnVkY0rXka"
},
"source": [
"# Transcription"
]
},
{
"cell_type": "markdown",
"id": "rYe0xVTzjbhQ",
"metadata": {
"id": "rYe0xVTzjbhQ"
},
"source": [
"To compare two models, JSON file should contain predictions from 1st (e.g., `QuartzNet15x5`) and 2nd (e.g., `Conformer-CTC Small`) models.\n",
"\n",
"NeMo includes a Python script for ASR inference: [`transcribe_speech.py`](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/transcribe_speech.py).\n",
"\n",
"`transcribe_speech.py` accepts `append_pred` flag that allows saving an ASR transcript in the JSON file with a given custom field (like `pred_text_QN`). `pred_name_postfix` parameter defines the custom field's name. In this example it is set to the abbreviated model name `QN`."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "928a9127",
"metadata": {
"id": "928a9127"
},
"outputs": [],
"source": [
"!python3 ./NeMo/examples/asr/transcribe_speech.py \\\n",
"pretrained_name=\"QuartzNet15x5Base-En\" \\\n",
"dataset_manifest=./test_other.json \\\n",
"output_filename=\"test_other_QN.json\" \\\n",
"batch_size=8 \\\n",
"cuda=0 amp=True \\\n",
"append_pred=True \\\n",
"pred_name_postfix=\"QN\""
]
},
{
"cell_type": "markdown",
"id": "OfhJrSkvqrs6",
"metadata": {
"id": "OfhJrSkvqrs6"
},
"source": [
"`transcribe_speech.py` also reports overall WER for the given dataset. In our case, QuartzNet's WER is 10.0%.\n",
"\n",
"Here is the first line of the newly created `test_other_QN.json` manifest file with QuartzNet's predictions:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5e274c99",
"metadata": {
"id": "5e274c99"
},
"outputs": [],
"source": [
"!head -n 1 test_other_QN.json"
]
},
{
"cell_type": "markdown",
"id": "21761c16",
"metadata": {
"id": "21761c16"
},
"source": [
"Let's run inference with the second ASR model (`Conformer-CTC Small`):"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e88e4ed3",
"metadata": {
"id": "e88e4ed3"
},
"outputs": [],
"source": [
"!python ./NeMo/examples/asr/transcribe_speech.py \\\n",
"pretrained_name=\"stt_en_conformer_ctc_small\" \\\n",
"dataset_manifest=./test_other_QN.json \\\n",
"output_filename=\"test_other_QN_Conf_small.json\" \\\n",
"batch_size=8 \\\n",
"cuda=0 amp=True \\\n",
"append_pred=True \\\n",
"pred_name_postfix=\"conf_s\""
]
},
{
"cell_type": "markdown",
"id": "MyvMpkcGqhff",
"metadata": {
"id": "MyvMpkcGqhff"
},
"source": [
"Conformer's WER is 8.1%.\n",
"\n",
"Here is the first line of `test_other_QN_Conf_small.json` manifest file with transcripts by both ASR models\n",
"- `pred_text_QN` by QuartzNet\n",
"- `pred_text_conf_s` by Conformer-Small"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "551a2b33",
"metadata": {
"id": "551a2b33"
},
"outputs": [],
"source": [
"!head -n 1 test_other_QN_Conf_small.json"
]
},
{
"cell_type": "markdown",
"id": "wYQAP60iretG",
"metadata": {
"id": "wYQAP60iretG"
},
"source": [
"# Launching SDE"
]
},
{
"cell_type": "markdown",
"id": "67a15347",
"metadata": {
"id": "67a15347"
},
"source": [
"Now it's time to start SDE!\n",
"\n",
"SDE is a Python web application that can be run either locally or on a remote server (a cloud instance) from a command line.\n",
"\n",
"In this tutorial, we use Google Colab. So we need to get a proxy link to the instance.\n",
"\n",
"When running locally, you just running command \"python3 ./NeMo/tools/speech_data_explorer/data_explorer.py test_other_QN_Conf_small.json --port=8050 \\\n",
"-nc pred_text_QN pred_text_conf_s\" and conndect by provided link in your browser"
]
},
{
"cell_type": "markdown",
"id": "hABptmgK1amL",
"metadata": {
"id": "hABptmgK1amL"
},
"source": [
"Now let's start SDE with the following shell command. `nc` parameter provides names of ASR transcripts for Comparison mode."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "Fmh2uHfLgTrL",
"metadata": {
"id": "Fmh2uHfLgTrL"
},
"outputs": [],
"source": [
"import threading\n",
"import subprocess\n",
"import time\n",
"from IPython.display import HTML, display\n",
"from google.colab.output import eval_js\n",
"import os\n",
"\n",
"PORT = 8050\n",
"\n",
"def run_dash_app():\n",
" print(f\"Starting Dash app on port {PORT} in a background thread...\")\n",
" # Command to run your Dash application\n",
" command = [\n",
" \"python3\",\n",
" \"./NeMo/tools/speech_data_explorer/data_explorer.py\",\n",
" \"test_other_QN_Conf_small.json\",\n",
" f\"--port={PORT}\",\n",
" \"-nc\",\n",
" \"pred_text_QN\",\n",
" \"pred_text_conf_s\"\n",
" ]\n",
" try:\n",
" # Start the subprocess and redirect output to /dev/null\n",
" # This prevents the notebook from being flooded with server logs\n",
" process = subprocess.Popen(command, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)\n",
" print(\"Dash app process started.\")\n",
" # Keep the thread alive as long as the process is running\n",
" process.wait()\n",
" except Exception as e:\n",
" print(f\"Error starting Dash app: {e}\")\n",
"\n",
"# Start the Dash app in a separate thread\n",
"# Set daemon=True so the thread automatically exits when the main program exits\n",
"dash_thread = threading.Thread(target=run_dash_app, daemon=True)\n",
"dash_thread.start()\n",
"\n",
"# Give the server a moment to start up\n",
"time.sleep(5)\n",
"\n",
"# Get the Colab proxy URL\n",
"proxy_url = eval_js(f\"google.colab.kernel.proxyPort({PORT})\")\n",
"\n",
"# Display link\n",
"print(f\"\\nDash app should be accessible at:\")\n",
"display(HTML(f'Click to open Dash app'))\n",
"print(\"\\nIf the link doesn't work, wait a few more seconds and try refreshing the page, or re-run this cell.\")\n",
"print(\"To stop the server, you may need to interrupt the Colab runtime (Runtime -> Interrupt execution).\")"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "e69fc876",
"metadata": {
"id": "e69fc876"
},
"outputs": [],
"source": [
"# !python3 ./NeMo/tools/speech_data_explorer/data_explorer.py test_other_QN_Conf_small.json --port=8050 \\\n",
"# -nc pred_text_QN pred_text_conf_s"
]
},
{
"cell_type": "markdown",
"id": "710ea9c4",
"metadata": {
"id": "710ea9c4"
},
"source": [
"Please click on the aforementioned proxy link to view SDE application in another web-browser tab."
]
},
{
"cell_type": "markdown",
"id": "rLNS2kFhr-GL",
"metadata": {
"id": "rLNS2kFhr-GL"
},
"source": [
"#Quick guide on SDE interface"
]
},
{
"cell_type": "markdown",
"id": "AI1Wkahb2w2-",
"metadata": {
"id": "AI1Wkahb2w2-"
},
"source": [
"SDE provides general dataset's statistics (number of utterances, hours, character set size, histograms, etc.) on `Statistics` page. It allows users to do interactive exploration and analysis on the dataset on `Samples` page. All these functions are described in detail in [NeMo documentation](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/tools/speech_data_explorer.html)."
]
},
{
"cell_type": "markdown",
"id": "IsNRQkg9CQ6n",
"metadata": {
"id": "IsNRQkg9CQ6n"
},
"source": [
""
]
},
{
"cell_type": "markdown",
"id": "67693388",
"metadata": {
"id": "67693388"
},
"source": [
"In this tutorial, we are interested in `Comparison tool` page.\n",
"\n",
"The `Comparison` page has two main modes (levels) of comparison:\n",
"- word level (metrics for individual vocabulary words are compared)\n",
"- utterance level (metrics for individual dataset's utterances are compared)\n",
"\n",
"In each mode, the dataset is visualized as an interactive scatterplot. Each marker represents a data unit (either a word, or an utterance). X-coordinate encodes a selected metric for one model, Y-coordinate does the same for the other model.\n",
"\n",
"By default, word level comparison is selected on `Comparison tool` page. In second (2) and third (3) box you can choose what will be shown on the axes of the scatterplot (that is, metrics for 1st and 2nd models). In our example, it is word level accuracy for QuartzNet and Conformer-Small: `accuracy_model_pred_text_{model_name}`."
]
},
{
"cell_type": "markdown",
"id": "dUXwvaU8ASxv",
"metadata": {
"id": "dUXwvaU8ASxv"
},
"source": [
"Depending on comparison level, these fields provide the following options:\n",
"\n",
"\n",
"* word accuracy (ratio of the correctly recognized number of words to the total number of this word in the entire dataset)\n",
"* utterance WER\n",
"* utterance CER"
]
},
{
"cell_type": "markdown",
"id": "xHfBizBuv5Pg",
"metadata": {
"id": "xHfBizBuv5Pg"
},
"source": [
""
]
},
{
"cell_type": "markdown",
"id": "WhRiRrVCDUVO",
"metadata": {
"id": "WhRiRrVCDUVO"
},
"source": [
""
]
},
{
"cell_type": "markdown",
"id": "1c0c797c",
"metadata": {
"id": "1c0c797c"
},
"source": [
"By default, the scatterplot displays all data units (either words or utterances). To allow users to select more interesting subsets, SDE supports flexible filtering queries like in the following example (for word level):\n"
]
},
{
"cell_type": "markdown",
"id": "BtabXT7PEGe-",
"metadata": {
"id": "BtabXT7PEGe-"
},
"source": [
""
]
},
{
"cell_type": "markdown",
"id": "1f8cadb0",
"metadata": {
"id": "1f8cadb0"
},
"source": [
"You can type the filtering queries either in a specific column header, or in a custom filter expression textbox (putting lowercased column names in curly brackets).\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"id": "9COXr5cXsp8X",
"metadata": {
"id": "9COXr5cXsp8X"
},
"source": [
"Below is an example of complicated query, that allows us to display utterances where both ASR models yield different WERs (for utterance level):"
]
},
{
"cell_type": "markdown",
"id": "5OQDmR00sSub",
"metadata": {
"id": "5OQDmR00sSub"
},
"source": [
""
]
},
{
"cell_type": "markdown",
"id": "f7f64b2d",
"metadata": {
"id": "f7f64b2d"
},
"source": [
"The scatterplot is fully interactive: you can zoom, swipe, and view extended information about a data point hovering over it."
]
},
{
"cell_type": "markdown",
"id": "oalX8opDEdMX",
"metadata": {
"id": "oalX8opDEdMX"
},
"source": [
""
]
},
{
"cell_type": "markdown",
"id": "7f557223",
"metadata": {
"id": "7f557223"
},
"source": [
"Let's examine an utterance level mode.\n",
"\n",
"You can choose either WER or CER as a metric:"
]
},
{
"cell_type": "markdown",
"id": "WdYYdwMdFUHo",
"metadata": {
"id": "WdYYdwMdFUHo"
},
"source": [
""
]
},
{
"cell_type": "markdown",
"id": "ee18d226",
"metadata": {
"id": "ee18d226"
},
"source": [
"It is easy to add complex filtering expressions:"
]
},
{
"cell_type": "markdown",
"id": "I3JT9-afEgAY",
"metadata": {
"id": "I3JT9-afEgAY"
},
"source": [
""
]
},
{
"cell_type": "markdown",
"id": "3aefad51",
"metadata": {
"id": "3aefad51"
},
"source": [
"The scatterplot is interactive. Clicking on any data point automatically selects corresponding data row in dataset table."
]
},
{
"cell_type": "markdown",
"id": "64rWT4JeFy93",
"metadata": {
"id": "64rWT4JeFy93"
},
"source": [
""
]
},
{
"cell_type": "markdown",
"id": "e3441c51",
"metadata": {
"id": "e3441c51"
},
"source": [
"After selecting a data point, you can listen to the utterance. Just click on player icon.\n",
"\n",
"Also you can analyze audio signal in time and frequency domain using interactive plots of the signal and its spectrogram.\n"
]
},
{
"cell_type": "markdown",
"id": "a41728b3",
"metadata": {
"id": "a41728b3"
},
"source": [
""
]
},
{
"cell_type": "markdown",
"id": "MiycGa3NGQQT",
"metadata": {
"id": "MiycGa3NGQQT"
},
"source": [
""
]
},
{
"cell_type": "markdown",
"id": "EVn6Bu8fsEWu",
"metadata": {
"id": "EVn6Bu8fsEWu"
},
"source": [
"# Interesting examples and use cases"
]
},
{
"cell_type": "markdown",
"id": "J8ko_-Q1tX6v",
"metadata": {
"id": "J8ko_-Q1tX6v"
},
"source": [
"Using the {wer_model_1} != {wer_model_2} query you can remove all sentences/words that have the same WER, this can be very convenient if the models you are comparing are similar.\n",
"\n",
"In the following example, you can see how QuartzNet and Conformer-Small transcribe the same utterance:\n",
"\n",
"* reference transcript is `the school of the wilderness`\n",
"* QuartzNet's transcript is `the school of the wearerness` (the error in the last word)\n",
"* Conformer-Small's transcript is `the score of the word in its` (almost completely wrong prediction)\n",
"\n"
]
},
{
"cell_type": "markdown",
"id": "U1qKT9tEWk0j",
"metadata": {
"id": "U1qKT9tEWk0j"
},
"source": [
""
]
},
{
"cell_type": "markdown",
"id": "MK7B7xniCPky",
"metadata": {
"id": "MK7B7xniCPky"
},
"source": [
"General ASR models might make mistakes in proper nouns and narrow domain specific terms since they are rarely found in training datasets."
]
},
{
"cell_type": "markdown",
"id": "sAP9F-bHtpVi",
"metadata": {
"id": "sAP9F-bHtpVi"
},
"source": [
"1) Name: `Shiloh`"
]
},
{
"cell_type": "markdown",
"id": "e6LCtKS0s9LE",
"metadata": {
"id": "e6LCtKS0s9LE"
},
"source": [
""
]
},
{
"cell_type": "markdown",
"id": "vXf2v13ys84x",
"metadata": {
"id": "vXf2v13ys84x"
},
"source": [
"2) Domain specific term: `rheumatic`"
]
},
{
"cell_type": "markdown",
"id": "BI7sIWh7o8zj",
"metadata": {
"id": "BI7sIWh7o8zj"
},
"source": [
""
]
},
{
"cell_type": "markdown",
"id": "1D7a1_p5svWu",
"metadata": {
"id": "1D7a1_p5svWu"
},
"source": [
"It is very likely that, in the word level mode, many words will receive the same accuracy, and will overlap with each other on the scatterplot. That is why the graph will display only one point at a given location. To resolve this issue, there is an option to space them with a given radius:"
]
},
{
"cell_type": "markdown",
"id": "VUL1xTrXpHEL",
"metadata": {
"id": "VUL1xTrXpHEL"
},
"source": [
""
]
},
{
"cell_type": "markdown",
"id": "2wOPEk4ktKrb",
"metadata": {
"id": "2wOPEk4ktKrb"
},
"source": [
"This part of the graph shows an enlarged view in coordinates (0, 100). That is, words correctly recognized by the Conformer-Small and not recognized by QuartzNet."
]
},
{
"cell_type": "markdown",
"id": "J1jw_1c5pMRc",
"metadata": {
"id": "J1jw_1c5pMRc"
},
"source": [
""
]
},
{
"cell_type": "markdown",
"id": "IjXTQJZuC3Z-",
"metadata": {
"id": "IjXTQJZuC3Z-"
},
"source": [
"Let's write down words recognized only by QuartzNet and words recognized only by Conformer-Small:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "57bPPFyxSPiJ",
"metadata": {
"id": "57bPPFyxSPiJ"
},
"outputs": [],
"source": [
"import pandas as pd\n",
"\n",
"data = {\n",
" \"QuartzNet_words\": [\"allspice\", \"southwark\", \"dante\", \"panada\", \"favour\", \"vapours\", \"fuchs\", \"battlefields\", \"morrel\", \"postscript\"],\n",
" \"Conformer_words\": [\"gothic\", \"tablespoons\", \"heidelberg\", \"bough\", \"nocturnal\", \"meekin\", \"-\", \"-\", \"-\", \"-\"]\n",
"}\n",
"\n",
"df = pd.DataFrame(data)\n",
"\n",
"df\n",
"\n"
]
},
{
"cell_type": "markdown",
"id": "UghOVs2xUG1b",
"metadata": {
"id": "UghOVs2xUG1b"
},
"source": [
""
]
},
{
"cell_type": "markdown",
"id": "5_rKgNoovt1k",
"metadata": {
"id": "5_rKgNoovt1k"
},
"source": [
"Words in the first group seem to have varied origins, with a mix of English, Germanic, and potentially Latin-derived terms.\n",
"The second group, on the other hand, also appears to contain words of varied linguistic origins but might have a slightly more European or Old World feel, especially with words like `gothic` and `heidelberg`.\n",
"\n",
"**Word Length**\n",
"\n",
"The average word length in the first group might be longer with words like `battlefields` and `postscript`.\n",
"The second group also contains long words, but when we consider `bough` or `gothic`, it might have a slightly shorter average length.\n",
"\n",
"**Functional vs. Descriptive**\n",
"\n",
"The first group contains a mix of nouns that are both functional (like `allspice` or `panada`) and more abstract or descriptive (like `dante` or `vapours`).\n",
"The second group also contains nouns, but they seem more descriptive or pertaining to concepts or themes like `gothic` or `nocturnal`."
]
},
{
"cell_type": "markdown",
"id": "C0hcCrZI98fF",
"metadata": {
"id": "C0hcCrZI98fF"
},
"source": [
"Next, we will look at several interesting examples that were easily discovered using this tool."
]
},
{
"cell_type": "markdown",
"id": "FV8rLFfr9KVv",
"metadata": {
"id": "FV8rLFfr9KVv"
},
"source": [
"# Examples with audio"
]
},
{
"cell_type": "markdown",
"id": "W35kfqsYkVlf",
"metadata": {
"id": "W35kfqsYkVlf"
},
"source": [
"(It is worth noting that the examples below are taken from LibriSpeech dev-clean set, while the rest of the tutorial is based on LibriSpeech test-other)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9Q1aCuBC8QZm",
"metadata": {
"id": "9Q1aCuBC8QZm"
},
"outputs": [],
"source": [
"#This cell is made so that you can quickly listen to those utterances.\n",
"from IPython.display import Audio, display\n",
"!python3 ./NeMo/scripts/dataset_processing/get_librispeech_data.py --data_sets dev_clean --data_root ."
]
},
{
"cell_type": "markdown",
"id": "BjgoarWOrntG",
"metadata": {
"id": "BjgoarWOrntG"
},
"source": [
""
]
},
{
"cell_type": "markdown",
"id": "QogvdaUzrq2L",
"metadata": {
"id": "QogvdaUzrq2L"
},
"source": [
"Above is example how Conformer-Small trying to use more common phrases fails with recognizing word `southwark` (Southwark is a district of Central London). We see how Conformer-Small fails on proper nouns and just names in general."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "iA-wfLFE-Lke",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 75
},
"id": "iA-wfLFE-Lke",
"outputId": "f4024048-91fb-4043-97d3-bb7d702a3ef1"
},
"outputs": [
{
"data": {
"text/html": [
"\n",
" \n",
" "
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"sound_file = \"/content/LibriSpeech/dev-clean-processed/5895-34629-0012.wav\"\n",
"display(Audio(sound_file, autoplay=True))"
]
},
{
"cell_type": "markdown",
"id": "jyX2PVgovdjj",
"metadata": {
"id": "jyX2PVgovdjj"
},
"source": [
"In this case, the speaker does not really pause between `over` and `all`, and QuartzNet transcribes it as a single word."
]
},
{
"cell_type": "markdown",
"id": "z3jISeCTrRVI",
"metadata": {
"id": "z3jISeCTrRVI"
},
"source": [
""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5aYtqrTG8Xpa",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 75
},
"id": "5aYtqrTG8Xpa",
"outputId": "fd73db6e-0c95-4796-ef7c-8feb7cf21a9b"
},
"outputs": [
{
"data": {
"text/html": [
"\n",
" \n",
" "
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"sound_file = \"/content/LibriSpeech/dev-clean-processed/652-129742-0008.wav\"\n",
"display(Audio(sound_file, autoplay=True))"
]
},
{
"cell_type": "markdown",
"id": "jz9qolfuvdFj",
"metadata": {
"id": "jz9qolfuvdFj"
},
"source": [
"This is an example of an error in the dataset.\n",
"The audio is challenging (quality is not very good, and the speaker is singing this phrase). But it is obvious that the reference transcript should be `grub pile grub pile`. Likely, the extra space in word `pile` was introduced by replacing a hyphen character with a space in the original dataset:"
]
},
{
"cell_type": "markdown",
"id": "FrjjbUv0A3g9",
"metadata": {
"id": "FrjjbUv0A3g9"
},
"source": [
""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "sdKQ2hvB7bIa",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 75
},
"id": "sdKQ2hvB7bIa",
"outputId": "afdbff46-d50d-4465-f8da-b46547993ad1"
},
"outputs": [
{
"data": {
"text/html": [
"\n",
" \n",
" "
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"sound_file = \"/content/LibriSpeech/dev-clean-processed/6313-76958-0023.wav\"\n",
"display(Audio(sound_file, autoplay=True))"
]
},
{
"cell_type": "markdown",
"id": "uXbTGzVCvd8K",
"metadata": {
"id": "uXbTGzVCvd8K"
},
"source": [
"Here is another example. I would not say that the models were wrong. When listening to the audio, you might actually think that the announcer is saying “guessed”.\n",
"\n",
"Therefore, if we see that both models make the same errors - this is a good reason to check the dataset."
]
},
{
"cell_type": "markdown",
"id": "D4xWxRk8BLtb",
"metadata": {
"id": "D4xWxRk8BLtb"
},
"source": [
""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "MsOJGMOn8ERg",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 75
},
"id": "MsOJGMOn8ERg",
"outputId": "af69e3ba-7989-45c9-f529-b8a883d3664b"
},
"outputs": [
{
"data": {
"text/html": [
"\n",
" \n",
" "
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"sound_file = \"/content/LibriSpeech/dev-clean-processed/2428-83705-0008.wav\"\n",
"display(Audio(sound_file, autoplay=True))"
]
}
],
"metadata": {
"accelerator": "GPU",
"colab": {
"gpuType": "T4",
"provenance": []
},
"kernelspec": {
"display_name": "Python 3",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.10"
}
},
"nbformat": 4,
"nbformat_minor": 5
}