NeMo_Canary / docs /source /tools /speech_data_explorer.rst

Upload folder using huggingface_hub

b386992 verified 6 months ago

7.55 kB

	Speech Data Explorer
	====================

	Speech Data Explorer (SDE) is a `Dash <https://plotly.com/dash/>`__-based web application for interactive exploration and analysis of speech datasets.

	+--------------------------------------------------------------------------------------------------------------------------+
	\| SDE Features: \|
	+--------------------------------------------------------------------------------------------------------------------------+
	\| global dataset statistics [alphabet, vocabulary, duration-based histograms, number of hours, number of utterances, etc.] \|
	+--------------------------------------------------------------------------------------------------------------------------+
	\| navigation across the dataset using an interactive datatable that supports sorting and filtering \|
	+--------------------------------------------------------------------------------------------------------------------------+
	\| inspection of individual utterances [plotting waveforms, spectrograms, custom attributes, and playing audio] \|
	+--------------------------------------------------------------------------------------------------------------------------+
	\| error analysis [word error rate (WER), character error rate (CER), word match rate (WMR), word accuracy, \|
	\| display highlighted the difference between the reference text and ASR model prediction] \|
	+--------------------------------------------------------------------------------------------------------------------------+
	\| estimation of audio signal parameters [peak level, frequency bandwidth] \|
	+--------------------------------------------------------------------------------------------------------------------------+

	Getting Started
	---------------
	SDE could be found in `NeMo/tools/speech_data_explorer <https://github.com/NVIDIA/NeMo/tree/stable/tools/speech_data_explorer>`__.

	Please install the SDE requirements:

	.. code-block:: bash

	pip install -r tools/speech_data_explorer/requirements.txt

	Then run:

	.. code-block:: bash

	python tools/speech_data_explorer/data_explorer.py -h

	usage: data_explorer.py [-h] [--vocab VOCAB] [--port PORT] [--disable-caching-metrics] [--estimate-audio-metrics] [--debug] manifest

	Speech Data Explorer

	positional arguments:
	manifest path to JSON manifest file

	optional arguments:
	-h, --help show this help message and exit
	--vocab VOCAB optional vocabulary to highlight OOV words
	--port PORT serving port for establishing connection
	--disable-caching-metrics
	disable caching metrics for errors analysis
	--estimate-audio-metrics, -a
	estimate frequency bandwidth and signal level of audio recordings
	--debug, -d enable debug mode


	SDE takes as an input a JSON manifest file (that describes speech datasets in NeMo). It should contain the following fields:

	* `audio_filepath` (path to audio file)
	* `duration` (duration of the audio file in seconds)
	* `text` (reference transcript)

	SDE supports any extra custom fields in the JSON manifest. If the field is numeric, then SDE can visualize its distribution across utterances.

	If the JSON manifest has attribute `pred_text`, SDE interprets it as a predicted ASR transcript and computes error analysis metrics.
	The command line option ``--estimate-audio-metrics`` allows SDE to estimate the signal's peak level and frequency bandwidth for each utterance.
	By default, SDE caches all computed metrics to a pickle file. The caching can be disabled with ``--disable-caching-metrics`` option.

	User Interface
	--------------

	SDE application has two pages:

	* `Statistics` (to display global statistics and aggregated error metrics)

	.. image:: images/sde_base_stats.png
	:align: center
	:width: 800px
	:alt: SDE Statistics


	* `Samples` (to allow navigation across the entire dataset and exploration of individual utterances)

	.. image:: images/sde_player.png
	:align: center
	:width: 800px
	:alt: SDE Statistics


	Plotly Dash Datatable provides core SDE's interactive features (navigation, filtering, and sorting).
	SDE has two datatables:

	* Vocabulary (that shows all words from dataset's reference texts on `Statistics` page)

	.. image:: images/sde_words.png
	:align: center
	:width: 800px
	:alt: Vocabulary


	* Data (that visualizes all dataset's utterances on `Samples` page)

	.. image:: images/sde_utterances.png
	:align: center
	:width: 800px
	:alt: Data


	Every column of the DataTable has the following interactive features:

	* toggling off (by clicking on the `eye` icon in the column's header cell) or on (by clicking on the `Toggle Columns` button below the table)

	.. image:: images/datatable_toggle.png
	:align: center
	:width: 800px
	:alt: Toggling


	* sorting (by clicking on small triangle icons in the column's header cell): unordered (two triangles point up and down), ascending (a triangle points up), descending (a triangle points down)

	.. image:: images/datatable_sort.png
	:align: center
	:width: 800px
	:alt: Sorting


	* filtering (by entering a filtering expression in a cell below the header's cell): SDE supports ``<``, ``>``, ``<=``, ``>=``, ``=``, ``!=``, and ``contains`` operators; to match a specific substring, the quoted substring can be used as a filtering expression

	.. image:: images/datatable_filter.png
	:align: center
	:width: 800px
	:alt: Filtering



	Analysis of Speech Datasets
	---------------------------

	In the simplest use case, SDE helps to explore a speech dataset interactively and get basic statistics.
	If there is no available pre-trained ASR model to get predicted transcripts, there are still available heuristic rules to spot potential issues in a dataset:

	1. Check dataset alphabet (it should contain only target characters)
	2. Check vocabulary for uncommon words (e.g., foreign words, typos). SDE can take an external vocabulary file passed with ``--vocab`` option. Then it is easy to filter out-of-vocabulary (OOV) words in the dataset and sort them by their number of occurrences (count).
	3. Check utterances with a high character rate. A high character rate might indicate that the utterance has more words in the reference transcript than the corresponding audio recording.

	If there is a pre-trained ASR model, then the JSON manifest file can be extended with ASR predicted transcripts:

	.. code-block:: bash

	python examples/asr/transcribe_speech.py pretrained_name=<ASR_MODEL_NAME> dataset_manifest=<JSON_FILENAME>

	After that it is worth to check words with zero accuracy.

	.. image:: images/sde_mls_words.png
	:align: center
	:width: 800px
	:alt: MLS Words


	And then look at high CER utterances.

	.. image:: images/sde_mls_cer.png
	:align: center
	:width: 800px
	:alt: MLS CER


	Listening to the audio recording helps to validate the corresponding reference transcript.

	.. image:: images/sde_mls_player.png
	:align: center
	:width: 800px
	:alt: MLS Player