thanks to NVIDIA ❤

7934b29 almost 3 years ago

7.78 kB

	Checkpoints
	===========

	There are two main ways to load pretrained checkpoints in NeMo:

	* Using the :code:`restore_from()` method to load a local checkpoint file (``.nemo``), or
	* Using the :code:`from_pretrained()` method to download and set up a checkpoint from NGC.

	Refer to the following sections for instructions and examples for each.

	Note that these instructions are for loading fully trained checkpoints for evaluation or fine-tuning. For resuming an unfinished
	training experiment, use the Experiment Manager to do so by setting the ``resume_if_exists`` flag to ``True``.

	Loading Local Checkpoints
	-------------------------

	NeMo automatically saves checkpoints of a model that is trained in a ``.nemo`` format. Alternatively, to manually save the model at any
	point, issue :code:`model.save_to(<checkpoint_path>.nemo)`.

	If there is a local ``.nemo`` checkpoint that you'd like to load, use the :code:`restore_from()` method:

	.. code-block:: python

	import nemo.collections.asr as nemo_asr
	model = nemo_asr.models.<MODEL_BASE_CLASS>.restore_from(restore_path="<path/to/checkpoint/file.nemo>")

	Where the model base class is the ASR model class of the original checkpoint, or the general ``ASRModel`` class.

	NGC Pretrained Checkpoints
	--------------------------

	The ASR collection has checkpoints of several models trained on various datasets for a variety of tasks. These checkpoints are
	obtainable via NGC `NeMo Automatic Speech Recognition collection <https://catalog.ngc.nvidia.com/orgs/nvidia/collections/nemo_asr>`_.
	The model cards on NGC contain more information about each of the checkpoints available.

	The tables below list the ASR models available from NGC. The models can be accessed via the :code:`from_pretrained()` method inside
	the ASR Model class. In general, you can load any of these models with code in the following format:

	.. code-block:: python

	import nemo.collections.asr as nemo_asr
	model = nemo_asr.models.ASRModel.from_pretrained(model_name="<MODEL_NAME>")

	Where the model name is the value under "Model Name" entry in the tables below.

	For example, to load the base English QuartzNet model for speech recognition, run:

	.. code-block:: python

	model = nemo_asr.models.ASRModel.from_pretrained(model_name="QuartzNet15x5Base-En")

	You can also call :code:`from_pretrained()` from the specific model class (such as :code:`EncDecCTCModel`
	for QuartzNet) if you need to access a specific model functionality.

	If you would like to programmatically list the models available for a particular base class, you can use the
	:code:`list_available_models()` method.

	.. code-block:: python

	nemo_asr.models.<MODEL_BASE_CLASS>.list_available_models()

	Transcribing/Inference
	^^^^^^^^^^^^^^^^^^^^^^

	To perform inference and transcribe a sample of speech after loading the model, use the ``transcribe()`` method:

	.. code-block:: python

	model.transcribe(paths2audio_files=[list of audio files], batch_size=BATCH_SIZE, logprobs=False)

	Setting the argument ``logprobs`` to ``True`` returns the log probabilities instead of transcriptions. For more information, see `nemo.collections.asr.modules <./api.html#modules>`__.
	The audio files should be 16KHz mono-channel wav files.

	Inference on long audio
	^^^^^^^^^^^^^^^^^^^^^^

	In some cases the audio is too long for standard inference, especially if you're using a model such as Conformer, where the time and memory costs of the attention layers scale quadratically with the duration.

	There are two main ways of performing inference on long audio files in NeMo:

	The first way is to use buffered inference, where the audio is divided into chunks to run on, and the output is merged afterwards.
	The relevant scripts for this are contained in `this folder <https://github.com/NVIDIA/NeMo/blob/stable/examples/asr/asr_chunked_inference>`_.

	The second way, specifically for models with the Conformer encoder, is to convert to local attention, which changes the costs to be linear.
	This can be done even for models trained with full attention, though may result in lower WER in some cases. You can switch to local attention when running the
	`transcribe <https://github.com/NVIDIA/NeMo/blob/stable/examples/asr/transcribe_speech.py>`_ or `evaluation <https://github.com/NVIDIA/NeMo/blob/stable/examples/asr/transcribe_speech.py>`_
	scripts in the following way:

	.. code-block:: python

	python speech_to_text_eval.py \
	(...other parameters...) \
	++model_change.conformer.self_attention_model="rel_pos_local_attn" \
	++model_change.conformer.att_context_size=[64, 64]

	Alternatively, you can change the attention model after loading a checkpoint:

	.. code-block:: python

	asr_model = ASRModel.from_pretrained('stt_en_conformer_ctc_large')
	asr_model.change_attention_model(
	self_attention_model="rel_pos_local_attn",
	att_context_size=[64, 64]
	)

	Fine-tuning on Different Datasets
	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

	There are multiple ASR tutorials provided in the :ref:`Tutorials <tutorials>` section. Most of these tutorials explain how to instantiate a pre-trained model, prepare the model for fine-tuning on some dataset (in the same language) as a demonstration.

	Inference Execution Flow Diagram
	--------------------------------

	When preparing your own inference scripts, please follow the execution flow diagram order for correct inference, found at the `examples directory for ASR collection <https://github.com/NVIDIA/NeMo/blob/stable/examples/asr/README.md>`_.

	Automatic Speech Recognition Models
	-----------------------------------

	Below is a list of all the ASR models that are available in NeMo for specific languages, as well as auxiliary language models for certain languages.

	Language Models for ASR
	^^^^^^^^^^^^^^^^^^^^^^^

	.. csv-table::
	:file: data/asrlm_results.csv
	:align: left
	:widths: 30, 30, 40
	:header-rows: 1

	\|

	Speech Recognition (Languages)
	------------------------------

	English
	^^^^^^^
	.. csv-table::
	:file: data/benchmark_en.csv
	:align: left
	:widths: 40, 10, 50
	:header-rows: 1

	-----------------------------

	Mandarin
	^^^^^^^^
	.. csv-table::
	:file: data/benchmark_zh.csv
	:align: left
	:widths: 40, 10, 50
	:header-rows: 1

	-----------------------------

	German
	^^^^^^
	.. csv-table::
	:file: data/benchmark_de.csv
	:align: left
	:widths: 40, 10, 50
	:header-rows: 1

	-----------------------------

	French
	^^^^^^
	.. csv-table::
	:file: data/benchmark_fr.csv
	:align: left
	:widths: 40, 10, 50
	:header-rows: 1

	-----------------------------

	Polish
	^^^^^^
	.. csv-table::
	:file: data/benchmark_pl.csv
	:align: left
	:widths: 40, 10, 50
	:header-rows: 1

	-----------------------------

	Italian
	^^^^^^^
	.. csv-table::
	:file: data/benchmark_it.csv
	:align: left
	:widths: 40, 10, 50
	:header-rows: 1

	-----------------------------

	Russian
	^^^^^^^
	.. csv-table::
	:file: data/benchmark_ru.csv
	:align: left
	:widths: 40, 10, 50
	:header-rows: 1

	-----------------------------

	Spanish
	^^^^^^^
	.. csv-table::
	:file: data/benchmark_es.csv
	:align: left
	:widths: 40, 10, 50
	:header-rows: 1


	-----------------------------

	Catalan
	^^^^^^^
	.. csv-table::
	:file: data/benchmark_ca.csv
	:align: left
	:widths: 40, 10, 50
	:header-rows: 1

	-----------------------------

	Hindi
	^^^^^^^
	.. csv-table::
	:file: data/benchmark_hi.csv
	:align: left
	:widths: 40, 10, 50
	:header-rows: 1

	-----------------------------

	Marathi
	^^^^^^^
	.. csv-table::
	:file: data/benchmark_mr.csv
	:align: left
	:widths: 40, 10, 50
	:header-rows: 1

	-----------------------------

	Kinyarwanda
	^^^^^^^^^^^
	.. csv-table::
	:file: data/benchmark_rw.csv
	:align: left
	:widths: 40, 10, 50
	:header-rows: 1