NeMo_Canary / docs /source /audio /configs.rst

Upload folder using huggingface_hub

b386992 verified 7 months ago

11 kB

	NeMo Audio Configuration Files
	==============================
	This section describes the NeMo configuration file setup that is specific to models in the audio collection.
	For general information about how to set up and run experiments that is common to all NeMo models (e.g. Experiment Manager and PyTorch Lightning trainer parameters), see the :doc:`../core/core` section.

	The model section of the NeMo audio configuration files generally requires information about the dataset(s) being used, parameters for any augmentation being performed, as well as the model architecture specification.

	Example configuration files for all of the NeMo audio models can be found in the
	`config directory of the examples <https://github.com/NVIDIA/NeMo/tree/main/examples/audio/conf>`_.


	.. _audio-configs-nemo-dataset-configuration:

	NeMo Dataset Configuration
	--------------------------

	Training, validation, and test parameters are specified using the ``model.train_ds``, ``model.validation_ds``, and ``model.test_ds`` sections in the configuration file, respectively.
	Depending on the task, there may be arguments specifying the sample rate or duration of the loaded audio examples. Some fields can be left out and specified via the command-line at runtime.
	Refer to the `Dataset Processing Classes <./api.html#Datasets>`__ section of the API for a list of datasets classes and their respective parameters.
	An example train, validation and test datasets can be configured as follows:

	.. code-block:: yaml

	model:
	sample_rate: 16000
	skip_nan_grad: false

	train_ds:
	manifest_filepath: ???
	input_key: audio_filepath # key of the input signal path in the manifest
	target_key: target_filepath # key of the target signal path in the manifest
	target_channel_selector: 0 # target signal is the first channel from files in target_key
	audio_duration: 4.0 # in seconds, audio segment duration for training
	random_offset: true # if the file is longer than audio_duration, use random offset to select a subsegment
	min_duration: ${model.train_ds.audio_duration}
	batch_size: 64 # batch size may be increased based on the available memory
	shuffle: true
	num_workers: 8
	pin_memory: true

	validation_ds:
	manifest_filepath: ???
	input_key: audio_filepath # key of the input signal path in the manifest
	target_key: target_filepath # key of the target signal path in the manifest
	target_channel_selector: 0 # target signal is the first channel from files in target_key
	batch_size: 64 # batch size may be increased based on the available memory
	shuffle: false
	num_workers: 4
	pin_memory: true

	test_ds:
	manifest_filepath: ???
	input_key: audio_filepath # key of the input signal path in the manifest
	target_key: target_filepath # key of the target signal path in the manifest
	target_channel_selector: 0 # target signal is the first channel from files in target_key
	batch_size: 1 # batch size may be increased based on the available memory
	shuffle: false
	num_workers: 4
	pin_memory: true

	More information about online augmentation can found in the `masking example configuration <https://github.com/NVIDIA/NeMo/blob/main/examples/audio/conf/masking.yaml>`_


	.. _audio-configs-lhotse-dataset-configuration:

	Lhotse Dataset Configuration
	----------------------------

	Lhotse CutSet
	~~~~~~~~~~~~~

	An example train dataset in Lhotse CutSet format can be configured as follows:

	.. code-block:: yaml

	train_ds:
	use_lhotse: true # enable Lhotse data loader
	cuts_path: ??? # path to Lhotse cuts manifest with input signals and the corresponding target signals (target signals should be in the custom "target_recording" field)
	truncate_duration: 4.00 # truncate audio to 4 seconds
	truncate_offset_type: random # if the file is longer than truncate_duration, use random offset to select a subsegment
	batch_size: 64 # batch size may be increased based on the available memory
	shuffle: true
	num_workers: 8
	pin_memory: true

	Lhotse CutSet with Online Augmentation
	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

	An example train dataset in Lhotse CutSet format using online augmentation with room impulse response (RIR) convolution and additive noise can be configured as follows:

	.. code-block:: yaml

	train_ds:
	use_lhotse: true # enable Lhotse data loader
	cuts_path: ??? # path to Lhotse cuts manifest with speech signals for augmentation (including custom "target_recording" field with the same signals)
	truncate_duration: 4.00 # truncate audio to 4 seconds
	truncate_offset_type: random # if the file is longer than truncate_duration, use random offset to select a subsegment
	batch_size: 64 # batch size may be increased based on the available memory
	shuffle: true
	num_workers: 8
	pin_memory: true
	rir_enabled: true # enable room impulse response augmentation
	rir_path: ??? # path to Lhotse recordings manifest with room impulse response signals
	noise_path: ??? # path to Lhotse cuts manifest with noise signals

	A configuration file with Lhotse online augmentation can found in the `online augmentation example configuration <https://github.com/NVIDIA/NeMo/blob/main/examples/audio/conf/masking_with_online_augmentation.yaml>`_.
	More information about the online augmentation can be found in the `tutorial notebook <https://github.com/NVIDIA/NeMo/blob/main/tutorials/audio/speech_enhancement/Speech_Enhancement_with_Online_Augmentation.ipynb>`_.


	Lhotse Shar
	~~~~~~~~~~~

	An example train dataset in Lhotse shar format can be configured as follows:

	.. code-block:: yaml

	train_ds:
	shar_path: ???
	use_lhotse: true
	truncate_duration: 4.00 # truncate audio to 4 seconds
	truncate_offset_type: random
	batch_size: 8 # batch size may be increased based on the available memory
	shuffle: true
	num_workers: 8
	pin_memory: true


	A configuration file with Lhotse shar format can found in the `SSL pretraining example configuration <https://github.com/NVIDIA/NeMo/blob/main/examples/audio/conf/flow_matching_generative_ssl_pretraining.yaml>`_.


	Model Architecture Configuration
	--------------------------------
	Each configuration file should describe the model architecture being used for the experiment.
	An example of a simple predictive model configuration is shown below:

	.. code-block:: yaml

	model:
	type: predictive
	sample_rate: 16000
	skip_nan_grad: false
	num_outputs: 1
	normalize_input: true # normalize the input signal to 0dBFS

	train_ds:
	manifest_filepath: ???
	input_key: noisy_filepath
	target_key: clean_filepath
	audio_duration: 2.00 # trim audio to 2 seconds
	random_offset: true
	normalization_signal: input_signal
	batch_size: 8 # batch size may be increased based on the available memory
	shuffle: true
	num_workers: 8
	pin_memory: true

	validation_ds:
	manifest_filepath: ???
	input_key: noisy_filepath
	target_key: clean_filepath
	batch_size: 8
	shuffle: false
	num_workers: 4
	pin_memory: true

	encoder:
	_target_: nemo.collections.audio.modules.transforms.AudioToSpectrogram
	fft_length: 510 # Number of subbands in the STFT = fft_length // 2 + 1 = 256
	hop_length: 128
	magnitude_power: 0.5
	scale: 0.33

	decoder:
	_target_: nemo.collections.audio.modules.transforms.SpectrogramToAudio
	fft_length: ${model.encoder.fft_length}
	hop_length: ${model.encoder.hop_length}
	magnitude_power: ${model.encoder.magnitude_power}
	scale: ${model.encoder.scale}

	estimator:
	_target_: nemo.collections.audio.parts.submodules.ncsnpp.SpectrogramNoiseConditionalScoreNetworkPlusPlus
	in_channels: 1 # single-channel noisy input
	out_channels: 1 # single-channel estimate
	num_res_blocks: 3 # increased number of res blocks
	pad_time_to: 64 # pad to 64 frames for the time dimension
	pad_dimension_to: 0 # no padding in the frequency dimension

	loss:
	_target_: nemo.collections.audio.losses.MSELoss # computed in the time domain

	metrics:
	val:
	sisdr: # output SI-SDR
	_target_: torchmetrics.audio.ScaleInvariantSignalDistortionRatio

	optim:
	name: adam
	lr: 1e-4
	# optimizer arguments
	betas: [0.9, 0.999]
	weight_decay: 0.0


	Complete configuration file can found in the `example configuration <https://github.com/NVIDIA/NeMo/blob/main/examples/audio/conf/predictive.yaml>`_.


	Finetuning Configuration
	--------------------------

	All scripts support easy finetuning by partially/fully loading the pretrained weights from a checkpoint into the currently instantiated model.
	Note that the currently instantiated model should have parameters that match the pre-trained checkpoint so the weights may load properly.

	Pre-trained weights can be provided by:

	* Providing a path to a NeMo model (via ``init_from_nemo_model``)
	* Providing a name of a pretrained NeMo model (which will be downloaded via the cloud) (via ``init_from_pretrained_model``)


	Training from scratch
	~~~~~~~~~~~~~~~~~~~~~

	A model can be trained from scratch using the following command:

	.. code-block:: shell

	python examples/audio/audio_to_audio_train.py \
	--config-path=<path to dir of configs>
	--config-name=<name of config without .yaml>) \
	model.train_ds.manifest_filepath="<path to manifest file>" \
	model.validation_ds.manifest_filepath="<path to manifest file>" \
	trainer.devices=1 \
	trainer.accelerator='gpu' \
	trainer.max_epochs=50

	Fine-tuning via a NeMo model
	~~~~~~~~~~~~~~~~~~~~~~~~~~~~

	A model can be finetuned from an existing NeMo model using the following command:

	.. code-block:: shell
	:emphasize-lines: 9

	python examples/audio/audio_to_audio_train.py \
	--config-path=<path to dir of configs>
	--config-name=<name of config without .yaml>) \
	model.train_ds.manifest_filepath="<path to manifest file>" \
	model.validation_ds.manifest_filepath="<path to manifest file>" \
	trainer.devices=1 \
	trainer.accelerator='gpu' \
	trainer.max_epochs=50 \
	+init_from_nemo_model="<path to .nemo model file>"


	Fine-tuning via a NeMo pretrained model name
	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

	A model can be finetuned from an pre-trained NeMo model using the following command:

	.. code-block:: shell
	:emphasize-lines: 9

	python examples/audio/audio_to_audio_train.py \
	--config-path=<path to dir of configs>
	--config-name=<name of config without .yaml>) \
	model.train_ds.manifest_filepath="<path to manifest file>" \
	model.validation_ds.manifest_filepath="<path to manifest file>" \
	trainer.devices=1 \
	trainer.accelerator='gpu' \
	trainer.max_epochs=50 \
	+init_from_pretrained_model="<name of pretrained checkpoint>"