NeMo ASR+VAD Inference
This example provides the ASR+VAD inference pipeline, with the option to perform only ASR or VAD alone.
Input
There are two types of input
- A manifest passed to
manifest_filepath, - A directory containing audios passed to
audio_dirand also specifyaudio_type(default towav).
The input manifest must be a manifest json file, where each line is a Python dictionary. The fields ["audio_filepath", "offset", "duration", "text"] are required. An example of a manifest file is:
{"audio_filepath": "/path/to/audio_file1", "offset": 0, "duration": 10000, "text": "a b c d e"}
{"audio_filepath": "/path/to/audio_file2", "offset": 0, "duration": 10000, "text": "f g h i j"}
Output
Output will be a folder storing the VAD predictions and/or a manifest containing the audio transcriptions. Some temporary data will also be stored.
Usage
To run the code with ASR+VAD default settings:
python speech_to_text_with_vad.py \
manifest_filepath=/PATH/TO/MANIFEST.json \
vad_model=vad_multilingual_marblenet \
asr_model=stt_en_conformer_ctc_large \
vad_config=../conf/vad/vad_inference_postprocess.yaml
To use only ASR and disable VAD, set vad_model=None and use_rttm=False.
To use only VAD, set asr_model=None and specify both vad_model and vad_config.
To enable profiling, set profiling=True, but this will significantly slow down the program.
To use or disable feature masking, set use_rttm to True or False.
To normalize feature before masking, set normalize=pre_norm,
and set normalize=post_norm for masking before normalization.
To use a specific value for feature masking, set feat_mask_val to the desired value.
Default is feat_mask_val=None, where -16.530 (zero log mel-spectrogram value) will be used for post_norm and 0 (same as SpecAugment) will be used for pre_norm.
See more options in the InferenceConfig class.