Upload folder using huggingface_hub

b386992 verified 7 months ago

4.89 kB

	# Speech Classification

	This directory contains example scripts to train speech classification and voice activity detection models. There are two types of VAD models: Frame-VAD and Segment-VAD.

	## Frame-VAD

	The frame-level VAD model predicts for each frame of the audio whether it has speech or not. For example, with the default config file (`../conf/marblenet/marblenet_3x2x64_20ms.yaml`), the model provides a probability for each frame of 20ms length.

	### Training
	```sh
	python speech_to_frame_label.py \
	--config-path=<path to directory of configs, e.g. "../conf/marblenet">
	--config-name=<name of config without .yaml, e.g. "marblenet_3x2x64_20ms"> \
	model.train_ds.manifest_filepath="[<path to train manifest1>,<path to train manifest2>]" \
	model.validation_ds.manifest_filepath=["<path to val manifest1>","<path to val manifest2>"] \
	trainer.devices=-1 \
	trainer.accelerator="gpu" \
	strategy="ddp" \
	trainer.max_epochs=100
	```

	The input manifest must be a manifest json file, where each line is a Python dictionary. The fields ["audio_filepath", "offset", "duration", "label"] are required. An example of a manifest file is:
	```
	{"audio_filepath": "/path/to/audio_file1", "offset": 0, "duration": 10000, "label": "0 1 0 0 1"}
	{"audio_filepath": "/path/to/audio_file2", "offset": 0, "duration": 10000, "label": "0 0 0 1 1 1 1 0 0"}
	```
	For example, if you have a 1s audio file, you'll need to have 50 frame labels in the manifest entry like "0 0 0 0 1 1 0 1 .... 0 1".
	However, shorter label strings are also supported for smaller file sizes. For example, you can prepare the `label` in 40ms frame, and the model will properly repeat the label for each 20ms frame.


	### Inference
	python frame_vad_infer.py \
	--config-path="../conf/vad" --config-name="frame_vad_infer_postprocess" \
	dataset=<Path of manifest file containing evaluation data. Audio files should have unique names>

	The manifest json file should have the following format (each line is a Python dictionary):
	```
	{"audio_filepath": "/path/to/audio_file1.wav", "offset": 0, "duration": 10000}
	{"audio_filepath": "/path/to/audio_file2.wav", "offset": 0, "duration": 10000}
	```

	#### Evaluation
	If you want to evaluate tne model's AUROC and DER performance, you need to set `evaluate: True` in config yaml (e.g., `../conf/vad/frame_vad_infer_postprocess.yaml`), and also provide groundtruth in label strings:
	```
	{"audio_filepath": "/path/to/audio_file1.wav", "offset": 0, "duration": 10000, "label": "0 1 0 0 0 1 1 1 0"}
	```
	or RTTM files:
	```
	{"audio_filepath": "/path/to/audio_file1.wav", "offset": 0, "duration": 10000, "rttm_filepath": "/path/to/rttm_file1.rttm"}
	```


	## Segment-VAD

	Segment-level VAD predicts a single label for each segment of audio (e.g., 0.63s by default).

	### Training
	```sh
	python speech_to_label.py \
	--config-path=<path to dir of configs, e.g. "../conf/marblenet"> \
	--config-name=<name of config without .yaml, e.g., "marblenet_3x2x64"> \
	model.train_ds.manifest_filepath="[<path to train manifest1>,<path to train manifest2>]" \
	model.validation_ds.manifest_filepath=["<path to val manifest1>","<path to val manifest2>"] \
	trainer.devices=-1 \
	trainer.accelerator="gpu" \
	strategy="ddp" \
	trainer.max_epochs=100
	```

	The input manifest must be a manifest json file, where each line is a Python dictionary. The fields ["audio_filepath", "offset", "duration", "label"] are required. An example of a manifest file is:
	```
	{"audio_filepath": "/path/to/audio_file1", "offset": 0, "duration": 0.63, "label": "0"}
	{"audio_filepath": "/path/to/audio_file2", "offset": 0, "duration": 0.63, "label": "1"}
	```


	### Inference
	```sh
	python vad_infer.py \
	--config-path="../conf/vad" \
	--config-name="vad_inference_postprocessing.yaml"
	dataset=<Path of json file of evaluation data. Audio files should have unique names>
	```
	The manifest json file should have the following format (each line is a Python dictionary):
	```
	{"audio_filepath": "/path/to/audio_file1.wav", "offset": 0, "duration": 10000}
	{"audio_filepath": "/path/to/audio_file2.wav", "offset": 0, "duration": 10000}
	```


	## Visualization

	To visualize the VAD outputs, you can use the `nemo.collections.asr.parts.utils.vad_utils.plot_sample_from_rttm` function, which takes an audio file and an RTTM file as input, and plots the audio waveform and the VAD labels. Since the VAD inference script will output a json manifest `manifest_vad_out.json` by default, you can create a Jupyter Notebook with the following script and fill in the paths using the output manifest:
	```python
	from nemo.collections.asr.parts.utils.vad_utils import plot_sample_from_rttm

	plot_sample_from_rttm(
	audio_file="/path/to/audio_file.wav",
	rttm_file="/path/to/rttm_file.rttm",
	offset=0.0,
	duration=1000,
	save_path="vad_pred.png"
	)
	```