hypersunflower
/

a_sad_model

Voice Activity Detection

Model card Files Files and versions

a_sad_model / README.md

hypersunflower's picture

Update README.md

02a9488 verified 3 months ago

|

history blame contribute delete

1.89 kB

	---
	datasets:
	- hypersunflower/ava_speech_data_log_mel_spec
	- nccratliri/vad-human-ava-speech
	language:
	- en
	pipeline_tag: voice-activity-detection
	---

	As a part of my pet project create this SAD[1] model. It takes a log mel-spectrogram as input and outputs concatenated array of onset and offset.

	Loss - BCEWithLogitsLoss
	Optimizer - Adam

	here are the metrics on the test set:

	\| Metric \| Value \|
	\| --------- \| ---------------------- \|
	\| Accuracy \| 0.9998331655911613 \|
	\| Hamming \| 0.00016682081819592185 \|
	\| Precision \| 0.9327198181417427 \|
	\| Recall \| 0.9306135245038709 \|
	\| F1 \| 0.9296357635399213 \|
	\| Loss \| 0.0008604296028513627 \|


	To download the model and the necessary code use the following snippet:
	```
	from huggingface_hub import snapshot_download
	snapshot_download("hypersunflower/a_sad_model", local_dir = "model/", repo_type="model")
	```

	To use the model for inference[2]:

	```
	# load the scripts
	from .model.speech_detection import detectSpeech
	from .model.sadModel import sadModel
	from .model.logMelSpectrogram import logMelSpectrogram

	# load the models
	detector = detectSpeech(
	model_path="/model/a_sad_model.pth",
	model_class=sadModel(),
	logMelSpectrogram=logMelSpectrogram()
	)

	# inference
	onset, offset = detector.detect("path_to_the_audio")
	```

	Note: the code uses pydub.AudioSegment to process the audio which requires ffmpeg. You can install it the following way:

	```
	!apt update &> /dev/null
	!apt install ffmpeg -y &> /dev/null
	```
	This works for linux


	Training code can be found here: https://github.com/ertan-somundzhu/sad-model


	[1] short for Speech Activity Detection

	[2] though the model showes good perfomance on the nccratliri/vad-human-ava-speech dataset (from which i took 25% procent of the original dataset), it will most likely fail when working with real-world noisy data