| | --- |
| | datasets: |
| | - hypersunflower/ava_speech_data_log_mel_spec |
| | - nccratliri/vad-human-ava-speech |
| | language: |
| | - en |
| | pipeline_tag: voice-activity-detection |
| | --- |
| | |
| | As a part of my pet project create this SAD[1] model. It takes a log mel-spectrogram as input and outputs concatenated array of onset and offset. |
| |
|
| | Loss - BCEWithLogitsLoss |
| | Optimizer - Adam |
| |
|
| | here are the metrics on the test set: |
| |
|
| | | Metric | Value | |
| | | --------- | ---------------------- | |
| | | Accuracy | 0.9998331655911613 | |
| | | Hamming | 0.00016682081819592185 | |
| | | Precision | 0.9327198181417427 | |
| | | Recall | 0.9306135245038709 | |
| | | F1 | 0.9296357635399213 | |
| | | Loss | 0.0008604296028513627 | |
| |
|
| |
|
| | To download the model and the necessary code use the following snippet: |
| | ``` |
| | from huggingface_hub import snapshot_download |
| | snapshot_download("hypersunflower/a_sad_model", local_dir = "model/", repo_type="model") |
| | ``` |
| |
|
| | To use the model for inference[2]: |
| |
|
| | ``` |
| | # load the scripts |
| | from .model.speech_detection import detectSpeech |
| | from .model.sadModel import sadModel |
| | from .model.logMelSpectrogram import logMelSpectrogram |
| | |
| | # load the models |
| | detector = detectSpeech( |
| | model_path="/model/a_sad_model.pth", |
| | model_class=sadModel(), |
| | logMelSpectrogram=logMelSpectrogram() |
| | ) |
| | |
| | # inference |
| | onset, offset = detector.detect("path_to_the_audio") |
| | ``` |
| |
|
| | Note: the code uses pydub.AudioSegment to process the audio which requires ffmpeg. You can install it the following way: |
| |
|
| | ``` |
| | !apt update &> /dev/null |
| | !apt install ffmpeg -y &> /dev/null |
| | ``` |
| | This works for linux |
| |
|
| |
|
| | Training code can be found here: https://github.com/ertan-somundzhu/sad-model |
| |
|
| |
|
| | [1] short for Speech Activity Detection |
| |
|
| | [2] though the model showes good perfomance on the nccratliri/vad-human-ava-speech dataset (from which i took 25% procent of the original dataset), it will most likely fail when working with real-world noisy data |