--- datasets: - hypersunflower/ava_speech_data_log_mel_spec - nccratliri/vad-human-ava-speech language: - en pipeline_tag: voice-activity-detection --- As a part of my pet project create this SAD[1] model. It takes a log mel-spectrogram as input and outputs concatenated array of onset and offset. Loss - BCEWithLogitsLoss Optimizer - Adam here are the metrics on the test set: | Metric | Value | | --------- | ---------------------- | | Accuracy | 0.9998331655911613 | | Hamming | 0.00016682081819592185 | | Precision | 0.9327198181417427 | | Recall | 0.9306135245038709 | | F1 | 0.9296357635399213 | | Loss | 0.0008604296028513627 | To download the model and the necessary code use the following snippet: ``` from huggingface_hub import snapshot_download snapshot_download("hypersunflower/a_sad_model", local_dir = "model/", repo_type="model") ``` To use the model for inference[2]: ``` # load the scripts from .model.speech_detection import detectSpeech from .model.sadModel import sadModel from .model.logMelSpectrogram import logMelSpectrogram # load the models detector = detectSpeech( model_path="/model/a_sad_model.pth", model_class=sadModel(), logMelSpectrogram=logMelSpectrogram() ) # inference onset, offset = detector.detect("path_to_the_audio") ``` Note: the code uses pydub.AudioSegment to process the audio which requires ffmpeg. You can install it the following way: ``` !apt update &> /dev/null !apt install ffmpeg -y &> /dev/null ``` This works for linux Training code can be found here: https://github.com/ertan-somundzhu/sad-model [1] short for Speech Activity Detection [2] though the model showes good perfomance on the nccratliri/vad-human-ava-speech dataset (from which i took 25% procent of the original dataset), it will most likely fail when working with real-world noisy data