---
datasets:
- hypersunflower/ava_speech_data_log_mel_spec
- nccratliri/vad-human-ava-speech
language:
- en
pipeline_tag: voice-activity-detection
---

As a part of my pet project create this SAD[1] model. It takes a log mel-spectrogram as input and outputs concatenated array of onset and offset.

Loss - BCEWithLogitsLoss 
Optimizer - Adam

here are the metrics on the test set:

| Metric    | Value                  |
| --------- | ---------------------- |
| Accuracy  | 0.9998331655911613     |
| Hamming   | 0.00016682081819592185 |
| Precision | 0.9327198181417427     |
| Recall    | 0.9306135245038709     |
| F1        | 0.9296357635399213     |
| Loss      | 0.0008604296028513627  |


To download the model and the necessary code use the following snippet:
```
from huggingface_hub import snapshot_download
snapshot_download("hypersunflower/a_sad_model", local_dir = "model/", repo_type="model")
```

To use the model for inference[2]:

```
# load the scripts
from .model.speech_detection import detectSpeech
from .model.sadModel import sadModel
from .model.logMelSpectrogram import logMelSpectrogram

# load the models
detector = detectSpeech(
    model_path="/model/a_sad_model.pth",
    model_class=sadModel(),
    logMelSpectrogram=logMelSpectrogram()
)

# inference
onset, offset = detector.detect("path_to_the_audio")
```

Note: the code uses pydub.AudioSegment to process the audio which requires ffmpeg. You can install it the following way:

```
!apt update &> /dev/null
!apt install ffmpeg -y &> /dev/null
```
This works for linux


Training code can be found here: https://github.com/ertan-somundzhu/sad-model


[1] short for Speech Activity Detection

[2] though the model showes good perfomance on the nccratliri/vad-human-ava-speech dataset (from which i took 25% procent of the original dataset), it will most likely fail when working with real-world noisy data