ssl-vad-base / README.md
vpelloin's picture
Update README.md
d1be7ef verified
metadata
license: other
license_name: pantagruel-research-license
license_link: https://huggingface.co/ina-foss/ssl-vad-base/blob/main/LICENSE.md
extra_gated_prompt: >-
  You agree to use the model according to its
  [license](https://huggingface.co/ina-foss/ssl-vad-base/blob/main/LICENSE.md).
extra_gated_fields:
  Research institute: text
  Institutional email address: text
  Country: country
  I agree to use this model for non-commercial research use only: checkbox
  I have read, and I agree to use the model according to its license: checkbox
pipeline_tag: voice-activity-detection
library_name: transformers
tags:
  - speech
  - voice
  - voice activity detection
  - VAD
base_model:
  - ina-foss/ssl-audio-1k-base

Voice Activity Detection using SSL features

This model is trained to detect and segment voice activity on an given audio file. It uses SSL features from the ina-foss/ssl-audio-1k-base model.

You can find the full list of other Voice Activity Detection (VAD) models using SSL features here. Their global results on the test set of InaGVAD are the following (see the paper for more details):

Link to model Global Acc. Global F1.
ssl-vad-music2vec 96.4 97.0
ssl-vad-base 96.8 97.3
ssl-vad-no_music 96.0 96.6
ssl-vad-only_speech 96.2 96.8
ssl-vad-only_fr 96.3 96.9
ssl-vad-gender 96.5 97.0

Music detection models using SSL features can be found here.

Architecture

The model first extract features from the CNN and the first transformer layer of the ina-foss/ssl-audio-1k-base SSL encoder. Then, these features are given to a downstream model MLP, which has been trained to binary predict voice activity for each frame. During inference, the decoding uses a Viterbi decoder (from Librosa).

Data and training

It has been trained on the validation subset of the InaGVAD dataset by Doukhan et al. (2024).

For detailed information about training and results associated with this model, please refer to our publication. The training hyperparameters, original checkpoint and Tensorboard event files are available in the training directory.

Usage

To use this model, you need the packages listed inside the requirements.txt file. Then:

import librosa
from transformers import AutoModel

# loading the audio file, need to be sampled at 16kHz
audio, sr = librosa.load('/path/to/your/audio/file.wav', sr=16000)

# loading the music detection model
model = AutoModel.from_pretrained(
    'ina-foss/ssl-vad-base',
    trust_remote_code=True
)

# running the inference
output = model(
    audio=audio,
    sampling_rate=sr
)

print(output)
[{'start': 0.0, 'stop': 56.58943157192866, 'label': False},
 {'start': 56.58943157192866, 'stop': 60.45007501250208, 'label': True},
 {'start': 60.45007501250208, 'stop': 62.870478413068845, 'label': False},
[...]
 {'start': 117.03950658443074, 'stop': 119.21986997832973, 'label': True},
 {'start': 119.21986997832973, 'stop': 119.97999666611102, 'label': False}]

License and citation

The model is distributed using the pantagruel-research-license.

If you use this model or find it useful in your research, publications, or applications, please cite the following work:

@inproceedings{pelloin2026lrec,
  author =       "Pelloin, Valentin and Bekkali, Lina and Dehak, Reda and Doukhan, David",
  year =         "2026",
  title =        "Data Selection Effects on Self-Supervised Learning of Audio Representations for French Audiovisual Broadcasts",
  booktitle={Fifteenth International Conference on Language Resources and Evaluation (LREC 2026)},
  address = "Palma, Mallorca, Spain",
  publisher = "European Language Resources Association",
}