|
|
--- |
|
|
license: apache-2.0 |
|
|
tags: |
|
|
- non-verbal-vocalization |
|
|
- audio-classification |
|
|
- baby-crying |
|
|
model-index: |
|
|
- name: voc2vec |
|
|
results: [] |
|
|
language: |
|
|
- en |
|
|
pipeline_tag: audio-classification |
|
|
library_name: transformers |
|
|
--- |
|
|
|
|
|
# voc2vec |
|
|
|
|
|
voc2vec is a foundation model specifically designed for non-verbal human data. |
|
|
|
|
|
We employed a collection of 10 datasets covering around 125 hours of non-verbal audio and pre-trained a [Wav2Vec2](https://ai.facebook.com/blog/wav2vec-20-learning-the-structure-of-speech-from-raw-audio/)-like model. |
|
|
|
|
|
## Model description |
|
|
|
|
|
Voc2vec is built upon the wav2vec 2.0 framework and follows its pre-training setup. |
|
|
The pre-training datasets include: AudioSet (vocalization), FreeSound (babies), HumanVoiceDataset, NNIME, NonSpeech7K, ReCANVo, SingingDatabase, TUT (babies), VocalSketch, VocalSound. |
|
|
|
|
|
## Task and datasets description |
|
|
|
|
|
We evaluate voc2vec on six datasets: ASVP-ESD, ASPV-ESD (babies), CNVVE, NonVerbal Vocalization Dataset, Donate a Cry, VIVAE. |
|
|
|
|
|
The following table reports the average performance in terms of Unweighted Average Recall (UAR) and F1 Macro across the six datasets described above. |
|
|
|
|
|
| Model | Architecture | Pre-training DS | UAR | F1 Macro | |
|
|
|--------|-------------|-------------|-----------|-----------| |
|
|
| **voc2vec** | wav2vec 2.0 | Voc125 | .612±.212 | .580±.230 | |
|
|
| **voc2vec-as-pt** | wav2vec 2.0 | AudioSet + Voc125 | .603±.183 | .574±.194 | |
|
|
| **voc2vec-ls-pt** | wav2vec 2.0 | LibriSpeech + Voc125 | .661±.206 | .636±.223 | |
|
|
| **voc2vec-hubert-ls-pt** | HuBERT | LibriSpeech + Voc125 | **.696±.189** | **.678±.200** | |
|
|
|
|
|
## Available Models |
|
|
|
|
|
| Model | Description | Link | |
|
|
|--------|-------------|------| |
|
|
| **voc2vec** | Pre-trained model on **125 hours of non-verbal audio**. | [🔗 Model](https://huggingface.co/alkiskoudounas/voc2vec) | |
|
|
| **voc2vec-as-pt** | Continues pre-training from a wav2vec2-like model that was **initially trained on the AudioSet dataset**. | [🔗 Model](https://huggingface.co/alkiskoudounas/voc2vec-as-pt) | |
|
|
| **voc2vec-ls-pt** | Continues pre-training from a wav2vec2-like model that was **initially trained on the LibriSpeech dataset**. | [🔗 Model](https://huggingface.co/alkiskoudounas/voc2vec-ls-pt) | |
|
|
| **voc2vec-hubert-ls-pt** | Continues pre-training from a hubert-like model that was **initially trained on the LibriSpeech dataset**. | [🔗 Model](https://huggingface.co/alkiskoudounas/voc2vec-hubert-ls-pt) | |
|
|
|
|
|
## Usage examples |
|
|
|
|
|
You can use the model directly in the following manner: |
|
|
```python |
|
|
import torch |
|
|
import librosa |
|
|
from transformers import AutoModelForAudioClassification, AutoFeatureExtractor |
|
|
|
|
|
## Load an audio file |
|
|
audio_array, sr = librosa.load("path_to_audio.wav", sr=16000) |
|
|
|
|
|
## Load model and feature extractor |
|
|
model = AutoModelForAudioClassification.from_pretrained("alkiskoudounas/voc2vec") |
|
|
feature_extractor = AutoFeatureExtractor.from_pretrained("alkiskoudounas/voc2vec") |
|
|
|
|
|
## Extract features |
|
|
inputs = feature_extractor(audio_array.squeeze(), sampling_rate=feature_extractor.sampling_rate, padding=True, return_tensors="pt") |
|
|
|
|
|
## Compute logits |
|
|
logits = model(**inputs).logits |
|
|
``` |
|
|
|
|
|
## BibTeX entry and citation info |
|
|
|
|
|
```bibtex |
|
|
@INPROCEEDINGS{koudounas2025icassp, |
|
|
author={Koudounas, Alkis and La Quatra, Moreno and Siniscalchi, Sabato Marco and Baralis, Elena}, |
|
|
booktitle={ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, |
|
|
title={voc2vec: A Foundation Model for Non-Verbal Vocalization}, |
|
|
year={2025}, |
|
|
volume={}, |
|
|
number={}, |
|
|
pages={1-5}, |
|
|
keywords={Pediatrics;Accuracy;Foundation models;Benchmark testing;Signal processing;Data models;Acoustics;Speech processing;Nonverbal vocalization;Representation Learning;Self-Supervised Models;Pre-trained Models}, |
|
|
doi={10.1109/ICASSP49660.2025.10890672}} |
|
|
``` |