File size: 3,790 Bytes
3f505a9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ce6d664
 
 
 
 
 
 
 
 
e7bf015
 
 
 
 
ce6d664
 
 
e7bf015
3f505a9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ce6d664
3f505a9
 
 
 
ce6d664
 
 
3f505a9
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
---
license: apache-2.0
tags:
- non-verbal-vocalization
- audio-classification
- baby-crying
model-index:
- name: voc2vec
  results: []
language:
- en
pipeline_tag: audio-classification
library_name: transformers
---

# voc2vec

voc2vec is a foundation model specifically designed for non-verbal human data. 

We employed a collection of 10 datasets covering around 125 hours of non-verbal audio and pre-trained a [Wav2Vec2](https://ai.facebook.com/blog/wav2vec-20-learning-the-structure-of-speech-from-raw-audio/)-like model.

## Model description

Voc2vec is built upon the wav2vec 2.0 framework and follows its pre-training setup. 
The pre-training datasets include: AudioSet (vocalization), FreeSound (babies), HumanVoiceDataset, NNIME, NonSpeech7K, ReCANVo, SingingDatabase, TUT (babies), VocalSketch, VocalSound.

## Task and datasets description

We evaluate voc2vec on six datasets: ASVP-ESD, ASPV-ESD (babies), CNVVE, NonVerbal Vocalization Dataset, Donate a Cry, VIVAE.

The following table reports the average performance in terms of Unweighted Average Recall (UAR) and F1 Macro across the six datasets described above.
 
| Model | Architecture | Pre-training DS | UAR | F1 Macro |
|--------|-------------|-------------|-----------|-----------|
| **voc2vec** | wav2vec 2.0 | Voc125 | .612±.212 | .580±.230 |
| **voc2vec-as-pt** | wav2vec 2.0 | AudioSet + Voc125 | .603±.183 | .574±.194 |
| **voc2vec-ls-pt** | wav2vec 2.0 | LibriSpeech + Voc125 | .661±.206 | .636±.223 |
| **voc2vec-hubert-ls-pt** | HuBERT | LibriSpeech + Voc125 | **.696±.189** | **.678±.200** |

## Available Models

| Model | Description | Link |
|--------|-------------|------|
| **voc2vec** | Pre-trained model on **125 hours of non-verbal audio**. | [🔗 Model](https://huggingface.co/alkiskoudounas/voc2vec) |
| **voc2vec-as-pt** | Continues pre-training from a wav2vec2-like model that was **initially trained on the AudioSet dataset**. | [🔗 Model](https://huggingface.co/alkiskoudounas/voc2vec-as-pt) |
| **voc2vec-ls-pt** | Continues pre-training from a wav2vec2-like model that was **initially trained on the LibriSpeech dataset**. | [🔗 Model](https://huggingface.co/alkiskoudounas/voc2vec-ls-pt) |
| **voc2vec-hubert-ls-pt** | Continues pre-training from a hubert-like model that was **initially trained on the LibriSpeech dataset**. | [🔗 Model](https://huggingface.co/alkiskoudounas/voc2vec-hubert-ls-pt) |

## Usage examples

You can use the model directly in the following manner:
```python
import torch
import librosa
from transformers import AutoModelForAudioClassification, AutoFeatureExtractor

## Load an audio file
audio_array, sr = librosa.load("path_to_audio.wav", sr=16000)

## Load model and feature extractor
model = AutoModelForAudioClassification.from_pretrained("alkiskoudounas/voc2vec")
feature_extractor = AutoFeatureExtractor.from_pretrained("alkiskoudounas/voc2vec")

## Extract features
inputs = feature_extractor(audio_array.squeeze(), sampling_rate=feature_extractor.sampling_rate, padding=True, return_tensors="pt")

## Compute logits
logits = model(**inputs).logits
```

## BibTeX entry and citation info

```bibtex
@INPROCEEDINGS{koudounas2025icassp,
  author={Koudounas, Alkis and La Quatra, Moreno and Siniscalchi, Sabato Marco and Baralis, Elena},
  booktitle={ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, 
  title={voc2vec: A Foundation Model for Non-Verbal Vocalization}, 
  year={2025},
  volume={},
  number={},
  pages={1-5},
  keywords={Pediatrics;Accuracy;Foundation models;Benchmark testing;Signal processing;Data models;Acoustics;Speech processing;Nonverbal vocalization;Representation Learning;Self-Supervised Models;Pre-trained Models},
  doi={10.1109/ICASSP49660.2025.10890672}}
```