voc2vec / README.md

Updated README

ce6d664 verified 10 months ago

3.79 kB

	---
	license: apache-2.0
	tags:
	- non-verbal-vocalization
	- audio-classification
	- baby-crying
	model-index:
	- name: voc2vec
	results: []
	language:
	- en
	pipeline_tag: audio-classification
	library_name: transformers
	---

	# voc2vec

	voc2vec is a foundation model specifically designed for non-verbal human data.

	We employed a collection of 10 datasets covering around 125 hours of non-verbal audio and pre-trained a [Wav2Vec2](https://ai.facebook.com/blog/wav2vec-20-learning-the-structure-of-speech-from-raw-audio/)-like model.

	## Model description

	Voc2vec is built upon the wav2vec 2.0 framework and follows its pre-training setup.
	The pre-training datasets include: AudioSet (vocalization), FreeSound (babies), HumanVoiceDataset, NNIME, NonSpeech7K, ReCANVo, SingingDatabase, TUT (babies), VocalSketch, VocalSound.

	## Task and datasets description

	We evaluate voc2vec on six datasets: ASVP-ESD, ASPV-ESD (babies), CNVVE, NonVerbal Vocalization Dataset, Donate a Cry, VIVAE.

	The following table reports the average performance in terms of Unweighted Average Recall (UAR) and F1 Macro across the six datasets described above.

	\| Model \| Architecture \| Pre-training DS \| UAR \| F1 Macro \|
	\|--------\|-------------\|-------------\|-----------\|-----------\|
	\| voc2vec \| wav2vec 2.0 \| Voc125 \| .612±.212 \| .580±.230 \|
	\| voc2vec-as-pt \| wav2vec 2.0 \| AudioSet + Voc125 \| .603±.183 \| .574±.194 \|
	\| voc2vec-ls-pt \| wav2vec 2.0 \| LibriSpeech + Voc125 \| .661±.206 \| .636±.223 \|
	\| voc2vec-hubert-ls-pt \| HuBERT \| LibriSpeech + Voc125 \| .696±.189 \| .678±.200 \|

	## Available Models

	\| Model \| Description \| Link \|
	\|--------\|-------------\|------\|
	\| voc2vec \| Pre-trained model on 125 hours of non-verbal audio. \| [🔗 Model](https://huggingface.co/alkiskoudounas/voc2vec) \|
	\| voc2vec-as-pt \| Continues pre-training from a wav2vec2-like model that was initially trained on the AudioSet dataset. \| [🔗 Model](https://huggingface.co/alkiskoudounas/voc2vec-as-pt) \|
	\| voc2vec-ls-pt \| Continues pre-training from a wav2vec2-like model that was initially trained on the LibriSpeech dataset. \| [🔗 Model](https://huggingface.co/alkiskoudounas/voc2vec-ls-pt) \|
	\| voc2vec-hubert-ls-pt \| Continues pre-training from a hubert-like model that was initially trained on the LibriSpeech dataset. \| [🔗 Model](https://huggingface.co/alkiskoudounas/voc2vec-hubert-ls-pt) \|

	## Usage examples

	You can use the model directly in the following manner:
	```python
	import torch
	import librosa
	from transformers import AutoModelForAudioClassification, AutoFeatureExtractor

	## Load an audio file
	audio_array, sr = librosa.load("path_to_audio.wav", sr=16000)

	## Load model and feature extractor
	model = AutoModelForAudioClassification.from_pretrained("alkiskoudounas/voc2vec")
	feature_extractor = AutoFeatureExtractor.from_pretrained("alkiskoudounas/voc2vec")

	## Extract features
	inputs = feature_extractor(audio_array.squeeze(), sampling_rate=feature_extractor.sampling_rate, padding=True, return_tensors="pt")

	## Compute logits
	logits = model(**inputs).logits
	```

	## BibTeX entry and citation info

	```bibtex
	@INPROCEEDINGS{koudounas2025icassp,
	author={Koudounas, Alkis and La Quatra, Moreno and Siniscalchi, Sabato Marco and Baralis, Elena},
	booktitle={ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
	title={voc2vec: A Foundation Model for Non-Verbal Vocalization},
	year={2025},
	volume={},
	number={},
	pages={1-5},
	keywords={Pediatrics;Accuracy;Foundation models;Benchmark testing;Signal processing;Data models;Acoustics;Speech processing;Nonverbal vocalization;Representation Learning;Self-Supervised Models;Pre-trained Models},
	doi={10.1109/ICASSP49660.2025.10890672}}
	```