wavjepa-base / README.md

Update README.md

487f02e verified about 2 months ago

11.5 kB

	---
	library_name: transformers
	tags:
	- audio
	- speech
	- waveform
	license: mit
	datasets:
	- agkphysics/AudioSet
	metrics:
	- accuracy
	pipeline_tag: feature-extraction
	---

	# Model Card for Model ID

	WavJEPA, a waveform-based version of the Joint-Embedding Predictive Architecture. WavJEPA leverages high-level semantic representation learning to tackle the shortcomings of representation learning at the speech unit or token level. We show that
	this approach substantially outperforms state-of-the-art time-domain audio foundation models across a wide variety of downstream benchmark tasks, while requiring considerably fewer computational resources. Additionally, to overcome the
	performance drop that time-domain models typically exhibit in noisy and reverberant real-world acoustic environments, we present WavJEPA-Nat
	## Model Details

	The WavJEPA framework comprises a waveform encoder, context encoder, target encoder and a predictor. WavJEPA’s objective is to predict latent
	representation of various targets blocks based on a single context block extracted from the same
	sound wave. As waveform encoder, we use the feature encoder of Wav2Vec 2.0, which is composed
	of stacked temporal convolution layers (Baevski et al., 2020). Similar to the original I-JEPA architecture (Assran et al., 2023), a Vision Transformer (ViT) (Dosovitskiy et al., 2021) is used for the
	target encoder, context encoder and predictor.

	### Model Description

	WavJEPA is the first framework applying semantic learning to general-purpose audio representations in the time domain, surpassing state-of-the-art time-domain approaches on the HEAR (Turian
	et al., 2022) and ARCH (La Quatra et al., 2024) benchmark suites while requiring only a fraction
	of the computational resources. WavJEPA leverages high-level semantic representation learning to tackle the shortcomings of representation learning at the speech unit or token level. We show that
	this approach substantially outperforms state-of-the-art time-domain audio foundation models across a wide variety of downstream benchmark tasks, while requiring considerably fewer computational resources.
	Additionally, we address the degraded performance of time-domain
	models in real-world sound scenes with WavJEPA-Nat, a multi-channel extension of the WavJEPA
	framework trained on simulated real-world sound scenes. Evaluation on Nat-HEAR (Yuksel et al.,
	2025), a naturalistic version of the HEAR benchmark suite, demonstrates that WavJEPA-Nat exceeds the robustness of other time-domain foundation models to noise and reverberation.


	- Developed by: Goksenin Yuksel, goksenin.yuksel@ru.nl
	- Model type: Transformers, Audio Foundation Models, Raw Waveform Models
	- Language(s) (NLP): WavJEPA and WavJEPA-Nat support all languages, but mainly English.
	- License: MIT

	### Model Sources

	- Repository: https://github.com/labhamlet/wavjepa
	- Paper: https://arxiv.org/abs/2509.23238

	## Uses

	WavJEPA can be used as a powerful feature extractor for downstream tasks such as enviromental sound classification, speech recognition, speaker counting etc.
	Later, training a linear head on top of these extracted features would yield a fine-tuned audio scene analysis model.


	## How to Get Started with the Model



	~~~python
	from transformers import AutoModel, AutoFeatureExtractor

	model = AutoModel.from_pretrained("labhamlet/wavjepa-base", trust_remote_code=True)
	extractor = AutoFeatureExtractor.from_pretrained("labhamlet/wavjepa-base", trust_remote_code=True)

	audio = torch.zeros([1,160000])
	extracted = extractor(audio, return_tensors="pt")
	audio_feature = extracted['input_values']
	print(model(audio_feature).shape)
	~~~

	## Training Details


	### Training Data

	We train WavJEPA on the unbalanced training set of AudioSet, which consists of 1.74 million 10-second sound clips scraped from YouTube (Gemmeke
	et al., 2017).

	### Training Procedure

	Each sound clip was resampled to 16 kHz and mean centered to enforce equal loudness
	across sound clips. We then randomly sampled 8 sections of 2 s from each sound clip, effectively increasing the batch size by a factor of 8 in a computationally efficient manner. Finally, each instance
	is instance normalized (Ulyanov et al., 2017). The waveform encoder converts each 2 s instance into
	an embedding w
	200×768, effectively resampling the audio to 100 Hz with a stride of 10 ms and a
	receptive field size of 12.5 ms

	We sampled starting indices for the context block with p = 0.065 and for target blocks
	with p = 0.025. We set M to 10 for both context block and target block . To update the target encoder
	parameters ∆, we linearly increased τ from τ0 = 0.999 to τe = 0.99999 over the first 100,000 steps,
	after which τ was kept constant. We used K = 8 for the top K averaging.
	We trained WavJEPA for 375,000 steps using a batch size of 32 on two NVIDIA H100 94 GB
	GPUs. Given our in-batch sampling factor of 8, we boost our effective batch size to 256. We use
	the AdamW optimizer (Loshchilov & Hutter, 2019) with a weight decay coefficient λw = 0.04. The
	learning rate schedule follows a cosine decay with linear warm-up over 100,000 steps, reaching a
	peak learning rate of 2 × 10−4 before decaying to zero

	#### Preprocessing

	RMS Normalization was applied to audio clips to get all of them in the same loudness levels, and later instance normalization is applied.


	#### Training Hyperparameters

	- Training regime:: WavJEPA and WavJEPA-Nat were trained with mixed precision, torch.compile and flash attention.

	## Evaluation

	We evaluate WavJEPA and other state-of-the-art models on the HEAR and ARCH benchmark
	task suite, which presents a wide range of tasks to evaluate the downstream performance of audio
	representation models (Turian et al., 2022).
	### Testing Data, Factors & Metrics

	#### Testing Data

	HEAR: The aim of the HEAR benchmark is to develop a general-purpose audio representation that provides a strong basis for learning in a wide variety of tasks and scenarios.
	HEAR evaluates audio representations using a benchmark suite across a variety of domains, including speech, environmental sound, and music.
	HEAR was launched as a NeurIPS 2021 shared challenge. It still remains an open question whether one single general-purpose audio representation can perform as holistically as the human ear.

	ARCH: ARCH, a comprehensive benchmark for evaluating ARL methods on diverse audio classification domains, covering acoustic events, music, and speech. ARCH comprises 12 datasets, that allow us to thoroughly assess pre-trained SSL models of different sizes.
	ARCH streamlines benchmarking of ARL techniques through its unified access to a wide range of domains and its ability to readily incorporate new datasets and models.

	### Results


	HEAR

	\| Model \| Size \| DCASE \| FSD50K \| LC \| ESC-50 \| CD \| VL \| SC-5 \| NS \| BO \| Mri-S \| Mri-T \| s(m) \|
	\|-------\|------\|-------\|--------\|-----\|--------\|-----\|-----\|------\|-----\|-----\|-------\|-------\|------\|
	\| Baseline \|
	\| HEAR-Naive \| N/A \| 7.6 \| 12.5 \| 40.3 ± 1.2 \| 27.4 ± 3.3 \| 36.7 ± 2.5 \| 16.0 ± 3.4 \| 13.3 \| 89.2 \| 97.1 ± 3.2 \| 94.2 ± 1.1 \| 93.7 ± 0.3 \| 0.0 \|
	\| Speech pre-training \|
	\| Wav2Vec2.0 \| B \| 23.5 \| 29.4 \| 69.9 ± 2.1 \| 46.4 ± 1.8 \| 57.3 ± 1.1 \| 34.9 ± 2.4 \| 85.3 \| 17.4 \| 81.4 ± 4.8 \| 90.7 ± 0.8 \| 77.0 ± 0.9 \| 30.9 \|
	\| HuBERT \| B \| 78.0 \| 32.8 \| 63.3 ± 1.2 \| 58.6 ± 2.8 \| 71.2 ± 1.2 \| 65.2 ± 2.9 \| 94.0 \| 19.8 \| 93.2 ± 5.9 \| 94.6 ± 0.4 \| 85.0 ± 2.5 \| 47.3 \|
	\| WavLM \| B \| 27.0 \| 25.7 \| 61.3 ± 2.3 \| 49.5 ± 3.8 \| 64.3 ± 1.3 \| 60.1 ± 3.2 \| 93.6 \| 16.0 \| 84.3 ± 6.3 \| 88.8 ± 1.0 \| 76.8 ± 0.5 \| 35.1 \|
	\| Data2Vec \| B \| 46.5 \| 15.2 \| 47.9 ± 1.2 \| 28.0 ± 2.8 \| 55.7 ± 1.0 \| 44.9 ± 3.1 \| 88.5 \| 14.0 \| 78.4 ± 4.1 \| 85.1 ± 0.7 \| 70.5 ± 3.3 \| 23.6 \|
	\| Wav2Vec2.0 \| L \| 66.0 \| 34.8 \| 64.6 ± 1.9 \| 59.8 ± 1.5 \| 65.7 ± 0.8 \| 53.3 ± 6.3 \| 75.8 \| 40.6 \| 93.6 ± 2.6 \| 94.8 ± 0.5 \| 82.4 ± 3.0 \| 42.5 \|
	\| HuBERT \| L \| 34.8 \| 31.4 \| 63.8 ± 1.3 \| 60.4 ± 3.0 \| 71.0 ± 1.2 \| 69.0 ± 2.8 \| 84.8 \| 20.4 \| 93.6 ± 3.0 \| 95.3 ± 0.8 \| 82.5 ± 2.0 \| 44.3 \|
	\| WavLM \| L \| 77.4 \| 40.1 \| 69.4 ± 2.1 \| 66.6 ± 2.5 \| 76.3 ± 2.2 \| 79.2 ± 3.9 \| 93.8 \| 18.2 \| 93.6 ± 5.4 \| 95.8 ± 0.8 \| 90.1 ± 1.0 \| 58.1 \|
	\| Data2Vec \| L \| 40.8 \| 18.7 \| 50.9 ± 1.7 \| 34.4 ± 2.5 \| 62.8 ± 1.6 \| 60.0 ± 4.9 \| 86.1 \| 14.4 \| 80.1 ± 8.5 \| 84.7 ± 2.6 \| 65.6 ± 3.1 \| 29.0 \|
	\| AudioSet pre-training \|
	\| Wav2Vec2.0 \| B \| 52.0 \| 34.7 \| 60.4 ± 1.7 \| 58.9 ± 1.9 \| 56.3 ± 1.3 \| 27.9 ± 4.6 \| 72.1 \| 42.0 \| 86.0 ± 9.6 \| 92.9 ± 1.4 \| 77.3 ± 0.5 \| 31.9 \|
	\| HuBERT \| B \| 86.2 \| 41.1 \| 63.5 ± 3.4 \| 69.1 ± 1.6 \| 69.5 ± 1.2 \| 53.3 ± 3.1 \| 83.5 \| 38.8 \| 91.5 ± 8.8 \| 95.6 ± 0.5 \| 90.4 ± 0.8 \| 51.1 \|
	\| Wav2Vec2.0 \| L \| 82.6 \| 47.8 \| 73.6 ± 1.2 \| 72.6 ± 2.1 \| 68.2 ± 1.7 \| 42.2 ± 6.0 \| 83.9 \| 30.8 \| 91.5 ± 5.0 \| 96.5 ± 0.3 \| 88.7 ± 2.5 \| 55.9 \|
	\| HuBERT \| L \| 86.2 \| 45.4 \| 75.2 ± 1.4 \| 66.3 ± 4.6 \| 70.1 ± 0.8 \| 39.6 ± 3.6 \| 85.7 \| 38.6 \| 91.6 ± 9.6 \| 97.3 ± 0.5 \| 89.6 ± 2.3 \| 57.7 \|
	\| WavJEPA \| B \| 93.9 \| 54.4 \| 76.7 ± 2.4 \| 86.5 ± 3.3 \| 71.0 ± 0.8 \| 49.8 ± 3.4 \| 90.0 \| 34.4 \| 89.4 ± 5.4 \| 97.3 ± 0.4 \| 88.5 ± 0.5 \| 66.0 \|

	ARCH

	\| Model \| Size \| ESC-50 \| US8K \| FSD50K \| VIVAE \| FMA \| MTT \| IRMAS \| MS-DB \| RAVDESS \| AM \| SLURP \| EMOVO \| s(m) \|
	\|-------\|------\|--------\|------\|--------\|-------\|-----\|-----\|-------\|-------\|---------\|-----\|-------\|-------\|------\|
	\| Baseline \|
	\| HEAR-Naive \| N/A \| 13.0 \| 36.0 \| 2.2 \| 22.0 \| 39.0 \| 9.9 \| 19.9 \| 35.2 \| 22.6 \| 45.7 \| 5.4 \| 18.4 \| 0.0 \|
	\| Speech pre-training \|
	\| Wav2Vec2.0 \| B \| 45.7 \| 55.5 \| 19.4 \| 31.5 \| 50.5 \| 37.6 \| 35.1 \| 66.1 \| 55.3 \| 86.4 \| 14.4 \| 31.8 \| 49.7 \|
	\| WavLM \| B \| 49.9 \| 61.8 \| 17.6 \| 36.3 \| 48.7 \| 34.9 \| 32.6 \| 54.2 \| 67.9 \| 99.5 \| 31.0 \| 43.1 \| 68.0 \|
	\| HuBERT \| B \| 58.9 \| 67.3 \| 24.5 \| 40.5 \| 54.6 \| 38.8 \| 36.7 \| 58.5 \| 65.3 \| 99.6 \| 33.8 \| 40.5 \| 59.7 \|
	\| Data2Vec \| B \| 23.6 \| 45.6 \| 10.1 \| 30.2 \| 40.6 \| 27.6 \| 25.9 \| 50.7 \| 48.0 \| 99.1 \| 43.6 \| 27.3 \| 38.8 \|
	\| Wav2Vec2.0 \| L \| 13.1 \| 42.7 \| 5.8 \| 22.0 \| 41.7 \| 21.0 \| 19.9 \| 50.2 \| 11.6 \| 45.7 \| 7.3 \| 19.3 \| 8.6 \|
	\| WavLM \| L \| 67.2 \| 70.9 \| 32.2 \| 42.5 \| 61.1 \| 41.3 \| 42.5 \| 68.0 \| 71.8 \| 99.8 \| 42.3 \| 45.3 \| 75.8 \|
	\| HuBERT \| L \| 64.0 \| 70.0 \| 29.5 \| 41.0 \| 54.8 \| 38.4 \| 36.8 \| 64.1 \| 72.6 \| 99.9 \| 45.3 \| 43.8 \| 81.5 \|
	\| Data2Vec \| L \| 25.4 \| 49.2 \| 10.8 \| 30.6 \| 43.5 \| 28.5 \| 27.1 \| 44.2 \| 45.1 \| 99.2 \| 28.6 \| 23.1 \| 35.1 \|
	\| AudioSet pre-training \|
	\| W2V2 \| B \| 52.6 \| 70.5 \| 21.3 \| 31.3 \| 59.5 \| 37.9 \| 35.9 \| 64.6 \| 45.9 \| 88.1 \| 11.0 \| 30.8 \| 53.8 \|
	\| HuBERT \| B \| 68.8 \| 79.1 \| 31.1 \| 40.1 \| 65.9 \| 43.4 \| 47.7 \| 67.8 \| 63.5 \| 98.8 \| 20.5 \| 33.4 \| 75.5 \|
	\| Wav2Vec 2.0 \| L \| 74.4 \| 79.0 \| 37.6 \| 39.7 \| 66.6 \| 44.5 \| 49.9 \| 76.9 \| 59.5 \| 99.4 \| 17.7 \| 38.2 \| 80.0 \|
	\| HuBERT \| L \| 71.5 \| 75.6 \| 37.4 \| 44.3 \| 67.5 \| 43.4 \| 50.5 \| 77.8 \| 73.3 \| 99.6 \| 20.5 \| 38.6 \| 83.9 \|
	\| WavJEPA \| B \| 83.9 \| 83.5 \| 48.0 \| 44.06 \| 68.2 \| 46.0 \| 59.0 \| 79.5 \| 62.5 \| 99.5 \| 23.3 \| 46.6 \| 92.3 \|

	#### Summary

	We presented WavJEPA, a state-of-the-art audio foundation model that leverages self-supervised semantic learning to obtain robust general-purpose audio representations from raw waveforms.
	WavJEPA’s results highlight the superior performance of semantic audio representation learning in comparison with representation learning at the speech unit or token level, as is common in existing
	time-domain speech representation learning approaches.

	## Model Card Contact

	Goksenin Yuksel; goksenin.yuksel@ru.nl