SPEAR Base (speech + general audio)

This is the updated version of SPEAR Base dual-domain (speech + general audio) model. Compared to the first version, this model is trained with token-mixing for enhanced performance on overlapped/noisy speech. The model adopts a Zipformer backbone with 93M parameters consisting of 12 Zipformer stacks. It generates 512-dimensional representations at approximately 50~Hz.

This model was pre-trained on 97k hours of mixture data of English speech and general audio, among which 84k hours are speech data, and the rest 13k hours are general audio data. It achieves competitive performance (compared with models with similar sizes) on SUPERB benchmark and on HEAR benchmark.

The speech data consists of the following datasets:

Dataset Duration (hours)
Libriheavy ~50k
Gigaspeech ~10k
VoxPopuli (en) ~24k

The audio data consists of the following datasets:

Dataset Duration (hours)
AudioSet ~5k
Freesound ~2.8k
Music4all ~1k
VGGSound ~0.5k
MTG-Jamendo ~3.8k

Note: The model is pretrained on 16kHz sampled speech/audio data. When using the model make sure that your input is also sampled at 16kHz.

Paper

Authors: Xiaoyu Yang, Yifan Yang, Zengrui Jin, Ziyun Cui, Wen Wu, Baoxiang Li, Chao Zhang, Phil Woodland

Abstract Self-supervised learning (SSL) has significantly advanced acoustic representation learning. However, most existing models are optimised for either speech or audio event understanding, resulting in a persistent gap between these two domains. We address this gap with SPEAR (SPEech and Audio Representations), a self-supervised framework that distils complementary knowledge from a speech-focused SSL teacher and a general-audio SSL teacher into a single unified model. SPEAR applies multi-codebook vector quantisation to continuous teacher representations to produce fine-grained discrete tokens that capture both semantic and acoustic information. To effectively integrate these heterogeneous representations, SPEAR jointly predicts them given a masked input with an asymmetric pre-training loss. We further improve robustness in complex sound scenes through a novel token mixing mechanism. Extensive experiments demonstrate that SPEAR consistently outperforms existing unified speech and audio models. SPEAR establishes a new state-of-the-art on the SUPERB benchmark, surpassing WavLM Large on 12 of 15 tasks, while achieving competitive performance on the HEAR benchmark. These results position SPEAR as a versatile foundation for general-purpose speech and audio representation learning. The code and pre-trained models will be released.

Usage

This model is pre-trained purely using unlabelled data. Therefore, it requires fine-tuning with labelled data for downstream tasks such as automatic speech recognition (ASR) or audio tagging (AT).

The model achieves the following word error rates (WERs) when fine-tuned on LibriSpeech for ASR:

Fine-tuning data test-clean test-other
LS960 1.9 4.2

The model acheives the following mean average precision (mAP) when fine-tuned on AudioSet for AT:

Fine-tuning data mAP
AudioSet Balanced 39.1
AudioSet Full 48.4

You can extract its top-layer feature (and intermediate hidden states) using the following code:

from transformers import AutoModel
import torch

model = AutoModel.from_pretrained(
    "marcoyang/spear-base-speech-audio-v2", 
    trust_remote_code=True,
)
if torch.cuda.is_available():
    model = model.to("cuda")
model.eval()

device = next(model.parameters()).device
audio = torch.randn(1, 160000).to(device) # dummy audio input of 10 seconds
audio_len = torch.tensor([160000]).to(device)

with torch.no_grad():
    outputs = model(audio, audio_len)

encoder_out = outputs["encoder_out"] # (N,T,C)
encoder_out_lens = outputs["encoder_out_lens"] # (N)
middle_out = outputs["hidden_states"] # list of (N,T,C)

print(encoder_out)
print(encoder_out_lens)
print(len(middle_out)) # 12 layers
print(middle_out[-1].shape)
print(middle_out[-1])
Downloads last month
153
Safetensors
Model size
93.3M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including marcoyang/spear-base-speech-audio-v2

Papers for marcoyang/spear-base-speech-audio-v2