SPEAR Base (speech + audio)

This is the updated version of SPEAR Base single domain (speech-only) model. Compared to the first version, this model is trained with token-mixing for enhanced performance on overlapped/noisy speech. The model adopts a Zipformer backbone with 93M model parameters consisting of 12 Zipformer stacks. It generates 512-dimensional representations at approximately 50 Hz.

The model was pre-trained on 84k hours of unlabelled English speech data from following datasets:

Dataset	Duration (hours)
Libriheavy	50,000
Gigaspeech	10,000
VoxPopuli (en)	24,000

Note: The model is pretrained on 16kHz sampled speech data. When using the model, make sure that your speech input is also sampled at 16kHz.

Paper

Authors: Xiaoyu Yang, Yifan Yang, Zengrui Jin, Ziyun Cui, Wen Wu, Baoxiang Li, Chao Zhang, Phil Woodland

Abstract Self-supervised learning (SSL) has significantly advanced acoustic representation learning. However, most existing models are optimised for either speech or audio event understanding, resulting in a persistent gap between these two domains. We address this gap with SPEAR (SPEech and Audio Representations), a self-supervised framework that distils complementary knowledge from a speech-focused SSL teacher and a general-audio SSL teacher into a single unified model. SPEAR applies multi-codebook vector quantisation to continuous teacher representations to produce fine-grained discrete tokens that capture both semantic and acoustic information. To effectively integrate these heterogeneous representations, SPEAR jointly predicts them given a masked input with an asymmetric pre-training loss. We further improve robustness in complex sound scenes through a novel token mixing mechanism. Extensive experiments demonstrate that SPEAR consistently outperforms existing unified speech and audio models. SPEAR establishes a new state-of-the-art on the SUPERB benchmark, surpassing WavLM Large on 12 of 15 tasks, while achieving competitive performance on the HEAR benchmark. These results position SPEAR as a versatile foundation for general-purpose speech and audio representation learning. The code and pre-trained models will be released.

Usage

This model is pre-trained purely using unlabelled data. Therefore, it requires fine-tuning with labelled data for downstream tasks such as automatic speech recognition (ASR).

The model achieves the following performance when fine-tuned on LibriSpeech for ASR:

Fine-tuning data	test-clean	test-other
LS960	1.9	4.0

You can however extract its top-layer feature (and intermediate hidden states) using the following code:

from transformers import AutoModel
import torch

model = AutoModel.from_pretrained(
    "marcoyang/spear-base-speech-v2", 
    trust_remote_code=True,
    force_download=False,
)
if torch.cuda.is_available():
    model = model.to("cuda")
model.eval()

device = next(model.parameters()).device
audio = torch.randn(1, 160000).to(device) # dummy audio input of 10 seconds
audio_len = torch.tensor([160000]).to(device)

with torch.no_grad():
    outputs = model(audio, audio_len)

encoder_out = outputs["encoder_out"] # (N,T,C)
encoder_out_lens = outputs["encoder_out_lens"] # (N)
middle_out = outputs["hidden_states"] # list of (N,T,C)

print(encoder_out)
print(encoder_out_lens)
print(len(middle_out)) # 12 layers
print(middle_out[-1].shape)
print(middle_out[-1])