SPEAR Base (speech + audio)
This is the updated version of SPEAR Base single domain (speech-only) model. Compared to the first version, this model is trained with token-mixing for enhanced performance on overlapped/noisy speech. The model adopts a Zipformer backbone with 93M model parameters consisting of 12 Zipformer stacks. It generates 512-dimensional representations at approximately 50 Hz.
The model was pre-trained on 84k hours of unlabelled English speech data from following datasets:
| Dataset | Duration (hours) |
|---|---|
| Libriheavy | 50,000 |
| Gigaspeech | 10,000 |
| VoxPopuli (en) | 24,000 |
Note: The model is pretrained on 16kHz sampled speech data. When using the model, make sure that your speech input is also sampled at 16kHz.
Authors: Xiaoyu Yang, Yifan Yang, Zengrui Jin, Ziyun Cui, Wen Wu, Baoxiang Li, Chao Zhang, Phil Woodland
Abstract Self-supervised learning (SSL) has significantly advanced acoustic representation learning. However, most existing models are optimised for either speech or audio event understanding, resulting in a persistent gap between these two domains. We address this gap with SPEAR (SPEech and Audio Representations), a self-supervised framework that distils complementary knowledge from a speech-focused SSL teacher and a general-audio SSL teacher into a single unified model. SPEAR applies multi-codebook vector quantisation to continuous teacher representations to produce fine-grained discrete tokens that capture both semantic and acoustic information. To effectively integrate these heterogeneous representations, SPEAR jointly predicts them given a masked input with an asymmetric pre-training loss. We further improve robustness in complex sound scenes through a novel token mixing mechanism. Extensive experiments demonstrate that SPEAR consistently outperforms existing unified speech and audio models. SPEAR establishes a new state-of-the-art on the SUPERB benchmark, surpassing WavLM Large on 12 of 15 tasks, while achieving competitive performance on the HEAR benchmark. These results position SPEAR as a versatile foundation for general-purpose speech and audio representation learning. The code and pre-trained models will be released.
Usage
This model is pre-trained purely using unlabelled data. Therefore, it requires fine-tuning with labelled data for downstream tasks such as automatic speech recognition (ASR).
The model achieves the following performance when fine-tuned on LibriSpeech for ASR:
| Fine-tuning data | test-clean | test-other |
|---|---|---|
| LS960 | 1.9 | 4.0 |
You can however extract its top-layer feature (and intermediate hidden states) using the following code:
from transformers import AutoModel
import torch
model = AutoModel.from_pretrained(
"marcoyang/spear-base-speech-v2",
trust_remote_code=True,
force_download=False,
)
if torch.cuda.is_available():
model = model.to("cuda")
model.eval()
device = next(model.parameters()).device
audio = torch.randn(1, 160000).to(device) # dummy audio input of 10 seconds
audio_len = torch.tensor([160000]).to(device)
with torch.no_grad():
outputs = model(audio, audio_len)
encoder_out = outputs["encoder_out"] # (N,T,C)
encoder_out_lens = outputs["encoder_out_lens"] # (N)
middle_out = outputs["hidden_states"] # list of (N,T,C)
print(encoder_out)
print(encoder_out_lens)
print(len(middle_out)) # 12 layers
print(middle_out[-1].shape)
print(middle_out[-1])
- Downloads last month
- 163