SPEAR Base (speech + general audio)
This is the updated version of SPEAR Base dual-domain (speech + general audio) model. Compared to the first version, this model is trained with token-mixing for enhanced performance on overlapped/noisy speech. The model adopts a Zipformer backbone with 93M parameters consisting of 12 Zipformer stacks. It generates 512-dimensional representations at approximately 50~Hz.
This model was pre-trained on 97k hours of mixture data of English speech and general audio, among which 84k hours are speech data, and the rest 13k hours are general audio data. It achieves competitive performance (compared with models with similar sizes) on SUPERB benchmark and on HEAR benchmark.
The speech data consists of the following datasets:
| Dataset | Duration (hours) |
|---|---|
| Libriheavy | ~50k |
| Gigaspeech | ~10k |
| VoxPopuli (en) | ~24k |
The audio data consists of the following datasets:
| Dataset | Duration (hours) |
|---|---|
| AudioSet | ~5k |
| Freesound | ~2.8k |
| Music4all | ~1k |
| VGGSound | ~0.5k |
| MTG-Jamendo | ~3.8k |
Note: The model is pretrained on 16kHz sampled speech/audio data. When using the model make sure that your input is also sampled at 16kHz.
Authors: Xiaoyu Yang, Yifan Yang, Zengrui Jin, Ziyun Cui, Wen Wu, Baoxiang Li, Chao Zhang, Phil Woodland
Abstract Self-supervised learning (SSL) has significantly advanced acoustic representation learning. However, most existing models are optimised for either speech or audio event understanding, resulting in a persistent gap between these two domains. We address this gap with SPEAR (SPEech and Audio Representations), a self-supervised framework that distils complementary knowledge from a speech-focused SSL teacher and a general-audio SSL teacher into a single unified model. SPEAR applies multi-codebook vector quantisation to continuous teacher representations to produce fine-grained discrete tokens that capture both semantic and acoustic information. To effectively integrate these heterogeneous representations, SPEAR jointly predicts them given a masked input with an asymmetric pre-training loss. We further improve robustness in complex sound scenes through a novel token mixing mechanism. Extensive experiments demonstrate that SPEAR consistently outperforms existing unified speech and audio models. SPEAR establishes a new state-of-the-art on the SUPERB benchmark, surpassing WavLM Large on 12 of 15 tasks, while achieving competitive performance on the HEAR benchmark. These results position SPEAR as a versatile foundation for general-purpose speech and audio representation learning. The code and pre-trained models will be released.
Usage
This model is pre-trained purely using unlabelled data. Therefore, it requires fine-tuning with labelled data for downstream tasks such as automatic speech recognition (ASR) or audio tagging (AT).
The model achieves the following word error rates (WERs) when fine-tuned on LibriSpeech for ASR:
| Fine-tuning data | test-clean | test-other |
|---|---|---|
| LS960 | 1.9 | 4.2 |
The model acheives the following mean average precision (mAP) when fine-tuned on AudioSet for AT:
| Fine-tuning data | mAP |
|---|---|
| AudioSet Balanced | 39.1 |
| AudioSet Full | 48.4 |
You can extract its top-layer feature (and intermediate hidden states) using the following code:
from transformers import AutoModel
import torch
model = AutoModel.from_pretrained(
"marcoyang/spear-base-speech-audio-v2",
trust_remote_code=True,
)
if torch.cuda.is_available():
model = model.to("cuda")
model.eval()
device = next(model.parameters()).device
audio = torch.randn(1, 160000).to(device) # dummy audio input of 10 seconds
audio_len = torch.tensor([160000]).to(device)
with torch.no_grad():
outputs = model(audio, audio_len)
encoder_out = outputs["encoder_out"] # (N,T,C)
encoder_out_lens = outputs["encoder_out_lens"] # (N)
middle_out = outputs["hidden_states"] # list of (N,T,C)
print(encoder_out)
print(encoder_out_lens)
print(len(middle_out)) # 12 layers
print(middle_out[-1].shape)
print(middle_out[-1])
- Downloads last month
- 153