marcoyang
/

spear-base-speech-audio

Model card Files Files and versions

marcoyang commited on Feb 9

Commit

60e4f01

·

verified ·

1 Parent(s): db89ec8

Update README.md

Files changed (1) hide show

README.md +13 -2

README.md CHANGED Viewed

@@ -1,6 +1,17 @@
 # SPEAR Base (speech + general audio)
-This is the [SPEAR](https://arxiv.org/abs/2510.25955) Base dual-domain (speech + general audio) model. The model adopts a [Zipformer](https://arxiv.org/abs/2310.11230) backbone with 327M parameters consisting of 112 Zipformer stacks. It generates 512-dimensional representations at approximately 50~Hz.
 This model was pre-trained on 97k hours of mixture data of English speech and general audio, among which 84k hours are speech data, and the rest 13k hours are general audio data. It achieves competitive performance (compared with models with similar sizes) on [SUPERB](https://arxiv.org/abs/2105.01051) benchmark and on [HEAR](https://arxiv.org/abs/2203.03022) benchmark.
@@ -27,7 +38,7 @@ The audio data consists of the following datasets:
-[Paper](https://arxiv.org/abs/2510.25955)
 Authors: Xiaoyu Yang, Yifan Yang, Zengrui Jin, Ziyun Cui, Wen Wu, Baoxiang Li, Chao Zhang, Phil Woodland

+---
+license: apache-2.0
+---
 # SPEAR Base (speech + general audio)
+## UPDATE (2026.Feb)
+We have an [**updated version**](https://huggingface.co/marcoyang/spear-base-speech-audio-v2) of this model with enhanced capability on overlapped/noisy speech.
+**We recommend using the updated version of the model**. Please refer to our [paper](https://arxiv.org/abs/2510.25955) for more detail.
+---
+This is the first version [SPEAR](https://arxiv.org/abs/2510.25955v1) Base dual-domain (speech + general audio) model. The model adopts a [Zipformer](https://arxiv.org/abs/2310.11230) backbone with 93M parameters consisting of 12 Zipformer stacks. It generates 512-dimensional representations at approximately 50~Hz.
 This model was pre-trained on 97k hours of mixture data of English speech and general audio, among which 84k hours are speech data, and the rest 13k hours are general audio data. It achieves competitive performance (compared with models with similar sizes) on [SUPERB](https://arxiv.org/abs/2105.01051) benchmark and on [HEAR](https://arxiv.org/abs/2203.03022) benchmark.
+[Paper](https://arxiv.org/abs/2510.25955v1)
 Authors: Xiaoyu Yang, Yifan Yang, Zengrui Jin, Ziyun Cui, Wen Wu, Baoxiang Li, Chao Zhang, Phil Woodland