marcoyang commited on
Commit
60e4f01
·
verified ·
1 Parent(s): db89ec8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +13 -2
README.md CHANGED
@@ -1,6 +1,17 @@
 
 
 
 
1
  # SPEAR Base (speech + general audio)
2
 
3
- This is the [SPEAR](https://arxiv.org/abs/2510.25955) Base dual-domain (speech + general audio) model. The model adopts a [Zipformer](https://arxiv.org/abs/2310.11230) backbone with 327M parameters consisting of 112 Zipformer stacks. It generates 512-dimensional representations at approximately 50~Hz.
 
 
 
 
 
 
 
4
 
5
  This model was pre-trained on 97k hours of mixture data of English speech and general audio, among which 84k hours are speech data, and the rest 13k hours are general audio data. It achieves competitive performance (compared with models with similar sizes) on [SUPERB](https://arxiv.org/abs/2105.01051) benchmark and on [HEAR](https://arxiv.org/abs/2203.03022) benchmark.
6
 
@@ -27,7 +38,7 @@ The audio data consists of the following datasets:
27
 
28
 
29
 
30
- [Paper](https://arxiv.org/abs/2510.25955)
31
 
32
  Authors: Xiaoyu Yang, Yifan Yang, Zengrui Jin, Ziyun Cui, Wen Wu, Baoxiang Li, Chao Zhang, Phil Woodland
33
 
 
1
+ ---
2
+ license: apache-2.0
3
+ ---
4
+
5
  # SPEAR Base (speech + general audio)
6
 
7
+ ## UPDATE (2026.Feb)
8
+
9
+ We have an [**updated version**](https://huggingface.co/marcoyang/spear-base-speech-audio-v2) of this model with enhanced capability on overlapped/noisy speech.
10
+ **We recommend using the updated version of the model**. Please refer to our [paper](https://arxiv.org/abs/2510.25955) for more detail.
11
+
12
+ ---
13
+
14
+ This is the first version [SPEAR](https://arxiv.org/abs/2510.25955v1) Base dual-domain (speech + general audio) model. The model adopts a [Zipformer](https://arxiv.org/abs/2310.11230) backbone with 93M parameters consisting of 12 Zipformer stacks. It generates 512-dimensional representations at approximately 50~Hz.
15
 
16
  This model was pre-trained on 97k hours of mixture data of English speech and general audio, among which 84k hours are speech data, and the rest 13k hours are general audio data. It achieves competitive performance (compared with models with similar sizes) on [SUPERB](https://arxiv.org/abs/2105.01051) benchmark and on [HEAR](https://arxiv.org/abs/2203.03022) benchmark.
17
 
 
38
 
39
 
40
 
41
+ [Paper](https://arxiv.org/abs/2510.25955v1)
42
 
43
  Authors: Xiaoyu Yang, Yifan Yang, Zengrui Jin, Ziyun Cui, Wen Wu, Baoxiang Li, Chao Zhang, Phil Woodland
44