lhndzn commited on
Commit
c597935
·
verified ·
1 Parent(s): 2df83b3

Upload 8 files

Browse files
mossformer2/.gitattributes ADDED
@@ -0,0 +1,32 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bin.* filter=lfs diff=lfs merge=lfs -text
5
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.model filter=lfs diff=lfs merge=lfs -text
12
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
13
+ *.onnx filter=lfs diff=lfs merge=lfs -text
14
+ *.ot filter=lfs diff=lfs merge=lfs -text
15
+ *.parquet filter=lfs diff=lfs merge=lfs -text
16
+ *.pb filter=lfs diff=lfs merge=lfs -text
17
+ *.pt filter=lfs diff=lfs merge=lfs -text
18
+ *.pth filter=lfs diff=lfs merge=lfs -text
19
+ *.rar filter=lfs diff=lfs merge=lfs -text
20
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
21
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
22
+ *.tflite filter=lfs diff=lfs merge=lfs -text
23
+ *.tgz filter=lfs diff=lfs merge=lfs -text
24
+ *.xz filter=lfs diff=lfs merge=lfs -text
25
+ *.zip filter=lfs diff=lfs merge=lfs -text
26
+ *.zstandard filter=lfs diff=lfs merge=lfs -text
27
+ *.tfevents* filter=lfs diff=lfs merge=lfs -text
28
+ *.db* filter=lfs diff=lfs merge=lfs -text
29
+ *.ark* filter=lfs diff=lfs merge=lfs -text
30
+ **/*ckpt*data* filter=lfs diff=lfs merge=lfs -text
31
+ **/*ckpt*.meta filter=lfs diff=lfs merge=lfs -text
32
+ **/*ckpt*.index filter=lfs diff=lfs merge=lfs -text
mossformer2/README.md ADDED
@@ -0,0 +1,117 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tasks:
3
+ - speech-separation
4
+ widgets:
5
+ - task: speech-separation
6
+ inputs:
7
+ - type: audio
8
+ name: input
9
+ title: 麦克风录制的音频
10
+ displayProps:
11
+ sampleRate: 8000
12
+ validator:
13
+ max_size: 10M
14
+ output:
15
+ displayProps:
16
+ audio:
17
+ sampleRate: 8000
18
+ examples:
19
+ - name: 1
20
+ title: 示例1
21
+ inputs:
22
+ - name: input
23
+ data: git://examples/mix_speech1.wav
24
+ - name: 2
25
+ title: 示例2
26
+ inputs:
27
+ - name: input
28
+ data: git://examples/mix_speech.wav
29
+ inferencespec:
30
+ cpu: 1
31
+ memory: 1000
32
+ gpu: 0
33
+ gpu_memory: 3000
34
+ model_type:
35
+ - mossformer
36
+ domain:
37
+ - audio
38
+ frameworks:
39
+ - pytorch
40
+ model-backbone:
41
+ - mossformer2
42
+ customized-quickstart: True
43
+ finetune-support: True
44
+ license: Apache License 2.0
45
+ tags:
46
+ - Alibaba
47
+ - Audio
48
+ - Speech Separation
49
+ - 语音分离
50
+ ---
51
+
52
+ # MossFormer2语音分离模型介绍
53
+
54
+ 本次发布为上一代单声道语音分离算法MossFormer的升级版。并在单声道语音分离任务上比MossFormer取得显著的性能提升。MossFormer模型主要采用了基于自注意力的MossFormer模块,倾向于强调较长范围、粗粒度的依赖关系,在有效建模较细粒度的循环模式方面存在不足。在MossFormer2模型中,我们引入了一种新颖的混合模型,通过将一个循环模块集成到MossFormer框架中,从而具备了同时建模较长范围、粗粒度依赖关系和较细粒度循环模式的能力。为了减轻非并行循环神经网络(RNN)的局限性,我们提出了一种基于前馈顺序记忆网络(FSMN)的非RNN循环模块。该循环模块包含一个基于扩张(dilated)机制的FSMN块,不仅通过扩张机制增强感受野,同时通过密集连接(Dense Connection)提高信息流动性。另外,我们还使用门控卷积单元(GCU)来设计循环模块,以便在降低嵌入维度和提高模型效率的同时,促进相关背景信息的门控控制。该循环模块依靠线性投影(Linear Projection)和卷积来实现对整个序列的无缝并行处理。MossFormer2模型在WSJ0-2/3mix、Libri2Mix和WHAM!/WHAMR!基准测试中表现出色,超越了MossFormer和其他最先进的方法。
55
+
56
+ ## 模型的使用方式
57
+
58
+ 模型pipeline 输入为个8000Hz采样率的单声道wav文件,内容是两个人混杂在一起的说话声,输出结果是分离开的两个单声道音频。
59
+
60
+ ### 环境准备
61
+
62
+ * 本模型支持Linux,Windows和MacOS平台。
63
+ * 本模型使用了三方库SoundFile进行wav文件处理,**在Linux系统上用户需要手动安装SoundFile的底层依赖库libsndfile**,在Windows和MacOS上会自动安装不需要用户操作。详细信息可参考[SoundFile官网](https://github.com/bastibe/python-soundfile#installation)。以Ubuntu系统为例,用户需要执行如下命令:
64
+
65
+ ```shell
66
+ sudo apt-get update
67
+ sudo apt-get install libsndfile1
68
+ ```
69
+
70
+ ### 代码范例
71
+
72
+ ```python
73
+ import numpy
74
+ import soundfile as sf
75
+ from modelscope.pipelines import pipeline
76
+ from modelscope.utils.constant import Tasks
77
+
78
+ # input可以是url也可以是本地文件路径
79
+ input = 'https://modelscope.cn/api/v1/models/damo/speech_mossformer2_separation_temporal_8k/repo?Revision=master&FilePath=examples/mix_speech1.wav'
80
+ separation = pipeline(
81
+ Tasks.speech_separation,
82
+ model='damo/speech_mossformer2_separation_temporal_8k')
83
+ result = separation(input)
84
+ for i, signal in enumerate(result['output_pcm_list']):
85
+ save_file = f'output_spk{i}.wav'
86
+ sf.write(save_file, numpy.frombuffer(signal, dtype=numpy.int16), 8000)
87
+ ```
88
+
89
+ ### 模型局限性
90
+
91
+ 本模型训练虽然尽量涵盖各类噪声和混响场景,我们也加入一定的电话通道数据,但由于训练数据有限,无法完全覆盖所有噪声、混响场景和电话场景,因而无法保障对各种混合数据上的分离效果!
92
+
93
+ ## 训练数据介绍
94
+ 本模型训练时使用了包括WSJ0-2Mix和 Libri2Mix的干净语音混合数据,以及WHAMR、DNS Challenge 2020带噪声及混响数据。
95
+
96
+ ## 数据评估及结果
97
+ MossFormer2模型与其它SOTA模型在公开数据集WSJ0-2mix/3mix、Libri2Mix和 WHAM!/WHAMR!上的对比结果如下,注意:该对比结果使用的是MossFormer2模型在相应的数据集上训练后的参考测试结果,并非本次发布模型的测试结果。
98
+
99
+ <div align=center>
100
+ <div>表1. 模型在公开数据集WSJ0-2mix/3mix、Libri2Mix上的对比结果</div>
101
+ <img width="640" src="description/matrix1.png"/>
102
+ </div>
103
+
104
+ <div align=center>
105
+ <div>表2. 模型在公开数据集WHAM!/WHAMR!上的对比结果</div>
106
+ <img width="640" src="description/matrix2.png"/>
107
+ </div>
108
+
109
+ ### 指标说明:
110
+
111
+ * SI-SNR (Scale Invariant Signal-to-Noise Ratio) 尺度不变的信噪比,是在普通信噪比基础上通过正则化消减信号变化导致的影响,是针对宽带噪声失真的语音增强算法的常规衡量方法。SI-SNRi (SI-SNR improvement) 是衡量对比原始混合语音,SI-SNR在分离后语音上的提升量。
112
+
113
+
114
+ ### 相关论文以及引用信息
115
+
116
+ Zhao, Shengkui and Ma, Bin et al, “MossFormer2: Combining Transformer and RNN-Free Recurrent Network for Enhanced Time-Domain Monaural Speech Separation”, submitting to ICASSP 2024.
117
+
mossformer2/configuration.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "framework": "pytorch",
3
+ "task": "speech-separation",
4
+ "model": {
5
+ "type": "speech_mossformer2_separation_temporal_8k"
6
+ },
7
+ "pipeline": {
8
+ "type": "speech_mossformer2_separation_temporal_8k"
9
+ }
10
+ }
mossformer2/description/matrix1.png ADDED
mossformer2/description/matrix2.png ADDED
mossformer2/examples/mix_speech.wav ADDED
Binary file (141 kB). View file
 
mossformer2/examples/mix_speech1.wav ADDED
Binary file (40.8 kB). View file
 
pytorch_model.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4f238ff0ae1409eff9f6caf76502576976bd45188cd73eba19124688bc442b19
3
+ size 223483621