Codex Bot commited on
Commit
4298bdc
·
0 Parent(s):

Mirror speechbrain/sepformer-wsj02mix

Browse files
Files changed (10) hide show
  1. .gitattributes +26 -0
  2. README.md +123 -0
  3. brain.ckpt +3 -0
  4. config.json +3 -0
  5. decoder.ckpt +3 -0
  6. encoder.ckpt +3 -0
  7. hyperparams.yaml +65 -0
  8. hyperparams_train.yaml +163 -0
  9. masknet.ckpt +3 -0
  10. test_mixture.wav +3 -0
.gitattributes ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.bin.* filter=lfs diff=lfs merge=lfs -text
2
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.h5 filter=lfs diff=lfs merge=lfs -text
5
+ *.tflite filter=lfs diff=lfs merge=lfs -text
6
+ *.tar.gz filter=lfs diff=lfs merge=lfs -text
7
+ *.ot filter=lfs diff=lfs merge=lfs -text
8
+ *.onnx filter=lfs diff=lfs merge=lfs -text
9
+ *.arrow filter=lfs diff=lfs merge=lfs -text
10
+ *.ftz filter=lfs diff=lfs merge=lfs -text
11
+ *.joblib filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.pb filter=lfs diff=lfs merge=lfs -text
15
+ *.pt filter=lfs diff=lfs merge=lfs -text
16
+ *.pth filter=lfs diff=lfs merge=lfs -text
17
+ masknet.ckpt filter=lfs diff=lfs merge=lfs -text
18
+ optimizer.ckpt filter=lfs diff=lfs merge=lfs -text
19
+ brain.ckpt filter=lfs diff=lfs merge=lfs -text
20
+ counter.ckpt filter=lfs diff=lfs merge=lfs -text
21
+ decoder.ckpt filter=lfs diff=lfs merge=lfs -text
22
+ encoder.ckpt filter=lfs diff=lfs merge=lfs -text
23
+ lr_scheduler.ckpt filter=lfs diff=lfs merge=lfs -text
24
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ *.wav filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,123 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: "en"
3
+ thumbnail:
4
+ tags:
5
+ - Source Separation
6
+ - Speech Separation
7
+ - Audio Source Separation
8
+ - WSJ02Mix
9
+ - SepFormer
10
+ - Transformer
11
+ - audio-to-audio
12
+ - audio-source-separation
13
+ - speechbrain
14
+ license: "apache-2.0"
15
+ datasets:
16
+ - WSJ0-2Mix
17
+ metrics:
18
+ - SI-SNRi
19
+ - SDRi
20
+
21
+ ---
22
+
23
+ <iframe src="https://ghbtns.com/github-btn.html?user=speechbrain&repo=speechbrain&type=star&count=true&size=large&v=2" frameborder="0" scrolling="0" width="170" height="30" title="GitHub"></iframe>
24
+ <br/><br/>
25
+
26
+ # SepFormer trained on WSJ0-2Mix
27
+
28
+ This repository provides all the necessary tools to perform audio source separation with a [SepFormer](https://arxiv.org/abs/2010.13154v2)
29
+ model, implemented with SpeechBrain, and pretrained on WSJ0-2Mix dataset. For a better experience we encourage you to learn more about
30
+ [SpeechBrain](https://speechbrain.github.io). The model performance is 22.4 dB on the test set of WSJ0-2Mix dataset.
31
+
32
+ | Release | Test-Set SI-SNRi | Test-Set SDRi |
33
+ |:-------------:|:--------------:|:--------------:|
34
+ | 09-03-21 | 22.4dB | 22.6dB |
35
+
36
+ You can listen to example results obtained on the test set of WSJ0-2/3Mix through [here](https://sourceseparationresearch.com/static/sepformer_example_results/sepformer_results.html).
37
+
38
+
39
+ ## Install SpeechBrain
40
+
41
+ First of all, please install SpeechBrain with the following command:
42
+
43
+ ```
44
+ pip install speechbrain
45
+ ```
46
+
47
+ Please notice that we encourage you to read our tutorials and learn more about
48
+ [SpeechBrain](https://speechbrain.github.io).
49
+
50
+ ### Perform source separation on your own audio file
51
+ ```python
52
+ from speechbrain.inference.separation import SepformerSeparation as separator
53
+ import torchaudio
54
+
55
+ model = separator.from_hparams(source="speechbrain/sepformer-wsj02mix", savedir='pretrained_models/sepformer-wsj02mix')
56
+
57
+ # for custom file, change path
58
+ est_sources = model.separate_file(path='speechbrain/sepformer-wsj02mix/test_mixture.wav')
59
+
60
+ torchaudio.save("source1hat.wav", est_sources[:, :, 0].detach().cpu(), 8000)
61
+ torchaudio.save("source2hat.wav", est_sources[:, :, 1].detach().cpu(), 8000)
62
+ ```
63
+
64
+ The system expects input recordings sampled at 8kHz (single channel).
65
+ If your signal has a different sample rate, resample it (e.g, using torchaudio or sox) before using the interface.
66
+
67
+ ### Inference on GPU
68
+ To perform inference on the GPU, add `run_opts={"device":"cuda"}` when calling the `from_hparams` method.
69
+
70
+ ### Training
71
+ The model was trained with SpeechBrain (fc2eabb7).
72
+ To train it from scratch follows these steps:
73
+ 1. Clone SpeechBrain:
74
+ ```bash
75
+ git clone https://github.com/speechbrain/speechbrain/
76
+ ```
77
+ 2. Install it:
78
+ ```
79
+ cd speechbrain
80
+ pip install -r requirements.txt
81
+ pip install -e .
82
+ ```
83
+
84
+ 3. Run Training:
85
+ ```
86
+ cd recipes/WSJ0Mix/separation
87
+ python train.py hparams/sepformer.yaml --data_folder=your_data_folder
88
+ ```
89
+
90
+ You can find our training results (models, logs, etc) [here](https://drive.google.com/drive/folders/1cON-eqtKv_NYnJhaE9VjLT_e2ybn-O7u?usp=sharing).
91
+
92
+ ### Limitations
93
+ The SpeechBrain team does not provide any warranty on the performance achieved by this model when used on other datasets.
94
+
95
+ #### Referencing SpeechBrain
96
+
97
+ ```bibtex
98
+ @misc{speechbrain,
99
+ title={{SpeechBrain}: A General-Purpose Speech Toolkit},
100
+ author={Mirco Ravanelli and Titouan Parcollet and Peter Plantinga and Aku Rouhe and Samuele Cornell and Loren Lugosch and Cem Subakan and Nauman Dawalatabad and Abdelwahab Heba and Jianyuan Zhong and Ju-Chieh Chou and Sung-Lin Yeh and Szu-Wei Fu and Chien-Feng Liao and Elena Rastorgueva and François Grondin and William Aris and Hwidong Na and Yan Gao and Renato De Mori and Yoshua Bengio},
101
+ year={2021},
102
+ eprint={2106.04624},
103
+ archivePrefix={arXiv},
104
+ primaryClass={eess.AS},
105
+ note={arXiv:2106.04624}
106
+ }
107
+ ```
108
+
109
+
110
+ #### Referencing SepFormer
111
+ ```bibtex
112
+ @inproceedings{subakan2021attention,
113
+ title={Attention is All You Need in Speech Separation},
114
+ author={Cem Subakan and Mirco Ravanelli and Samuele Cornell and Mirko Bronzi and Jianyuan Zhong},
115
+ year={2021},
116
+ booktitle={ICASSP 2021}
117
+ }
118
+ ```
119
+
120
+ # **About SpeechBrain**
121
+ - Website: https://speechbrain.github.io/
122
+ - Code: https://github.com/speechbrain/speechbrain/
123
+ - HuggingFace: https://huggingface.co/speechbrain/
brain.ckpt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d9e24193f36931b7f57932532efbdcf64971f42732383ba6808825f77db258f6
3
+ size 28
config.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ {
2
+ "speechbrain_interface": "SepformerSeparation"
3
+ }
decoder.ckpt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:abea1a2d41151331b4c36071d1b3205aed940a189721f008b12a703e9c63e7e4
3
+ size 17202
encoder.ckpt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3139bb880b29ea77ae8a168b8f2ad6e8eb5c2c0904289676c223d0e93cd2a35d
3
+ size 17267
hyperparams.yaml ADDED
@@ -0,0 +1,65 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ################################
2
+ # Model: Inference for source separation with SepFormer
3
+ # https://arxiv.org/abs/2010.13154
4
+ # Generated from speechbrain/recipes/WSJ0Mix/separation/train/hparams/sepformer-wsj02mix.yaml
5
+ # Dataset : wsj02mix
6
+ # ###############################
7
+
8
+ # Parameters
9
+ sample_rate: 8000
10
+ num_spks: 2
11
+
12
+ # Specifying the network
13
+ Encoder: !new:speechbrain.lobes.models.dual_path.Encoder
14
+ kernel_size: 16
15
+ out_channels: 256
16
+
17
+ SBtfintra: !new:speechbrain.lobes.models.dual_path.SBTransformerBlock
18
+ num_layers: 8
19
+ d_model: 256
20
+ nhead: 8
21
+ d_ffn: 1024
22
+ dropout: 0
23
+ use_positional_encoding: true
24
+ norm_before: true
25
+
26
+ SBtfinter: !new:speechbrain.lobes.models.dual_path.SBTransformerBlock
27
+ num_layers: 8
28
+ d_model: 256
29
+ nhead: 8
30
+ d_ffn: 1024
31
+ dropout: 0
32
+ use_positional_encoding: true
33
+ norm_before: true
34
+
35
+ MaskNet: !new:speechbrain.lobes.models.dual_path.Dual_Path_Model
36
+ num_spks: !ref <num_spks>
37
+ in_channels: 256
38
+ out_channels: 256
39
+ num_layers: 2
40
+ K: 250
41
+ intra_model: !ref <SBtfintra>
42
+ inter_model: !ref <SBtfinter>
43
+ norm: ln
44
+ linear_layer_after_inter_intra: false
45
+ skip_around_intra: true
46
+
47
+ Decoder: !new:speechbrain.lobes.models.dual_path.Decoder
48
+ in_channels: 256
49
+ out_channels: 1
50
+ kernel_size: 16
51
+ stride: 8
52
+ bias: false
53
+
54
+ modules:
55
+ encoder: !ref <Encoder>
56
+ decoder: !ref <Decoder>
57
+ masknet: !ref <MaskNet>
58
+
59
+ pretrainer: !new:speechbrain.utils.parameter_transfer.Pretrainer
60
+ loadables:
61
+ masknet: !ref <MaskNet>
62
+ encoder: !ref <Encoder>
63
+ decoder: !ref <Decoder>
64
+
65
+
hyperparams_train.yaml ADDED
@@ -0,0 +1,163 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Generated 2021-02-26 from:
2
+ # /scratch/csubakan/speechbrain_new/recipes/WSJ2Mix/separation/yamls/dptransformer78.yaml
3
+ # yamllint disable
4
+ # ################################
5
+ # Model: SepFormer for source separation
6
+ # https://arxiv.org/abs/2010.13154
7
+ #
8
+ # Dataset : WSJ0-mix
9
+ # ################################
10
+ # Basic parameters
11
+ # Seed needs to be set at top of yaml, before objects with parameters are made
12
+ #
13
+ seed: 1234
14
+ __set_seed: !apply:torch.manual_seed [1234]
15
+
16
+ # Data params
17
+ data_folder: /localscratch/csubakan.62709298.0/wsj0-mix/2speakers # wsj2mix or wsj3mix
18
+ experiment_name: 78-speedchange-dynamicmix-hardcodegaussian
19
+ output_folder: results/78-speedchange-dynamicmix-hardcodegaussian/1234
20
+ train_log: results/78-speedchange-dynamicmix-hardcodegaussian/1234/train_log.txt
21
+ save_folder: results/78-speedchange-dynamicmix-hardcodegaussian/1234/save
22
+ train_data: results/78-speedchange-dynamicmix-hardcodegaussian/1234/save/wsj_tr.csv
23
+ valid_data: results/78-speedchange-dynamicmix-hardcodegaussian/1234/save/wsj_cv.csv
24
+ test_data: results/78-speedchange-dynamicmix-hardcodegaussian/1234/save/wsj_tt.csv
25
+ wsj0_tr: /localscratch/csubakan.62709298.0/wsj0-processed/si_tr_s/
26
+
27
+ # Experiment params
28
+ auto_mix_prec: true
29
+ test_only: false
30
+ num_spks: 2 # set to 3 for wsj0-3mix
31
+ progressbar: true
32
+ save_audio: false # Save estimated sources on disk
33
+ sample_rate: 8000
34
+
35
+ # Training parameters
36
+ N_epochs: 200
37
+ batch_size: 1
38
+ lr: 0.00015
39
+ clip_grad_norm: 5
40
+ loss_upper_lim: 999999 # this is the upper limit for an acceptable loss
41
+ # if True, the training sequences are cut to a specified length
42
+ limit_training_signal_len: false
43
+ # this is the length of sequences if we choose to limit
44
+ # the signal length of training sequences
45
+ training_signal_len: 128000
46
+ dynamic_mixing: regular
47
+
48
+ # Augment parameters
49
+ use_wavedrop: false
50
+ use_speedperturb: true
51
+ use_speedperturb_sameforeachsource: false
52
+ use_rand_shift: false
53
+ min_shift: -8000
54
+ max_shift: 8000
55
+
56
+ # Neural parameters
57
+ N_encoder_out: 256
58
+ out_channels: 256
59
+ kernel_size: 16
60
+ kernel_stride: 8
61
+
62
+ threshold_byloss: true
63
+ threshold: -30
64
+
65
+ # Dataloader options
66
+ dataloader_opts:
67
+ batch_size: 1
68
+ num_workers: 3
69
+
70
+ speedperturb: !new:speechbrain.lobes.augment.TimeDomainSpecAugment
71
+ perturb_prob: 1.0
72
+ drop_freq_prob: 0.0
73
+ drop_chunk_prob: 0.0
74
+ sample_rate: 8000
75
+ speeds: [95, 100, 105]
76
+
77
+ wavedrop: !new:speechbrain.lobes.augment.TimeDomainSpecAugment
78
+ perturb_prob: 0.0
79
+ drop_freq_prob: 1.0
80
+ drop_chunk_prob: 1.0
81
+ sample_rate: 8000
82
+
83
+
84
+ Encoder: &id003 !new:speechbrain.lobes.models.dual_path.Encoder
85
+ kernel_size: 16
86
+ out_channels: 256
87
+
88
+
89
+ SBtfintra: &id001 !new:speechbrain.lobes.models.dual_path.SBTransformerBlock
90
+ num_layers: 8
91
+ d_model: 256
92
+ nhead: 8
93
+ d_ffn: 1024
94
+ dropout: 0
95
+ use_positional_encoding: true
96
+ norm_before: true
97
+
98
+ SBtfinter: &id002 !new:speechbrain.lobes.models.dual_path.SBTransformerBlock
99
+ num_layers: 8
100
+ d_model: 256
101
+ nhead: 8
102
+ d_ffn: 1024
103
+ dropout: 0
104
+ use_positional_encoding: true
105
+ norm_before: true
106
+
107
+ MaskNet: &id005 !new:speechbrain.lobes.models.dual_path.Dual_Path_Model
108
+
109
+ num_spks: 2
110
+ in_channels: 256
111
+ out_channels: 256
112
+ num_layers: 2
113
+ K: 250
114
+ intra_model: *id001
115
+ inter_model: *id002
116
+ norm: ln
117
+ linear_layer_after_inter_intra: false
118
+ skip_around_intra: true
119
+
120
+ Decoder: &id004 !new:speechbrain.lobes.models.dual_path.Decoder
121
+ in_channels: 256
122
+ out_channels: 1
123
+ kernel_size: 16
124
+ stride: 8
125
+ bias: false
126
+
127
+ optimizer: !name:torch.optim.Adam
128
+ lr: 0.00015
129
+ weight_decay: 0
130
+
131
+ loss: !name:speechbrain.nnet.losses.get_si_snr_with_pitwrapper
132
+
133
+ lr_scheduler: &id007 !new:speechbrain.nnet.schedulers.ReduceLROnPlateau
134
+
135
+ factor: 0.5
136
+ patience: 4
137
+ dont_halve_until_epoch: 100
138
+
139
+ epoch_counter: &id006 !new:speechbrain.utils.epoch_loop.EpochCounter
140
+ limit: 200
141
+
142
+ modules:
143
+ encoder: *id003
144
+ decoder: *id004
145
+ masknet: *id005
146
+ checkpointer: !new:speechbrain.utils.checkpoints.Checkpointer
147
+ checkpoints_dir: results/78-speedchange-dynamicmix-hardcodegaussian/1234/save
148
+ recoverables:
149
+ encoder: *id003
150
+ decoder: *id004
151
+ masknet: *id005
152
+ counter: *id006
153
+ lr_scheduler: *id007
154
+ train_logger: !new:speechbrain.utils.train_logger.FileTrainLogger
155
+ save_file: results/78-speedchange-dynamicmix-hardcodegaussian/1234/train_log.txt
156
+
157
+
158
+
159
+ pretrainer: !new:speechbrain.utils.parameter_transfer.Pretrainer
160
+ loadables:
161
+ masknet: !ref <MaskNet>
162
+ encoder: !ref <Encoder>
163
+ decoder: !ref <Decoder>
masknet.ckpt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:57dd5f49bf21c5a2101bb4e46648d05d34d517a59e26f0b06646d0bebe8214c7
3
+ size 113108458
test_mixture.wav ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:73235884251e3f575adc540597bc704363ffe50bd4fc5164e96bf0d7afd3a371
3
+ size 66234