Update README.md
Browse files
README.md
CHANGED
|
@@ -2,40 +2,34 @@
|
|
| 2 |
library_name: transformers
|
| 3 |
license: mit
|
| 4 |
datasets:
|
| 5 |
-
|
| 6 |
pipeline_tag: audio-to-audio
|
| 7 |
tags:
|
| 8 |
-
|
|
|
|
| 9 |
---
|
| 10 |
|
| 11 |
-
#
|
| 12 |
|
| 13 |
-
Model for the Music source separation task.
|
| 14 |
-
Made a bunch of improvements to [the existing BS-RoFormer open-source implementation](https://github.com/lucidrains/BS-RoFormer).
|
| 15 |
|
| 16 |
-
针对音乐音频分离任务的模型。
|
| 17 |
|
| 18 |
-
## Model Details
|
| 19 |
|
| 20 |
-
|
| 21 |
|
| 22 |
-
|
| 23 |
|
| 24 |
-
|
| 25 |
-
|
| 26 |
-
|
| 27 |
-
| bass | **9.68** | 8.31 | 5.66 |
|
| 28 |
-
| drums | **10.58** | 9.55 | 6.77 |
|
| 29 |
-
| other | **8.99** | 8.14 | 6.06 |
|
| 30 |
-
| vocal | **10.86** | 10.03 | 7.44 |
|
| 31 |
|
| 32 |
-
|
| 33 |
|
| 34 |
-
|
| 35 |
-
|
| 36 |
-
|
| 37 |
-
|
| 38 |
-
| [Mini-BS-RoFormer](https://huggingface.co/HiDolen/Mini-BS-RoFormer) | 3068.64 |
|
| 39 |
|
| 40 |
## Uses
|
| 41 |
|
|
@@ -49,7 +43,7 @@ import soundfile
|
|
| 49 |
import torch
|
| 50 |
import librosa
|
| 51 |
|
| 52 |
-
model_name = "HiDolen/Mini-BS-RoFormer-
|
| 53 |
model = AutoModel.from_pretrained(
|
| 54 |
model_name,
|
| 55 |
trust_remote_code=True,
|
|
@@ -65,6 +59,9 @@ waveform = waveform.to("cuda")
|
|
| 65 |
# 进行推理
|
| 66 |
result = model.separate(
|
| 67 |
waveform,
|
|
|
|
|
|
|
|
|
|
| 68 |
batch_size=2,
|
| 69 |
verbose=True,
|
| 70 |
)
|
|
@@ -74,18 +71,37 @@ for i in range(result.shape[0]):
|
|
| 74 |
soundfile.write(f"separated_stem_{i}.wav", result[i].cpu().numpy().T, 44100)
|
| 75 |
```
|
| 76 |
|
| 77 |
-
|
| 78 |
|
| 79 |
```python
|
| 80 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 81 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 82 |
result = model.separate(
|
| 83 |
waveform,
|
|
|
|
|
|
|
|
|
|
| 84 |
batch_size=2,
|
| 85 |
verbose=True,
|
| 86 |
)
|
| 87 |
|
| 88 |
-
# 合并 bass、drums、other 作为伴奏
|
| 89 |
instrumental = result[0] + result[1] + result[2]
|
| 90 |
vocals = result[3]
|
| 91 |
result = torch.stack([instrumental, vocals], dim=0)
|
|
@@ -93,27 +109,22 @@ for i in range(result.shape[0]):
|
|
| 93 |
soundfile.write(f"separated_stem_{i}.wav", result[i].cpu().numpy().T, 44100)
|
| 94 |
```
|
| 95 |
|
| 96 |
-
##
|
| 97 |
-
|
| 98 |
-
Mini-BS-RoFormer-V2 相比于之前版本的主要改进:
|
| 99 |
|
| 100 |
-
|
| 101 |
-
2. 使用 muon 优化器训练 transformer 层,加速收敛
|
| 102 |
-
3. 音频输入的 stft 运算,使用的 n_fft 从 2048 变为 4096。音频输出维持在 n_fft=2048
|
| 103 |
-
4. freq_band 从原来的 62 段分频更换为基于梅尔频率的 80 段分频
|
| 104 |
-
5. 时间维度下采样,采样步长为 4。大大减少了训练和推理的运算量
|
| 105 |
-
6. MaskEstimator 额外预测 gate 门控,输出音频的空白部分会更加安静
|
| 106 |
-
7. 其他若干代码修改
|
| 107 |
|
| 108 |
-
|
| 109 |
|
| 110 |
-
|
| 111 |
|
| 112 |
-
|
| 113 |
|
| 114 |
-
|
|
|
|
|
|
|
|
|
|
| 115 |
|
| 116 |
## Acknowledgments
|
| 117 |
|
| 118 |
- https://github.com/lucidrains/BS-RoFormer
|
| 119 |
-
- https://arxiv.org/abs/2309.02612 (Music Source Separation with Band-Split RoPE Transformer)
|
|
|
|
| 2 |
library_name: transformers
|
| 3 |
license: mit
|
| 4 |
datasets:
|
| 5 |
+
- CLAPv2/MUSDB18-HQ
|
| 6 |
pipeline_tag: audio-to-audio
|
| 7 |
tags:
|
| 8 |
+
- music
|
| 9 |
+
new_version: HiDolen/Mini-BS-RoFormer-V2-46.8M
|
| 10 |
---
|
| 11 |
|
| 12 |
+
# Model Card for Model ID
|
| 13 |
|
| 14 |
+
Model for the Music source separation task. Its implementation is referenced to [the existing BS-RoFormer code](https://github.com/lucidrains/BS-RoFormer).
|
|
|
|
| 15 |
|
| 16 |
+
针对音乐音频分离任务的模型。改编自 [现有的 BS-RoFormer 模型代码](https://github.com/lucidrains/BS-RoFormer)。
|
| 17 |
|
|
|
|
| 18 |
|
| 19 |
+
## Model Details
|
| 20 |
|
| 21 |
+
模型参数:
|
| 22 |
|
| 23 |
+
- depth = 8
|
| 24 |
+
- hidden_size = 256
|
| 25 |
+
- intermediate_size = 256 * 3
|
|
|
|
|
|
|
|
|
|
|
|
|
| 26 |
|
| 27 |
+
总参数量 17.9M,在 MUSDB18HQ 数据的 val 集上达到平均 SDR 9.0 的性能。分轨具体 SDR:
|
| 28 |
|
| 29 |
+
- bass,8.31
|
| 30 |
+
- drums,9.55
|
| 31 |
+
- other,8.14
|
| 32 |
+
- vocal,10.03
|
|
|
|
| 33 |
|
| 34 |
## Uses
|
| 35 |
|
|
|
|
| 43 |
import torch
|
| 44 |
import librosa
|
| 45 |
|
| 46 |
+
model_name = "HiDolen/Mini-BS-RoFormer-18M"
|
| 47 |
model = AutoModel.from_pretrained(
|
| 48 |
model_name,
|
| 49 |
trust_remote_code=True,
|
|
|
|
| 59 |
# 进行推理
|
| 60 |
result = model.separate(
|
| 61 |
waveform,
|
| 62 |
+
chunk_size=44100 * 6,
|
| 63 |
+
overlap_size=44100 * 3,
|
| 64 |
+
gap_size=0,
|
| 65 |
batch_size=2,
|
| 66 |
verbose=True,
|
| 67 |
)
|
|
|
|
| 71 |
soundfile.write(f"separated_stem_{i}.wav", result[i].cpu().numpy().T, 44100)
|
| 72 |
```
|
| 73 |
|
| 74 |
+
只分离伴奏人声 2 轨而不是分离 4 轨:
|
| 75 |
|
| 76 |
```python
|
| 77 |
+
from transformers import AutoModel
|
| 78 |
+
import soundfile
|
| 79 |
+
import torch
|
| 80 |
+
import librosa
|
| 81 |
+
|
| 82 |
+
model_name = "HiDolen/Mini-BS-RoFormer-18M"
|
| 83 |
+
model = AutoModel.from_pretrained(
|
| 84 |
+
model_name,
|
| 85 |
+
trust_remote_code=True,
|
| 86 |
+
)
|
| 87 |
+
model.to("cuda")
|
| 88 |
|
| 89 |
+
# 加载音频
|
| 90 |
+
file = "./Bruno Mars - Runaway Baby.mp3"
|
| 91 |
+
waveform, sr = librosa.load(file, sr=44100, mono=False)
|
| 92 |
+
waveform = torch.tensor(waveform).float()
|
| 93 |
+
waveform = waveform.to("cuda")
|
| 94 |
+
|
| 95 |
+
# 进行推理
|
| 96 |
result = model.separate(
|
| 97 |
waveform,
|
| 98 |
+
chunk_size=44100 * 6,
|
| 99 |
+
overlap_size=44100 * 3,
|
| 100 |
+
gap_size=0,
|
| 101 |
batch_size=2,
|
| 102 |
verbose=True,
|
| 103 |
)
|
| 104 |
|
|
|
|
| 105 |
instrumental = result[0] + result[1] + result[2]
|
| 106 |
vocals = result[3]
|
| 107 |
result = torch.stack([instrumental, vocals], dim=0)
|
|
|
|
| 109 |
soundfile.write(f"separated_stem_{i}.wav", result[i].cpu().numpy().T, 44100)
|
| 110 |
```
|
| 111 |
|
| 112 |
+
## Training Details
|
|
|
|
|
|
|
| 113 |
|
| 114 |
+
使用 MUSDB18HQ 数据进行训练。
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 115 |
|
| 116 |
+
不使用原论文中提到的 Multi-STFT 损失项以提高训练速度。
|
| 117 |
|
| 118 |
+
学习率 5e-4,以 batch_size=6 训练 200k 步。
|
| 119 |
|
| 120 |
+
训练时发现的一些技巧:
|
| 121 |
|
| 122 |
+
- 数据增强不是越多越好。去除音高和拉伸增强能让收敛更快,也可以占用更少 CPU 资源
|
| 123 |
+
- Multi-STFT 损失可以去掉,不影响训练的同时能极大提高训练速度
|
| 124 |
+
- MaskEstimator 可以只有一个线性层,仍然能拟合且能节省大量参数
|
| 125 |
+
- 适量减少 freq_transformer 层数几乎不影响整体性能。主要会影响人声这种频段丰富的元素
|
| 126 |
|
| 127 |
## Acknowledgments
|
| 128 |
|
| 129 |
- https://github.com/lucidrains/BS-RoFormer
|
| 130 |
+
- https://arxiv.org/abs/2309.02612 (Music Source Separation with Band-Split RoPE Transformer)
|