liampower commited on
Commit
611057d
·
verified ·
1 Parent(s): 0144d8f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -189
README.md CHANGED
@@ -15,192 +15,4 @@ pipeline_tag: audio-to-audio
15
 
16
  # Model Card for SICTO Vocal Separator
17
 
18
- This model performs music source separation using a Hybrid Spectrogram Transformer architecture (HSTasnet) to separate different instruments from mixed audio.
19
-
20
- ## Model Details
21
-
22
- ### Model Description
23
-
24
- HSTasnet is a hybrid spectrogram transformer model for music source separation that combines both time and frequency domain processing. It uses parallel time-domain and frequency-domain encoders followed by RNN-based memory modules to process audio at multiple scales. The model merges these complementary representations through a hybrid RNN layer before generating masks for source separation.
25
-
26
- - **Developed by:** Authors of "HSTasnet: A Hybrid Spectrogram Transformer for Music Source Separation"
27
- - **Model type:** Transformer-based Source Separation
28
- - **License:** MIT
29
- - **Paper:** [HSTasnet: A Hybrid Spectrogram Transformer for Music Source Separation](https://arxiv.org/abs/2402.17701)
30
-
31
- ### Model Sources
32
-
33
- - **Repository:** [burstMembrane/hstasnet](https://github.com/burstMembrane/hstasnet)
34
- - **Paper:** [arXiv:2402.17701](https://arxiv.org/abs/2402.17701)
35
-
36
- ## Uses
37
-
38
- ### Direct Use
39
-
40
- The model can be used to separate music tracks into their constituent instruments (vocals, drums, bass, and other). It's particularly useful for:
41
-
42
- - Music production and remixing
43
- - Audio analysis and research
44
- - Creating karaoke tracks
45
- - Isolating specific instruments for practice or study
46
- - Isolating instruments for downstream tasks like transcription, alignment, etc.
47
-
48
- ## How to Get Started with the Model
49
-
50
- ```bash
51
- # Example usage with the SheetMuse training framework
52
- sm-train --model hstasnet \
53
- --results_path results \
54
- --data_path /path/to/training/data \
55
- --config configs/config_moisesdb_hstasnet.yaml
56
- ```
57
-
58
- To use the pretrained model
59
-
60
- ```bash
61
- pip install git+git@bitbucket.org:mattstepincto/sheetmuse-training.git
62
- ```
63
-
64
-
65
- Then run the `separate_file` method after importing th pretrained model. Note you will need a HF API token an daccess to the bitbucket repository
66
-
67
-
68
- ```python
69
-
70
-
71
- from sheetmuse_training.hf.smsourceseparator import SMSourceSeparator
72
-
73
- model = SMSourceSeparator.from_pretrained("sicto/hstasnet", token="sicto/hf/read/token")
74
-
75
- device = "cuda" if torch.cuda.is_available() else "cpu"
76
- model = model.to(args.device)
77
- model.eval()
78
-
79
- output = model.separate_file(
80
- # the input file e.g mixture.wav
81
- file_path,
82
- # the folder to save the output to e.g out
83
- savedir=savedir,
84
- # a list of instruments used for file naming, e.g ["drums, "bass", "other", "vocals"
85
- instruments=model.instruments,
86
- # the device to use for inference
87
- device=args.device,
88
- )
89
- # output shape will be [batch_size (1), n_instruments, n_channels, n_samples]
90
- print(f"Output shape: {output.shape}")
91
-
92
- ```
93
-
94
- ## Training Details
95
-
96
- ### Training Data
97
-
98
- The model is typically trained on the MUSDB18-HQ dataset, which contains:
99
- - 150 songs (86 for training, 14 for validation, 50 for testing)
100
- - High-quality audio at 44.1kHz
101
- - Separate stems for vocals, drums, bass, and other instruments
102
-
103
- ### Training Procedure
104
-
105
- #### Training Hyperparameters
106
-
107
- - **Optimizer:** AdamW
108
- - **Learning Rate:** 1.43e-4
109
- - **Batch Size:** 24
110
- - **Number of Epochs:** 100
111
- - **Patience:** 5 (for learning rate reduction)
112
- - **Reduce Factor:** 0.8
113
- - **Gradient Clipping:** 7.0
114
- - **Mixed Precision Training:** Enabled
115
- - **Gradient Accumulation Steps:** 1
116
-
117
- ### Evaluation
118
-
119
- #### Metrics
120
-
121
- The model is evaluated using two metrics:
122
- - Signal-to-Distortion Ratio (SDR)
123
- - L1 Frequency Loss
124
-
125
- #### Results
126
-
127
- Typical performance metrics on MUSDB18-HQ test set:
128
- - SDR: ~5.1 dB (average across all instruments)
129
-
130
- With extra data:
131
- - SDR: ~5.7 dB (average across all instruments)
132
-
133
- ## Technical Specifications
134
-
135
- ### Model Architecture
136
-
137
- HSTasnet implements a hybrid architecture combining:
138
-
139
- 1. **Time Domain Processing**:
140
- - Time encoder with window size 1024 and hop size 512
141
- - RNN hidden dimension of 768
142
- - RNN-based memory module for temporal processing
143
- - Skip connections and mask generation
144
-
145
- 2. **Frequency Domain Processing**:
146
- - STFT-based encoder (1024-point FFT, hop size 512, Hamming window)
147
- - Parallel RNN memory module
148
- - Complementary mask generation
149
-
150
- 3. **Audio Processing Parameters**:
151
- - Sample rate: 44.1kHz
152
- - Number of channels: 2 (stereo)
153
- - Chunk size: 262,144 samples
154
- - Processing 4 sources: drums, bass, other, vocals
155
-
156
- 4. **Augmentation Strategy**:
157
- - Channel shuffling (50% probability)
158
- - Random polarity inversion (50% probability)
159
- - Source-specific augmentations:
160
- - Vocals: Pitch shifting (±5 semitones), EQ (±9dB), distortion
161
- - Bass: Pitch shifting (±2 semitones), EQ (-3/+6dB), distortion
162
- - Drums: Pitch shifting (±5 semitones), EQ (±9dB), distortion
163
- - Other: Pitch shifting (±4 semitones), noise injection, time stretching (0.8-1.25x)
164
-
165
- ### Compute Infrastructure
166
-
167
- #### Hardware Requirements
168
- - Minimum 16GB GPU memory
169
- - Recommended: NVIDIA 3090 or similar
170
- - CPU, MPS inference supported but slower
171
-
172
- #### Software Requirements
173
- - Python 3.8+
174
- - PyTorch 1.10+
175
- - torchaudio for STFT operations
176
- - pytorch_lightning for training
177
- - Additional dependencies listed in requirements.txt
178
-
179
- ### Input Requirements
180
-
181
- - Audio format: Waveform tensor of shape [Batch, Channels, Length]
182
- - Supported sample rates: 44.1kHz (default)
183
- - Supports both mono and stereo inputs
184
- - Variable length processing with optional padding
185
-
186
- ### Output Format
187
-
188
- - Separated sources: Tensor of shape [Batch, Sources, Channels, Length]
189
- - Maintains input sample rate and channel configuration
190
- - Optional length matching through zero-padding
191
-
192
- ## Citation
193
-
194
- **BibTeX:**
195
- ```bibtex
196
- @article{hstasnet2024,
197
- title={Real-time Low-latency Music Source Separation using Hybrid Spectrogram-TasNet},
198
- author={[Satvik Venkatesh, Arthur Benilov, Philip Coleman, Frederic Roskam]},
199
- journal={arXiv preprint arXiv:2402.17701},
200
- year={2024}
201
- }
202
- ```
203
-
204
- ## Model Card Contact
205
-
206
- For questions about the model card, please open an issue in the repository.
 
15
 
16
  # Model Card for SICTO Vocal Separator
17
 
18
+ This model performs HQ Vocal Separation