sicto
/

sicto-vocal-separator

@@ -15,192 +15,4 @@ pipeline_tag: audio-to-audio
 # Model Card for SICTO Vocal Separator
-This model performs music source separation using a Hybrid Spectrogram Transformer architecture (HSTasnet) to separate different instruments from mixed audio.
-## Model Details
-### Model Description
-HSTasnet is a hybrid spectrogram transformer model for music source separation that combines both time and frequency domain processing. It uses parallel time-domain and frequency-domain encoders followed by RNN-based memory modules to process audio at multiple scales. The model merges these complementary representations through a hybrid RNN layer before generating masks for source separation.
-- **Developed by:** Authors of "HSTasnet: A Hybrid Spectrogram Transformer for Music Source Separation"
-- **Model type:** Transformer-based Source Separation
-- **License:** MIT
-- **Paper:** [HSTasnet: A Hybrid Spectrogram Transformer for Music Source Separation](https://arxiv.org/abs/2402.17701)
-### Model Sources
-- **Repository:** [burstMembrane/hstasnet](https://github.com/burstMembrane/hstasnet)
-- **Paper:** [arXiv:2402.17701](https://arxiv.org/abs/2402.17701)
-## Uses
-### Direct Use
-The model can be used to separate music tracks into their constituent instruments (vocals, drums, bass, and other). It's particularly useful for:
-- Music production and remixing
-- Audio analysis and research
-- Creating karaoke tracks
-- Isolating specific instruments for practice or study
-- Isolating instruments for downstream tasks like transcription, alignment, etc.
-## How to Get Started with the Model
-```bash
-# Example usage with the SheetMuse training framework
-sm-train --model hstasnet \
-        --results_path results \
-        --data_path /path/to/training/data \
-        --config configs/config_moisesdb_hstasnet.yaml
-```
-To use the pretrained model
-```bash
-pip install git+git@bitbucket.org:mattstepincto/sheetmuse-training.git
-```
-Then run the `separate_file` method after importing th pretrained model. Note you will need a HF API token an daccess to the bitbucket repository
-```python
-    from sheetmuse_training.hf.smsourceseparator import SMSourceSeparator
-    model = SMSourceSeparator.from_pretrained("sicto/hstasnet", token="sicto/hf/read/token")
-    device = "cuda" if torch.cuda.is_available() else "cpu"
-    model = model.to(args.device)
-    model.eval()
-    output = model.separate_file(
-        # the input file e.g mixture.wav
-        file_path,
-        # the folder to save the output to e.g out
-        savedir=savedir,
-        # a list of instruments used for file naming, e.g ["drums, "bass", "other", "vocals"
-        instruments=model.instruments,
-        # the device to use for inference
-        device=args.device,
-    )
-    # output shape will be [batch_size (1), n_instruments, n_channels, n_samples]
-    print(f"Output shape: {output.shape}")
-```
-## Training Details
-### Training Data
-The model is typically trained on the MUSDB18-HQ dataset, which contains:
-- 150 songs (86 for training, 14 for validation, 50 for testing)
-- High-quality audio at 44.1kHz
-- Separate stems for vocals, drums, bass, and other instruments
-### Training Procedure
-#### Training Hyperparameters
-- **Optimizer:** AdamW
-- **Learning Rate:** 1.43e-4
-- **Batch Size:** 24
-- **Number of Epochs:** 100
-- **Patience:** 5 (for learning rate reduction)
-- **Reduce Factor:** 0.8
-- **Gradient Clipping:** 7.0
-- **Mixed Precision Training:** Enabled
-- **Gradient Accumulation Steps:** 1
-### Evaluation
-#### Metrics
-The model is evaluated using two metrics:
-- Signal-to-Distortion Ratio (SDR)
-- L1 Frequency Loss
-#### Results
-Typical performance metrics on MUSDB18-HQ test set:
-- SDR: ~5.1 dB (average across all instruments)
-With extra data:
-- SDR: ~5.7 dB (average across all instruments)
-## Technical Specifications
-### Model Architecture
-HSTasnet implements a hybrid architecture combining:
-1. **Time Domain Processing**:
-- Time encoder with window size 1024 and hop size 512
-- RNN hidden dimension of 768
-- RNN-based memory module for temporal processing
-- Skip connections and mask generation
-2. **Frequency Domain Processing**:
-- STFT-based encoder (1024-point FFT, hop size 512, Hamming window)
-- Parallel RNN memory module
-- Complementary mask generation
-3. **Audio Processing Parameters**:
-- Sample rate: 44.1kHz
-- Number of channels: 2 (stereo)
-- Chunk size: 262,144 samples
-- Processing 4 sources: drums, bass, other, vocals
-4. **Augmentation Strategy**:
-- Channel shuffling (50% probability)
-- Random polarity inversion (50% probability)
-- Source-specific augmentations:
-  - Vocals: Pitch shifting (±5 semitones), EQ (±9dB), distortion
-  - Bass: Pitch shifting (±2 semitones), EQ (-3/+6dB), distortion
-  - Drums: Pitch shifting (±5 semitones), EQ (±9dB), distortion
-  - Other: Pitch shifting (±4 semitones), noise injection, time stretching (0.8-1.25x)
-### Compute Infrastructure
-#### Hardware Requirements
-- Minimum 16GB GPU memory
-- Recommended: NVIDIA 3090 or similar
-- CPU, MPS inference supported but slower
-#### Software Requirements
-- Python 3.8+
-- PyTorch 1.10+
-- torchaudio for STFT operations
-- pytorch_lightning for training
-- Additional dependencies listed in requirements.txt
-### Input Requirements
-- Audio format: Waveform tensor of shape [Batch, Channels, Length]
-- Supported sample rates: 44.1kHz (default)
-- Supports both mono and stereo inputs
-- Variable length processing with optional padding
-### Output Format
-- Separated sources: Tensor of shape [Batch, Sources, Channels, Length]
-- Maintains input sample rate and channel configuration
-- Optional length matching through zero-padding
-    ## Citation
-    **BibTeX:**
-    ```bibtex
-    @article{hstasnet2024,
-    title={Real-time Low-latency Music Source Separation using Hybrid Spectrogram-TasNet},
-    author={[Satvik Venkatesh, Arthur Benilov, Philip Coleman, Frederic Roskam]},
-    journal={arXiv preprint arXiv:2402.17701},
-    year={2024}
-    }
-    ```
-    ## Model Card Contact
-    For questions about the model card, please open an issue in the repository.


15
16	# Model Card for SICTO Vocal Separator
17
18	+ This model performs HQ Vocal Separation