Spaces:

chenxie95
/

xlance-msr

Running

App Files Files Community

Yongyi Zang commited on Sep 2, 2025

Commit

26ab161

1 Parent(s): 6bf7e1d

Add docs

Browse files

Files changed (7) hide show

README.md +87 -2
data/README.md +39 -0
evaluation/README.md +31 -0
losses/README.md +63 -0
models/README.md +58 -0
modules/README.md +146 -0
requirements.txt +9 -0

README.md CHANGED Viewed

@@ -1,3 +1,88 @@
-# Music Source Restoration Kit
-This repository offers a collection of model implementations, training configurations, and evaluation scripts to help you quickly get started with training and evaluating music source restoration models.

+# Music Source Restoration Kit
+This repository offers a collection of model implementations, training configurations, and evaluation scripts to help you quickly get started with training and evaluating music source restoration models.
+We have designed the repository to be a GAN-based framework; to learn more about the GANs, you can watch [this video](https://www.youtube.com/watch?v=TpMIssRdhco).
+## Directory Structure
+The repository is organized to separate concerns, making it easy to extend and maintain. Click on a directory to learn more about its contents.
+```
+MSRKit/
+├── README.md                 <- You are here
+├── config.yaml               <- Main configuration file for experiments
+├── train.py                  <- Main script to start training
+├── unwrap.py                 <- Utility to extract generator weights from a checkpoint
+│
+├── data/                     <- [Data loading and augmentation](./data/README.md)
+│
+├── evaluation/               <- [Evaluation metrics](./evaluation/README.md)
+│
+├── losses/                   <- [Loss function implementations](./losses/README.md)
+│
+├── models/                   <- [Top-level generator model architectures](./models/README.md)
+│
+└── modules/                  <- [Core building blocks for models](./modules/README.md)
+    ├── discriminator/        <- [Discriminator architectures](./modules/discriminator/README.md)
+    └── generator/            <- [Reusable generator components](./modules/generator/README.md)
+```
+## 🚀 Getting Started
+### 1. Setup
+First, clone the repository and install the required dependencies.
+```bash
+git clone https://github.com/yongyizang/MSRKit.git
+cd MSRKit
+pip install -r requirements.txt
+```
+*Note: The `FAD_CLAP` metric requires `laion-clap`. Please install it via `pip install laion-clap`.*
+### 2. Configure Your Experiment
+Modify the `config.yaml` file to set up your dataset paths, model hyperparameters, and training settings.
+Key sections to update:
+  - `data.train_dataset.root_directory`: Path to your training data.
+  - `data.train_dataset.file_list`: Path to a `.txt` file listing your training samples.
+  - `data.val_dataset.root_directory`: Path to your validation data.
+  - `data.val_dataset.file_list`: Path to a `.txt` file listing your validation samples.
+  - `model`: Choose the generator model and its parameters.
+  - `discriminators`: Add and configure one or more discriminators.
+  - `trainer`: Set training parameters like `max_steps`, `devices` (GPU IDs), and `precision`.
+### 3. Start Training
+Launch the training process using the `train.py` script and your configuration file.
+```bash
+python train.py --config config.yaml
+```
+Logs, checkpoints, and audio samples will be saved in the `lightning_logs/` directory.
+### 4. Unwrap Generator Weights
+After training, you may want to use the generator model for inference without the rest of the Lightning module. The `unwrap.py` script extracts the generator's `state_dict` from a checkpoint file.
+```bash
+python unwrap.py --ckpt "path/to/your/checkpoint.ckpt" --out "path/to/save/generator.pth"
+```
+This creates a clean `.pth` file containing only the generator's weights. This is useful if you want to use the generator model for inference without the rest of the Lightning module, or if you want to fine-tune the generator model on a different dataset.
+## Building Your First Model
+To build your first model, you can reference the model architecture in the `models/` directory. You can also refer to the `modules/` directory for the building blocks used in the model architectures. At a very high level, we have implemented the following processing blocks:
+- Spectral Operations: `Fourier`, `Band`
+- Sequence Modeling Blocks: `RoFormerBlock` (and an example of modified attention pattern, `AttentionRegisterRoFormerBlock`), `RNNBlock`, `ConvNeXt1DBlock`
+- Convolutional Blocks: `ConvNeXt2DBlock`, `ConvNeXt1DBlock`
+- Discriminator Architectures: `MultiPeriodDiscriminator`, `MultiScaleDiscriminator`, `MultiResolutionDiscriminator`, `MultiFrequencyDiscriminator`
+## ⚖️ License
+This project is licensed under the MIT License.

data/README.md ADDED Viewed

	@@ -0,0 +1,39 @@

+# Data Module
+This directory contains all the necessary components for data loading, processing, and augmentation.
+## Files
+### `dataset.py`
+This file defines the `RawStems` dataset class, which is the core of the data pipeline. It dynamically creates training examples by mixing a target stem with other stems based on a specified Signal-to-Noise Ratio (SNR).
+#### `RawStems`
+A PyTorch `Dataset` that loads and processes raw audio stems for music source restoration tasks.
+**`__init__` Arguments:**
+  - `target_stem` (`str`): The name of the target stem folder (e.g., `"Voc"` or `"Gtr_EG"`).
+  - `root_directory` (`Union[str, Path]`): The root directory containing subfolders for each song.
+  - `file_list` (`Optional[Union[str, Path]]`): Path to a `.txt` file where each line is a path to a song folder, relative to `root_directory`.
+  - `sr` (`int`): The target sample rate to load audio at. Default: `44100`.
+  - `clip_duration` (`float`): The duration of the audio clips to be extracted, in seconds. Default: `3.0`.
+  - `snr_range` (`Tuple[float, float]`): A tuple representing the min and max SNR (in dB) for mixing the target stem with the noise (other stems). Default: `(0.0, 10.0)`.
+  - `apply_augmentation` (`bool`): Whether to apply on-the-fly augmentations to the audio. Default: `True`.
+### `augment.py`
+This file implements the audio augmentation pipelines using the `pedalboard` library.
+#### `StemAugmentation`
+Applies a chain of augmentations suitable for the *target* audio source before it's mixed. This simulates variations in recording quality and effects.
+  - **Effects include**: Random EQ, Resampling, Compression, Distortion, and Reverb.
+#### `MixtureAugmentation`
+Applies a chain of augmentations to the final *mixture* audio. This simulates artifacts that could occur on a fully mixed track.
+  - **Effects include**: Limiting, Resampling, and MP3 a.k.a Codec compression.

evaluation/README.md ADDED Viewed

	@@ -0,0 +1,31 @@

+# Evaluation Module
+This directory contains classes for evaluating model performance during validation. All metrics inherit from a base `Metric` class for a consistent interface.
+## Files
+### `metrics.py`
+#### `SI_SNR` (Scale-Invariant Signal-to-Noise Ratio)
+A common metric for audio source separation that measures the quality of the restored signal relative to the original target. It is invariant to the overall scaling of the estimated signal.
+  - `update(pred, target)`: Updates the running statistics with a new batch of predicted and target audio tensors.
+  - `compute()`: Calculates the mean and standard deviation of the SI-SNR scores accumulated since the last reset.
+  - `reset()`: Clears the accumulated statistics.
+#### `FAD_CLAP` (Fréchet Audio Distance using CLAP)
+Measures the Fréchet distance between the distributions of embeddings from the generated audio and the ground truth audio. It uses a pre-trained CLAP (Contrastive Language-Audio Pretraining) model to generate these embeddings, providing a perceptually relevant measure of audio quality and similarity.
+**Note:** This metric requires the `laion-clap` library. If not installed, it will fall back to using random embeddings, which is not meaningful for evaluation.
+  - `update(pred, target)`: Extracts CLAP embeddings from the predicted and target audio tensors and stores them.
+  - `compute()`: Calculates the FAD score between the collected sets of embeddings.
+  - `reset()`: Clears the stored embeddings.
+**`__init__` Arguments:**
+  - `embedding_dim` (`int`): The dimensionality of the embeddings. Should match the CLAP model. Default: `512`.
+  - `model_name` (`str`): The name of the CLAP model architecture to use. Default: `'HTSAT-base'`.
+  - `ckpt_path` (`Optional[str]`): Optional path to a specific CLAP model checkpoint. If `None`, it uses the default pre-trained weights.

losses/README.md ADDED Viewed

	@@ -0,0 +1,63 @@

+# Losses Module
+This directory contains the implementations of various loss functions used for training the generator and discriminators.
+## Files
+### `gan_loss.py`
+This file implements adversarial losses for both the generator and discriminator, as well as a feature matching loss.
+We provide both LSGAN and Hinge GAN implementations. LSGAN and Hinge GAN differ primarily in how they penalize mistakes.
+- LSGAN uses a "least squares" approach that constantly pushes fake samples toward looking real, with the penalty growing quadratically the further off they are - this means even terrible fakes get strong learning signals, preventing vanishing gradients, but the discriminator never stops pushing even on samples that are already good enough, which can cause instability.
+- Hinge GAN instead creates a "satisfaction zone" where once the discriminator is confident enough about a sample (real or fake), it stops trying to improve its classification - this focuses all the learning on ambiguous samples near the decision boundary. The result: LSGAN provides consistent gradients throughout training but can overshoot and destabilize, while Hinge GAN typically produces sharper images by not wasting effort on already-separated samples, though it risks killing gradients entirely if the discriminator gets too confident too fast.
+#### `GeneratorLoss`
+Calculates the adversarial loss for the generator, encouraging it to produce outputs that the discriminator classifies as real.
+**`__init__` Arguments:**
+  - `gan_type` (`str`): The type of GAN loss to use. Supports `'hinge'` and `'lsgan'` (Least Squares GAN). Default: `'hinge'`.
+#### `DiscriminatorLoss`
+Calculates the adversarial loss for the discriminator, training it to distinguish between real and fake (generated) inputs.
+**`__init__` Arguments:**
+  - `gan_type` (`str`): The type of GAN loss to use. Supports `'hinge'` and `'lsgan'`. Default: `'hinge'`.
+#### `FeatureMatchingLoss`
+Calculates the L1 distance between the feature maps of the real and fake inputs from the intermediate layers of the discriminator. This helps stabilize training by matching the statistical properties of the features.
+-----
+### `reconstruction_loss.py`
+This file implements reconstruction losses that measure the direct difference between the generated audio and the ground truth target audio in various domains.
+#### `MultiMelSpecReconstructionLoss`
+Calculates the L1 loss between the log-mel spectrograms of the predicted and target audio. It computes this loss using multiple different STFT configurations (FFT size, hop length, mel bands) and averages the results for a more robust, multi-resolution spectral loss.
+**`__init__` Arguments:**
+  - `sample_rate` (`int`): The sample rate of the audio.
+  - `n_fft` (`List[int]`): A list of FFT sizes for the different STFT resolutions.
+  - `hop_length` (`List[int]`): A list of hop lengths corresponding to the FFT sizes.
+  - `n_mels` (`List[int]`): A list of the number of mel bands corresponding to the FFT sizes.
+#### `ComplexSpecReconstructionLoss`
+Calculates the L1 loss on the magnitude of the complex spectrograms.
+#### `MultiComplexSpecReconstructionLoss`
+A multi-resolution version of `ComplexSpecReconstructionLoss`.
+#### `WaveformReconstructionLoss`
+Calculates a simple L1 loss directly on the raw audio waveforms.

models/README.md ADDED Viewed

	@@ -0,0 +1,58 @@

+# Models Module
+This directory contains the high-level generator architectures. These models define the main structure for transforming a mixed audio waveform into a restored target stem. They process the audio in the spectral domain and utilize various building blocks from the `modules/` directory.
+All models (currently) first transform the input waveform into a spectrogram, process it in the time-frequency domain, and then convert it back to a waveform using an inverse STFT. They uniformly assumes a mono audio tensor being processed of shape [batch, samples].
+## Files
+### `MelRoFormer.py`
+#### `MelRoFormer`
+A dual-path Transformer-based model that applies attention alternately along the frequency and time axes of the spectrogram. It uses `RoFormerBlock`s, which incorporate Rotary Position Embeddings (RoPE) for effective sequence modeling. This model references https://arxiv.org/abs/2409.04702.
+**`__init__` Arguments:**
+  - `hidden_channels` (`int`): The number of channels (embedding dimension) used throughout the model.
+  - `num_layers` (`int`): The number of layers (a time block + a frequency block is one layer).
+  - `num_heads` (`int`): The number of attention heads in each RoFormer block.
+  - `window_size` (`int`): The STFT window size.
+  - `hop_size` (`int`): The STFT hop size.
+  - `sample_rate` (`int`): The sample rate of the input audio.
+-----
+### `MelRNN.py`
+#### `MelRNN`
+A dual-path model similar to `MelRoFormer`, but it uses bidirectional GRUs (`RNNBlock`) instead of Transformers for processing the time and frequency axes. This can be a lighter-weight alternative to the attention-based models. This model references (yet deviates from) https://arxiv.org/abs/2209.15174.
+**`__init__` Arguments:**
+  - `hidden_channels` (`int`): The number of channels (embedding dimension).
+  - `num_layers` (`int`): The number of RNN layers.
+  - `num_groups` (`int`): The number of groups for the `GroupedRNN` within each `RNNBlock`.
+  - `window_size` (`int`): The STFT window size.
+  - `hop_size` (`int`): The STFT hop size.
+  - `sample_rate` (`int`): The sample rate of the input audio.
+-----
+### `UNet.py`
+#### `MelUNet`
+A U-Net architecture that operates on the 2D spectrogram. It uses a series of downsampling and upsampling blocks (`ConvNeXt2DBlock`) with skip connections to capture multi-scale features in the spectrogram.
+**`__init__` Arguments:**
+  - `hidden_channels` (`int`): The initial number of channels in the network. Channel count doubles with each downsampling step.
+  - `num_layers` (`int`): The depth of the U-Net (number of downsampling/upsampling stages).
+  - `upsampling_factor` (`int`): The factor for upsampling/downsampling in each block (typically `2`).
+  - `window_size` (`int`): The STFT window size.
+  - `hop_size` (`int`): The STFT hop size.
+  - `sample_rate` (`int`): The sample rate of the input audio.
+-----

modules/README.md ADDED Viewed

	@@ -0,0 +1,146 @@

+# Modules Directory
+This directory contains the fundamental building blocks used to construct the larger models and discriminators. It is divided into subdirectories based on function.
+## Subdirectories
+  - **`discriminator/`**: Contains complete, stand-alone discriminator architectures.
+  - **`generator/`**: Contains reusable neural network layers and blocks (e.g., attention, RNN, ConvNeXt blocks) used in the main generator models.
+  - **`spectral_ops.py`**: Includes modules for spectral processing:
+      - `Fourier`: A wrapper for `torch.stft` and `torch.istft`.
+      - `Band`: A module to split a spectrogram into different frequency bands (e.g., mel scale) for processing and reassemble them.
+# Discriminator Modules
+This directory provides a suite of powerful, multi-component discriminators. The training script combines these into a single powerful ensemble discriminator. Each is designed to analyze audio from a different perspective (time, frequency, scale), making the generator's task more challenging and leading to higher-quality results.
+## Files
+### `MultiPeriodDiscriminator.py`
+#### `MultiPeriodDiscriminator`
+This discriminator operates on the raw audio waveform. It consists of several sub-discriminators, each viewing the input signal at a different *period*. For example, a sub-discriminator with `period=2` will reshape the audio into a 2D representation where adjacent samples are folded, allowing it to spot artifacts at that specific frequency. This is highly effective at detecting periodic artifacts.
+**`__init__` Arguments:**
+  - `nch` (`int`): Number of input channels (e.g., `1` for mono). Default: `1`.
+  - `sample_rate` (`int`): Sample rate of the audio. Default: `48000`.
+  - `periods` (`List[int]`): A list of periods for each sub-discriminator. Prime numbers are recommended. Default: `[2, 3, 5, 7, 11]`.
+  - `norm` (`bool`): Whether to use spectral normalization. Default: `True`.
+-----
+### `MultiScaleDiscriminator.py`
+#### `MultiScaleDiscriminator`
+This discriminator also operates on the raw waveform. It contains multiple sub-discriminators that process the audio at different resolutions by downsampling the input. This allows it to identify artifacts at various time scales, from fine-grained details to broader structural issues.
+**`__init__` Arguments:**
+  - `sample_rate` (`int`): Sample rate of the audio.
+  - `downsample_rates` (`List[int]`): A list of factors to downsample the audio for each sub-discriminator. Default: `[2, 4]`.
+  - `nch` (`int`): Number of input channels. Default: `1`.
+  - `norm` (`bool`): Whether to use spectral normalization. Default: `True`.
+-----
+### `MultiResolutionDiscriminator.py`
+#### `MultiResolutionDiscriminator`
+This discriminator operates in the spectral domain. It consists of several sub-discriminators, each analyzing the STFT of the input audio using a different window length. This allows it to detect spectral artifacts across different time-frequency resolutions.
+**`__init__` Arguments:**
+  - `nch` (`int`): Number of input channels. Default: `1`.
+  - `sample_rate` (`int`): Sample rate of the audio. Default: `48000`.
+  - `window_lengths` (`List[int]`): A list of STFT window lengths for each sub-discriminator. Default: `[2048, 1024, 512]`.
+  - `hop_factor` (`float`): The ratio of hop length to window length. Default: `0.25`.
+  - `bands` (`List[Tuple[float, float]]`): Frequency bands to analyze, specified as fractions of the Nyquist frequency.
+  - `norm` (`bool`): Whether to use spectral normalization. Default: `True`.
+  - `hidden_channels` (`int`): The number of hidden channels in the conv layers. Default: `32`.
+-----
+### `MultiFrequencyDiscriminator.py`
+#### `MultiFrequencyDiscriminator`
+This discriminator is similar to `MultiResolutionDiscriminator` but with a different internal architecture focused on capturing features across frequency bands. It also processes the real and imaginary parts of the STFT as separate channels. This discriminator references https://arxiv.org/abs/2210.13438's discriminator architecture.
+**`__init__` Arguments:**
+  - `nch` (`int`): Number of input channels.
+  - `window_sizes` (`List[int]`): A list of STFT window sizes for each sub-discriminator.
+  - `hidden_channels` (`int`): The number of base hidden channels. Default: `8`.
+  - `sample_rate` (`int`): Sample rate of the audio. Default: `48000`.
+  - `norm` (`bool`): Whether to use spectral normalization. Default: `True`.
+-----
+# Generator Modules
+This directory contains reusable building blocks that form the core components of the main generator models in the `/models` directory.
+## Files
+### `RoFormerBlock.py`
+#### `RoFormerBlock`
+A standard Transformer block that uses **Ro**tary **P**osition **E**mbeddings (RoPE) instead of absolute or learned position embeddings. RoPE injects positional information by rotating the query and key vectors, which is particularly effective for sequence modeling. The block consists of a self-attention layer followed by an MLP, with residual connections and RMS normalization.
+**`__init__` Arguments:**
+  - `n_embd` (`int`): The embedding dimension (number of channels).
+  - `n_head` (`int`): The number of attention heads.
+  - `max_seq_len` (`int`): The maximum sequence length this block can handle, used to pre-compute the RoPE cache.
+  - `rope_base` (`int`): The base value for the rotary position embedding calculation. Default: `10000`.
+-----
+### `AttentionRegisterRoFormerBlock.py`
+#### `AttentionRegisterRoFormerBlock`
+An extension of the `RoFormerBlock` that implements **Attention Registers**. This technique adds a small number of learnable "register" tokens to the sequence. These tokens act as a global memory or scratchpad for the attention mechanism, improving its ability to retain and access information across the entire sequence, especially when combined with a sliding window attention mechanism.
+**`__init__` Arguments:**
+  - *(Inherits from `RoFormerBlock`)*
+  - `num_register_tokens` (`int`): The number of register tokens to prepend to the sequence. Default: `0`.
+  - `window_size` (`int`): The size of the sliding attention window. If `-1`, full attention is used. Default: `-1`.
+-----
+### `RNNBlock.py`
+#### `RNNBlock`
+A block that uses a Recurrent Neural Network (RNN) layer followed by an MLP, with residual connections and RMS normalization. It uses a `GroupedRNN` internally.
+**`__init__` Arguments:**
+  - `n_embd` (`int`): The embedding dimension.
+  - `n_layer` (`int`): The number of layers in the RNN.
+  - `n_groups` (`int`): The number of parallel, smaller RNNs to use in the `GroupedRNN`. The embedding dimension is split across these groups.
+  - `rnn_type` (`str`): The type of RNN cell to use, either `'gru'` or `'lstm'`. Default: `'gru'`.
+  - `bidirectional` (`bool`): Whether to use a bidirectional RNN. Default: `False`.
+-----
+### `ConvNeXt1DBlock.py` & `ConvNeXt2DBlock.py`
+#### `ConvNeXt1DBlock` / `ConvNeXt2DBlock`
+Implementations of the ConvNeXt block for 1D and 2D data, respectively. This block is a modern, pure-convolutional architecture that adopts design principles from Vision Transformers. It features a depthwise convolution followed by pointwise convolutions (linear layers) in an inverted bottleneck structure. These blocks can be run in `'normal'` mode (downsampling) or `'transposed'` mode (upsampling).
+**`__init__` Arguments:**
+  - `kernel_size` (`int` or `tuple`): The kernel size for the depthwise convolution.
+  - `stride` (`int` or `tuple`): The stride for the convolution, used for down/up-sampling.
+  - `input_dim` (`int`): The number of input channels.
+  - `output_dim` (`int`): The number of output channels.
+  - `mode` (`str`): Operation mode, either `'normal'` for `ConvNd` or `'transposed'` for `ConvTransposeNd`. Default: `'normal'`.

requirements.txt ADDED Viewed

	@@ -0,0 +1,9 @@

+einops==0.8.1
+importlib-metadata==8.0.0
+jaraco.collections==5.1.0
+librosa==0.11.0
+thop==0.1.1.post2209072238
+tomli==2.0.1
+torch==2.8.0
+torchaudio==2.8.0
+pytorch-lightning