Audio Source Separation with Time-Frequency Sequence Attention Res-U-Net (DCASE 2025)
This repository contains an implementation that replicates the architecture described in "TFSWA-ResUNet: music source separation with timeβfrequency sequence and shifted window attention-based ResUNet".
Instead of music source separation, this implementation adapts the model for Sound Event Separation using a subset of the DCASE 2025 Task 4 dataset. The entire training, validation, and testing pipeline is contained within a single Jupyter notebook.
π― Features
Architecture
- Res-U-Net with integrated Time-Frequency Sequence Attention (TF-SA) and Shifted Window Attention
- Task: Separating overlapping sound events in domestic environments
- Input: Magnitude spectrograms of mixed audio (32kHz sampling rate)
- Output: Estimated spectrograms of specific sound classes
π Project Structure
Audio-Separation-ResUNet-TF-Attention/
βββ TF_SA_ResUNet.ipynb # Main notebook containing model, training, and inference
βββ README.md # Project documentation
π Dataset
This project uses a custom subset of the DCASE 2025 Task 4 dataset, reduced to facilitate efficient training while maintaining task complexity.
Dataset Statistics
- Total Samples: 10,000
- Configuration: 3 overlapping events per mixture
- Classes: 5 target sound classes
- Sampling Rate: 32kHz
Access the Dataset
- π€ Hugging Face
- π¦ Kaggle
π Installation & Usage
1. Clone the Repository
git clone https://github.com/kiuyha/Audio-Separation-ResUNet-TF-Attention.git
cd Audio-Separation-ResUNet-TF-Attention
2. Open the Notebook
This project is designed to run in Google Colab or a local Jupyter environment. All necessary dependencies are installed directly within the notebook cells.
- Open
TF_SA_ResUNet.ipynb - Ensure you have a GPU runtime enabled for training
3. Dependencies
The code relies on standard deep learning and audio libraries:
- Python 3.8+
- PyTorch
- Librosa
- NumPy
- Matplotlib
- Soundfile
All dependencies are automatically installed when running the notebook cells.
π€ Model Weights
Pre-trained model weights are hosted on Hugging Face:
How to Load Weights
- Download the
.pthfile from the link above - Place it in the root directory of the project (or upload it to your Colab session)
- Run the inference cell in the notebook to load the state dictionary
π Evaluation
The model is evaluated using the DCASE metric: CA-SDRi (Class-Aware Sound Signal-to-Distortion Ratio Improvement)
Results
| Model Variant | CA-SDRi (dB) |
|---|---|
| ResUNet (Baseline) | 3.15857 |
| ResUNet + SpecAugment | 2.95301 |
| TF-SA-ResUNet | 5.25322 |
| TF-SA-ResUNet + SpecAugment | 4.66175 |
Resources
- Read the Report: https://drive.google.com/file/d/1tsKs-xcIF_9E1K_2pLuiPkUknKcop8ik/view
- Code: https://github.com/kiuyha/Audio-Separation-ResUNet-TF-Attention
π Citation
If you use this implementation in your research, please cite the original paper:
@article{kong2024tfswa,
title={TFSWA-ResUNet: music source separation with timeβfrequency sequence and shifted window attention-based ResUNet},
author={Kong, Q. and Cao, Y. and Liu, H. and Doi, K. and Iqbal, T.},
journal={Complex \& Intelligent Systems},
volume={10},
pages={1--17},
year={2024},
publisher={Springer}
}
Paper Link: TFSWA-ResUNet on Springer
π License
This project is licensed under the MIT License. See the LICENSE file for details.
π§ Contact
For questions or issues, please open an issue on GitHub or contact the repository maintainer.
π Acknowledgments
- DCASE 2025 Task 4 organizers for providing the dataset framework
- Original authors of the TFSWA-ResUNet architecture
- The open-source audio processing community