| | --- |
| | tags: |
| | - music-structure-annotation |
| | - transformer |
| | --- |
| | |
| | <p align="center"> |
| | <img src="https://github.com/ASLP-lab/SongFormer/blob/main/figs/logo.png?raw=true" width="50%" /> |
| | </p> |
| |
|
| | <h1 align="center">SongFormer: Scaling Music Structure Analysis with Heterogeneous Supervision</h1> |
| |
|
| | <div align="center"> |
| |
|
| |  |
| |  |
| | [](https://arxiv.org/abs/2510.02797) |
| | [](https://github.com/ASLP-lab/SongFormer) |
| | [](https://huggingface.co/spaces/ASLP-lab/SongFormer) |
| | [](https://huggingface.co/ASLP-lab/SongFormer) |
| | [](https://huggingface.co/datasets/ASLP-lab/SongFormDB) |
| | [](https://huggingface.co/datasets/ASLP-lab/SongFormBench) |
| | [](https://discord.gg/p5uBryC4Zs) |
| | [](http://www.npu-aslp.org/) |
| |
|
| | </div> |
| |
|
| | <div align="center"> |
| | <h3> |
| | Chunbo Hao<sup>1*</sup>, Ruibin Yuan<sup>2,5*</sup>, Jixun Yao<sup>1</sup>, Qixin Deng<sup>3,5</sup>,<br>Xinyi Bai<sup>4,5</sup>, Wei Xue<sup>2</sup>, Lei Xie<sup>1โ </sup> |
| | </h3> |
| | |
| | <p> |
| | <sup>*</sup>Equal contribution <sup>โ </sup>Corresponding author |
| | </p> |
| | |
| | <p> |
| | <sup>1</sup>Audio, Speech and Language Processing Group (ASLP@NPU),<br>Northwestern Polytechnical University<br> |
| | <sup>2</sup>Hong Kong University of Science and Technology<br> |
| | <sup>3</sup>Northwestern University<br> |
| | <sup>4</sup>Cornell University<br> |
| | <sup>5</sup>Multimodal Art Projection (M-A-P) |
| | </p> |
| | </div> |
| | |
| | ---- |
| |
|
| | SongFormer is a music structure analysis framework that leverages multi-resolution self-supervised representations and heterogeneous supervision, accompanied by the large-scale multilingual dataset SongFormDB and the high-quality benchmark SongFormBench to foster fair and reproducible research. |
| |
|
| |  |
| |
|
| | For a more detailed deployment guide, please refer to the [GitHub repository](https://github.com/ASLP-lab/SongFormer/). |
| |
|
| | ## ๐ QuickStart |
| |
|
| | ### Prerequisites |
| |
|
| | Before running the model, follow the instructions in the [GitHub repository](https://github.com/ASLP-lab/SongFormer/) to set up the required **Python environment**. |
| |
|
| | --- |
| |
|
| | ### Input: Audio File Path |
| |
|
| | You can perform inference by providing the path to an audio file: |
| |
|
| | ```python |
| | from transformers import AutoModel |
| | from huggingface_hub import snapshot_download |
| | import sys |
| | import os |
| | |
| | # Download the model from Hugging Face Hub |
| | local_dir = snapshot_download( |
| | repo_id="ASLP-lab/SongFormer", |
| | repo_type="model", |
| | local_dir_use_symlinks=False, |
| | resume_download=True, |
| | allow_patterns="*", |
| | ignore_patterns=["SongFormer.pt", "SongFormer.safetensors"], |
| | ) |
| | |
| | # Add the local directory to path and set environment variable |
| | sys.path.append(local_dir) |
| | os.environ["SONGFORMER_LOCAL_DIR"] = local_dir |
| | |
| | # Load the model |
| | songformer = AutoModel.from_pretrained( |
| | local_dir, |
| | trust_remote_code=True, |
| | low_cpu_mem_usage=False, |
| | ) |
| | |
| | # Set device and switch to evaluation mode |
| | device = "cuda:0" |
| | songformer.to(device) |
| | songformer.eval() |
| | |
| | # Run inference |
| | result = songformer("path/to/audio/file.wav") |
| | ``` |
| |
|
| | --- |
| |
|
| | ### Input: Tensor or NumPy Array |
| |
|
| | Alternatively, you can directly feed a raw audio waveform as a NumPy array or PyTorch tensor: |
| |
|
| | ```python |
| | from transformers import AutoModel |
| | from huggingface_hub import snapshot_download |
| | import sys |
| | import os |
| | import numpy as np |
| | |
| | # Download model |
| | local_dir = snapshot_download( |
| | repo_id="ASLP-lab/SongFormer", |
| | repo_type="model", |
| | local_dir_use_symlinks=False, |
| | resume_download=True, |
| | allow_patterns="*", |
| | ignore_patterns=["SongFormer.pt", "SongFormer.safetensors"], |
| | ) |
| | |
| | # Setup environment |
| | sys.path.append(local_dir) |
| | os.environ["SONGFORMER_LOCAL_DIR"] = local_dir |
| | |
| | # Load model |
| | songformer = AutoModel.from_pretrained( |
| | local_dir, |
| | trust_remote_code=True, |
| | low_cpu_mem_usage=False, |
| | ) |
| | |
| | # Configure device |
| | device = "cuda:0" |
| | songformer.to(device) |
| | songformer.eval() |
| | |
| | # Generate dummy audio input (sampling rate: 24,000 Hz, e.g., 60 seconds of audio) |
| | audio = np.random.randn(24000 * 60).astype(np.float32) |
| | |
| | # Perform inference |
| | result = songformer(audio) |
| | ``` |
| |
|
| | > โ ๏ธ **Note:** The expected sampling rate for input audio is **24,000 Hz**. |
| |
|
| | --- |
| |
|
| | ### Output Format |
| |
|
| | The model returns a structured list of segment predictions, with each entry containing timing and label information: |
| |
|
| | ```json |
| | [ |
| | { |
| | "start": 0.0, // Start time of segment (in seconds) |
| | "end": 15.2, // End time of segment (in seconds) |
| | "label": "verse" // Predicted segment label |
| | }, |
| | ... |
| | ] |
| | ``` |
| |
|
| | ## ๐ง Notes |
| |
|
| | - The initialization logic of **MusicFM** has been modified to eliminate the need for loading checkpoint files during instantiation, improving both reliability and startup efficiency. |
| |
|
| | ## ๐ Citation |
| |
|
| | If you use **SongFormer** in your research or application, please cite our work: |
| |
|
| | ```bibtex |
| | @misc{hao2025songformer, |
| | title = {SongFormer: Scaling Music Structure Analysis with Heterogeneous Supervision}, |
| | author = {Chunbo Hao and Ruibin Yuan and Jixun Yao and Qixin Deng and Xinyi Bai and Wei Xue and Lei Xie}, |
| | year = {2025}, |
| | eprint = {2510.02797}, |
| | archivePrefix = {arXiv}, |
| | primaryClass = {eess.AS}, |
| | url = {https://arxiv.org/abs/2510.02797} |
| | } |
| | ``` |