ASLP-lab commited on
Commit
523abf4
Β·
verified Β·
1 Parent(s): b0d4960

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +155 -6
README.md CHANGED
@@ -4,13 +4,162 @@ tags:
4
  - transformer
5
  ---
6
 
7
- # SongFormer πŸ€—
 
 
8
 
9
- This is the official Hugging Face model repo for **[SongFormer](https://github.com/ASLP-lab/SongFormer)**.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
 
11
  SongFormer is a music structure analysis framework that leverages multi-resolution self-supervised representations and heterogeneous supervision, accompanied by the large-scale multilingual dataset SongFormDB and the high-quality benchmark SongFormBench to foster fair and reproducible research.
12
 
13
- ## Related Projects
14
- - πŸ’» Code: [GitHub Repository](https://github.com/ASLP-lab/SongFormer)
15
- - πŸ“‚ Dataset: [SongFormDB](https://huggingface.co/datasets/ASLP-lab/SongFormDB)
16
- - πŸ“Š Benchmark: [SongFormBench](https://huggingface.co/datasets/ASLP-lab/SongFormBench)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
  - transformer
5
  ---
6
 
7
+ <p align="center">
8
+ <img src="https://github.com/ASLP-lab/SongFormer/blob/main/figs/logo.png?raw=true" width="50%" />
9
+ </p>
10
 
11
+ # SongFormer: Scaling Music Structure Analysis with Heterogeneous Supervision
12
+
13
+ ![Python](https://img.shields.io/badge/Python-3.10-brightgreen)
14
+ ![License](https://img.shields.io/badge/License-CC%20BY%204.0-lightblue)
15
+ [![arXiv Paper](https://img.shields.io/badge/arXiv-2510.02797-blue)](https://arxiv.org/abs/2510.02797)
16
+ [![GitHub](https://img.shields.io/badge/GitHub-SongFormer-black)](https://github.com/ASLP-lab/SongFormer)
17
+ [![HuggingFace Space](https://img.shields.io/badge/HuggingFace-space-yellow)](https://huggingface.co/spaces/ASLP-lab/SongFormer)
18
+ [![HuggingFace Model](https://img.shields.io/badge/HuggingFace-model-blue)](https://huggingface.co/ASLP-lab/SongFormer)
19
+ [![Dataset SongFormDB](https://img.shields.io/badge/HF%20Dataset-SongFormDB-green)](https://huggingface.co/datasets/ASLP-lab/SongFormDB)
20
+ [![Dataset SongFormBench](https://img.shields.io/badge/HF%20Dataset-SongFormBench-orange)](https://huggingface.co/datasets/ASLP-lab/SongFormBench)
21
+ [![Discord](https://img.shields.io/badge/Discord-join%20us-purple?logo=discord&logoColor=white)](https://discord.gg/p5uBryC4Zs)
22
+ [![lab](https://img.shields.io/badge/🏫-ASLP-grey?labelColor=lightgrey)](http://www.npu-aslp.org/)
23
+
24
+ Chunbo Hao<sup>&ast;</sup>, Ruibin Yuan<sup>&ast;</sup>, Jixun Yao, Qixin Deng, Xinyi Bai, Wei Xue, Lei Xie<sup>&dagger;</sup>
25
+
26
+ ----
27
 
28
  SongFormer is a music structure analysis framework that leverages multi-resolution self-supervised representations and heterogeneous supervision, accompanied by the large-scale multilingual dataset SongFormDB and the high-quality benchmark SongFormBench to foster fair and reproducible research.
29
 
30
+ ![](https://github.com/ASLP-lab/SongFormer/blob/main/figs/songformer.png?raw=true)
31
+
32
+ For a more detailed deployment guide, please refer to the [GitHub repository](https://github.com/ASLP-lab/SongFormer/).
33
+
34
+ ## πŸš€ QuickStart
35
+
36
+ ### Prerequisites
37
+
38
+ Before running the model, follow the instructions in the [GitHub repository](https://github.com/ASLP-lab/SongFormer/) to set up the required **Python environment**.
39
+
40
+ ---
41
+
42
+ ### Input: Audio File Path
43
+
44
+ You can perform inference by providing the path to an audio file:
45
+
46
+ ```python
47
+ from transformers import AutoModel
48
+ from huggingface_hub import snapshot_download
49
+ import sys
50
+ import os
51
+
52
+ # Download the model from Hugging Face Hub
53
+ local_dir = snapshot_download(
54
+ repo_id="ASLP-lab/SongFormer",
55
+ repo_type="model",
56
+ local_dir_use_symlinks=False,
57
+ resume_download=True,
58
+ allow_patterns="*",
59
+ ignore_patterns=["SongFormer.pt", "SongFormer.safetensors"],
60
+ )
61
+
62
+ # Add the local directory to path and set environment variable
63
+ sys.path.append(local_dir)
64
+ os.environ["SONGFORMER_LOCAL_DIR"] = local_dir
65
+
66
+ # Load the model
67
+ songformer = AutoModel.from_pretrained(
68
+ local_dir,
69
+ trust_remote_code=True,
70
+ low_cpu_mem_usage=False,
71
+ )
72
+
73
+ # Set device and switch to evaluation mode
74
+ device = "cuda:0"
75
+ songformer.to(device)
76
+ songformer.eval()
77
+
78
+ # Run inference
79
+ result = songformer("path/to/audio/file.wav")
80
+ ```
81
+
82
+ ---
83
+
84
+ ### Input: Tensor or NumPy Array
85
+
86
+ Alternatively, you can directly feed a raw audio waveform as a NumPy array or PyTorch tensor:
87
+
88
+ ```python
89
+ from transformers import AutoModel
90
+ from huggingface_hub import snapshot_download
91
+ import sys
92
+ import os
93
+ import numpy as np
94
+
95
+ # Download model
96
+ local_dir = snapshot_download(
97
+ repo_id="ASLP-lab/SongFormer",
98
+ repo_type="model",
99
+ local_dir_use_symlinks=False,
100
+ resume_download=True,
101
+ allow_patterns="*",
102
+ ignore_patterns=["SongFormer.pt", "SongFormer.safetensors"],
103
+ )
104
+
105
+ # Setup environment
106
+ sys.path.append(local_dir)
107
+ os.environ["SONGFORMER_LOCAL_DIR"] = local_dir
108
+
109
+ # Load model
110
+ songformer = AutoModel.from_pretrained(
111
+ local_dir,
112
+ trust_remote_code=True,
113
+ low_cpu_mem_usage=False,
114
+ )
115
+
116
+ # Configure device
117
+ device = "cuda:0"
118
+ songformer.to(device)
119
+ songformer.eval()
120
+
121
+ # Generate dummy audio input (sampling rate: 24,000 Hz, e.g., 60 seconds of audio)
122
+ audio = np.random.randn(24000 * 60).astype(np.float32)
123
+
124
+ # Perform inference
125
+ result = songformer(audio)
126
+ ```
127
+
128
+ > ⚠️ **Note:** The expected sampling rate for input audio is **24,000 Hz**.
129
+
130
+ ---
131
+
132
+ ### Output Format
133
+
134
+ The model returns a structured list of segment predictions, with each entry containing timing and label information:
135
+
136
+ ```json
137
+ [
138
+ {
139
+ "start": 0.0, // Start time of segment (in seconds)
140
+ "end": 15.2, // End time of segment (in seconds)
141
+ "label": "verse" // Predicted segment label
142
+ },
143
+ ...
144
+ ]
145
+ ```
146
+
147
+ ## πŸ”§ Notes
148
+
149
+ - The initialization logic of **MusicFM** has been modified to eliminate the need for loading checkpoint files during instantiation, improving both reliability and startup efficiency.
150
+
151
+ ## πŸ“š Citation
152
+
153
+ If you use **SongFormer** in your research or application, please cite our work:
154
+
155
+ ```bibtex
156
+ @misc{hao2025songformer,
157
+ title = {SongFormer: Scaling Music Structure Analysis with Heterogeneous Supervision},
158
+ author = {Chunbo Hao and Ruibin Yuan and Jixun Yao and Qixin Deng and Xinyi Bai and Wei Xue and Lei Xie},
159
+ year = {2025},
160
+ eprint = {2510.02797},
161
+ archivePrefix = {arXiv},
162
+ primaryClass = {eess.AS},
163
+ url = {https://arxiv.org/abs/2510.02797}
164
+ }
165
+ ```