File size: 5,304 Bytes
c7f3ffb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
# 🎵 SoulX-Singer-Preprocess

This part offers a comprehensive **singing transcription and editing toolkit** for real-world music audio. It provides the pipeline from vocal extraction to high-level annotation optimized for SVS dataset construction. By integrating state-of-the-art models, it transforms raw audio into structured singing data and supports the **customizable creation and editing of lyric-aligned MIDI scores**.


## ✨ Features

The toolkit includes the following core modules:

- 🎤 **Clean Dry Vocal Extraction**  
  Extracts the lead vocal track from polyphonic music audio and dereverberation.

- 📝 **Lyrics Transcription**  
  Automatically transcribes lyrics from clean vocal.

- 🎶 **Note Transcription**  
  Converts singing voice into note-level representations for SVS.

- 🎼 **MIDI Editor**  
  Supports customizable creation and editing of MIDI scores integrated with lyrics.


## 🔧 Python Environment

Before running the pipeline, set up the Python environment as follows:

1. **Install Conda** (if not already installed): https://docs.conda.io/en/latest/miniconda.html

2. **Activate or create a conda environment** (recommended Python 3.10):

   - If you already have the `soulxsinger` environment:

     ```bash
     conda activate soulxsinger
     ```

   - Otherwise, create it first:

     ```bash
     conda create -n soulxsinger -y python=3.10
     conda activate soulxsinger
     ```

3. **Install dependencies** from the `preprocess` directory:

   ```bash
   cd preprocess
   pip install -r requirements.txt
   ```

## 📁 Data Preparation

Before running the pipeline, prepare the following inputs:

- **Prompt audio**  
  Reference audio that provides timbre and style

- **Target audio**  
  Original vocal or music audio to be processed and transcribed.

Configure the corresponding parameters in:

```
example/preprocess.sh
```

Typical configuration includes:
- Input / output paths
- Module enable switches

## 🚀 Usage

After configuring `preprocess.sh`, run the transcription pipeline with:

```bash
bash example/preprocess.sh
```

The script will automatically execute the following steps:

1. **Vocal separation and dereverberation**
2. **F0 extraction and voice activity detection (VAD)**
3. **Lyrics transcription**
4. **Note transcription**

---

After the pipeline completes, you will obtain **SoulX-Singer–style metadata** that can be directly used for Singing Voice Synthesis (SVS).

**Output paths:**
- The final metadata (**JSON file**) is written **in the same directory as your input audio**, with the **same filename** (e.g. `audio.mp3``audio.json`)
- All **intermediate results** (separated vocal and accompaniment, F0, VAD outputs, etc.) are also saved under the configured **`save_dir`**.

⚠️ **Important Note**

Transcription errors—especially in **lyrics** and **note annotations**—can significantly affect the final SVS quality. We **strongly recommend manually reviewing and correcting** the generated metadata before inference.

To support this, we provide a **MIDI Editor** for editing lyrics, phoneme alignment, note pitches, and durations. The workflow is:

**Export metadata to MIDI** → edit in the MIDI Editor → **Import edited MIDI back to metadata** for SVS.

---

#### Step 1: Metadata → MIDI (for editing)

Convert SoulX-Singer metadata to a MIDI file so you can open it in the MIDI Editor:

```bash
preprocess_root=example/transcriptions/music

python -m preprocess.tools.midi_parser \
    --meta2midi \
    --meta "${preprocess_root}/metadata.json" \
    --midi "${preprocess_root}/vocal.mid"
```

#### Step 2: Edit in the MIDI Editor

Open the MIDI Editor (see [MIDI Editor Tutorial](tools/midi_editor/README.md)), load `vocal.mid`, and correct lyrics, pitches, or durations as needed. Save the result as e.g. `vocal_edited.mid`.

#### Step 3: MIDI → Metadata (for SoulX-Singer inference)

Convert the edited MIDI back into SoulX-Singer-style metadata (and cut wavs) for SVS:

```bash
python -m preprocess.tools.midi_parser \
    --midi2meta \
    --midi "${preprocess_root}/vocal_edited.mid" \
    --meta "${preprocess_root}/edit_metadata.json" \
    --vocal "${preprocess_root}/vocal.wav" \
```

Use `edit_metadata.json` (and the wavs under `edit_cut_wavs`) as the target metadata in your inference pipeline.


## 🔗 References & Dependencies

This project builds upon the following excellent open-source works:

### 🎧 Vocal Separation & Dereverberation
- [Music Source Separation Training](https://github.com/ZFTurbo/Music-Source-Separation-Training)
- [Lead Vocal Separation](https://huggingface.co/becruily/mel-band-roformer-karaoke)
- [Vocal Dereverberation](https://huggingface.co/anvuew/dereverb_mel_band_roformer)

### 🎼 F0 Extraction
- [RMVPE](https://github.com/Dream-High/RMVPE)

### 📝 Lyrics Transcription (ASR)
- [Paraformer](https://modelscope.cn/models/iic/speech_seaco_paraformer_large_asr_nat-zh-cn-16k-common-vocab8404-pytorch)
- [Parakeet-tdt-0.6b-v2](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2)

### 🎶 Note Transcription
- [ROSVOT](https://github.com/RickyL-2000/ROSVOT)

We sincerely thank the authors of these repositories for their exceptional open-source contributions, which have been fundamental to the development of this toolkit.