Xinsheng-Wang's picture
Upload folder using huggingface_hub
c7f3ffb verified
# 🎵 SoulX-Singer-Preprocess
This part offers a comprehensive **singing transcription and editing toolkit** for real-world music audio. It provides the pipeline from vocal extraction to high-level annotation optimized for SVS dataset construction. By integrating state-of-the-art models, it transforms raw audio into structured singing data and supports the **customizable creation and editing of lyric-aligned MIDI scores**.
## ✨ Features
The toolkit includes the following core modules:
- 🎤 **Clean Dry Vocal Extraction**
Extracts the lead vocal track from polyphonic music audio and dereverberation.
- 📝 **Lyrics Transcription**
Automatically transcribes lyrics from clean vocal.
- 🎶 **Note Transcription**
Converts singing voice into note-level representations for SVS.
- 🎼 **MIDI Editor**
Supports customizable creation and editing of MIDI scores integrated with lyrics.
## 🔧 Python Environment
Before running the pipeline, set up the Python environment as follows:
1. **Install Conda** (if not already installed): https://docs.conda.io/en/latest/miniconda.html
2. **Activate or create a conda environment** (recommended Python 3.10):
- If you already have the `soulxsinger` environment:
```bash
conda activate soulxsinger
```
- Otherwise, create it first:
```bash
conda create -n soulxsinger -y python=3.10
conda activate soulxsinger
```
3. **Install dependencies** from the `preprocess` directory:
```bash
cd preprocess
pip install -r requirements.txt
```
## 📁 Data Preparation
Before running the pipeline, prepare the following inputs:
- **Prompt audio**
Reference audio that provides timbre and style
- **Target audio**
Original vocal or music audio to be processed and transcribed.
Configure the corresponding parameters in:
```
example/preprocess.sh
```
Typical configuration includes:
- Input / output paths
- Module enable switches
## 🚀 Usage
After configuring `preprocess.sh`, run the transcription pipeline with:
```bash
bash example/preprocess.sh
```
The script will automatically execute the following steps:
1. **Vocal separation and dereverberation**
2. **F0 extraction and voice activity detection (VAD)**
3. **Lyrics transcription**
4. **Note transcription**
---
After the pipeline completes, you will obtain **SoulX-Singer–style metadata** that can be directly used for Singing Voice Synthesis (SVS).
**Output paths:**
- The final metadata (**JSON file**) is written **in the same directory as your input audio**, with the **same filename** (e.g. `audio.mp3``audio.json`)
- All **intermediate results** (separated vocal and accompaniment, F0, VAD outputs, etc.) are also saved under the configured **`save_dir`**.
⚠️ **Important Note**
Transcription errors—especially in **lyrics** and **note annotations**—can significantly affect the final SVS quality. We **strongly recommend manually reviewing and correcting** the generated metadata before inference.
To support this, we provide a **MIDI Editor** for editing lyrics, phoneme alignment, note pitches, and durations. The workflow is:
**Export metadata to MIDI** → edit in the MIDI Editor → **Import edited MIDI back to metadata** for SVS.
---
#### Step 1: Metadata → MIDI (for editing)
Convert SoulX-Singer metadata to a MIDI file so you can open it in the MIDI Editor:
```bash
preprocess_root=example/transcriptions/music
python -m preprocess.tools.midi_parser \
--meta2midi \
--meta "${preprocess_root}/metadata.json" \
--midi "${preprocess_root}/vocal.mid"
```
#### Step 2: Edit in the MIDI Editor
Open the MIDI Editor (see [MIDI Editor Tutorial](tools/midi_editor/README.md)), load `vocal.mid`, and correct lyrics, pitches, or durations as needed. Save the result as e.g. `vocal_edited.mid`.
#### Step 3: MIDI → Metadata (for SoulX-Singer inference)
Convert the edited MIDI back into SoulX-Singer-style metadata (and cut wavs) for SVS:
```bash
python -m preprocess.tools.midi_parser \
--midi2meta \
--midi "${preprocess_root}/vocal_edited.mid" \
--meta "${preprocess_root}/edit_metadata.json" \
--vocal "${preprocess_root}/vocal.wav" \
```
Use `edit_metadata.json` (and the wavs under `edit_cut_wavs`) as the target metadata in your inference pipeline.
## 🔗 References & Dependencies
This project builds upon the following excellent open-source works:
### 🎧 Vocal Separation & Dereverberation
- [Music Source Separation Training](https://github.com/ZFTurbo/Music-Source-Separation-Training)
- [Lead Vocal Separation](https://huggingface.co/becruily/mel-band-roformer-karaoke)
- [Vocal Dereverberation](https://huggingface.co/anvuew/dereverb_mel_band_roformer)
### 🎼 F0 Extraction
- [RMVPE](https://github.com/Dream-High/RMVPE)
### 📝 Lyrics Transcription (ASR)
- [Paraformer](https://modelscope.cn/models/iic/speech_seaco_paraformer_large_asr_nat-zh-cn-16k-common-vocab8404-pytorch)
- [Parakeet-tdt-0.6b-v2](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2)
### 🎶 Note Transcription
- [ROSVOT](https://github.com/RickyL-2000/ROSVOT)
We sincerely thank the authors of these repositories for their exceptional open-source contributions, which have been fundamental to the development of this toolkit.