SoulX-Singer

Running on Zero

File size: 5,304 Bytes

c7f3ffb

# 🎵 SoulX-Singer-Preprocess

This part offers a comprehensive **singing transcription and editing toolkit** for real-world music audio. It provides the pipeline from vocal extraction to high-level annotation optimized for SVS dataset construction. By integrating state-of-the-art models, it transforms raw audio into structured singing data and supports the **customizable creation and editing of lyric-aligned MIDI scores**.


## ✨ Features

The toolkit includes the following core modules:

- 🎤 **Clean Dry Vocal Extraction**  
  Extracts the lead vocal track from polyphonic music audio and dereverberation.

- 📝 **Lyrics Transcription**  
  Automatically transcribes lyrics from clean vocal.

- 🎶 **Note Transcription**  
  Converts singing voice into note-level representations for SVS.

- 🎼 **MIDI Editor**  
  Supports customizable creation and editing of MIDI scores integrated with lyrics.


## 🔧 Python Environment

Before running the pipeline, set up the Python environment as follows:

1. **Install Conda** (if not already installed): https://docs.conda.io/en/latest/miniconda.html

2. **Activate or create a conda environment** (recommended Python 3.10):

   - If you already have the `soulxsinger` environment:

     ```bash
     conda activate soulxsinger
     ```

   - Otherwise, create it first:

     ```bash
     conda create -n soulxsinger -y python=3.10
     conda activate soulxsinger
     ```

3. **Install dependencies** from the `preprocess` directory:

   ```bash
   cd preprocess
   pip install -r requirements.txt
   ```

## 📁 Data Preparation

Before running the pipeline, prepare the following inputs:

- **Prompt audio**  
  Reference audio that provides timbre and style

- **Target audio**  
  Original vocal or music audio to be processed and transcribed.

Configure the corresponding parameters in:

```
example/preprocess.sh
```

Typical configuration includes:
- Input / output paths
- Module enable switches

## 🚀 Usage

After configuring `preprocess.sh`, run the transcription pipeline with:

```bash
bash example/preprocess.sh
```

The script will automatically execute the following steps:

1. **Vocal separation and dereverberation**
2. **F0 extraction and voice activity detection (VAD)**
3. **Lyrics transcription**
4. **Note transcription**

---

After the pipeline completes, you will obtain **SoulX-Singer–style metadata** that can be directly used for Singing Voice Synthesis (SVS).

**Output paths:**
- The final metadata (**JSON file**) is written **in the same directory as your input audio**, with the **same filename** (e.g. `audio.mp3` → `audio.json`)
- All **intermediate results** (separated vocal and accompaniment, F0, VAD outputs, etc.) are also saved under the configured **`save_dir`**.

⚠️ **Important Note**

Transcription errors—especially in **lyrics** and **note annotations**—can significantly affect the final SVS quality. We **strongly recommend manually reviewing and correcting** the generated metadata before inference.

To support this, we provide a **MIDI Editor** for editing lyrics, phoneme alignment, note pitches, and durations. The workflow is:

**Export metadata to MIDI** → edit in the MIDI Editor → **Import edited MIDI back to metadata** for SVS.

---

#### Step 1: Metadata → MIDI (for editing)

Convert SoulX-Singer metadata to a MIDI file so you can open it in the MIDI Editor:

```bash
preprocess_root=example/transcriptions/music

python -m preprocess.tools.midi_parser \
    --meta2midi \
    --meta "${preprocess_root}/metadata.json" \
    --midi "${preprocess_root}/vocal.mid"
```

#### Step 2: Edit in the MIDI Editor

Open the MIDI Editor (see [MIDI Editor Tutorial](tools/midi_editor/README.md)), load `vocal.mid`, and correct lyrics, pitches, or durations as needed. Save the result as e.g. `vocal_edited.mid`.

#### Step 3: MIDI → Metadata (for SoulX-Singer inference)

Convert the edited MIDI back into SoulX-Singer-style metadata (and cut wavs) for SVS:

```bash
python -m preprocess.tools.midi_parser \
    --midi2meta \
    --midi "${preprocess_root}/vocal_edited.mid" \
    --meta "${preprocess_root}/edit_metadata.json" \
    --vocal "${preprocess_root}/vocal.wav" \
```

Use `edit_metadata.json` (and the wavs under `edit_cut_wavs`) as the target metadata in your inference pipeline.


## 🔗 References & Dependencies

This project builds upon the following excellent open-source works:

### 🎧 Vocal Separation & Dereverberation
- [Music Source Separation Training](https://github.com/ZFTurbo/Music-Source-Separation-Training)
- [Lead Vocal Separation](https://huggingface.co/becruily/mel-band-roformer-karaoke)
- [Vocal Dereverberation](https://huggingface.co/anvuew/dereverb_mel_band_roformer)

### 🎼 F0 Extraction
- [RMVPE](https://github.com/Dream-High/RMVPE)

### 📝 Lyrics Transcription (ASR)
- [Paraformer](https://modelscope.cn/models/iic/speech_seaco_paraformer_large_asr_nat-zh-cn-16k-common-vocab8404-pytorch)
- [Parakeet-tdt-0.6b-v2](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2)

### 🎶 Note Transcription
- [ROSVOT](https://github.com/RickyL-2000/ROSVOT)

We sincerely thank the authors of these repositories for their exceptional open-source contributions, which have been fundamental to the development of this toolkit.