Xinsheng-Wang's picture
Upload folder using huggingface_hub
c7f3ffb verified

A newer version of the Gradio SDK is available: 6.5.1

Upgrade

🎵 SoulX-Singer-Preprocess

This part offers a comprehensive singing transcription and editing toolkit for real-world music audio. It provides the pipeline from vocal extraction to high-level annotation optimized for SVS dataset construction. By integrating state-of-the-art models, it transforms raw audio into structured singing data and supports the customizable creation and editing of lyric-aligned MIDI scores.

✨ Features

The toolkit includes the following core modules:

  • 🎤 Clean Dry Vocal Extraction
    Extracts the lead vocal track from polyphonic music audio and dereverberation.

  • 📝 Lyrics Transcription
    Automatically transcribes lyrics from clean vocal.

  • 🎶 Note Transcription
    Converts singing voice into note-level representations for SVS.

  • 🎼 MIDI Editor
    Supports customizable creation and editing of MIDI scores integrated with lyrics.

🔧 Python Environment

Before running the pipeline, set up the Python environment as follows:

  1. Install Conda (if not already installed): https://docs.conda.io/en/latest/miniconda.html

  2. Activate or create a conda environment (recommended Python 3.10):

    • If you already have the soulxsinger environment:

      conda activate soulxsinger
      
    • Otherwise, create it first:

      conda create -n soulxsinger -y python=3.10
      conda activate soulxsinger
      
  3. Install dependencies from the preprocess directory:

    cd preprocess
    pip install -r requirements.txt
    

📁 Data Preparation

Before running the pipeline, prepare the following inputs:

  • Prompt audio
    Reference audio that provides timbre and style

  • Target audio
    Original vocal or music audio to be processed and transcribed.

Configure the corresponding parameters in:

example/preprocess.sh

Typical configuration includes:

  • Input / output paths
  • Module enable switches

🚀 Usage

After configuring preprocess.sh, run the transcription pipeline with:

bash example/preprocess.sh

The script will automatically execute the following steps:

  1. Vocal separation and dereverberation
  2. F0 extraction and voice activity detection (VAD)
  3. Lyrics transcription
  4. Note transcription

After the pipeline completes, you will obtain SoulX-Singer–style metadata that can be directly used for Singing Voice Synthesis (SVS).

Output paths:

  • The final metadata (JSON file) is written in the same directory as your input audio, with the same filename (e.g. audio.mp3audio.json)
  • All intermediate results (separated vocal and accompaniment, F0, VAD outputs, etc.) are also saved under the configured save_dir.

⚠️ Important Note

Transcription errors—especially in lyrics and note annotations—can significantly affect the final SVS quality. We strongly recommend manually reviewing and correcting the generated metadata before inference.

To support this, we provide a MIDI Editor for editing lyrics, phoneme alignment, note pitches, and durations. The workflow is:

Export metadata to MIDI → edit in the MIDI Editor → Import edited MIDI back to metadata for SVS.


Step 1: Metadata → MIDI (for editing)

Convert SoulX-Singer metadata to a MIDI file so you can open it in the MIDI Editor:

preprocess_root=example/transcriptions/music

python -m preprocess.tools.midi_parser \
    --meta2midi \
    --meta "${preprocess_root}/metadata.json" \
    --midi "${preprocess_root}/vocal.mid"

Step 2: Edit in the MIDI Editor

Open the MIDI Editor (see MIDI Editor Tutorial), load vocal.mid, and correct lyrics, pitches, or durations as needed. Save the result as e.g. vocal_edited.mid.

Step 3: MIDI → Metadata (for SoulX-Singer inference)

Convert the edited MIDI back into SoulX-Singer-style metadata (and cut wavs) for SVS:

python -m preprocess.tools.midi_parser \
    --midi2meta \
    --midi "${preprocess_root}/vocal_edited.mid" \
    --meta "${preprocess_root}/edit_metadata.json" \
    --vocal "${preprocess_root}/vocal.wav" \

Use edit_metadata.json (and the wavs under edit_cut_wavs) as the target metadata in your inference pipeline.

🔗 References & Dependencies

This project builds upon the following excellent open-source works:

🎧 Vocal Separation & Dereverberation

🎼 F0 Extraction

📝 Lyrics Transcription (ASR)

🎶 Note Transcription

We sincerely thank the authors of these repositories for their exceptional open-source contributions, which have been fundamental to the development of this toolkit.