Xinsheng-Wang's picture
Upload folder using huggingface_hub
c7f3ffb verified

A newer version of the Gradio SDK is available: 6.5.1

Upgrade

🎡 SoulX-Singer-Preprocess

This part offers a comprehensive singing transcription and editing toolkit for real-world music audio. It provides the pipeline from vocal extraction to high-level annotation optimized for SVS dataset construction. By integrating state-of-the-art models, it transforms raw audio into structured singing data and supports the customizable creation and editing of lyric-aligned MIDI scores.

✨ Features

The toolkit includes the following core modules:

  • 🎀 Clean Dry Vocal Extraction
    Extracts the lead vocal track from polyphonic music audio and dereverberation.

  • πŸ“ Lyrics Transcription
    Automatically transcribes lyrics from clean vocal.

  • 🎢 Note Transcription
    Converts singing voice into note-level representations for SVS.

  • 🎼 MIDI Editor
    Supports customizable creation and editing of MIDI scores integrated with lyrics.

πŸ”§ Python Environment

Before running the pipeline, set up the Python environment as follows:

  1. Install Conda (if not already installed): https://docs.conda.io/en/latest/miniconda.html

  2. Activate or create a conda environment (recommended Python 3.10):

    • If you already have the soulxsinger environment:

      conda activate soulxsinger
      
    • Otherwise, create it first:

      conda create -n soulxsinger -y python=3.10
      conda activate soulxsinger
      
  3. Install dependencies from the preprocess directory:

    cd preprocess
    pip install -r requirements.txt
    

πŸ“ Data Preparation

Before running the pipeline, prepare the following inputs:

  • Prompt audio
    Reference audio that provides timbre and style

  • Target audio
    Original vocal or music audio to be processed and transcribed.

Configure the corresponding parameters in:

example/preprocess.sh

Typical configuration includes:

  • Input / output paths
  • Module enable switches

πŸš€ Usage

After configuring preprocess.sh, run the transcription pipeline with:

bash example/preprocess.sh

The script will automatically execute the following steps:

  1. Vocal separation and dereverberation
  2. F0 extraction and voice activity detection (VAD)
  3. Lyrics transcription
  4. Note transcription

After the pipeline completes, you will obtain SoulX-Singer–style metadata that can be directly used for Singing Voice Synthesis (SVS).

Output paths:

  • The final metadata (JSON file) is written in the same directory as your input audio, with the same filename (e.g. audio.mp3 β†’ audio.json)
  • All intermediate results (separated vocal and accompaniment, F0, VAD outputs, etc.) are also saved under the configured save_dir.

⚠️ Important Note

Transcription errorsβ€”especially in lyrics and note annotationsβ€”can significantly affect the final SVS quality. We strongly recommend manually reviewing and correcting the generated metadata before inference.

To support this, we provide a MIDI Editor for editing lyrics, phoneme alignment, note pitches, and durations. The workflow is:

Export metadata to MIDI β†’ edit in the MIDI Editor β†’ Import edited MIDI back to metadata for SVS.


Step 1: Metadata β†’ MIDI (for editing)

Convert SoulX-Singer metadata to a MIDI file so you can open it in the MIDI Editor:

preprocess_root=example/transcriptions/music

python -m preprocess.tools.midi_parser \
    --meta2midi \
    --meta "${preprocess_root}/metadata.json" \
    --midi "${preprocess_root}/vocal.mid"

Step 2: Edit in the MIDI Editor

Open the MIDI Editor (see MIDI Editor Tutorial), load vocal.mid, and correct lyrics, pitches, or durations as needed. Save the result as e.g. vocal_edited.mid.

Step 3: MIDI β†’ Metadata (for SoulX-Singer inference)

Convert the edited MIDI back into SoulX-Singer-style metadata (and cut wavs) for SVS:

python -m preprocess.tools.midi_parser \
    --midi2meta \
    --midi "${preprocess_root}/vocal_edited.mid" \
    --meta "${preprocess_root}/edit_metadata.json" \
    --vocal "${preprocess_root}/vocal.wav" \

Use edit_metadata.json (and the wavs under edit_cut_wavs) as the target metadata in your inference pipeline.

πŸ”— References & Dependencies

This project builds upon the following excellent open-source works:

🎧 Vocal Separation & Dereverberation

🎼 F0 Extraction

πŸ“ Lyrics Transcription (ASR)

🎢 Note Transcription

We sincerely thank the authors of these repositories for their exceptional open-source contributions, which have been fundamental to the development of this toolkit.