Spaces:
Running
on
Zero
Running
on
Zero
| # 🎵 SoulX-Singer-Preprocess | |
| This part offers a comprehensive **singing transcription and editing toolkit** for real-world music audio. It provides the pipeline from vocal extraction to high-level annotation optimized for SVS dataset construction. By integrating state-of-the-art models, it transforms raw audio into structured singing data and supports the **customizable creation and editing of lyric-aligned MIDI scores**. | |
| ## ✨ Features | |
| The toolkit includes the following core modules: | |
| - 🎤 **Clean Dry Vocal Extraction** | |
| Extracts the lead vocal track from polyphonic music audio and dereverberation. | |
| - 📝 **Lyrics Transcription** | |
| Automatically transcribes lyrics from clean vocal. | |
| - 🎶 **Note Transcription** | |
| Converts singing voice into note-level representations for SVS. | |
| - 🎼 **MIDI Editor** | |
| Supports customizable creation and editing of MIDI scores integrated with lyrics. | |
| ## 🔧 Python Environment | |
| Before running the pipeline, set up the Python environment as follows: | |
| 1. **Install Conda** (if not already installed): https://docs.conda.io/en/latest/miniconda.html | |
| 2. **Activate or create a conda environment** (recommended Python 3.10): | |
| - If you already have the `soulxsinger` environment: | |
| ```bash | |
| conda activate soulxsinger | |
| ``` | |
| - Otherwise, create it first: | |
| ```bash | |
| conda create -n soulxsinger -y python=3.10 | |
| conda activate soulxsinger | |
| ``` | |
| 3. **Install dependencies** from the `preprocess` directory: | |
| ```bash | |
| cd preprocess | |
| pip install -r requirements.txt | |
| ``` | |
| ## 📁 Data Preparation | |
| Before running the pipeline, prepare the following inputs: | |
| - **Prompt audio** | |
| Reference audio that provides timbre and style | |
| - **Target audio** | |
| Original vocal or music audio to be processed and transcribed. | |
| Configure the corresponding parameters in: | |
| ``` | |
| example/preprocess.sh | |
| ``` | |
| Typical configuration includes: | |
| - Input / output paths | |
| - Module enable switches | |
| ## 🚀 Usage | |
| After configuring `preprocess.sh`, run the transcription pipeline with: | |
| ```bash | |
| bash example/preprocess.sh | |
| ``` | |
| The script will automatically execute the following steps: | |
| 1. **Vocal separation and dereverberation** | |
| 2. **F0 extraction and voice activity detection (VAD)** | |
| 3. **Lyrics transcription** | |
| 4. **Note transcription** | |
| --- | |
| After the pipeline completes, you will obtain **SoulX-Singer–style metadata** that can be directly used for Singing Voice Synthesis (SVS). | |
| **Output paths:** | |
| - The final metadata (**JSON file**) is written **in the same directory as your input audio**, with the **same filename** (e.g. `audio.mp3` → `audio.json`) | |
| - All **intermediate results** (separated vocal and accompaniment, F0, VAD outputs, etc.) are also saved under the configured **`save_dir`**. | |
| ⚠️ **Important Note** | |
| Transcription errors—especially in **lyrics** and **note annotations**—can significantly affect the final SVS quality. We **strongly recommend manually reviewing and correcting** the generated metadata before inference. | |
| To support this, we provide a **MIDI Editor** for editing lyrics, phoneme alignment, note pitches, and durations. The workflow is: | |
| **Export metadata to MIDI** → edit in the MIDI Editor → **Import edited MIDI back to metadata** for SVS. | |
| --- | |
| #### Step 1: Metadata → MIDI (for editing) | |
| Convert SoulX-Singer metadata to a MIDI file so you can open it in the MIDI Editor: | |
| ```bash | |
| preprocess_root=example/transcriptions/music | |
| python -m preprocess.tools.midi_parser \ | |
| --meta2midi \ | |
| --meta "${preprocess_root}/metadata.json" \ | |
| --midi "${preprocess_root}/vocal.mid" | |
| ``` | |
| #### Step 2: Edit in the MIDI Editor | |
| Open the MIDI Editor (see [MIDI Editor Tutorial](tools/midi_editor/README.md)), load `vocal.mid`, and correct lyrics, pitches, or durations as needed. Save the result as e.g. `vocal_edited.mid`. | |
| #### Step 3: MIDI → Metadata (for SoulX-Singer inference) | |
| Convert the edited MIDI back into SoulX-Singer-style metadata (and cut wavs) for SVS: | |
| ```bash | |
| python -m preprocess.tools.midi_parser \ | |
| --midi2meta \ | |
| --midi "${preprocess_root}/vocal_edited.mid" \ | |
| --meta "${preprocess_root}/edit_metadata.json" \ | |
| --vocal "${preprocess_root}/vocal.wav" \ | |
| ``` | |
| Use `edit_metadata.json` (and the wavs under `edit_cut_wavs`) as the target metadata in your inference pipeline. | |
| ## 🔗 References & Dependencies | |
| This project builds upon the following excellent open-source works: | |
| ### 🎧 Vocal Separation & Dereverberation | |
| - [Music Source Separation Training](https://github.com/ZFTurbo/Music-Source-Separation-Training) | |
| - [Lead Vocal Separation](https://huggingface.co/becruily/mel-band-roformer-karaoke) | |
| - [Vocal Dereverberation](https://huggingface.co/anvuew/dereverb_mel_band_roformer) | |
| ### 🎼 F0 Extraction | |
| - [RMVPE](https://github.com/Dream-High/RMVPE) | |
| ### 📝 Lyrics Transcription (ASR) | |
| - [Paraformer](https://modelscope.cn/models/iic/speech_seaco_paraformer_large_asr_nat-zh-cn-16k-common-vocab8404-pytorch) | |
| - [Parakeet-tdt-0.6b-v2](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2) | |
| ### 🎶 Note Transcription | |
| - [ROSVOT](https://github.com/RickyL-2000/ROSVOT) | |
| We sincerely thank the authors of these repositories for their exceptional open-source contributions, which have been fundamental to the development of this toolkit. | |