Spaces:

Soul-AILab
/

SoulX-Singer

Running on Zero

App Files Files Community

SoulX-Singer / preprocess /README.md

Xinsheng-Wang

Upload folder using huggingface_hub

c7f3ffb verified 2 days ago

preview code

raw

history blame contribute delete

5.3 kB

	# 🎵 SoulX-Singer-Preprocess

	This part offers a comprehensive singing transcription and editing toolkit for real-world music audio. It provides the pipeline from vocal extraction to high-level annotation optimized for SVS dataset construction. By integrating state-of-the-art models, it transforms raw audio into structured singing data and supports the customizable creation and editing of lyric-aligned MIDI scores.


	## ✨ Features

	The toolkit includes the following core modules:

	- 🎤 Clean Dry Vocal Extraction
	Extracts the lead vocal track from polyphonic music audio and dereverberation.

	- 📝 Lyrics Transcription
	Automatically transcribes lyrics from clean vocal.

	- 🎶 Note Transcription
	Converts singing voice into note-level representations for SVS.

	- 🎼 MIDI Editor
	Supports customizable creation and editing of MIDI scores integrated with lyrics.


	## 🔧 Python Environment

	Before running the pipeline, set up the Python environment as follows:

	1. Install Conda (if not already installed): https://docs.conda.io/en/latest/miniconda.html

	2. Activate or create a conda environment (recommended Python 3.10):

	- If you already have the `soulxsinger` environment:

	```bash
	conda activate soulxsinger
	```

	- Otherwise, create it first:

	```bash
	conda create -n soulxsinger -y python=3.10
	conda activate soulxsinger
	```

	3. Install dependencies from the `preprocess` directory:

	```bash
	cd preprocess
	pip install -r requirements.txt
	```

	## 📁 Data Preparation

	Before running the pipeline, prepare the following inputs:

	- Prompt audio
	Reference audio that provides timbre and style

	- Target audio
	Original vocal or music audio to be processed and transcribed.

	Configure the corresponding parameters in:

	```
	example/preprocess.sh
	```

	Typical configuration includes:
	- Input / output paths
	- Module enable switches

	## 🚀 Usage

	After configuring `preprocess.sh`, run the transcription pipeline with:

	```bash
	bash example/preprocess.sh
	```

	The script will automatically execute the following steps:

	1. Vocal separation and dereverberation
	2. F0 extraction and voice activity detection (VAD)
	3. Lyrics transcription
	4. Note transcription

	---

	After the pipeline completes, you will obtain SoulX-Singer–style metadata that can be directly used for Singing Voice Synthesis (SVS).

	Output paths:
	- The final metadata (JSON file) is written in the same directory as your input audio, with the same filename (e.g. `audio.mp3` → `audio.json`)
	- All intermediate results (separated vocal and accompaniment, F0, VAD outputs, etc.) are also saved under the configured `save_dir`.

	⚠️ Important Note

	Transcription errors—especially in lyrics and note annotations—can significantly affect the final SVS quality. We strongly recommend manually reviewing and correcting the generated metadata before inference.

	To support this, we provide a MIDI Editor for editing lyrics, phoneme alignment, note pitches, and durations. The workflow is:

	Export metadata to MIDI → edit in the MIDI Editor → Import edited MIDI back to metadata for SVS.

	---

	#### Step 1: Metadata → MIDI (for editing)

	Convert SoulX-Singer metadata to a MIDI file so you can open it in the MIDI Editor:

	```bash
	preprocess_root=example/transcriptions/music

	python -m preprocess.tools.midi_parser \
	--meta2midi \
	--meta "${preprocess_root}/metadata.json" \
	--midi "${preprocess_root}/vocal.mid"
	```

	#### Step 2: Edit in the MIDI Editor

	Open the MIDI Editor (see [MIDI Editor Tutorial](tools/midi_editor/README.md)), load `vocal.mid`, and correct lyrics, pitches, or durations as needed. Save the result as e.g. `vocal_edited.mid`.

	#### Step 3: MIDI → Metadata (for SoulX-Singer inference)

	Convert the edited MIDI back into SoulX-Singer-style metadata (and cut wavs) for SVS:

	```bash
	python -m preprocess.tools.midi_parser \
	--midi2meta \
	--midi "${preprocess_root}/vocal_edited.mid" \
	--meta "${preprocess_root}/edit_metadata.json" \
	--vocal "${preprocess_root}/vocal.wav" \
	```

	Use `edit_metadata.json` (and the wavs under `edit_cut_wavs`) as the target metadata in your inference pipeline.


	## 🔗 References & Dependencies

	This project builds upon the following excellent open-source works:

	### 🎧 Vocal Separation & Dereverberation
	- [Music Source Separation Training](https://github.com/ZFTurbo/Music-Source-Separation-Training)
	- [Lead Vocal Separation](https://huggingface.co/becruily/mel-band-roformer-karaoke)
	- [Vocal Dereverberation](https://huggingface.co/anvuew/dereverb_mel_band_roformer)

	### 🎼 F0 Extraction
	- [RMVPE](https://github.com/Dream-High/RMVPE)

	### 📝 Lyrics Transcription (ASR)
	- [Paraformer](https://modelscope.cn/models/iic/speech_seaco_paraformer_large_asr_nat-zh-cn-16k-common-vocab8404-pytorch)
	- [Parakeet-tdt-0.6b-v2](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2)

	### 🎶 Note Transcription
	- [ROSVOT](https://github.com/RickyL-2000/ROSVOT)

	We sincerely thank the authors of these repositories for their exceptional open-source contributions, which have been fundamental to the development of this toolkit.