Spaces:

ASLP-lab
/

YingMusic-Singer-Plus

Running on Zero

App Files Files Community

YingMusic-Singer-Plus / README.md

ASLP-lab

Update README.md

192789f verified 1 day ago

preview code

raw

history blame contribute delete

9.87 kB

	---
	title: YingMusic-Singer-Plus
	emoji: 🎤
	colorFrom: pink
	colorTo: blue
	sdk: gradio
	python_version: "3.10"
	app_file: app.py
	tags:
	- singing-voice-synthesis
	- lyric-editing
	- diffusion-model
	- reinforcement-learning
	short_description: Edit lyrics, keep the melody
	fullWidth: true
	---

	<div align="center">

	<h1>🎤 YingMusic-Singer-Plus: Controllable Singing Voice Synthesis with Flexible Lyric Manipulation and Annotation-free Melody Guidance</h1>

	<p>
	<a href="">English</a> ｜ <a href="README_ZH.md">中文</a>
	</p>


	![Python](https://img.shields.io/badge/Python-3.10-3776AB?logo=python&logoColor=white)
	![License](https://img.shields.io/badge/License-CC--BY--4.0-lightgrey)

	[![arXiv Paper](https://img.shields.io/badge/arXiv-2603.24589-b31b1b?logo=arxiv&logoColor=white)](https://arxiv.org/abs/2603.24589)
	[![GitHub](https://img.shields.io/badge/GitHub-YingMusic--Singer-181717?logo=github&logoColor=white)](https://github.com/ASLP-lab/YingMusic-Singer-Plus)
	[![Demo Page](https://img.shields.io/badge/GitHub-Demo--Page-8A2BE2?logo=github&logoColor=white&labelColor=181717)](https://aslp-lab.github.io/YingMusic-Singer-Plus-Demo/)
	[![HuggingFace Space](https://img.shields.io/badge/🤗%20HuggingFace-Space-FFD21E)](https://huggingface.co/spaces/ASLP-lab/YingMusic-Singer-Plus)
	[![HuggingFace Model](https://img.shields.io/badge/🤗%20HuggingFace-Model-FF9D00)](https://huggingface.co/ASLP-lab/YingMusic-Singer-Plus)
	[![Dataset LyricEditBench](https://img.shields.io/badge/🤗%20HuggingFace-LyricEditBench-FF6F00)](https://huggingface.co/datasets/ASLP-lab/LyricEditBench)
	[![Discord](https://img.shields.io/badge/Discord-Join%20Us-5865F2?logo=discord&logoColor=white)](https://discord.gg/RXghgWyvrn)
	[![WeChat](https://img.shields.io/badge/WeChat-Group-07C160?logo=wechat&logoColor=white)](https://github.com/ASLP-lab/YingMusic-Singer-Plus/blob/main/assets/wechat_qr.png)
	[![Lab](https://img.shields.io/badge/🏫%20ASLP-Lab-4A90D9)](http://www.npu-aslp.org/)

	<p>
	<a href="https://orcid.org/0009-0005-5957-8936">Chunbo Hao</a><sup>1,2</sup> ·
	<a href="https://orcid.org/0009-0003-2602-2910">Junjie Zheng</a><sup>2</sup> ·
	<a href="https://orcid.org/0009-0001-6706-0572">Guobin Ma</a><sup>1</sup> ·
	Yuepeng Jiang<sup>1</sup> ·
	Huakang Chen<sup>1</sup> ·
	Wenjie Tian<sup>1</sup> ·
	<a href="https://orcid.org/0009-0003-9258-4006">Gongyu Chen</a><sup>2</sup> ·
	<a href="https://orcid.org/0009-0005-5413-6725">Zihao Chen</a><sup>2</sup> ·
	Lei Xie<sup>1</sup>
	</p>

	<p>
	<sup>1</sup> Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern Polytechnical University, China<br>
	<sup>2</sup> AI Lab, GiantNetwork, China
	</p>

	</div>

	<div align="center">
	<img src="./assets/YingMusic-Singer.drawio.svg" alt="YingMusic-Singer Architecture" width="90%">
	<p><i>Overall architecture of YingMusic-Singer-Plus. Left: SFT training pipeline. Right: GRPO training pipeline.</i></p>
	</div>


	## 📖 Introduction

	YingMusic-Singer-Plus is a fully diffusion-based singing voice synthesis model that enables melody-controllable singing voice editing with flexible lyric manipulation, requiring no manual alignment or precise phoneme annotation.

	Given only three inputs — an optional timbre reference, a melody-providing singing clip, and modified lyrics — YingMusic-Singer-Plus synthesizes high-fidelity singing voices at 44.1 kHz while faithfully preserving the original melody.


	## ✨ Key Features

	- Annotation-free: No manual lyric-MIDI alignment required at inference
	- Flexible lyric manipulation: Supports 6 editing types — partial/full changes, insertion, deletion, translation (CN↔EN), and code-switching
	- Strong melody preservation: CKA-based melody alignment loss + GRPO-based optimization
	- Bilingual: Unified IPA tokenizer for both Chinese and English
	- High fidelity: 44.1 kHz stereo output via Stable Audio 2 VAE


	## 🚀 Quick Start

	### Option 1: Install from Scratch

	```bash
	conda create -n YingMusic-Singer-Plus python=3.10
	conda activate YingMusic-Singer-Plus

	# uv is much faster than pip
	pip install uv
	uv pip install -r requirements.txt
	```

	### Option 2: Pre-built Conda Environment

	1. Download and install Miniconda from https://repo.anaconda.com/miniconda/ for your platform. Verify with `conda --version`.
	2. Download the pre-built environment package for your setup from the table below.
	3. In your Conda directory, navigate to `envs/` and create a folder named `YingMusic-Singer-Plus`.
	4. Move the downloaded package into that folder, then extract it with `tar -xvf <package_name>`.

	\| CPU Architecture \| GPU \| OS \| Download \|
	\|------------------\|--------\|---------\|----------\|
	\| ARM \| NVIDIA \| Linux \| Coming soon \|
	\| AMD64 \| NVIDIA \| Linux \| Coming soon \|
	\| AMD64 \| NVIDIA \| Windows \| Coming soon \|

	### Option 3: Docker

	Build the image:

	```bash
	docker build -t YingMusic-Singer-Plus .
	```

	Run inference:

	```bash
	docker run --gpus all -it YingMusic-Singer-Plus
	```


	## 🎵 Inference

	### Option 1: Online Demo (HuggingFace Space)

	Visit https://huggingface.co/spaces/ASLP-lab/YingMusic-Singer-Plus to try the model instantly in your browser.

	### Option 2: Local Gradio App (same as online demo)

	```bash
	python app_local.py
	```

	### Option 3: Command-line Inference

	```bash
	python infer_api.py \
	--ref_audio path/to/ref.wav \
	--melody_audio path/to/melody.wav \
	--ref_text "该体谅的不执着\|如果那天我" \
	--target_text "好多天\|看不完你" \
	--output output.wav
	```

	Enable vocal separation and accompaniment mixing:

	```bash
	python infer_api.py \
	--ref_audio ref.wav \
	--melody_audio melody.wav \
	--ref_text "..." \
	--target_text "..." \
	--separate_vocals \ # separate vocals from the input before processing
	--mix_accompaniment \ # mix the synthesized vocal back with the accompaniment
	--output mixed_output.wav
	```
	### Option 4: Batch Inference

	> Note: All audio fed to the model must be pure vocal tracks (no accompaniment). If your inputs contain accompaniment, run vocal separation first using `src/third_party/MusicSourceSeparationTraining/inference_api.py`.

	The input JSONL file should contain one JSON object per line, formatted as follows:

	```json
	{"id": "1", "melody_ref_path": "XXX", "gen_text": "好多天\|看不完你", "timbre_ref_path": "XXX", "timbre_ref_text": "该体谅的不执着\|如果那天我"}
	```

	```bash
	python batch_infer.py \
	--input_type jsonl \
	--input_path /path/to/input.jsonl \
	--output_dir /path/to/output \
	--ckpt_path /path/to/ckpts \
	--num_gpus 4
	```

	Multi-process inference on LyricEditBench (melody control) — the test set will be downloaded automatically:

	```bash
	python inference_mp.py \
	--input_type lyric_edit_bench_melody_control \
	--output_dir path/to/LyricEditBench_melody_control \
	--ckpt_path ASLP-lab/YingMusic-Singer-Plus \
	--num_gpus 8
	```

	Multi-process inference on LyricEditBench (singing edit):

	```bash
	python inference_mp.py \
	--input_type lyric_edit_bench_sing_edit \
	--output_dir path/to/LyricEditBench_sing_edit \
	--ckpt_path ASLP-lab/YingMusic-Singer-Plus \
	--num_gpus 8
	```

	## 🏗️ Model Architecture

	YingMusic-Singer-Plus consists of four core components:

	\| Component \| Description \|
	\|-----------\|-------------\|
	\| VAE \| Stable Audio 2 encoder/decoder; downsamples stereo 44.1 kHz audio by 2048× \|
	\| Melody Extractor \| Encoder of a pretrained MIDI extraction model (SOME); captures disentangled melody information \|
	\| IPA Tokenizer \| Converts Chinese & English lyrics into a unified phoneme sequence with sentence-level alignment \|
	\| DiT-based CFM \| Conditional flow matching backbone following F5-TTS (22 layers, 16 heads, hidden dim 1024) \|

	Total parameters: ~727.3M (453.6M CFM + 156.1M VAE + 117.6M Melody Extractor)


	## 📊 LyricEditBench

	We introduce LyricEditBench, the first benchmark for melody-preserving lyric modification evaluation, built on [GTSinger](https://github.com/GTSinger/GTSinger). The dataset is available on HuggingFace at https://huggingface.co/datasets/ASLP-lab/LyricEditBench.

	### Results

	<div align="center">
	<p><i>Comparison with baseline models on LyricEditBench across task types (Table 1) and languages. Metrics — P: PER, S: SIM, F: F0-CORR, V: VS — are detailed in Section 3. Best results in <b>bold</b>.</i></p>
	<img src="./assets/results.png" alt="LyricEditBench Results" width="90%">
	</div>


	## 🙏 Acknowledgements

	This work builds upon the following open-source projects:

	- [F5-TTS](https://github.com/SWivid/F5-TTS) — DiT-based CFM backbone
	- [Stable Audio 2](https://github.com/Stability-AI/stable-audio-tools) — VAE architecture
	- [SOME](https://github.com/openvpi/SOME) — Melody Extractor
	- [DiffRhythm](https://github.com/ASLP-lab/DiffRhythm) — Sentence-level alignment strategy
	- [GTSinger](https://github.com/GTSinger/GTSinger) — Benchmark base corpus
	- [Emilia](https://huggingface.co/datasets/amphion/Emilia-Dataset) — TTS pretraining data


	## 📄 License

	The code and model weights in this project are licensed under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/), except for the following:

	The VAE model weights and inference code (in `src/YingMusic-Singer-Plus/utils/stable-audio-tools`) are derived from [Stable Audio Open](https://huggingface.co/stabilityai/stable-audio-open-1.0) by Stability AI, and are licensed under the [Stability AI Community License](./LICENSE-STABILITY).


	<p align="center">
	<img src="https://raw.githubusercontent.com/ASLP-lab/YingMusic-Singer-Plus/main/assets/institutional_logo.svg" alt="Institutional Logo" width="600">
	</p>