Spaces:
Running on Zero
A newer version of the Gradio SDK is available: 6.11.0
title: YingMusic-Singer-Plus
emoji: ๐ค
colorFrom: pink
colorTo: blue
sdk: gradio
python_version: '3.10'
app_file: app.py
tags:
- singing-voice-synthesis
- lyric-editing
- diffusion-model
- reinforcement-learning
short_description: Edit lyrics, keep the melody
fullWidth: true
๐ค YingMusic-Singer-Plus: Controllable Singing Voice Synthesis with Flexible Lyric Manipulation and Annotation-free Melody Guidance
Chunbo Hao1,2 ยท Junjie Zheng2 ยท Guobin Ma1 ยท Yuepeng Jiang1 ยท Huakang Chen1 ยท Wenjie Tian1 ยท Gongyu Chen2 ยท Zihao Chen2 ยท Lei Xie1
1 Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern Polytechnical University, China
2 AI Lab, GiantNetwork, China
Overall architecture of YingMusic-Singer-Plus. Left: SFT training pipeline. Right: GRPO training pipeline.
๐ Introduction
YingMusic-Singer-Plus is a fully diffusion-based singing voice synthesis model that enables melody-controllable singing voice editing with flexible lyric manipulation, requiring no manual alignment or precise phoneme annotation.
Given only three inputs โ an optional timbre reference, a melody-providing singing clip, and modified lyrics โ YingMusic-Singer-Plus synthesizes high-fidelity singing voices at 44.1 kHz while faithfully preserving the original melody.
โจ Key Features
- Annotation-free: No manual lyric-MIDI alignment required at inference
- Flexible lyric manipulation: Supports 6 editing types โ partial/full changes, insertion, deletion, translation (CNโEN), and code-switching
- Strong melody preservation: CKA-based melody alignment loss + GRPO-based optimization
- Bilingual: Unified IPA tokenizer for both Chinese and English
- High fidelity: 44.1 kHz stereo output via Stable Audio 2 VAE
๐ Quick Start
Option 1: Install from Scratch
conda create -n YingMusic-Singer-Plus python=3.10
conda activate YingMusic-Singer-Plus
# uv is much faster than pip
pip install uv
uv pip install -r requirements.txt
Option 2: Pre-built Conda Environment
- Download and install Miniconda from https://repo.anaconda.com/miniconda/ for your platform. Verify with
conda --version. - Download the pre-built environment package for your setup from the table below.
- In your Conda directory, navigate to
envs/and create a folder namedYingMusic-Singer-Plus. - Move the downloaded package into that folder, then extract it with
tar -xvf <package_name>.
| CPU Architecture | GPU | OS | Download |
|---|---|---|---|
| ARM | NVIDIA | Linux | Coming soon |
| AMD64 | NVIDIA | Linux | Coming soon |
| AMD64 | NVIDIA | Windows | Coming soon |
Option 3: Docker
Build the image:
docker build -t YingMusic-Singer-Plus .
Run inference:
docker run --gpus all -it YingMusic-Singer-Plus
๐ต Inference
Option 1: Online Demo (HuggingFace Space)
Visit https://huggingface.co/spaces/ASLP-lab/YingMusic-Singer-Plus to try the model instantly in your browser.
Option 2: Local Gradio App (same as online demo)
python app_local.py
Option 3: Command-line Inference
python infer_api.py \
--ref_audio path/to/ref.wav \
--melody_audio path/to/melody.wav \
--ref_text "่ฏฅไฝ่ฐ
็ไธๆง็|ๅฆๆ้ฃๅคฉๆ" \
--target_text "ๅฅฝๅคๅคฉ|็ไธๅฎไฝ " \
--output output.wav
Enable vocal separation and accompaniment mixing:
python infer_api.py \
--ref_audio ref.wav \
--melody_audio melody.wav \
--ref_text "..." \
--target_text "..." \
--separate_vocals \ # separate vocals from the input before processing
--mix_accompaniment \ # mix the synthesized vocal back with the accompaniment
--output mixed_output.wav
Option 4: Batch Inference
Note: All audio fed to the model must be pure vocal tracks (no accompaniment). If your inputs contain accompaniment, run vocal separation first using
src/third_party/MusicSourceSeparationTraining/inference_api.py.
The input JSONL file should contain one JSON object per line, formatted as follows:
{"id": "1", "melody_ref_path": "XXX", "gen_text": "ๅฅฝๅคๅคฉ|็ไธๅฎไฝ ", "timbre_ref_path": "XXX", "timbre_ref_text": "่ฏฅไฝ่ฐ
็ไธๆง็|ๅฆๆ้ฃๅคฉๆ"}
python batch_infer.py \
--input_type jsonl \
--input_path /path/to/input.jsonl \
--output_dir /path/to/output \
--ckpt_path /path/to/ckpts \
--num_gpus 4
Multi-process inference on LyricEditBench (melody control) โ the test set will be downloaded automatically:
python inference_mp.py \
--input_type lyric_edit_bench_melody_control \
--output_dir path/to/LyricEditBench_melody_control \
--ckpt_path ASLP-lab/YingMusic-Singer-Plus \
--num_gpus 8
Multi-process inference on LyricEditBench (singing edit):
python inference_mp.py \
--input_type lyric_edit_bench_sing_edit \
--output_dir path/to/LyricEditBench_sing_edit \
--ckpt_path ASLP-lab/YingMusic-Singer-Plus \
--num_gpus 8
๐๏ธ Model Architecture
YingMusic-Singer-Plus consists of four core components:
| Component | Description |
|---|---|
| VAE | Stable Audio 2 encoder/decoder; downsamples stereo 44.1 kHz audio by 2048ร |
| Melody Extractor | Encoder of a pretrained MIDI extraction model (SOME); captures disentangled melody information |
| IPA Tokenizer | Converts Chinese & English lyrics into a unified phoneme sequence with sentence-level alignment |
| DiT-based CFM | Conditional flow matching backbone following F5-TTS (22 layers, 16 heads, hidden dim 1024) |
Total parameters: ~727.3M (453.6M CFM + 156.1M VAE + 117.6M Melody Extractor)
๐ LyricEditBench
We introduce LyricEditBench, the first benchmark for melody-preserving lyric modification evaluation, built on GTSinger. The dataset is available on HuggingFace at https://huggingface.co/datasets/ASLP-lab/LyricEditBench.
Results
Comparison with baseline models on LyricEditBench across task types (Table 1) and languages. Metrics โ P: PER, S: SIM, F: F0-CORR, V: VS โ are detailed in Section 3. Best results in bold.
๐ Acknowledgements
This work builds upon the following open-source projects:
- F5-TTS โ DiT-based CFM backbone
- Stable Audio 2 โ VAE architecture
- SOME โ Melody Extractor
- DiffRhythm โ Sentence-level alignment strategy
- GTSinger โ Benchmark base corpus
- Emilia โ TTS pretraining data
๐ License
The code and model weights in this project are licensed under CC BY 4.0, except for the following:
The VAE model weights and inference code (in src/YingMusic-Singer-Plus/utils/stable-audio-tools) are derived from Stable Audio Open by Stability AI, and are licensed under the Stability AI Community License.