--- title: YingMusic-Singer-Plus emoji: 🎤 colorFrom: pink colorTo: blue sdk: gradio python_version: "3.10" app_file: app.py tags: - singing-voice-synthesis - lyric-editing - diffusion-model - reinforcement-learning short_description: Edit lyrics, keep the melody fullWidth: true ---

🎤 YingMusic-Singer-Plus: Controllable Singing Voice Synthesis with Flexible Lyric Manipulation and Annotation-free Melody Guidance

English中文

![Python](https://img.shields.io/badge/Python-3.10-3776AB?logo=python&logoColor=white) ![License](https://img.shields.io/badge/License-CC--BY--4.0-lightgrey) [![arXiv Paper](https://img.shields.io/badge/arXiv-2603.24589-b31b1b?logo=arxiv&logoColor=white)](https://arxiv.org/abs/2603.24589) [![GitHub](https://img.shields.io/badge/GitHub-YingMusic--Singer-181717?logo=github&logoColor=white)](https://github.com/ASLP-lab/YingMusic-Singer-Plus) [![Demo Page](https://img.shields.io/badge/GitHub-Demo--Page-8A2BE2?logo=github&logoColor=white&labelColor=181717)](https://aslp-lab.github.io/YingMusic-Singer-Plus-Demo/) [![HuggingFace Space](https://img.shields.io/badge/🤗%20HuggingFace-Space-FFD21E)](https://huggingface.co/spaces/ASLP-lab/YingMusic-Singer-Plus) [![HuggingFace Model](https://img.shields.io/badge/🤗%20HuggingFace-Model-FF9D00)](https://huggingface.co/ASLP-lab/YingMusic-Singer-Plus) [![Dataset LyricEditBench](https://img.shields.io/badge/🤗%20HuggingFace-LyricEditBench-FF6F00)](https://huggingface.co/datasets/ASLP-lab/LyricEditBench) [![Discord](https://img.shields.io/badge/Discord-Join%20Us-5865F2?logo=discord&logoColor=white)](https://discord.gg/RXghgWyvrn) [![WeChat](https://img.shields.io/badge/WeChat-Group-07C160?logo=wechat&logoColor=white)](https://github.com/ASLP-lab/YingMusic-Singer-Plus/blob/main/assets/wechat_qr.png) [![Lab](https://img.shields.io/badge/🏫%20ASLP-Lab-4A90D9)](http://www.npu-aslp.org/)

Chunbo Hao1,2 · Junjie Zheng2 · Guobin Ma1 · Yuepeng Jiang1 · Huakang Chen1 · Wenjie Tian1 · Gongyu Chen2 · Zihao Chen2 · Lei Xie1

1 Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern Polytechnical University, China
2 AI Lab, GiantNetwork, China

YingMusic-Singer Architecture

Overall architecture of YingMusic-Singer-Plus. Left: SFT training pipeline. Right: GRPO training pipeline.

## 📖 Introduction **YingMusic-Singer-Plus** is a fully diffusion-based singing voice synthesis model that enables **melody-controllable singing voice editing with flexible lyric manipulation**, requiring no manual alignment or precise phoneme annotation. Given only three inputs — an optional timbre reference, a melody-providing singing clip, and modified lyrics — YingMusic-Singer-Plus synthesizes high-fidelity singing voices at **44.1 kHz** while faithfully preserving the original melody. ## ✨ Key Features - **Annotation-free**: No manual lyric-MIDI alignment required at inference - **Flexible lyric manipulation**: Supports 6 editing types — partial/full changes, insertion, deletion, translation (CN↔EN), and code-switching - **Strong melody preservation**: CKA-based melody alignment loss + GRPO-based optimization - **Bilingual**: Unified IPA tokenizer for both Chinese and English - **High fidelity**: 44.1 kHz stereo output via Stable Audio 2 VAE ## 🚀 Quick Start ### Option 1: Install from Scratch ```bash conda create -n YingMusic-Singer-Plus python=3.10 conda activate YingMusic-Singer-Plus # uv is much faster than pip pip install uv uv pip install -r requirements.txt ``` ### Option 2: Pre-built Conda Environment 1. Download and install **Miniconda** from https://repo.anaconda.com/miniconda/ for your platform. Verify with `conda --version`. 2. Download the pre-built environment package for your setup from the table below. 3. In your Conda directory, navigate to `envs/` and create a folder named `YingMusic-Singer-Plus`. 4. Move the downloaded package into that folder, then extract it with `tar -xvf `. | CPU Architecture | GPU | OS | Download | |------------------|--------|---------|----------| | ARM | NVIDIA | Linux | Coming soon | | AMD64 | NVIDIA | Linux | Coming soon | | AMD64 | NVIDIA | Windows | Coming soon | ### Option 3: Docker Build the image: ```bash docker build -t YingMusic-Singer-Plus . ``` Run inference: ```bash docker run --gpus all -it YingMusic-Singer-Plus ``` ## 🎵 Inference ### Option 1: Online Demo (HuggingFace Space) Visit https://huggingface.co/spaces/ASLP-lab/YingMusic-Singer-Plus to try the model instantly in your browser. ### Option 2: Local Gradio App (same as online demo) ```bash python app_local.py ``` ### Option 3: Command-line Inference ```bash python infer_api.py \ --ref_audio path/to/ref.wav \ --melody_audio path/to/melody.wav \ --ref_text "该体谅的不执着|如果那天我" \ --target_text "好多天|看不完你" \ --output output.wav ``` Enable vocal separation and accompaniment mixing: ```bash python infer_api.py \ --ref_audio ref.wav \ --melody_audio melody.wav \ --ref_text "..." \ --target_text "..." \ --separate_vocals \ # separate vocals from the input before processing --mix_accompaniment \ # mix the synthesized vocal back with the accompaniment --output mixed_output.wav ``` ### Option 4: Batch Inference > **Note**: All audio fed to the model must be pure vocal tracks (no accompaniment). If your inputs contain accompaniment, run vocal separation first using `src/third_party/MusicSourceSeparationTraining/inference_api.py`. The input JSONL file should contain one JSON object per line, formatted as follows: ```json {"id": "1", "melody_ref_path": "XXX", "gen_text": "好多天|看不完你", "timbre_ref_path": "XXX", "timbre_ref_text": "该体谅的不执着|如果那天我"} ``` ```bash python batch_infer.py \ --input_type jsonl \ --input_path /path/to/input.jsonl \ --output_dir /path/to/output \ --ckpt_path /path/to/ckpts \ --num_gpus 4 ``` Multi-process inference on **LyricEditBench (melody control)** — the test set will be downloaded automatically: ```bash python inference_mp.py \ --input_type lyric_edit_bench_melody_control \ --output_dir path/to/LyricEditBench_melody_control \ --ckpt_path ASLP-lab/YingMusic-Singer-Plus \ --num_gpus 8 ``` Multi-process inference on **LyricEditBench (singing edit)**: ```bash python inference_mp.py \ --input_type lyric_edit_bench_sing_edit \ --output_dir path/to/LyricEditBench_sing_edit \ --ckpt_path ASLP-lab/YingMusic-Singer-Plus \ --num_gpus 8 ``` ## 🏗️ Model Architecture YingMusic-Singer-Plus consists of four core components: | Component | Description | |-----------|-------------| | **VAE** | Stable Audio 2 encoder/decoder; downsamples stereo 44.1 kHz audio by 2048× | | **Melody Extractor** | Encoder of a pretrained MIDI extraction model (SOME); captures disentangled melody information | | **IPA Tokenizer** | Converts Chinese & English lyrics into a unified phoneme sequence with sentence-level alignment | | **DiT-based CFM** | Conditional flow matching backbone following F5-TTS (22 layers, 16 heads, hidden dim 1024) | **Total parameters**: ~727.3M (453.6M CFM + 156.1M VAE + 117.6M Melody Extractor) ## 📊 LyricEditBench We introduce **LyricEditBench**, the first benchmark for melody-preserving lyric modification evaluation, built on [GTSinger](https://github.com/GTSinger/GTSinger). The dataset is available on HuggingFace at https://huggingface.co/datasets/ASLP-lab/LyricEditBench. ### Results

Comparison with baseline models on LyricEditBench across task types (Table 1) and languages. Metrics — P: PER, S: SIM, F: F0-CORR, V: VS — are detailed in Section 3. Best results in bold.

LyricEditBench Results
## 🙏 Acknowledgements This work builds upon the following open-source projects: - [F5-TTS](https://github.com/SWivid/F5-TTS) — DiT-based CFM backbone - [Stable Audio 2](https://github.com/Stability-AI/stable-audio-tools) — VAE architecture - [SOME](https://github.com/openvpi/SOME) — Melody Extractor - [DiffRhythm](https://github.com/ASLP-lab/DiffRhythm) — Sentence-level alignment strategy - [GTSinger](https://github.com/GTSinger/GTSinger) — Benchmark base corpus - [Emilia](https://huggingface.co/datasets/amphion/Emilia-Dataset) — TTS pretraining data ## 📄 License The code and model weights in this project are licensed under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/), **except** for the following: The VAE model weights and inference code (in `src/YingMusic-Singer-Plus/utils/stable-audio-tools`) are derived from [Stable Audio Open](https://huggingface.co/stabilityai/stable-audio-open-1.0) by Stability AI, and are licensed under the [Stability AI Community License](./LICENSE-STABILITY).

Institutional Logo