🎤 YingMusic-Singer-Plus: Controllable Singing Voice Synthesis with Flexible Lyric Manipulation and Annotation-free Melody Guidance

---
title: YingMusic-Singer-Plus
emoji: 🎤
colorFrom: pink
colorTo: blue
sdk: gradio
python_version: "3.10"
app_file: app.py
tags:
  - singing-voice-synthesis
  - lyric-editing
  - diffusion-model
  - reinforcement-learning
short_description: Edit lyrics, keep the melody
fullWidth: true
---

<div align="center">

<h1>🎤 YingMusic-Singer-Plus: Controllable Singing Voice Synthesis with Flexible Lyric Manipulation and Annotation-free Melody Guidance</h1>

<p>
  <a href="">English</a> ｜ <a href="README_ZH.md">中文</a>
</p>


![Python](https://img.shields.io/badge/Python-3.10-3776AB?logo=python&logoColor=white)
![License](https://img.shields.io/badge/License-CC--BY--4.0-lightgrey)

[![arXiv Paper](https://img.shields.io/badge/arXiv-2603.24589-b31b1b?logo=arxiv&logoColor=white)](https://arxiv.org/abs/2603.24589)
[![GitHub](https://img.shields.io/badge/GitHub-YingMusic--Singer-181717?logo=github&logoColor=white)](https://github.com/ASLP-lab/YingMusic-Singer-Plus)
[![Demo Page](https://img.shields.io/badge/GitHub-Demo--Page-8A2BE2?logo=github&logoColor=white&labelColor=181717)](https://aslp-lab.github.io/YingMusic-Singer-Plus-Demo/)
[![HuggingFace Space](https://img.shields.io/badge/🤗%20HuggingFace-Space-FFD21E)](https://huggingface.co/spaces/ASLP-lab/YingMusic-Singer-Plus)
[![HuggingFace Model](https://img.shields.io/badge/🤗%20HuggingFace-Model-FF9D00)](https://huggingface.co/ASLP-lab/YingMusic-Singer-Plus)
[![Dataset LyricEditBench](https://img.shields.io/badge/🤗%20HuggingFace-LyricEditBench-FF6F00)](https://huggingface.co/datasets/ASLP-lab/LyricEditBench)
[![Discord](https://img.shields.io/badge/Discord-Join%20Us-5865F2?logo=discord&logoColor=white)](https://discord.gg/RXghgWyvrn)
[![WeChat](https://img.shields.io/badge/WeChat-Group-07C160?logo=wechat&logoColor=white)](https://github.com/ASLP-lab/YingMusic-Singer-Plus/blob/main/assets/wechat_qr.png)
[![Lab](https://img.shields.io/badge/🏫%20ASLP-Lab-4A90D9)](http://www.npu-aslp.org/)

<p>
        <a href="https://orcid.org/0009-0005-5957-8936">Chunbo Hao</a><sup>1,2</sup> ·
        <a href="https://orcid.org/0009-0003-2602-2910">Junjie Zheng</a><sup>2</sup> ·
        <a href="https://orcid.org/0009-0001-6706-0572">Guobin Ma</a><sup>1</sup> ·
        Yuepeng Jiang<sup>1</sup> ·
        Huakang Chen<sup>1</sup> ·
        Wenjie Tian<sup>1</sup> ·
        <a href="https://orcid.org/0009-0003-9258-4006">Gongyu Chen</a><sup>2</sup> ·
        <a href="https://orcid.org/0009-0005-5413-6725">Zihao Chen</a><sup>2</sup> ·
        Lei Xie<sup>1</sup>
</p>

<p>
        <sup>1</sup> Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern Polytechnical University, China<br>
        <sup>2</sup> AI Lab, GiantNetwork, China
</p>

</div>

<div align="center">
<img src="./assets/YingMusic-Singer.drawio.svg" alt="YingMusic-Singer Architecture" width="90%">
<p><i>Overall architecture of YingMusic-Singer-Plus. Left: SFT training pipeline. Right: GRPO training pipeline.</i></p>
</div>


## 📖 Introduction

**YingMusic-Singer-Plus** is a fully diffusion-based singing voice synthesis model that enables **melody-controllable singing voice editing with flexible lyric manipulation**, requiring no manual alignment or precise phoneme annotation.

Given only three inputs — an optional timbre reference, a melody-providing singing clip, and modified lyrics — YingMusic-Singer-Plus synthesizes high-fidelity singing voices at **44.1 kHz** while faithfully preserving the original melody.


## ✨ Key Features

- **Annotation-free**: No manual lyric-MIDI alignment required at inference
- **Flexible lyric manipulation**: Supports 6 editing types — partial/full changes, insertion, deletion, translation (CN↔EN), and code-switching
- **Strong melody preservation**: CKA-based melody alignment loss + GRPO-based optimization
- **Bilingual**: Unified IPA tokenizer for both Chinese and English
- **High fidelity**: 44.1 kHz stereo output via Stable Audio 2 VAE


## 🚀 Quick Start

### Option 1: Install from Scratch

```bash
conda create -n YingMusic-Singer-Plus python=3.10
conda activate YingMusic-Singer-Plus

# uv is much faster than pip
pip install uv
uv pip install -r requirements.txt
```

### Option 2: Pre-built Conda Environment

1. Download and install **Miniconda** from https://repo.anaconda.com/miniconda/ for your platform. Verify with `conda --version`.
2. Download the pre-built environment package for your setup from the table below.
3. In your Conda directory, navigate to `envs/` and create a folder named `YingMusic-Singer-Plus`.
4. Move the downloaded package into that folder, then extract it with `tar -xvf <package_name>`.

| CPU Architecture | GPU    | OS      | Download |
|------------------|--------|---------|----------|
| ARM              | NVIDIA | Linux   | Coming soon |
| AMD64            | NVIDIA | Linux   | Coming soon |
| AMD64            | NVIDIA | Windows | Coming soon |

### Option 3: Docker

Build the image:

```bash
docker build -t YingMusic-Singer-Plus .
```

Run inference:

```bash
docker run --gpus all -it YingMusic-Singer-Plus
```


## 🎵 Inference

### Option 1: Online Demo (HuggingFace Space)

Visit https://huggingface.co/spaces/ASLP-lab/YingMusic-Singer-Plus to try the model instantly in your browser.

### Option 2: Local Gradio App (same as online demo)

```bash
python app_local.py
```

### Option 3: Command-line Inference

```bash
python infer_api.py \
    --ref_audio path/to/ref.wav \
    --melody_audio path/to/melody.wav \
    --ref_text "该体谅的不执着|如果那天我" \
    --target_text "好多天|看不完你" \
    --output output.wav
```

Enable vocal separation and accompaniment mixing:

```bash
python infer_api.py \
    --ref_audio ref.wav \
    --melody_audio melody.wav \
    --ref_text "..." \
    --target_text "..." \
    --separate_vocals \      # separate vocals from the input before processing
    --mix_accompaniment \    # mix the synthesized vocal back with the accompaniment
    --output mixed_output.wav
```
### Option 4: Batch Inference

> **Note**: All audio fed to the model must be pure vocal tracks (no accompaniment). If your inputs contain accompaniment, run vocal separation first using `src/third_party/MusicSourceSeparationTraining/inference_api.py`.

The input JSONL file should contain one JSON object per line, formatted as follows:

```json
{"id": "1", "melody_ref_path": "XXX", "gen_text": "好多天|看不完你", "timbre_ref_path": "XXX", "timbre_ref_text": "该体谅的不执着|如果那天我"}
```

```bash
python batch_infer.py \
    --input_type jsonl \
    --input_path /path/to/input.jsonl \
    --output_dir /path/to/output \
    --ckpt_path /path/to/ckpts \
    --num_gpus 4
```

Multi-process inference on **LyricEditBench (melody control)** — the test set will be downloaded automatically:

```bash
python inference_mp.py \
    --input_type lyric_edit_bench_melody_control \
    --output_dir path/to/LyricEditBench_melody_control \
    --ckpt_path ASLP-lab/YingMusic-Singer-Plus \
    --num_gpus 8
```

Multi-process inference on **LyricEditBench (singing edit)**:

```bash
python inference_mp.py \
    --input_type lyric_edit_bench_sing_edit \
    --output_dir path/to/LyricEditBench_sing_edit \
    --ckpt_path ASLP-lab/YingMusic-Singer-Plus \
    --num_gpus 8
```

## 🏗️ Model Architecture

YingMusic-Singer-Plus consists of four core components:

| Component | Description |
|-----------|-------------|
| **VAE** | Stable Audio 2 encoder/decoder; downsamples stereo 44.1 kHz audio by 2048× |
| **Melody Extractor** | Encoder of a pretrained MIDI extraction model (SOME); captures disentangled melody information |
| **IPA Tokenizer** | Converts Chinese & English lyrics into a unified phoneme sequence with sentence-level alignment |
| **DiT-based CFM** | Conditional flow matching backbone following F5-TTS (22 layers, 16 heads, hidden dim 1024) |

**Total parameters**: ~727.3M (453.6M CFM + 156.1M VAE + 117.6M Melody Extractor)


## 📊 LyricEditBench

We introduce **LyricEditBench**, the first benchmark for melody-preserving lyric modification evaluation, built on [GTSinger](https://github.com/GTSinger/GTSinger). The dataset is available on HuggingFace at https://huggingface.co/datasets/ASLP-lab/LyricEditBench.

### Results

<div align="center">
<p><i>Comparison with baseline models on LyricEditBench across task types (Table 1) and languages. Metrics — P: PER, S: SIM, F: F0-CORR, V: VS — are detailed in Section 3. Best results in <b>bold</b>.</i></p>
<img src="./assets/results.png" alt="LyricEditBench Results" width="90%">
</div>


## 🙏 Acknowledgements

This work builds upon the following open-source projects:

- [F5-TTS](https://github.com/SWivid/F5-TTS) — DiT-based CFM backbone
- [Stable Audio 2](https://github.com/Stability-AI/stable-audio-tools) — VAE architecture
- [SOME](https://github.com/openvpi/SOME) — Melody Extractor
- [DiffRhythm](https://github.com/ASLP-lab/DiffRhythm) — Sentence-level alignment strategy
- [GTSinger](https://github.com/GTSinger/GTSinger) — Benchmark base corpus
- [Emilia](https://huggingface.co/datasets/amphion/Emilia-Dataset) — TTS pretraining data


## 📄 License

The code and model weights in this project are licensed under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/), **except** for the following:

The VAE model weights and inference code (in `src/YingMusic-Singer-Plus/utils/stable-audio-tools`) are derived from [Stable Audio Open](https://huggingface.co/stabilityai/stable-audio-open-1.0) by Stability AI, and are licensed under the [Stability AI Community License](./LICENSE-STABILITY).


<p align="center">
  <img src="https://raw.githubusercontent.com/ASLP-lab/YingMusic-Singer-Plus/main/assets/institutional_logo.svg" alt="Institutional Logo" width="600">
</p>