ASLP-lab's picture
Update README.md
192789f verified
---
title: YingMusic-Singer-Plus
emoji: 🎤
colorFrom: pink
colorTo: blue
sdk: gradio
python_version: "3.10"
app_file: app.py
tags:
- singing-voice-synthesis
- lyric-editing
- diffusion-model
- reinforcement-learning
short_description: Edit lyrics, keep the melody
fullWidth: true
---
<div align="center">
<h1>🎤 YingMusic-Singer-Plus: Controllable Singing Voice Synthesis with Flexible Lyric Manipulation and Annotation-free Melody Guidance</h1>
<p>
<a href="">English</a><a href="README_ZH.md">中文</a>
</p>
![Python](https://img.shields.io/badge/Python-3.10-3776AB?logo=python&logoColor=white)
![License](https://img.shields.io/badge/License-CC--BY--4.0-lightgrey)
[![arXiv Paper](https://img.shields.io/badge/arXiv-2603.24589-b31b1b?logo=arxiv&logoColor=white)](https://arxiv.org/abs/2603.24589)
[![GitHub](https://img.shields.io/badge/GitHub-YingMusic--Singer-181717?logo=github&logoColor=white)](https://github.com/ASLP-lab/YingMusic-Singer-Plus)
[![Demo Page](https://img.shields.io/badge/GitHub-Demo--Page-8A2BE2?logo=github&logoColor=white&labelColor=181717)](https://aslp-lab.github.io/YingMusic-Singer-Plus-Demo/)
[![HuggingFace Space](https://img.shields.io/badge/🤗%20HuggingFace-Space-FFD21E)](https://huggingface.co/spaces/ASLP-lab/YingMusic-Singer-Plus)
[![HuggingFace Model](https://img.shields.io/badge/🤗%20HuggingFace-Model-FF9D00)](https://huggingface.co/ASLP-lab/YingMusic-Singer-Plus)
[![Dataset LyricEditBench](https://img.shields.io/badge/🤗%20HuggingFace-LyricEditBench-FF6F00)](https://huggingface.co/datasets/ASLP-lab/LyricEditBench)
[![Discord](https://img.shields.io/badge/Discord-Join%20Us-5865F2?logo=discord&logoColor=white)](https://discord.gg/RXghgWyvrn)
[![WeChat](https://img.shields.io/badge/WeChat-Group-07C160?logo=wechat&logoColor=white)](https://github.com/ASLP-lab/YingMusic-Singer-Plus/blob/main/assets/wechat_qr.png)
[![Lab](https://img.shields.io/badge/🏫%20ASLP-Lab-4A90D9)](http://www.npu-aslp.org/)
<p>
<a href="https://orcid.org/0009-0005-5957-8936">Chunbo Hao</a><sup>1,2</sup> ·
<a href="https://orcid.org/0009-0003-2602-2910">Junjie Zheng</a><sup>2</sup> ·
<a href="https://orcid.org/0009-0001-6706-0572">Guobin Ma</a><sup>1</sup> ·
Yuepeng Jiang<sup>1</sup> ·
Huakang Chen<sup>1</sup> ·
Wenjie Tian<sup>1</sup> ·
<a href="https://orcid.org/0009-0003-9258-4006">Gongyu Chen</a><sup>2</sup> ·
<a href="https://orcid.org/0009-0005-5413-6725">Zihao Chen</a><sup>2</sup> ·
Lei Xie<sup>1</sup>
</p>
<p>
<sup>1</sup> Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern Polytechnical University, China<br>
<sup>2</sup> AI Lab, GiantNetwork, China
</p>
</div>
<div align="center">
<img src="./assets/YingMusic-Singer.drawio.svg" alt="YingMusic-Singer Architecture" width="90%">
<p><i>Overall architecture of YingMusic-Singer-Plus. Left: SFT training pipeline. Right: GRPO training pipeline.</i></p>
</div>
## 📖 Introduction
**YingMusic-Singer-Plus** is a fully diffusion-based singing voice synthesis model that enables **melody-controllable singing voice editing with flexible lyric manipulation**, requiring no manual alignment or precise phoneme annotation.
Given only three inputs — an optional timbre reference, a melody-providing singing clip, and modified lyrics — YingMusic-Singer-Plus synthesizes high-fidelity singing voices at **44.1 kHz** while faithfully preserving the original melody.
## ✨ Key Features
- **Annotation-free**: No manual lyric-MIDI alignment required at inference
- **Flexible lyric manipulation**: Supports 6 editing types — partial/full changes, insertion, deletion, translation (CN↔EN), and code-switching
- **Strong melody preservation**: CKA-based melody alignment loss + GRPO-based optimization
- **Bilingual**: Unified IPA tokenizer for both Chinese and English
- **High fidelity**: 44.1 kHz stereo output via Stable Audio 2 VAE
## 🚀 Quick Start
### Option 1: Install from Scratch
```bash
conda create -n YingMusic-Singer-Plus python=3.10
conda activate YingMusic-Singer-Plus
# uv is much faster than pip
pip install uv
uv pip install -r requirements.txt
```
### Option 2: Pre-built Conda Environment
1. Download and install **Miniconda** from https://repo.anaconda.com/miniconda/ for your platform. Verify with `conda --version`.
2. Download the pre-built environment package for your setup from the table below.
3. In your Conda directory, navigate to `envs/` and create a folder named `YingMusic-Singer-Plus`.
4. Move the downloaded package into that folder, then extract it with `tar -xvf <package_name>`.
| CPU Architecture | GPU | OS | Download |
|------------------|--------|---------|----------|
| ARM | NVIDIA | Linux | Coming soon |
| AMD64 | NVIDIA | Linux | Coming soon |
| AMD64 | NVIDIA | Windows | Coming soon |
### Option 3: Docker
Build the image:
```bash
docker build -t YingMusic-Singer-Plus .
```
Run inference:
```bash
docker run --gpus all -it YingMusic-Singer-Plus
```
## 🎵 Inference
### Option 1: Online Demo (HuggingFace Space)
Visit https://huggingface.co/spaces/ASLP-lab/YingMusic-Singer-Plus to try the model instantly in your browser.
### Option 2: Local Gradio App (same as online demo)
```bash
python app_local.py
```
### Option 3: Command-line Inference
```bash
python infer_api.py \
--ref_audio path/to/ref.wav \
--melody_audio path/to/melody.wav \
--ref_text "该体谅的不执着|如果那天我" \
--target_text "好多天|看不完你" \
--output output.wav
```
Enable vocal separation and accompaniment mixing:
```bash
python infer_api.py \
--ref_audio ref.wav \
--melody_audio melody.wav \
--ref_text "..." \
--target_text "..." \
--separate_vocals \ # separate vocals from the input before processing
--mix_accompaniment \ # mix the synthesized vocal back with the accompaniment
--output mixed_output.wav
```
### Option 4: Batch Inference
> **Note**: All audio fed to the model must be pure vocal tracks (no accompaniment). If your inputs contain accompaniment, run vocal separation first using `src/third_party/MusicSourceSeparationTraining/inference_api.py`.
The input JSONL file should contain one JSON object per line, formatted as follows:
```json
{"id": "1", "melody_ref_path": "XXX", "gen_text": "好多天|看不完你", "timbre_ref_path": "XXX", "timbre_ref_text": "该体谅的不执着|如果那天我"}
```
```bash
python batch_infer.py \
--input_type jsonl \
--input_path /path/to/input.jsonl \
--output_dir /path/to/output \
--ckpt_path /path/to/ckpts \
--num_gpus 4
```
Multi-process inference on **LyricEditBench (melody control)** — the test set will be downloaded automatically:
```bash
python inference_mp.py \
--input_type lyric_edit_bench_melody_control \
--output_dir path/to/LyricEditBench_melody_control \
--ckpt_path ASLP-lab/YingMusic-Singer-Plus \
--num_gpus 8
```
Multi-process inference on **LyricEditBench (singing edit)**:
```bash
python inference_mp.py \
--input_type lyric_edit_bench_sing_edit \
--output_dir path/to/LyricEditBench_sing_edit \
--ckpt_path ASLP-lab/YingMusic-Singer-Plus \
--num_gpus 8
```
## 🏗️ Model Architecture
YingMusic-Singer-Plus consists of four core components:
| Component | Description |
|-----------|-------------|
| **VAE** | Stable Audio 2 encoder/decoder; downsamples stereo 44.1 kHz audio by 2048× |
| **Melody Extractor** | Encoder of a pretrained MIDI extraction model (SOME); captures disentangled melody information |
| **IPA Tokenizer** | Converts Chinese & English lyrics into a unified phoneme sequence with sentence-level alignment |
| **DiT-based CFM** | Conditional flow matching backbone following F5-TTS (22 layers, 16 heads, hidden dim 1024) |
**Total parameters**: ~727.3M (453.6M CFM + 156.1M VAE + 117.6M Melody Extractor)
## 📊 LyricEditBench
We introduce **LyricEditBench**, the first benchmark for melody-preserving lyric modification evaluation, built on [GTSinger](https://github.com/GTSinger/GTSinger). The dataset is available on HuggingFace at https://huggingface.co/datasets/ASLP-lab/LyricEditBench.
### Results
<div align="center">
<p><i>Comparison with baseline models on LyricEditBench across task types (Table 1) and languages. Metrics — P: PER, S: SIM, F: F0-CORR, V: VS — are detailed in Section 3. Best results in <b>bold</b>.</i></p>
<img src="./assets/results.png" alt="LyricEditBench Results" width="90%">
</div>
## 🙏 Acknowledgements
This work builds upon the following open-source projects:
- [F5-TTS](https://github.com/SWivid/F5-TTS) — DiT-based CFM backbone
- [Stable Audio 2](https://github.com/Stability-AI/stable-audio-tools) — VAE architecture
- [SOME](https://github.com/openvpi/SOME) — Melody Extractor
- [DiffRhythm](https://github.com/ASLP-lab/DiffRhythm) — Sentence-level alignment strategy
- [GTSinger](https://github.com/GTSinger/GTSinger) — Benchmark base corpus
- [Emilia](https://huggingface.co/datasets/amphion/Emilia-Dataset) — TTS pretraining data
## 📄 License
The code and model weights in this project are licensed under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/), **except** for the following:
The VAE model weights and inference code (in `src/YingMusic-Singer-Plus/utils/stable-audio-tools`) are derived from [Stable Audio Open](https://huggingface.co/stabilityai/stable-audio-open-1.0) by Stability AI, and are licensed under the [Stability AI Community License](./LICENSE-STABILITY).
<p align="center">
<img src="https://raw.githubusercontent.com/ASLP-lab/YingMusic-Singer-Plus/main/assets/institutional_logo.svg" alt="Institutional Logo" width="600">
</p>