Spaces:
Running on Zero
Running on Zero
File size: 9,866 Bytes
dae4a7c 192789f dae4a7c 3a0b618 dae4a7c 3a0b618 dae4a7c ffbb4ab dae4a7c 192789f ffbb4ab 5a7054b 192789f ffbb4ab 192789f ffbb4ab 5a7054b ffbb4ab 5a7054b ffbb4ab 192789f ffbb4ab 192789f ffbb4ab 192789f ffbb4ab dae4a7c 192789f dae4a7c ffbb4ab dae4a7c ffbb4ab dae4a7c ffbb4ab 192789f ffbb4ab dae4a7c ffbb4ab 3f33afc ffbb4ab 3f33afc ffbb4ab 3f33afc ffbb4ab 192789f ffbb4ab 3f33afc ffbb4ab 192789f ffbb4ab 3f33afc ffbb4ab 3f33afc ffbb4ab 3f33afc 192789f ffbb4ab 3f33afc ffbb4ab 3f33afc ffbb4ab 3f33afc ffbb4ab 192789f 3f33afc ffbb4ab 3f33afc ffbb4ab 3f33afc ffbb4ab 192789f 3f33afc ffbb4ab 192789f ffbb4ab 3f33afc ffbb4ab 3f33afc ffbb4ab dae4a7c ffbb4ab dae4a7c 192789f 5a7054b 192789f 5a7054b | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 | ---
title: YingMusic-Singer-Plus
emoji: 🎤
colorFrom: pink
colorTo: blue
sdk: gradio
python_version: "3.10"
app_file: app.py
tags:
- singing-voice-synthesis
- lyric-editing
- diffusion-model
- reinforcement-learning
short_description: Edit lyrics, keep the melody
fullWidth: true
---
<div align="center">
<h1>🎤 YingMusic-Singer-Plus: Controllable Singing Voice Synthesis with Flexible Lyric Manipulation and Annotation-free Melody Guidance</h1>
<p>
<a href="">English</a> | <a href="README_ZH.md">中文</a>
</p>


[](https://arxiv.org/abs/2603.24589)
[](https://github.com/ASLP-lab/YingMusic-Singer-Plus)
[](https://aslp-lab.github.io/YingMusic-Singer-Plus-Demo/)
[](https://huggingface.co/spaces/ASLP-lab/YingMusic-Singer-Plus)
[](https://huggingface.co/ASLP-lab/YingMusic-Singer-Plus)
[](https://huggingface.co/datasets/ASLP-lab/LyricEditBench)
[](https://discord.gg/RXghgWyvrn)
[](https://github.com/ASLP-lab/YingMusic-Singer-Plus/blob/main/assets/wechat_qr.png)
[](http://www.npu-aslp.org/)
<p>
<a href="https://orcid.org/0009-0005-5957-8936">Chunbo Hao</a><sup>1,2</sup> ·
<a href="https://orcid.org/0009-0003-2602-2910">Junjie Zheng</a><sup>2</sup> ·
<a href="https://orcid.org/0009-0001-6706-0572">Guobin Ma</a><sup>1</sup> ·
Yuepeng Jiang<sup>1</sup> ·
Huakang Chen<sup>1</sup> ·
Wenjie Tian<sup>1</sup> ·
<a href="https://orcid.org/0009-0003-9258-4006">Gongyu Chen</a><sup>2</sup> ·
<a href="https://orcid.org/0009-0005-5413-6725">Zihao Chen</a><sup>2</sup> ·
Lei Xie<sup>1</sup>
</p>
<p>
<sup>1</sup> Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern Polytechnical University, China<br>
<sup>2</sup> AI Lab, GiantNetwork, China
</p>
</div>
<div align="center">
<img src="./assets/YingMusic-Singer.drawio.svg" alt="YingMusic-Singer Architecture" width="90%">
<p><i>Overall architecture of YingMusic-Singer-Plus. Left: SFT training pipeline. Right: GRPO training pipeline.</i></p>
</div>
## 📖 Introduction
**YingMusic-Singer-Plus** is a fully diffusion-based singing voice synthesis model that enables **melody-controllable singing voice editing with flexible lyric manipulation**, requiring no manual alignment or precise phoneme annotation.
Given only three inputs — an optional timbre reference, a melody-providing singing clip, and modified lyrics — YingMusic-Singer-Plus synthesizes high-fidelity singing voices at **44.1 kHz** while faithfully preserving the original melody.
## ✨ Key Features
- **Annotation-free**: No manual lyric-MIDI alignment required at inference
- **Flexible lyric manipulation**: Supports 6 editing types — partial/full changes, insertion, deletion, translation (CN↔EN), and code-switching
- **Strong melody preservation**: CKA-based melody alignment loss + GRPO-based optimization
- **Bilingual**: Unified IPA tokenizer for both Chinese and English
- **High fidelity**: 44.1 kHz stereo output via Stable Audio 2 VAE
## 🚀 Quick Start
### Option 1: Install from Scratch
```bash
conda create -n YingMusic-Singer-Plus python=3.10
conda activate YingMusic-Singer-Plus
# uv is much faster than pip
pip install uv
uv pip install -r requirements.txt
```
### Option 2: Pre-built Conda Environment
1. Download and install **Miniconda** from https://repo.anaconda.com/miniconda/ for your platform. Verify with `conda --version`.
2. Download the pre-built environment package for your setup from the table below.
3. In your Conda directory, navigate to `envs/` and create a folder named `YingMusic-Singer-Plus`.
4. Move the downloaded package into that folder, then extract it with `tar -xvf <package_name>`.
| CPU Architecture | GPU | OS | Download |
|------------------|--------|---------|----------|
| ARM | NVIDIA | Linux | Coming soon |
| AMD64 | NVIDIA | Linux | Coming soon |
| AMD64 | NVIDIA | Windows | Coming soon |
### Option 3: Docker
Build the image:
```bash
docker build -t YingMusic-Singer-Plus .
```
Run inference:
```bash
docker run --gpus all -it YingMusic-Singer-Plus
```
## 🎵 Inference
### Option 1: Online Demo (HuggingFace Space)
Visit https://huggingface.co/spaces/ASLP-lab/YingMusic-Singer-Plus to try the model instantly in your browser.
### Option 2: Local Gradio App (same as online demo)
```bash
python app_local.py
```
### Option 3: Command-line Inference
```bash
python infer_api.py \
--ref_audio path/to/ref.wav \
--melody_audio path/to/melody.wav \
--ref_text "该体谅的不执着|如果那天我" \
--target_text "好多天|看不完你" \
--output output.wav
```
Enable vocal separation and accompaniment mixing:
```bash
python infer_api.py \
--ref_audio ref.wav \
--melody_audio melody.wav \
--ref_text "..." \
--target_text "..." \
--separate_vocals \ # separate vocals from the input before processing
--mix_accompaniment \ # mix the synthesized vocal back with the accompaniment
--output mixed_output.wav
```
### Option 4: Batch Inference
> **Note**: All audio fed to the model must be pure vocal tracks (no accompaniment). If your inputs contain accompaniment, run vocal separation first using `src/third_party/MusicSourceSeparationTraining/inference_api.py`.
The input JSONL file should contain one JSON object per line, formatted as follows:
```json
{"id": "1", "melody_ref_path": "XXX", "gen_text": "好多天|看不完你", "timbre_ref_path": "XXX", "timbre_ref_text": "该体谅的不执着|如果那天我"}
```
```bash
python batch_infer.py \
--input_type jsonl \
--input_path /path/to/input.jsonl \
--output_dir /path/to/output \
--ckpt_path /path/to/ckpts \
--num_gpus 4
```
Multi-process inference on **LyricEditBench (melody control)** — the test set will be downloaded automatically:
```bash
python inference_mp.py \
--input_type lyric_edit_bench_melody_control \
--output_dir path/to/LyricEditBench_melody_control \
--ckpt_path ASLP-lab/YingMusic-Singer-Plus \
--num_gpus 8
```
Multi-process inference on **LyricEditBench (singing edit)**:
```bash
python inference_mp.py \
--input_type lyric_edit_bench_sing_edit \
--output_dir path/to/LyricEditBench_sing_edit \
--ckpt_path ASLP-lab/YingMusic-Singer-Plus \
--num_gpus 8
```
## 🏗️ Model Architecture
YingMusic-Singer-Plus consists of four core components:
| Component | Description |
|-----------|-------------|
| **VAE** | Stable Audio 2 encoder/decoder; downsamples stereo 44.1 kHz audio by 2048× |
| **Melody Extractor** | Encoder of a pretrained MIDI extraction model (SOME); captures disentangled melody information |
| **IPA Tokenizer** | Converts Chinese & English lyrics into a unified phoneme sequence with sentence-level alignment |
| **DiT-based CFM** | Conditional flow matching backbone following F5-TTS (22 layers, 16 heads, hidden dim 1024) |
**Total parameters**: ~727.3M (453.6M CFM + 156.1M VAE + 117.6M Melody Extractor)
## 📊 LyricEditBench
We introduce **LyricEditBench**, the first benchmark for melody-preserving lyric modification evaluation, built on [GTSinger](https://github.com/GTSinger/GTSinger). The dataset is available on HuggingFace at https://huggingface.co/datasets/ASLP-lab/LyricEditBench.
### Results
<div align="center">
<p><i>Comparison with baseline models on LyricEditBench across task types (Table 1) and languages. Metrics — P: PER, S: SIM, F: F0-CORR, V: VS — are detailed in Section 3. Best results in <b>bold</b>.</i></p>
<img src="./assets/results.png" alt="LyricEditBench Results" width="90%">
</div>
## 🙏 Acknowledgements
This work builds upon the following open-source projects:
- [F5-TTS](https://github.com/SWivid/F5-TTS) — DiT-based CFM backbone
- [Stable Audio 2](https://github.com/Stability-AI/stable-audio-tools) — VAE architecture
- [SOME](https://github.com/openvpi/SOME) — Melody Extractor
- [DiffRhythm](https://github.com/ASLP-lab/DiffRhythm) — Sentence-level alignment strategy
- [GTSinger](https://github.com/GTSinger/GTSinger) — Benchmark base corpus
- [Emilia](https://huggingface.co/datasets/amphion/Emilia-Dataset) — TTS pretraining data
## 📄 License
The code and model weights in this project are licensed under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/), **except** for the following:
The VAE model weights and inference code (in `src/YingMusic-Singer-Plus/utils/stable-audio-tools`) are derived from [Stable Audio Open](https://huggingface.co/stabilityai/stable-audio-open-1.0) by Stability AI, and are licensed under the [Stability AI Community License](./LICENSE-STABILITY).
<p align="center">
<img src="https://raw.githubusercontent.com/ASLP-lab/YingMusic-Singer-Plus/main/assets/institutional_logo.svg" alt="Institutional Logo" width="600">
</p> |