File size: 9,866 Bytes
dae4a7c
192789f
dae4a7c
3a0b618
 
dae4a7c
 
 
 
 
 
 
 
3a0b618
dae4a7c
 
 
ffbb4ab
dae4a7c
192789f
ffbb4ab
 
 
 
 
 
 
 
5a7054b
 
192789f
 
 
 
ffbb4ab
 
192789f
ffbb4ab
 
 
5a7054b
 
 
 
 
 
 
 
 
ffbb4ab
 
 
5a7054b
 
ffbb4ab
 
 
 
 
 
192789f
ffbb4ab
 
 
 
 
192789f
ffbb4ab
192789f
ffbb4ab
 
 
 
 
 
 
 
 
 
 
 
 
 
dae4a7c
 
192789f
 
dae4a7c
ffbb4ab
dae4a7c
 
 
 
ffbb4ab
dae4a7c
ffbb4ab
 
192789f
ffbb4ab
dae4a7c
ffbb4ab
 
 
 
 
3f33afc
ffbb4ab
3f33afc
ffbb4ab
3f33afc
ffbb4ab
192789f
ffbb4ab
 
 
3f33afc
ffbb4ab
192789f
ffbb4ab
3f33afc
 
ffbb4ab
3f33afc
ffbb4ab
3f33afc
192789f
ffbb4ab
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3f33afc
 
 
 
 
 
ffbb4ab
3f33afc
ffbb4ab
 
 
3f33afc
 
ffbb4ab
192789f
3f33afc
ffbb4ab
3f33afc
ffbb4ab
 
 
3f33afc
 
ffbb4ab
192789f
3f33afc
ffbb4ab
 
 
 
192789f
ffbb4ab
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3f33afc
ffbb4ab
 
 
 
 
 
3f33afc
 
ffbb4ab
dae4a7c
ffbb4ab
dae4a7c
192789f
5a7054b
 
 
192789f
5a7054b
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
---
title: YingMusic-Singer-Plus
emoji: 🎤
colorFrom: pink
colorTo: blue
sdk: gradio
python_version: "3.10"
app_file: app.py
tags:
  - singing-voice-synthesis
  - lyric-editing
  - diffusion-model
  - reinforcement-learning
short_description: Edit lyrics, keep the melody
fullWidth: true
---

<div align="center">

<h1>🎤 YingMusic-Singer-Plus: Controllable Singing Voice Synthesis with Flexible Lyric Manipulation and Annotation-free Melody Guidance</h1>

<p>
  <a href="">English</a><a href="README_ZH.md">中文</a>
</p>


![Python](https://img.shields.io/badge/Python-3.10-3776AB?logo=python&logoColor=white)
![License](https://img.shields.io/badge/License-CC--BY--4.0-lightgrey)

[![arXiv Paper](https://img.shields.io/badge/arXiv-2603.24589-b31b1b?logo=arxiv&logoColor=white)](https://arxiv.org/abs/2603.24589)
[![GitHub](https://img.shields.io/badge/GitHub-YingMusic--Singer-181717?logo=github&logoColor=white)](https://github.com/ASLP-lab/YingMusic-Singer-Plus)
[![Demo Page](https://img.shields.io/badge/GitHub-Demo--Page-8A2BE2?logo=github&logoColor=white&labelColor=181717)](https://aslp-lab.github.io/YingMusic-Singer-Plus-Demo/)
[![HuggingFace Space](https://img.shields.io/badge/🤗%20HuggingFace-Space-FFD21E)](https://huggingface.co/spaces/ASLP-lab/YingMusic-Singer-Plus)
[![HuggingFace Model](https://img.shields.io/badge/🤗%20HuggingFace-Model-FF9D00)](https://huggingface.co/ASLP-lab/YingMusic-Singer-Plus)
[![Dataset LyricEditBench](https://img.shields.io/badge/🤗%20HuggingFace-LyricEditBench-FF6F00)](https://huggingface.co/datasets/ASLP-lab/LyricEditBench)
[![Discord](https://img.shields.io/badge/Discord-Join%20Us-5865F2?logo=discord&logoColor=white)](https://discord.gg/RXghgWyvrn)
[![WeChat](https://img.shields.io/badge/WeChat-Group-07C160?logo=wechat&logoColor=white)](https://github.com/ASLP-lab/YingMusic-Singer-Plus/blob/main/assets/wechat_qr.png)
[![Lab](https://img.shields.io/badge/🏫%20ASLP-Lab-4A90D9)](http://www.npu-aslp.org/)

<p>
        <a href="https://orcid.org/0009-0005-5957-8936">Chunbo Hao</a><sup>1,2</sup> ·
        <a href="https://orcid.org/0009-0003-2602-2910">Junjie Zheng</a><sup>2</sup> ·
        <a href="https://orcid.org/0009-0001-6706-0572">Guobin Ma</a><sup>1</sup> ·
        Yuepeng Jiang<sup>1</sup> ·
        Huakang Chen<sup>1</sup> ·
        Wenjie Tian<sup>1</sup> ·
        <a href="https://orcid.org/0009-0003-9258-4006">Gongyu Chen</a><sup>2</sup> ·
        <a href="https://orcid.org/0009-0005-5413-6725">Zihao Chen</a><sup>2</sup> ·
        Lei Xie<sup>1</sup>
</p>

<p>
        <sup>1</sup> Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern Polytechnical University, China<br>
        <sup>2</sup> AI Lab, GiantNetwork, China
</p>

</div>

<div align="center">
<img src="./assets/YingMusic-Singer.drawio.svg" alt="YingMusic-Singer Architecture" width="90%">
<p><i>Overall architecture of YingMusic-Singer-Plus. Left: SFT training pipeline. Right: GRPO training pipeline.</i></p>
</div>


## 📖 Introduction

**YingMusic-Singer-Plus** is a fully diffusion-based singing voice synthesis model that enables **melody-controllable singing voice editing with flexible lyric manipulation**, requiring no manual alignment or precise phoneme annotation.

Given only three inputs — an optional timbre reference, a melody-providing singing clip, and modified lyrics — YingMusic-Singer-Plus synthesizes high-fidelity singing voices at **44.1 kHz** while faithfully preserving the original melody.


## ✨ Key Features

- **Annotation-free**: No manual lyric-MIDI alignment required at inference
- **Flexible lyric manipulation**: Supports 6 editing types — partial/full changes, insertion, deletion, translation (CN↔EN), and code-switching
- **Strong melody preservation**: CKA-based melody alignment loss + GRPO-based optimization
- **Bilingual**: Unified IPA tokenizer for both Chinese and English
- **High fidelity**: 44.1 kHz stereo output via Stable Audio 2 VAE


## 🚀 Quick Start

### Option 1: Install from Scratch

```bash
conda create -n YingMusic-Singer-Plus python=3.10
conda activate YingMusic-Singer-Plus

# uv is much faster than pip
pip install uv
uv pip install -r requirements.txt
```

### Option 2: Pre-built Conda Environment

1. Download and install **Miniconda** from https://repo.anaconda.com/miniconda/ for your platform. Verify with `conda --version`.
2. Download the pre-built environment package for your setup from the table below.
3. In your Conda directory, navigate to `envs/` and create a folder named `YingMusic-Singer-Plus`.
4. Move the downloaded package into that folder, then extract it with `tar -xvf <package_name>`.

| CPU Architecture | GPU    | OS      | Download |
|------------------|--------|---------|----------|
| ARM              | NVIDIA | Linux   | Coming soon |
| AMD64            | NVIDIA | Linux   | Coming soon |
| AMD64            | NVIDIA | Windows | Coming soon |

### Option 3: Docker

Build the image:

```bash
docker build -t YingMusic-Singer-Plus .
```

Run inference:

```bash
docker run --gpus all -it YingMusic-Singer-Plus
```


## 🎵 Inference

### Option 1: Online Demo (HuggingFace Space)

Visit https://huggingface.co/spaces/ASLP-lab/YingMusic-Singer-Plus to try the model instantly in your browser.

### Option 2: Local Gradio App (same as online demo)

```bash
python app_local.py
```

### Option 3: Command-line Inference

```bash
python infer_api.py \
    --ref_audio path/to/ref.wav \
    --melody_audio path/to/melody.wav \
    --ref_text "该体谅的不执着|如果那天我" \
    --target_text "好多天|看不完你" \
    --output output.wav
```

Enable vocal separation and accompaniment mixing:

```bash
python infer_api.py \
    --ref_audio ref.wav \
    --melody_audio melody.wav \
    --ref_text "..." \
    --target_text "..." \
    --separate_vocals \      # separate vocals from the input before processing
    --mix_accompaniment \    # mix the synthesized vocal back with the accompaniment
    --output mixed_output.wav
```
### Option 4: Batch Inference

> **Note**: All audio fed to the model must be pure vocal tracks (no accompaniment). If your inputs contain accompaniment, run vocal separation first using `src/third_party/MusicSourceSeparationTraining/inference_api.py`.

The input JSONL file should contain one JSON object per line, formatted as follows:

```json
{"id": "1", "melody_ref_path": "XXX", "gen_text": "好多天|看不完你", "timbre_ref_path": "XXX", "timbre_ref_text": "该体谅的不执着|如果那天我"}
```

```bash
python batch_infer.py \
    --input_type jsonl \
    --input_path /path/to/input.jsonl \
    --output_dir /path/to/output \
    --ckpt_path /path/to/ckpts \
    --num_gpus 4
```

Multi-process inference on **LyricEditBench (melody control)** — the test set will be downloaded automatically:

```bash
python inference_mp.py \
    --input_type lyric_edit_bench_melody_control \
    --output_dir path/to/LyricEditBench_melody_control \
    --ckpt_path ASLP-lab/YingMusic-Singer-Plus \
    --num_gpus 8
```

Multi-process inference on **LyricEditBench (singing edit)**:

```bash
python inference_mp.py \
    --input_type lyric_edit_bench_sing_edit \
    --output_dir path/to/LyricEditBench_sing_edit \
    --ckpt_path ASLP-lab/YingMusic-Singer-Plus \
    --num_gpus 8
```

## 🏗️ Model Architecture

YingMusic-Singer-Plus consists of four core components:

| Component | Description |
|-----------|-------------|
| **VAE** | Stable Audio 2 encoder/decoder; downsamples stereo 44.1 kHz audio by 2048× |
| **Melody Extractor** | Encoder of a pretrained MIDI extraction model (SOME); captures disentangled melody information |
| **IPA Tokenizer** | Converts Chinese & English lyrics into a unified phoneme sequence with sentence-level alignment |
| **DiT-based CFM** | Conditional flow matching backbone following F5-TTS (22 layers, 16 heads, hidden dim 1024) |

**Total parameters**: ~727.3M (453.6M CFM + 156.1M VAE + 117.6M Melody Extractor)


## 📊 LyricEditBench

We introduce **LyricEditBench**, the first benchmark for melody-preserving lyric modification evaluation, built on [GTSinger](https://github.com/GTSinger/GTSinger). The dataset is available on HuggingFace at https://huggingface.co/datasets/ASLP-lab/LyricEditBench.

### Results

<div align="center">
<p><i>Comparison with baseline models on LyricEditBench across task types (Table 1) and languages. Metrics — P: PER, S: SIM, F: F0-CORR, V: VS — are detailed in Section 3. Best results in <b>bold</b>.</i></p>
<img src="./assets/results.png" alt="LyricEditBench Results" width="90%">
</div>


## 🙏 Acknowledgements

This work builds upon the following open-source projects:

- [F5-TTS](https://github.com/SWivid/F5-TTS) — DiT-based CFM backbone
- [Stable Audio 2](https://github.com/Stability-AI/stable-audio-tools) — VAE architecture
- [SOME](https://github.com/openvpi/SOME) — Melody Extractor
- [DiffRhythm](https://github.com/ASLP-lab/DiffRhythm) — Sentence-level alignment strategy
- [GTSinger](https://github.com/GTSinger/GTSinger) — Benchmark base corpus
- [Emilia](https://huggingface.co/datasets/amphion/Emilia-Dataset) — TTS pretraining data


## 📄 License

The code and model weights in this project are licensed under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/), **except** for the following:

The VAE model weights and inference code (in `src/YingMusic-Singer-Plus/utils/stable-audio-tools`) are derived from [Stable Audio Open](https://huggingface.co/stabilityai/stable-audio-open-1.0) by Stability AI, and are licensed under the [Stability AI Community License](./LICENSE-STABILITY).


<p align="center">
  <img src="https://raw.githubusercontent.com/ASLP-lab/YingMusic-Singer-Plus/main/assets/institutional_logo.svg" alt="Institutional Logo" width="600">
</p>