Spaces:

ASLP-lab
/

YingMusic-Singer-Plus

Running on Zero

File size: 9,866 Bytes

dae4a7c
192789f
dae4a7c
3a0b618
 
dae4a7c
 
 
 
 
 
 
 
3a0b618
dae4a7c
 
 
ffbb4ab
dae4a7c
192789f
ffbb4ab
 
 
 
 
 
 
 
5a7054b
 
192789f
 
 
 
ffbb4ab
 
192789f
ffbb4ab
 
 
5a7054b
 
 
 
 
 
 
 
 
ffbb4ab
 
 
5a7054b
 
ffbb4ab
 
 
 
 
 
192789f
ffbb4ab
 
 
 
 
192789f
ffbb4ab
192789f
ffbb4ab
 
 
 
 
 
 
 
 
 
 
 
 
 
dae4a7c
 
192789f
 
dae4a7c
ffbb4ab
dae4a7c
 
 
 
ffbb4ab
dae4a7c
ffbb4ab
 
192789f
ffbb4ab
dae4a7c
ffbb4ab
 
 
 
 
3f33afc
ffbb4ab
3f33afc
ffbb4ab
3f33afc
ffbb4ab
192789f
ffbb4ab
 
 
3f33afc
ffbb4ab
192789f
ffbb4ab
3f33afc
 
ffbb4ab
3f33afc
ffbb4ab
3f33afc
192789f
ffbb4ab
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3f33afc
 
 
 
 
 
ffbb4ab
3f33afc
ffbb4ab
 
 
3f33afc
 
ffbb4ab
192789f
3f33afc
ffbb4ab
3f33afc
ffbb4ab
 
 
3f33afc
 
ffbb4ab
192789f
3f33afc
ffbb4ab
 
 
 
192789f
ffbb4ab
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3f33afc
ffbb4ab
 
 
 
 
 
3f33afc
 
ffbb4ab
dae4a7c
ffbb4ab
dae4a7c
192789f
5a7054b
 
 
192789f
5a7054b

---
title: YingMusic-Singer-Plus
emoji: 🎤
colorFrom: pink
colorTo: blue
sdk: gradio
python_version: "3.10"
app_file: app.py
tags:
  - singing-voice-synthesis
  - lyric-editing
  - diffusion-model
  - reinforcement-learning
short_description: Edit lyrics, keep the melody
fullWidth: true
---

<div align="center">

<h1>🎤 YingMusic-Singer-Plus: Controllable Singing Voice Synthesis with Flexible Lyric Manipulation and Annotation-free Melody Guidance</h1>

<p>
  <a href="">English</a> ｜ <a href="README_ZH.md">中文</a>
</p>


![Python](https://img.shields.io/badge/Python-3.10-3776AB?logo=python&logoColor=white)
![License](https://img.shields.io/badge/License-CC--BY--4.0-lightgrey)

[![arXiv Paper](https://img.shields.io/badge/arXiv-2603.24589-b31b1b?logo=arxiv&logoColor=white)](https://arxiv.org/abs/2603.24589)
[![GitHub](https://img.shields.io/badge/GitHub-YingMusic--Singer-181717?logo=github&logoColor=white)](https://github.com/ASLP-lab/YingMusic-Singer-Plus)
[![Demo Page](https://img.shields.io/badge/GitHub-Demo--Page-8A2BE2?logo=github&logoColor=white&labelColor=181717)](https://aslp-lab.github.io/YingMusic-Singer-Plus-Demo/)
[![HuggingFace Space](https://img.shields.io/badge/🤗%20HuggingFace-Space-FFD21E)](https://huggingface.co/spaces/ASLP-lab/YingMusic-Singer-Plus)
[![HuggingFace Model](https://img.shields.io/badge/🤗%20HuggingFace-Model-FF9D00)](https://huggingface.co/ASLP-lab/YingMusic-Singer-Plus)
[![Dataset LyricEditBench](https://img.shields.io/badge/🤗%20HuggingFace-LyricEditBench-FF6F00)](https://huggingface.co/datasets/ASLP-lab/LyricEditBench)
[![Discord](https://img.shields.io/badge/Discord-Join%20Us-5865F2?logo=discord&logoColor=white)](https://discord.gg/RXghgWyvrn)
[![WeChat](https://img.shields.io/badge/WeChat-Group-07C160?logo=wechat&logoColor=white)](https://github.com/ASLP-lab/YingMusic-Singer-Plus/blob/main/assets/wechat_qr.png)
[![Lab](https://img.shields.io/badge/🏫%20ASLP-Lab-4A90D9)](http://www.npu-aslp.org/)

<p>
        <a href="https://orcid.org/0009-0005-5957-8936">Chunbo Hao</a><sup>1,2</sup> ·
        <a href="https://orcid.org/0009-0003-2602-2910">Junjie Zheng</a><sup>2</sup> ·
        <a href="https://orcid.org/0009-0001-6706-0572">Guobin Ma</a><sup>1</sup> ·
        Yuepeng Jiang<sup>1</sup> ·
        Huakang Chen<sup>1</sup> ·
        Wenjie Tian<sup>1</sup> ·
        <a href="https://orcid.org/0009-0003-9258-4006">Gongyu Chen</a><sup>2</sup> ·
        <a href="https://orcid.org/0009-0005-5413-6725">Zihao Chen</a><sup>2</sup> ·
        Lei Xie<sup>1</sup>
</p>

<p>
        <sup>1</sup> Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern Polytechnical University, China<br>
        <sup>2</sup> AI Lab, GiantNetwork, China
</p>

</div>

<div align="center">
<img src="./assets/YingMusic-Singer.drawio.svg" alt="YingMusic-Singer Architecture" width="90%">
<p><i>Overall architecture of YingMusic-Singer-Plus. Left: SFT training pipeline. Right: GRPO training pipeline.</i></p>
</div>


## 📖 Introduction

**YingMusic-Singer-Plus** is a fully diffusion-based singing voice synthesis model that enables **melody-controllable singing voice editing with flexible lyric manipulation**, requiring no manual alignment or precise phoneme annotation.

Given only three inputs — an optional timbre reference, a melody-providing singing clip, and modified lyrics — YingMusic-Singer-Plus synthesizes high-fidelity singing voices at **44.1 kHz** while faithfully preserving the original melody.


## ✨ Key Features

- **Annotation-free**: No manual lyric-MIDI alignment required at inference
- **Flexible lyric manipulation**: Supports 6 editing types — partial/full changes, insertion, deletion, translation (CN↔EN), and code-switching
- **Strong melody preservation**: CKA-based melody alignment loss + GRPO-based optimization
- **Bilingual**: Unified IPA tokenizer for both Chinese and English
- **High fidelity**: 44.1 kHz stereo output via Stable Audio 2 VAE


## 🚀 Quick Start

### Option 1: Install from Scratch

```bash
conda create -n YingMusic-Singer-Plus python=3.10
conda activate YingMusic-Singer-Plus

# uv is much faster than pip
pip install uv
uv pip install -r requirements.txt
```

### Option 2: Pre-built Conda Environment

1. Download and install **Miniconda** from https://repo.anaconda.com/miniconda/ for your platform. Verify with `conda --version`.
2. Download the pre-built environment package for your setup from the table below.
3. In your Conda directory, navigate to `envs/` and create a folder named `YingMusic-Singer-Plus`.
4. Move the downloaded package into that folder, then extract it with `tar -xvf <package_name>`.

| CPU Architecture | GPU    | OS      | Download |
|------------------|--------|---------|----------|
| ARM              | NVIDIA | Linux   | Coming soon |
| AMD64            | NVIDIA | Linux   | Coming soon |
| AMD64            | NVIDIA | Windows | Coming soon |

### Option 3: Docker

Build the image:

```bash
docker build -t YingMusic-Singer-Plus .
```

Run inference:

```bash
docker run --gpus all -it YingMusic-Singer-Plus
```


## 🎵 Inference

### Option 1: Online Demo (HuggingFace Space)

Visit https://huggingface.co/spaces/ASLP-lab/YingMusic-Singer-Plus to try the model instantly in your browser.

### Option 2: Local Gradio App (same as online demo)

```bash
python app_local.py
```

### Option 3: Command-line Inference

```bash
python infer_api.py \
    --ref_audio path/to/ref.wav \
    --melody_audio path/to/melody.wav \
    --ref_text "该体谅的不执着|如果那天我" \
    --target_text "好多天|看不完你" \
    --output output.wav
```

Enable vocal separation and accompaniment mixing:

```bash
python infer_api.py \
    --ref_audio ref.wav \
    --melody_audio melody.wav \
    --ref_text "..." \
    --target_text "..." \
    --separate_vocals \      # separate vocals from the input before processing
    --mix_accompaniment \    # mix the synthesized vocal back with the accompaniment
    --output mixed_output.wav
```
### Option 4: Batch Inference

> **Note**: All audio fed to the model must be pure vocal tracks (no accompaniment). If your inputs contain accompaniment, run vocal separation first using `src/third_party/MusicSourceSeparationTraining/inference_api.py`.

The input JSONL file should contain one JSON object per line, formatted as follows:

```json
{"id": "1", "melody_ref_path": "XXX", "gen_text": "好多天|看不完你", "timbre_ref_path": "XXX", "timbre_ref_text": "该体谅的不执着|如果那天我"}
```

```bash
python batch_infer.py \
    --input_type jsonl \
    --input_path /path/to/input.jsonl \
    --output_dir /path/to/output \
    --ckpt_path /path/to/ckpts \
    --num_gpus 4
```

Multi-process inference on **LyricEditBench (melody control)** — the test set will be downloaded automatically:

```bash
python inference_mp.py \
    --input_type lyric_edit_bench_melody_control \
    --output_dir path/to/LyricEditBench_melody_control \
    --ckpt_path ASLP-lab/YingMusic-Singer-Plus \
    --num_gpus 8
```

Multi-process inference on **LyricEditBench (singing edit)**:

```bash
python inference_mp.py \
    --input_type lyric_edit_bench_sing_edit \
    --output_dir path/to/LyricEditBench_sing_edit \
    --ckpt_path ASLP-lab/YingMusic-Singer-Plus \
    --num_gpus 8
```

## 🏗️ Model Architecture

YingMusic-Singer-Plus consists of four core components:

| Component | Description |
|-----------|-------------|
| **VAE** | Stable Audio 2 encoder/decoder; downsamples stereo 44.1 kHz audio by 2048× |
| **Melody Extractor** | Encoder of a pretrained MIDI extraction model (SOME); captures disentangled melody information |
| **IPA Tokenizer** | Converts Chinese & English lyrics into a unified phoneme sequence with sentence-level alignment |
| **DiT-based CFM** | Conditional flow matching backbone following F5-TTS (22 layers, 16 heads, hidden dim 1024) |

**Total parameters**: ~727.3M (453.6M CFM + 156.1M VAE + 117.6M Melody Extractor)


## 📊 LyricEditBench

We introduce **LyricEditBench**, the first benchmark for melody-preserving lyric modification evaluation, built on [GTSinger](https://github.com/GTSinger/GTSinger). The dataset is available on HuggingFace at https://huggingface.co/datasets/ASLP-lab/LyricEditBench.

### Results

<div align="center">
<p><i>Comparison with baseline models on LyricEditBench across task types (Table 1) and languages. Metrics — P: PER, S: SIM, F: F0-CORR, V: VS — are detailed in Section 3. Best results in <b>bold</b>.</i></p>
<img src="./assets/results.png" alt="LyricEditBench Results" width="90%">
</div>


## 🙏 Acknowledgements

This work builds upon the following open-source projects:

- [F5-TTS](https://github.com/SWivid/F5-TTS) — DiT-based CFM backbone
- [Stable Audio 2](https://github.com/Stability-AI/stable-audio-tools) — VAE architecture
- [SOME](https://github.com/openvpi/SOME) — Melody Extractor
- [DiffRhythm](https://github.com/ASLP-lab/DiffRhythm) — Sentence-level alignment strategy
- [GTSinger](https://github.com/GTSinger/GTSinger) — Benchmark base corpus
- [Emilia](https://huggingface.co/datasets/amphion/Emilia-Dataset) — TTS pretraining data


## 📄 License

The code and model weights in this project are licensed under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/), **except** for the following:

The VAE model weights and inference code (in `src/YingMusic-Singer-Plus/utils/stable-audio-tools`) are derived from [Stable Audio Open](https://huggingface.co/stabilityai/stable-audio-open-1.0) by Stability AI, and are licensed under the [Stability AI Community License](./LICENSE-STABILITY).


<p align="center">
  <img src="https://raw.githubusercontent.com/ASLP-lab/YingMusic-Singer-Plus/main/assets/institutional_logo.svg" alt="Institutional Logo" width="600">
</p>