Spaces:
Running on Zero
Running on Zero
| title: YingMusic-Singer-Plus | |
| emoji: 🎤 | |
| colorFrom: pink | |
| colorTo: blue | |
| sdk: gradio | |
| python_version: "3.10" | |
| app_file: app.py | |
| tags: | |
| - singing-voice-synthesis | |
| - lyric-editing | |
| - diffusion-model | |
| - reinforcement-learning | |
| short_description: Edit lyrics, keep the melody | |
| fullWidth: true | |
| <div align="center"> | |
| <h1>🎤 YingMusic-Singer-Plus: Controllable Singing Voice Synthesis with Flexible Lyric Manipulation and Annotation-free Melody Guidance</h1> | |
| <p> | |
| <a href="">English</a> | <a href="README_ZH.md">中文</a> | |
| </p> | |
|  | |
|  | |
| [](https://arxiv.org/abs/2603.24589) | |
| [](https://github.com/ASLP-lab/YingMusic-Singer-Plus) | |
| [](https://aslp-lab.github.io/YingMusic-Singer-Plus-Demo/) | |
| [](https://huggingface.co/spaces/ASLP-lab/YingMusic-Singer-Plus) | |
| [](https://huggingface.co/ASLP-lab/YingMusic-Singer-Plus) | |
| [](https://huggingface.co/datasets/ASLP-lab/LyricEditBench) | |
| [](https://discord.gg/RXghgWyvrn) | |
| [](https://github.com/ASLP-lab/YingMusic-Singer-Plus/blob/main/assets/wechat_qr.png) | |
| [](http://www.npu-aslp.org/) | |
| <p> | |
| <a href="https://orcid.org/0009-0005-5957-8936">Chunbo Hao</a><sup>1,2</sup> · | |
| <a href="https://orcid.org/0009-0003-2602-2910">Junjie Zheng</a><sup>2</sup> · | |
| <a href="https://orcid.org/0009-0001-6706-0572">Guobin Ma</a><sup>1</sup> · | |
| Yuepeng Jiang<sup>1</sup> · | |
| Huakang Chen<sup>1</sup> · | |
| Wenjie Tian<sup>1</sup> · | |
| <a href="https://orcid.org/0009-0003-9258-4006">Gongyu Chen</a><sup>2</sup> · | |
| <a href="https://orcid.org/0009-0005-5413-6725">Zihao Chen</a><sup>2</sup> · | |
| Lei Xie<sup>1</sup> | |
| </p> | |
| <p> | |
| <sup>1</sup> Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern Polytechnical University, China<br> | |
| <sup>2</sup> AI Lab, GiantNetwork, China | |
| </p> | |
| </div> | |
| <div align="center"> | |
| <img src="./assets/YingMusic-Singer.drawio.svg" alt="YingMusic-Singer Architecture" width="90%"> | |
| <p><i>Overall architecture of YingMusic-Singer-Plus. Left: SFT training pipeline. Right: GRPO training pipeline.</i></p> | |
| </div> | |
| ## 📖 Introduction | |
| **YingMusic-Singer-Plus** is a fully diffusion-based singing voice synthesis model that enables **melody-controllable singing voice editing with flexible lyric manipulation**, requiring no manual alignment or precise phoneme annotation. | |
| Given only three inputs — an optional timbre reference, a melody-providing singing clip, and modified lyrics — YingMusic-Singer-Plus synthesizes high-fidelity singing voices at **44.1 kHz** while faithfully preserving the original melody. | |
| ## ✨ Key Features | |
| - **Annotation-free**: No manual lyric-MIDI alignment required at inference | |
| - **Flexible lyric manipulation**: Supports 6 editing types — partial/full changes, insertion, deletion, translation (CN↔EN), and code-switching | |
| - **Strong melody preservation**: CKA-based melody alignment loss + GRPO-based optimization | |
| - **Bilingual**: Unified IPA tokenizer for both Chinese and English | |
| - **High fidelity**: 44.1 kHz stereo output via Stable Audio 2 VAE | |
| ## 🚀 Quick Start | |
| ### Option 1: Install from Scratch | |
| ```bash | |
| conda create -n YingMusic-Singer-Plus python=3.10 | |
| conda activate YingMusic-Singer-Plus | |
| # uv is much faster than pip | |
| pip install uv | |
| uv pip install -r requirements.txt | |
| ``` | |
| ### Option 2: Pre-built Conda Environment | |
| 1. Download and install **Miniconda** from https://repo.anaconda.com/miniconda/ for your platform. Verify with `conda --version`. | |
| 2. Download the pre-built environment package for your setup from the table below. | |
| 3. In your Conda directory, navigate to `envs/` and create a folder named `YingMusic-Singer-Plus`. | |
| 4. Move the downloaded package into that folder, then extract it with `tar -xvf <package_name>`. | |
| | CPU Architecture | GPU | OS | Download | | |
| |------------------|--------|---------|----------| | |
| | ARM | NVIDIA | Linux | Coming soon | | |
| | AMD64 | NVIDIA | Linux | Coming soon | | |
| | AMD64 | NVIDIA | Windows | Coming soon | | |
| ### Option 3: Docker | |
| Build the image: | |
| ```bash | |
| docker build -t YingMusic-Singer-Plus . | |
| ``` | |
| Run inference: | |
| ```bash | |
| docker run --gpus all -it YingMusic-Singer-Plus | |
| ``` | |
| ## 🎵 Inference | |
| ### Option 1: Online Demo (HuggingFace Space) | |
| Visit https://huggingface.co/spaces/ASLP-lab/YingMusic-Singer-Plus to try the model instantly in your browser. | |
| ### Option 2: Local Gradio App (same as online demo) | |
| ```bash | |
| python app_local.py | |
| ``` | |
| ### Option 3: Command-line Inference | |
| ```bash | |
| python infer_api.py \ | |
| --ref_audio path/to/ref.wav \ | |
| --melody_audio path/to/melody.wav \ | |
| --ref_text "该体谅的不执着|如果那天我" \ | |
| --target_text "好多天|看不完你" \ | |
| --output output.wav | |
| ``` | |
| Enable vocal separation and accompaniment mixing: | |
| ```bash | |
| python infer_api.py \ | |
| --ref_audio ref.wav \ | |
| --melody_audio melody.wav \ | |
| --ref_text "..." \ | |
| --target_text "..." \ | |
| --separate_vocals \ # separate vocals from the input before processing | |
| --mix_accompaniment \ # mix the synthesized vocal back with the accompaniment | |
| --output mixed_output.wav | |
| ``` | |
| ### Option 4: Batch Inference | |
| > **Note**: All audio fed to the model must be pure vocal tracks (no accompaniment). If your inputs contain accompaniment, run vocal separation first using `src/third_party/MusicSourceSeparationTraining/inference_api.py`. | |
| The input JSONL file should contain one JSON object per line, formatted as follows: | |
| ```json | |
| {"id": "1", "melody_ref_path": "XXX", "gen_text": "好多天|看不完你", "timbre_ref_path": "XXX", "timbre_ref_text": "该体谅的不执着|如果那天我"} | |
| ``` | |
| ```bash | |
| python batch_infer.py \ | |
| --input_type jsonl \ | |
| --input_path /path/to/input.jsonl \ | |
| --output_dir /path/to/output \ | |
| --ckpt_path /path/to/ckpts \ | |
| --num_gpus 4 | |
| ``` | |
| Multi-process inference on **LyricEditBench (melody control)** — the test set will be downloaded automatically: | |
| ```bash | |
| python inference_mp.py \ | |
| --input_type lyric_edit_bench_melody_control \ | |
| --output_dir path/to/LyricEditBench_melody_control \ | |
| --ckpt_path ASLP-lab/YingMusic-Singer-Plus \ | |
| --num_gpus 8 | |
| ``` | |
| Multi-process inference on **LyricEditBench (singing edit)**: | |
| ```bash | |
| python inference_mp.py \ | |
| --input_type lyric_edit_bench_sing_edit \ | |
| --output_dir path/to/LyricEditBench_sing_edit \ | |
| --ckpt_path ASLP-lab/YingMusic-Singer-Plus \ | |
| --num_gpus 8 | |
| ``` | |
| ## 🏗️ Model Architecture | |
| YingMusic-Singer-Plus consists of four core components: | |
| | Component | Description | | |
| |-----------|-------------| | |
| | **VAE** | Stable Audio 2 encoder/decoder; downsamples stereo 44.1 kHz audio by 2048× | | |
| | **Melody Extractor** | Encoder of a pretrained MIDI extraction model (SOME); captures disentangled melody information | | |
| | **IPA Tokenizer** | Converts Chinese & English lyrics into a unified phoneme sequence with sentence-level alignment | | |
| | **DiT-based CFM** | Conditional flow matching backbone following F5-TTS (22 layers, 16 heads, hidden dim 1024) | | |
| **Total parameters**: ~727.3M (453.6M CFM + 156.1M VAE + 117.6M Melody Extractor) | |
| ## 📊 LyricEditBench | |
| We introduce **LyricEditBench**, the first benchmark for melody-preserving lyric modification evaluation, built on [GTSinger](https://github.com/GTSinger/GTSinger). The dataset is available on HuggingFace at https://huggingface.co/datasets/ASLP-lab/LyricEditBench. | |
| ### Results | |
| <div align="center"> | |
| <p><i>Comparison with baseline models on LyricEditBench across task types (Table 1) and languages. Metrics — P: PER, S: SIM, F: F0-CORR, V: VS — are detailed in Section 3. Best results in <b>bold</b>.</i></p> | |
| <img src="./assets/results.png" alt="LyricEditBench Results" width="90%"> | |
| </div> | |
| ## 🙏 Acknowledgements | |
| This work builds upon the following open-source projects: | |
| - [F5-TTS](https://github.com/SWivid/F5-TTS) — DiT-based CFM backbone | |
| - [Stable Audio 2](https://github.com/Stability-AI/stable-audio-tools) — VAE architecture | |
| - [SOME](https://github.com/openvpi/SOME) — Melody Extractor | |
| - [DiffRhythm](https://github.com/ASLP-lab/DiffRhythm) — Sentence-level alignment strategy | |
| - [GTSinger](https://github.com/GTSinger/GTSinger) — Benchmark base corpus | |
| - [Emilia](https://huggingface.co/datasets/amphion/Emilia-Dataset) — TTS pretraining data | |
| ## 📄 License | |
| The code and model weights in this project are licensed under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/), **except** for the following: | |
| The VAE model weights and inference code (in `src/YingMusic-Singer-Plus/utils/stable-audio-tools`) are derived from [Stable Audio Open](https://huggingface.co/stabilityai/stable-audio-open-1.0) by Stability AI, and are licensed under the [Stability AI Community License](./LICENSE-STABILITY). | |
| <p align="center"> | |
| <img src="https://raw.githubusercontent.com/ASLP-lab/YingMusic-Singer-Plus/main/assets/institutional_logo.svg" alt="Institutional Logo" width="600"> | |
| </p> |