Spaces:

ASLP-lab
/

YingMusic-Singer-Plus

Running on Zero

App Files Files Community

YingMusic-Singer-Plus / README.md

ASLP-lab

Update README.md

192789f verified 1 day ago

preview code

raw

history blame contribute delete

9.87 kB

A newer version of the Gradio SDK is available: 6.11.0

Upgrade

metadata

title: YingMusic-Singer-Plus
emoji: 🎤
colorFrom: pink
colorTo: blue
sdk: gradio
python_version: '3.10'
app_file: app.py
tags:
  - singing-voice-synthesis
  - lyric-editing
  - diffusion-model
  - reinforcement-learning
short_description: Edit lyrics, keep the melody
fullWidth: true

🎤 YingMusic-Singer-Plus: Controllable Singing Voice Synthesis with Flexible Lyric Manipulation and Annotation-free Melody Guidance

English ｜中文

Chunbo Hao^1,2 · Junjie Zheng² · Guobin Ma¹ · Yuepeng Jiang¹ · Huakang Chen¹ · Wenjie Tian¹ · Gongyu Chen² · Zihao Chen² · Lei Xie¹

¹ Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern Polytechnical University, China
² AI Lab, GiantNetwork, China

Overall architecture of YingMusic-Singer-Plus. Left: SFT training pipeline. Right: GRPO training pipeline.

📖 Introduction

YingMusic-Singer-Plus is a fully diffusion-based singing voice synthesis model that enables melody-controllable singing voice editing with flexible lyric manipulation, requiring no manual alignment or precise phoneme annotation.

Given only three inputs — an optional timbre reference, a melody-providing singing clip, and modified lyrics — YingMusic-Singer-Plus synthesizes high-fidelity singing voices at 44.1 kHz while faithfully preserving the original melody.

✨ Key Features

Annotation-free: No manual lyric-MIDI alignment required at inference
Flexible lyric manipulation: Supports 6 editing types — partial/full changes, insertion, deletion, translation (CN↔EN), and code-switching
Strong melody preservation: CKA-based melody alignment loss + GRPO-based optimization
Bilingual: Unified IPA tokenizer for both Chinese and English
High fidelity: 44.1 kHz stereo output via Stable Audio 2 VAE

🚀 Quick Start

Option 1: Install from Scratch

conda create -n YingMusic-Singer-Plus python=3.10
conda activate YingMusic-Singer-Plus

# uv is much faster than pip
pip install uv
uv pip install -r requirements.txt

Option 2: Pre-built Conda Environment

Download and install Miniconda from https://repo.anaconda.com/miniconda/ for your platform. Verify with conda --version.
Download the pre-built environment package for your setup from the table below.
In your Conda directory, navigate to envs/ and create a folder named YingMusic-Singer-Plus.
Move the downloaded package into that folder, then extract it with tar -xvf <package_name>.

CPU Architecture	GPU	OS	Download
ARM	NVIDIA	Linux	Coming soon
AMD64	NVIDIA	Linux	Coming soon
AMD64	NVIDIA	Windows	Coming soon

Option 3: Docker

Build the image:

docker build -t YingMusic-Singer-Plus .

Run inference:

docker run --gpus all -it YingMusic-Singer-Plus

🎵 Inference

Option 1: Online Demo (HuggingFace Space)

Visit https://huggingface.co/spaces/ASLP-lab/YingMusic-Singer-Plus to try the model instantly in your browser.

Option 2: Local Gradio App (same as online demo)

python app_local.py

Option 3: Command-line Inference

python infer_api.py \
    --ref_audio path/to/ref.wav \
    --melody_audio path/to/melody.wav \
    --ref_text "该体谅的不执着|如果那天我" \
    --target_text "好多天|看不完你" \
    --output output.wav

Enable vocal separation and accompaniment mixing:

python infer_api.py \
    --ref_audio ref.wav \
    --melody_audio melody.wav \
    --ref_text "..." \
    --target_text "..." \
    --separate_vocals \      # separate vocals from the input before processing
    --mix_accompaniment \    # mix the synthesized vocal back with the accompaniment
    --output mixed_output.wav

Option 4: Batch Inference

Note: All audio fed to the model must be pure vocal tracks (no accompaniment). If your inputs contain accompaniment, run vocal separation first using src/third_party/MusicSourceSeparationTraining/inference_api.py.

The input JSONL file should contain one JSON object per line, formatted as follows:

{"id": "1", "melody_ref_path": "XXX", "gen_text": "好多天|看不完你", "timbre_ref_path": "XXX", "timbre_ref_text": "该体谅的不执着|如果那天我"}

python batch_infer.py \
    --input_type jsonl \
    --input_path /path/to/input.jsonl \
    --output_dir /path/to/output \
    --ckpt_path /path/to/ckpts \
    --num_gpus 4

Multi-process inference on LyricEditBench (melody control) — the test set will be downloaded automatically:

python inference_mp.py \
    --input_type lyric_edit_bench_melody_control \
    --output_dir path/to/LyricEditBench_melody_control \
    --ckpt_path ASLP-lab/YingMusic-Singer-Plus \
    --num_gpus 8

Multi-process inference on LyricEditBench (singing edit):

python inference_mp.py \
    --input_type lyric_edit_bench_sing_edit \
    --output_dir path/to/LyricEditBench_sing_edit \
    --ckpt_path ASLP-lab/YingMusic-Singer-Plus \
    --num_gpus 8

🏗️ Model Architecture

YingMusic-Singer-Plus consists of four core components:

Component	Description
VAE	Stable Audio 2 encoder/decoder; downsamples stereo 44.1 kHz audio by 2048×
Melody Extractor	Encoder of a pretrained MIDI extraction model (SOME); captures disentangled melody information
IPA Tokenizer	Converts Chinese & English lyrics into a unified phoneme sequence with sentence-level alignment
DiT-based CFM	Conditional flow matching backbone following F5-TTS (22 layers, 16 heads, hidden dim 1024)

Total parameters: ~727.3M (453.6M CFM + 156.1M VAE + 117.6M Melody Extractor)

📊 LyricEditBench

We introduce LyricEditBench, the first benchmark for melody-preserving lyric modification evaluation, built on GTSinger. The dataset is available on HuggingFace at https://huggingface.co/datasets/ASLP-lab/LyricEditBench.

Results

Comparison with baseline models on LyricEditBench across task types (Table 1) and languages. Metrics — P: PER, S: SIM, F: F0-CORR, V: VS — are detailed in Section 3. Best results in bold.

🙏 Acknowledgements

This work builds upon the following open-source projects:

F5-TTS — DiT-based CFM backbone
Stable Audio 2 — VAE architecture
SOME — Melody Extractor
DiffRhythm — Sentence-level alignment strategy
GTSinger — Benchmark base corpus
Emilia — TTS pretraining data

📄 License

The code and model weights in this project are licensed under CC BY 4.0, except for the following:

The VAE model weights and inference code (in src/YingMusic-Singer-Plus/utils/stable-audio-tools) are derived from Stable Audio Open by Stability AI, and are licensed under the Stability AI Community License.

Institutional Logo