Lyric Rewriting & Singing Voice Synthesis System
A Professional Toolchain for AI-Powered Vocal Editing and Synthesis
0. System Setup Guide
0.1 Environment Preparation
Hardware Requirements:
- NVIDIA GPU (β₯16GB VRAM recommended)
- CUDA 11.7+ and cuDNN 8.7+
Installation Steps:
# Create conda environment
conda create -n songedit python=3.10 -y
conda activate songedit
# Install dependencies (env.sh contents)
pip install torch==2.0.1+cu117 --extra-index-url https://download.pytorch.org/whl/cu117
pip install onnxruntime-gpu --extra-index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/onnxruntime-cuda-12/pypi/simple/
pip install -r requirements.txt
# Install audio processing libs
conda install -c conda-forge ffmpeg libsndfile
0.2 Model Checkpoints
Download pretrained models from HuggingFace:
# Install huggingface_hub if needed
pip install huggingface_hub
# Download all checkpoints
python -c "
from huggingface_hub import snapshot_download
snapshot_download(repo_id='badd9yang/songedit',
local_dir='checkpoints') # Optional for private repos
"
# Expected folder structure:
checkpoints/
βββ step1/
β βββ separate_model.pt
β βββ whisper/
β βββ ...
β βββ align.ckpt
βββ step2/
βββ whisper-small/
βββ model_v1.pt
βββ model_v2.pt
Note: For manual download, get models from HuggingFace Repo
1. Core Features Overview
1.1 Song Editing Pipeline
Transform raw audio into customizable singing performances with:
β Vocal Separation β Isolate vocals from accompaniment
β Lyric Transcription β Automatic lyric recognition via Whisper ASR
β Time-Alignment β Precise phoneme-level synchronization (MFA-based)
β Singing Voice Synthesis β DiffSinger-powered singing generation
β Voice Conversion β Timbre modification via Seed-VC
2. Technical Implementation
2.1 Audio Preprocessing & Alignment
Workflow Steps
Input Preparation
- Place vocal+accompaniment audio in
/data/input_data - System automatically:
- Extracts clean vocals
- Segments into 3-30s clips (VAD-based)
- Generates time-aligned lyrics (Whisper + MFA)
- Place vocal+accompaniment audio in
Feature Extraction
- Outputs DS-format files containing:
- Phoneme sequences
- Duration/pitch contours
- Linguistic features
- Outputs DS-format files containing:
User Interaction
# Initialize processing module import os from songedit.songedit import * os.environ["CUDA_VISIBLE_DEVICES"] = "0" model = SongEdit( separate_model_path= "checkpoints/step1/separate_model.pt", asr_model_path= "checkpoints/step1/whisper", align_model_path= "checkpoints/step1/align.ckpt", spk_dia_model_path= "checkpoints/step1", vad_model_path= "checkpoints/step1/vad.onnx", )
2.2 Lyric Editing Interface
Key Functions
Code Implementation:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
from songedit.songedit import *
from songedit.svc import ReplaceLyrics
proofread = Proofreading("checkpoints/step1/align.ckpt")
proofread.process("data/your_proofreading_path",
"data/your_proofreading_temp_save_path")
lyric_editor = ReplaceLyrics()
lyric_editor.process(
"your_proofread.ds",
"your_modified_lyrics.txt",
"save_modified.ds"
)
3. Singing Voice Synthesis Engine
3.1 Multi-Stage Synthesis Pipeline
graph LR
A[DS File] --> B(DiffSinger SVS)
B --> C[Raw Vocal]
C --> D{Apply VC?}
D -->|Yes| E[Seed-VC Timbre Transfer]
D -->|No| F[Final Output]
E --> F
Advanced Controls
# Full synthesis+conversion with pitch adaptation
from songedit.svc import *
model = SingingVoiceSynthesis(
"checkpoints/step2/model_v1.pt",
"checkpoints/step2/model_v2.pt",
"checkpoints/step2/whisper-small/")
model(
ds_file_path="song.ds",
out_path="result.wav",
ref_wav_path="target_voice.wav",
pitch_shift_svs=12, # +1 octave during synthesis
pitch_shift_svc=-12, # Normalize pitch post-VC
diffusion_steps=100, # Higher = better quality
mode="svs_svc" # Pipeline selection
)
4. Professional Mixing Tools
4.1 Vocal-Accompaniment Blending
Industry-standard processing chain:
- EQ Matching β Reduce frequency clashes
- Sidechain Compression β Dynamic vocal emphasis
- Spatial Enhancement β Convolution reverb
- Loudness Optimization β Mastering-grade limiting
Usage Example:
model.combine(
gen_vocal="ai_vocal.wav",
accomp="instrumental.wav",
out_path="mixed.wav",
vocal_volume=0.7, # 70% vocal prominence
time_stamps=[(1.2, 2.5)], # Timbre modification regions
)
5. System Capabilities
Development Progress
β
Vocal Isolation β State-of-the-art separation
β
Lyric-to-Audio Alignment β <5ms phoneme precision
β
Neural Singing Synthesis β 44.1kHz studio quality
β
Real-Time Voice Conversion β <500ms latency
Roadmap
π **DiffSinger Acoustic new version ** β Flow Matching architecture (3Γ faster)
π add DiffSinger Variance Model β Style Controllable
π ONNX Export β Cross-platform deployment
6. Acknowledgments
We extend gratitude to the open-source community:
- DiffSinger β Neural singing synthesis
- SOFA β Industrial-grade alignment
- Seed-VC β Zero-shot voice conversion
π Last Updated: May 2025
"From raw audio to professional vocal production β all in one pipeline."