Lyric Rewriting & Singing Voice Synthesis System

A Professional Toolchain for AI-Powered Vocal Editing and Synthesis

0. System Setup Guide

0.1 Environment Preparation

Hardware Requirements:

NVIDIA GPU (≥16GB VRAM recommended)
CUDA 11.7+ and cuDNN 8.7+

Installation Steps:

# Create conda environment
conda create -n songedit python=3.10 -y
conda activate songedit

# Install dependencies (env.sh contents)
pip install torch==2.0.1+cu117 --extra-index-url https://download.pytorch.org/whl/cu117
pip install onnxruntime-gpu --extra-index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/onnxruntime-cuda-12/pypi/simple/
pip install -r requirements.txt

# Install audio processing libs
conda install -c conda-forge ffmpeg libsndfile

0.2 Model Checkpoints

Download pretrained models from HuggingFace:

# Install huggingface_hub if needed
pip install huggingface_hub

# Download all checkpoints
python -c "
from huggingface_hub import snapshot_download
snapshot_download(repo_id='badd9yang/songedit',
                  local_dir='checkpoints')  # Optional for private repos
"

# Expected folder structure:
checkpoints/
├── step1/
│   ├── separate_model.pt
│   ├── whisper/
│   ├── ...
│   └── align.ckpt
└── step2/
    ├── whisper-small/
    ├── model_v1.pt
    └── model_v2.pt

Note: For manual download, get models from HuggingFace Repo

1. Core Features Overview

1.1 Song Editing Pipeline

Transform raw audio into customizable singing performances with:
✔ Vocal Separation – Isolate vocals from accompaniment
✔ Lyric Transcription – Automatic lyric recognition via Whisper ASR
✔ Time-Alignment – Precise phoneme-level synchronization (MFA-based)
✔ Singing Voice Synthesis – DiffSinger-powered singing generation
✔ Voice Conversion – Timbre modification via Seed-VC

2. Technical Implementation

2.1 Audio Preprocessing & Alignment

Workflow Steps

Input Preparation
- Place vocal+accompaniment audio in /data/input_data
- System automatically:
  - Extracts clean vocals
  - Segments into 3-30s clips (VAD-based)
  - Generates time-aligned lyrics (Whisper + MFA)
Feature Extraction
- Outputs DS-format files containing:
  - Phoneme sequences
  - Duration/pitch contours
  - Linguistic features

User Interaction

# Initialize processing module
import os
from songedit.songedit import *

os.environ["CUDA_VISIBLE_DEVICES"] = "0"

model = SongEdit(
        separate_model_path= "checkpoints/step1/separate_model.pt",
        asr_model_path= "checkpoints/step1/whisper",
        align_model_path= "checkpoints/step1/align.ckpt",
        spk_dia_model_path= "checkpoints/step1",
        vad_model_path= "checkpoints/step1/vad.onnx",
)

2.2 Lyric Editing Interface

Key Functions

Code Implementation:

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

from songedit.songedit import *
from songedit.svc import ReplaceLyrics

proofread = Proofreading("checkpoints/step1/align.ckpt")
proofread.process("data/your_proofreading_path",
              "data/your_proofreading_temp_save_path")

lyric_editor = ReplaceLyrics()
lyric_editor.process(
    "your_proofread.ds",
    "your_modified_lyrics.txt",
    "save_modified.ds"
)

3. Singing Voice Synthesis Engine

3.1 Multi-Stage Synthesis Pipeline

graph LR
    A[DS File] --> B(DiffSinger SVS)
    B --> C[Raw Vocal]
    C --> D{Apply VC?}
    D -->|Yes| E[Seed-VC Timbre Transfer]
    D -->|No| F[Final Output]
    E --> F

Advanced Controls

# Full synthesis+conversion with pitch adaptation
from songedit.svc import *
model = SingingVoiceSynthesis(
        "checkpoints/step2/model_v1.pt",
        "checkpoints/step2/model_v2.pt",
        "checkpoints/step2/whisper-small/")

model(
    ds_file_path="song.ds",
    out_path="result.wav",
    ref_wav_path="target_voice.wav",
    pitch_shift_svs=12,    # +1 octave during synthesis
    pitch_shift_svc=-12,   # Normalize pitch post-VC
    diffusion_steps=100,   # Higher = better quality
    mode="svs_svc"        # Pipeline selection
)

4. Professional Mixing Tools

4.1 Vocal-Accompaniment Blending

Industry-standard processing chain:

EQ Matching – Reduce frequency clashes
Sidechain Compression – Dynamic vocal emphasis
Spatial Enhancement – Convolution reverb
Loudness Optimization – Mastering-grade limiting

Usage Example:

model.combine(
    gen_vocal="ai_vocal.wav",
    accomp="instrumental.wav",
    out_path="mixed.wav",
    vocal_volume=0.7,              # 70% vocal prominence
    time_stamps=[(1.2, 2.5)],      # Timbre modification regions
)

5. System Capabilities

Development Progress

✅ Vocal Isolation – State-of-the-art separation
✅ Lyric-to-Audio Alignment – <5ms phoneme precision
✅ Neural Singing Synthesis – 44.1kHz studio quality
✅ Real-Time Voice Conversion – <500ms latency

Roadmap

🔜 **DiffSinger Acoustic new version ** – Flow Matching architecture (3× faster)

🔜 add DiffSinger Variance Model – Style Controllable

🔜 ONNX Export – Cross-platform deployment

6. Acknowledgments

We extend gratitude to the open-source community:

DiffSinger – Neural singing synthesis
SOFA – Industrial-grade alignment
Seed-VC – Zero-shot voice conversion

📌 Last Updated: May 2025

"From raw audio to professional vocal production – all in one pipeline."

Contact Support | GitHub Repository | API Reference