YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Lyric Rewriting & Singing Voice Synthesis System

A Professional Toolchain for AI-Powered Vocal Editing and Synthesis


0. System Setup Guide

0.1 Environment Preparation

Hardware Requirements:

  • NVIDIA GPU (β‰₯16GB VRAM recommended)
  • CUDA 11.7+ and cuDNN 8.7+

Installation Steps:

# Create conda environment
conda create -n songedit python=3.10 -y
conda activate songedit

# Install dependencies (env.sh contents)
pip install torch==2.0.1+cu117 --extra-index-url https://download.pytorch.org/whl/cu117
pip install onnxruntime-gpu --extra-index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/onnxruntime-cuda-12/pypi/simple/
pip install -r requirements.txt

# Install audio processing libs
conda install -c conda-forge ffmpeg libsndfile

0.2 Model Checkpoints

Download pretrained models from HuggingFace:

# Install huggingface_hub if needed
pip install huggingface_hub

# Download all checkpoints
python -c "
from huggingface_hub import snapshot_download
snapshot_download(repo_id='badd9yang/songedit',
                  local_dir='checkpoints')  # Optional for private repos
"

# Expected folder structure:
checkpoints/
β”œβ”€β”€ step1/
β”‚   β”œβ”€β”€ separate_model.pt
β”‚   β”œβ”€β”€ whisper/
β”‚   β”œβ”€β”€ ...
β”‚   └── align.ckpt
└── step2/
    β”œβ”€β”€ whisper-small/
    β”œβ”€β”€ model_v1.pt
    └── model_v2.pt
  

Note: For manual download, get models from HuggingFace Repo

1. Core Features Overview

1.1 Song Editing Pipeline

Transform raw audio into customizable singing performances with:
βœ” Vocal Separation – Isolate vocals from accompaniment
βœ” Lyric Transcription – Automatic lyric recognition via Whisper ASR
βœ” Time-Alignment – Precise phoneme-level synchronization (MFA-based)
βœ” Singing Voice Synthesis – DiffSinger-powered singing generation
βœ” Voice Conversion – Timbre modification via Seed-VC


2. Technical Implementation

2.1 Audio Preprocessing & Alignment

Workflow Steps

  1. Input Preparation

    • Place vocal+accompaniment audio in /data/input_data
    • System automatically:
      • Extracts clean vocals
      • Segments into 3-30s clips (VAD-based)
      • Generates time-aligned lyrics (Whisper + MFA)
  2. Feature Extraction

    • Outputs DS-format files containing:
      • Phoneme sequences
      • Duration/pitch contours
      • Linguistic features
  3. User Interaction

    # Initialize processing module
    import os
    from songedit.songedit import *
    
    os.environ["CUDA_VISIBLE_DEVICES"] = "0"
    
    model = SongEdit(
            separate_model_path= "checkpoints/step1/separate_model.pt",
            asr_model_path= "checkpoints/step1/whisper",
            align_model_path= "checkpoints/step1/align.ckpt",
            spk_dia_model_path= "checkpoints/step1",
            vad_model_path= "checkpoints/step1/vad.onnx",
    )
    

2.2 Lyric Editing Interface

Key Functions

Code Implementation:

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

from songedit.songedit import *
from songedit.svc import ReplaceLyrics

proofread = Proofreading("checkpoints/step1/align.ckpt")
proofread.process("data/your_proofreading_path",
              "data/your_proofreading_temp_save_path")

lyric_editor = ReplaceLyrics()
lyric_editor.process(
    "your_proofread.ds",
    "your_modified_lyrics.txt",
    "save_modified.ds"
)

3. Singing Voice Synthesis Engine

3.1 Multi-Stage Synthesis Pipeline

graph LR
    A[DS File] --> B(DiffSinger SVS)
    B --> C[Raw Vocal]
    C --> D{Apply VC?}
    D -->|Yes| E[Seed-VC Timbre Transfer]
    D -->|No| F[Final Output]
    E --> F

Advanced Controls

# Full synthesis+conversion with pitch adaptation
from songedit.svc import *
model = SingingVoiceSynthesis(
        "checkpoints/step2/model_v1.pt",
        "checkpoints/step2/model_v2.pt",
        "checkpoints/step2/whisper-small/")

model(
    ds_file_path="song.ds",
    out_path="result.wav",
    ref_wav_path="target_voice.wav",
    pitch_shift_svs=12,    # +1 octave during synthesis
    pitch_shift_svc=-12,   # Normalize pitch post-VC
    diffusion_steps=100,   # Higher = better quality
    mode="svs_svc"        # Pipeline selection
)

4. Professional Mixing Tools

4.1 Vocal-Accompaniment Blending

Industry-standard processing chain:

  1. EQ Matching – Reduce frequency clashes
  2. Sidechain Compression – Dynamic vocal emphasis
  3. Spatial Enhancement – Convolution reverb
  4. Loudness Optimization – Mastering-grade limiting

Usage Example:

model.combine(
    gen_vocal="ai_vocal.wav",
    accomp="instrumental.wav",
    out_path="mixed.wav",
    vocal_volume=0.7,              # 70% vocal prominence
    time_stamps=[(1.2, 2.5)],      # Timbre modification regions
)

5. System Capabilities

Development Progress

βœ… Vocal Isolation – State-of-the-art separation
βœ… Lyric-to-Audio Alignment – <5ms phoneme precision
βœ… Neural Singing Synthesis – 44.1kHz studio quality
βœ… Real-Time Voice Conversion – <500ms latency

Roadmap

πŸ”œ **DiffSinger Acoustic new version ** – Flow Matching architecture (3Γ— faster)

πŸ”œ add DiffSinger Variance Model – Style Controllable

πŸ”œ ONNX Export – Cross-platform deployment


6. Acknowledgments

We extend gratitude to the open-source community:

  • DiffSinger – Neural singing synthesis
  • SOFA – Industrial-grade alignment
  • Seed-VC – Zero-shot voice conversion

πŸ“Œ Last Updated: May 2025

"From raw audio to professional vocal production – all in one pipeline."

Contact Support | GitHub Repository | API Reference

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support