songedit / README.md
badd9yang's picture
Update README.md
16c8c85 verified
# **Lyric Rewriting & Singing Voice Synthesis System**
*A Professional Toolchain for AI-Powered Vocal Editing and Synthesis*
---
## **0. System Setup Guide**
### **0.1 Environment Preparation**
**Hardware Requirements:**
- NVIDIA GPU (β‰₯16GB VRAM recommended)
- CUDA 11.7+ and cuDNN 8.7+
**Installation Steps:**
```bash
# Create conda environment
conda create -n songedit python=3.10 -y
conda activate songedit
# Install dependencies (env.sh contents)
pip install torch==2.0.1+cu117 --extra-index-url https://download.pytorch.org/whl/cu117
pip install onnxruntime-gpu --extra-index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/onnxruntime-cuda-12/pypi/simple/
pip install -r requirements.txt
# Install audio processing libs
conda install -c conda-forge ffmpeg libsndfile
```
### **0.2 Model Checkpoints**
Download pretrained models from HuggingFace:
```bash
# Install huggingface_hub if needed
pip install huggingface_hub
# Download all checkpoints
python -c "
from huggingface_hub import snapshot_download
snapshot_download(repo_id='badd9yang/songedit',
local_dir='checkpoints') # Optional for private repos
"
# Expected folder structure:
checkpoints/
β”œβ”€β”€ step1/
β”‚ β”œβ”€β”€ separate_model.pt
β”‚ β”œβ”€β”€ whisper/
β”‚ β”œβ”€β”€ ...
β”‚ └── align.ckpt
└── step2/
β”œβ”€β”€ whisper-small/
β”œβ”€β”€ model_v1.pt
└── model_v2.pt
```
> **Note:** For manual download, get models from [HuggingFace Repo](https://huggingface.co/badd9yang/songedit/tree/main)
## **1. Core Features Overview**
### **1.1 Song Editing Pipeline**
Transform raw audio into customizable singing performances with:
βœ” **Vocal Separation** – Isolate vocals from accompaniment
βœ” **Lyric Transcription** – Automatic lyric recognition via Whisper ASR
βœ” **Time-Alignment** – Precise phoneme-level synchronization (MFA-based)
βœ” **Singing Voice Synthesis** – DiffSinger-powered singing generation
βœ” **Voice Conversion** – Timbre modification via Seed-VC
---
## **2. Technical Implementation**
### **2.1 Audio Preprocessing & Alignment**
#### **Workflow Steps**
1. **Input Preparation**
- Place vocal+accompaniment audio in `/data/input_data`
- System automatically:
- Extracts clean vocals
- Segments into 3-30s clips (VAD-based)
- Generates time-aligned lyrics (Whisper + MFA)
2. **Feature Extraction**
- Outputs DS-format files containing:
- Phoneme sequences
- Duration/pitch contours
- Linguistic features
3. **User Interaction**
```python
# Initialize processing module
import os
from songedit.songedit import *
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
model = SongEdit(
separate_model_path= "checkpoints/step1/separate_model.pt",
asr_model_path= "checkpoints/step1/whisper",
align_model_path= "checkpoints/step1/align.ckpt",
spk_dia_model_path= "checkpoints/step1",
vad_model_path= "checkpoints/step1/vad.onnx",
)
```
---
### **2.2 Lyric Editing Interface**
#### **Key Functions**
**Code Implementation:**
```python
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
from songedit.songedit import *
from songedit.svc import ReplaceLyrics
proofread = Proofreading("checkpoints/step1/align.ckpt")
proofread.process("data/your_proofreading_path",
"data/your_proofreading_temp_save_path")
lyric_editor = ReplaceLyrics()
lyric_editor.process(
"your_proofread.ds",
"your_modified_lyrics.txt",
"save_modified.ds"
)
```
---
## **3. Singing Voice Synthesis Engine**
### **3.1 Multi-Stage Synthesis Pipeline**
```mermaid
graph LR
A[DS File] --> B(DiffSinger SVS)
B --> C[Raw Vocal]
C --> D{Apply VC?}
D -->|Yes| E[Seed-VC Timbre Transfer]
D -->|No| F[Final Output]
E --> F
```
#### **Advanced Controls**
```python
# Full synthesis+conversion with pitch adaptation
from songedit.svc import *
model = SingingVoiceSynthesis(
"checkpoints/step2/model_v1.pt",
"checkpoints/step2/model_v2.pt",
"checkpoints/step2/whisper-small/")
model(
ds_file_path="song.ds",
out_path="result.wav",
ref_wav_path="target_voice.wav",
pitch_shift_svs=12, # +1 octave during synthesis
pitch_shift_svc=-12, # Normalize pitch post-VC
diffusion_steps=100, # Higher = better quality
mode="svs_svc" # Pipeline selection
)
```
---
## **4. Professional Mixing Tools**
### **4.1 Vocal-Accompaniment Blending**
Industry-standard processing chain:
1. **EQ Matching** – Reduce frequency clashes
2. **Sidechain Compression** – Dynamic vocal emphasis
3. **Spatial Enhancement** – Convolution reverb
4. **Loudness Optimization** – Mastering-grade limiting
**Usage Example:**
```python
model.combine(
gen_vocal="ai_vocal.wav",
accomp="instrumental.wav",
out_path="mixed.wav",
vocal_volume=0.7, # 70% vocal prominence
time_stamps=[(1.2, 2.5)], # Timbre modification regions
)
```
---
## **5. System Capabilities**
### **Development Progress**
βœ… **Vocal Isolation** – State-of-the-art separation
βœ… **Lyric-to-Audio Alignment** – <5ms phoneme precision
βœ… **Neural Singing Synthesis** – 44.1kHz studio quality
βœ… **Real-Time Voice Conversion** – <500ms latency
### **Roadmap**
πŸ”œ **DiffSinger Acoustic new version ** – Flow Matching architecture (3Γ— faster)
πŸ”œ **add DiffSinger Variance Model** – Style Controllable
πŸ”œ **ONNX Export** – Cross-platform deployment
---
## **6. Acknowledgments**
We extend gratitude to the open-source community:
- **DiffSinger** – Neural singing synthesis
- **SOFA** – Industrial-grade alignment
- **Seed-VC** – Zero-shot voice conversion
---
**πŸ“Œ Last Updated: May 2025**
> *"From raw audio to professional vocal production – all in one pipeline."*
[Contact Support](yangchen@hccl.ioa.ac.cn) | [GitHub Repository](github.com/badd9yang) | [API Reference](diffsinger.com)