songedit / README.md

Update README.md

16c8c85 verified 9 months ago

6.12 kB

	# Lyric Rewriting & Singing Voice Synthesis System
	A Professional Toolchain for AI-Powered Vocal Editing and Synthesis

	---

	## 0. System Setup Guide

	### 0.1 Environment Preparation

	Hardware Requirements:

	- NVIDIA GPU (≥16GB VRAM recommended)
	- CUDA 11.7+ and cuDNN 8.7+

	Installation Steps:

	```bash
	# Create conda environment
	conda create -n songedit python=3.10 -y
	conda activate songedit

	# Install dependencies (env.sh contents)
	pip install torch==2.0.1+cu117 --extra-index-url https://download.pytorch.org/whl/cu117
	pip install onnxruntime-gpu --extra-index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/onnxruntime-cuda-12/pypi/simple/
	pip install -r requirements.txt

	# Install audio processing libs
	conda install -c conda-forge ffmpeg libsndfile
	```

	### 0.2 Model Checkpoints

	Download pretrained models from HuggingFace:

	```bash
	# Install huggingface_hub if needed
	pip install huggingface_hub

	# Download all checkpoints
	python -c "
	from huggingface_hub import snapshot_download
	snapshot_download(repo_id='badd9yang/songedit',
	local_dir='checkpoints') # Optional for private repos
	"

	# Expected folder structure:
	checkpoints/
	├── step1/
	│ ├── separate_model.pt
	│ ├── whisper/
	│ ├── ...
	│ └── align.ckpt
	└── step2/
	├── whisper-small/
	├── model_v1.pt
	└── model_v2.pt

	```

	> Note: For manual download, get models from [HuggingFace Repo](https://huggingface.co/badd9yang/songedit/tree/main)



	## 1. Core Features Overview

	### 1.1 Song Editing Pipeline
	Transform raw audio into customizable singing performances with:
	✔ Vocal Separation – Isolate vocals from accompaniment
	✔ Lyric Transcription – Automatic lyric recognition via Whisper ASR
	✔ Time-Alignment – Precise phoneme-level synchronization (MFA-based)
	✔ Singing Voice Synthesis – DiffSinger-powered singing generation
	✔ Voice Conversion – Timbre modification via Seed-VC

	---

	## 2. Technical Implementation

	### 2.1 Audio Preprocessing & Alignment

	#### Workflow Steps
	1. Input Preparation
	- Place vocal+accompaniment audio in `/data/input_data`
	- System automatically:
	- Extracts clean vocals
	- Segments into 3-30s clips (VAD-based)
	- Generates time-aligned lyrics (Whisper + MFA)

	2. Feature Extraction
	- Outputs DS-format files containing:
	- Phoneme sequences
	- Duration/pitch contours
	- Linguistic features

	3. User Interaction
	```python
	# Initialize processing module
	import os
	from songedit.songedit import *

	os.environ["CUDA_VISIBLE_DEVICES"] = "0"

	model = SongEdit(
	separate_model_path= "checkpoints/step1/separate_model.pt",
	asr_model_path= "checkpoints/step1/whisper",
	align_model_path= "checkpoints/step1/align.ckpt",
	spk_dia_model_path= "checkpoints/step1",
	vad_model_path= "checkpoints/step1/vad.onnx",
	)
	```

	---

	### 2.2 Lyric Editing Interface

	#### Key Functions



	Code Implementation:

	```python
	import os
	os.environ["CUDA_VISIBLE_DEVICES"] = "0"

	from songedit.songedit import *
	from songedit.svc import ReplaceLyrics

	proofread = Proofreading("checkpoints/step1/align.ckpt")
	proofread.process("data/your_proofreading_path",
	"data/your_proofreading_temp_save_path")

	lyric_editor = ReplaceLyrics()
	lyric_editor.process(
	"your_proofread.ds",
	"your_modified_lyrics.txt",
	"save_modified.ds"
	)
	```

	---

	## 3. Singing Voice Synthesis Engine

	### 3.1 Multi-Stage Synthesis Pipeline
	```mermaid
	graph LR
	A[DS File] --> B(DiffSinger SVS)
	B --> C[Raw Vocal]
	C --> D{Apply VC?}
	D -->\|Yes\| E[Seed-VC Timbre Transfer]
	D -->\|No\| F[Final Output]
	E --> F
	```

	#### Advanced Controls
	```python
	# Full synthesis+conversion with pitch adaptation
	from songedit.svc import *
	model = SingingVoiceSynthesis(
	"checkpoints/step2/model_v1.pt",
	"checkpoints/step2/model_v2.pt",
	"checkpoints/step2/whisper-small/")

	model(
	ds_file_path="song.ds",
	out_path="result.wav",
	ref_wav_path="target_voice.wav",
	pitch_shift_svs=12, # +1 octave during synthesis
	pitch_shift_svc=-12, # Normalize pitch post-VC
	diffusion_steps=100, # Higher = better quality
	mode="svs_svc" # Pipeline selection
	)
	```

	---

	## 4. Professional Mixing Tools

	### 4.1 Vocal-Accompaniment Blending
	Industry-standard processing chain:
	1. EQ Matching – Reduce frequency clashes
	2. Sidechain Compression – Dynamic vocal emphasis
	3. Spatial Enhancement – Convolution reverb
	4. Loudness Optimization – Mastering-grade limiting

	Usage Example:
	```python
	model.combine(
	gen_vocal="ai_vocal.wav",
	accomp="instrumental.wav",
	out_path="mixed.wav",
	vocal_volume=0.7, # 70% vocal prominence
	time_stamps=[(1.2, 2.5)], # Timbre modification regions
	)
	```

	---

	## 5. System Capabilities

	### Development Progress
	✅ Vocal Isolation – State-of-the-art separation
	✅ Lyric-to-Audio Alignment – <5ms phoneme precision
	✅ Neural Singing Synthesis – 44.1kHz studio quality
	✅ Real-Time Voice Conversion – <500ms latency

	### Roadmap
	🔜 DiffSinger Acoustic new version – Flow Matching architecture (3× faster)

	🔜 add DiffSinger Variance Model – Style Controllable

	🔜 ONNX Export – Cross-platform deployment

	---

	## 6. Acknowledgments
	We extend gratitude to the open-source community:
	- DiffSinger – Neural singing synthesis
	- SOFA – Industrial-grade alignment
	- Seed-VC – Zero-shot voice conversion

	---

	📌 Last Updated: May 2025

	> "From raw audio to professional vocal production – all in one pipeline."

	[Contact Support](yangchen@hccl.ioa.ac.cn) \| [GitHub Repository](github.com/badd9yang) \| [API Reference](diffsinger.com)