A newer version of the Gradio SDK is available: 6.13.0
title: CFM SVC
emoji: ποΈ
colorFrom: blue
colorTo: indigo
sdk: gradio
sdk_version: 4.44.1
app_file: app.py
pinned: false
short_description: Singing Voice Conversion Based on CFM
Using Conditional Flow Matching (CFM) for Singing Voice Conversion (SVC)
CFM-SVC / F5-SVC β Singing Voice Conversion
Two implementations of a flow-matching-based Singing Voice Conversion (SVC) system.
| V1 (CFM-SVC) | V2 (F5-SVC) | |
|---|---|---|
| Backbone | DiT trained from scratch | F5-TTS pretrained (LoRA) |
| Output space | DAC codec latents (1024-dim) | Log-mel spectrogram (100-dim) |
| Vocoder | DAC decoder (frozen) | Vocos (frozen) |
| Params trained | ~82M | ~5M (adapter + LoRA) |
| Training data | Multi-speaker singing | Multi-speaker singing |
| Speaker adaptation | Speaker d-vector | Stage 2: spk_proj on speech clips |
Project Structure
matcha_svc/
βββ models/
β βββ cfm.py V1: Diffusion Transformer (DiT)
β βββ cond_encoder.py V1: PPG+HuBERT+F0+Speaker β conditioning
β βββ codec_wrapper.py V1: DAC codec + projector head
β βββ svc_cond_adapter.py V2: PPG+HuBERT+F0+Speaker β F5-TTS text_dim
β βββ lora_utils.py V2: LoRALinear, inject_lora(), freeze_non_lora()
β βββ f5_svc.py V2: F5SVCModel wrapper + build_f5svc() factory
β
βββ losses/
β βββ cfm_loss.py V1: flow matching + projector commitment loss
β
βββ svc_data/
β βββ mel_svc_dataset.py V2: log-mel dataset (same directory layout as V1)
β
βββ train_cfm.py V1 training script
βββ train_f5_stage1.py V2 Stage 1: SVCCondAdapter + LoRA on singing data
βββ train_f5_stage2.py V2 Stage 2: spk_proj on target speaker speech
βββ infer_f5_svc.py V2 inference: Euler sampling β Vocos β .wav
βββ submit_train.sh SLURM job script for V1
β
βββ data_svc/ Preprocessed features (generated by svc_preprocessing.py)
β βββ audio/<spk>/<id>.wav
β βββ whisper/<spk>/<id>.ppg.npy
β βββ hubert/<spk>/<id>.vec.npy
β βββ pitch/<spk>/<id>.pit.npy
β βββ speaker/<spk>/<id>.spk.npy
β βββ codec_targets/<spk>/<id>.pt β V1 only
β
βββ chkpt_cfm/ V1 checkpoints
βββ chkpt_f5svc/ V2 checkpoints
Prerequisites
python -m venv .venv
source .venv/bin/activate # or .venv\Scripts\activate on Windows
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install -r requirements.txt
pip install descript-audio-codec # V1
pip install f5-tts vocos safetensors huggingface_hub # V2
Pretrained feature extractors (shared by V1 and V2):
| File | Destination |
|---|---|
best_model.pth.tar (Speaker encoder) |
speaker_pretrain/ |
large-v2.pt (Whisper) |
whisper_pretrain/ |
hubert-soft-0d54a1f4.pt |
hubert_pretrain/ |
full.pth (CREPE) |
crepe/assets/ |
Data Preparation (shared by V1 and V2)
1. Raw audio layout
dataset_raw/
βββ speaker0/
β βββ 000001.wav
β βββ ...
βββ speaker1/
βββ ...
Clips should be clean vocals, < 30 seconds, no accompaniment. Use UVR for source separation and audio-slicer for cutting.
2. Extract features
python svc_preprocessing.py -t 2
Produces under data_svc/:
whisper/<spk>/<id>.ppg.npyβ Whisper PPG (1280-dim, 50 Hz)hubert/<spk>/<id>.vec.npyβ HuBERT (256-dim, 50 Hz)pitch/<spk>/<id>.pit.npyβ F0 in Hz (50 Hz, 0 = unvoiced)speaker/<spk>/<id>.spk.npyβ Speaker d-vector (256-dim)
3. V1 only: extract codec targets
python data/codec_targets.py -w ./data_svc/waves-32k -o ./data_svc/codec_targets
V2 computes mel spectrograms on-the-fly from the raw audio β no offline codec step needed.
V1: CFM-SVC (Training from Scratch)
Train
python train_cfm.py \
--data_dir ./data_svc/codec_targets \
--batch_size 64 \
--lr 2e-5 \
--epochs 250 \
--save_interval 1
# or via SLURM:
sbatch submit_train.sh
Training automatically resumes from the latest checkpoint in chkpt_cfm/.
Key arguments:
| Argument | Default | Description |
|---|---|---|
--lr |
1e-4 |
Learning rate |
--batch_size |
2 |
Batch size |
--grad_accum |
1 |
Gradient accumulation steps |
--grad_clip |
1.0 |
Gradient clip max norm |
--save_interval |
50 |
Save every N epochs |
--use_checkpointing |
off | Enable gradient checkpointing (saves VRAM) |
--freeze_norm |
off | Freeze latent norm stats (for fine-tuning) |
Inference (V1)
python infer.py --wave /path/to/source_singing.wav
V2: F5-SVC (LoRA on F5-TTS)
Architecture
- F5-TTS's DiT is loaded with pretrained weights and kept mostly frozen.
SVCCondAdapterreplaces the text encoder: PPG + HuBERT + F0 + speaker β (B, T, 512).- LoRA (rank 16) is injected into every DiT attention projection (Q, K, V, Out).
- Vocos decodes mel spectrograms to audio.
- Two-stage training protocol:
- Stage 1 (singing): SVCCondAdapter + LoRA trained on multi-speaker singing data.
- Stage 2 (per-speaker): only
spk_projtrained on the target speaker's speech clips.
Download F5-TTS checkpoint
from huggingface_hub import hf_hub_download
path = hf_hub_download("SWivid/F5-TTS", "F5TTS_Base/model_1200000.safetensors")
print(path)
Stage 1 β Singing Adaptation
Trains: SVCCondAdapter (content projection + speaker projection) + LoRA adapters
Freezes: All other DiT weights
python train_f5_stage1.py \
--f5tts_ckpt /path/to/model_1200000.safetensors \
--audio_dir ./data_svc/audio \
--epochs 200 \
--batch_size 16 \
--lr 1e-4
# Checkpoints saved to ./chkpt_f5svc/stage1_epoch_N.pt
All PPG/HuBERT/F0/speaker features from V1 preprocessing are reused directly.
The only difference is the audio directory name: V1 produces data_svc/waves-32k/
while V2 defaults to data_svc/audio/. Pass --audio_dir ./data_svc/waves-32k to
reuse V1 audio (it is resampled to 24 kHz on-the-fly, no re-extraction needed).
The codec targets directory (data_svc/codec_targets/) is V1-only and not needed here.
Stage 2 β Per-Speaker Fine-tuning
Trains: svc_adapter.spk_proj only
Freezes: DiT + LoRA (locked in from Stage 1)
Data: Speech clips of the target speaker (no singing required)
python train_f5_stage2.py \
--stage1_ckpt ./chkpt_f5svc/stage1_epoch_200.pt \
--audio_dir ./data_svc/audio/my_speaker \
--speaker_id my_speaker \
--epochs 50
# Saved to ./chkpt_f5svc/stage2_my_speaker.pt
The target speaker's speech clips need the same feature extraction as Stage 1:
run svc_preprocessing.py pointing at the speech audio directory.
Inference (V2)
python infer_f5_svc.py \
--ckpt ./chkpt_f5svc/stage1_epoch_200.pt \
--source ./source_singing.wav \
--target_spk ./data_svc/speaker/my_speaker/ref.spk.npy \
--ref_audio ./data_svc/audio/my_speaker/ref.wav \
--output ./converted.wav \
--steps 32
For a Stage 2 speaker-adapted checkpoint:
python infer_f5_svc.py \
--ckpt ./chkpt_f5svc/stage2_my_speaker.pt \
--source ./source_singing.wav \
--target_spk ./data_svc/speaker/my_speaker/ref.spk.npy \
--ref_audio ./data_svc/audio/my_speaker/ref.wav \
--output ./converted.wav
Inference arguments:
| Argument | Default | Description |
|---|---|---|
--ckpt |
required | Stage 1 or Stage 2 checkpoint |
--source |
required | Source singing .wav |
--target_spk |
required | Target speaker .spk.npy |
--ref_audio |
None |
Short .wav of target speaker for timbre reference |
--ref_sec |
3.0 |
Seconds of ref_audio to use |
--steps |
32 |
Euler ODE steps (more = higher quality, slower) |
--output |
./converted.wav |
Output path |
The source audio must have pre-extracted features (PPG, HuBERT, F0) in the standard
data_svc/ directory structure. Run svc_preprocessing.py on the source if needed.
Checkpoints
V1 saves full model state per epoch to chkpt_cfm/:
chkpt_cfm/
βββ dit_epoch_N.pt
βββ cond_encoder_epoch_N.pt
βββ projector_epoch_N.pt
βββ ema_dit_epoch_N.pt
βββ optimizer_epoch_N.pt
βββ scheduler_epoch_N.pt
βββ latent_norm.pt β cached normalization stats
V2 saves adapter + LoRA state per epoch to chkpt_f5svc/:
chkpt_f5svc/
βββ stage1_epoch_N.pt β full model state (adapter + LoRA + frozen DiT)
β also contains lora_only key for lightweight sharing
βββ stage2_<speaker_id>.pt β speaker-adapted state
References
- Rectified Flow / Flow Matching
- F5-TTS: SWivid/F5-TTS
- Vocos vocoder: hubert-whisper/vocos
- DAC: descriptinc/descript-audio-codec
- so-vits-svc-5.0: preprocessing pipeline