cfm_svc / README.md
ICGenAIShare04's picture
Update README.md
957f5de verified

A newer version of the Gradio SDK is available: 6.13.0

Upgrade
metadata
title: CFM SVC
emoji: πŸŽ™οΈ
colorFrom: blue
colorTo: indigo
sdk: gradio
sdk_version: 4.44.1
app_file: app.py
pinned: false
short_description: Singing Voice Conversion Based on CFM

Using Conditional Flow Matching (CFM) for Singing Voice Conversion (SVC)

CFM-SVC / F5-SVC β€” Singing Voice Conversion

Two implementations of a flow-matching-based Singing Voice Conversion (SVC) system.

V1 (CFM-SVC) V2 (F5-SVC)
Backbone DiT trained from scratch F5-TTS pretrained (LoRA)
Output space DAC codec latents (1024-dim) Log-mel spectrogram (100-dim)
Vocoder DAC decoder (frozen) Vocos (frozen)
Params trained ~82M ~5M (adapter + LoRA)
Training data Multi-speaker singing Multi-speaker singing
Speaker adaptation Speaker d-vector Stage 2: spk_proj on speech clips

Project Structure

matcha_svc/
β”œβ”€β”€ models/
β”‚   β”œβ”€β”€ cfm.py                  V1: Diffusion Transformer (DiT)
β”‚   β”œβ”€β”€ cond_encoder.py         V1: PPG+HuBERT+F0+Speaker β†’ conditioning
β”‚   β”œβ”€β”€ codec_wrapper.py        V1: DAC codec + projector head
β”‚   β”œβ”€β”€ svc_cond_adapter.py     V2: PPG+HuBERT+F0+Speaker β†’ F5-TTS text_dim
β”‚   β”œβ”€β”€ lora_utils.py           V2: LoRALinear, inject_lora(), freeze_non_lora()
β”‚   └── f5_svc.py               V2: F5SVCModel wrapper + build_f5svc() factory
β”‚
β”œβ”€β”€ losses/
β”‚   └── cfm_loss.py             V1: flow matching + projector commitment loss
β”‚
β”œβ”€β”€ svc_data/
β”‚   └── mel_svc_dataset.py      V2: log-mel dataset (same directory layout as V1)
β”‚
β”œβ”€β”€ train_cfm.py                V1 training script
β”œβ”€β”€ train_f5_stage1.py          V2 Stage 1: SVCCondAdapter + LoRA on singing data
β”œβ”€β”€ train_f5_stage2.py          V2 Stage 2: spk_proj on target speaker speech
β”œβ”€β”€ infer_f5_svc.py             V2 inference: Euler sampling β†’ Vocos β†’ .wav
β”œβ”€β”€ submit_train.sh             SLURM job script for V1
β”‚
β”œβ”€β”€ data_svc/                   Preprocessed features (generated by svc_preprocessing.py)
β”‚   β”œβ”€β”€ audio/<spk>/<id>.wav
β”‚   β”œβ”€β”€ whisper/<spk>/<id>.ppg.npy
β”‚   β”œβ”€β”€ hubert/<spk>/<id>.vec.npy
β”‚   β”œβ”€β”€ pitch/<spk>/<id>.pit.npy
β”‚   β”œβ”€β”€ speaker/<spk>/<id>.spk.npy
β”‚   └── codec_targets/<spk>/<id>.pt   ← V1 only
β”‚
β”œβ”€β”€ chkpt_cfm/                  V1 checkpoints
└── chkpt_f5svc/                V2 checkpoints

Prerequisites

python -m venv .venv
source .venv/bin/activate        # or .venv\Scripts\activate on Windows

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install -r requirements.txt
pip install descript-audio-codec  # V1
pip install f5-tts vocos safetensors huggingface_hub  # V2

Pretrained feature extractors (shared by V1 and V2):

File Destination
best_model.pth.tar (Speaker encoder) speaker_pretrain/
large-v2.pt (Whisper) whisper_pretrain/
hubert-soft-0d54a1f4.pt hubert_pretrain/
full.pth (CREPE) crepe/assets/

Data Preparation (shared by V1 and V2)

1. Raw audio layout

dataset_raw/
β”œβ”€β”€ speaker0/
β”‚   β”œβ”€β”€ 000001.wav
β”‚   └── ...
└── speaker1/
    └── ...

Clips should be clean vocals, < 30 seconds, no accompaniment. Use UVR for source separation and audio-slicer for cutting.

2. Extract features

python svc_preprocessing.py -t 2

Produces under data_svc/:

  • whisper/<spk>/<id>.ppg.npy β€” Whisper PPG (1280-dim, 50 Hz)
  • hubert/<spk>/<id>.vec.npy β€” HuBERT (256-dim, 50 Hz)
  • pitch/<spk>/<id>.pit.npy β€” F0 in Hz (50 Hz, 0 = unvoiced)
  • speaker/<spk>/<id>.spk.npy β€” Speaker d-vector (256-dim)

3. V1 only: extract codec targets

python data/codec_targets.py -w ./data_svc/waves-32k -o ./data_svc/codec_targets

V2 computes mel spectrograms on-the-fly from the raw audio β€” no offline codec step needed.


V1: CFM-SVC (Training from Scratch)

Train

python train_cfm.py \
    --data_dir    ./data_svc/codec_targets \
    --batch_size  64 \
    --lr          2e-5 \
    --epochs      250 \
    --save_interval 1

# or via SLURM:
sbatch submit_train.sh

Training automatically resumes from the latest checkpoint in chkpt_cfm/.

Key arguments:

Argument Default Description
--lr 1e-4 Learning rate
--batch_size 2 Batch size
--grad_accum 1 Gradient accumulation steps
--grad_clip 1.0 Gradient clip max norm
--save_interval 50 Save every N epochs
--use_checkpointing off Enable gradient checkpointing (saves VRAM)
--freeze_norm off Freeze latent norm stats (for fine-tuning)

Inference (V1)

python infer.py --wave /path/to/source_singing.wav

V2: F5-SVC (LoRA on F5-TTS)

Architecture

  • F5-TTS's DiT is loaded with pretrained weights and kept mostly frozen.
  • SVCCondAdapter replaces the text encoder: PPG + HuBERT + F0 + speaker β†’ (B, T, 512).
  • LoRA (rank 16) is injected into every DiT attention projection (Q, K, V, Out).
  • Vocos decodes mel spectrograms to audio.
  • Two-stage training protocol:
    • Stage 1 (singing): SVCCondAdapter + LoRA trained on multi-speaker singing data.
    • Stage 2 (per-speaker): only spk_proj trained on the target speaker's speech clips.

Download F5-TTS checkpoint

from huggingface_hub import hf_hub_download
path = hf_hub_download("SWivid/F5-TTS", "F5TTS_Base/model_1200000.safetensors")
print(path)

Stage 1 β€” Singing Adaptation

Trains: SVCCondAdapter (content projection + speaker projection) + LoRA adapters Freezes: All other DiT weights

python train_f5_stage1.py \
    --f5tts_ckpt /path/to/model_1200000.safetensors \
    --audio_dir  ./data_svc/audio \
    --epochs     200 \
    --batch_size 16 \
    --lr         1e-4

# Checkpoints saved to ./chkpt_f5svc/stage1_epoch_N.pt

All PPG/HuBERT/F0/speaker features from V1 preprocessing are reused directly. The only difference is the audio directory name: V1 produces data_svc/waves-32k/ while V2 defaults to data_svc/audio/. Pass --audio_dir ./data_svc/waves-32k to reuse V1 audio (it is resampled to 24 kHz on-the-fly, no re-extraction needed). The codec targets directory (data_svc/codec_targets/) is V1-only and not needed here.

Stage 2 β€” Per-Speaker Fine-tuning

Trains: svc_adapter.spk_proj only Freezes: DiT + LoRA (locked in from Stage 1) Data: Speech clips of the target speaker (no singing required)

python train_f5_stage2.py \
    --stage1_ckpt ./chkpt_f5svc/stage1_epoch_200.pt \
    --audio_dir   ./data_svc/audio/my_speaker \
    --speaker_id  my_speaker \
    --epochs      50

# Saved to ./chkpt_f5svc/stage2_my_speaker.pt

The target speaker's speech clips need the same feature extraction as Stage 1: run svc_preprocessing.py pointing at the speech audio directory.

Inference (V2)

python infer_f5_svc.py \
    --ckpt       ./chkpt_f5svc/stage1_epoch_200.pt \
    --source     ./source_singing.wav \
    --target_spk ./data_svc/speaker/my_speaker/ref.spk.npy \
    --ref_audio  ./data_svc/audio/my_speaker/ref.wav \
    --output     ./converted.wav \
    --steps      32

For a Stage 2 speaker-adapted checkpoint:

python infer_f5_svc.py \
    --ckpt       ./chkpt_f5svc/stage2_my_speaker.pt \
    --source     ./source_singing.wav \
    --target_spk ./data_svc/speaker/my_speaker/ref.spk.npy \
    --ref_audio  ./data_svc/audio/my_speaker/ref.wav \
    --output     ./converted.wav

Inference arguments:

Argument Default Description
--ckpt required Stage 1 or Stage 2 checkpoint
--source required Source singing .wav
--target_spk required Target speaker .spk.npy
--ref_audio None Short .wav of target speaker for timbre reference
--ref_sec 3.0 Seconds of ref_audio to use
--steps 32 Euler ODE steps (more = higher quality, slower)
--output ./converted.wav Output path

The source audio must have pre-extracted features (PPG, HuBERT, F0) in the standard data_svc/ directory structure. Run svc_preprocessing.py on the source if needed.


Checkpoints

V1 saves full model state per epoch to chkpt_cfm/:

chkpt_cfm/
β”œβ”€β”€ dit_epoch_N.pt
β”œβ”€β”€ cond_encoder_epoch_N.pt
β”œβ”€β”€ projector_epoch_N.pt
β”œβ”€β”€ ema_dit_epoch_N.pt
β”œβ”€β”€ optimizer_epoch_N.pt
β”œβ”€β”€ scheduler_epoch_N.pt
└── latent_norm.pt          ← cached normalization stats

V2 saves adapter + LoRA state per epoch to chkpt_f5svc/:

chkpt_f5svc/
β”œβ”€β”€ stage1_epoch_N.pt       ← full model state (adapter + LoRA + frozen DiT)
β”‚                              also contains lora_only key for lightweight sharing
└── stage2_<speaker_id>.pt  ← speaker-adapted state

References