Spaces:

mvp-lab
/

cfm_svc

Running on Zero

App Files Files Community

cfm_svc / README.md

ICGenAIShare04

Update README.md

957f5de verified about 2 months ago

preview code

raw

history blame contribute delete

9.31 kB

A newer version of the Gradio SDK is available: 6.13.0

Upgrade

metadata

title: CFM SVC
emoji: 🎙️
colorFrom: blue
colorTo: indigo
sdk: gradio
sdk_version: 4.44.1
app_file: app.py
pinned: false
short_description: Singing Voice Conversion Based on CFM

Using Conditional Flow Matching (CFM) for Singing Voice Conversion (SVC)

CFM-SVC / F5-SVC — Singing Voice Conversion

Two implementations of a flow-matching-based Singing Voice Conversion (SVC) system.

	V1 (CFM-SVC)	V2 (F5-SVC)
Backbone	DiT trained from scratch	F5-TTS pretrained (LoRA)
Output space	DAC codec latents (1024-dim)	Log-mel spectrogram (100-dim)
Vocoder	DAC decoder (frozen)	Vocos (frozen)
Params trained	~82M	~5M (adapter + LoRA)
Training data	Multi-speaker singing	Multi-speaker singing
Speaker adaptation	Speaker d-vector	Stage 2: spk_proj on speech clips

Project Structure

matcha_svc/
├── models/
│   ├── cfm.py                  V1: Diffusion Transformer (DiT)
│   ├── cond_encoder.py         V1: PPG+HuBERT+F0+Speaker → conditioning
│   ├── codec_wrapper.py        V1: DAC codec + projector head
│   ├── svc_cond_adapter.py     V2: PPG+HuBERT+F0+Speaker → F5-TTS text_dim
│   ├── lora_utils.py           V2: LoRALinear, inject_lora(), freeze_non_lora()
│   └── f5_svc.py               V2: F5SVCModel wrapper + build_f5svc() factory
│
├── losses/
│   └── cfm_loss.py             V1: flow matching + projector commitment loss
│
├── svc_data/
│   └── mel_svc_dataset.py      V2: log-mel dataset (same directory layout as V1)
│
├── train_cfm.py                V1 training script
├── train_f5_stage1.py          V2 Stage 1: SVCCondAdapter + LoRA on singing data
├── train_f5_stage2.py          V2 Stage 2: spk_proj on target speaker speech
├── infer_f5_svc.py             V2 inference: Euler sampling → Vocos → .wav
├── submit_train.sh             SLURM job script for V1
│
├── data_svc/                   Preprocessed features (generated by svc_preprocessing.py)
│   ├── audio/<spk>/<id>.wav
│   ├── whisper/<spk>/<id>.ppg.npy
│   ├── hubert/<spk>/<id>.vec.npy
│   ├── pitch/<spk>/<id>.pit.npy
│   ├── speaker/<spk>/<id>.spk.npy
│   └── codec_targets/<spk>/<id>.pt   ← V1 only
│
├── chkpt_cfm/                  V1 checkpoints
└── chkpt_f5svc/                V2 checkpoints

Prerequisites

python -m venv .venv
source .venv/bin/activate        # or .venv\Scripts\activate on Windows

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install -r requirements.txt
pip install descript-audio-codec  # V1
pip install f5-tts vocos safetensors huggingface_hub  # V2

Pretrained feature extractors (shared by V1 and V2):

File	Destination
`best_model.pth.tar` (Speaker encoder)	`speaker_pretrain/`
`large-v2.pt` (Whisper)	`whisper_pretrain/`
`hubert-soft-0d54a1f4.pt`	`hubert_pretrain/`
`full.pth` (CREPE)	`crepe/assets/`

Data Preparation (shared by V1 and V2)

1. Raw audio layout

dataset_raw/
├── speaker0/
│   ├── 000001.wav
│   └── ...
└── speaker1/
    └── ...

Clips should be clean vocals, < 30 seconds, no accompaniment. Use UVR for source separation and audio-slicer for cutting.

2. Extract features

python svc_preprocessing.py -t 2

Produces under data_svc/:

whisper/<spk>/<id>.ppg.npy — Whisper PPG (1280-dim, 50 Hz)
hubert/<spk>/<id>.vec.npy — HuBERT (256-dim, 50 Hz)
pitch/<spk>/<id>.pit.npy — F0 in Hz (50 Hz, 0 = unvoiced)
speaker/<spk>/<id>.spk.npy — Speaker d-vector (256-dim)

3. V1 only: extract codec targets

python data/codec_targets.py -w ./data_svc/waves-32k -o ./data_svc/codec_targets

V2 computes mel spectrograms on-the-fly from the raw audio — no offline codec step needed.

V1: CFM-SVC (Training from Scratch)

Train

python train_cfm.py \
    --data_dir    ./data_svc/codec_targets \
    --batch_size  64 \
    --lr          2e-5 \
    --epochs      250 \
    --save_interval 1

# or via SLURM:
sbatch submit_train.sh

Training automatically resumes from the latest checkpoint in chkpt_cfm/.

Key arguments:

Argument	Default	Description
`--lr`	`1e-4`	Learning rate
`--batch_size`	`2`	Batch size
`--grad_accum`	`1`	Gradient accumulation steps
`--grad_clip`	`1.0`	Gradient clip max norm
`--save_interval`	`50`	Save every N epochs
`--use_checkpointing`	off	Enable gradient checkpointing (saves VRAM)
`--freeze_norm`	off	Freeze latent norm stats (for fine-tuning)

Inference (V1)

python infer.py --wave /path/to/source_singing.wav

V2: F5-SVC (LoRA on F5-TTS)

Architecture

F5-TTS's DiT is loaded with pretrained weights and kept mostly frozen.
SVCCondAdapter replaces the text encoder: PPG + HuBERT + F0 + speaker → (B, T, 512).
LoRA (rank 16) is injected into every DiT attention projection (Q, K, V, Out).
Vocos decodes mel spectrograms to audio.
Two-stage training protocol:
- Stage 1 (singing): SVCCondAdapter + LoRA trained on multi-speaker singing data.
- Stage 2 (per-speaker): only spk_proj trained on the target speaker's speech clips.

Download F5-TTS checkpoint

from huggingface_hub import hf_hub_download
path = hf_hub_download("SWivid/F5-TTS", "F5TTS_Base/model_1200000.safetensors")
print(path)

Stage 1 — Singing Adaptation

Trains: SVCCondAdapter (content projection + speaker projection) + LoRA adapters Freezes: All other DiT weights

python train_f5_stage1.py \
    --f5tts_ckpt /path/to/model_1200000.safetensors \
    --audio_dir  ./data_svc/audio \
    --epochs     200 \
    --batch_size 16 \
    --lr         1e-4

# Checkpoints saved to ./chkpt_f5svc/stage1_epoch_N.pt

All PPG/HuBERT/F0/speaker features from V1 preprocessing are reused directly. The only difference is the audio directory name: V1 produces data_svc/waves-32k/ while V2 defaults to data_svc/audio/. Pass --audio_dir ./data_svc/waves-32k to reuse V1 audio (it is resampled to 24 kHz on-the-fly, no re-extraction needed). The codec targets directory (data_svc/codec_targets/) is V1-only and not needed here.

Stage 2 — Per-Speaker Fine-tuning

Trains: svc_adapter.spk_proj only Freezes: DiT + LoRA (locked in from Stage 1) Data: Speech clips of the target speaker (no singing required)

python train_f5_stage2.py \
    --stage1_ckpt ./chkpt_f5svc/stage1_epoch_200.pt \
    --audio_dir   ./data_svc/audio/my_speaker \
    --speaker_id  my_speaker \
    --epochs      50

# Saved to ./chkpt_f5svc/stage2_my_speaker.pt

The target speaker's speech clips need the same feature extraction as Stage 1: run svc_preprocessing.py pointing at the speech audio directory.

Inference (V2)

python infer_f5_svc.py \
    --ckpt       ./chkpt_f5svc/stage1_epoch_200.pt \
    --source     ./source_singing.wav \
    --target_spk ./data_svc/speaker/my_speaker/ref.spk.npy \
    --ref_audio  ./data_svc/audio/my_speaker/ref.wav \
    --output     ./converted.wav \
    --steps      32

For a Stage 2 speaker-adapted checkpoint:

python infer_f5_svc.py \
    --ckpt       ./chkpt_f5svc/stage2_my_speaker.pt \
    --source     ./source_singing.wav \
    --target_spk ./data_svc/speaker/my_speaker/ref.spk.npy \
    --ref_audio  ./data_svc/audio/my_speaker/ref.wav \
    --output     ./converted.wav

Inference arguments:

Argument	Default	Description
`--ckpt`	required	Stage 1 or Stage 2 checkpoint
`--source`	required	Source singing .wav
`--target_spk`	required	Target speaker .spk.npy
`--ref_audio`	`None`	Short .wav of target speaker for timbre reference
`--ref_sec`	`3.0`	Seconds of ref_audio to use
`--steps`	`32`	Euler ODE steps (more = higher quality, slower)
`--output`	`./converted.wav`	Output path

The source audio must have pre-extracted features (PPG, HuBERT, F0) in the standard data_svc/ directory structure. Run svc_preprocessing.py on the source if needed.

Checkpoints

V1 saves full model state per epoch to chkpt_cfm/:

chkpt_cfm/
├── dit_epoch_N.pt
├── cond_encoder_epoch_N.pt
├── projector_epoch_N.pt
├── ema_dit_epoch_N.pt
├── optimizer_epoch_N.pt
├── scheduler_epoch_N.pt
└── latent_norm.pt          ← cached normalization stats

V2 saves adapter + LoRA state per epoch to chkpt_f5svc/:

chkpt_f5svc/
├── stage1_epoch_N.pt       ← full model state (adapter + LoRA + frozen DiT)
│                              also contains lora_only key for lightweight sharing
└── stage2_<speaker_id>.pt  ← speaker-adapted state

References

Rectified Flow / Flow Matching
F5-TTS: SWivid/F5-TTS
Vocos vocoder: hubert-whisper/vocos
DAC: descriptinc/descript-audio-codec
so-vits-svc-5.0: preprocessing pipeline