SRFD-VoxCPM2 / README.md
voidful's picture
Upload SRFD-VoxCPM2 LoRA adapters and model card
76c09d1 verified
|
Raw
History Blame Contribute Delete
4.88 kB
---
license: apache-2.0
base_model: openbmb/VoxCPM2
library_name: voxcpm
pipeline_tag: text-to-speech
tags:
- VoxCPM2
- text-to-speech
- voice-cloning
- flow-matching
- lora
- srfd
- speech
language:
- en
inference: false
private: true
---
# SRFD-VoxCPM2
SRFD-VoxCPM2 is an adapter-only release for
[openbmb/VoxCPM2](https://huggingface.co/openbmb/VoxCPM2). It keeps the VoxCPM2
base model unchanged and provides VoxCPM LoRA weights trained with Speech
Representation Frechet Distance (SR-FD), a training-time distributional
regularizer for true four-step TTS.
This repository does not contain the 2B VoxCPM2 base weights. Download
`openbmb/VoxCPM2` separately and load these adapters on top of it.
## Released Adapters
| Adapter | Path | Removed FD target | Step | Seed-TTS EN WER | UTMOS / DNSMOS OVRL / P808 |
|---|---|---|---:|---:|---:|
| Compact 3-target SR-FD | `.` and `adapters/compact3_balanced/` | none | 1600 | `167/11805 = 1.4147%` | `3.7637 / 3.0711 / 3.6507` |
| Remove ASR-good Whisper | `ablations/remove_asr_true4_good_whisper/` | `asr_true4_good_whisper` | 1600 | `182/11805 = 1.5417%` | `3.7650 / 3.0754 / 3.6545` |
| Remove real CTC | `ablations/remove_real_ctc_content/` | `real_ctc_content` | 1000 | `176/11805 = 1.4909%` | `3.7609 / 3.0731 / 3.6535` |
| Remove teacher CTC | `ablations/remove_teacher_t10_ctc_content/` | `teacher_t10_ctc_content` | 900 | `175/11805 = 1.4824%` | `3.7604 / 3.0756 / 3.6541` |
The compact three-target model is the default adapter and is duplicated at the
repository root for convenience.
## Compact SR-FD Targets
The final compact model uses three content-centered FD targets:
1. `asr_true4_good_whisper`: Whisper content statistics from ASR-reranked good
true-four-step generations.
2. `teacher_t10_ctc_content`: CTC posterior statistics from ten-step VoxCPM2
teacher generations.
3. `real_ctc_content`: CTC posterior statistics from real LibriTTS
voice-cloning speech.
The leave-one-out adapters remove one of these targets while keeping the rest of
the compact recipe unchanged. They are intended for ablation and paper
reproducibility, not as recommended deployment checkpoints.
## Repository Layout
| Path | Description |
|---|---|
| `lora_weights.safetensors` | Default compact 3-target SR-FD adapter |
| `lora_config.json` | Custom VoxCPM LoRA config for the default adapter |
| `training_state.json` | Training step marker for the default adapter |
| `adapters/compact3_balanced/` | Explicit copy of the default adapter |
| `ablations/remove_asr_true4_good_whisper/` | Leave-one-out adapter without the Whisper low-step target |
| `ablations/remove_real_ctc_content/` | Leave-one-out adapter without the real-speech CTC target |
| `ablations/remove_teacher_t10_ctc_content/` | Leave-one-out adapter without the ten-step teacher CTC target |
| `configs/` | Training configs used for the compact model and ablations |
| `reports/` | Upstream WER, UTMOS, DNSMOS, and ablation summaries |
| `metadata/adapter_index.json` | Machine-readable adapter index with hashes and source checkpoints |
`lora_config.json` is a custom VoxCPM LoRA config. It is not a PEFT
`adapter_config.json`.
## Quick Start
Install VoxCPM and helper packages:
```bash
pip install voxcpm huggingface_hub soundfile
```
Load the base model and the default SR-FD adapter:
```python
import json
import os
import soundfile as sf
from huggingface_hub import snapshot_download
from voxcpm import VoxCPM
from voxcpm.model.voxcpm import LoRAConfig
base_model = "openbmb/VoxCPM2"
adapter_dir = snapshot_download("voidful/SRFD-VoxCPM2")
with open(os.path.join(adapter_dir, "lora_config.json"), "r", encoding="utf-8") as f:
adapter_info = json.load(f)
lora_config = LoRAConfig(**adapter_info["lora_config"])
model = VoxCPM.from_pretrained(
hf_model_id=base_model,
load_denoiser=False,
optimize=True,
lora_config=lora_config,
lora_weights_path=adapter_dir,
)
wav = model.generate(
text="SR-FD improves true four-step VoxCPM2 synthesis.",
cfg_value=2.35,
inference_timesteps=4,
normalize=True,
)
sf.write("srfd_voxcpm2.wav", wav, model.tts_model.sample_rate)
```
Use an ablation adapter by pointing the LoRA loader to an ablation subfolder:
```python
ablation_dir = os.path.join(adapter_dir, "ablations", "remove_asr_true4_good_whisper")
model.load_lora(ablation_dir)
```
## Evaluation Notes
The headline metric is upstream Seed-TTS English WER on 1,088 prompts with
11,805 paper-facing reference words. UTMOS and DNSMOS are objective proxies, not
human MOS. The compact 3-target adapter matches the 9-target SR-FD WER frontier
while making the FD target story simpler and easier to reproduce.
## License
This adapter release follows the Apache-2.0 license terms of the VoxCPM2 base
model. See `openbmb/VoxCPM2` for the original model card and usage restrictions.