Upload SRFD-VoxCPM2 LoRA adapters and model card

76c09d1 verified 27 days ago

4.88 kB

	---
	license: apache-2.0
	base_model: openbmb/VoxCPM2
	library_name: voxcpm
	pipeline_tag: text-to-speech
	tags:
	- VoxCPM2
	- text-to-speech
	- voice-cloning
	- flow-matching
	- lora
	- srfd
	- speech
	language:
	- en
	inference: false
	private: true
	---

	# SRFD-VoxCPM2

	SRFD-VoxCPM2 is an adapter-only release for
	[openbmb/VoxCPM2](https://huggingface.co/openbmb/VoxCPM2). It keeps the VoxCPM2
	base model unchanged and provides VoxCPM LoRA weights trained with Speech
	Representation Frechet Distance (SR-FD), a training-time distributional
	regularizer for true four-step TTS.

	This repository does not contain the 2B VoxCPM2 base weights. Download
	`openbmb/VoxCPM2` separately and load these adapters on top of it.

	## Released Adapters

	\| Adapter \| Path \| Removed FD target \| Step \| Seed-TTS EN WER \| UTMOS / DNSMOS OVRL / P808 \|
	\|---\|---\|---\|---:\|---:\|---:\|
	\| Compact 3-target SR-FD \| `.` and `adapters/compact3_balanced/` \| none \| 1600 \| `167/11805 = 1.4147%` \| `3.7637 / 3.0711 / 3.6507` \|
	\| Remove ASR-good Whisper \| `ablations/remove_asr_true4_good_whisper/` \| `asr_true4_good_whisper` \| 1600 \| `182/11805 = 1.5417%` \| `3.7650 / 3.0754 / 3.6545` \|
	\| Remove real CTC \| `ablations/remove_real_ctc_content/` \| `real_ctc_content` \| 1000 \| `176/11805 = 1.4909%` \| `3.7609 / 3.0731 / 3.6535` \|
	\| Remove teacher CTC \| `ablations/remove_teacher_t10_ctc_content/` \| `teacher_t10_ctc_content` \| 900 \| `175/11805 = 1.4824%` \| `3.7604 / 3.0756 / 3.6541` \|

	The compact three-target model is the default adapter and is duplicated at the
	repository root for convenience.

	## Compact SR-FD Targets

	The final compact model uses three content-centered FD targets:

	1. `asr_true4_good_whisper`: Whisper content statistics from ASR-reranked good
	true-four-step generations.
	2. `teacher_t10_ctc_content`: CTC posterior statistics from ten-step VoxCPM2
	teacher generations.
	3. `real_ctc_content`: CTC posterior statistics from real LibriTTS
	voice-cloning speech.

	The leave-one-out adapters remove one of these targets while keeping the rest of
	the compact recipe unchanged. They are intended for ablation and paper
	reproducibility, not as recommended deployment checkpoints.

	## Repository Layout

	\| Path \| Description \|
	\|---\|---\|
	\| `lora_weights.safetensors` \| Default compact 3-target SR-FD adapter \|
	\| `lora_config.json` \| Custom VoxCPM LoRA config for the default adapter \|
	\| `training_state.json` \| Training step marker for the default adapter \|
	\| `adapters/compact3_balanced/` \| Explicit copy of the default adapter \|
	\| `ablations/remove_asr_true4_good_whisper/` \| Leave-one-out adapter without the Whisper low-step target \|
	\| `ablations/remove_real_ctc_content/` \| Leave-one-out adapter without the real-speech CTC target \|
	\| `ablations/remove_teacher_t10_ctc_content/` \| Leave-one-out adapter without the ten-step teacher CTC target \|
	\| `configs/` \| Training configs used for the compact model and ablations \|
	\| `reports/` \| Upstream WER, UTMOS, DNSMOS, and ablation summaries \|
	\| `metadata/adapter_index.json` \| Machine-readable adapter index with hashes and source checkpoints \|

	`lora_config.json` is a custom VoxCPM LoRA config. It is not a PEFT
	`adapter_config.json`.

	## Quick Start

	Install VoxCPM and helper packages:

	```bash
	pip install voxcpm huggingface_hub soundfile
	```

	Load the base model and the default SR-FD adapter:

	```python
	import json
	import os

	import soundfile as sf
	from huggingface_hub import snapshot_download
	from voxcpm import VoxCPM
	from voxcpm.model.voxcpm import LoRAConfig

	base_model = "openbmb/VoxCPM2"
	adapter_dir = snapshot_download("voidful/SRFD-VoxCPM2")

	with open(os.path.join(adapter_dir, "lora_config.json"), "r", encoding="utf-8") as f:
	adapter_info = json.load(f)

	lora_config = LoRAConfig(**adapter_info["lora_config"])

	model = VoxCPM.from_pretrained(
	hf_model_id=base_model,
	load_denoiser=False,
	optimize=True,
	lora_config=lora_config,
	lora_weights_path=adapter_dir,
	)

	wav = model.generate(
	text="SR-FD improves true four-step VoxCPM2 synthesis.",
	cfg_value=2.35,
	inference_timesteps=4,
	normalize=True,
	)

	sf.write("srfd_voxcpm2.wav", wav, model.tts_model.sample_rate)
	```

	Use an ablation adapter by pointing the LoRA loader to an ablation subfolder:

	```python
	ablation_dir = os.path.join(adapter_dir, "ablations", "remove_asr_true4_good_whisper")
	model.load_lora(ablation_dir)
	```

	## Evaluation Notes

	The headline metric is upstream Seed-TTS English WER on 1,088 prompts with
	11,805 paper-facing reference words. UTMOS and DNSMOS are objective proxies, not
	human MOS. The compact 3-target adapter matches the 9-target SR-FD WER frontier
	while making the FD target story simpler and easier to reproduce.

	## License

	This adapter release follows the Apache-2.0 license terms of the VoxCPM2 base
	model. See `openbmb/VoxCPM2` for the original model card and usage restrictions.