Instructions to use voidful/SRFD-VoxCPM2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- VoxCPM
How to use voidful/SRFD-VoxCPM2 with VoxCPM:
import soundfile as sf from voxcpm import VoxCPM model = VoxCPM.from_pretrained("voidful/SRFD-VoxCPM2") wav = model.generate( text="VoxCPM is an innovative end-to-end TTS model from ModelBest, designed to generate highly expressive speech.", prompt_wav_path=None, # optional: path to a prompt speech for voice cloning prompt_text=None, # optional: reference text cfg_value=2.0, # LM guidance on LocDiT, higher for better adherence to the prompt, but maybe worse inference_timesteps=10, # LocDiT inference timesteps, higher for better result, lower for fast speed normalize=True, # enable external TN tool denoise=True, # enable external Denoise tool retry_badcase=True, # enable retrying mode for some bad cases (unstoppable) retry_badcase_max_times=3, # maximum retrying times retry_badcase_ratio_threshold=6.0, # maximum length restriction for bad case detection (simple but effective), it could be adjusted for slow pace speech ) sf.write("output.wav", wav, 16000) print("saved: output.wav") - Notebooks
- Google Colab
- Kaggle
| license: apache-2.0 | |
| base_model: openbmb/VoxCPM2 | |
| library_name: voxcpm | |
| pipeline_tag: text-to-speech | |
| tags: | |
| - VoxCPM2 | |
| - text-to-speech | |
| - voice-cloning | |
| - flow-matching | |
| - lora | |
| - srfd | |
| - speech | |
| language: | |
| - en | |
| inference: false | |
| private: true | |
| # SRFD-VoxCPM2 | |
| SRFD-VoxCPM2 is an adapter-only release for | |
| [openbmb/VoxCPM2](https://huggingface.co/openbmb/VoxCPM2). It keeps the VoxCPM2 | |
| base model unchanged and provides VoxCPM LoRA weights trained with Speech | |
| Representation Frechet Distance (SR-FD), a training-time distributional | |
| regularizer for true four-step TTS. | |
| This repository does not contain the 2B VoxCPM2 base weights. Download | |
| `openbmb/VoxCPM2` separately and load these adapters on top of it. | |
| ## Released Adapters | |
| | Adapter | Path | Removed FD target | Step | Seed-TTS EN WER | UTMOS / DNSMOS OVRL / P808 | | |
| |---|---|---|---:|---:|---:| | |
| | Compact 3-target SR-FD | `.` and `adapters/compact3_balanced/` | none | 1600 | `167/11805 = 1.4147%` | `3.7637 / 3.0711 / 3.6507` | | |
| | Remove ASR-good Whisper | `ablations/remove_asr_true4_good_whisper/` | `asr_true4_good_whisper` | 1600 | `182/11805 = 1.5417%` | `3.7650 / 3.0754 / 3.6545` | | |
| | Remove real CTC | `ablations/remove_real_ctc_content/` | `real_ctc_content` | 1000 | `176/11805 = 1.4909%` | `3.7609 / 3.0731 / 3.6535` | | |
| | Remove teacher CTC | `ablations/remove_teacher_t10_ctc_content/` | `teacher_t10_ctc_content` | 900 | `175/11805 = 1.4824%` | `3.7604 / 3.0756 / 3.6541` | | |
| The compact three-target model is the default adapter and is duplicated at the | |
| repository root for convenience. | |
| ## Compact SR-FD Targets | |
| The final compact model uses three content-centered FD targets: | |
| 1. `asr_true4_good_whisper`: Whisper content statistics from ASR-reranked good | |
| true-four-step generations. | |
| 2. `teacher_t10_ctc_content`: CTC posterior statistics from ten-step VoxCPM2 | |
| teacher generations. | |
| 3. `real_ctc_content`: CTC posterior statistics from real LibriTTS | |
| voice-cloning speech. | |
| The leave-one-out adapters remove one of these targets while keeping the rest of | |
| the compact recipe unchanged. They are intended for ablation and paper | |
| reproducibility, not as recommended deployment checkpoints. | |
| ## Repository Layout | |
| | Path | Description | | |
| |---|---| | |
| | `lora_weights.safetensors` | Default compact 3-target SR-FD adapter | | |
| | `lora_config.json` | Custom VoxCPM LoRA config for the default adapter | | |
| | `training_state.json` | Training step marker for the default adapter | | |
| | `adapters/compact3_balanced/` | Explicit copy of the default adapter | | |
| | `ablations/remove_asr_true4_good_whisper/` | Leave-one-out adapter without the Whisper low-step target | | |
| | `ablations/remove_real_ctc_content/` | Leave-one-out adapter without the real-speech CTC target | | |
| | `ablations/remove_teacher_t10_ctc_content/` | Leave-one-out adapter without the ten-step teacher CTC target | | |
| | `configs/` | Training configs used for the compact model and ablations | | |
| | `reports/` | Upstream WER, UTMOS, DNSMOS, and ablation summaries | | |
| | `metadata/adapter_index.json` | Machine-readable adapter index with hashes and source checkpoints | | |
| `lora_config.json` is a custom VoxCPM LoRA config. It is not a PEFT | |
| `adapter_config.json`. | |
| ## Quick Start | |
| Install VoxCPM and helper packages: | |
| ```bash | |
| pip install voxcpm huggingface_hub soundfile | |
| ``` | |
| Load the base model and the default SR-FD adapter: | |
| ```python | |
| import json | |
| import os | |
| import soundfile as sf | |
| from huggingface_hub import snapshot_download | |
| from voxcpm import VoxCPM | |
| from voxcpm.model.voxcpm import LoRAConfig | |
| base_model = "openbmb/VoxCPM2" | |
| adapter_dir = snapshot_download("voidful/SRFD-VoxCPM2") | |
| with open(os.path.join(adapter_dir, "lora_config.json"), "r", encoding="utf-8") as f: | |
| adapter_info = json.load(f) | |
| lora_config = LoRAConfig(**adapter_info["lora_config"]) | |
| model = VoxCPM.from_pretrained( | |
| hf_model_id=base_model, | |
| load_denoiser=False, | |
| optimize=True, | |
| lora_config=lora_config, | |
| lora_weights_path=adapter_dir, | |
| ) | |
| wav = model.generate( | |
| text="SR-FD improves true four-step VoxCPM2 synthesis.", | |
| cfg_value=2.35, | |
| inference_timesteps=4, | |
| normalize=True, | |
| ) | |
| sf.write("srfd_voxcpm2.wav", wav, model.tts_model.sample_rate) | |
| ``` | |
| Use an ablation adapter by pointing the LoRA loader to an ablation subfolder: | |
| ```python | |
| ablation_dir = os.path.join(adapter_dir, "ablations", "remove_asr_true4_good_whisper") | |
| model.load_lora(ablation_dir) | |
| ``` | |
| ## Evaluation Notes | |
| The headline metric is upstream Seed-TTS English WER on 1,088 prompts with | |
| 11,805 paper-facing reference words. UTMOS and DNSMOS are objective proxies, not | |
| human MOS. The compact 3-target adapter matches the 9-target SR-FD WER frontier | |
| while making the FD target story simpler and easier to reproduce. | |
| ## License | |
| This adapter release follows the Apache-2.0 license terms of the VoxCPM2 base | |
| model. See `openbmb/VoxCPM2` for the original model card and usage restrictions. | |