Jetlink
/

JetlinkTTS

@@ -1,226 +0,0 @@
----
-language:
-- zh
-- en
-- ar
-- my
-- da
-- nl
-- fi
-- fr
-- de
-- el
-- he
-- hi
-- id
-- it
-- ja
-- km
-- ko
-- lo
-- ms
-- no
-- pl
-- pt
-- ru
-- es
-- sw
-- sv
-- tl
-- th
-- tr
-- vi
-license: apache-2.0
-library_name: voxcpm
-tags:
-- text-to-speech
-- tts
-- multilingual
-- voice-cloning
-- voice-design
-- diffusion
-- audio
-pipeline_tag: text-to-speech
----
-# VoxCPM2
-**VoxCPM2** is a tokenizer-free, diffusion autoregressive Text-to-Speech model — **2B parameters**, **30 languages**, **48kHz** audio output, trained on over **2 million hours** of multilingual speech data.
-[![GitHub](https://img.shields.io/badge/GitHub-VoxCPM-blue?logo=github)](https://github.com/OpenBMB/VoxCPM)
-[![Docs](https://img.shields.io/badge/Docs-ReadTheDocs-8CA1AF)](https://voxcpm.readthedocs.io/en/latest/)
-[![Demo](https://img.shields.io/badge/Live%20Playground-Demo-orange)](https://huggingface.co/spaces/OpenBMB/VoxCPM-Demo)
-[![Audio Samples](https://img.shields.io/badge/Audio%20Samples-Demo%20Page-green)](https://openbmb.github.io/voxcpm2-demopage)
-[![Discord](https://img.shields.io/badge/Discord-VoxCPM-5865F2?logo=discord&logoColor=white)](https://discord.gg/KZUx7tVNwz)
-[![Lark](https://img.shields.io/badge/飞书群-VoxCPM-00D6B9?logo=lark&logoColor=white)](https://applink.feishu.cn/client/chat/chatter/add_by_link?link_token=acds0b9d-23d8-4d7e-b696-d200f3e22a7f)
-## Highlights
-- 🌍 **30-Language Multilingual** — No language tag needed; input text in any supported language directly
-- 🎨 **Voice Design** — Generate a novel voice from a natural-language description alone (gender, age, tone, emotion, pace…); no reference audio required
-- 🎛️ **Controllable Cloning** — Clone any voice from a short clip, with optional style guidance to steer emotion, pace, and expression while preserving timbre
-- 🎙️ **Ultimate Cloning** — Provide reference audio + its transcript for audio-continuation cloning; every vocal nuance faithfully reproduced
-- 🔊 **48kHz Studio-Quality Output** — Accepts 16kHz reference; outputs 48kHz via AudioVAE V2's built-in super-resolution, no external upsampler needed
-- 🧠 **Context-Aware Synthesis** — Automatically infers appropriate prosody and expressiveness from text content
-- ⚡ **Real-Time Streaming** — RTF as low as ~0.3 on NVIDIA RTX 4090, and ~0.13  accelerated by [Nano-VLLM](https://github.com/a710128/nanovllm-voxcpm)
-- 📜 **Fully Open-Source & Commercial-Ready** — Apache-2.0 license, free for commercial use
-<summary><b>Supported Languages (30)</b></summary>
-Arabic, Burmese, Chinese, Danish, Dutch, English, Finnish, French, German, Greek, Hebrew, Hindi, Indonesian, Italian, Japanese, Khmer, Korean, Lao, Malay, Norwegian, Polish, Portuguese, Russian, Spanish, Swahili, Swedish, Tagalog, Thai, Turkish, Vietnamese
-Chinese Dialects: 四川话, 粤语, 吴语, 东北话, 河南话, 陕西话, 山东话, 天津话, 闽南话
-## Quick Start
-### Installation
-```bash
-pip install voxcpm
-```
-**Requirements:** Python ≥ 3.10, PyTorch ≥ 2.5.0, CUDA ≥ 12.0 · [Full Quick Start →](https://voxcpm.readthedocs.io/en/latest/quickstart.html)
-### Text-to-Speech
-```python
-from voxcpm import VoxCPM
-import soundfile as sf
-model = VoxCPM.from_pretrained("openbmb/VoxCPM2", load_denoiser=False)
-wav = model.generate(
-    text="VoxCPM2 brings multilingual support, creative voice design, and controllable voice cloning.",
-    cfg_value=2.0,
-    inference_timesteps=10,
-)
-sf.write("output.wav", wav, model.tts_model.sample_rate)
-```
-### Voice Design
-Put the voice description in parentheses at the start of `text`, followed by the content to synthesize:
-```python
-wav = model.generate(
-    text="(A young woman, gentle and sweet voice)Hello, welcome to VoxCPM2!",
-    cfg_value=2.0,
-    inference_timesteps=10,
-)
-sf.write("voice_design.wav", wav, model.tts_model.sample_rate)
-```
-### Controllable Voice Cloning
-```python
-# Basic cloning
-wav = model.generate(
-    text="This is a cloned voice generated by VoxCPM2.",
-    reference_wav_path="speaker.wav",
-)
-sf.write("clone.wav", wav, model.tts_model.sample_rate)
-# Cloning with style control
-wav = model.generate(
-    text="(slightly faster, cheerful tone)This is a cloned voice with style control.",
-    reference_wav_path="speaker.wav",
-    cfg_value=2.0,
-    inference_timesteps=10,
-)
-sf.write("controllable_clone.wav", wav, model.tts_model.sample_rate)
-```
-### Ultimate Cloning
-Provide both the reference audio and its exact transcript for maximum fidelity. Pass the same clip to both `reference_wav_path` and `prompt_wav_path` for highest similarity:
-```python
-wav = model.generate(
-    text="This is an ultimate cloning demonstration using VoxCPM2.",
-    prompt_wav_path="speaker_reference.wav",
-    prompt_text="The transcript of the reference audio.",
-    reference_wav_path="speaker_reference.wav",
-)
-sf.write("hifi_clone.wav", wav, model.tts_model.sample_rate)
-```
-### Streaming
-```python
-import numpy as np
-chunks = []
-for chunk in model.generate_streaming(text="Streaming is easy with VoxCPM!"):
-    chunks.append(chunk)
-wav = np.concatenate(chunks)
-sf.write("streaming.wav", wav, model.tts_model.sample_rate)
-```
-## Model Details
-| Property | Value |
-|---|---|
-| Architecture | Tokenizer-free Diffusion Autoregressive (LocEnc → TSLM → RALM → LocDiT) |
-| Backbone | Based on MiniCPM-4, totally 2B parameters |
-| Audio VAE | AudioVAE V2 (asymmetric encode/decode, 16kHz in → 48kHz out) |
-| Training Data | 2M+ hours multilingual speech |
-| LM Token Rate | 6.25 Hz |
-| Max Sequence Length | 8192 tokens |
-| dtype | bfloat16 |
-| VRAM | ~8 GB |
-| RTF (RTX 4090) | ~0.30 (standard) / ~0.13 (Nano-vLLM) |
-## Performance
-VoxCPM2 achieves state-of-the-art or competitive results on major zero-shot and controllable TTS benchmarks.
-See the [GitHub repo](https://github.com/OpenBMB/VoxCPM#-performance) for full benchmark tables (Seed-TTS-eval, CV3-eval, InstructTTSEval, MiniMax Multilingual Test).
-## Fine-tuning
-VoxCPM2 supports both full SFT and LoRA fine-tuning with as little as 5–10 minutes of audio:
-```bash
-# LoRA fine-tuning (recommended)
-python scripts/train_voxcpm_finetune.py \
-    --config_path conf/voxcpm_v2/voxcpm_finetune_lora.yaml
-# Full fine-tuning
-python scripts/train_voxcpm_finetune.py \
-    --config_path conf/voxcpm_v2/voxcpm_finetune_all.yaml
-```
-See the [Fine-tuning Guide](https://voxcpm.readthedocs.io/en/latest/finetuning/finetune.html) for full instructions.
-## Limitations
-- Voice Design and Style Control results may vary between runs; generating 1–3 times is recommended to obtain the desired output.
-- Performance varies across languages depending on training data availability.
-- Occasional instability may occur with very long or highly expressive inputs.
-- **Strictly forbidden** to use for impersonation, fraud, or disinformation. AI-generated content should be clearly labeled.
-## Citation
-```bibtex
-@article{voxcpm2_2026,
-  title   = {VoxCPM2: Tokenizer-Free TTS for Multilingual Speech Generation, Creative Voice Design, and True-to-Life Cloning},
-  author  = {VoxCPM Team},
-  journal = {GitHub},
-  year    = {2026},
-}
-@article{voxcpm2025,
-  title   = {VoxCPM: Tokenizer-Free TTS for Context-Aware Speech Generation and True-to-Life Voice Cloning},
-  author  = {Zhou, Yixuan and Zeng, Guoyang and Liu, Xin and Li, Xiang and
-             Yu, Renjie and Wang, Ziyang and Ye, Runchuan and Sun, Weiyue and
-             Gui, Jiancheng and Li, Kehan and Wu, Zhiyong and Liu, Zhiyuan},
-  journal = {arXiv preprint arXiv:2509.24650},
-  year    = {2025},
-}
-```
-## License
-Released under the [Apache-2.0](https://www.apache.org/licenses/LICENSE-2.0) license, free for commercial use. For production deployments, we recommend thorough testing and safety evaluation tailored to your use case.