--- license: agpl-3.0 language: - en - zh base_model: - openbmb/VoxCPM1.5 pipeline_tag: text-to-speech tags: - rknn - rkllm - text-to-speech - speech - speech generation - voice cloning --- # VoxCPM1.5-RKNN2 ### (English README see below) > VoxCPM 是一种创新的无分词器文本转语音(TTS)系统,重新定义了语音合成的真实感。通过在连续空间中建模语音,它克服了离散标记化的局限,并实现了两项核心能力:上下文感知的语音生成和逼真的零样本语音克隆。 > 不同于将语音转换为离散标记的主流方法,VoxCPM 采用端到端的扩散自回归架构,直接从文本生成连续的语音表示。它基于 MiniCPM-4 主干构建,通过分层语言建模和 FSQ 约束实现了隐式的语义-声学解耦,极大地提升了表现力和生成稳定性。 我们非常激动地推出 VoxCPM 的重大升级版本。此次更新在显著提升音频质量和效率的同时,保留了核心的上下文感知语音生成和零样本(Zero-shot)语音克隆能力。 | 特性 | VoxCPM | VoxCPM1.5 | |---------|------------|------------| | **Audio VAE 采样率** | 16kHz | 44.1kHz | | **LM Token 速率** | 12.5Hz | 6.25Hz | | **Patch 大小** | 2 | 4 | | **SFT 支持** | ✅ | ✅ | | **LoRA 支持** | ✅ | ✅ | - 推理速度(RKNN2):RK3588上RTF约4.5(生成10s音频需要推理45s,相对于旧版似乎并没有什么提升) - 大致内存占用(RKNN2):约3.3GB(相对于旧版同样没有什么提升) ## 使用方法 1. 克隆项目到本地 2. 安装依赖 ```bash pip install numpy scipy soundfile tqdm transformers sentencepiece ztu-somemodelruntime-ez-rknn-async ``` 3. 运行 ```bash python onnx_infer-rknn2.py --onnx-dir . --tokenizer-dir . --base-hf-dir . --residual-hf-dir . --text "哇, VoxCPM1.5 现在也能在 RK3588 上跑起来了。" --prompt-audio basic_ref_zh.wav --prompt-text "对,这就是我,万人敬仰的太乙真人。" --output rknn_output.wav --cfg-value 2.0 --inference-timesteps 10 --seed 1234 ``` 可选参数: - `--text`: 要生成的文本 - `--prompt-audio`: 参考音频路径(用于语音克隆) - `--prompt-text`: 参考音频对应的文本(使用参考音频时必填) - `--cfg-value`: CFG引导强度,默认2.0 - `--inference-timesteps`: 扩散步数,默认10 - `--seed`: 随机种子 - `--output`: 输出音频路径 ## 运行效果 ```log > python onnx_infer-rknn2.py --onnx-dir . --tokenizer-dir . --base-hf-dir . --residual-hf-dir . --text "哇, VoxCPM1.5 现在也能在 RK3588 上跑起来了。" --prompt-audio basic_ref_zh.wav --prompt-text "对,这就是我,万人敬仰的太乙真人。" --output rknn_output.wav --cfg-value 2.0 --inference-timesteps 10 --seed 1234 I rkllm: rkllm-runtime version: 1.2.3, rknpu driver version: 0.9.8, platform: RK3588 I rkllm: loading rkllm model from ./base_lm.rkllm I rkllm: rkllm-toolkit version: 1.2.3, max_context_limit: 4096, npu_core_num: 1, target_platform: RK3588, model_dtype: FP16 I rkllm: Enabled cpus: [4, 5, 6, 7] I rkllm: Enabled cpus num: 4 I rkllm: rkllm-runtime version: 1.2.3, rknpu driver version: 0.9.8, platform: RK3588 I rkllm: loading rkllm model from ./residual_lm.rkllm I rkllm: rkllm-toolkit version: 1.2.3, max_context_limit: 4096, npu_core_num: 3, target_platform: RK3588, model_dtype: FP16 I rkllm: Enabled cpus: [4, 5, 6, 7] I rkllm: Enabled cpus num: 4 [time] vae_encode_0: 2127.35 ms [time] vae_encode_105840: 2057.71 ms [time] vae_encode_211680: 1997.43 ms [time] locenc_0: 1791.50 ms [time] locenc_64: 1782.49 ms [time] base_lm initial: 368.19 ms [time] fsq_init_0: 5.52 ms [time] fsq_init_64: 4.20 ms [time] residual_lm initial: 105.79 ms gen_loop: 0%| | 0/2000 [00:00 VoxCPM is an innovative tokenizer-free Text-to-Speech (TTS) system that redefines realism in speech synthesis. By modeling speech in continuous space, it overcomes the limitations of discrete tokenization and achieves two core capabilities: context-aware speech generation and realistic zero-shot voice cloning. > Unlike mainstream approaches that convert speech into discrete tokens, VoxCPM adopts an end-to-end diffusion autoregressive architecture that directly generates continuous speech representations from text. Built on the MiniCPM-4 backbone, it achieves implicit semantic-acoustic decoupling through hierarchical language modeling and FSQ constraints, greatly enhancing expressiveness and generation stability. We’re thrilled to introduce a major upgrade that improves audio quality and efficiency of VoxCPM, while maintaining the core capabilities of context-aware speech generation and zero-shot voice cloning. | Feature | VoxCPM | VoxCPM1.5 | |---------|------------|------------| | **Audio VAE Sampling Rate** | 16kHz | 44.1kHz | | **LM Token Rate** | 12.5Hz | 6.25Hz | | **Patch Size** | 2 | 4 | | **SFT Support** | ✅ | ✅ | | **LoRA Support** | ✅ | ✅ | - Inference speed (RKNN2): RTF approximately 4.5 on RK3588 (45s inference time to generate 10s audio, no improvement compared to the previous version) - Approximate memory usage (RKNN2): ~3.3GB (no improvement compared to the previous version too) ## Usage 1. Clone the project locally 2. Install dependencies ```bash pip install numpy scipy soundfile tqdm transformers sentencepiece ztu-somemodelruntime-ez-rknn-async ``` 3. Run ```bash python onnx_infer-rknn2.py --onnx-dir . --tokenizer-dir . --base-hf-dir . --residual-hf-dir . --text "Wow, VoxCPM1.5 actually runs perfectly on the RK3588 SoC!" --prompt-audio basic_ref_zh.wav --prompt-text "对,这就是我,万人敬仰的太乙真人。" --output rknn_output.wav --cfg-value 2.0 --inference-timesteps 10 --seed 1234 ``` Optional parameters: - `--text`: Text to generate - `--prompt-audio`: Reference audio path (for voice cloning) - `--prompt-text`: Text corresponding to the reference audio (required when using reference audio) - `--cfg-value`: CFG guidance strength, default 2.0 - `--inference-timesteps`: Number of diffusion steps, default 10 - `--seed`: Random seed - `--output`: Output audio path ## Performance ```log > python onnx_infer-rknn2.py --onnx-dir . --tokenizer-dir . --base-hf-dir . --residual-hf-dir . --text "哇, VoxCPM1.5 现在也能在 RK3588 上跑起来了。" --prompt-audio basic_ref_zh.wav --prompt-text "对,这就是我,万人敬仰的太乙真人。" --output rknn_output.wav --cfg-value 2.0 --inference-timesteps 10 --seed 1234 I rkllm: rkllm-runtime version: 1.2.3, rknpu driver version: 0.9.8, platform: RK3588 I rkllm: loading rkllm model from ./base_lm.rkllm I rkllm: rkllm-toolkit version: 1.2.3, max_context_limit: 4096, npu_core_num: 1, target_platform: RK3588, model_dtype: FP16 I rkllm: Enabled cpus: [4, 5, 6, 7] I rkllm: Enabled cpus num: 4 I rkllm: rkllm-runtime version: 1.2.3, rknpu driver version: 0.9.8, platform: RK3588 I rkllm: loading rkllm model from ./residual_lm.rkllm I rkllm: rkllm-toolkit version: 1.2.3, max_context_limit: 4096, npu_core_num: 3, target_platform: RK3588, model_dtype: FP16 I rkllm: Enabled cpus: [4, 5, 6, 7] I rkllm: Enabled cpus num: 4 [time] vae_encode_0: 2127.35 ms [time] vae_encode_105840: 2057.71 ms [time] vae_encode_211680: 1997.43 ms [time] locenc_0: 1791.50 ms [time] locenc_64: 1782.49 ms [time] base_lm initial: 368.19 ms [time] fsq_init_0: 5.52 ms [time] fsq_init_64: 4.20 ms [time] residual_lm initial: 105.79 ms gen_loop: 0%| | 0/2000 [00:00