VoxCPM1.5-RKNN2

(English README see below)

VoxCPM 是一种创新的无分词器文本转语音（TTS）系统，重新定义了语音合成的真实感。通过在连续空间中建模语音，它克服了离散标记化的局限，并实现了两项核心能力：上下文感知的语音生成和逼真的零样本语音克隆。不同于将语音转换为离散标记的主流方法，VoxCPM 采用端到端的扩散自回归架构，直接从文本生成连续的语音表示。它基于 MiniCPM-4 主干构建，通过分层语言建模和 FSQ 约束实现了隐式的语义-声学解耦，极大地提升了表现力和生成稳定性。

我们非常激动地推出 VoxCPM 的重大升级版本。此次更新在显著提升音频质量和效率的同时，保留了核心的上下文感知语音生成和零样本（Zero-shot）语音克隆能力。

特性	VoxCPM	VoxCPM1.5
Audio VAE 采样率	16kHz	44.1kHz
LM Token 速率	12.5Hz	6.25Hz
Patch 大小	2	4
SFT 支持	✅	✅
LoRA 支持	✅	✅

推理速度(RKNN2)：RK3588上RTF约4.5（生成10s音频需要推理45s，相对于旧版似乎并没有什么提升）
大致内存占用(RKNN2)：约3.3GB（相对于旧版同样没有什么提升）

使用方法

克隆项目到本地
安装依赖

pip install numpy scipy soundfile tqdm transformers sentencepiece ztu-somemodelruntime-ez-rknn-async

运行

python onnx_infer-rknn2.py --onnx-dir . --tokenizer-dir . --base-hf-dir . --residual-hf-dir . --text "哇, VoxCPM1.5 现在也能在 RK3588 上跑起来了。" --prompt-audio basic_ref_zh.wav --prompt-text "对，这就是我，万人敬仰的太乙真人。" --output rknn_output.wav --cfg-value 2.0 --inference-timesteps 10 --seed 1234

可选参数：

--text: 要生成的文本
--prompt-audio: 参考音频路径（用于语音克隆）
--prompt-text: 参考音频对应的文本（使用参考音频时必填）
--cfg-value: CFG引导强度，默认2.0
--inference-timesteps: 扩散步数，默认10
--seed: 随机种子
--output: 输出音频路径

运行效果

> python onnx_infer-rknn2.py --onnx-dir . --tokenizer-dir . --base-hf-dir . --residual-hf-dir . --text "哇, VoxCPM1.5 现在也能在 RK3588 上跑起来了。" --prompt-audio basic_ref_zh.wav --prompt-text "对，这就是我，万人敬仰的太乙真人。" --output rknn_output.wav --cfg-value 2.0 --inference-timesteps 10 --seed 1234
I rkllm: rkllm-runtime version: 1.2.3, rknpu driver version: 0.9.8, platform: RK3588
I rkllm: loading rkllm model from ./base_lm.rkllm
I rkllm: rkllm-toolkit version: 1.2.3, max_context_limit: 4096, npu_core_num: 1, target_platform: RK3588, model_dtype: FP16
I rkllm: Enabled cpus: [4, 5, 6, 7]
I rkllm: Enabled cpus num: 4
I rkllm: rkllm-runtime version: 1.2.3, rknpu driver version: 0.9.8, platform: RK3588
I rkllm: loading rkllm model from ./residual_lm.rkllm
I rkllm: rkllm-toolkit version: 1.2.3, max_context_limit: 4096, npu_core_num: 3, target_platform: RK3588, model_dtype: FP16
I rkllm: Enabled cpus: [4, 5, 6, 7]
I rkllm: Enabled cpus num: 4
[time] vae_encode_0: 2127.35 ms
[time] vae_encode_105840: 2057.71 ms
[time] vae_encode_211680: 1997.43 ms
[time] locenc_0: 1791.50 ms
[time] locenc_64: 1782.49 ms
[time] base_lm initial: 368.19 ms
[time] fsq_init_0: 5.52 ms
[time] fsq_init_64: 4.20 ms
[time] residual_lm initial: 105.79 ms
gen_loop:   0%|                                                                                                                                                                            | 0/2000 [00:00<?, ?it/s][time] lm_to_dit: 1.49 ms
[time] res_to_dit: 1.11 ms
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 32.15it/s]
[time] locenc_step: 33.00 ms█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊                                   | 8/10 [00:00<00:00, 32.24it/s]
gen_loop:   0%|                                                                                                                                                                    | 1/2000 [00:00<15:33,  2.14it/s][time] lm_to_dit: 0.67 ms
[time] res_to_dit: 0.76 ms
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 32.86it/s]
[time] locenc_step: 31.85 ms█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊                                   | 8/10 [00:00<00:00, 32.99it/s]
gen_loop:   0%|▏                                                                                                                                                                   | 2/2000 [00:00<15:18,  2.18it/s][time] lm_to_dit: 0.61 ms
[time] res_to_dit: 0.65 ms
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 32.72it/s]
[time] locenc_step: 32.01 ms█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊                                   | 8/10 [00:00<00:00, 32.83it/s]
gen_loop:   2%|███▉                                                                                                                                                               | 49/2000 [00:22<14:55,  2.18it/s][time] lm_to_dit: 0.88 ms
[time] res_to_dit: 0.64 ms
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 32.72it/s]
[time] locenc_step: 32.16 ms█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊                                   | 8/10 [00:00<00:00, 32.88it/s]
gen_loop:   2%|███▉                                                                                                                                                               | 49/2000 [00:22<15:05,  2.15it/s]
[time] vae_decode_0: 2438.31 ms
[time] vae_decode_60: 2372.92 ms
[time] vae_decode_120: 2380.40 ms
[time] vae_decode_180: 2344.88 ms
Saved: rknn_output.wav

模型转换

查看 https://huggingface.co/happyme531/VoxCPM1.5-RKNN2/tree/main/convert

已知问题

某些情况下语音生成可能陷入死循环，原项目似乎有检测死循环的机制，但我这里没有实现。
由于RKNN工具链的内部问题，locenc模型没有办法在一个模型里配置两种输入长度的两组shape，因此只能单独转换两个模型。
由于RKLLM工具链/运行时的内部问题，两个LLM的输出张量的数值都只有正确结果的四分之一，手动乘4之后可以得到正确结果。

参考

English README

VoxCPM is an innovative tokenizer-free Text-to-Speech (TTS) system that redefines realism in speech synthesis. By modeling speech in continuous space, it overcomes the limitations of discrete tokenization and achieves two core capabilities: context-aware speech generation and realistic zero-shot voice cloning.

Unlike mainstream approaches that convert speech into discrete tokens, VoxCPM adopts an end-to-end diffusion autoregressive architecture that directly generates continuous speech representations from text. Built on the MiniCPM-4 backbone, it achieves implicit semantic-acoustic decoupling through hierarchical language modeling and FSQ constraints, greatly enhancing expressiveness and generation stability.

We’re thrilled to introduce a major upgrade that improves audio quality and efficiency of VoxCPM, while maintaining the core capabilities of context-aware speech generation and zero-shot voice cloning.

Feature	VoxCPM	VoxCPM1.5
Audio VAE Sampling Rate	16kHz	44.1kHz
LM Token Rate	12.5Hz	6.25Hz
Patch Size	2	4
SFT Support	✅	✅
LoRA Support	✅	✅

Inference speed (RKNN2): RTF approximately 4.5 on RK3588 (45s inference time to generate 10s audio, no improvement compared to the previous version)
Approximate memory usage (RKNN2): ~3.3GB (no improvement compared to the previous version too)

Usage

Clone the project locally
Install dependencies

pip install numpy scipy soundfile tqdm transformers sentencepiece ztu-somemodelruntime-ez-rknn-async

python onnx_infer-rknn2.py --onnx-dir . --tokenizer-dir . --base-hf-dir . --residual-hf-dir . --text "Wow, VoxCPM1.5 actually runs perfectly on the RK3588 SoC!" --prompt-audio basic_ref_zh.wav --prompt-text "对，这就是我，万人敬仰的太乙真人。" --output rknn_output.wav --cfg-value 2.0 --inference-timesteps 10 --seed 1234

Optional parameters:

--text: Text to generate
--prompt-audio: Reference audio path (for voice cloning)
--prompt-text: Text corresponding to the reference audio (required when using reference audio)
--cfg-value: CFG guidance strength, default 2.0
--inference-timesteps: Number of diffusion steps, default 10
--seed: Random seed
--output: Output audio path

Performance

> python onnx_infer-rknn2.py --onnx-dir . --tokenizer-dir . --base-hf-dir . --residual-hf-dir . --text "哇, VoxCPM1.5 现在也能在 RK3588 上跑起来了。" --prompt-audio basic_ref_zh.wav --prompt-text "对，这就是我，万人敬仰的太乙真人。" --output rknn_output.wav --cfg-value 2.0 --inference-timesteps 10 --seed 1234
I rkllm: rkllm-runtime version: 1.2.3, rknpu driver version: 0.9.8, platform: RK3588
I rkllm: loading rkllm model from ./base_lm.rkllm
I rkllm: rkllm-toolkit version: 1.2.3, max_context_limit: 4096, npu_core_num: 1, target_platform: RK3588, model_dtype: FP16
I rkllm: Enabled cpus: [4, 5, 6, 7]
I rkllm: Enabled cpus num: 4
I rkllm: rkllm-runtime version: 1.2.3, rknpu driver version: 0.9.8, platform: RK3588
I rkllm: loading rkllm model from ./residual_lm.rkllm
I rkllm: rkllm-toolkit version: 1.2.3, max_context_limit: 4096, npu_core_num: 3, target_platform: RK3588, model_dtype: FP16
I rkllm: Enabled cpus: [4, 5, 6, 7]
I rkllm: Enabled cpus num: 4
[time] vae_encode_0: 2127.35 ms
[time] vae_encode_105840: 2057.71 ms
[time] vae_encode_211680: 1997.43 ms
[time] locenc_0: 1791.50 ms
[time] locenc_64: 1782.49 ms
[time] base_lm initial: 368.19 ms
[time] fsq_init_0: 5.52 ms
[time] fsq_init_64: 4.20 ms
[time] residual_lm initial: 105.79 ms
gen_loop:   0%|                                                                                                                                                                            | 0/2000 [00:00<?, ?it/s][time] lm_to_dit: 1.49 ms
[time] res_to_dit: 1.11 ms
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 32.15it/s]
[time] locenc_step: 33.00 ms█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊                                   | 8/10 [00:00<00:00, 32.24it/s]
gen_loop:   0%|                                                                                                                                                                    | 1/2000 [00:00<15:33,  2.14it/s][time] lm_to_dit: 0.67 ms
[time] res_to_dit: 0.76 ms
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 32.86it/s]
[time] locenc_step: 31.85 ms█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊                                   | 8/10 [00:00<00:00, 32.99it/s]
gen_loop:   0%|▏                                                                                                                                                                   | 2/2000 [00:00<15:18,  2.18it/s][time] lm_to_dit: 0.61 ms
[time] res_to_dit: 0.65 ms
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 32.72it/s]
[time] locenc_step: 32.01 ms█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊                                   | 8/10 [00:00<00:00, 32.83it/s]
gen_loop:   2%|███▉                                                                                                                                                               | 49/2000 [00:22<14:55,  2.18it/s][time] lm_to_dit: 0.88 ms
[time] res_to_dit: 0.64 ms
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 32.72it/s]
[time] locenc_step: 32.16 ms█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊                                   | 8/10 [00:00<00:00, 32.88it/s]
gen_loop:   2%|███▉                                                                                                                                                               | 49/2000 [00:22<15:05,  2.15it/s]
[time] vae_decode_0: 2438.31 ms
[time] vae_decode_60: 2372.92 ms
[time] vae_decode_120: 2380.40 ms
[time] vae_decode_180: 2344.88 ms
Saved: rknn_output.wav

Model Conversion

See https://huggingface.co/happyme531/VoxCPM1.5-RKNN2/tree/main/convert

Known Issues

In some cases, speech generation may fall into an infinite loop. The original project seems to have a mechanism to detect infinite loops, but it is not implemented here.
Due to internal issues with the RKNN toolchain, the locenc model cannot configure two sets of shapes for two different input lengths in a single model, so two separate models must be converted.
Due to internal issues with the RKLLM toolchain/runtime, the output tensor values of both LLMs are only one-quarter of the correct result. Multiplying by 4 manually yields the correct result.

References

Downloads last month: 5

Model tree for happyme531/VoxCPM1.5-RKNN2

Base model

openbmb/MiniCPM4-0.5B

Finetuned

openbmb/VoxCPM1.5

Finetuned

(7)

this model