VoxCPM-0.5B-RKNN2

(English README see below)

VoxCPM 是一种创新的无分词器文本转语音（TTS）系统，重新定义了语音合成的真实感。通过在连续空间中建模语音，它克服了离散标记化的局限，并实现了两项核心能力：上下文感知的语音生成和逼真的零样本语音克隆。不同于将语音转换为离散标记的主流方法，VoxCPM 采用端到端的扩散自回归架构，直接从文本生成连续的语音表示。它基于 MiniCPM-4 主干构建，通过分层语言建模和 FSQ 约束实现了隐式的语义-声学解耦，极大地提升了表现力和生成稳定性。

推理速度(RKNN2)：RK3588上RTF约4.5（生成10s音频需要推理45s）
大致内存占用(RKNN2)：约3.3GB

使用方法

克隆项目到本地
安装依赖

pip install numpy scipy soundfile tqdm transformers sentencepiece ztu-somemodelruntime-ez-rknn-async

运行

python onnx_infer-rknn2.py --onnx-dir . --tokenizer-dir . --base-hf-dir . --residual-hf-dir . --text "哇, 这个模型居然在RK3588这个辣鸡SoC上也能完美运行!" --prompt-audio basic_ref_zh.wav --prompt-text "对，这就是我，万人敬仰的太乙真人。" --output rknn_output.wav --cfg-value 2.0 --inference-timesteps 10 --seed 1234

可选参数：

--text: 要生成的文本
--prompt-audio: 参考音频路径（用于语音克隆）
--prompt-text: 参考音频对应的文本（使用参考音频时必填）
--cfg-value: CFG引导强度，默认2.0
--inference-timesteps: 扩散步数，默认10
--seed: 随机种子
--output: 输出音频路径

运行效果

> python onnx_infer-rknn2.py --onnx-dir . --tokenizer-dir . --base-hf-dir . --residual-hf-dir . --text "哇, 这个模型居然在RK3588这个辣鸡SoC上也能完美运行!" --prompt-audio basic_ref_zh.wav --prompt-text "对，这就是我，万人敬仰的太乙真人。" --output rknn_output.wav --cfg-value 2.0 --inference-timesteps 10 --seed 1234

I rkllm: rkllm-runtime version: 1.2.3, rknpu driver version: 0.9.8, platform: RK3588
I rkllm: loading rkllm model from ./base_lm.rkllm
I rkllm: rkllm-toolkit version: 1.2.3, max_context_limit: 4096, npu_core_num: 1, target_platform: RK3588, model_dtype: FP16
I rkllm: Enabled cpus: [4, 5, 6, 7]
I rkllm: Enabled cpus num: 4
I rkllm: rkllm-runtime version: 1.2.3, rknpu driver version: 0.9.8, platform: RK3588
I rkllm: loading rkllm model from ./residual_lm.rkllm
I rkllm: rkllm-toolkit version: 1.2.2, max_context_limit: 4096, npu_core_num: 3, target_platform: RK3588, model_dtype: FP16
I rkllm: Enabled cpus: [4, 5, 6, 7]
I rkllm: Enabled cpus num: 4
[time] vae_encode_0: 1502.91 ms
[time] vae_encode_38400: 1443.79 ms
[time] vae_encode_76800: 1418.36 ms
[time] locenc_0: 820.25 ms
[time] locenc_64: 814.78 ms
[time] locenc_128: 815.60 ms
[time] base_lm initial: 549.21 ms
[time] fsq_init_0: 5.34 ms
[time] fsq_init_64: 3.95 ms
[time] fsq_init_128: 4.17 ms
[time] residual_lm initial: 131.22 ms
gen_loop:   0%|                                                                                 | 0/2000 [00:00<?, ?it/s][time] lm_to_dit: 1.26 ms
[time] res_to_dit: 1.01 ms
100%|████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 60.13it/s]
[time] locenc_step: 16.43 ms████████████████████████████████████▍                         | 7/10 [00:00<00:00, 61.20it/s]
gen_loop:   0%|                                                                         | 1/2000 [00:00<09:43,  3.42it/s][time] lm_to_dit: 0.75 ms
[time] res_to_dit: 0.55 ms
100%|████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 63.99it/s]
[time] locenc_step: 15.93 ms████████████████████████████████████▍                         | 7/10 [00:00<00:00, 67.27it/s]
gen_loop:   0%|                                                                         | 2/2000 [00:00<09:25,  3.53it/s][time] lm_to_dit: 0.74 ms

...

[time] res_to_dit: 0.59 ms
100%|████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 64.19it/s]
[time] locenc_step: 15.73 ms████████████████████████████████████▍                         | 7/10 [00:00<00:00, 67.34it/s]
gen_loop:   6%|████▎                                                                  | 123/2000 [00:34<08:47,  3.56it/s][time] lm_to_dit: 0.76 ms
[time] res_to_dit: 0.56 ms
100%|████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 64.08it/s]
[time] locenc_step: 15.82 ms████████████████████████████████████▍                         | 7/10 [00:00<00:00, 67.43it/s]
gen_loop:   6%|████▎                                                                  | 123/2000 [00:34<08:47,  3.56it/s]
[time] vae_decode_0: 1153.02 ms
[time] vae_decode_60: 1102.36 ms
[time] vae_decode_120: 1105.00 ms
[time] vae_decode_180: 1105.60 ms
[time] vae_decode_240: 1082.36 ms
Saved: rknn_output.wav

模型转换

查看 https://huggingface.co/happyme531/VoxCPM-0.5B-RKNN2/tree/main/convert

已知问题

某些情况下语音生成可能陷入死循环，原项目似乎有检测死循环的机制，但我这里没有实现。
由于RKNN工具链的内部问题，locenc模型没有办法在一个模型里配置两种输入长度的两组shape，因此只能单独转换两个模型。
由于RKLLM工具链/运行时的内部问题，两个LLM的输出张量的数值都只有正确结果的四分之一，手动乘4之后可以得到正确结果。
~~由于RKNN工具链目前不支持非4维输入模型多batch使用多NPU核的数据并行推理，脚本中CFG是分两次单独进行的，速度较慢。~~(已修复)

参考

English README

VoxCPM is an innovative tokenizer-free Text-to-Speech (TTS) system that redefines realism in speech synthesis. By modeling speech in continuous space, it overcomes the limitations of discrete tokenization and achieves two core capabilities: context-aware speech generation and realistic zero-shot voice cloning.

Unlike mainstream approaches that convert speech into discrete tokens, VoxCPM adopts an end-to-end diffusion autoregressive architecture that directly generates continuous speech representations from text. Built on the MiniCPM-4 backbone, it achieves implicit semantic-acoustic decoupling through hierarchical language modeling and FSQ constraints, greatly enhancing expressiveness and generation stability.

Inference speed (RKNN2): RTF approximately 4.5 on RK3588 (45s inference time to generate 10s audio)
Approximate memory usage (RKNN2): ~3.3GB

Usage

Clone the project locally
Install dependencies

pip install numpy scipy soundfile tqdm transformers sentencepiece ztu-somemodelruntime-ez-rknn-async

python onnx_infer-rknn2.py --onnx-dir . --tokenizer-dir . --base-hf-dir . --residual-hf-dir . --text "Wow, this model actually runs perfectly on the RK3588 SoC!" --prompt-audio basic_ref_zh.wav --prompt-text "对，这就是我，万人敬仰的太乙真人。" --output rknn_output.wav --cfg-value 2.0 --inference-timesteps 10 --seed 1234

Optional parameters:

--text: Text to generate
--prompt-audio: Reference audio path (for voice cloning)
--prompt-text: Text corresponding to the reference audio (required when using reference audio)
--cfg-value: CFG guidance strength, default 2.0
--inference-timesteps: Number of diffusion steps, default 10
--seed: Random seed
--output: Output audio path

Performance

> python onnx_infer-rknn2.py --onnx-dir . --tokenizer-dir . --base-hf-dir . --residual-hf-dir . --text "哇, 这个模型居然在RK3588这个辣鸡SoC上也能完美运行!" --prompt-audio basic_ref_zh.wav --prompt-text "对，这就是我，万人敬仰的太乙真人。" --output rknn_output.wav --cfg-value 2.0 --inference-timesteps 10 --seed 1234

I rkllm: rkllm-runtime version: 1.2.3, rknpu driver version: 0.9.8, platform: RK3588
I rkllm: loading rkllm model from ./base_lm.rkllm
I rkllm: rkllm-toolkit version: 1.2.3, max_context_limit: 4096, npu_core_num: 1, target_platform: RK3588, model_dtype: FP16
I rkllm: Enabled cpus: [4, 5, 6, 7]
I rkllm: Enabled cpus num: 4
I rkllm: rkllm-runtime version: 1.2.3, rknpu driver version: 0.9.8, platform: RK3588
I rkllm: loading rkllm model from ./residual_lm.rkllm
I rkllm: rkllm-toolkit version: 1.2.2, max_context_limit: 4096, npu_core_num: 3, target_platform: RK3588, model_dtype: FP16
I rkllm: Enabled cpus: [4, 5, 6, 7]
I rkllm: Enabled cpus num: 4
[time] vae_encode_0: 1502.91 ms
[time] vae_encode_38400: 1443.79 ms
[time] vae_encode_76800: 1418.36 ms
[time] locenc_0: 820.25 ms
[time] locenc_64: 814.78 ms
[time] locenc_128: 815.60 ms
[time] base_lm initial: 549.21 ms
[time] fsq_init_0: 5.34 ms
[time] fsq_init_64: 3.95 ms
[time] fsq_init_128: 4.17 ms
[time] residual_lm initial: 131.22 ms
gen_loop:   0%|                                                                                 | 0/2000 [00:00<?, ?it/s][time] lm_to_dit: 1.26 ms
[time] res_to_dit: 1.01 ms
100%|████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 60.13it/s]
[time] locenc_step: 16.43 ms████████████████████████████████████▍                         | 7/10 [00:00<00:00, 61.20it/s]
gen_loop:   0%|                                                                         | 1/2000 [00:00<09:43,  3.42it/s][time] lm_to_dit: 0.75 ms
[time] res_to_dit: 0.55 ms
100%|████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 63.99it/s]
[time] locenc_step: 15.93 ms████████████████████████████████████▍                         | 7/10 [00:00<00:00, 67.27it/s]
gen_loop:   0%|                                                                         | 2/2000 [00:00<09:25,  3.53it/s][time] lm_to_dit: 0.74 ms

...

[time] res_to_dit: 0.59 ms
100%|████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 64.19it/s]
[time] locenc_step: 15.73 ms████████████████████████████████████▍                         | 7/10 [00:00<00:00, 67.34it/s]
gen_loop:   6%|████▎                                                                  | 123/2000 [00:34<08:47,  3.56it/s][time] lm_to_dit: 0.76 ms
[time] res_to_dit: 0.56 ms
100%|████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 64.08it/s]
[time] locenc_step: 15.82 ms████████████████████████████████████▍                         | 7/10 [00:00<00:00, 67.43it/s]
gen_loop:   6%|████▎                                                                  | 123/2000 [00:34<08:47,  3.56it/s]
[time] vae_decode_0: 1153.02 ms
[time] vae_decode_60: 1102.36 ms
[time] vae_decode_120: 1105.00 ms
[time] vae_decode_180: 1105.60 ms
[time] vae_decode_240: 1082.36 ms
Saved: rknn_output.wav

Model Conversion

See https://huggingface.co/happyme531/VoxCPM-0.5B-RKNN2/tree/main/convert

Known Issues

In some cases, speech generation may fall into an infinite loop. The original project seems to have a mechanism to detect infinite loops, but it is not implemented here.
Due to internal issues with the RKNN toolchain, the locenc model cannot configure two sets of shapes for two different input lengths in a single model, so two separate models must be converted.
Due to internal issues with the RKLLM toolchain/runtime, the output tensor values of both LLMs are only one-quarter of the correct result. Multiplying by 4 manually yields the correct result.
Since the RKNN toolchain currently does not support data-parallel inference using multiple NPU cores for non-4D input models with multiple batches, CFG in the script is performed separately in two passes, which is relatively slow.(Solved)

References

Downloads last month: 34

Model tree for happyme531/VoxCPM-0.5B-RKNN2

Base model

openbmb/MiniCPM4-0.5B

Finetuned

openbmb/VoxCPM-0.5B

Finetuned

(7)

this model