File size: 14,493 Bytes

---
license: agpl-3.0
language:
- en
- zh
base_model:
- openbmb/VoxCPM-0.5B
pipeline_tag: text-to-speech
tags:
- rknn
- rkllm
- text-to-speech
- speech
- speech generation
- voice cloning
---

# VoxCPM-0.5B-RKNN2

### (English README see below)

VoxCPM 是一种创新的无分词器文本转语音（TTS）系统，重新定义了语音合成的真实感。通过在连续空间中建模语音，它克服了离散标记化的局限，并实现了两项核心能力：上下文感知的语音生成和逼真的零样本语音克隆。
不同于将语音转换为离散标记的主流方法，VoxCPM 采用端到端的扩散自回归架构，直接从文本生成连续的语音表示。它基于 MiniCPM-4 主干构建，通过分层语言建模和 FSQ 约束实现了隐式的语义-声学解耦，极大地提升了表现力和生成稳定性。

![模型架构](model_structure.jpg)


- 推理速度(RKNN2)：RK3588上RTF约4.5（生成10s音频需要推理45s）
- 大致内存占用(RKNN2)：约3.3GB

## 使用方法

1. 克隆项目到本地

2. 安装依赖

```bash
pip install numpy scipy soundfile tqdm transformers sentencepiece ztu-somemodelruntime-ez-rknn-async
```

3. 运行

```bash
python onnx_infer-rknn2.py --onnx-dir . --tokenizer-dir . --base-hf-dir . --residual-hf-dir . --text "哇, 这个模型居然在RK3588这个辣鸡SoC上也能完美运行!" --prompt-audio basic_ref_zh.wav --prompt-text "对，这就是我，万人敬仰的太乙真人。" --output rknn_output.wav --cfg-value 2.0 --inference-timesteps 10 --seed 1234
```

可选参数：
- `--text`: 要生成的文本
- `--prompt-audio`: 参考音频路径（用于语音克隆）
- `--prompt-text`: 参考音频对应的文本（使用参考音频时必填）
- `--cfg-value`: CFG引导强度，默认2.0
- `--inference-timesteps`: 扩散步数，默认10
- `--seed`: 随机种子
- `--output`: 输出音频路径

## 运行效果


```log
> python onnx_infer-rknn2.py --onnx-dir . --tokenizer-dir . --base-hf-dir . --residual-hf-dir . --text "哇, 这个模型居然在RK3588这个辣鸡SoC上也能完美运行!" --prompt-audio basic_ref_zh.wav --prompt-text "对，这就是我，万人敬仰的太乙真人。" --output rknn_output.wav --cfg-value 2.0 --inference-timesteps 10 --seed 1234

I rkllm: rkllm-runtime version: 1.2.3, rknpu driver version: 0.9.8, platform: RK3588
I rkllm: loading rkllm model from ./base_lm.rkllm
I rkllm: rkllm-toolkit version: 1.2.3, max_context_limit: 4096, npu_core_num: 1, target_platform: RK3588, model_dtype: FP16
I rkllm: Enabled cpus: [4, 5, 6, 7]
I rkllm: Enabled cpus num: 4
I rkllm: rkllm-runtime version: 1.2.3, rknpu driver version: 0.9.8, platform: RK3588
I rkllm: loading rkllm model from ./residual_lm.rkllm
I rkllm: rkllm-toolkit version: 1.2.2, max_context_limit: 4096, npu_core_num: 3, target_platform: RK3588, model_dtype: FP16
I rkllm: Enabled cpus: [4, 5, 6, 7]
I rkllm: Enabled cpus num: 4
[time] vae_encode_0: 1502.91 ms
[time] vae_encode_38400: 1443.79 ms
[time] vae_encode_76800: 1418.36 ms
[time] locenc_0: 820.25 ms
[time] locenc_64: 814.78 ms
[time] locenc_128: 815.60 ms
[time] base_lm initial: 549.21 ms
[time] fsq_init_0: 5.34 ms
[time] fsq_init_64: 3.95 ms
[time] fsq_init_128: 4.17 ms
[time] residual_lm initial: 131.22 ms
gen_loop:   0%|                                                                                 | 0/2000 [00:00<?, ?it/s][time] lm_to_dit: 1.26 ms
[time] res_to_dit: 1.01 ms
100%|████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 60.13it/s]
[time] locenc_step: 16.43 ms████████████████████████████████████▍                         | 7/10 [00:00<00:00, 61.20it/s]
gen_loop:   0%|                                                                         | 1/2000 [00:00<09:43,  3.42it/s][time] lm_to_dit: 0.75 ms
[time] res_to_dit: 0.55 ms
100%|████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 63.99it/s]
[time] locenc_step: 15.93 ms████████████████████████████████████▍                         | 7/10 [00:00<00:00, 67.27it/s]
gen_loop:   0%|                                                                         | 2/2000 [00:00<09:25,  3.53it/s][time] lm_to_dit: 0.74 ms

...

[time] res_to_dit: 0.59 ms
100%|████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 64.19it/s]
[time] locenc_step: 15.73 ms████████████████████████████████████▍                         | 7/10 [00:00<00:00, 67.34it/s]
gen_loop:   6%|████▎                                                                  | 123/2000 [00:34<08:47,  3.56it/s][time] lm_to_dit: 0.76 ms
[time] res_to_dit: 0.56 ms
100%|████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 64.08it/s]
[time] locenc_step: 15.82 ms████████████████████████████████████▍                         | 7/10 [00:00<00:00, 67.43it/s]
gen_loop:   6%|████▎                                                                  | 123/2000 [00:34<08:47,  3.56it/s]
[time] vae_decode_0: 1153.02 ms
[time] vae_decode_60: 1102.36 ms
[time] vae_decode_120: 1105.00 ms
[time] vae_decode_180: 1105.60 ms
[time] vae_decode_240: 1082.36 ms
Saved: rknn_output.wav
```

## 模型转换

查看 https://huggingface.co/happyme531/VoxCPM-0.5B-RKNN2/tree/main/convert 

## 已知问题

- 某些情况下语音生成可能陷入死循环，原项目似乎有检测死循环的机制，但我这里没有实现。
- 由于RKNN工具链的内部问题，locenc模型没有办法在一个模型里配置两种输入长度的两组shape，因此只能单独转换两个模型。
- 由于RKLLM工具链/运行时的内部问题，两个LLM的输出张量的数值都只有正确结果的四分之一，手动乘4之后可以得到正确结果。
- ~~由于RKNN工具链目前不支持非4维输入模型多batch使用多NPU核的数据并行推理，脚本中CFG是分两次单独进行的，速度较慢。~~(已修复)


## 参考
- [openbmb/VoxCPM-0.5B](https://huggingface.co/openbmb/VoxCPM-0.5B)
- [0seba/VoxCPMANE](https://github.com/0seba/VoxCPMANE)
- [bluryar/VoxCPM-ONNX](https://github.com/bluryar/VoxCPM-ONNX)

# English README

VoxCPM is an innovative tokenizer-free Text-to-Speech (TTS) system that redefines realism in speech synthesis. By modeling speech in continuous space, it overcomes the limitations of discrete tokenization and achieves two core capabilities: context-aware speech generation and realistic zero-shot voice cloning.

Unlike mainstream approaches that convert speech into discrete tokens, VoxCPM adopts an end-to-end diffusion autoregressive architecture that directly generates continuous speech representations from text. Built on the MiniCPM-4 backbone, it achieves implicit semantic-acoustic decoupling through hierarchical language modeling and FSQ constraints, greatly enhancing expressiveness and generation stability.

- Inference speed (RKNN2): RTF approximately 4.5 on RK3588 (45s inference time to generate 10s audio)
- Approximate memory usage (RKNN2): ~3.3GB

## Usage

1. Clone the project locally

2. Install dependencies

```bash
pip install numpy scipy soundfile tqdm transformers sentencepiece ztu-somemodelruntime-ez-rknn-async
```

3. Run

```bash
python onnx_infer-rknn2.py --onnx-dir . --tokenizer-dir . --base-hf-dir . --residual-hf-dir . --text "Wow, this model actually runs perfectly on the RK3588 SoC!" --prompt-audio basic_ref_zh.wav --prompt-text "对，这就是我，万人敬仰的太乙真人。" --output rknn_output.wav --cfg-value 2.0 --inference-timesteps 10 --seed 1234
```

Optional parameters:
- `--text`: Text to generate
- `--prompt-audio`: Reference audio path (for voice cloning)
- `--prompt-text`: Text corresponding to the reference audio (required when using reference audio)
- `--cfg-value`: CFG guidance strength, default 2.0
- `--inference-timesteps`: Number of diffusion steps, default 10
- `--seed`: Random seed
- `--output`: Output audio path

## Performance


```log
> python onnx_infer-rknn2.py --onnx-dir . --tokenizer-dir . --base-hf-dir . --residual-hf-dir . --text "哇, 这个模型居然在RK3588这个辣鸡SoC上也能完美运行!" --prompt-audio basic_ref_zh.wav --prompt-text "对，这就是我，万人敬仰的太乙真人。" --output rknn_output.wav --cfg-value 2.0 --inference-timesteps 10 --seed 1234

I rkllm: rkllm-runtime version: 1.2.3, rknpu driver version: 0.9.8, platform: RK3588
I rkllm: loading rkllm model from ./base_lm.rkllm
I rkllm: rkllm-toolkit version: 1.2.3, max_context_limit: 4096, npu_core_num: 1, target_platform: RK3588, model_dtype: FP16
I rkllm: Enabled cpus: [4, 5, 6, 7]
I rkllm: Enabled cpus num: 4
I rkllm: rkllm-runtime version: 1.2.3, rknpu driver version: 0.9.8, platform: RK3588
I rkllm: loading rkllm model from ./residual_lm.rkllm
I rkllm: rkllm-toolkit version: 1.2.2, max_context_limit: 4096, npu_core_num: 3, target_platform: RK3588, model_dtype: FP16
I rkllm: Enabled cpus: [4, 5, 6, 7]
I rkllm: Enabled cpus num: 4
[time] vae_encode_0: 1502.91 ms
[time] vae_encode_38400: 1443.79 ms
[time] vae_encode_76800: 1418.36 ms
[time] locenc_0: 820.25 ms
[time] locenc_64: 814.78 ms
[time] locenc_128: 815.60 ms
[time] base_lm initial: 549.21 ms
[time] fsq_init_0: 5.34 ms
[time] fsq_init_64: 3.95 ms
[time] fsq_init_128: 4.17 ms
[time] residual_lm initial: 131.22 ms
gen_loop:   0%|                                                                                 | 0/2000 [00:00<?, ?it/s][time] lm_to_dit: 1.26 ms
[time] res_to_dit: 1.01 ms
100%|████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 60.13it/s]
[time] locenc_step: 16.43 ms████████████████████████████████████▍                         | 7/10 [00:00<00:00, 61.20it/s]
gen_loop:   0%|                                                                         | 1/2000 [00:00<09:43,  3.42it/s][time] lm_to_dit: 0.75 ms
[time] res_to_dit: 0.55 ms
100%|████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 63.99it/s]
[time] locenc_step: 15.93 ms████████████████████████████████████▍                         | 7/10 [00:00<00:00, 67.27it/s]
gen_loop:   0%|                                                                         | 2/2000 [00:00<09:25,  3.53it/s][time] lm_to_dit: 0.74 ms

...

[time] res_to_dit: 0.59 ms
100%|████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 64.19it/s]
[time] locenc_step: 15.73 ms████████████████████████████████████▍                         | 7/10 [00:00<00:00, 67.34it/s]
gen_loop:   6%|████▎                                                                  | 123/2000 [00:34<08:47,  3.56it/s][time] lm_to_dit: 0.76 ms
[time] res_to_dit: 0.56 ms
100%|████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 64.08it/s]
[time] locenc_step: 15.82 ms████████████████████████████████████▍                         | 7/10 [00:00<00:00, 67.43it/s]
gen_loop:   6%|████▎                                                                  | 123/2000 [00:34<08:47,  3.56it/s]
[time] vae_decode_0: 1153.02 ms
[time] vae_decode_60: 1102.36 ms
[time] vae_decode_120: 1105.00 ms
[time] vae_decode_180: 1105.60 ms
[time] vae_decode_240: 1082.36 ms
Saved: rknn_output.wav
```
## Model Conversion

See https://huggingface.co/happyme531/VoxCPM-0.5B-RKNN2/tree/main/convert 

## Known Issues

- In some cases, speech generation may fall into an infinite loop. The original project seems to have a mechanism to detect infinite loops, but it is not implemented here.
- Due to internal issues with the RKNN toolchain, the locenc model cannot configure two sets of shapes for two different input lengths in a single model, so two separate models must be converted.
- Due to internal issues with the RKLLM toolchain/runtime, the output tensor values of both LLMs are only one-quarter of the correct result. Multiplying by 4 manually yields the correct result.
- ~~Since the RKNN toolchain currently does not support data-parallel inference using multiple NPU cores for non-4D input models with multiple batches, CFG in the script is performed separately in two passes, which is relatively slow.~~(Solved)

## References
- [openbmb/VoxCPM-0.5B](https://huggingface.co/openbmb/VoxCPM-0.5B)
- [0seba/VoxCPMANE](https://github.com/0seba/VoxCPMANE)
- [bluryar/VoxCPM-ONNX](https://github.com/bluryar/VoxCPM-ONNX)