---
license: agpl-3.0
language:
- en
- zh
base_model:
- openbmb/VoxCPM1.5
pipeline_tag: text-to-speech
tags:
- rknn
- rkllm
- text-to-speech
- speech
- speech generation
- voice cloning
---

# VoxCPM1.5-RKNN2

### (English README see below)

> VoxCPM 是一种创新的无分词器文本转语音（TTS）系统，重新定义了语音合成的真实感。通过在连续空间中建模语音，它克服了离散标记化的局限，并实现了两项核心能力：上下文感知的语音生成和逼真的零样本语音克隆。
> 不同于将语音转换为离散标记的主流方法，VoxCPM 采用端到端的扩散自回归架构，直接从文本生成连续的语音表示。它基于 MiniCPM-4 主干构建，通过分层语言建模和 FSQ 约束实现了隐式的语义-声学解耦，极大地提升了表现力和生成稳定性。

我们非常激动地推出 VoxCPM 的重大升级版本。此次更新在显著提升音频质量和效率的同时，保留了核心的上下文感知语音生成和零样本（Zero-shot）语音克隆能力。

| 特性 | VoxCPM | VoxCPM1.5 |
|---------|------------|------------|
| **Audio VAE 采样率** | 16kHz | 44.1kHz |
| **LM Token 速率** | 12.5Hz | 6.25Hz |
| **Patch 大小** | 2 | 4 |
| **SFT 支持** | ✅ | ✅ |
| **LoRA 支持** | ✅ | ✅ |


- 推理速度(RKNN2)：RK3588上RTF约4.5（生成10s音频需要推理45s，相对于旧版似乎并没有什么提升）
- 大致内存占用(RKNN2)：约3.3GB（相对于旧版同样没有什么提升）

## 使用方法

1. 克隆项目到本地

2. 安装依赖

```bash
pip install numpy scipy soundfile tqdm transformers sentencepiece ztu-somemodelruntime-ez-rknn-async
```

3. 运行

```bash
python onnx_infer-rknn2.py --onnx-dir . --tokenizer-dir . --base-hf-dir . --residual-hf-dir . --text "哇, VoxCPM1.5 现在也能在 RK3588 上跑起来了。" --prompt-audio basic_ref_zh.wav --prompt-text "对，这就是我，万人敬仰的太乙真人。" --output rknn_output.wav --cfg-value 2.0 --inference-timesteps 10 --seed 1234
```

可选参数：
- `--text`: 要生成的文本
- `--prompt-audio`: 参考音频路径（用于语音克隆）
- `--prompt-text`: 参考音频对应的文本（使用参考音频时必填）
- `--cfg-value`: CFG引导强度，默认2.0
- `--inference-timesteps`: 扩散步数，默认10
- `--seed`: 随机种子
- `--output`: 输出音频路径

## 运行效果


```log
> python onnx_infer-rknn2.py --onnx-dir . --tokenizer-dir . --base-hf-dir . --residual-hf-dir . --text "哇, VoxCPM1.5 现在也能在 RK3588 上跑起来了。" --prompt-audio basic_ref_zh.wav --prompt-text "对，这就是我，万人敬仰的太乙真人。" --output rknn_output.wav --cfg-value 2.0 --inference-timesteps 10 --seed 1234
I rkllm: rkllm-runtime version: 1.2.3, rknpu driver version: 0.9.8, platform: RK3588
I rkllm: loading rkllm model from ./base_lm.rkllm
I rkllm: rkllm-toolkit version: 1.2.3, max_context_limit: 4096, npu_core_num: 1, target_platform: RK3588, model_dtype: FP16
I rkllm: Enabled cpus: [4, 5, 6, 7]
I rkllm: Enabled cpus num: 4
I rkllm: rkllm-runtime version: 1.2.3, rknpu driver version: 0.9.8, platform: RK3588
I rkllm: loading rkllm model from ./residual_lm.rkllm
I rkllm: rkllm-toolkit version: 1.2.3, max_context_limit: 4096, npu_core_num: 3, target_platform: RK3588, model_dtype: FP16
I rkllm: Enabled cpus: [4, 5, 6, 7]
I rkllm: Enabled cpus num: 4
[time] vae_encode_0: 2127.35 ms
[time] vae_encode_105840: 2057.71 ms
[time] vae_encode_211680: 1997.43 ms
[time] locenc_0: 1791.50 ms
[time] locenc_64: 1782.49 ms
[time] base_lm initial: 368.19 ms
[time] fsq_init_0: 5.52 ms
[time] fsq_init_64: 4.20 ms
[time] residual_lm initial: 105.79 ms
gen_loop:   0%|                                                                                                                                                                            | 0/2000 [00:00<?, ?it/s][time] lm_to_dit: 1.49 ms
[time] res_to_dit: 1.11 ms
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 32.15it/s]
[time] locenc_step: 33.00 ms█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊                                   | 8/10 [00:00<00:00, 32.24it/s]
gen_loop:   0%|                                                                                                                                                                    | 1/2000 [00:00<15:33,  2.14it/s][time] lm_to_dit: 0.67 ms
[time] res_to_dit: 0.76 ms
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 32.86it/s]
[time] locenc_step: 31.85 ms█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊                                   | 8/10 [00:00<00:00, 32.99it/s]
gen_loop:   0%|▏                                                                                                                                                                   | 2/2000 [00:00<15:18,  2.18it/s][time] lm_to_dit: 0.61 ms
[time] res_to_dit: 0.65 ms
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 32.72it/s]
[time] locenc_step: 32.01 ms█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊                                   | 8/10 [00:00<00:00, 32.83it/s]
gen_loop:   2%|███▉                                                                                                                                                               | 49/2000 [00:22<14:55,  2.18it/s][time] lm_to_dit: 0.88 ms
[time] res_to_dit: 0.64 ms
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 32.72it/s]
[time] locenc_step: 32.16 ms█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊                                   | 8/10 [00:00<00:00, 32.88it/s]
gen_loop:   2%|███▉                                                                                                                                                               | 49/2000 [00:22<15:05,  2.15it/s]
[time] vae_decode_0: 2438.31 ms
[time] vae_decode_60: 2372.92 ms
[time] vae_decode_120: 2380.40 ms
[time] vae_decode_180: 2344.88 ms
Saved: rknn_output.wav
```

## 模型转换

查看 https://huggingface.co/happyme531/VoxCPM1.5-RKNN2/tree/main/convert 

## 已知问题

- 某些情况下语音生成可能陷入死循环，原项目似乎有检测死循环的机制，但我这里没有实现。
- 由于RKNN工具链的内部问题，locenc模型没有办法在一个模型里配置两种输入长度的两组shape，因此只能单独转换两个模型。
- 由于RKLLM工具链/运行时的内部问题，两个LLM的输出张量的数值都只有正确结果的四分之一，手动乘4之后可以得到正确结果。


## 参考
- [openbmb/VoxCPM1.5](https://huggingface.co/openbmb/VoxCPM1.5)
- [0seba/VoxCPMANE](https://github.com/0seba/VoxCPMANE)
- [bluryar/VoxCPM-ONNX](https://github.com/bluryar/VoxCPM-ONNX)

# English README

> VoxCPM is an innovative tokenizer-free Text-to-Speech (TTS) system that redefines realism in speech synthesis. By modeling speech in continuous space, it overcomes the limitations of discrete tokenization and achieves two core capabilities: context-aware speech generation and realistic zero-shot voice cloning.

> Unlike mainstream approaches that convert speech into discrete tokens, VoxCPM adopts an end-to-end diffusion autoregressive architecture that directly generates continuous speech representations from text. Built on the MiniCPM-4 backbone, it achieves implicit semantic-acoustic decoupling through hierarchical language modeling and FSQ constraints, greatly enhancing expressiveness and generation stability.

We’re thrilled to introduce a major upgrade that improves audio quality and efficiency of VoxCPM, while maintaining the core capabilities of context-aware speech generation and zero-shot voice cloning.

| Feature | VoxCPM | VoxCPM1.5 |
|---------|------------|------------|
| **Audio VAE Sampling Rate** | 16kHz | 44.1kHz |
| **LM Token Rate** | 12.5Hz | 6.25Hz |
| **Patch Size** | 2 | 4 |
| **SFT Support** | ✅ | ✅ |
| **LoRA Support** | ✅ | ✅ |

- Inference speed (RKNN2): RTF approximately 4.5 on RK3588 (45s inference time to generate 10s audio, no improvement compared to the previous version)
- Approximate memory usage (RKNN2): ~3.3GB (no improvement compared to the previous version too)

## Usage

1. Clone the project locally

2. Install dependencies

```bash
pip install numpy scipy soundfile tqdm transformers sentencepiece ztu-somemodelruntime-ez-rknn-async
```

3. Run

```bash
python onnx_infer-rknn2.py --onnx-dir . --tokenizer-dir . --base-hf-dir . --residual-hf-dir . --text "Wow, VoxCPM1.5 actually runs perfectly on the RK3588 SoC!" --prompt-audio basic_ref_zh.wav --prompt-text "对，这就是我，万人敬仰的太乙真人。" --output rknn_output.wav --cfg-value 2.0 --inference-timesteps 10 --seed 1234
```

Optional parameters:
- `--text`: Text to generate
- `--prompt-audio`: Reference audio path (for voice cloning)
- `--prompt-text`: Text corresponding to the reference audio (required when using reference audio)
- `--cfg-value`: CFG guidance strength, default 2.0
- `--inference-timesteps`: Number of diffusion steps, default 10
- `--seed`: Random seed
- `--output`: Output audio path

## Performance


```log
> python onnx_infer-rknn2.py --onnx-dir . --tokenizer-dir . --base-hf-dir . --residual-hf-dir . --text "哇, VoxCPM1.5 现在也能在 RK3588 上跑起来了。" --prompt-audio basic_ref_zh.wav --prompt-text "对，这就是我，万人敬仰的太乙真人。" --output rknn_output.wav --cfg-value 2.0 --inference-timesteps 10 --seed 1234
I rkllm: rkllm-runtime version: 1.2.3, rknpu driver version: 0.9.8, platform: RK3588
I rkllm: loading rkllm model from ./base_lm.rkllm
I rkllm: rkllm-toolkit version: 1.2.3, max_context_limit: 4096, npu_core_num: 1, target_platform: RK3588, model_dtype: FP16
I rkllm: Enabled cpus: [4, 5, 6, 7]
I rkllm: Enabled cpus num: 4
I rkllm: rkllm-runtime version: 1.2.3, rknpu driver version: 0.9.8, platform: RK3588
I rkllm: loading rkllm model from ./residual_lm.rkllm
I rkllm: rkllm-toolkit version: 1.2.3, max_context_limit: 4096, npu_core_num: 3, target_platform: RK3588, model_dtype: FP16
I rkllm: Enabled cpus: [4, 5, 6, 7]
I rkllm: Enabled cpus num: 4
[time] vae_encode_0: 2127.35 ms
[time] vae_encode_105840: 2057.71 ms
[time] vae_encode_211680: 1997.43 ms
[time] locenc_0: 1791.50 ms
[time] locenc_64: 1782.49 ms
[time] base_lm initial: 368.19 ms
[time] fsq_init_0: 5.52 ms
[time] fsq_init_64: 4.20 ms
[time] residual_lm initial: 105.79 ms
gen_loop:   0%|                                                                                                                                                                            | 0/2000 [00:00<?, ?it/s][time] lm_to_dit: 1.49 ms
[time] res_to_dit: 1.11 ms
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 32.15it/s]
[time] locenc_step: 33.00 ms█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊                                   | 8/10 [00:00<00:00, 32.24it/s]
gen_loop:   0%|                                                                                                                                                                    | 1/2000 [00:00<15:33,  2.14it/s][time] lm_to_dit: 0.67 ms
[time] res_to_dit: 0.76 ms
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 32.86it/s]
[time] locenc_step: 31.85 ms█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊                                   | 8/10 [00:00<00:00, 32.99it/s]
gen_loop:   0%|▏                                                                                                                                                                   | 2/2000 [00:00<15:18,  2.18it/s][time] lm_to_dit: 0.61 ms
[time] res_to_dit: 0.65 ms
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 32.72it/s]
[time] locenc_step: 32.01 ms█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊                                   | 8/10 [00:00<00:00, 32.83it/s]
gen_loop:   2%|███▉                                                                                                                                                               | 49/2000 [00:22<14:55,  2.18it/s][time] lm_to_dit: 0.88 ms
[time] res_to_dit: 0.64 ms
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 32.72it/s]
[time] locenc_step: 32.16 ms█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊                                   | 8/10 [00:00<00:00, 32.88it/s]
gen_loop:   2%|███▉                                                                                                                                                               | 49/2000 [00:22<15:05,  2.15it/s]
[time] vae_decode_0: 2438.31 ms
[time] vae_decode_60: 2372.92 ms
[time] vae_decode_120: 2380.40 ms
[time] vae_decode_180: 2344.88 ms
Saved: rknn_output.wav
```
## Model Conversion

See https://huggingface.co/happyme531/VoxCPM1.5-RKNN2/tree/main/convert 

## Known Issues

- In some cases, speech generation may fall into an infinite loop. The original project seems to have a mechanism to detect infinite loops, but it is not implemented here.
- Due to internal issues with the RKNN toolchain, the locenc model cannot configure two sets of shapes for two different input lengths in a single model, so two separate models must be converted.
- Due to internal issues with the RKLLM toolchain/runtime, the output tensor values of both LLMs are only one-quarter of the correct result. Multiplying by 4 manually yields the correct result.

## References
- [openbmb/VoxCPM1.5](https://huggingface.co/openbmb/VoxCPM1.5)
- [0seba/VoxCPMANE](https://github.com/0seba/VoxCPMANE)
- [bluryar/VoxCPM-ONNX](https://github.com/bluryar/VoxCPM-ONNX)