Add CFG parallel inference with new library

921aee2 verified 10 days ago

14.4 kB

	---
	license: agpl-3.0
	language:
	- en
	- zh
	base_model:
	- openbmb/VoxCPM-0.5B
	pipeline_tag: text-to-speech
	tags:
	- rknn
	- rkllm
	- text-to-speech
	- speech
	- speech generation
	- voice cloning
	---

	# VoxCPM-0.5B-RKNN2

	### (English README see below)

	VoxCPM 是一种创新的无分词器文本转语音（TTS）系统，重新定义了语音合成的真实感。通过在连续空间中建模语音，它克服了离散标记化的局限，并实现了两项核心能力：上下文感知的语音生成和逼真的零样本语音克隆。
	不同于将语音转换为离散标记的主流方法，VoxCPM 采用端到端的扩散自回归架构，直接从文本生成连续的语音表示。它基于 MiniCPM-4 主干构建，通过分层语言建模和 FSQ 约束实现了隐式的语义-声学解耦，极大地提升了表现力和生成稳定性。

	![模型架构](model_structure.jpg)


	- 推理速度(RKNN2)：RK3588上RTF约4.5（生成10s音频需要推理45s）
	- 大致内存占用(RKNN2)：约3.3GB

	## 使用方法

	1. 克隆项目到本地

	2. 安装依赖

	```bash
	pip install numpy scipy soundfile tqdm transformers sentencepiece ztu-somemodelruntime-ez-rknn-async
	```

	3. 运行

	```bash
	python onnx_infer-rknn2.py --onnx-dir . --tokenizer-dir . --base-hf-dir . --residual-hf-dir . --text "哇, 这个模型居然在RK3588这个辣鸡SoC上也能完美运行!" --prompt-audio basic_ref_zh.wav --prompt-text "对，这就是我，万人敬仰的太乙真人。" --output rknn_output.wav --cfg-value 2.0 --inference-timesteps 10 --seed 1234
	```

	可选参数：
	- `--text`: 要生成的文本
	- `--prompt-audio`: 参考音频路径（用于语音克隆）
	- `--prompt-text`: 参考音频对应的文本（使用参考音频时必填）
	- `--cfg-value`: CFG引导强度，默认2.0
	- `--inference-timesteps`: 扩散步数，默认10
	- `--seed`: 随机种子
	- `--output`: 输出音频路径

	## 运行效果


	```log
	> python onnx_infer-rknn2.py --onnx-dir . --tokenizer-dir . --base-hf-dir . --residual-hf-dir . --text "哇, 这个模型居然在RK3588这个辣鸡SoC上也能完美运行!" --prompt-audio basic_ref_zh.wav --prompt-text "对，这就是我，万人敬仰的太乙真人。" --output rknn_output.wav --cfg-value 2.0 --inference-timesteps 10 --seed 1234

	I rkllm: rkllm-runtime version: 1.2.3, rknpu driver version: 0.9.8, platform: RK3588
	I rkllm: loading rkllm model from ./base_lm.rkllm
	I rkllm: rkllm-toolkit version: 1.2.3, max_context_limit: 4096, npu_core_num: 1, target_platform: RK3588, model_dtype: FP16
	I rkllm: Enabled cpus: [4, 5, 6, 7]
	I rkllm: Enabled cpus num: 4
	I rkllm: rkllm-runtime version: 1.2.3, rknpu driver version: 0.9.8, platform: RK3588
	I rkllm: loading rkllm model from ./residual_lm.rkllm
	I rkllm: rkllm-toolkit version: 1.2.2, max_context_limit: 4096, npu_core_num: 3, target_platform: RK3588, model_dtype: FP16
	I rkllm: Enabled cpus: [4, 5, 6, 7]
	I rkllm: Enabled cpus num: 4
	[time] vae_encode_0: 1502.91 ms
	[time] vae_encode_38400: 1443.79 ms
	[time] vae_encode_76800: 1418.36 ms
	[time] locenc_0: 820.25 ms
	[time] locenc_64: 814.78 ms
	[time] locenc_128: 815.60 ms
	[time] base_lm initial: 549.21 ms
	[time] fsq_init_0: 5.34 ms
	[time] fsq_init_64: 3.95 ms
	[time] fsq_init_128: 4.17 ms
	[time] residual_lm initial: 131.22 ms
	gen_loop: 0%\| \| 0/2000 [00:00<?, ?it/s][time] lm_to_dit: 1.26 ms
	[time] res_to_dit: 1.01 ms
	100%\|████████████████████████████████████████████████████████████████████████████████████\| 10/10 [00:00<00:00, 60.13it/s]
	[time] locenc_step: 16.43 ms████████████████████████████████████▍ \| 7/10 [00:00<00:00, 61.20it/s]
	gen_loop: 0%\| \| 1/2000 [00:00<09:43, 3.42it/s][time] lm_to_dit: 0.75 ms
	[time] res_to_dit: 0.55 ms
	100%\|████████████████████████████████████████████████████████████████████████████████████\| 10/10 [00:00<00:00, 63.99it/s]
	[time] locenc_step: 15.93 ms████████████████████████████████████▍ \| 7/10 [00:00<00:00, 67.27it/s]
	gen_loop: 0%\| \| 2/2000 [00:00<09:25, 3.53it/s][time] lm_to_dit: 0.74 ms

	...

	[time] res_to_dit: 0.59 ms
	100%\|████████████████████████████████████████████████████████████████████████████████████\| 10/10 [00:00<00:00, 64.19it/s]
	[time] locenc_step: 15.73 ms████████████████████████████████████▍ \| 7/10 [00:00<00:00, 67.34it/s]
	gen_loop: 6%\|████▎ \| 123/2000 [00:34<08:47, 3.56it/s][time] lm_to_dit: 0.76 ms
	[time] res_to_dit: 0.56 ms
	100%\|████████████████████████████████████████████████████████████████████████████████████\| 10/10 [00:00<00:00, 64.08it/s]
	[time] locenc_step: 15.82 ms████████████████████████████████████▍ \| 7/10 [00:00<00:00, 67.43it/s]
	gen_loop: 6%\|████▎ \| 123/2000 [00:34<08:47, 3.56it/s]
	[time] vae_decode_0: 1153.02 ms
	[time] vae_decode_60: 1102.36 ms
	[time] vae_decode_120: 1105.00 ms
	[time] vae_decode_180: 1105.60 ms
	[time] vae_decode_240: 1082.36 ms
	Saved: rknn_output.wav
	```

	## 模型转换

	#### 懒得写了，待补充

	## 已知问题

	- 某些情况下语音生成可能陷入死循环，原项目似乎有检测死循环的机制，但我这里没有实现。
	- 由于RKNN工具链的内部问题，locenc模型没有办法在一个模型里配置两种输入长度的两组shape，因此只能单独转换两个模型。
	- 由于RKLLM工具链/运行时的内部问题，两个LLM的输出张量的数值都只有正确结果的四分之一，手动乘4之后可以得到正确结果。
	- ~~由于RKNN工具链目前不支持非4维输入模型多batch使用多NPU核的数据并行推理，脚本中CFG是分两次单独进行的，速度较慢。~~(已修复)


	## 参考
	- [openbmb/VoxCPM-0.5B](https://huggingface.co/openbmb/VoxCPM-0.5B)
	- [0seba/VoxCPMANE](https://github.com/0seba/VoxCPMANE)
	- [bluryar/VoxCPM-ONNX](https://github.com/bluryar/VoxCPM-ONNX)

	# English README

	VoxCPM is an innovative tokenizer-free Text-to-Speech (TTS) system that redefines realism in speech synthesis. By modeling speech in continuous space, it overcomes the limitations of discrete tokenization and achieves two core capabilities: context-aware speech generation and realistic zero-shot voice cloning.

	Unlike mainstream approaches that convert speech into discrete tokens, VoxCPM adopts an end-to-end diffusion autoregressive architecture that directly generates continuous speech representations from text. Built on the MiniCPM-4 backbone, it achieves implicit semantic-acoustic decoupling through hierarchical language modeling and FSQ constraints, greatly enhancing expressiveness and generation stability.

	- Inference speed (RKNN2): RTF approximately 4.5 on RK3588 (45s inference time to generate 10s audio)
	- Approximate memory usage (RKNN2): ~3.3GB

	## Usage

	1. Clone the project locally

	2. Install dependencies

	```bash
	pip install numpy scipy soundfile tqdm transformers sentencepiece ztu-somemodelruntime-ez-rknn-async
	```

	3. Run

	```bash
	python onnx_infer-rknn2.py --onnx-dir . --tokenizer-dir . --base-hf-dir . --residual-hf-dir . --text "Wow, this model actually runs perfectly on the RK3588 SoC!" --prompt-audio basic_ref_zh.wav --prompt-text "对，这就是我，万人敬仰的太乙真人。" --output rknn_output.wav --cfg-value 2.0 --inference-timesteps 10 --seed 1234
	```

	Optional parameters:
	- `--text`: Text to generate
	- `--prompt-audio`: Reference audio path (for voice cloning)
	- `--prompt-text`: Text corresponding to the reference audio (required when using reference audio)
	- `--cfg-value`: CFG guidance strength, default 2.0
	- `--inference-timesteps`: Number of diffusion steps, default 10
	- `--seed`: Random seed
	- `--output`: Output audio path

	## Performance


	```log
	> python onnx_infer-rknn2.py --onnx-dir . --tokenizer-dir . --base-hf-dir . --residual-hf-dir . --text "哇, 这个模型居然在RK3588这个辣鸡SoC上也能完美运行!" --prompt-audio basic_ref_zh.wav --prompt-text "对，这就是我，万人敬仰的太乙真人。" --output rknn_output.wav --cfg-value 2.0 --inference-timesteps 10 --seed 1234

	I rkllm: rkllm-runtime version: 1.2.3, rknpu driver version: 0.9.8, platform: RK3588
	I rkllm: loading rkllm model from ./base_lm.rkllm
	I rkllm: rkllm-toolkit version: 1.2.3, max_context_limit: 4096, npu_core_num: 1, target_platform: RK3588, model_dtype: FP16
	I rkllm: Enabled cpus: [4, 5, 6, 7]
	I rkllm: Enabled cpus num: 4
	I rkllm: rkllm-runtime version: 1.2.3, rknpu driver version: 0.9.8, platform: RK3588
	I rkllm: loading rkllm model from ./residual_lm.rkllm
	I rkllm: rkllm-toolkit version: 1.2.2, max_context_limit: 4096, npu_core_num: 3, target_platform: RK3588, model_dtype: FP16
	I rkllm: Enabled cpus: [4, 5, 6, 7]
	I rkllm: Enabled cpus num: 4
	[time] vae_encode_0: 1502.91 ms
	[time] vae_encode_38400: 1443.79 ms
	[time] vae_encode_76800: 1418.36 ms
	[time] locenc_0: 820.25 ms
	[time] locenc_64: 814.78 ms
	[time] locenc_128: 815.60 ms
	[time] base_lm initial: 549.21 ms
	[time] fsq_init_0: 5.34 ms
	[time] fsq_init_64: 3.95 ms
	[time] fsq_init_128: 4.17 ms
	[time] residual_lm initial: 131.22 ms
	gen_loop: 0%\| \| 0/2000 [00:00<?, ?it/s][time] lm_to_dit: 1.26 ms
	[time] res_to_dit: 1.01 ms
	100%\|████████████████████████████████████████████████████████████████████████████████████\| 10/10 [00:00<00:00, 60.13it/s]
	[time] locenc_step: 16.43 ms████████████████████████████████████▍ \| 7/10 [00:00<00:00, 61.20it/s]
	gen_loop: 0%\| \| 1/2000 [00:00<09:43, 3.42it/s][time] lm_to_dit: 0.75 ms
	[time] res_to_dit: 0.55 ms
	100%\|████████████████████████████████████████████████████████████████████████████████████\| 10/10 [00:00<00:00, 63.99it/s]
	[time] locenc_step: 15.93 ms████████████████████████████████████▍ \| 7/10 [00:00<00:00, 67.27it/s]
	gen_loop: 0%\| \| 2/2000 [00:00<09:25, 3.53it/s][time] lm_to_dit: 0.74 ms

	...

	[time] res_to_dit: 0.59 ms
	100%\|████████████████████████████████████████████████████████████████████████████████████\| 10/10 [00:00<00:00, 64.19it/s]
	[time] locenc_step: 15.73 ms████████████████████████████████████▍ \| 7/10 [00:00<00:00, 67.34it/s]
	gen_loop: 6%\|████▎ \| 123/2000 [00:34<08:47, 3.56it/s][time] lm_to_dit: 0.76 ms
	[time] res_to_dit: 0.56 ms
	100%\|████████████████████████████████████████████████████████████████████████████████████\| 10/10 [00:00<00:00, 64.08it/s]
	[time] locenc_step: 15.82 ms████████████████████████████████████▍ \| 7/10 [00:00<00:00, 67.43it/s]
	gen_loop: 6%\|████▎ \| 123/2000 [00:34<08:47, 3.56it/s]
	[time] vae_decode_0: 1153.02 ms
	[time] vae_decode_60: 1102.36 ms
	[time] vae_decode_120: 1105.00 ms
	[time] vae_decode_180: 1105.60 ms
	[time] vae_decode_240: 1082.36 ms
	Saved: rknn_output.wav
	```
	## Model Conversion

	#### TODO: Documentation to be added

	## Known Issues

	- In some cases, speech generation may fall into an infinite loop. The original project seems to have a mechanism to detect infinite loops, but it is not implemented here.
	- Due to internal issues with the RKNN toolchain, the locenc model cannot configure two sets of shapes for two different input lengths in a single model, so two separate models must be converted.
	- Due to internal issues with the RKLLM toolchain/runtime, the output tensor values of both LLMs are only one-quarter of the correct result. Multiplying by 4 manually yields the correct result.
	- ~~Since the RKNN toolchain currently does not support data-parallel inference using multiple NPU cores for non-4D input models with multiple batches, CFG in the script is performed separately in two passes, which is relatively slow.~~(Solved)

	## References
	- [openbmb/VoxCPM-0.5B](https://huggingface.co/openbmb/VoxCPM-0.5B)
	- [0seba/VoxCPMANE](https://github.com/0seba/VoxCPMANE)
	- [bluryar/VoxCPM-ONNX](https://github.com/bluryar/VoxCPM-ONNX)