Upload 68 files

e11f7fb verified 27 days ago

20 kB

	---
	license: agpl-3.0
	language:
	- en
	- zh
	base_model:
	- openbmb/VoxCPM1.5
	pipeline_tag: text-to-speech
	tags:
	- rknn
	- rkllm
	- text-to-speech
	- speech
	- speech generation
	- voice cloning
	---

	# VoxCPM1.5-RKNN2

	### (English README see below)

	> VoxCPM 是一种创新的无分词器文本转语音（TTS）系统，重新定义了语音合成的真实感。通过在连续空间中建模语音，它克服了离散标记化的局限，并实现了两项核心能力：上下文感知的语音生成和逼真的零样本语音克隆。
	> 不同于将语音转换为离散标记的主流方法，VoxCPM 采用端到端的扩散自回归架构，直接从文本生成连续的语音表示。它基于 MiniCPM-4 主干构建，通过分层语言建模和 FSQ 约束实现了隐式的语义-声学解耦，极大地提升了表现力和生成稳定性。

	我们非常激动地推出 VoxCPM 的重大升级版本。此次更新在显著提升音频质量和效率的同时，保留了核心的上下文感知语音生成和零样本（Zero-shot）语音克隆能力。

	\| 特性 \| VoxCPM \| VoxCPM1.5 \|
	\|---------\|------------\|------------\|
	\| Audio VAE 采样率 \| 16kHz \| 44.1kHz \|
	\| LM Token 速率 \| 12.5Hz \| 6.25Hz \|
	\| Patch 大小 \| 2 \| 4 \|
	\| SFT 支持 \| ✅ \| ✅ \|
	\| LoRA 支持 \| ✅ \| ✅ \|


	- 推理速度(RKNN2)：RK3588上RTF约4.5（生成10s音频需要推理45s，相对于旧版似乎并没有什么提升）
	- 大致内存占用(RKNN2)：约3.3GB（相对于旧版同样没有什么提升）

	## 使用方法

	1. 克隆项目到本地

	2. 安装依赖

	```bash
	pip install numpy scipy soundfile tqdm transformers sentencepiece ztu-somemodelruntime-ez-rknn-async
	```

	3. 运行

	```bash
	python onnx_infer-rknn2.py --onnx-dir . --tokenizer-dir . --base-hf-dir . --residual-hf-dir . --text "哇, VoxCPM1.5 现在也能在 RK3588 上跑起来了。" --prompt-audio basic_ref_zh.wav --prompt-text "对，这就是我，万人敬仰的太乙真人。" --output rknn_output.wav --cfg-value 2.0 --inference-timesteps 10 --seed 1234
	```

	可选参数：
	- `--text`: 要生成的文本
	- `--prompt-audio`: 参考音频路径（用于语音克隆）
	- `--prompt-text`: 参考音频对应的文本（使用参考音频时必填）
	- `--cfg-value`: CFG引导强度，默认2.0
	- `--inference-timesteps`: 扩散步数，默认10
	- `--seed`: 随机种子
	- `--output`: 输出音频路径

	## 运行效果


	```log
	> python onnx_infer-rknn2.py --onnx-dir . --tokenizer-dir . --base-hf-dir . --residual-hf-dir . --text "哇, VoxCPM1.5 现在也能在 RK3588 上跑起来了。" --prompt-audio basic_ref_zh.wav --prompt-text "对，这就是我，万人敬仰的太乙真人。" --output rknn_output.wav --cfg-value 2.0 --inference-timesteps 10 --seed 1234
	I rkllm: rkllm-runtime version: 1.2.3, rknpu driver version: 0.9.8, platform: RK3588
	I rkllm: loading rkllm model from ./base_lm.rkllm
	I rkllm: rkllm-toolkit version: 1.2.3, max_context_limit: 4096, npu_core_num: 1, target_platform: RK3588, model_dtype: FP16
	I rkllm: Enabled cpus: [4, 5, 6, 7]
	I rkllm: Enabled cpus num: 4
	I rkllm: rkllm-runtime version: 1.2.3, rknpu driver version: 0.9.8, platform: RK3588
	I rkllm: loading rkllm model from ./residual_lm.rkllm
	I rkllm: rkllm-toolkit version: 1.2.3, max_context_limit: 4096, npu_core_num: 3, target_platform: RK3588, model_dtype: FP16
	I rkllm: Enabled cpus: [4, 5, 6, 7]
	I rkllm: Enabled cpus num: 4
	[time] vae_encode_0: 2127.35 ms
	[time] vae_encode_105840: 2057.71 ms
	[time] vae_encode_211680: 1997.43 ms
	[time] locenc_0: 1791.50 ms
	[time] locenc_64: 1782.49 ms
	[time] base_lm initial: 368.19 ms
	[time] fsq_init_0: 5.52 ms
	[time] fsq_init_64: 4.20 ms
	[time] residual_lm initial: 105.79 ms
	gen_loop: 0%\| \| 0/2000 [00:00<?, ?it/s][time] lm_to_dit: 1.49 ms
	[time] res_to_dit: 1.11 ms
	100%\|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 10/10 [00:00<00:00, 32.15it/s]
	[time] locenc_step: 33.00 ms█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊ \| 8/10 [00:00<00:00, 32.24it/s]
	gen_loop: 0%\| \| 1/2000 [00:00<15:33, 2.14it/s][time] lm_to_dit: 0.67 ms
	[time] res_to_dit: 0.76 ms
	100%\|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 10/10 [00:00<00:00, 32.86it/s]
	[time] locenc_step: 31.85 ms█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊ \| 8/10 [00:00<00:00, 32.99it/s]
	gen_loop: 0%\|▏ \| 2/2000 [00:00<15:18, 2.18it/s][time] lm_to_dit: 0.61 ms
	[time] res_to_dit: 0.65 ms
	100%\|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 10/10 [00:00<00:00, 32.72it/s]
	[time] locenc_step: 32.01 ms█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊ \| 8/10 [00:00<00:00, 32.83it/s]
	gen_loop: 2%\|███▉ \| 49/2000 [00:22<14:55, 2.18it/s][time] lm_to_dit: 0.88 ms
	[time] res_to_dit: 0.64 ms
	100%\|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 10/10 [00:00<00:00, 32.72it/s]
	[time] locenc_step: 32.16 ms█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊ \| 8/10 [00:00<00:00, 32.88it/s]
	gen_loop: 2%\|███▉ \| 49/2000 [00:22<15:05, 2.15it/s]
	[time] vae_decode_0: 2438.31 ms
	[time] vae_decode_60: 2372.92 ms
	[time] vae_decode_120: 2380.40 ms
	[time] vae_decode_180: 2344.88 ms
	Saved: rknn_output.wav
	```

	## 模型转换

	查看 https://huggingface.co/happyme531/VoxCPM1.5-RKNN2/tree/main/convert

	## 已知问题

	- 某些情况下语音生成可能陷入死循环，原项目似乎有检测死循环的机制，但我这里没有实现。
	- 由于RKNN工具链的内部问题，locenc模型没有办法在一个模型里配置两种输入长度的两组shape，因此只能单独转换两个模型。
	- 由于RKLLM工具链/运行时的内部问题，两个LLM的输出张量的数值都只有正确结果的四分之一，手动乘4之后可以得到正确结果。


	## 参考
	- [openbmb/VoxCPM1.5](https://huggingface.co/openbmb/VoxCPM1.5)
	- [0seba/VoxCPMANE](https://github.com/0seba/VoxCPMANE)
	- [bluryar/VoxCPM-ONNX](https://github.com/bluryar/VoxCPM-ONNX)

	# English README

	> VoxCPM is an innovative tokenizer-free Text-to-Speech (TTS) system that redefines realism in speech synthesis. By modeling speech in continuous space, it overcomes the limitations of discrete tokenization and achieves two core capabilities: context-aware speech generation and realistic zero-shot voice cloning.

	> Unlike mainstream approaches that convert speech into discrete tokens, VoxCPM adopts an end-to-end diffusion autoregressive architecture that directly generates continuous speech representations from text. Built on the MiniCPM-4 backbone, it achieves implicit semantic-acoustic decoupling through hierarchical language modeling and FSQ constraints, greatly enhancing expressiveness and generation stability.

	We’re thrilled to introduce a major upgrade that improves audio quality and efficiency of VoxCPM, while maintaining the core capabilities of context-aware speech generation and zero-shot voice cloning.

	\| Feature \| VoxCPM \| VoxCPM1.5 \|
	\|---------\|------------\|------------\|
	\| Audio VAE Sampling Rate \| 16kHz \| 44.1kHz \|
	\| LM Token Rate \| 12.5Hz \| 6.25Hz \|
	\| Patch Size \| 2 \| 4 \|
	\| SFT Support \| ✅ \| ✅ \|
	\| LoRA Support \| ✅ \| ✅ \|

	- Inference speed (RKNN2): RTF approximately 4.5 on RK3588 (45s inference time to generate 10s audio, no improvement compared to the previous version)
	- Approximate memory usage (RKNN2): ~3.3GB (no improvement compared to the previous version too)

	## Usage

	1. Clone the project locally

	2. Install dependencies

	```bash
	pip install numpy scipy soundfile tqdm transformers sentencepiece ztu-somemodelruntime-ez-rknn-async
	```

	3. Run

	```bash
	python onnx_infer-rknn2.py --onnx-dir . --tokenizer-dir . --base-hf-dir . --residual-hf-dir . --text "Wow, VoxCPM1.5 actually runs perfectly on the RK3588 SoC!" --prompt-audio basic_ref_zh.wav --prompt-text "对，这就是我，万人敬仰的太乙真人。" --output rknn_output.wav --cfg-value 2.0 --inference-timesteps 10 --seed 1234
	```

	Optional parameters:
	- `--text`: Text to generate
	- `--prompt-audio`: Reference audio path (for voice cloning)
	- `--prompt-text`: Text corresponding to the reference audio (required when using reference audio)
	- `--cfg-value`: CFG guidance strength, default 2.0
	- `--inference-timesteps`: Number of diffusion steps, default 10
	- `--seed`: Random seed
	- `--output`: Output audio path

	## Performance


	```log
	> python onnx_infer-rknn2.py --onnx-dir . --tokenizer-dir . --base-hf-dir . --residual-hf-dir . --text "哇, VoxCPM1.5 现在也能在 RK3588 上跑起来了。" --prompt-audio basic_ref_zh.wav --prompt-text "对，这就是我，万人敬仰的太乙真人。" --output rknn_output.wav --cfg-value 2.0 --inference-timesteps 10 --seed 1234
	I rkllm: rkllm-runtime version: 1.2.3, rknpu driver version: 0.9.8, platform: RK3588
	I rkllm: loading rkllm model from ./base_lm.rkllm
	I rkllm: rkllm-toolkit version: 1.2.3, max_context_limit: 4096, npu_core_num: 1, target_platform: RK3588, model_dtype: FP16
	I rkllm: Enabled cpus: [4, 5, 6, 7]
	I rkllm: Enabled cpus num: 4
	I rkllm: rkllm-runtime version: 1.2.3, rknpu driver version: 0.9.8, platform: RK3588
	I rkllm: loading rkllm model from ./residual_lm.rkllm
	I rkllm: rkllm-toolkit version: 1.2.3, max_context_limit: 4096, npu_core_num: 3, target_platform: RK3588, model_dtype: FP16
	I rkllm: Enabled cpus: [4, 5, 6, 7]
	I rkllm: Enabled cpus num: 4
	[time] vae_encode_0: 2127.35 ms
	[time] vae_encode_105840: 2057.71 ms
	[time] vae_encode_211680: 1997.43 ms
	[time] locenc_0: 1791.50 ms
	[time] locenc_64: 1782.49 ms
	[time] base_lm initial: 368.19 ms
	[time] fsq_init_0: 5.52 ms
	[time] fsq_init_64: 4.20 ms
	[time] residual_lm initial: 105.79 ms
	gen_loop: 0%\| \| 0/2000 [00:00<?, ?it/s][time] lm_to_dit: 1.49 ms
	[time] res_to_dit: 1.11 ms
	100%\|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 10/10 [00:00<00:00, 32.15it/s]
	[time] locenc_step: 33.00 ms█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊ \| 8/10 [00:00<00:00, 32.24it/s]
	gen_loop: 0%\| \| 1/2000 [00:00<15:33, 2.14it/s][time] lm_to_dit: 0.67 ms
	[time] res_to_dit: 0.76 ms
	100%\|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 10/10 [00:00<00:00, 32.86it/s]
	[time] locenc_step: 31.85 ms█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊ \| 8/10 [00:00<00:00, 32.99it/s]
	gen_loop: 0%\|▏ \| 2/2000 [00:00<15:18, 2.18it/s][time] lm_to_dit: 0.61 ms
	[time] res_to_dit: 0.65 ms
	100%\|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 10/10 [00:00<00:00, 32.72it/s]
	[time] locenc_step: 32.01 ms█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊ \| 8/10 [00:00<00:00, 32.83it/s]
	gen_loop: 2%\|███▉ \| 49/2000 [00:22<14:55, 2.18it/s][time] lm_to_dit: 0.88 ms
	[time] res_to_dit: 0.64 ms
	100%\|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 10/10 [00:00<00:00, 32.72it/s]
	[time] locenc_step: 32.16 ms█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊ \| 8/10 [00:00<00:00, 32.88it/s]
	gen_loop: 2%\|███▉ \| 49/2000 [00:22<15:05, 2.15it/s]
	[time] vae_decode_0: 2438.31 ms
	[time] vae_decode_60: 2372.92 ms
	[time] vae_decode_120: 2380.40 ms
	[time] vae_decode_180: 2344.88 ms
	Saved: rknn_output.wav
	```
	## Model Conversion

	See https://huggingface.co/happyme531/VoxCPM1.5-RKNN2/tree/main/convert

	## Known Issues

	- In some cases, speech generation may fall into an infinite loop. The original project seems to have a mechanism to detect infinite loops, but it is not implemented here.
	- Due to internal issues with the RKNN toolchain, the locenc model cannot configure two sets of shapes for two different input lengths in a single model, so two separate models must be converted.
	- Due to internal issues with the RKLLM toolchain/runtime, the output tensor values of both LLMs are only one-quarter of the correct result. Multiplying by 4 manually yields the correct result.

	## References
	- [openbmb/VoxCPM1.5](https://huggingface.co/openbmb/VoxCPM1.5)
	- [0seba/VoxCPMANE](https://github.com/0seba/VoxCPMANE)
	- [bluryar/VoxCPM-ONNX](https://github.com/bluryar/VoxCPM-ONNX)