Add CFG parallel inference with new library

Browse files

Files changed (3) hide show

README.md +72 -191
onnx_infer-rknn2.py +26 -48
onnx_infer.py +14 -22

README.md CHANGED Viewed

@@ -25,7 +25,7 @@ VoxCPM 是一种创新的无分词器文本转语音（TTS）系统，重新定
 ![模型架构](model_structure.jpg)
-- 推理速度(RKNN2)：RK3588上RTF约8（生成10s音频需要推理80s）
 - 大致内存占用(RKNN2)：约3.3GB
 ## 使用方法
@@ -35,7 +35,7 @@ VoxCPM 是一种创新的无分词器文本转语音（TTS）系统，重新定
 2. 安装依赖
 ```bash
-pip install "numpy<2" scipy soundfile tqdm transformers sentencepiece rknn-toolkit-lite2
 ```
 3. 运行
@@ -69,101 +69,42 @@ I rkllm: loading rkllm model from ./residual_lm.rkllm
 I rkllm: rkllm-toolkit version: 1.2.2, max_context_limit: 4096, npu_core_num: 3, target_platform: RK3588, model_dtype: FP16
 I rkllm: Enabled cpus: [4, 5, 6, 7]
 I rkllm: Enabled cpus num: 4
-W rknn-toolkit-lite2 version: 2.3.2
-I RKNN: [18:58:26.264] RKNN Runtime Information, librknnrt version: 2.3.2 (429f97ae6b@2025-04-09T09:09:27)
-I RKNN: [18:58:26.264] RKNN Driver Information, version: 0.9.8
-I RKNN: [18:58:26.265] RKNN Model Information, version: 6, toolkit version: 2.3.0(compiler version: 2.3.0 (@2024-11-07T08:11:34)), target: RKNPU v2, target platform: rk3588, framework name: ONNX, framework layout: NCHW, model inference type: static_shape
-W RKNN: [18:58:26.404] query RKNN_QUERY_INPUT_DYNAMIC_RANGE error, rknn model is static shape type, please export rknn with dynamic_shapes
-W Query dynamic range failed. Ret code: RKNN_ERR_MODEL_INVALID. (If it is a static shape RKNN model, please ignore the above warning message.)
-W rknn-toolkit-lite2 version: 2.3.2
-I RKNN: [18:58:26.537] RKNN Runtime Information, librknnrt version: 2.3.2 (429f97ae6b@2025-04-09T09:09:27)
-I RKNN: [18:58:26.537] RKNN Driver Information, version: 0.9.8
-I RKNN: [18:58:26.537] RKNN Model Information, version: 6, toolkit version: 2.3.0(compiler version: 2.3.0 (@2024-11-07T08:11:34)), target: RKNPU v2, target platform: rk3588, framework name: ONNX, framework layout: NCHW, model inference type: static_shape
-W RKNN: [18:58:26.616] query RKNN_QUERY_INPUT_DYNAMIC_RANGE error, rknn model is static shape type, please export rknn with dynamic_shapes
-W Query dynamic range failed. Ret code: RKNN_ERR_MODEL_INVALID. (If it is a static shape RKNN model, please ignore the above warning message.)
-W rknn-toolkit-lite2 version: 2.3.2
-I RKNN: [18:58:26.795] RKNN Runtime Information, librknnrt version: 2.3.2 (429f97ae6b@2025-04-09T09:09:27)
-I RKNN: [18:58:26.795] RKNN Driver Information, version: 0.9.8
-I RKNN: [18:58:26.795] RKNN Model Information, version: 6, toolkit version: 2.3.2(compiler version: 2.3.2 (e045de294f@2025-04-07T19:48:25)), target: RKNPU v2, target platform: rk3588, framework name: ONNX, framework layout: NCHW, model inference type: static_shape
-W RKNN: [18:58:27.020] query RKNN_QUERY_INPUT_DYNAMIC_RANGE error, rknn model is static shape type, please export rknn with dynamic_shapes
-W Query dynamic range failed. Ret code: RKNN_ERR_MODEL_INVALID. (If it is a static shape RKNN model, please ignore the above warning message.)
-W rknn-toolkit-lite2 version: 2.3.2
-I RKNN: [18:58:27.194] RKNN Runtime Information, librknnrt version: 2.3.2 (429f97ae6b@2025-04-09T09:09:27)
-I RKNN: [18:58:27.194] RKNN Driver Information, version: 0.9.8
-I RKNN: [18:58:27.194] RKNN Model Information, version: 6, toolkit version: 2.3.2(compiler version: 2.3.2 (e045de294f@2025-04-07T19:48:25)), target: RKNPU v2, target platform: rk3588, framework name: ONNX, framework layout: NCHW, model inference type: static_shape
-W RKNN: [18:58:27.317] query RKNN_QUERY_INPUT_DYNAMIC_RANGE error, rknn model is static shape type, please export rknn with dynamic_shapes
-W Query dynamic range failed. Ret code: RKNN_ERR_MODEL_INVALID. (If it is a static shape RKNN model, please ignore the above warning message.)
-W rknn-toolkit-lite2 version: 2.3.2
-I RKNN: [18:58:27.431] RKNN Runtime Information, librknnrt version: 2.3.2 (429f97ae6b@2025-04-09T09:09:27)
-I RKNN: [18:58:27.431] RKNN Driver Information, version: 0.9.8
-I RKNN: [18:58:27.431] RKNN Model Information, version: 6, toolkit version: 2.3.2(compiler version: 2.3.2 (@2025-04-03T08:26:16)), target: RKNPU v2, target platform: rk3588, framework name: ONNX, framework layout: NCHW, model inference type: dynamic_shape
-W rknn-toolkit-lite2 version: 2.3.2
-I RKNN: [18:58:27.547] RKNN Runtime Information, librknnrt version: 2.3.2 (429f97ae6b@2025-04-09T09:09:27)
-I RKNN: [18:58:27.547] RKNN Driver Information, version: 0.9.8
-I RKNN: [18:58:27.547] RKNN Model Information, version: 6, toolkit version: 2.3.2(compiler version: 2.3.2 (e045de294f@2025-04-07T19:48:25)), target: RKNPU v2, target platform: rk3588, framework name: ONNX, framework layout: NCHW, model inference type: static_shape
-W RKNN: [18:58:27.549] query RKNN_QUERY_INPUT_DYNAMIC_RANGE error, rknn model is static shape type, please export rknn with dynamic_shapes
-W Query dynamic range failed. Ret code: RKNN_ERR_MODEL_INVALID. (If it is a static shape RKNN model, please ignore the above warning message.)
-W rknn-toolkit-lite2 version: 2.3.2
-I RKNN: [18:58:27.728] RKNN Runtime Information, librknnrt version: 2.3.2 (429f97ae6b@2025-04-09T09:09:27)
-I RKNN: [18:58:27.728] RKNN Driver Information, version: 0.9.8
-I RKNN: [18:58:27.728] RKNN Model Information, version: 6, toolkit version: 2.3.2(compiler version: 2.3.2 (e045de294f@2025-04-07T19:48:25)), target: RKNPU v2, target platform: rk3588, framework name: ONNX, framework layout: NCHW, model inference type: static_shape
-W RKNN: [18:58:27.819] query RKNN_QUERY_INPUT_DYNAMIC_RANGE error, rknn model is static shape type, please export rknn with dynamic_shapes
-W Query dynamic range failed. Ret code: RKNN_ERR_MODEL_INVALID. (If it is a static shape RKNN model, please ignore the above warning message.)
-W rknn-toolkit-lite2 version: 2.3.2
-I RKNN: [18:58:27.937] RKNN Runtime Information, librknnrt version: 2.3.2 (429f97ae6b@2025-04-09T09:09:27)
-I RKNN: [18:58:27.937] RKNN Driver Information, version: 0.9.8
-I RKNN: [18:58:27.937] RKNN Model Information, version: 6, toolkit version: 2.3.0(compiler version: 2.3.0 (@2024-11-07T08:11:34)), target: RKNPU v2, target platform: rk3588, framework name: ONNX, framework layout: NCHW, model inference type: static_shape
-W RKNN: [18:58:27.940] query RKNN_QUERY_INPUT_DYNAMIC_RANGE error, rknn model is static shape type, please export rknn with dynamic_shapes
-W Query dynamic range failed. Ret code: RKNN_ERR_MODEL_INVALID. (If it is a static shape RKNN model, please ignore the above warning message.)
-W rknn-toolkit-lite2 version: 2.3.2
-I RKNN: [18:58:28.058] RKNN Runtime Information, librknnrt version: 2.3.2 (429f97ae6b@2025-04-09T09:09:27)
-I RKNN: [18:58:28.058] RKNN Driver Information, version: 0.9.8
-I RKNN: [18:58:28.058] RKNN Model Information, version: 6, toolkit version: 2.3.0(compiler version: 2.3.0 (@2024-11-07T08:11:34)), target: RKNPU v2, target platform: rk3588, framework name: ONNX, framework layout: NCHW, model inference type: static_shape
-W RKNN: [18:58:28.060] query RKNN_QUERY_INPUT_DYNAMIC_RANGE error, rknn model is static shape type, please export rknn with dynamic_shapes
-W Query dynamic range failed. Ret code: RKNN_ERR_MODEL_INVALID. (If it is a static shape RKNN model, please ignore the above warning message.)
-[time] vae_encode_0: 1601.56 ms
-[time] vae_encode_38400: 1605.46 ms
-[time] vae_encode_76800: 1591.07 ms
-W The input[0] need NHWC data format, but NCHW set, the data format and data buffer will be changed to NHWC.
-[time] locenc_0: 819.49 ms
-W The input[0] need NHWC data format, but NCHW set, the data format and data buffer will be changed to NHWC.
-[time] locenc_64: 818.33 ms
-W The input[0] need NHWC data format, but NCHW set, the data format and data buffer will be changed to NHWC.
-[time] locenc_128: 819.09 ms
-[time] base_lm initial: 579.08 ms
-[time] fsq_init_0: 2.54 ms
-[time] fsq_init_64: 1.86 ms
-[time] fsq_init_128: 1.79 ms
-[time] residual_lm initial: 139.10 ms
-gen_loop:   0%|                                                                          | 0/2000 [00:00<?, ?it/s][time] lm_to_dit: 0.82 ms
-[time] res_to_dit: 0.56 ms
-100%|█████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 33.32it/s]
-W The input[0] need NHWC data format, but NCHW set, the data format and data buffer will be changed to NHWC.4it/s]
-[time] locenc_step: 16.32 ms
-gen_loop:   0%|                                                                  | 1/2000 [00:00<14:30,  2.30it/s][time] lm_to_dit: 0.57 ms
-[time] res_to_dit: 0.44 ms
-100%|█████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 33.10it/s]
-W The input[0] need NHWC data format, but NCHW set, the data format and data buffer will be changed to NHWC.1it/s]
-[time] locenc_step: 15.84 ms
-gen_loop:   0%|                                                                  | 2/2000 [00:00<14:27,  2.30it/s][time] lm_to_dit: 0.56 ms
-[time] res_to_dit: 0.50 ms
-100%|█████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 31.93it/s]
 ...
-W The input[0] need NHWC data format, but NCHW set, the data format and data buffer will be changed to NHWC.5it/s]
-[time] locenc_step: 15.88 ms
-gen_loop:   6%|███▉                                                            | 123/2000 [00:53<13:35,  2.30it/s][time] lm_to_dit: 0.57 ms
-[time] res_to_dit: 0.49 ms
-100%|█████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 32.94it/s]
-W The input[0] need NHWC data format, but NCHW set, the data format and data buffer will be changed to NHWC.6it/s]
-[time] locenc_step: 15.84 ms
-gen_loop:   6%|███▉                                                            | 123/2000 [00:54<13:44,  2.28it/s]
-[time] vae_decode_0: 1044.00 ms
-[time] vae_decode_60: 1018.03 ms
-[time] vae_decode_120: 1020.72 ms
-[time] vae_decode_180: 1021.19 ms
-[time] vae_decode_240: 1006.85 ms
 Saved: rknn_output.wav
 ```
@@ -176,7 +117,7 @@ Saved: rknn_output.wav
 - 某些情况下语音生成可能陷入死循环，原项目似乎有检测死循环的机制，但我这里没有实现。
 - 由于RKNN工具链的内部问题，locenc模型没有办法在一个模型里配置两种输入长度的两组shape，因此只能单独转换两个模型。
 - 由于RKLLM工具链/运行时的内部问题，两个LLM的输出张量的数值都只有正确结果的四分之一，手动乘4之后可以得到正确结果。
-- 由于RKNN工具链目前不支持非4维输入模型多batch使用多NPU核的数据并行推理，脚本中CFG是分两次单独进行的，速度较慢。
 ## 参考
@@ -190,7 +131,7 @@ VoxCPM is an innovative tokenizer-free Text-to-Speech (TTS) system that redefine
 Unlike mainstream approaches that convert speech into discrete tokens, VoxCPM adopts an end-to-end diffusion autoregressive architecture that directly generates continuous speech representations from text. Built on the MiniCPM-4 backbone, it achieves implicit semantic-acoustic decoupling through hierarchical language modeling and FSQ constraints, greatly enhancing expressiveness and generation stability.
-- Inference speed (RKNN2): RTF approximately 8 on RK3588 (80s inference time to generate 10s audio)
 - Approximate memory usage (RKNN2): ~3.3GB
 ## Usage
@@ -200,7 +141,7 @@ Unlike mainstream approaches that convert speech into discrete tokens, VoxCPM ad
 2. Install dependencies
 ```bash
-pip install "numpy<2" scipy soundfile tqdm transformers sentencepiece rknn-toolkit-lite2
 ```
 3. Run
@@ -234,104 +175,44 @@ I rkllm: loading rkllm model from ./residual_lm.rkllm
 I rkllm: rkllm-toolkit version: 1.2.2, max_context_limit: 4096, npu_core_num: 3, target_platform: RK3588, model_dtype: FP16
 I rkllm: Enabled cpus: [4, 5, 6, 7]
 I rkllm: Enabled cpus num: 4
-W rknn-toolkit-lite2 version: 2.3.2
-I RKNN: [18:58:26.264] RKNN Runtime Information, librknnrt version: 2.3.2 (429f97ae6b@2025-04-09T09:09:27)
-I RKNN: [18:58:26.264] RKNN Driver Information, version: 0.9.8
-I RKNN: [18:58:26.265] RKNN Model Information, version: 6, toolkit version: 2.3.0(compiler version: 2.3.0 (@2024-11-07T08:11:34)), target: RKNPU v2, target platform: rk3588, framework name: ONNX, framework layout: NCHW, model inference type: static_shape
-W RKNN: [18:58:26.404] query RKNN_QUERY_INPUT_DYNAMIC_RANGE error, rknn model is static shape type, please export rknn with dynamic_shapes
-W Query dynamic range failed. Ret code: RKNN_ERR_MODEL_INVALID. (If it is a static shape RKNN model, please ignore the above warning message.)
-W rknn-toolkit-lite2 version: 2.3.2
-I RKNN: [18:58:26.537] RKNN Runtime Information, librknnrt version: 2.3.2 (429f97ae6b@2025-04-09T09:09:27)
-I RKNN: [18:58:26.537] RKNN Driver Information, version: 0.9.8
-I RKNN: [18:58:26.537] RKNN Model Information, version: 6, toolkit version: 2.3.0(compiler version: 2.3.0 (@2024-11-07T08:11:34)), target: RKNPU v2, target platform: rk3588, framework name: ONNX, framework layout: NCHW, model inference type: static_shape
-W RKNN: [18:58:26.616] query RKNN_QUERY_INPUT_DYNAMIC_RANGE error, rknn model is static shape type, please export rknn with dynamic_shapes
-W Query dynamic range failed. Ret code: RKNN_ERR_MODEL_INVALID. (If it is a static shape RKNN model, please ignore the above warning message.)
-W rknn-toolkit-lite2 version: 2.3.2
-I RKNN: [18:58:26.795] RKNN Runtime Information, librknnrt version: 2.3.2 (429f97ae6b@2025-04-09T09:09:27)
-I RKNN: [18:58:26.795] RKNN Driver Information, version: 0.9.8
-I RKNN: [18:58:26.795] RKNN Model Information, version: 6, toolkit version: 2.3.2(compiler version: 2.3.2 (e045de294f@2025-04-07T19:48:25)), target: RKNPU v2, target platform: rk3588, framework name: ONNX, framework layout: NCHW, model inference type: static_shape
-W RKNN: [18:58:27.020] query RKNN_QUERY_INPUT_DYNAMIC_RANGE error, rknn model is static shape type, please export rknn with dynamic_shapes
-W Query dynamic range failed. Ret code: RKNN_ERR_MODEL_INVALID. (If it is a static shape RKNN model, please ignore the above warning message.)
-W rknn-toolkit-lite2 version: 2.3.2
-I RKNN: [18:58:27.194] RKNN Runtime Information, librknnrt version: 2.3.2 (429f97ae6b@2025-04-09T09:09:27)
-I RKNN: [18:58:27.194] RKNN Driver Information, version: 0.9.8
-I RKNN: [18:58:27.194] RKNN Model Information, version: 6, toolkit version: 2.3.2(compiler version: 2.3.2 (e045de294f@2025-04-07T19:48:25)), target: RKNPU v2, target platform: rk3588, framework name: ONNX, framework layout: NCHW, model inference type: static_shape
-W RKNN: [18:58:27.317] query RKNN_QUERY_INPUT_DYNAMIC_RANGE error, rknn model is static shape type, please export rknn with dynamic_shapes
-W Query dynamic range failed. Ret code: RKNN_ERR_MODEL_INVALID. (If it is a static shape RKNN model, please ignore the above warning message.)
-W rknn-toolkit-lite2 version: 2.3.2
-I RKNN: [18:58:27.431] RKNN Runtime Information, librknnrt version: 2.3.2 (429f97ae6b@2025-04-09T09:09:27)
-I RKNN: [18:58:27.431] RKNN Driver Information, version: 0.9.8
-I RKNN: [18:58:27.431] RKNN Model Information, version: 6, toolkit version: 2.3.2(compiler version: 2.3.2 (@2025-04-03T08:26:16)), target: RKNPU v2, target platform: rk3588, framework name: ONNX, framework layout: NCHW, model inference type: dynamic_shape
-W rknn-toolkit-lite2 version: 2.3.2
-I RKNN: [18:58:27.547] RKNN Runtime Information, librknnrt version: 2.3.2 (429f97ae6b@2025-04-09T09:09:27)
-I RKNN: [18:58:27.547] RKNN Driver Information, version: 0.9.8
-I RKNN: [18:58:27.547] RKNN Model Information, version: 6, toolkit version: 2.3.2(compiler version: 2.3.2 (e045de294f@2025-04-07T19:48:25)), target: RKNPU v2, target platform: rk3588, framework name: ONNX, framework layout: NCHW, model inference type: static_shape
-W RKNN: [18:58:27.549] query RKNN_QUERY_INPUT_DYNAMIC_RANGE error, rknn model is static shape type, please export rknn with dynamic_shapes
-W Query dynamic range failed. Ret code: RKNN_ERR_MODEL_INVALID. (If it is a static shape RKNN model, please ignore the above warning message.)
-W rknn-toolkit-lite2 version: 2.3.2
-I RKNN: [18:58:27.728] RKNN Runtime Information, librknnrt version: 2.3.2 (429f97ae6b@2025-04-09T09:09:27)
-I RKNN: [18:58:27.728] RKNN Driver Information, version: 0.9.8
-I RKNN: [18:58:27.728] RKNN Model Information, version: 6, toolkit version: 2.3.2(compiler version: 2.3.2 (e045de294f@2025-04-07T19:48:25)), target: RKNPU v2, target platform: rk3588, framework name: ONNX, framework layout: NCHW, model inference type: static_shape
-W RKNN: [18:58:27.819] query RKNN_QUERY_INPUT_DYNAMIC_RANGE error, rknn model is static shape type, please export rknn with dynamic_shapes
-W Query dynamic range failed. Ret code: RKNN_ERR_MODEL_INVALID. (If it is a static shape RKNN model, please ignore the above warning message.)
-W rknn-toolkit-lite2 version: 2.3.2
-I RKNN: [18:58:27.937] RKNN Runtime Information, librknnrt version: 2.3.2 (429f97ae6b@2025-04-09T09:09:27)
-I RKNN: [18:58:27.937] RKNN Driver Information, version: 0.9.8
-I RKNN: [18:58:27.937] RKNN Model Information, version: 6, toolkit version: 2.3.0(compiler version: 2.3.0 (@2024-11-07T08:11:34)), target: RKNPU v2, target platform: rk3588, framework name: ONNX, framework layout: NCHW, model inference type: static_shape
-W RKNN: [18:58:27.940] query RKNN_QUERY_INPUT_DYNAMIC_RANGE error, rknn model is static shape type, please export rknn with dynamic_shapes
-W Query dynamic range failed. Ret code: RKNN_ERR_MODEL_INVALID. (If it is a static shape RKNN model, please ignore the above warning message.)
-W rknn-toolkit-lite2 version: 2.3.2
-I RKNN: [18:58:28.058] RKNN Runtime Information, librknnrt version: 2.3.2 (429f97ae6b@2025-04-09T09:09:27)
-I RKNN: [18:58:28.058] RKNN Driver Information, version: 0.9.8
-I RKNN: [18:58:28.058] RKNN Model Information, version: 6, toolkit version: 2.3.0(compiler version: 2.3.0 (@2024-11-07T08:11:34)), target: RKNPU v2, target platform: rk3588, framework name: ONNX, framework layout: NCHW, model inference type: static_shape
-W RKNN: [18:58:28.060] query RKNN_QUERY_INPUT_DYNAMIC_RANGE error, rknn model is static shape type, please export rknn with dynamic_shapes
-W Query dynamic range failed. Ret code: RKNN_ERR_MODEL_INVALID. (If it is a static shape RKNN model, please ignore the above warning message.)
-[time] vae_encode_0: 1601.56 ms
-[time] vae_encode_38400: 1605.46 ms
-[time] vae_encode_76800: 1591.07 ms
-W The input[0] need NHWC data format, but NCHW set, the data format and data buffer will be changed to NHWC.
-[time] locenc_0: 819.49 ms
-W The input[0] need NHWC data format, but NCHW set, the data format and data buffer will be changed to NHWC.
-[time] locenc_64: 818.33 ms
-W The input[0] need NHWC data format, but NCHW set, the data format and data buffer will be changed to NHWC.
-[time] locenc_128: 819.09 ms
-[time] base_lm initial: 579.08 ms
-[time] fsq_init_0: 2.54 ms
-[time] fsq_init_64: 1.86 ms
-[time] fsq_init_128: 1.79 ms
-[time] residual_lm initial: 139.10 ms
-gen_loop:   0%|                                                                          | 0/2000 [00:00<?, ?it/s][time] lm_to_dit: 0.82 ms
-[time] res_to_dit: 0.56 ms
-100%|█████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 33.32it/s]
-W The input[0] need NHWC data format, but NCHW set, the data format and data buffer will be changed to NHWC.4it/s]
-[time] locenc_step: 16.32 ms
-gen_loop:   0%|                                                                  | 1/2000 [00:00<14:30,  2.30it/s][time] lm_to_dit: 0.57 ms
-[time] res_to_dit: 0.44 ms
-100%|█████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 33.10it/s]
-W The input[0] need NHWC data format, but NCHW set, the data format and data buffer will be changed to NHWC.1it/s]
-[time] locenc_step: 15.84 ms
-gen_loop:   0%|                                                                  | 2/2000 [00:00<14:27,  2.30it/s][time] lm_to_dit: 0.56 ms
-[time] res_to_dit: 0.50 ms
-100%|█████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 31.93it/s]
 ...
-W The input[0] need NHWC data format, but NCHW set, the data format and data buffer will be changed to NHWC.5it/s]
-[time] locenc_step: 15.88 ms
-gen_loop:   6%|███▉                                                            | 123/2000 [00:53<13:35,  2.30it/s][time] lm_to_dit: 0.57 ms
-[time] res_to_dit: 0.49 ms
-100%|█████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 32.94it/s]
-W The input[0] need NHWC data format, but NCHW set, the data format and data buffer will be changed to NHWC.6it/s]
-[time] locenc_step: 15.84 ms
-gen_loop:   6%|███▉                                                            | 123/2000 [00:54<13:44,  2.28it/s]
-[time] vae_decode_0: 1044.00 ms
-[time] vae_decode_60: 1018.03 ms
-[time] vae_decode_120: 1020.72 ms
-[time] vae_decode_180: 1021.19 ms
-[time] vae_decode_240: 1006.85 ms
 Saved: rknn_output.wav
 ```
 ## Model Conversion
 #### TODO: Documentation to be added
@@ -341,7 +222,7 @@ Saved: rknn_output.wav
 - In some cases, speech generation may fall into an infinite loop. The original project seems to have a mechanism to detect infinite loops, but it is not implemented here.
 - Due to internal issues with the RKNN toolchain, the locenc model cannot configure two sets of shapes for two different input lengths in a single model, so two separate models must be converted.
 - Due to internal issues with the RKLLM toolchain/runtime, the output tensor values of both LLMs are only one-quarter of the correct result. Multiplying by 4 manually yields the correct result.
-- Since the RKNN toolchain currently does not support data-parallel inference using multiple NPU cores for non-4D input models with multiple batches, CFG in the script is performed separately in two passes, which is relatively slow.
 ## References
 - [openbmb/VoxCPM-0.5B](https://huggingface.co/openbmb/VoxCPM-0.5B)

 ![模型架构](model_structure.jpg)
+- 推理速度(RKNN2)：RK3588上RTF约4.5（生成10s音频需要推理45s）
 - 大致内存占用(RKNN2)：约3.3GB
 ## 使用方法
 2. 安装依赖
 ```bash
+pip install numpy scipy soundfile tqdm transformers sentencepiece ztu-somemodelruntime-ez-rknn-async
 ```
 3. 运行
 I rkllm: rkllm-toolkit version: 1.2.2, max_context_limit: 4096, npu_core_num: 3, target_platform: RK3588, model_dtype: FP16
 I rkllm: Enabled cpus: [4, 5, 6, 7]
 I rkllm: Enabled cpus num: 4
+[time] vae_encode_0: 1502.91 ms
+[time] vae_encode_38400: 1443.79 ms
+[time] vae_encode_76800: 1418.36 ms
+[time] locenc_0: 820.25 ms
+[time] locenc_64: 814.78 ms
+[time] locenc_128: 815.60 ms
+[time] base_lm initial: 549.21 ms
+[time] fsq_init_0: 5.34 ms
+[time] fsq_init_64: 3.95 ms
+[time] fsq_init_128: 4.17 ms
+[time] residual_lm initial: 131.22 ms
+gen_loop:   0%|                                                                                 | 0/2000 [00:00<?, ?it/s][time] lm_to_dit: 1.26 ms
+[time] res_to_dit: 1.01 ms
+100%|████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 60.13it/s]
+[time] locenc_step: 16.43 ms████████████████████████████████████▍                         | 7/10 [00:00<00:00, 61.20it/s]
+gen_loop:   0%|                                                                         | 1/2000 [00:00<09:43,  3.42it/s][time] lm_to_dit: 0.75 ms
+[time] res_to_dit: 0.55 ms
+100%|████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 63.99it/s]
+[time] locenc_step: 15.93 ms████████████████████████████████████▍                         | 7/10 [00:00<00:00, 67.27it/s]
+gen_loop:   0%|                                                                         | 2/2000 [00:00<09:25,  3.53it/s][time] lm_to_dit: 0.74 ms
 ...
+[time] res_to_dit: 0.59 ms
+100%|████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 64.19it/s]
+[time] locenc_step: 15.73 ms████████████████████████████████████▍                         | 7/10 [00:00<00:00, 67.34it/s]
+gen_loop:   6%|████▎                                                                  | 123/2000 [00:34<08:47,  3.56it/s][time] lm_to_dit: 0.76 ms
+[time] res_to_dit: 0.56 ms
+100%|████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 64.08it/s]
+[time] locenc_step: 15.82 ms████████████████████████████████████▍                         | 7/10 [00:00<00:00, 67.43it/s]
+gen_loop:   6%|████▎                                                                  | 123/2000 [00:34<08:47,  3.56it/s]
+[time] vae_decode_0: 1153.02 ms
+[time] vae_decode_60: 1102.36 ms
+[time] vae_decode_120: 1105.00 ms
+[time] vae_decode_180: 1105.60 ms
+[time] vae_decode_240: 1082.36 ms
 Saved: rknn_output.wav
 ```
 - 某些情况下语音生成可能陷入死循环，原项目似乎有检测死循环的机制，但我这里没有实现。
 - 由于RKNN工具链的内部问题，locenc模型没有办法在一个模型里配置两种输入长度的两组shape，因此只能单独转换两个模型。
 - 由于RKLLM工具链/运行时的内部问题，两个LLM的输出张量的数值都只有正确结果的四分之一，手动乘4之后可以得到正确结果。
+- ~~由于RKNN工具链目前不支持非4维输入模型多batch使用多NPU核的数据并行推理，脚本中CFG是分两次单独进行的，速度较慢。~~(已修复)
 ## 参考
 Unlike mainstream approaches that convert speech into discrete tokens, VoxCPM adopts an end-to-end diffusion autoregressive architecture that directly generates continuous speech representations from text. Built on the MiniCPM-4 backbone, it achieves implicit semantic-acoustic decoupling through hierarchical language modeling and FSQ constraints, greatly enhancing expressiveness and generation stability.
+- Inference speed (RKNN2): RTF approximately 4.5 on RK3588 (45s inference time to generate 10s audio)
 - Approximate memory usage (RKNN2): ~3.3GB
 ## Usage
 2. Install dependencies
 ```bash
+pip install numpy scipy soundfile tqdm transformers sentencepiece ztu-somemodelruntime-ez-rknn-async
 ```
 3. Run
 I rkllm: rkllm-toolkit version: 1.2.2, max_context_limit: 4096, npu_core_num: 3, target_platform: RK3588, model_dtype: FP16
 I rkllm: Enabled cpus: [4, 5, 6, 7]
 I rkllm: Enabled cpus num: 4
+[time] vae_encode_0: 1502.91 ms
+[time] vae_encode_38400: 1443.79 ms
+[time] vae_encode_76800: 1418.36 ms
+[time] locenc_0: 820.25 ms
+[time] locenc_64: 814.78 ms
+[time] locenc_128: 815.60 ms
+[time] base_lm initial: 549.21 ms
+[time] fsq_init_0: 5.34 ms
+[time] fsq_init_64: 3.95 ms
+[time] fsq_init_128: 4.17 ms
+[time] residual_lm initial: 131.22 ms
+gen_loop:   0%|                                                                                 | 0/2000 [00:00<?, ?it/s][time] lm_to_dit: 1.26 ms
+[time] res_to_dit: 1.01 ms
+100%|████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 60.13it/s]
+[time] locenc_step: 16.43 ms████████████████████████████████████▍                         | 7/10 [00:00<00:00, 61.20it/s]
+gen_loop:   0%|                                                                         | 1/2000 [00:00<09:43,  3.42it/s][time] lm_to_dit: 0.75 ms
+[time] res_to_dit: 0.55 ms
+100%|████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 63.99it/s]
+[time] locenc_step: 15.93 ms████████████████████████████████████▍                         | 7/10 [00:00<00:00, 67.27it/s]
+gen_loop:   0%|                                                                         | 2/2000 [00:00<09:25,  3.53it/s][time] lm_to_dit: 0.74 ms
 ...
+[time] res_to_dit: 0.59 ms
+100%|████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 64.19it/s]
+[time] locenc_step: 15.73 ms████████████████████████████████████▍                         | 7/10 [00:00<00:00, 67.34it/s]
+gen_loop:   6%|████▎                                                                  | 123/2000 [00:34<08:47,  3.56it/s][time] lm_to_dit: 0.76 ms
+[time] res_to_dit: 0.56 ms
+100%|████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 64.08it/s]
+[time] locenc_step: 15.82 ms████████████████████████████████████▍                         | 7/10 [00:00<00:00, 67.43it/s]
+gen_loop:   6%|████▎                                                                  | 123/2000 [00:34<08:47,  3.56it/s]
+[time] vae_decode_0: 1153.02 ms
+[time] vae_decode_60: 1102.36 ms
+[time] vae_decode_120: 1105.00 ms
+[time] vae_decode_180: 1105.60 ms
+[time] vae_decode_240: 1082.36 ms
 Saved: rknn_output.wav
 ```
 ## Model Conversion
 #### TODO: Documentation to be added
 - In some cases, speech generation may fall into an infinite loop. The original project seems to have a mechanism to detect infinite loops, but it is not implemented here.
 - Due to internal issues with the RKNN toolchain, the locenc model cannot configure two sets of shapes for two different input lengths in a single model, so two separate models must be converted.
 - Due to internal issues with the RKLLM toolchain/runtime, the output tensor values of both LLMs are only one-quarter of the correct result. Multiplying by 4 manually yields the correct result.
+- ~~Since the RKNN toolchain currently does not support data-parallel inference using multiple NPU cores for non-4D input models with multiple batches, CFG in the script is performed separately in two passes, which is relatively slow.~~(Solved)
 ## References
 - [openbmb/VoxCPM-0.5B](https://huggingface.co/openbmb/VoxCPM-0.5B)

onnx_infer-rknn2.py CHANGED Viewed

@@ -12,7 +12,7 @@ from rkllm_binding import *
 from transformers import AutoTokenizer
-import ztu_somemodelruntime_rknnlite2 as ort
 def mask_multichar_chinese_tokens(tokenizer):
     # Pre-compute multi-character tokens (length >= 2, pure Chinese characters)
@@ -97,12 +97,6 @@ def mask_multichar_chinese_tokens(tokenizer):
     return CharTokenizerWrapper(tokenizer)
-def load_rknn(path: str, providers):
-    if not os.path.exists(path):
-        raise FileNotFoundError(f"ONNX file not found: {path}")
-    return ort.InferenceSession(path, providers=providers)
 def ensure_numpy(arr, dtype=None):
     np_arr = np.asarray(arr)
     if dtype is not None:
@@ -110,10 +104,10 @@ def ensure_numpy(arr, dtype=None):
     return np_arr
-def run_ort(session: ort.InferenceSession, inputs: dict, name: str = None):
     start = time.perf_counter()
     ort_inputs = {k: ensure_numpy(v) for k, v in inputs.items()}
-    outputs = session.run(None, ort_inputs)  # noqa: SLF001
     if name:
         elapsed_ms = (time.perf_counter() - start) * 1000
         print(f"[time] {name}: {elapsed_ms:.2f} ms")
@@ -249,35 +243,27 @@ def cfm_euler_with_onnx_step(
             t_in = np.full((b,), t, dtype=dtype)
             dt_in = np.full((b,), dt if mean_mode else 0.0, dtype=dtype)
-            # run conditional branch (pos)
-            dphi_dt_pos = np.asarray(
-                run_ort(
-                    dit_sess,
-                    {
-                        "x": x,
-                        "mu": mu,
-                        "t": t_in,
-                        "cond": cond,
-                        "dt": dt_in,
-                    },
-                ),
-                dtype=dtype,
-            )
-            # run "negative" branch (unconditional: mu=0, cond=0)
-            dphi_dt_neg = np.asarray(
                 run_ort(
                     dit_sess,
                     {
-                        "x": x,
-                        "mu": np.zeros_like(mu),
-                        "t": t_in,
-                        "cond": np.zeros_like(cond),
-                        "dt": np.zeros_like(dt_in),
                     },
                 ),
                 dtype=dtype,
             )
             if use_cfg_zero_star:
                 positive_flat = dphi_dt_pos.reshape(b, -1)
@@ -361,16 +347,8 @@ def main():
     parser.add_argument("--min-len", type=int, default=2, help="Minimum generated patch count before stop allowed.")
     parser.add_argument("--max-len", type=int, default=2000, help="Maximum generated patch count.")
     parser.add_argument("--seed", type=int, default=None, help="Random seed for reproducibility.")
-    parser.add_argument(
-        "--providers",
-        nargs="+",
-        default=None,
-        help="ONNX Runtime providers (e.g., CUDAExecutionProvider CPUExecutionProvider).",
-    )
-    args = parser.parse_args()
-    providers = args.providers or ["CUDAExecutionProvider", "CPUExecutionProvider"]
     # Seed
     if args.seed is not None:
         random.seed(args.seed)
@@ -401,15 +379,15 @@ def main():
     fixed_seq_len = 64  # target platform prefers fixed seq len for locenc/fsq
     # Load ONNX sessions
-    vae_encode_sess = load_rknn(os.path.join(args.onnx_dir, "audio_vae_encode.rknn"), providers)
-    vae_decode_sess = load_rknn(os.path.join(args.onnx_dir, "audio_vae_decode.rknn"), providers)
-    locenc_64_sess = load_rknn(os.path.join(args.onnx_dir, "locenc_64.rknn"), providers)
-    locenc_1_sess = load_rknn(os.path.join(args.onnx_dir, "locenc_1.rknn"), providers)
-    fsq_sess = load_rknn(os.path.join(args.onnx_dir, "fsq_layer.rknn"), providers)
-    stop_sess = load_rknn(os.path.join(args.onnx_dir, "stop_head.rknn"), providers)
-    dit_step_sess = load_rknn(os.path.join(args.onnx_dir, "dit_step.rknn"), providers)
-    lm_to_dit_sess = load_rknn(os.path.join(args.onnx_dir, "lm_to_dit_proj.rknn"), providers)
-    res_to_dit_sess = load_rknn(os.path.join(args.onnx_dir, "res_to_dit_proj.rknn"), providers)
     # Build text/audio features
     if args.prompt_audio:

 from transformers import AutoTokenizer
+import ztu_somemodelruntime_ez_rknn_async as ort
 def mask_multichar_chinese_tokens(tokenizer):
     # Pre-compute multi-character tokens (length >= 2, pure Chinese characters)
     return CharTokenizerWrapper(tokenizer)
 def ensure_numpy(arr, dtype=None):
     np_arr = np.asarray(arr)
     if dtype is not None:
     return np_arr
+def run_ort(session: ort.InferenceSession, inputs: dict, name: str = None, run_options=None):
     start = time.perf_counter()
     ort_inputs = {k: ensure_numpy(v) for k, v in inputs.items()}
+    outputs = session.run(None, ort_inputs, run_options=run_options)  # noqa: SLF001
     if name:
         elapsed_ms = (time.perf_counter() - start) * 1000
         print(f"[time] {name}: {elapsed_ms:.2f} ms")
             t_in = np.full((b,), t, dtype=dtype)
             dt_in = np.full((b,), dt if mean_mode else 0.0, dtype=dtype)
+            x_batch = np.concatenate([x, x], axis=0)
+            mu_batch = np.concatenate([mu, np.zeros_like(mu)], axis=0)
+            t_batch = np.concatenate([t_in, t_in], axis=0)
+            cond_batch = np.concatenate([cond, np.zeros_like(cond)], axis=0)
+            dt_batch = np.concatenate([dt_in, np.zeros_like(dt_in)], axis=0)
+            dphi_dt_batch = np.asarray(
                 run_ort(
                     dit_sess,
                     {
+                        "x": x_batch,
+                        "mu": mu_batch,
+                        "t": t_batch,
+                        "cond": cond_batch,
+                        "dt": dt_batch,
                     },
+                    run_options={"ztu_modelrt_dispatch_batch": True}
                 ),
                 dtype=dtype,
             )
+            dphi_dt_pos, dphi_dt_neg = np.split(dphi_dt_batch, [b], axis=0)
             if use_cfg_zero_star:
                 positive_flat = dphi_dt_pos.reshape(b, -1)
     parser.add_argument("--min-len", type=int, default=2, help="Minimum generated patch count before stop allowed.")
     parser.add_argument("--max-len", type=int, default=2000, help="Maximum generated patch count.")
     parser.add_argument("--seed", type=int, default=None, help="Random seed for reproducibility.")
+    args = parser.parse_args()
     # Seed
     if args.seed is not None:
         random.seed(args.seed)
     fixed_seq_len = 64  # target platform prefers fixed seq len for locenc/fsq
     # Load ONNX sessions
+    vae_encode_sess = ort.InferenceSession(os.path.join(args.onnx_dir, "audio_vae_encode.rknn"))
+    vae_decode_sess = ort.InferenceSession(os.path.join(args.onnx_dir, "audio_vae_decode.rknn"))
+    locenc_64_sess = ort.InferenceSession(os.path.join(args.onnx_dir, "locenc_64.rknn"))
+    locenc_1_sess = ort.InferenceSession(os.path.join(args.onnx_dir, "locenc_1.rknn"))
+    fsq_sess = ort.InferenceSession(os.path.join(args.onnx_dir, "fsq_layer.rknn"))
+    stop_sess = ort.InferenceSession(os.path.join(args.onnx_dir, "stop_head.rknn"))
+    dit_step_sess = ort.InferenceSession(os.path.join(args.onnx_dir, "dit_step.rknn"), provider_options=[{"schedule": [0,1]}])
+    lm_to_dit_sess = ort.InferenceSession(os.path.join(args.onnx_dir, "lm_to_dit_proj.rknn"))
+    res_to_dit_sess = ort.InferenceSession(os.path.join(args.onnx_dir, "res_to_dit_proj.rknn"))
     # Build text/audio features
     if args.prompt_audio:

onnx_infer.py CHANGED Viewed

@@ -153,33 +153,25 @@ def cfm_euler_with_onnx_step(
             if not mean_mode:
                 dt_in = torch.zeros_like(dt_in)
-            # run conditional branch (pos)
-            dphi_dt_pos = run_ort(
-                dit_sess,
-                {
-                    "x": x,
-                    "mu": mu,
-                    "t": t_in,
-                    "cond": cond,
-                    "dt": dt_in,
-                },
-                name=f"dit_step_pos_{step}",
-            )
-            dphi_dt_pos = torch.from_numpy(dphi_dt_pos).to(device=device, dtype=dtype)
-            # run "negative" branch (unconditional: mu=0, cond=0)
-            dphi_dt_neg = run_ort(
                 dit_sess,
                 {
-                    "x": x,
-                    "mu": torch.zeros_like(mu),
-                    "t": t_in,
-                    "cond": torch.zeros_like(cond),
-                    "dt": torch.zeros_like(dt_in),
                 },
-                name=f"dit_step_neg_{step}",
             )
-            dphi_dt_neg = torch.from_numpy(dphi_dt_neg).to(device=device, dtype=dtype)
             if use_cfg_zero_star:
                 positive_flat = dphi_dt_pos.view(b, -1)

             if not mean_mode:
                 dt_in = torch.zeros_like(dt_in)
+            x_batch = torch.cat([x, x], dim=0)
+            mu_batch = torch.cat([mu, torch.zeros_like(mu)], dim=0)
+            t_batch = torch.cat([t_in, t_in], dim=0)
+            cond_batch = torch.cat([cond, torch.zeros_like(cond)], dim=0)
+            dt_batch = torch.cat([dt_in, torch.zeros_like(dt_in)], dim=0)
+            dphi_dt_batch = run_ort(
                 dit_sess,
                 {
+                    "x": x_batch,
+                    "mu": mu_batch,
+                    "t": t_batch,
+                    "cond": cond_batch,
+                    "dt": dt_batch,
                 },
+                name=f"dit_step_b2_{step}",
             )
+            dphi_dt_batch = torch.from_numpy(dphi_dt_batch).to(device=device, dtype=dtype)
+            dphi_dt_pos, dphi_dt_neg = torch.split(dphi_dt_batch, [b, b], dim=0)
             if use_cfg_zero_star:
                 positive_flat = dphi_dt_pos.view(b, -1)