| | --- |
| | license: agpl-3.0 |
| | language: |
| | - en |
| | - zh |
| | base_model: |
| | - openbmb/VoxCPM-0.5B |
| | pipeline_tag: text-to-speech |
| | tags: |
| | - rknn |
| | - rkllm |
| | - text-to-speech |
| | - speech |
| | - speech generation |
| | - voice cloning |
| | --- |
| | |
| | # VoxCPM-0.5B-RKNN2 |
| |
|
| | ### (English README see below) |
| |
|
| | VoxCPM ๆฏไธ็งๅๆฐ็ๆ ๅ่ฏๅจๆๆฌ่ฝฌ่ฏญ้ณ๏ผTTS๏ผ็ณป็ป๏ผ้ๆฐๅฎไนไบ่ฏญ้ณๅๆ็็ๅฎๆใ้่ฟๅจ่ฟ็ปญ็ฉบ้ดไธญๅปบๆจก่ฏญ้ณ๏ผๅฎๅ
ๆไบ็ฆปๆฃๆ ่ฎฐๅ็ๅฑ้๏ผๅนถๅฎ็ฐไบไธค้กนๆ ธๅฟ่ฝๅ๏ผไธไธๆๆ็ฅ็่ฏญ้ณ็ๆๅ้ผ็็้ถๆ ทๆฌ่ฏญ้ณๅ
้ใ |
| | ไธๅไบๅฐ่ฏญ้ณ่ฝฌๆขไธบ็ฆปๆฃๆ ่ฎฐ็ไธปๆตๆนๆณ๏ผVoxCPM ้็จ็ซฏๅฐ็ซฏ็ๆฉๆฃ่ชๅๅฝๆถๆ๏ผ็ดๆฅไปๆๆฌ็ๆ่ฟ็ปญ็่ฏญ้ณ่กจ็คบใๅฎๅบไบ MiniCPM-4 ไธปๅนฒๆๅปบ๏ผ้่ฟๅๅฑ่ฏญ่จๅปบๆจกๅ FSQ ็บฆๆๅฎ็ฐไบ้ๅผ็่ฏญไน-ๅฃฐๅญฆ่งฃ่ฆ๏ผๆๅคงๅฐๆๅไบ่กจ็ฐๅๅ็ๆ็จณๅฎๆงใ |
| |
|
| |  |
| |
|
| |
|
| | - ๆจ็้ๅบฆ(RKNN2)๏ผRK3588ไธRTF็บฆ4.5๏ผ็ๆ10s้ณ้ข้่ฆๆจ็45s๏ผ |
| | - ๅคง่ดๅ
ๅญๅ ็จ(RKNN2)๏ผ็บฆ3.3GB |
| |
|
| | ## ไฝฟ็จๆนๆณ |
| |
|
| | 1. ๅ
้้กน็ฎๅฐๆฌๅฐ |
| |
|
| | 2. ๅฎ่ฃ
ไพ่ต |
| |
|
| | ```bash |
| | pip install numpy scipy soundfile tqdm transformers sentencepiece ztu-somemodelruntime-ez-rknn-async |
| | ``` |
| |
|
| | 3. ่ฟ่ก |
| |
|
| | ```bash |
| | python onnx_infer-rknn2.py --onnx-dir . --tokenizer-dir . --base-hf-dir . --residual-hf-dir . --text "ๅ, ่ฟไธชๆจกๅๅฑ
็ถๅจRK3588่ฟไธช่พฃ้ธกSoCไธไน่ฝๅฎ็พ่ฟ่ก!" --prompt-audio basic_ref_zh.wav --prompt-text "ๅฏน๏ผ่ฟๅฐฑๆฏๆ๏ผไธไบบๆฌไปฐ็ๅคชไน็ไบบใ" --output rknn_output.wav --cfg-value 2.0 --inference-timesteps 10 --seed 1234 |
| | ``` |
| |
|
| | ๅฏ้ๅๆฐ๏ผ |
| | - `--text`: ่ฆ็ๆ็ๆๆฌ |
| | - `--prompt-audio`: ๅ่้ณ้ข่ทฏๅพ๏ผ็จไบ่ฏญ้ณๅ
้๏ผ |
| | - `--prompt-text`: ๅ่้ณ้ขๅฏนๅบ็ๆๆฌ๏ผไฝฟ็จๅ่้ณ้ขๆถๅฟ
ๅกซ๏ผ |
| | - `--cfg-value`: CFGๅผๅฏผๅผบๅบฆ๏ผ้ป่ฎค2.0 |
| | - `--inference-timesteps`: ๆฉๆฃๆญฅๆฐ๏ผ้ป่ฎค10 |
| | - `--seed`: ้ๆบ็งๅญ |
| | - `--output`: ่พๅบ้ณ้ข่ทฏๅพ |
| |
|
| | ## ่ฟ่กๆๆ |
| |
|
| |
|
| | ```log |
| | > python onnx_infer-rknn2.py --onnx-dir . --tokenizer-dir . --base-hf-dir . --residual-hf-dir . --text "ๅ, ่ฟไธชๆจกๅๅฑ
็ถๅจRK3588่ฟไธช่พฃ้ธกSoCไธไน่ฝๅฎ็พ่ฟ่ก!" --prompt-audio basic_ref_zh.wav --prompt-text "ๅฏน๏ผ่ฟๅฐฑๆฏๆ๏ผไธไบบๆฌไปฐ็ๅคชไน็ไบบใ" --output rknn_output.wav --cfg-value 2.0 --inference-timesteps 10 --seed 1234 |
| | |
| | I rkllm: rkllm-runtime version: 1.2.3, rknpu driver version: 0.9.8, platform: RK3588 |
| | I rkllm: loading rkllm model from ./base_lm.rkllm |
| | I rkllm: rkllm-toolkit version: 1.2.3, max_context_limit: 4096, npu_core_num: 1, target_platform: RK3588, model_dtype: FP16 |
| | I rkllm: Enabled cpus: [4, 5, 6, 7] |
| | I rkllm: Enabled cpus num: 4 |
| | I rkllm: rkllm-runtime version: 1.2.3, rknpu driver version: 0.9.8, platform: RK3588 |
| | I rkllm: loading rkllm model from ./residual_lm.rkllm |
| | I rkllm: rkllm-toolkit version: 1.2.2, max_context_limit: 4096, npu_core_num: 3, target_platform: RK3588, model_dtype: FP16 |
| | I rkllm: Enabled cpus: [4, 5, 6, 7] |
| | I rkllm: Enabled cpus num: 4 |
| | [time] vae_encode_0: 1502.91 ms |
| | [time] vae_encode_38400: 1443.79 ms |
| | [time] vae_encode_76800: 1418.36 ms |
| | [time] locenc_0: 820.25 ms |
| | [time] locenc_64: 814.78 ms |
| | [time] locenc_128: 815.60 ms |
| | [time] base_lm initial: 549.21 ms |
| | [time] fsq_init_0: 5.34 ms |
| | [time] fsq_init_64: 3.95 ms |
| | [time] fsq_init_128: 4.17 ms |
| | [time] residual_lm initial: 131.22 ms |
| | gen_loop: 0%| | 0/2000 [00:00<?, ?it/s][time] lm_to_dit: 1.26 ms |
| | [time] res_to_dit: 1.01 ms |
| | 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 10/10 [00:00<00:00, 60.13it/s] |
| | [time] locenc_step: 16.43 msโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 7/10 [00:00<00:00, 61.20it/s] |
| | gen_loop: 0%| | 1/2000 [00:00<09:43, 3.42it/s][time] lm_to_dit: 0.75 ms |
| | [time] res_to_dit: 0.55 ms |
| | 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 10/10 [00:00<00:00, 63.99it/s] |
| | [time] locenc_step: 15.93 msโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 7/10 [00:00<00:00, 67.27it/s] |
| | gen_loop: 0%| | 2/2000 [00:00<09:25, 3.53it/s][time] lm_to_dit: 0.74 ms |
| | |
| | ... |
| | |
| | [time] res_to_dit: 0.59 ms |
| | 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 10/10 [00:00<00:00, 64.19it/s] |
| | [time] locenc_step: 15.73 msโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 7/10 [00:00<00:00, 67.34it/s] |
| | gen_loop: 6%|โโโโโ | 123/2000 [00:34<08:47, 3.56it/s][time] lm_to_dit: 0.76 ms |
| | [time] res_to_dit: 0.56 ms |
| | 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 10/10 [00:00<00:00, 64.08it/s] |
| | [time] locenc_step: 15.82 msโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 7/10 [00:00<00:00, 67.43it/s] |
| | gen_loop: 6%|โโโโโ | 123/2000 [00:34<08:47, 3.56it/s] |
| | [time] vae_decode_0: 1153.02 ms |
| | [time] vae_decode_60: 1102.36 ms |
| | [time] vae_decode_120: 1105.00 ms |
| | [time] vae_decode_180: 1105.60 ms |
| | [time] vae_decode_240: 1082.36 ms |
| | Saved: rknn_output.wav |
| | ``` |
| |
|
| | ## ๆจกๅ่ฝฌๆข |
| |
|
| | #### ๆๅพๅไบ๏ผๅพ
่กฅๅ
|
| |
|
| | ## ๅทฒ็ฅ้ฎ้ข |
| |
|
| | - ๆไบๆ
ๅตไธ่ฏญ้ณ็ๆๅฏ่ฝ้ทๅ
ฅๆญปๅพช็ฏ๏ผๅ้กน็ฎไผผไนๆๆฃๆตๆญปๅพช็ฏ็ๆบๅถ๏ผไฝๆ่ฟ้ๆฒกๆๅฎ็ฐใ |
| | - ็ฑไบRKNNๅทฅๅ
ท้พ็ๅ
้จ้ฎ้ข๏ผlocencๆจกๅๆฒกๆๅๆณๅจไธไธชๆจกๅ้้
็ฝฎไธค็ง่พๅ
ฅ้ฟๅบฆ็ไธค็ปshape๏ผๅ ๆญคๅช่ฝๅ็ฌ่ฝฌๆขไธคไธชๆจกๅใ |
| | - ็ฑไบRKLLMๅทฅๅ
ท้พ/่ฟ่กๆถ็ๅ
้จ้ฎ้ข๏ผไธคไธชLLM็่พๅบๅผ ้็ๆฐๅผ้ฝๅชๆๆญฃ็กฎ็ปๆ็ๅๅไนไธ๏ผๆๅจไน4ไนๅๅฏไปฅๅพๅฐๆญฃ็กฎ็ปๆใ |
| | - ~~็ฑไบRKNNๅทฅๅ
ท้พ็ฎๅไธๆฏๆ้4็ปด่พๅ
ฅๆจกๅๅคbatchไฝฟ็จๅคNPUๆ ธ็ๆฐๆฎๅนถ่กๆจ็๏ผ่ๆฌไธญCFGๆฏๅไธคๆฌกๅ็ฌ่ฟ่ก็๏ผ้ๅบฆ่พๆ
ขใ~~(ๅทฒไฟฎๅค) |
| |
|
| |
|
| | ## ๅ่ |
| | - [openbmb/VoxCPM-0.5B](https://huggingface.co/openbmb/VoxCPM-0.5B) |
| | - [0seba/VoxCPMANE](https://github.com/0seba/VoxCPMANE) |
| | - [bluryar/VoxCPM-ONNX](https://github.com/bluryar/VoxCPM-ONNX) |
| |
|
| | # English README |
| |
|
| | VoxCPM is an innovative tokenizer-free Text-to-Speech (TTS) system that redefines realism in speech synthesis. By modeling speech in continuous space, it overcomes the limitations of discrete tokenization and achieves two core capabilities: context-aware speech generation and realistic zero-shot voice cloning. |
| |
|
| | Unlike mainstream approaches that convert speech into discrete tokens, VoxCPM adopts an end-to-end diffusion autoregressive architecture that directly generates continuous speech representations from text. Built on the MiniCPM-4 backbone, it achieves implicit semantic-acoustic decoupling through hierarchical language modeling and FSQ constraints, greatly enhancing expressiveness and generation stability. |
| |
|
| | - Inference speed (RKNN2): RTF approximately 4.5 on RK3588 (45s inference time to generate 10s audio) |
| | - Approximate memory usage (RKNN2): ~3.3GB |
| |
|
| | ## Usage |
| |
|
| | 1. Clone the project locally |
| |
|
| | 2. Install dependencies |
| |
|
| | ```bash |
| | pip install numpy scipy soundfile tqdm transformers sentencepiece ztu-somemodelruntime-ez-rknn-async |
| | ``` |
| |
|
| | 3. Run |
| |
|
| | ```bash |
| | python onnx_infer-rknn2.py --onnx-dir . --tokenizer-dir . --base-hf-dir . --residual-hf-dir . --text "Wow, this model actually runs perfectly on the RK3588 SoC!" --prompt-audio basic_ref_zh.wav --prompt-text "ๅฏน๏ผ่ฟๅฐฑๆฏๆ๏ผไธไบบๆฌไปฐ็ๅคชไน็ไบบใ" --output rknn_output.wav --cfg-value 2.0 --inference-timesteps 10 --seed 1234 |
| | ``` |
| |
|
| | Optional parameters: |
| | - `--text`: Text to generate |
| | - `--prompt-audio`: Reference audio path (for voice cloning) |
| | - `--prompt-text`: Text corresponding to the reference audio (required when using reference audio) |
| | - `--cfg-value`: CFG guidance strength, default 2.0 |
| | - `--inference-timesteps`: Number of diffusion steps, default 10 |
| | - `--seed`: Random seed |
| | - `--output`: Output audio path |
| |
|
| | ## Performance |
| |
|
| |
|
| | ```log |
| | > python onnx_infer-rknn2.py --onnx-dir . --tokenizer-dir . --base-hf-dir . --residual-hf-dir . --text "ๅ, ่ฟไธชๆจกๅๅฑ
็ถๅจRK3588่ฟไธช่พฃ้ธกSoCไธไน่ฝๅฎ็พ่ฟ่ก!" --prompt-audio basic_ref_zh.wav --prompt-text "ๅฏน๏ผ่ฟๅฐฑๆฏๆ๏ผไธไบบๆฌไปฐ็ๅคชไน็ไบบใ" --output rknn_output.wav --cfg-value 2.0 --inference-timesteps 10 --seed 1234 |
| | |
| | I rkllm: rkllm-runtime version: 1.2.3, rknpu driver version: 0.9.8, platform: RK3588 |
| | I rkllm: loading rkllm model from ./base_lm.rkllm |
| | I rkllm: rkllm-toolkit version: 1.2.3, max_context_limit: 4096, npu_core_num: 1, target_platform: RK3588, model_dtype: FP16 |
| | I rkllm: Enabled cpus: [4, 5, 6, 7] |
| | I rkllm: Enabled cpus num: 4 |
| | I rkllm: rkllm-runtime version: 1.2.3, rknpu driver version: 0.9.8, platform: RK3588 |
| | I rkllm: loading rkllm model from ./residual_lm.rkllm |
| | I rkllm: rkllm-toolkit version: 1.2.2, max_context_limit: 4096, npu_core_num: 3, target_platform: RK3588, model_dtype: FP16 |
| | I rkllm: Enabled cpus: [4, 5, 6, 7] |
| | I rkllm: Enabled cpus num: 4 |
| | [time] vae_encode_0: 1502.91 ms |
| | [time] vae_encode_38400: 1443.79 ms |
| | [time] vae_encode_76800: 1418.36 ms |
| | [time] locenc_0: 820.25 ms |
| | [time] locenc_64: 814.78 ms |
| | [time] locenc_128: 815.60 ms |
| | [time] base_lm initial: 549.21 ms |
| | [time] fsq_init_0: 5.34 ms |
| | [time] fsq_init_64: 3.95 ms |
| | [time] fsq_init_128: 4.17 ms |
| | [time] residual_lm initial: 131.22 ms |
| | gen_loop: 0%| | 0/2000 [00:00<?, ?it/s][time] lm_to_dit: 1.26 ms |
| | [time] res_to_dit: 1.01 ms |
| | 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 10/10 [00:00<00:00, 60.13it/s] |
| | [time] locenc_step: 16.43 msโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 7/10 [00:00<00:00, 61.20it/s] |
| | gen_loop: 0%| | 1/2000 [00:00<09:43, 3.42it/s][time] lm_to_dit: 0.75 ms |
| | [time] res_to_dit: 0.55 ms |
| | 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 10/10 [00:00<00:00, 63.99it/s] |
| | [time] locenc_step: 15.93 msโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 7/10 [00:00<00:00, 67.27it/s] |
| | gen_loop: 0%| | 2/2000 [00:00<09:25, 3.53it/s][time] lm_to_dit: 0.74 ms |
| | |
| | ... |
| | |
| | [time] res_to_dit: 0.59 ms |
| | 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 10/10 [00:00<00:00, 64.19it/s] |
| | [time] locenc_step: 15.73 msโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 7/10 [00:00<00:00, 67.34it/s] |
| | gen_loop: 6%|โโโโโ | 123/2000 [00:34<08:47, 3.56it/s][time] lm_to_dit: 0.76 ms |
| | [time] res_to_dit: 0.56 ms |
| | 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 10/10 [00:00<00:00, 64.08it/s] |
| | [time] locenc_step: 15.82 msโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 7/10 [00:00<00:00, 67.43it/s] |
| | gen_loop: 6%|โโโโโ | 123/2000 [00:34<08:47, 3.56it/s] |
| | [time] vae_decode_0: 1153.02 ms |
| | [time] vae_decode_60: 1102.36 ms |
| | [time] vae_decode_120: 1105.00 ms |
| | [time] vae_decode_180: 1105.60 ms |
| | [time] vae_decode_240: 1082.36 ms |
| | Saved: rknn_output.wav |
| | ``` |
| | ## Model Conversion |
| |
|
| | #### TODO: Documentation to be added |
| |
|
| | ## Known Issues |
| |
|
| | - In some cases, speech generation may fall into an infinite loop. The original project seems to have a mechanism to detect infinite loops, but it is not implemented here. |
| | - Due to internal issues with the RKNN toolchain, the locenc model cannot configure two sets of shapes for two different input lengths in a single model, so two separate models must be converted. |
| | - Due to internal issues with the RKLLM toolchain/runtime, the output tensor values of both LLMs are only one-quarter of the correct result. Multiplying by 4 manually yields the correct result. |
| | - ~~Since the RKNN toolchain currently does not support data-parallel inference using multiple NPU cores for non-4D input models with multiple batches, CFG in the script is performed separately in two passes, which is relatively slow.~~(Solved) |
| |
|
| | ## References |
| | - [openbmb/VoxCPM-0.5B](https://huggingface.co/openbmb/VoxCPM-0.5B) |
| | - [0seba/VoxCPMANE](https://github.com/0seba/VoxCPMANE) |
| | - [bluryar/VoxCPM-ONNX](https://github.com/bluryar/VoxCPM-ONNX) |
| |
|