| --- |
| license: agpl-3.0 |
| language: |
| - en |
| - zh |
| base_model: |
| - openbmb/VoxCPM1.5 |
| pipeline_tag: text-to-speech |
| tags: |
| - rknn |
| - rkllm |
| - text-to-speech |
| - speech |
| - speech generation |
| - voice cloning |
| --- |
| |
| # VoxCPM1.5-RKNN2 |
|
|
| ### (English README see below) |
|
|
| > VoxCPM ๆฏไธ็งๅๆฐ็ๆ ๅ่ฏๅจๆๆฌ่ฝฌ่ฏญ้ณ๏ผTTS๏ผ็ณป็ป๏ผ้ๆฐๅฎไนไบ่ฏญ้ณๅๆ็็ๅฎๆใ้่ฟๅจ่ฟ็ปญ็ฉบ้ดไธญๅปบๆจก่ฏญ้ณ๏ผๅฎๅ
ๆไบ็ฆปๆฃๆ ่ฎฐๅ็ๅฑ้๏ผๅนถๅฎ็ฐไบไธค้กนๆ ธๅฟ่ฝๅ๏ผไธไธๆๆ็ฅ็่ฏญ้ณ็ๆๅ้ผ็็้ถๆ ทๆฌ่ฏญ้ณๅ
้ใ |
| > ไธๅไบๅฐ่ฏญ้ณ่ฝฌๆขไธบ็ฆปๆฃๆ ่ฎฐ็ไธปๆตๆนๆณ๏ผVoxCPM ้็จ็ซฏๅฐ็ซฏ็ๆฉๆฃ่ชๅๅฝๆถๆ๏ผ็ดๆฅไปๆๆฌ็ๆ่ฟ็ปญ็่ฏญ้ณ่กจ็คบใๅฎๅบไบ MiniCPM-4 ไธปๅนฒๆๅปบ๏ผ้่ฟๅๅฑ่ฏญ่จๅปบๆจกๅ FSQ ็บฆๆๅฎ็ฐไบ้ๅผ็่ฏญไน-ๅฃฐๅญฆ่งฃ่ฆ๏ผๆๅคงๅฐๆๅไบ่กจ็ฐๅๅ็ๆ็จณๅฎๆงใ |
|
|
| ๆไปฌ้ๅธธๆฟๅจๅฐๆจๅบ VoxCPM ็้ๅคงๅ็บง็ๆฌใๆญคๆฌกๆดๆฐๅจๆพ่ๆๅ้ณ้ข่ดจ้ๅๆ็็ๅๆถ๏ผไฟ็ไบๆ ธๅฟ็ไธไธๆๆ็ฅ่ฏญ้ณ็ๆๅ้ถๆ ทๆฌ๏ผZero-shot๏ผ่ฏญ้ณๅ
้่ฝๅใ |
|
|
| | ็นๆง | VoxCPM | VoxCPM1.5 | |
| |---------|------------|------------| |
| | **Audio VAE ้ๆ ท็** | 16kHz | 44.1kHz | |
| | **LM Token ้็** | 12.5Hz | 6.25Hz | |
| | **Patch ๅคงๅฐ** | 2 | 4 | |
| | **SFT ๆฏๆ** | โ
| โ
| |
| | **LoRA ๆฏๆ** | โ
| โ
| |
|
|
|
|
| - ๆจ็้ๅบฆ(RKNN2)๏ผRK3588ไธRTF็บฆ4.5๏ผ็ๆ10s้ณ้ข้่ฆๆจ็45s๏ผ็ธๅฏนไบๆง็ไผผไนๅนถๆฒกๆไปไนๆๅ๏ผ |
| - ๅคง่ดๅ
ๅญๅ ็จ(RKNN2)๏ผ็บฆ3.3GB๏ผ็ธๅฏนไบๆง็ๅๆ ทๆฒกๆไปไนๆๅ๏ผ |
|
|
| ## ไฝฟ็จๆนๆณ |
|
|
| 1. ๅ
้้กน็ฎๅฐๆฌๅฐ |
|
|
| 2. ๅฎ่ฃ
ไพ่ต |
|
|
| ```bash |
| pip install numpy scipy soundfile tqdm transformers sentencepiece ztu-somemodelruntime-ez-rknn-async |
| ``` |
|
|
| 3. ่ฟ่ก |
|
|
| ```bash |
| python onnx_infer-rknn2.py --onnx-dir . --tokenizer-dir . --base-hf-dir . --residual-hf-dir . --text "ๅ, VoxCPM1.5 ็ฐๅจไน่ฝๅจ RK3588 ไธ่ท่ตทๆฅไบใ" --prompt-audio basic_ref_zh.wav --prompt-text "ๅฏน๏ผ่ฟๅฐฑๆฏๆ๏ผไธไบบๆฌไปฐ็ๅคชไน็ไบบใ" --output rknn_output.wav --cfg-value 2.0 --inference-timesteps 10 --seed 1234 |
| ``` |
|
|
| ๅฏ้ๅๆฐ๏ผ |
| - `--text`: ่ฆ็ๆ็ๆๆฌ |
| - `--prompt-audio`: ๅ่้ณ้ข่ทฏๅพ๏ผ็จไบ่ฏญ้ณๅ
้๏ผ |
| - `--prompt-text`: ๅ่้ณ้ขๅฏนๅบ็ๆๆฌ๏ผไฝฟ็จๅ่้ณ้ขๆถๅฟ
ๅกซ๏ผ |
| - `--cfg-value`: CFGๅผๅฏผๅผบๅบฆ๏ผ้ป่ฎค2.0 |
| - `--inference-timesteps`: ๆฉๆฃๆญฅๆฐ๏ผ้ป่ฎค10 |
| - `--seed`: ้ๆบ็งๅญ |
| - `--output`: ่พๅบ้ณ้ข่ทฏๅพ |
|
|
| ## ่ฟ่กๆๆ |
|
|
|
|
| ```log |
| > python onnx_infer-rknn2.py --onnx-dir . --tokenizer-dir . --base-hf-dir . --residual-hf-dir . --text "ๅ, VoxCPM1.5 ็ฐๅจไน่ฝๅจ RK3588 ไธ่ท่ตทๆฅไบใ" --prompt-audio basic_ref_zh.wav --prompt-text "ๅฏน๏ผ่ฟๅฐฑๆฏๆ๏ผไธไบบๆฌไปฐ็ๅคชไน็ไบบใ" --output rknn_output.wav --cfg-value 2.0 --inference-timesteps 10 --seed 1234 |
| I rkllm: rkllm-runtime version: 1.2.3, rknpu driver version: 0.9.8, platform: RK3588 |
| I rkllm: loading rkllm model from ./base_lm.rkllm |
| I rkllm: rkllm-toolkit version: 1.2.3, max_context_limit: 4096, npu_core_num: 1, target_platform: RK3588, model_dtype: FP16 |
| I rkllm: Enabled cpus: [4, 5, 6, 7] |
| I rkllm: Enabled cpus num: 4 |
| I rkllm: rkllm-runtime version: 1.2.3, rknpu driver version: 0.9.8, platform: RK3588 |
| I rkllm: loading rkllm model from ./residual_lm.rkllm |
| I rkllm: rkllm-toolkit version: 1.2.3, max_context_limit: 4096, npu_core_num: 3, target_platform: RK3588, model_dtype: FP16 |
| I rkllm: Enabled cpus: [4, 5, 6, 7] |
| I rkllm: Enabled cpus num: 4 |
| [time] vae_encode_0: 2127.35 ms |
| [time] vae_encode_105840: 2057.71 ms |
| [time] vae_encode_211680: 1997.43 ms |
| [time] locenc_0: 1791.50 ms |
| [time] locenc_64: 1782.49 ms |
| [time] base_lm initial: 368.19 ms |
| [time] fsq_init_0: 5.52 ms |
| [time] fsq_init_64: 4.20 ms |
| [time] residual_lm initial: 105.79 ms |
| gen_loop: 0%| | 0/2000 [00:00<?, ?it/s][time] lm_to_dit: 1.49 ms |
| [time] res_to_dit: 1.11 ms |
| 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 10/10 [00:00<00:00, 32.15it/s] |
| [time] locenc_step: 33.00 msโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 8/10 [00:00<00:00, 32.24it/s] |
| gen_loop: 0%| | 1/2000 [00:00<15:33, 2.14it/s][time] lm_to_dit: 0.67 ms |
| [time] res_to_dit: 0.76 ms |
| 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 10/10 [00:00<00:00, 32.86it/s] |
| [time] locenc_step: 31.85 msโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 8/10 [00:00<00:00, 32.99it/s] |
| gen_loop: 0%|โ | 2/2000 [00:00<15:18, 2.18it/s][time] lm_to_dit: 0.61 ms |
| [time] res_to_dit: 0.65 ms |
| 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 10/10 [00:00<00:00, 32.72it/s] |
| [time] locenc_step: 32.01 msโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 8/10 [00:00<00:00, 32.83it/s] |
| gen_loop: 2%|โโโโ | 49/2000 [00:22<14:55, 2.18it/s][time] lm_to_dit: 0.88 ms |
| [time] res_to_dit: 0.64 ms |
| 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 10/10 [00:00<00:00, 32.72it/s] |
| [time] locenc_step: 32.16 msโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 8/10 [00:00<00:00, 32.88it/s] |
| gen_loop: 2%|โโโโ | 49/2000 [00:22<15:05, 2.15it/s] |
| [time] vae_decode_0: 2438.31 ms |
| [time] vae_decode_60: 2372.92 ms |
| [time] vae_decode_120: 2380.40 ms |
| [time] vae_decode_180: 2344.88 ms |
| Saved: rknn_output.wav |
| ``` |
|
|
| ## ๆจกๅ่ฝฌๆข |
|
|
| ๆฅ็ https://huggingface.co/happyme531/VoxCPM1.5-RKNN2/tree/main/convert |
|
|
| ## ๅทฒ็ฅ้ฎ้ข |
|
|
| - ๆไบๆ
ๅตไธ่ฏญ้ณ็ๆๅฏ่ฝ้ทๅ
ฅๆญปๅพช็ฏ๏ผๅ้กน็ฎไผผไนๆๆฃๆตๆญปๅพช็ฏ็ๆบๅถ๏ผไฝๆ่ฟ้ๆฒกๆๅฎ็ฐใ |
| - ็ฑไบRKNNๅทฅๅ
ท้พ็ๅ
้จ้ฎ้ข๏ผlocencๆจกๅๆฒกๆๅๆณๅจไธไธชๆจกๅ้้
็ฝฎไธค็ง่พๅ
ฅ้ฟๅบฆ็ไธค็ปshape๏ผๅ ๆญคๅช่ฝๅ็ฌ่ฝฌๆขไธคไธชๆจกๅใ |
| - ็ฑไบRKLLMๅทฅๅ
ท้พ/่ฟ่กๆถ็ๅ
้จ้ฎ้ข๏ผไธคไธชLLM็่พๅบๅผ ้็ๆฐๅผ้ฝๅชๆๆญฃ็กฎ็ปๆ็ๅๅไนไธ๏ผๆๅจไน4ไนๅๅฏไปฅๅพๅฐๆญฃ็กฎ็ปๆใ |
|
|
|
|
| ## ๅ่ |
| - [openbmb/VoxCPM1.5](https://huggingface.co/openbmb/VoxCPM1.5) |
| - [0seba/VoxCPMANE](https://github.com/0seba/VoxCPMANE) |
| - [bluryar/VoxCPM-ONNX](https://github.com/bluryar/VoxCPM-ONNX) |
|
|
| # English README |
|
|
| > VoxCPM is an innovative tokenizer-free Text-to-Speech (TTS) system that redefines realism in speech synthesis. By modeling speech in continuous space, it overcomes the limitations of discrete tokenization and achieves two core capabilities: context-aware speech generation and realistic zero-shot voice cloning. |
|
|
| > Unlike mainstream approaches that convert speech into discrete tokens, VoxCPM adopts an end-to-end diffusion autoregressive architecture that directly generates continuous speech representations from text. Built on the MiniCPM-4 backbone, it achieves implicit semantic-acoustic decoupling through hierarchical language modeling and FSQ constraints, greatly enhancing expressiveness and generation stability. |
|
|
| Weโre thrilled to introduce a major upgrade that improves audio quality and efficiency of VoxCPM, while maintaining the core capabilities of context-aware speech generation and zero-shot voice cloning. |
|
|
| | Feature | VoxCPM | VoxCPM1.5 | |
| |---------|------------|------------| |
| | **Audio VAE Sampling Rate** | 16kHz | 44.1kHz | |
| | **LM Token Rate** | 12.5Hz | 6.25Hz | |
| | **Patch Size** | 2 | 4 | |
| | **SFT Support** | โ
| โ
| |
| | **LoRA Support** | โ
| โ
| |
|
|
| - Inference speed (RKNN2): RTF approximately 4.5 on RK3588 (45s inference time to generate 10s audio, no improvement compared to the previous version) |
| - Approximate memory usage (RKNN2): ~3.3GB (no improvement compared to the previous version too) |
|
|
| ## Usage |
|
|
| 1. Clone the project locally |
|
|
| 2. Install dependencies |
|
|
| ```bash |
| pip install numpy scipy soundfile tqdm transformers sentencepiece ztu-somemodelruntime-ez-rknn-async |
| ``` |
|
|
| 3. Run |
|
|
| ```bash |
| python onnx_infer-rknn2.py --onnx-dir . --tokenizer-dir . --base-hf-dir . --residual-hf-dir . --text "Wow, VoxCPM1.5 actually runs perfectly on the RK3588 SoC!" --prompt-audio basic_ref_zh.wav --prompt-text "ๅฏน๏ผ่ฟๅฐฑๆฏๆ๏ผไธไบบๆฌไปฐ็ๅคชไน็ไบบใ" --output rknn_output.wav --cfg-value 2.0 --inference-timesteps 10 --seed 1234 |
| ``` |
|
|
| Optional parameters: |
| - `--text`: Text to generate |
| - `--prompt-audio`: Reference audio path (for voice cloning) |
| - `--prompt-text`: Text corresponding to the reference audio (required when using reference audio) |
| - `--cfg-value`: CFG guidance strength, default 2.0 |
| - `--inference-timesteps`: Number of diffusion steps, default 10 |
| - `--seed`: Random seed |
| - `--output`: Output audio path |
|
|
| ## Performance |
|
|
|
|
| ```log |
| > python onnx_infer-rknn2.py --onnx-dir . --tokenizer-dir . --base-hf-dir . --residual-hf-dir . --text "ๅ, VoxCPM1.5 ็ฐๅจไน่ฝๅจ RK3588 ไธ่ท่ตทๆฅไบใ" --prompt-audio basic_ref_zh.wav --prompt-text "ๅฏน๏ผ่ฟๅฐฑๆฏๆ๏ผไธไบบๆฌไปฐ็ๅคชไน็ไบบใ" --output rknn_output.wav --cfg-value 2.0 --inference-timesteps 10 --seed 1234 |
| I rkllm: rkllm-runtime version: 1.2.3, rknpu driver version: 0.9.8, platform: RK3588 |
| I rkllm: loading rkllm model from ./base_lm.rkllm |
| I rkllm: rkllm-toolkit version: 1.2.3, max_context_limit: 4096, npu_core_num: 1, target_platform: RK3588, model_dtype: FP16 |
| I rkllm: Enabled cpus: [4, 5, 6, 7] |
| I rkllm: Enabled cpus num: 4 |
| I rkllm: rkllm-runtime version: 1.2.3, rknpu driver version: 0.9.8, platform: RK3588 |
| I rkllm: loading rkllm model from ./residual_lm.rkllm |
| I rkllm: rkllm-toolkit version: 1.2.3, max_context_limit: 4096, npu_core_num: 3, target_platform: RK3588, model_dtype: FP16 |
| I rkllm: Enabled cpus: [4, 5, 6, 7] |
| I rkllm: Enabled cpus num: 4 |
| [time] vae_encode_0: 2127.35 ms |
| [time] vae_encode_105840: 2057.71 ms |
| [time] vae_encode_211680: 1997.43 ms |
| [time] locenc_0: 1791.50 ms |
| [time] locenc_64: 1782.49 ms |
| [time] base_lm initial: 368.19 ms |
| [time] fsq_init_0: 5.52 ms |
| [time] fsq_init_64: 4.20 ms |
| [time] residual_lm initial: 105.79 ms |
| gen_loop: 0%| | 0/2000 [00:00<?, ?it/s][time] lm_to_dit: 1.49 ms |
| [time] res_to_dit: 1.11 ms |
| 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 10/10 [00:00<00:00, 32.15it/s] |
| [time] locenc_step: 33.00 msโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 8/10 [00:00<00:00, 32.24it/s] |
| gen_loop: 0%| | 1/2000 [00:00<15:33, 2.14it/s][time] lm_to_dit: 0.67 ms |
| [time] res_to_dit: 0.76 ms |
| 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 10/10 [00:00<00:00, 32.86it/s] |
| [time] locenc_step: 31.85 msโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 8/10 [00:00<00:00, 32.99it/s] |
| gen_loop: 0%|โ | 2/2000 [00:00<15:18, 2.18it/s][time] lm_to_dit: 0.61 ms |
| [time] res_to_dit: 0.65 ms |
| 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 10/10 [00:00<00:00, 32.72it/s] |
| [time] locenc_step: 32.01 msโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 8/10 [00:00<00:00, 32.83it/s] |
| gen_loop: 2%|โโโโ | 49/2000 [00:22<14:55, 2.18it/s][time] lm_to_dit: 0.88 ms |
| [time] res_to_dit: 0.64 ms |
| 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 10/10 [00:00<00:00, 32.72it/s] |
| [time] locenc_step: 32.16 msโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 8/10 [00:00<00:00, 32.88it/s] |
| gen_loop: 2%|โโโโ | 49/2000 [00:22<15:05, 2.15it/s] |
| [time] vae_decode_0: 2438.31 ms |
| [time] vae_decode_60: 2372.92 ms |
| [time] vae_decode_120: 2380.40 ms |
| [time] vae_decode_180: 2344.88 ms |
| Saved: rknn_output.wav |
| ``` |
| ## Model Conversion |
|
|
| See https://huggingface.co/happyme531/VoxCPM1.5-RKNN2/tree/main/convert |
|
|
| ## Known Issues |
|
|
| - In some cases, speech generation may fall into an infinite loop. The original project seems to have a mechanism to detect infinite loops, but it is not implemented here. |
| - Due to internal issues with the RKNN toolchain, the locenc model cannot configure two sets of shapes for two different input lengths in a single model, so two separate models must be converted. |
| - Due to internal issues with the RKLLM toolchain/runtime, the output tensor values of both LLMs are only one-quarter of the correct result. Multiplying by 4 manually yields the correct result. |
|
|
| ## References |
| - [openbmb/VoxCPM1.5](https://huggingface.co/openbmb/VoxCPM1.5) |
| - [0seba/VoxCPMANE](https://github.com/0seba/VoxCPMANE) |
| - [bluryar/VoxCPM-ONNX](https://github.com/bluryar/VoxCPM-ONNX) |
|
|