VoxCPM1.5-RKNN2
(English README see below)
VoxCPM ๆฏไธ็งๅๆฐ็ๆ ๅ่ฏๅจๆๆฌ่ฝฌ่ฏญ้ณ๏ผTTS๏ผ็ณป็ป๏ผ้ๆฐๅฎไนไบ่ฏญ้ณๅๆ็็ๅฎๆใ้่ฟๅจ่ฟ็ปญ็ฉบ้ดไธญๅปบๆจก่ฏญ้ณ๏ผๅฎๅ ๆไบ็ฆปๆฃๆ ่ฎฐๅ็ๅฑ้๏ผๅนถๅฎ็ฐไบไธค้กนๆ ธๅฟ่ฝๅ๏ผไธไธๆๆ็ฅ็่ฏญ้ณ็ๆๅ้ผ็็้ถๆ ทๆฌ่ฏญ้ณๅ ้ใ ไธๅไบๅฐ่ฏญ้ณ่ฝฌๆขไธบ็ฆปๆฃๆ ่ฎฐ็ไธปๆตๆนๆณ๏ผVoxCPM ้็จ็ซฏๅฐ็ซฏ็ๆฉๆฃ่ชๅๅฝๆถๆ๏ผ็ดๆฅไปๆๆฌ็ๆ่ฟ็ปญ็่ฏญ้ณ่กจ็คบใๅฎๅบไบ MiniCPM-4 ไธปๅนฒๆๅปบ๏ผ้่ฟๅๅฑ่ฏญ่จๅปบๆจกๅ FSQ ็บฆๆๅฎ็ฐไบ้ๅผ็่ฏญไน-ๅฃฐๅญฆ่งฃ่ฆ๏ผๆๅคงๅฐๆๅไบ่กจ็ฐๅๅ็ๆ็จณๅฎๆงใ
ๆไปฌ้ๅธธๆฟๅจๅฐๆจๅบ VoxCPM ็้ๅคงๅ็บง็ๆฌใๆญคๆฌกๆดๆฐๅจๆพ่ๆๅ้ณ้ข่ดจ้ๅๆ็็ๅๆถ๏ผไฟ็ไบๆ ธๅฟ็ไธไธๆๆ็ฅ่ฏญ้ณ็ๆๅ้ถๆ ทๆฌ๏ผZero-shot๏ผ่ฏญ้ณๅ ้่ฝๅใ
| ็นๆง | VoxCPM | VoxCPM1.5 |
|---|---|---|
| Audio VAE ้ๆ ท็ | 16kHz | 44.1kHz |
| LM Token ้็ | 12.5Hz | 6.25Hz |
| Patch ๅคงๅฐ | 2 | 4 |
| SFT ๆฏๆ | โ | โ |
| LoRA ๆฏๆ | โ | โ |
- ๆจ็้ๅบฆ(RKNN2)๏ผRK3588ไธRTF็บฆ4.5๏ผ็ๆ10s้ณ้ข้่ฆๆจ็45s๏ผ็ธๅฏนไบๆง็ไผผไนๅนถๆฒกๆไปไนๆๅ๏ผ
- ๅคง่ดๅ ๅญๅ ็จ(RKNN2)๏ผ็บฆ3.3GB๏ผ็ธๅฏนไบๆง็ๅๆ ทๆฒกๆไปไนๆๅ๏ผ
ไฝฟ็จๆนๆณ
ๅ ้้กน็ฎๅฐๆฌๅฐ
ๅฎ่ฃ ไพ่ต
pip install numpy scipy soundfile tqdm transformers sentencepiece ztu-somemodelruntime-ez-rknn-async
- ่ฟ่ก
python onnx_infer-rknn2.py --onnx-dir . --tokenizer-dir . --base-hf-dir . --residual-hf-dir . --text "ๅ, VoxCPM1.5 ็ฐๅจไน่ฝๅจ RK3588 ไธ่ท่ตทๆฅไบใ" --prompt-audio basic_ref_zh.wav --prompt-text "ๅฏน๏ผ่ฟๅฐฑๆฏๆ๏ผไธไบบๆฌไปฐ็ๅคชไน็ไบบใ" --output rknn_output.wav --cfg-value 2.0 --inference-timesteps 10 --seed 1234
ๅฏ้ๅๆฐ๏ผ
--text: ่ฆ็ๆ็ๆๆฌ--prompt-audio: ๅ่้ณ้ข่ทฏๅพ๏ผ็จไบ่ฏญ้ณๅ ้๏ผ--prompt-text: ๅ่้ณ้ขๅฏนๅบ็ๆๆฌ๏ผไฝฟ็จๅ่้ณ้ขๆถๅฟ ๅกซ๏ผ--cfg-value: CFGๅผๅฏผๅผบๅบฆ๏ผ้ป่ฎค2.0--inference-timesteps: ๆฉๆฃๆญฅๆฐ๏ผ้ป่ฎค10--seed: ้ๆบ็งๅญ--output: ่พๅบ้ณ้ข่ทฏๅพ
่ฟ่กๆๆ
> python onnx_infer-rknn2.py --onnx-dir . --tokenizer-dir . --base-hf-dir . --residual-hf-dir . --text "ๅ, VoxCPM1.5 ็ฐๅจไน่ฝๅจ RK3588 ไธ่ท่ตทๆฅไบใ" --prompt-audio basic_ref_zh.wav --prompt-text "ๅฏน๏ผ่ฟๅฐฑๆฏๆ๏ผไธไบบๆฌไปฐ็ๅคชไน็ไบบใ" --output rknn_output.wav --cfg-value 2.0 --inference-timesteps 10 --seed 1234
I rkllm: rkllm-runtime version: 1.2.3, rknpu driver version: 0.9.8, platform: RK3588
I rkllm: loading rkllm model from ./base_lm.rkllm
I rkllm: rkllm-toolkit version: 1.2.3, max_context_limit: 4096, npu_core_num: 1, target_platform: RK3588, model_dtype: FP16
I rkllm: Enabled cpus: [4, 5, 6, 7]
I rkllm: Enabled cpus num: 4
I rkllm: rkllm-runtime version: 1.2.3, rknpu driver version: 0.9.8, platform: RK3588
I rkllm: loading rkllm model from ./residual_lm.rkllm
I rkllm: rkllm-toolkit version: 1.2.3, max_context_limit: 4096, npu_core_num: 3, target_platform: RK3588, model_dtype: FP16
I rkllm: Enabled cpus: [4, 5, 6, 7]
I rkllm: Enabled cpus num: 4
[time] vae_encode_0: 2127.35 ms
[time] vae_encode_105840: 2057.71 ms
[time] vae_encode_211680: 1997.43 ms
[time] locenc_0: 1791.50 ms
[time] locenc_64: 1782.49 ms
[time] base_lm initial: 368.19 ms
[time] fsq_init_0: 5.52 ms
[time] fsq_init_64: 4.20 ms
[time] residual_lm initial: 105.79 ms
gen_loop: 0%| | 0/2000 [00:00<?, ?it/s][time] lm_to_dit: 1.49 ms
[time] res_to_dit: 1.11 ms
100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 10/10 [00:00<00:00, 32.15it/s]
[time] locenc_step: 33.00 msโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 8/10 [00:00<00:00, 32.24it/s]
gen_loop: 0%| | 1/2000 [00:00<15:33, 2.14it/s][time] lm_to_dit: 0.67 ms
[time] res_to_dit: 0.76 ms
100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 10/10 [00:00<00:00, 32.86it/s]
[time] locenc_step: 31.85 msโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 8/10 [00:00<00:00, 32.99it/s]
gen_loop: 0%|โ | 2/2000 [00:00<15:18, 2.18it/s][time] lm_to_dit: 0.61 ms
[time] res_to_dit: 0.65 ms
100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 10/10 [00:00<00:00, 32.72it/s]
[time] locenc_step: 32.01 msโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 8/10 [00:00<00:00, 32.83it/s]
gen_loop: 2%|โโโโ | 49/2000 [00:22<14:55, 2.18it/s][time] lm_to_dit: 0.88 ms
[time] res_to_dit: 0.64 ms
100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 10/10 [00:00<00:00, 32.72it/s]
[time] locenc_step: 32.16 msโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 8/10 [00:00<00:00, 32.88it/s]
gen_loop: 2%|โโโโ | 49/2000 [00:22<15:05, 2.15it/s]
[time] vae_decode_0: 2438.31 ms
[time] vae_decode_60: 2372.92 ms
[time] vae_decode_120: 2380.40 ms
[time] vae_decode_180: 2344.88 ms
Saved: rknn_output.wav
ๆจกๅ่ฝฌๆข
ๆฅ็ https://huggingface.co/happyme531/VoxCPM1.5-RKNN2/tree/main/convert
ๅทฒ็ฅ้ฎ้ข
- ๆไบๆ ๅตไธ่ฏญ้ณ็ๆๅฏ่ฝ้ทๅ ฅๆญปๅพช็ฏ๏ผๅ้กน็ฎไผผไนๆๆฃๆตๆญปๅพช็ฏ็ๆบๅถ๏ผไฝๆ่ฟ้ๆฒกๆๅฎ็ฐใ
- ็ฑไบRKNNๅทฅๅ ท้พ็ๅ ้จ้ฎ้ข๏ผlocencๆจกๅๆฒกๆๅๆณๅจไธไธชๆจกๅ้้ ็ฝฎไธค็ง่พๅ ฅ้ฟๅบฆ็ไธค็ปshape๏ผๅ ๆญคๅช่ฝๅ็ฌ่ฝฌๆขไธคไธชๆจกๅใ
- ็ฑไบRKLLMๅทฅๅ ท้พ/่ฟ่กๆถ็ๅ ้จ้ฎ้ข๏ผไธคไธชLLM็่พๅบๅผ ้็ๆฐๅผ้ฝๅชๆๆญฃ็กฎ็ปๆ็ๅๅไนไธ๏ผๆๅจไน4ไนๅๅฏไปฅๅพๅฐๆญฃ็กฎ็ปๆใ
ๅ่
English README
VoxCPM is an innovative tokenizer-free Text-to-Speech (TTS) system that redefines realism in speech synthesis. By modeling speech in continuous space, it overcomes the limitations of discrete tokenization and achieves two core capabilities: context-aware speech generation and realistic zero-shot voice cloning.
Unlike mainstream approaches that convert speech into discrete tokens, VoxCPM adopts an end-to-end diffusion autoregressive architecture that directly generates continuous speech representations from text. Built on the MiniCPM-4 backbone, it achieves implicit semantic-acoustic decoupling through hierarchical language modeling and FSQ constraints, greatly enhancing expressiveness and generation stability.
Weโre thrilled to introduce a major upgrade that improves audio quality and efficiency of VoxCPM, while maintaining the core capabilities of context-aware speech generation and zero-shot voice cloning.
| Feature | VoxCPM | VoxCPM1.5 |
|---|---|---|
| Audio VAE Sampling Rate | 16kHz | 44.1kHz |
| LM Token Rate | 12.5Hz | 6.25Hz |
| Patch Size | 2 | 4 |
| SFT Support | โ | โ |
| LoRA Support | โ | โ |
- Inference speed (RKNN2): RTF approximately 4.5 on RK3588 (45s inference time to generate 10s audio, no improvement compared to the previous version)
- Approximate memory usage (RKNN2): ~3.3GB (no improvement compared to the previous version too)
Usage
Clone the project locally
Install dependencies
pip install numpy scipy soundfile tqdm transformers sentencepiece ztu-somemodelruntime-ez-rknn-async
- Run
python onnx_infer-rknn2.py --onnx-dir . --tokenizer-dir . --base-hf-dir . --residual-hf-dir . --text "Wow, VoxCPM1.5 actually runs perfectly on the RK3588 SoC!" --prompt-audio basic_ref_zh.wav --prompt-text "ๅฏน๏ผ่ฟๅฐฑๆฏๆ๏ผไธไบบๆฌไปฐ็ๅคชไน็ไบบใ" --output rknn_output.wav --cfg-value 2.0 --inference-timesteps 10 --seed 1234
Optional parameters:
--text: Text to generate--prompt-audio: Reference audio path (for voice cloning)--prompt-text: Text corresponding to the reference audio (required when using reference audio)--cfg-value: CFG guidance strength, default 2.0--inference-timesteps: Number of diffusion steps, default 10--seed: Random seed--output: Output audio path
Performance
> python onnx_infer-rknn2.py --onnx-dir . --tokenizer-dir . --base-hf-dir . --residual-hf-dir . --text "ๅ, VoxCPM1.5 ็ฐๅจไน่ฝๅจ RK3588 ไธ่ท่ตทๆฅไบใ" --prompt-audio basic_ref_zh.wav --prompt-text "ๅฏน๏ผ่ฟๅฐฑๆฏๆ๏ผไธไบบๆฌไปฐ็ๅคชไน็ไบบใ" --output rknn_output.wav --cfg-value 2.0 --inference-timesteps 10 --seed 1234
I rkllm: rkllm-runtime version: 1.2.3, rknpu driver version: 0.9.8, platform: RK3588
I rkllm: loading rkllm model from ./base_lm.rkllm
I rkllm: rkllm-toolkit version: 1.2.3, max_context_limit: 4096, npu_core_num: 1, target_platform: RK3588, model_dtype: FP16
I rkllm: Enabled cpus: [4, 5, 6, 7]
I rkllm: Enabled cpus num: 4
I rkllm: rkllm-runtime version: 1.2.3, rknpu driver version: 0.9.8, platform: RK3588
I rkllm: loading rkllm model from ./residual_lm.rkllm
I rkllm: rkllm-toolkit version: 1.2.3, max_context_limit: 4096, npu_core_num: 3, target_platform: RK3588, model_dtype: FP16
I rkllm: Enabled cpus: [4, 5, 6, 7]
I rkllm: Enabled cpus num: 4
[time] vae_encode_0: 2127.35 ms
[time] vae_encode_105840: 2057.71 ms
[time] vae_encode_211680: 1997.43 ms
[time] locenc_0: 1791.50 ms
[time] locenc_64: 1782.49 ms
[time] base_lm initial: 368.19 ms
[time] fsq_init_0: 5.52 ms
[time] fsq_init_64: 4.20 ms
[time] residual_lm initial: 105.79 ms
gen_loop: 0%| | 0/2000 [00:00<?, ?it/s][time] lm_to_dit: 1.49 ms
[time] res_to_dit: 1.11 ms
100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 10/10 [00:00<00:00, 32.15it/s]
[time] locenc_step: 33.00 msโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 8/10 [00:00<00:00, 32.24it/s]
gen_loop: 0%| | 1/2000 [00:00<15:33, 2.14it/s][time] lm_to_dit: 0.67 ms
[time] res_to_dit: 0.76 ms
100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 10/10 [00:00<00:00, 32.86it/s]
[time] locenc_step: 31.85 msโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 8/10 [00:00<00:00, 32.99it/s]
gen_loop: 0%|โ | 2/2000 [00:00<15:18, 2.18it/s][time] lm_to_dit: 0.61 ms
[time] res_to_dit: 0.65 ms
100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 10/10 [00:00<00:00, 32.72it/s]
[time] locenc_step: 32.01 msโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 8/10 [00:00<00:00, 32.83it/s]
gen_loop: 2%|โโโโ | 49/2000 [00:22<14:55, 2.18it/s][time] lm_to_dit: 0.88 ms
[time] res_to_dit: 0.64 ms
100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 10/10 [00:00<00:00, 32.72it/s]
[time] locenc_step: 32.16 msโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 8/10 [00:00<00:00, 32.88it/s]
gen_loop: 2%|โโโโ | 49/2000 [00:22<15:05, 2.15it/s]
[time] vae_decode_0: 2438.31 ms
[time] vae_decode_60: 2372.92 ms
[time] vae_decode_120: 2380.40 ms
[time] vae_decode_180: 2344.88 ms
Saved: rknn_output.wav
Model Conversion
See https://huggingface.co/happyme531/VoxCPM1.5-RKNN2/tree/main/convert
Known Issues
- In some cases, speech generation may fall into an infinite loop. The original project seems to have a mechanism to detect infinite loops, but it is not implemented here.
- Due to internal issues with the RKNN toolchain, the locenc model cannot configure two sets of shapes for two different input lengths in a single model, so two separate models must be converted.
- Due to internal issues with the RKLLM toolchain/runtime, the output tensor values of both LLMs are only one-quarter of the correct result. Multiplying by 4 manually yields the correct result.
References
- Downloads last month
- 6