VoxCPM-0.5B-RKNN2 / README.md
happyme531's picture
Add CFG parallel inference with new library
921aee2 verified
---
license: agpl-3.0
language:
- en
- zh
base_model:
- openbmb/VoxCPM-0.5B
pipeline_tag: text-to-speech
tags:
- rknn
- rkllm
- text-to-speech
- speech
- speech generation
- voice cloning
---
# VoxCPM-0.5B-RKNN2
### (English README see below)
VoxCPM ๆ˜ฏไธ€็งๅˆ›ๆ–ฐ็š„ๆ— ๅˆ†่ฏๅ™จๆ–‡ๆœฌ่ฝฌ่ฏญ้Ÿณ๏ผˆTTS๏ผ‰็ณป็ปŸ๏ผŒ้‡ๆ–ฐๅฎšไน‰ไบ†่ฏญ้Ÿณๅˆๆˆ็š„็œŸๅฎžๆ„Ÿใ€‚้€š่ฟ‡ๅœจ่ฟž็ปญ็ฉบ้—ดไธญๅปบๆจก่ฏญ้Ÿณ๏ผŒๅฎƒๅ…‹ๆœไบ†็ฆปๆ•ฃๆ ‡่ฎฐๅŒ–็š„ๅฑ€้™๏ผŒๅนถๅฎž็Žฐไบ†ไธค้กนๆ ธๅฟƒ่ƒฝๅŠ›๏ผšไธŠไธ‹ๆ–‡ๆ„Ÿ็Ÿฅ็š„่ฏญ้Ÿณ็”Ÿๆˆๅ’Œ้€ผ็œŸ็š„้›ถๆ ทๆœฌ่ฏญ้Ÿณๅ…‹้š†ใ€‚
ไธๅŒไบŽๅฐ†่ฏญ้Ÿณ่ฝฌๆขไธบ็ฆปๆ•ฃๆ ‡่ฎฐ็š„ไธปๆตๆ–นๆณ•๏ผŒVoxCPM ้‡‡็”จ็ซฏๅˆฐ็ซฏ็š„ๆ‰ฉๆ•ฃ่‡ชๅ›žๅฝ’ๆžถๆž„๏ผŒ็›ดๆŽฅไปŽๆ–‡ๆœฌ็”Ÿๆˆ่ฟž็ปญ็š„่ฏญ้Ÿณ่กจ็คบใ€‚ๅฎƒๅŸบไบŽ MiniCPM-4 ไธปๅนฒๆž„ๅปบ๏ผŒ้€š่ฟ‡ๅˆ†ๅฑ‚่ฏญ่จ€ๅปบๆจกๅ’Œ FSQ ็บฆๆŸๅฎž็Žฐไบ†้šๅผ็š„่ฏญไน‰-ๅฃฐๅญฆ่งฃ่€ฆ๏ผŒๆžๅคงๅœฐๆๅ‡ไบ†่กจ็ŽฐๅŠ›ๅ’Œ็”Ÿๆˆ็จณๅฎšๆ€งใ€‚
![ๆจกๅž‹ๆžถๆž„](model_structure.jpg)
- ๆŽจ็†้€Ÿๅบฆ(RKNN2)๏ผšRK3588ไธŠRTF็บฆ4.5๏ผˆ็”Ÿๆˆ10s้Ÿณ้ข‘้œ€่ฆๆŽจ็†45s๏ผ‰
- ๅคง่‡ดๅ†…ๅญ˜ๅ ็”จ(RKNN2)๏ผš็บฆ3.3GB
## ไฝฟ็”จๆ–นๆณ•
1. ๅ…‹้š†้กน็›ฎๅˆฐๆœฌๅœฐ
2. ๅฎ‰่ฃ…ไพ่ต–
```bash
pip install numpy scipy soundfile tqdm transformers sentencepiece ztu-somemodelruntime-ez-rknn-async
```
3. ่ฟ่กŒ
```bash
python onnx_infer-rknn2.py --onnx-dir . --tokenizer-dir . --base-hf-dir . --residual-hf-dir . --text "ๅ“‡, ่ฟ™ไธชๆจกๅž‹ๅฑ…็„ถๅœจRK3588่ฟ™ไธช่พฃ้ธกSoCไธŠไนŸ่ƒฝๅฎŒ็พŽ่ฟ่กŒ!" --prompt-audio basic_ref_zh.wav --prompt-text "ๅฏน๏ผŒ่ฟ™ๅฐฑๆ˜ฏๆˆ‘๏ผŒไธ‡ไบบๆ•ฌไปฐ็š„ๅคชไน™็œŸไบบใ€‚" --output rknn_output.wav --cfg-value 2.0 --inference-timesteps 10 --seed 1234
```
ๅฏ้€‰ๅ‚ๆ•ฐ๏ผš
- `--text`: ่ฆ็”Ÿๆˆ็š„ๆ–‡ๆœฌ
- `--prompt-audio`: ๅ‚่€ƒ้Ÿณ้ข‘่ทฏๅพ„๏ผˆ็”จไบŽ่ฏญ้Ÿณๅ…‹้š†๏ผ‰
- `--prompt-text`: ๅ‚่€ƒ้Ÿณ้ข‘ๅฏนๅบ”็š„ๆ–‡ๆœฌ๏ผˆไฝฟ็”จๅ‚่€ƒ้Ÿณ้ข‘ๆ—ถๅฟ…ๅกซ๏ผ‰
- `--cfg-value`: CFGๅผ•ๅฏผๅผบๅบฆ๏ผŒ้ป˜่ฎค2.0
- `--inference-timesteps`: ๆ‰ฉๆ•ฃๆญฅๆ•ฐ๏ผŒ้ป˜่ฎค10
- `--seed`: ้šๆœบ็งๅญ
- `--output`: ่พ“ๅ‡บ้Ÿณ้ข‘่ทฏๅพ„
## ่ฟ่กŒๆ•ˆๆžœ
```log
> python onnx_infer-rknn2.py --onnx-dir . --tokenizer-dir . --base-hf-dir . --residual-hf-dir . --text "ๅ“‡, ่ฟ™ไธชๆจกๅž‹ๅฑ…็„ถๅœจRK3588่ฟ™ไธช่พฃ้ธกSoCไธŠไนŸ่ƒฝๅฎŒ็พŽ่ฟ่กŒ!" --prompt-audio basic_ref_zh.wav --prompt-text "ๅฏน๏ผŒ่ฟ™ๅฐฑๆ˜ฏๆˆ‘๏ผŒไธ‡ไบบๆ•ฌไปฐ็š„ๅคชไน™็œŸไบบใ€‚" --output rknn_output.wav --cfg-value 2.0 --inference-timesteps 10 --seed 1234
I rkllm: rkllm-runtime version: 1.2.3, rknpu driver version: 0.9.8, platform: RK3588
I rkllm: loading rkllm model from ./base_lm.rkllm
I rkllm: rkllm-toolkit version: 1.2.3, max_context_limit: 4096, npu_core_num: 1, target_platform: RK3588, model_dtype: FP16
I rkllm: Enabled cpus: [4, 5, 6, 7]
I rkllm: Enabled cpus num: 4
I rkllm: rkllm-runtime version: 1.2.3, rknpu driver version: 0.9.8, platform: RK3588
I rkllm: loading rkllm model from ./residual_lm.rkllm
I rkllm: rkllm-toolkit version: 1.2.2, max_context_limit: 4096, npu_core_num: 3, target_platform: RK3588, model_dtype: FP16
I rkllm: Enabled cpus: [4, 5, 6, 7]
I rkllm: Enabled cpus num: 4
[time] vae_encode_0: 1502.91 ms
[time] vae_encode_38400: 1443.79 ms
[time] vae_encode_76800: 1418.36 ms
[time] locenc_0: 820.25 ms
[time] locenc_64: 814.78 ms
[time] locenc_128: 815.60 ms
[time] base_lm initial: 549.21 ms
[time] fsq_init_0: 5.34 ms
[time] fsq_init_64: 3.95 ms
[time] fsq_init_128: 4.17 ms
[time] residual_lm initial: 131.22 ms
gen_loop: 0%| | 0/2000 [00:00<?, ?it/s][time] lm_to_dit: 1.26 ms
[time] res_to_dit: 1.01 ms
100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 10/10 [00:00<00:00, 60.13it/s]
[time] locenc_step: 16.43 msโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ– | 7/10 [00:00<00:00, 61.20it/s]
gen_loop: 0%| | 1/2000 [00:00<09:43, 3.42it/s][time] lm_to_dit: 0.75 ms
[time] res_to_dit: 0.55 ms
100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 10/10 [00:00<00:00, 63.99it/s]
[time] locenc_step: 15.93 msโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ– | 7/10 [00:00<00:00, 67.27it/s]
gen_loop: 0%| | 2/2000 [00:00<09:25, 3.53it/s][time] lm_to_dit: 0.74 ms
...
[time] res_to_dit: 0.59 ms
100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 10/10 [00:00<00:00, 64.19it/s]
[time] locenc_step: 15.73 msโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ– | 7/10 [00:00<00:00, 67.34it/s]
gen_loop: 6%|โ–ˆโ–ˆโ–ˆโ–ˆโ–Ž | 123/2000 [00:34<08:47, 3.56it/s][time] lm_to_dit: 0.76 ms
[time] res_to_dit: 0.56 ms
100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 10/10 [00:00<00:00, 64.08it/s]
[time] locenc_step: 15.82 msโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ– | 7/10 [00:00<00:00, 67.43it/s]
gen_loop: 6%|โ–ˆโ–ˆโ–ˆโ–ˆโ–Ž | 123/2000 [00:34<08:47, 3.56it/s]
[time] vae_decode_0: 1153.02 ms
[time] vae_decode_60: 1102.36 ms
[time] vae_decode_120: 1105.00 ms
[time] vae_decode_180: 1105.60 ms
[time] vae_decode_240: 1082.36 ms
Saved: rknn_output.wav
```
## ๆจกๅž‹่ฝฌๆข
#### ๆ‡’ๅพ—ๅ†™ไบ†๏ผŒๅพ…่กฅๅ……
## ๅทฒ็Ÿฅ้—ฎ้ข˜
- ๆŸไบ›ๆƒ…ๅ†ตไธ‹่ฏญ้Ÿณ็”Ÿๆˆๅฏ่ƒฝ้™ทๅ…ฅๆญปๅพช็Žฏ๏ผŒๅŽŸ้กน็›ฎไผผไนŽๆœ‰ๆฃ€ๆต‹ๆญปๅพช็Žฏ็š„ๆœบๅˆถ๏ผŒไฝ†ๆˆ‘่ฟ™้‡Œๆฒกๆœ‰ๅฎž็Žฐใ€‚
- ็”ฑไบŽRKNNๅทฅๅ…ท้“พ็š„ๅ†…้ƒจ้—ฎ้ข˜๏ผŒlocencๆจกๅž‹ๆฒกๆœ‰ๅŠžๆณ•ๅœจไธ€ไธชๆจกๅž‹้‡Œ้…็ฝฎไธค็ง่พ“ๅ…ฅ้•ฟๅบฆ็š„ไธค็ป„shape๏ผŒๅ› ๆญคๅช่ƒฝๅ•็‹ฌ่ฝฌๆขไธคไธชๆจกๅž‹ใ€‚
- ็”ฑไบŽRKLLMๅทฅๅ…ท้“พ/่ฟ่กŒๆ—ถ็š„ๅ†…้ƒจ้—ฎ้ข˜๏ผŒไธคไธชLLM็š„่พ“ๅ‡บๅผ ้‡็š„ๆ•ฐๅ€ผ้ƒฝๅชๆœ‰ๆญฃ็กฎ็ป“ๆžœ็š„ๅ››ๅˆ†ไน‹ไธ€๏ผŒๆ‰‹ๅŠจไน˜4ไน‹ๅŽๅฏไปฅๅพ—ๅˆฐๆญฃ็กฎ็ป“ๆžœใ€‚
- ~~็”ฑไบŽRKNNๅทฅๅ…ท้“พ็›ฎๅ‰ไธๆ”ฏๆŒ้ž4็ปด่พ“ๅ…ฅๆจกๅž‹ๅคšbatchไฝฟ็”จๅคšNPUๆ ธ็š„ๆ•ฐๆฎๅนถ่กŒๆŽจ็†๏ผŒ่„šๆœฌไธญCFGๆ˜ฏๅˆ†ไธคๆฌกๅ•็‹ฌ่ฟ›่กŒ็š„๏ผŒ้€Ÿๅบฆ่พƒๆ…ขใ€‚~~(ๅทฒไฟฎๅค)
## ๅ‚่€ƒ
- [openbmb/VoxCPM-0.5B](https://huggingface.co/openbmb/VoxCPM-0.5B)
- [0seba/VoxCPMANE](https://github.com/0seba/VoxCPMANE)
- [bluryar/VoxCPM-ONNX](https://github.com/bluryar/VoxCPM-ONNX)
# English README
VoxCPM is an innovative tokenizer-free Text-to-Speech (TTS) system that redefines realism in speech synthesis. By modeling speech in continuous space, it overcomes the limitations of discrete tokenization and achieves two core capabilities: context-aware speech generation and realistic zero-shot voice cloning.
Unlike mainstream approaches that convert speech into discrete tokens, VoxCPM adopts an end-to-end diffusion autoregressive architecture that directly generates continuous speech representations from text. Built on the MiniCPM-4 backbone, it achieves implicit semantic-acoustic decoupling through hierarchical language modeling and FSQ constraints, greatly enhancing expressiveness and generation stability.
- Inference speed (RKNN2): RTF approximately 4.5 on RK3588 (45s inference time to generate 10s audio)
- Approximate memory usage (RKNN2): ~3.3GB
## Usage
1. Clone the project locally
2. Install dependencies
```bash
pip install numpy scipy soundfile tqdm transformers sentencepiece ztu-somemodelruntime-ez-rknn-async
```
3. Run
```bash
python onnx_infer-rknn2.py --onnx-dir . --tokenizer-dir . --base-hf-dir . --residual-hf-dir . --text "Wow, this model actually runs perfectly on the RK3588 SoC!" --prompt-audio basic_ref_zh.wav --prompt-text "ๅฏน๏ผŒ่ฟ™ๅฐฑๆ˜ฏๆˆ‘๏ผŒไธ‡ไบบๆ•ฌไปฐ็š„ๅคชไน™็œŸไบบใ€‚" --output rknn_output.wav --cfg-value 2.0 --inference-timesteps 10 --seed 1234
```
Optional parameters:
- `--text`: Text to generate
- `--prompt-audio`: Reference audio path (for voice cloning)
- `--prompt-text`: Text corresponding to the reference audio (required when using reference audio)
- `--cfg-value`: CFG guidance strength, default 2.0
- `--inference-timesteps`: Number of diffusion steps, default 10
- `--seed`: Random seed
- `--output`: Output audio path
## Performance
```log
> python onnx_infer-rknn2.py --onnx-dir . --tokenizer-dir . --base-hf-dir . --residual-hf-dir . --text "ๅ“‡, ่ฟ™ไธชๆจกๅž‹ๅฑ…็„ถๅœจRK3588่ฟ™ไธช่พฃ้ธกSoCไธŠไนŸ่ƒฝๅฎŒ็พŽ่ฟ่กŒ!" --prompt-audio basic_ref_zh.wav --prompt-text "ๅฏน๏ผŒ่ฟ™ๅฐฑๆ˜ฏๆˆ‘๏ผŒไธ‡ไบบๆ•ฌไปฐ็š„ๅคชไน™็œŸไบบใ€‚" --output rknn_output.wav --cfg-value 2.0 --inference-timesteps 10 --seed 1234
I rkllm: rkllm-runtime version: 1.2.3, rknpu driver version: 0.9.8, platform: RK3588
I rkllm: loading rkllm model from ./base_lm.rkllm
I rkllm: rkllm-toolkit version: 1.2.3, max_context_limit: 4096, npu_core_num: 1, target_platform: RK3588, model_dtype: FP16
I rkllm: Enabled cpus: [4, 5, 6, 7]
I rkllm: Enabled cpus num: 4
I rkllm: rkllm-runtime version: 1.2.3, rknpu driver version: 0.9.8, platform: RK3588
I rkllm: loading rkllm model from ./residual_lm.rkllm
I rkllm: rkllm-toolkit version: 1.2.2, max_context_limit: 4096, npu_core_num: 3, target_platform: RK3588, model_dtype: FP16
I rkllm: Enabled cpus: [4, 5, 6, 7]
I rkllm: Enabled cpus num: 4
[time] vae_encode_0: 1502.91 ms
[time] vae_encode_38400: 1443.79 ms
[time] vae_encode_76800: 1418.36 ms
[time] locenc_0: 820.25 ms
[time] locenc_64: 814.78 ms
[time] locenc_128: 815.60 ms
[time] base_lm initial: 549.21 ms
[time] fsq_init_0: 5.34 ms
[time] fsq_init_64: 3.95 ms
[time] fsq_init_128: 4.17 ms
[time] residual_lm initial: 131.22 ms
gen_loop: 0%| | 0/2000 [00:00<?, ?it/s][time] lm_to_dit: 1.26 ms
[time] res_to_dit: 1.01 ms
100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 10/10 [00:00<00:00, 60.13it/s]
[time] locenc_step: 16.43 msโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ– | 7/10 [00:00<00:00, 61.20it/s]
gen_loop: 0%| | 1/2000 [00:00<09:43, 3.42it/s][time] lm_to_dit: 0.75 ms
[time] res_to_dit: 0.55 ms
100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 10/10 [00:00<00:00, 63.99it/s]
[time] locenc_step: 15.93 msโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ– | 7/10 [00:00<00:00, 67.27it/s]
gen_loop: 0%| | 2/2000 [00:00<09:25, 3.53it/s][time] lm_to_dit: 0.74 ms
...
[time] res_to_dit: 0.59 ms
100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 10/10 [00:00<00:00, 64.19it/s]
[time] locenc_step: 15.73 msโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ– | 7/10 [00:00<00:00, 67.34it/s]
gen_loop: 6%|โ–ˆโ–ˆโ–ˆโ–ˆโ–Ž | 123/2000 [00:34<08:47, 3.56it/s][time] lm_to_dit: 0.76 ms
[time] res_to_dit: 0.56 ms
100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 10/10 [00:00<00:00, 64.08it/s]
[time] locenc_step: 15.82 msโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ– | 7/10 [00:00<00:00, 67.43it/s]
gen_loop: 6%|โ–ˆโ–ˆโ–ˆโ–ˆโ–Ž | 123/2000 [00:34<08:47, 3.56it/s]
[time] vae_decode_0: 1153.02 ms
[time] vae_decode_60: 1102.36 ms
[time] vae_decode_120: 1105.00 ms
[time] vae_decode_180: 1105.60 ms
[time] vae_decode_240: 1082.36 ms
Saved: rknn_output.wav
```
## Model Conversion
#### TODO: Documentation to be added
## Known Issues
- In some cases, speech generation may fall into an infinite loop. The original project seems to have a mechanism to detect infinite loops, but it is not implemented here.
- Due to internal issues with the RKNN toolchain, the locenc model cannot configure two sets of shapes for two different input lengths in a single model, so two separate models must be converted.
- Due to internal issues with the RKLLM toolchain/runtime, the output tensor values of both LLMs are only one-quarter of the correct result. Multiplying by 4 manually yields the correct result.
- ~~Since the RKNN toolchain currently does not support data-parallel inference using multiple NPU cores for non-4D input models with multiple batches, CFG in the script is performed separately in two passes, which is relatively slow.~~(Solved)
## References
- [openbmb/VoxCPM-0.5B](https://huggingface.co/openbmb/VoxCPM-0.5B)
- [0seba/VoxCPMANE](https://github.com/0seba/VoxCPMANE)
- [bluryar/VoxCPM-ONNX](https://github.com/bluryar/VoxCPM-ONNX)