File size: 14,493 Bytes
621e4aa 921aee2 621e4aa 921aee2 621e4aa 921aee2 621e4aa 921aee2 621e4aa f669ceb 621e4aa 921aee2 621e4aa 921aee2 621e4aa 921aee2 621e4aa 921aee2 621e4aa 921aee2 621e4aa f669ceb 621e4aa 921aee2 621e4aa | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 | ---
license: agpl-3.0
language:
- en
- zh
base_model:
- openbmb/VoxCPM-0.5B
pipeline_tag: text-to-speech
tags:
- rknn
- rkllm
- text-to-speech
- speech
- speech generation
- voice cloning
---
# VoxCPM-0.5B-RKNN2
### (English README see below)
VoxCPM ๆฏไธ็งๅๆฐ็ๆ ๅ่ฏๅจๆๆฌ่ฝฌ่ฏญ้ณ๏ผTTS๏ผ็ณป็ป๏ผ้ๆฐๅฎไนไบ่ฏญ้ณๅๆ็็ๅฎๆใ้่ฟๅจ่ฟ็ปญ็ฉบ้ดไธญๅปบๆจก่ฏญ้ณ๏ผๅฎๅ
ๆไบ็ฆปๆฃๆ ่ฎฐๅ็ๅฑ้๏ผๅนถๅฎ็ฐไบไธค้กนๆ ธๅฟ่ฝๅ๏ผไธไธๆๆ็ฅ็่ฏญ้ณ็ๆๅ้ผ็็้ถๆ ทๆฌ่ฏญ้ณๅ
้ใ
ไธๅไบๅฐ่ฏญ้ณ่ฝฌๆขไธบ็ฆปๆฃๆ ่ฎฐ็ไธปๆตๆนๆณ๏ผVoxCPM ้็จ็ซฏๅฐ็ซฏ็ๆฉๆฃ่ชๅๅฝๆถๆ๏ผ็ดๆฅไปๆๆฌ็ๆ่ฟ็ปญ็่ฏญ้ณ่กจ็คบใๅฎๅบไบ MiniCPM-4 ไธปๅนฒๆๅปบ๏ผ้่ฟๅๅฑ่ฏญ่จๅปบๆจกๅ FSQ ็บฆๆๅฎ็ฐไบ้ๅผ็่ฏญไน-ๅฃฐๅญฆ่งฃ่ฆ๏ผๆๅคงๅฐๆๅไบ่กจ็ฐๅๅ็ๆ็จณๅฎๆงใ

- ๆจ็้ๅบฆ(RKNN2)๏ผRK3588ไธRTF็บฆ4.5๏ผ็ๆ10s้ณ้ข้่ฆๆจ็45s๏ผ
- ๅคง่ดๅ
ๅญๅ ็จ(RKNN2)๏ผ็บฆ3.3GB
## ไฝฟ็จๆนๆณ
1. ๅ
้้กน็ฎๅฐๆฌๅฐ
2. ๅฎ่ฃ
ไพ่ต
```bash
pip install numpy scipy soundfile tqdm transformers sentencepiece ztu-somemodelruntime-ez-rknn-async
```
3. ่ฟ่ก
```bash
python onnx_infer-rknn2.py --onnx-dir . --tokenizer-dir . --base-hf-dir . --residual-hf-dir . --text "ๅ, ่ฟไธชๆจกๅๅฑ
็ถๅจRK3588่ฟไธช่พฃ้ธกSoCไธไน่ฝๅฎ็พ่ฟ่ก!" --prompt-audio basic_ref_zh.wav --prompt-text "ๅฏน๏ผ่ฟๅฐฑๆฏๆ๏ผไธไบบๆฌไปฐ็ๅคชไน็ไบบใ" --output rknn_output.wav --cfg-value 2.0 --inference-timesteps 10 --seed 1234
```
ๅฏ้ๅๆฐ๏ผ
- `--text`: ่ฆ็ๆ็ๆๆฌ
- `--prompt-audio`: ๅ่้ณ้ข่ทฏๅพ๏ผ็จไบ่ฏญ้ณๅ
้๏ผ
- `--prompt-text`: ๅ่้ณ้ขๅฏนๅบ็ๆๆฌ๏ผไฝฟ็จๅ่้ณ้ขๆถๅฟ
ๅกซ๏ผ
- `--cfg-value`: CFGๅผๅฏผๅผบๅบฆ๏ผ้ป่ฎค2.0
- `--inference-timesteps`: ๆฉๆฃๆญฅๆฐ๏ผ้ป่ฎค10
- `--seed`: ้ๆบ็งๅญ
- `--output`: ่พๅบ้ณ้ข่ทฏๅพ
## ่ฟ่กๆๆ
```log
> python onnx_infer-rknn2.py --onnx-dir . --tokenizer-dir . --base-hf-dir . --residual-hf-dir . --text "ๅ, ่ฟไธชๆจกๅๅฑ
็ถๅจRK3588่ฟไธช่พฃ้ธกSoCไธไน่ฝๅฎ็พ่ฟ่ก!" --prompt-audio basic_ref_zh.wav --prompt-text "ๅฏน๏ผ่ฟๅฐฑๆฏๆ๏ผไธไบบๆฌไปฐ็ๅคชไน็ไบบใ" --output rknn_output.wav --cfg-value 2.0 --inference-timesteps 10 --seed 1234
I rkllm: rkllm-runtime version: 1.2.3, rknpu driver version: 0.9.8, platform: RK3588
I rkllm: loading rkllm model from ./base_lm.rkllm
I rkllm: rkllm-toolkit version: 1.2.3, max_context_limit: 4096, npu_core_num: 1, target_platform: RK3588, model_dtype: FP16
I rkllm: Enabled cpus: [4, 5, 6, 7]
I rkllm: Enabled cpus num: 4
I rkllm: rkllm-runtime version: 1.2.3, rknpu driver version: 0.9.8, platform: RK3588
I rkllm: loading rkllm model from ./residual_lm.rkllm
I rkllm: rkllm-toolkit version: 1.2.2, max_context_limit: 4096, npu_core_num: 3, target_platform: RK3588, model_dtype: FP16
I rkllm: Enabled cpus: [4, 5, 6, 7]
I rkllm: Enabled cpus num: 4
[time] vae_encode_0: 1502.91 ms
[time] vae_encode_38400: 1443.79 ms
[time] vae_encode_76800: 1418.36 ms
[time] locenc_0: 820.25 ms
[time] locenc_64: 814.78 ms
[time] locenc_128: 815.60 ms
[time] base_lm initial: 549.21 ms
[time] fsq_init_0: 5.34 ms
[time] fsq_init_64: 3.95 ms
[time] fsq_init_128: 4.17 ms
[time] residual_lm initial: 131.22 ms
gen_loop: 0%| | 0/2000 [00:00<?, ?it/s][time] lm_to_dit: 1.26 ms
[time] res_to_dit: 1.01 ms
100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 10/10 [00:00<00:00, 60.13it/s]
[time] locenc_step: 16.43 msโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 7/10 [00:00<00:00, 61.20it/s]
gen_loop: 0%| | 1/2000 [00:00<09:43, 3.42it/s][time] lm_to_dit: 0.75 ms
[time] res_to_dit: 0.55 ms
100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 10/10 [00:00<00:00, 63.99it/s]
[time] locenc_step: 15.93 msโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 7/10 [00:00<00:00, 67.27it/s]
gen_loop: 0%| | 2/2000 [00:00<09:25, 3.53it/s][time] lm_to_dit: 0.74 ms
...
[time] res_to_dit: 0.59 ms
100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 10/10 [00:00<00:00, 64.19it/s]
[time] locenc_step: 15.73 msโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 7/10 [00:00<00:00, 67.34it/s]
gen_loop: 6%|โโโโโ | 123/2000 [00:34<08:47, 3.56it/s][time] lm_to_dit: 0.76 ms
[time] res_to_dit: 0.56 ms
100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 10/10 [00:00<00:00, 64.08it/s]
[time] locenc_step: 15.82 msโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 7/10 [00:00<00:00, 67.43it/s]
gen_loop: 6%|โโโโโ | 123/2000 [00:34<08:47, 3.56it/s]
[time] vae_decode_0: 1153.02 ms
[time] vae_decode_60: 1102.36 ms
[time] vae_decode_120: 1105.00 ms
[time] vae_decode_180: 1105.60 ms
[time] vae_decode_240: 1082.36 ms
Saved: rknn_output.wav
```
## ๆจกๅ่ฝฌๆข
ๆฅ็ https://huggingface.co/happyme531/VoxCPM-0.5B-RKNN2/tree/main/convert
## ๅทฒ็ฅ้ฎ้ข
- ๆไบๆ
ๅตไธ่ฏญ้ณ็ๆๅฏ่ฝ้ทๅ
ฅๆญปๅพช็ฏ๏ผๅ้กน็ฎไผผไนๆๆฃๆตๆญปๅพช็ฏ็ๆบๅถ๏ผไฝๆ่ฟ้ๆฒกๆๅฎ็ฐใ
- ็ฑไบRKNNๅทฅๅ
ท้พ็ๅ
้จ้ฎ้ข๏ผlocencๆจกๅๆฒกๆๅๆณๅจไธไธชๆจกๅ้้
็ฝฎไธค็ง่พๅ
ฅ้ฟๅบฆ็ไธค็ปshape๏ผๅ ๆญคๅช่ฝๅ็ฌ่ฝฌๆขไธคไธชๆจกๅใ
- ็ฑไบRKLLMๅทฅๅ
ท้พ/่ฟ่กๆถ็ๅ
้จ้ฎ้ข๏ผไธคไธชLLM็่พๅบๅผ ้็ๆฐๅผ้ฝๅชๆๆญฃ็กฎ็ปๆ็ๅๅไนไธ๏ผๆๅจไน4ไนๅๅฏไปฅๅพๅฐๆญฃ็กฎ็ปๆใ
- ~~็ฑไบRKNNๅทฅๅ
ท้พ็ฎๅไธๆฏๆ้4็ปด่พๅ
ฅๆจกๅๅคbatchไฝฟ็จๅคNPUๆ ธ็ๆฐๆฎๅนถ่กๆจ็๏ผ่ๆฌไธญCFGๆฏๅไธคๆฌกๅ็ฌ่ฟ่ก็๏ผ้ๅบฆ่พๆ
ขใ~~(ๅทฒไฟฎๅค)
## ๅ่
- [openbmb/VoxCPM-0.5B](https://huggingface.co/openbmb/VoxCPM-0.5B)
- [0seba/VoxCPMANE](https://github.com/0seba/VoxCPMANE)
- [bluryar/VoxCPM-ONNX](https://github.com/bluryar/VoxCPM-ONNX)
# English README
VoxCPM is an innovative tokenizer-free Text-to-Speech (TTS) system that redefines realism in speech synthesis. By modeling speech in continuous space, it overcomes the limitations of discrete tokenization and achieves two core capabilities: context-aware speech generation and realistic zero-shot voice cloning.
Unlike mainstream approaches that convert speech into discrete tokens, VoxCPM adopts an end-to-end diffusion autoregressive architecture that directly generates continuous speech representations from text. Built on the MiniCPM-4 backbone, it achieves implicit semantic-acoustic decoupling through hierarchical language modeling and FSQ constraints, greatly enhancing expressiveness and generation stability.
- Inference speed (RKNN2): RTF approximately 4.5 on RK3588 (45s inference time to generate 10s audio)
- Approximate memory usage (RKNN2): ~3.3GB
## Usage
1. Clone the project locally
2. Install dependencies
```bash
pip install numpy scipy soundfile tqdm transformers sentencepiece ztu-somemodelruntime-ez-rknn-async
```
3. Run
```bash
python onnx_infer-rknn2.py --onnx-dir . --tokenizer-dir . --base-hf-dir . --residual-hf-dir . --text "Wow, this model actually runs perfectly on the RK3588 SoC!" --prompt-audio basic_ref_zh.wav --prompt-text "ๅฏน๏ผ่ฟๅฐฑๆฏๆ๏ผไธไบบๆฌไปฐ็ๅคชไน็ไบบใ" --output rknn_output.wav --cfg-value 2.0 --inference-timesteps 10 --seed 1234
```
Optional parameters:
- `--text`: Text to generate
- `--prompt-audio`: Reference audio path (for voice cloning)
- `--prompt-text`: Text corresponding to the reference audio (required when using reference audio)
- `--cfg-value`: CFG guidance strength, default 2.0
- `--inference-timesteps`: Number of diffusion steps, default 10
- `--seed`: Random seed
- `--output`: Output audio path
## Performance
```log
> python onnx_infer-rknn2.py --onnx-dir . --tokenizer-dir . --base-hf-dir . --residual-hf-dir . --text "ๅ, ่ฟไธชๆจกๅๅฑ
็ถๅจRK3588่ฟไธช่พฃ้ธกSoCไธไน่ฝๅฎ็พ่ฟ่ก!" --prompt-audio basic_ref_zh.wav --prompt-text "ๅฏน๏ผ่ฟๅฐฑๆฏๆ๏ผไธไบบๆฌไปฐ็ๅคชไน็ไบบใ" --output rknn_output.wav --cfg-value 2.0 --inference-timesteps 10 --seed 1234
I rkllm: rkllm-runtime version: 1.2.3, rknpu driver version: 0.9.8, platform: RK3588
I rkllm: loading rkllm model from ./base_lm.rkllm
I rkllm: rkllm-toolkit version: 1.2.3, max_context_limit: 4096, npu_core_num: 1, target_platform: RK3588, model_dtype: FP16
I rkllm: Enabled cpus: [4, 5, 6, 7]
I rkllm: Enabled cpus num: 4
I rkllm: rkllm-runtime version: 1.2.3, rknpu driver version: 0.9.8, platform: RK3588
I rkllm: loading rkllm model from ./residual_lm.rkllm
I rkllm: rkllm-toolkit version: 1.2.2, max_context_limit: 4096, npu_core_num: 3, target_platform: RK3588, model_dtype: FP16
I rkllm: Enabled cpus: [4, 5, 6, 7]
I rkllm: Enabled cpus num: 4
[time] vae_encode_0: 1502.91 ms
[time] vae_encode_38400: 1443.79 ms
[time] vae_encode_76800: 1418.36 ms
[time] locenc_0: 820.25 ms
[time] locenc_64: 814.78 ms
[time] locenc_128: 815.60 ms
[time] base_lm initial: 549.21 ms
[time] fsq_init_0: 5.34 ms
[time] fsq_init_64: 3.95 ms
[time] fsq_init_128: 4.17 ms
[time] residual_lm initial: 131.22 ms
gen_loop: 0%| | 0/2000 [00:00<?, ?it/s][time] lm_to_dit: 1.26 ms
[time] res_to_dit: 1.01 ms
100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 10/10 [00:00<00:00, 60.13it/s]
[time] locenc_step: 16.43 msโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 7/10 [00:00<00:00, 61.20it/s]
gen_loop: 0%| | 1/2000 [00:00<09:43, 3.42it/s][time] lm_to_dit: 0.75 ms
[time] res_to_dit: 0.55 ms
100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 10/10 [00:00<00:00, 63.99it/s]
[time] locenc_step: 15.93 msโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 7/10 [00:00<00:00, 67.27it/s]
gen_loop: 0%| | 2/2000 [00:00<09:25, 3.53it/s][time] lm_to_dit: 0.74 ms
...
[time] res_to_dit: 0.59 ms
100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 10/10 [00:00<00:00, 64.19it/s]
[time] locenc_step: 15.73 msโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 7/10 [00:00<00:00, 67.34it/s]
gen_loop: 6%|โโโโโ | 123/2000 [00:34<08:47, 3.56it/s][time] lm_to_dit: 0.76 ms
[time] res_to_dit: 0.56 ms
100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 10/10 [00:00<00:00, 64.08it/s]
[time] locenc_step: 15.82 msโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 7/10 [00:00<00:00, 67.43it/s]
gen_loop: 6%|โโโโโ | 123/2000 [00:34<08:47, 3.56it/s]
[time] vae_decode_0: 1153.02 ms
[time] vae_decode_60: 1102.36 ms
[time] vae_decode_120: 1105.00 ms
[time] vae_decode_180: 1105.60 ms
[time] vae_decode_240: 1082.36 ms
Saved: rknn_output.wav
```
## Model Conversion
See https://huggingface.co/happyme531/VoxCPM-0.5B-RKNN2/tree/main/convert
## Known Issues
- In some cases, speech generation may fall into an infinite loop. The original project seems to have a mechanism to detect infinite loops, but it is not implemented here.
- Due to internal issues with the RKNN toolchain, the locenc model cannot configure two sets of shapes for two different input lengths in a single model, so two separate models must be converted.
- Due to internal issues with the RKLLM toolchain/runtime, the output tensor values of both LLMs are only one-quarter of the correct result. Multiplying by 4 manually yields the correct result.
- ~~Since the RKNN toolchain currently does not support data-parallel inference using multiple NPU cores for non-4D input models with multiple batches, CFG in the script is performed separately in two passes, which is relatively slow.~~(Solved)
## References
- [openbmb/VoxCPM-0.5B](https://huggingface.co/openbmb/VoxCPM-0.5B)
- [0seba/VoxCPMANE](https://github.com/0seba/VoxCPMANE)
- [bluryar/VoxCPM-ONNX](https://github.com/bluryar/VoxCPM-ONNX)
|