File size: 14,493 Bytes
621e4aa
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
921aee2
621e4aa
 
 
 
 
 
 
 
 
921aee2
621e4aa
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
921aee2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
621e4aa
 
 
921aee2
 
 
 
 
 
 
 
 
 
 
 
 
621e4aa
 
 
 
 
f669ceb
621e4aa
 
 
 
 
 
921aee2
621e4aa
 
 
 
 
 
 
 
 
 
 
 
 
921aee2
621e4aa
 
 
 
 
 
 
 
 
921aee2
621e4aa
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
921aee2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
621e4aa
 
 
921aee2
 
 
 
 
 
 
 
 
 
 
 
 
621e4aa
 
 
 
f669ceb
621e4aa
 
 
 
 
 
921aee2
621e4aa
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
---
license: agpl-3.0
language:
- en
- zh
base_model:
- openbmb/VoxCPM-0.5B
pipeline_tag: text-to-speech
tags:
- rknn
- rkllm
- text-to-speech
- speech
- speech generation
- voice cloning
---

# VoxCPM-0.5B-RKNN2

### (English README see below)

VoxCPM ๆ˜ฏไธ€็งๅˆ›ๆ–ฐ็š„ๆ— ๅˆ†่ฏๅ™จๆ–‡ๆœฌ่ฝฌ่ฏญ้Ÿณ๏ผˆTTS๏ผ‰็ณป็ปŸ๏ผŒ้‡ๆ–ฐๅฎšไน‰ไบ†่ฏญ้Ÿณๅˆๆˆ็š„็œŸๅฎžๆ„Ÿใ€‚้€š่ฟ‡ๅœจ่ฟž็ปญ็ฉบ้—ดไธญๅปบๆจก่ฏญ้Ÿณ๏ผŒๅฎƒๅ…‹ๆœไบ†็ฆปๆ•ฃๆ ‡่ฎฐๅŒ–็š„ๅฑ€้™๏ผŒๅนถๅฎž็Žฐไบ†ไธค้กนๆ ธๅฟƒ่ƒฝๅŠ›๏ผšไธŠไธ‹ๆ–‡ๆ„Ÿ็Ÿฅ็š„่ฏญ้Ÿณ็”Ÿๆˆๅ’Œ้€ผ็œŸ็š„้›ถๆ ทๆœฌ่ฏญ้Ÿณๅ…‹้š†ใ€‚
ไธๅŒไบŽๅฐ†่ฏญ้Ÿณ่ฝฌๆขไธบ็ฆปๆ•ฃๆ ‡่ฎฐ็š„ไธปๆตๆ–นๆณ•๏ผŒVoxCPM ้‡‡็”จ็ซฏๅˆฐ็ซฏ็š„ๆ‰ฉๆ•ฃ่‡ชๅ›žๅฝ’ๆžถๆž„๏ผŒ็›ดๆŽฅไปŽๆ–‡ๆœฌ็”Ÿๆˆ่ฟž็ปญ็š„่ฏญ้Ÿณ่กจ็คบใ€‚ๅฎƒๅŸบไบŽ MiniCPM-4 ไธปๅนฒๆž„ๅปบ๏ผŒ้€š่ฟ‡ๅˆ†ๅฑ‚่ฏญ่จ€ๅปบๆจกๅ’Œ FSQ ็บฆๆŸๅฎž็Žฐไบ†้šๅผ็š„่ฏญไน‰-ๅฃฐๅญฆ่งฃ่€ฆ๏ผŒๆžๅคงๅœฐๆๅ‡ไบ†่กจ็ŽฐๅŠ›ๅ’Œ็”Ÿๆˆ็จณๅฎšๆ€งใ€‚

![ๆจกๅž‹ๆžถๆž„](model_structure.jpg)


- ๆŽจ็†้€Ÿๅบฆ(RKNN2)๏ผšRK3588ไธŠRTF็บฆ4.5๏ผˆ็”Ÿๆˆ10s้Ÿณ้ข‘้œ€่ฆๆŽจ็†45s๏ผ‰
- ๅคง่‡ดๅ†…ๅญ˜ๅ ็”จ(RKNN2)๏ผš็บฆ3.3GB

## ไฝฟ็”จๆ–นๆณ•

1. ๅ…‹้š†้กน็›ฎๅˆฐๆœฌๅœฐ

2. ๅฎ‰่ฃ…ไพ่ต–

```bash
pip install numpy scipy soundfile tqdm transformers sentencepiece ztu-somemodelruntime-ez-rknn-async
```

3. ่ฟ่กŒ

```bash
python onnx_infer-rknn2.py --onnx-dir . --tokenizer-dir . --base-hf-dir . --residual-hf-dir . --text "ๅ“‡, ่ฟ™ไธชๆจกๅž‹ๅฑ…็„ถๅœจRK3588่ฟ™ไธช่พฃ้ธกSoCไธŠไนŸ่ƒฝๅฎŒ็พŽ่ฟ่กŒ!" --prompt-audio basic_ref_zh.wav --prompt-text "ๅฏน๏ผŒ่ฟ™ๅฐฑๆ˜ฏๆˆ‘๏ผŒไธ‡ไบบๆ•ฌไปฐ็š„ๅคชไน™็œŸไบบใ€‚" --output rknn_output.wav --cfg-value 2.0 --inference-timesteps 10 --seed 1234
```

ๅฏ้€‰ๅ‚ๆ•ฐ๏ผš
- `--text`: ่ฆ็”Ÿๆˆ็š„ๆ–‡ๆœฌ
- `--prompt-audio`: ๅ‚่€ƒ้Ÿณ้ข‘่ทฏๅพ„๏ผˆ็”จไบŽ่ฏญ้Ÿณๅ…‹้š†๏ผ‰
- `--prompt-text`: ๅ‚่€ƒ้Ÿณ้ข‘ๅฏนๅบ”็š„ๆ–‡ๆœฌ๏ผˆไฝฟ็”จๅ‚่€ƒ้Ÿณ้ข‘ๆ—ถๅฟ…ๅกซ๏ผ‰
- `--cfg-value`: CFGๅผ•ๅฏผๅผบๅบฆ๏ผŒ้ป˜่ฎค2.0
- `--inference-timesteps`: ๆ‰ฉๆ•ฃๆญฅๆ•ฐ๏ผŒ้ป˜่ฎค10
- `--seed`: ้šๆœบ็งๅญ
- `--output`: ่พ“ๅ‡บ้Ÿณ้ข‘่ทฏๅพ„

## ่ฟ่กŒๆ•ˆๆžœ


```log
> python onnx_infer-rknn2.py --onnx-dir . --tokenizer-dir . --base-hf-dir . --residual-hf-dir . --text "ๅ“‡, ่ฟ™ไธชๆจกๅž‹ๅฑ…็„ถๅœจRK3588่ฟ™ไธช่พฃ้ธกSoCไธŠไนŸ่ƒฝๅฎŒ็พŽ่ฟ่กŒ!" --prompt-audio basic_ref_zh.wav --prompt-text "ๅฏน๏ผŒ่ฟ™ๅฐฑๆ˜ฏๆˆ‘๏ผŒไธ‡ไบบๆ•ฌไปฐ็š„ๅคชไน™็œŸไบบใ€‚" --output rknn_output.wav --cfg-value 2.0 --inference-timesteps 10 --seed 1234

I rkllm: rkllm-runtime version: 1.2.3, rknpu driver version: 0.9.8, platform: RK3588
I rkllm: loading rkllm model from ./base_lm.rkllm
I rkllm: rkllm-toolkit version: 1.2.3, max_context_limit: 4096, npu_core_num: 1, target_platform: RK3588, model_dtype: FP16
I rkllm: Enabled cpus: [4, 5, 6, 7]
I rkllm: Enabled cpus num: 4
I rkllm: rkllm-runtime version: 1.2.3, rknpu driver version: 0.9.8, platform: RK3588
I rkllm: loading rkllm model from ./residual_lm.rkllm
I rkllm: rkllm-toolkit version: 1.2.2, max_context_limit: 4096, npu_core_num: 3, target_platform: RK3588, model_dtype: FP16
I rkllm: Enabled cpus: [4, 5, 6, 7]
I rkllm: Enabled cpus num: 4
[time] vae_encode_0: 1502.91 ms
[time] vae_encode_38400: 1443.79 ms
[time] vae_encode_76800: 1418.36 ms
[time] locenc_0: 820.25 ms
[time] locenc_64: 814.78 ms
[time] locenc_128: 815.60 ms
[time] base_lm initial: 549.21 ms
[time] fsq_init_0: 5.34 ms
[time] fsq_init_64: 3.95 ms
[time] fsq_init_128: 4.17 ms
[time] residual_lm initial: 131.22 ms
gen_loop:   0%|                                                                                 | 0/2000 [00:00<?, ?it/s][time] lm_to_dit: 1.26 ms
[time] res_to_dit: 1.01 ms
100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 10/10 [00:00<00:00, 60.13it/s]
[time] locenc_step: 16.43 msโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–                         | 7/10 [00:00<00:00, 61.20it/s]
gen_loop:   0%|                                                                         | 1/2000 [00:00<09:43,  3.42it/s][time] lm_to_dit: 0.75 ms
[time] res_to_dit: 0.55 ms
100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 10/10 [00:00<00:00, 63.99it/s]
[time] locenc_step: 15.93 msโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–                         | 7/10 [00:00<00:00, 67.27it/s]
gen_loop:   0%|                                                                         | 2/2000 [00:00<09:25,  3.53it/s][time] lm_to_dit: 0.74 ms

...

[time] res_to_dit: 0.59 ms
100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 10/10 [00:00<00:00, 64.19it/s]
[time] locenc_step: 15.73 msโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–                         | 7/10 [00:00<00:00, 67.34it/s]
gen_loop:   6%|โ–ˆโ–ˆโ–ˆโ–ˆโ–Ž                                                                  | 123/2000 [00:34<08:47,  3.56it/s][time] lm_to_dit: 0.76 ms
[time] res_to_dit: 0.56 ms
100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 10/10 [00:00<00:00, 64.08it/s]
[time] locenc_step: 15.82 msโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–                         | 7/10 [00:00<00:00, 67.43it/s]
gen_loop:   6%|โ–ˆโ–ˆโ–ˆโ–ˆโ–Ž                                                                  | 123/2000 [00:34<08:47,  3.56it/s]
[time] vae_decode_0: 1153.02 ms
[time] vae_decode_60: 1102.36 ms
[time] vae_decode_120: 1105.00 ms
[time] vae_decode_180: 1105.60 ms
[time] vae_decode_240: 1082.36 ms
Saved: rknn_output.wav
```

## ๆจกๅž‹่ฝฌๆข

ๆŸฅ็œ‹ https://huggingface.co/happyme531/VoxCPM-0.5B-RKNN2/tree/main/convert 

## ๅทฒ็Ÿฅ้—ฎ้ข˜

- ๆŸไบ›ๆƒ…ๅ†ตไธ‹่ฏญ้Ÿณ็”Ÿๆˆๅฏ่ƒฝ้™ทๅ…ฅๆญปๅพช็Žฏ๏ผŒๅŽŸ้กน็›ฎไผผไนŽๆœ‰ๆฃ€ๆต‹ๆญปๅพช็Žฏ็š„ๆœบๅˆถ๏ผŒไฝ†ๆˆ‘่ฟ™้‡Œๆฒกๆœ‰ๅฎž็Žฐใ€‚
- ็”ฑไบŽRKNNๅทฅๅ…ท้“พ็š„ๅ†…้ƒจ้—ฎ้ข˜๏ผŒlocencๆจกๅž‹ๆฒกๆœ‰ๅŠžๆณ•ๅœจไธ€ไธชๆจกๅž‹้‡Œ้…็ฝฎไธค็ง่พ“ๅ…ฅ้•ฟๅบฆ็š„ไธค็ป„shape๏ผŒๅ› ๆญคๅช่ƒฝๅ•็‹ฌ่ฝฌๆขไธคไธชๆจกๅž‹ใ€‚
- ็”ฑไบŽRKLLMๅทฅๅ…ท้“พ/่ฟ่กŒๆ—ถ็š„ๅ†…้ƒจ้—ฎ้ข˜๏ผŒไธคไธชLLM็š„่พ“ๅ‡บๅผ ้‡็š„ๆ•ฐๅ€ผ้ƒฝๅชๆœ‰ๆญฃ็กฎ็ป“ๆžœ็š„ๅ››ๅˆ†ไน‹ไธ€๏ผŒๆ‰‹ๅŠจไน˜4ไน‹ๅŽๅฏไปฅๅพ—ๅˆฐๆญฃ็กฎ็ป“ๆžœใ€‚
- ~~็”ฑไบŽRKNNๅทฅๅ…ท้“พ็›ฎๅ‰ไธๆ”ฏๆŒ้ž4็ปด่พ“ๅ…ฅๆจกๅž‹ๅคšbatchไฝฟ็”จๅคšNPUๆ ธ็š„ๆ•ฐๆฎๅนถ่กŒๆŽจ็†๏ผŒ่„šๆœฌไธญCFGๆ˜ฏๅˆ†ไธคๆฌกๅ•็‹ฌ่ฟ›่กŒ็š„๏ผŒ้€Ÿๅบฆ่พƒๆ…ขใ€‚~~(ๅทฒไฟฎๅค)


## ๅ‚่€ƒ
- [openbmb/VoxCPM-0.5B](https://huggingface.co/openbmb/VoxCPM-0.5B)
- [0seba/VoxCPMANE](https://github.com/0seba/VoxCPMANE)
- [bluryar/VoxCPM-ONNX](https://github.com/bluryar/VoxCPM-ONNX)

# English README

VoxCPM is an innovative tokenizer-free Text-to-Speech (TTS) system that redefines realism in speech synthesis. By modeling speech in continuous space, it overcomes the limitations of discrete tokenization and achieves two core capabilities: context-aware speech generation and realistic zero-shot voice cloning.

Unlike mainstream approaches that convert speech into discrete tokens, VoxCPM adopts an end-to-end diffusion autoregressive architecture that directly generates continuous speech representations from text. Built on the MiniCPM-4 backbone, it achieves implicit semantic-acoustic decoupling through hierarchical language modeling and FSQ constraints, greatly enhancing expressiveness and generation stability.

- Inference speed (RKNN2): RTF approximately 4.5 on RK3588 (45s inference time to generate 10s audio)
- Approximate memory usage (RKNN2): ~3.3GB

## Usage

1. Clone the project locally

2. Install dependencies

```bash
pip install numpy scipy soundfile tqdm transformers sentencepiece ztu-somemodelruntime-ez-rknn-async
```

3. Run

```bash
python onnx_infer-rknn2.py --onnx-dir . --tokenizer-dir . --base-hf-dir . --residual-hf-dir . --text "Wow, this model actually runs perfectly on the RK3588 SoC!" --prompt-audio basic_ref_zh.wav --prompt-text "ๅฏน๏ผŒ่ฟ™ๅฐฑๆ˜ฏๆˆ‘๏ผŒไธ‡ไบบๆ•ฌไปฐ็š„ๅคชไน™็œŸไบบใ€‚" --output rknn_output.wav --cfg-value 2.0 --inference-timesteps 10 --seed 1234
```

Optional parameters:
- `--text`: Text to generate
- `--prompt-audio`: Reference audio path (for voice cloning)
- `--prompt-text`: Text corresponding to the reference audio (required when using reference audio)
- `--cfg-value`: CFG guidance strength, default 2.0
- `--inference-timesteps`: Number of diffusion steps, default 10
- `--seed`: Random seed
- `--output`: Output audio path

## Performance


```log
> python onnx_infer-rknn2.py --onnx-dir . --tokenizer-dir . --base-hf-dir . --residual-hf-dir . --text "ๅ“‡, ่ฟ™ไธชๆจกๅž‹ๅฑ…็„ถๅœจRK3588่ฟ™ไธช่พฃ้ธกSoCไธŠไนŸ่ƒฝๅฎŒ็พŽ่ฟ่กŒ!" --prompt-audio basic_ref_zh.wav --prompt-text "ๅฏน๏ผŒ่ฟ™ๅฐฑๆ˜ฏๆˆ‘๏ผŒไธ‡ไบบๆ•ฌไปฐ็š„ๅคชไน™็œŸไบบใ€‚" --output rknn_output.wav --cfg-value 2.0 --inference-timesteps 10 --seed 1234

I rkllm: rkllm-runtime version: 1.2.3, rknpu driver version: 0.9.8, platform: RK3588
I rkllm: loading rkllm model from ./base_lm.rkllm
I rkllm: rkllm-toolkit version: 1.2.3, max_context_limit: 4096, npu_core_num: 1, target_platform: RK3588, model_dtype: FP16
I rkllm: Enabled cpus: [4, 5, 6, 7]
I rkllm: Enabled cpus num: 4
I rkllm: rkllm-runtime version: 1.2.3, rknpu driver version: 0.9.8, platform: RK3588
I rkllm: loading rkllm model from ./residual_lm.rkllm
I rkllm: rkllm-toolkit version: 1.2.2, max_context_limit: 4096, npu_core_num: 3, target_platform: RK3588, model_dtype: FP16
I rkllm: Enabled cpus: [4, 5, 6, 7]
I rkllm: Enabled cpus num: 4
[time] vae_encode_0: 1502.91 ms
[time] vae_encode_38400: 1443.79 ms
[time] vae_encode_76800: 1418.36 ms
[time] locenc_0: 820.25 ms
[time] locenc_64: 814.78 ms
[time] locenc_128: 815.60 ms
[time] base_lm initial: 549.21 ms
[time] fsq_init_0: 5.34 ms
[time] fsq_init_64: 3.95 ms
[time] fsq_init_128: 4.17 ms
[time] residual_lm initial: 131.22 ms
gen_loop:   0%|                                                                                 | 0/2000 [00:00<?, ?it/s][time] lm_to_dit: 1.26 ms
[time] res_to_dit: 1.01 ms
100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 10/10 [00:00<00:00, 60.13it/s]
[time] locenc_step: 16.43 msโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–                         | 7/10 [00:00<00:00, 61.20it/s]
gen_loop:   0%|                                                                         | 1/2000 [00:00<09:43,  3.42it/s][time] lm_to_dit: 0.75 ms
[time] res_to_dit: 0.55 ms
100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 10/10 [00:00<00:00, 63.99it/s]
[time] locenc_step: 15.93 msโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–                         | 7/10 [00:00<00:00, 67.27it/s]
gen_loop:   0%|                                                                         | 2/2000 [00:00<09:25,  3.53it/s][time] lm_to_dit: 0.74 ms

...

[time] res_to_dit: 0.59 ms
100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 10/10 [00:00<00:00, 64.19it/s]
[time] locenc_step: 15.73 msโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–                         | 7/10 [00:00<00:00, 67.34it/s]
gen_loop:   6%|โ–ˆโ–ˆโ–ˆโ–ˆโ–Ž                                                                  | 123/2000 [00:34<08:47,  3.56it/s][time] lm_to_dit: 0.76 ms
[time] res_to_dit: 0.56 ms
100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 10/10 [00:00<00:00, 64.08it/s]
[time] locenc_step: 15.82 msโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–                         | 7/10 [00:00<00:00, 67.43it/s]
gen_loop:   6%|โ–ˆโ–ˆโ–ˆโ–ˆโ–Ž                                                                  | 123/2000 [00:34<08:47,  3.56it/s]
[time] vae_decode_0: 1153.02 ms
[time] vae_decode_60: 1102.36 ms
[time] vae_decode_120: 1105.00 ms
[time] vae_decode_180: 1105.60 ms
[time] vae_decode_240: 1082.36 ms
Saved: rknn_output.wav
```
## Model Conversion

See https://huggingface.co/happyme531/VoxCPM-0.5B-RKNN2/tree/main/convert 

## Known Issues

- In some cases, speech generation may fall into an infinite loop. The original project seems to have a mechanism to detect infinite loops, but it is not implemented here.
- Due to internal issues with the RKNN toolchain, the locenc model cannot configure two sets of shapes for two different input lengths in a single model, so two separate models must be converted.
- Due to internal issues with the RKLLM toolchain/runtime, the output tensor values of both LLMs are only one-quarter of the correct result. Multiplying by 4 manually yields the correct result.
- ~~Since the RKNN toolchain currently does not support data-parallel inference using multiple NPU cores for non-4D input models with multiple batches, CFG in the script is performed separately in two passes, which is relatively slow.~~(Solved)

## References
- [openbmb/VoxCPM-0.5B](https://huggingface.co/openbmb/VoxCPM-0.5B)
- [0seba/VoxCPMANE](https://github.com/0seba/VoxCPMANE)
- [bluryar/VoxCPM-ONNX](https://github.com/bluryar/VoxCPM-ONNX)