VoxCPM-0.5B-RKNN2 / README.md
happyme531's picture
Add CFG parallel inference with new library
921aee2 verified
metadata
license: agpl-3.0
language:
  - en
  - zh
base_model:
  - openbmb/VoxCPM-0.5B
pipeline_tag: text-to-speech
tags:
  - rknn
  - rkllm
  - text-to-speech
  - speech
  - speech generation
  - voice cloning

VoxCPM-0.5B-RKNN2

(English README see below)

VoxCPM ๆ˜ฏไธ€็งๅˆ›ๆ–ฐ็š„ๆ— ๅˆ†่ฏๅ™จๆ–‡ๆœฌ่ฝฌ่ฏญ้Ÿณ๏ผˆTTS๏ผ‰็ณป็ปŸ๏ผŒ้‡ๆ–ฐๅฎšไน‰ไบ†่ฏญ้Ÿณๅˆๆˆ็š„็œŸๅฎžๆ„Ÿใ€‚้€š่ฟ‡ๅœจ่ฟž็ปญ็ฉบ้—ดไธญๅปบๆจก่ฏญ้Ÿณ๏ผŒๅฎƒๅ…‹ๆœไบ†็ฆปๆ•ฃๆ ‡่ฎฐๅŒ–็š„ๅฑ€้™๏ผŒๅนถๅฎž็Žฐไบ†ไธค้กนๆ ธๅฟƒ่ƒฝๅŠ›๏ผšไธŠไธ‹ๆ–‡ๆ„Ÿ็Ÿฅ็š„่ฏญ้Ÿณ็”Ÿๆˆๅ’Œ้€ผ็œŸ็š„้›ถๆ ทๆœฌ่ฏญ้Ÿณๅ…‹้š†ใ€‚ ไธๅŒไบŽๅฐ†่ฏญ้Ÿณ่ฝฌๆขไธบ็ฆปๆ•ฃๆ ‡่ฎฐ็š„ไธปๆตๆ–นๆณ•๏ผŒVoxCPM ้‡‡็”จ็ซฏๅˆฐ็ซฏ็š„ๆ‰ฉๆ•ฃ่‡ชๅ›žๅฝ’ๆžถๆž„๏ผŒ็›ดๆŽฅไปŽๆ–‡ๆœฌ็”Ÿๆˆ่ฟž็ปญ็š„่ฏญ้Ÿณ่กจ็คบใ€‚ๅฎƒๅŸบไบŽ MiniCPM-4 ไธปๅนฒๆž„ๅปบ๏ผŒ้€š่ฟ‡ๅˆ†ๅฑ‚่ฏญ่จ€ๅปบๆจกๅ’Œ FSQ ็บฆๆŸๅฎž็Žฐไบ†้šๅผ็š„่ฏญไน‰-ๅฃฐๅญฆ่งฃ่€ฆ๏ผŒๆžๅคงๅœฐๆๅ‡ไบ†่กจ็ŽฐๅŠ›ๅ’Œ็”Ÿๆˆ็จณๅฎšๆ€งใ€‚

ๆจกๅž‹ๆžถๆž„

  • ๆŽจ็†้€Ÿๅบฆ(RKNN2)๏ผšRK3588ไธŠRTF็บฆ4.5๏ผˆ็”Ÿๆˆ10s้Ÿณ้ข‘้œ€่ฆๆŽจ็†45s๏ผ‰
  • ๅคง่‡ดๅ†…ๅญ˜ๅ ็”จ(RKNN2)๏ผš็บฆ3.3GB

ไฝฟ็”จๆ–นๆณ•

  1. ๅ…‹้š†้กน็›ฎๅˆฐๆœฌๅœฐ

  2. ๅฎ‰่ฃ…ไพ่ต–

pip install numpy scipy soundfile tqdm transformers sentencepiece ztu-somemodelruntime-ez-rknn-async
  1. ่ฟ่กŒ
python onnx_infer-rknn2.py --onnx-dir . --tokenizer-dir . --base-hf-dir . --residual-hf-dir . --text "ๅ“‡, ่ฟ™ไธชๆจกๅž‹ๅฑ…็„ถๅœจRK3588่ฟ™ไธช่พฃ้ธกSoCไธŠไนŸ่ƒฝๅฎŒ็พŽ่ฟ่กŒ!" --prompt-audio basic_ref_zh.wav --prompt-text "ๅฏน๏ผŒ่ฟ™ๅฐฑๆ˜ฏๆˆ‘๏ผŒไธ‡ไบบๆ•ฌไปฐ็š„ๅคชไน™็œŸไบบใ€‚" --output rknn_output.wav --cfg-value 2.0 --inference-timesteps 10 --seed 1234

ๅฏ้€‰ๅ‚ๆ•ฐ๏ผš

  • --text: ่ฆ็”Ÿๆˆ็š„ๆ–‡ๆœฌ
  • --prompt-audio: ๅ‚่€ƒ้Ÿณ้ข‘่ทฏๅพ„๏ผˆ็”จไบŽ่ฏญ้Ÿณๅ…‹้š†๏ผ‰
  • --prompt-text: ๅ‚่€ƒ้Ÿณ้ข‘ๅฏนๅบ”็š„ๆ–‡ๆœฌ๏ผˆไฝฟ็”จๅ‚่€ƒ้Ÿณ้ข‘ๆ—ถๅฟ…ๅกซ๏ผ‰
  • --cfg-value: CFGๅผ•ๅฏผๅผบๅบฆ๏ผŒ้ป˜่ฎค2.0
  • --inference-timesteps: ๆ‰ฉๆ•ฃๆญฅๆ•ฐ๏ผŒ้ป˜่ฎค10
  • --seed: ้šๆœบ็งๅญ
  • --output: ่พ“ๅ‡บ้Ÿณ้ข‘่ทฏๅพ„

่ฟ่กŒๆ•ˆๆžœ

> python onnx_infer-rknn2.py --onnx-dir . --tokenizer-dir . --base-hf-dir . --residual-hf-dir . --text "ๅ“‡, ่ฟ™ไธชๆจกๅž‹ๅฑ…็„ถๅœจRK3588่ฟ™ไธช่พฃ้ธกSoCไธŠไนŸ่ƒฝๅฎŒ็พŽ่ฟ่กŒ!" --prompt-audio basic_ref_zh.wav --prompt-text "ๅฏน๏ผŒ่ฟ™ๅฐฑๆ˜ฏๆˆ‘๏ผŒไธ‡ไบบๆ•ฌไปฐ็š„ๅคชไน™็œŸไบบใ€‚" --output rknn_output.wav --cfg-value 2.0 --inference-timesteps 10 --seed 1234

I rkllm: rkllm-runtime version: 1.2.3, rknpu driver version: 0.9.8, platform: RK3588
I rkllm: loading rkllm model from ./base_lm.rkllm
I rkllm: rkllm-toolkit version: 1.2.3, max_context_limit: 4096, npu_core_num: 1, target_platform: RK3588, model_dtype: FP16
I rkllm: Enabled cpus: [4, 5, 6, 7]
I rkllm: Enabled cpus num: 4
I rkllm: rkllm-runtime version: 1.2.3, rknpu driver version: 0.9.8, platform: RK3588
I rkllm: loading rkllm model from ./residual_lm.rkllm
I rkllm: rkllm-toolkit version: 1.2.2, max_context_limit: 4096, npu_core_num: 3, target_platform: RK3588, model_dtype: FP16
I rkllm: Enabled cpus: [4, 5, 6, 7]
I rkllm: Enabled cpus num: 4
[time] vae_encode_0: 1502.91 ms
[time] vae_encode_38400: 1443.79 ms
[time] vae_encode_76800: 1418.36 ms
[time] locenc_0: 820.25 ms
[time] locenc_64: 814.78 ms
[time] locenc_128: 815.60 ms
[time] base_lm initial: 549.21 ms
[time] fsq_init_0: 5.34 ms
[time] fsq_init_64: 3.95 ms
[time] fsq_init_128: 4.17 ms
[time] residual_lm initial: 131.22 ms
gen_loop:   0%|                                                                                 | 0/2000 [00:00<?, ?it/s][time] lm_to_dit: 1.26 ms
[time] res_to_dit: 1.01 ms
100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 10/10 [00:00<00:00, 60.13it/s]
[time] locenc_step: 16.43 msโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–                         | 7/10 [00:00<00:00, 61.20it/s]
gen_loop:   0%|                                                                         | 1/2000 [00:00<09:43,  3.42it/s][time] lm_to_dit: 0.75 ms
[time] res_to_dit: 0.55 ms
100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 10/10 [00:00<00:00, 63.99it/s]
[time] locenc_step: 15.93 msโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–                         | 7/10 [00:00<00:00, 67.27it/s]
gen_loop:   0%|                                                                         | 2/2000 [00:00<09:25,  3.53it/s][time] lm_to_dit: 0.74 ms

...

[time] res_to_dit: 0.59 ms
100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 10/10 [00:00<00:00, 64.19it/s]
[time] locenc_step: 15.73 msโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–                         | 7/10 [00:00<00:00, 67.34it/s]
gen_loop:   6%|โ–ˆโ–ˆโ–ˆโ–ˆโ–Ž                                                                  | 123/2000 [00:34<08:47,  3.56it/s][time] lm_to_dit: 0.76 ms
[time] res_to_dit: 0.56 ms
100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 10/10 [00:00<00:00, 64.08it/s]
[time] locenc_step: 15.82 msโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–                         | 7/10 [00:00<00:00, 67.43it/s]
gen_loop:   6%|โ–ˆโ–ˆโ–ˆโ–ˆโ–Ž                                                                  | 123/2000 [00:34<08:47,  3.56it/s]
[time] vae_decode_0: 1153.02 ms
[time] vae_decode_60: 1102.36 ms
[time] vae_decode_120: 1105.00 ms
[time] vae_decode_180: 1105.60 ms
[time] vae_decode_240: 1082.36 ms
Saved: rknn_output.wav

ๆจกๅž‹่ฝฌๆข

ๆ‡’ๅพ—ๅ†™ไบ†๏ผŒๅพ…่กฅๅ……

ๅทฒ็Ÿฅ้—ฎ้ข˜

  • ๆŸไบ›ๆƒ…ๅ†ตไธ‹่ฏญ้Ÿณ็”Ÿๆˆๅฏ่ƒฝ้™ทๅ…ฅๆญปๅพช็Žฏ๏ผŒๅŽŸ้กน็›ฎไผผไนŽๆœ‰ๆฃ€ๆต‹ๆญปๅพช็Žฏ็š„ๆœบๅˆถ๏ผŒไฝ†ๆˆ‘่ฟ™้‡Œๆฒกๆœ‰ๅฎž็Žฐใ€‚
  • ็”ฑไบŽRKNNๅทฅๅ…ท้“พ็š„ๅ†…้ƒจ้—ฎ้ข˜๏ผŒlocencๆจกๅž‹ๆฒกๆœ‰ๅŠžๆณ•ๅœจไธ€ไธชๆจกๅž‹้‡Œ้…็ฝฎไธค็ง่พ“ๅ…ฅ้•ฟๅบฆ็š„ไธค็ป„shape๏ผŒๅ› ๆญคๅช่ƒฝๅ•็‹ฌ่ฝฌๆขไธคไธชๆจกๅž‹ใ€‚
  • ็”ฑไบŽRKLLMๅทฅๅ…ท้“พ/่ฟ่กŒๆ—ถ็š„ๅ†…้ƒจ้—ฎ้ข˜๏ผŒไธคไธชLLM็š„่พ“ๅ‡บๅผ ้‡็š„ๆ•ฐๅ€ผ้ƒฝๅชๆœ‰ๆญฃ็กฎ็ป“ๆžœ็š„ๅ››ๅˆ†ไน‹ไธ€๏ผŒๆ‰‹ๅŠจไน˜4ไน‹ๅŽๅฏไปฅๅพ—ๅˆฐๆญฃ็กฎ็ป“ๆžœใ€‚
  • ็”ฑไบŽRKNNๅทฅๅ…ท้“พ็›ฎๅ‰ไธๆ”ฏๆŒ้ž4็ปด่พ“ๅ…ฅๆจกๅž‹ๅคšbatchไฝฟ็”จๅคšNPUๆ ธ็š„ๆ•ฐๆฎๅนถ่กŒๆŽจ็†๏ผŒ่„šๆœฌไธญCFGๆ˜ฏๅˆ†ไธคๆฌกๅ•็‹ฌ่ฟ›่กŒ็š„๏ผŒ้€Ÿๅบฆ่พƒๆ…ขใ€‚(ๅทฒไฟฎๅค)

ๅ‚่€ƒ

English README

VoxCPM is an innovative tokenizer-free Text-to-Speech (TTS) system that redefines realism in speech synthesis. By modeling speech in continuous space, it overcomes the limitations of discrete tokenization and achieves two core capabilities: context-aware speech generation and realistic zero-shot voice cloning.

Unlike mainstream approaches that convert speech into discrete tokens, VoxCPM adopts an end-to-end diffusion autoregressive architecture that directly generates continuous speech representations from text. Built on the MiniCPM-4 backbone, it achieves implicit semantic-acoustic decoupling through hierarchical language modeling and FSQ constraints, greatly enhancing expressiveness and generation stability.

  • Inference speed (RKNN2): RTF approximately 4.5 on RK3588 (45s inference time to generate 10s audio)
  • Approximate memory usage (RKNN2): ~3.3GB

Usage

  1. Clone the project locally

  2. Install dependencies

pip install numpy scipy soundfile tqdm transformers sentencepiece ztu-somemodelruntime-ez-rknn-async
  1. Run
python onnx_infer-rknn2.py --onnx-dir . --tokenizer-dir . --base-hf-dir . --residual-hf-dir . --text "Wow, this model actually runs perfectly on the RK3588 SoC!" --prompt-audio basic_ref_zh.wav --prompt-text "ๅฏน๏ผŒ่ฟ™ๅฐฑๆ˜ฏๆˆ‘๏ผŒไธ‡ไบบๆ•ฌไปฐ็š„ๅคชไน™็œŸไบบใ€‚" --output rknn_output.wav --cfg-value 2.0 --inference-timesteps 10 --seed 1234

Optional parameters:

  • --text: Text to generate
  • --prompt-audio: Reference audio path (for voice cloning)
  • --prompt-text: Text corresponding to the reference audio (required when using reference audio)
  • --cfg-value: CFG guidance strength, default 2.0
  • --inference-timesteps: Number of diffusion steps, default 10
  • --seed: Random seed
  • --output: Output audio path

Performance

> python onnx_infer-rknn2.py --onnx-dir . --tokenizer-dir . --base-hf-dir . --residual-hf-dir . --text "ๅ“‡, ่ฟ™ไธชๆจกๅž‹ๅฑ…็„ถๅœจRK3588่ฟ™ไธช่พฃ้ธกSoCไธŠไนŸ่ƒฝๅฎŒ็พŽ่ฟ่กŒ!" --prompt-audio basic_ref_zh.wav --prompt-text "ๅฏน๏ผŒ่ฟ™ๅฐฑๆ˜ฏๆˆ‘๏ผŒไธ‡ไบบๆ•ฌไปฐ็š„ๅคชไน™็œŸไบบใ€‚" --output rknn_output.wav --cfg-value 2.0 --inference-timesteps 10 --seed 1234

I rkllm: rkllm-runtime version: 1.2.3, rknpu driver version: 0.9.8, platform: RK3588
I rkllm: loading rkllm model from ./base_lm.rkllm
I rkllm: rkllm-toolkit version: 1.2.3, max_context_limit: 4096, npu_core_num: 1, target_platform: RK3588, model_dtype: FP16
I rkllm: Enabled cpus: [4, 5, 6, 7]
I rkllm: Enabled cpus num: 4
I rkllm: rkllm-runtime version: 1.2.3, rknpu driver version: 0.9.8, platform: RK3588
I rkllm: loading rkllm model from ./residual_lm.rkllm
I rkllm: rkllm-toolkit version: 1.2.2, max_context_limit: 4096, npu_core_num: 3, target_platform: RK3588, model_dtype: FP16
I rkllm: Enabled cpus: [4, 5, 6, 7]
I rkllm: Enabled cpus num: 4
[time] vae_encode_0: 1502.91 ms
[time] vae_encode_38400: 1443.79 ms
[time] vae_encode_76800: 1418.36 ms
[time] locenc_0: 820.25 ms
[time] locenc_64: 814.78 ms
[time] locenc_128: 815.60 ms
[time] base_lm initial: 549.21 ms
[time] fsq_init_0: 5.34 ms
[time] fsq_init_64: 3.95 ms
[time] fsq_init_128: 4.17 ms
[time] residual_lm initial: 131.22 ms
gen_loop:   0%|                                                                                 | 0/2000 [00:00<?, ?it/s][time] lm_to_dit: 1.26 ms
[time] res_to_dit: 1.01 ms
100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 10/10 [00:00<00:00, 60.13it/s]
[time] locenc_step: 16.43 msโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–                         | 7/10 [00:00<00:00, 61.20it/s]
gen_loop:   0%|                                                                         | 1/2000 [00:00<09:43,  3.42it/s][time] lm_to_dit: 0.75 ms
[time] res_to_dit: 0.55 ms
100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 10/10 [00:00<00:00, 63.99it/s]
[time] locenc_step: 15.93 msโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–                         | 7/10 [00:00<00:00, 67.27it/s]
gen_loop:   0%|                                                                         | 2/2000 [00:00<09:25,  3.53it/s][time] lm_to_dit: 0.74 ms

...

[time] res_to_dit: 0.59 ms
100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 10/10 [00:00<00:00, 64.19it/s]
[time] locenc_step: 15.73 msโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–                         | 7/10 [00:00<00:00, 67.34it/s]
gen_loop:   6%|โ–ˆโ–ˆโ–ˆโ–ˆโ–Ž                                                                  | 123/2000 [00:34<08:47,  3.56it/s][time] lm_to_dit: 0.76 ms
[time] res_to_dit: 0.56 ms
100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 10/10 [00:00<00:00, 64.08it/s]
[time] locenc_step: 15.82 msโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–                         | 7/10 [00:00<00:00, 67.43it/s]
gen_loop:   6%|โ–ˆโ–ˆโ–ˆโ–ˆโ–Ž                                                                  | 123/2000 [00:34<08:47,  3.56it/s]
[time] vae_decode_0: 1153.02 ms
[time] vae_decode_60: 1102.36 ms
[time] vae_decode_120: 1105.00 ms
[time] vae_decode_180: 1105.60 ms
[time] vae_decode_240: 1082.36 ms
Saved: rknn_output.wav

Model Conversion

TODO: Documentation to be added

Known Issues

  • In some cases, speech generation may fall into an infinite loop. The original project seems to have a mechanism to detect infinite loops, but it is not implemented here.
  • Due to internal issues with the RKNN toolchain, the locenc model cannot configure two sets of shapes for two different input lengths in a single model, so two separate models must be converted.
  • Due to internal issues with the RKLLM toolchain/runtime, the output tensor values of both LLMs are only one-quarter of the correct result. Multiplying by 4 manually yields the correct result.
  • Since the RKNN toolchain currently does not support data-parallel inference using multiple NPU cores for non-4D input models with multiple batches, CFG in the script is performed separately in two passes, which is relatively slow.(Solved)

References