File size: 8,328 Bytes
867fae1
 
08a04fb
 
 
 
 
 
 
 
 
 
867fae1
08a04fb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6ca7c7e
 
08a04fb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
83420a1
08a04fb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
---

license: mit
language:
- en
- zh
base_model:
- CosyVoice3
pipeline_tag: text-to-speech
library_name: transformers
tags:
- CosyVoice3
- Speech
---


# CosyVoice3  
This version of CosyVoice3 has been converted to run on the Axera NPU using **w8a16** quantization.
Compatible with Pulsar2 version: 4.2

## Convert tools links:
For those who are interested in model conversion, you can try to export axmodel through the original repo : 
[Cosyvoice](https://github.com/FunAudioLLM/CosyVoice)

[Pulsar2 Link, How to Convert LLM from Huggingface to axmodel](https://pulsar2-docs.readthedocs.io/en/latest/appendix/build_llm.html) 

[AXera NPU HOST LLM Runtime](https://github.com/AXERA-TECH/CosyVoice3.Axera) 

## Support Platform

- AX650
  - AX650N DEMO Board
  - [M4N-Dock(爱芯派Pro)](https://wiki.sipeed.com/hardware/zh/maixIV/m4ndock/m4ndock.html)
  - [M.2 Accelerator card](https://axcl-docs.readthedocs.io/zh-cn/latest/doc_guide_hardware.html)

**Speech Generation**  
| Stage | Time |
|------|------|
| llm prefill ( input_token_num + prompt_token_num 在 [0,128 ] ) | 104 ms  | 
| llm prefill ( input_token_num + prompt_token_num 在 [128,256 ] ) | 160 ms  | 
| Decode  |  14 token/s |

## How to use

Download all files from this repository to the device  

### 1. PrePare  

#### 1.1 Copy this project to AX650 Board  

#### 1.2 Prepare Dependencies  

**Running HTTP Tokenizer Server** and **Processing Prompt Speech** require these Python packages. If you run these two step on a PC, install them on the PC.  
```

pip3 install -r scripts/requirements.txt

```  

### 2. Start HTTP Tokenizer Server  
```

cd scripts

python scripts/cosyvoice3_tokenizer.py --host {your host} --port {your port}   

```


### 3. Run on Axera Device
There are 2 kinds of device, AX650 Board , AXCL aarch64 Board and AXCL x86 Board.

#### 3.1 Run on AX650 Board  
1) Moidfy the HTTP host in `run_ax650.sh`.  

2) Run `run_ax650.sh`  
```shell

root@ax650 ~/CosyVoice3 # bash run_ax650.sh 

rm: cannot remove 'output*.wav': No such file or directory

[I][                            Init][ 108]: LLM init start

[I][                            Init][  34]: connect http://10.122.86.184:12345 ok

bos_id: 0, eos_id: 1773

  7% | β–ˆβ–ˆβ–ˆ                               |   2 /  27 [3.11s<42.04s, 0.64 count/s] embed_selector init ok[I][                            Init][ 138]: attr.axmodel_num:24

100% | β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ |  27 /  27 [10.32s<10.32s, 2.62 count/s] init post axmodel ok,remain_cmm(7178 MB)

[I][                            Init][ 216]: max_token_len : 1023

[I][                            Init][ 221]: kv_cache_size : 128, kv_cache_num: 1023

[I][                            Init][ 229]: prefill_token_num : 128

[I][                            Init][ 233]: grp: 1, prefill_max_token_num : 1

[I][                            Init][ 233]: grp: 2, prefill_max_token_num : 128

[I][                            Init][ 233]: grp: 3, prefill_max_token_num : 256

[I][                            Init][ 233]: grp: 4, prefill_max_token_num : 384

[I][                            Init][ 233]: grp: 5, prefill_max_token_num : 512

[I][                            Init][ 237]: prefill_max_token_num : 512

[I][                            Init][ 249]: LLM init ok

[I][                            Init][ 154]: Token2Wav init ok

[I][                            main][ 273]: 

[I][                             Run][ 388]: input token num : 142, prefill_split_num : 2

[I][                             Run][ 422]: input_num_token:128

[I][                             Run][ 422]: input_num_token:14

[I][                             Run][ 607]: ttft: 236.90 ms

[Main/Token2Wav Thread] Processing batch of 28 tokens...

Successfully saved audio to output_0.wav (32-bit Float PCM).

[Main/Token2Wav Thread] Processing batch of 53 tokens...

Successfully saved audio to output_1.wav (32-bit Float PCM).

[Main/Token2Wav Thread] Processing batch of 78 tokens...

Successfully saved audio to output_2.wav (32-bit Float PCM).

[Main/Token2Wav Thread] Processing batch of 78 tokens...

Successfully saved audio to output_3.wav (32-bit Float PCM).

[Main/Token2Wav Thread] Processing batch of 78 tokens...

Successfully saved audio to output_4.wav (32-bit Float PCM).

[Main/Token2Wav Thread] Processing batch of 78 tokens...

Successfully saved audio to output_5.wav (32-bit Float PCM).

[Main/Token2Wav Thread] Processing batch of 78 tokens...

Successfully saved audio to output_6.wav (32-bit Float PCM).

[Main/Token2Wav Thread] Processing batch of 78 tokens...

Successfully saved audio to output_7.wav (32-bit Float PCM).

[Main/Token2Wav Thread] Processing batch of 78 tokens...

Successfully saved audio to output_8.wav (32-bit Float PCM).

[Main/Token2Wav Thread] Processing batch of 78 tokens...

Successfully saved audio to output_9.wav (32-bit Float PCM).

[I][                             Run][ 723]: hit eos, llm finished

[I][                             Run][ 753]: llm finished

[Main/Token2Wav Thread] Buffer is empty and LLM finished. Exiting.





[I][                             Run][ 758]: total decode tokens:271

[N][                             Run][ 759]: hit eos,avg 21.47 token/s



Successfully saved audio to output_10.wav (32-bit Float PCM).

Successfully saved audio to output.wav (32-bit Float PCM).



Voice generation pipeline completed.

Type "q" to exit, Ctrl+c to stop current running

text >> 

```

Output Speech:
[output.wav](asset/output.wav)


####  Or run on AX650 Board with Gradio GUI 
1) Start server
```

bash run_api_ax650.sh

```
2) Start Gradio GUI
```

python scripts/gradio_demo.py

```

#### 3.2 Run on AXCL aarch64 Board
```

bash run_axcl_aarch64.sh

```
#### Or run on AXCL aarch64 Board with Gradio GUI 
1) Start server
```

bash run_api_axcl_aarch64.sh

```
2) Start Gradio GUI
```

python scripts/gradio_demo.py

```
3) Open the page from a browser  
The page url is : `https://{your device ip}:7860`  

Note that you need to run these two commands in the project root directory.

#### 3.3 Run on AXCL x86 Board
```

bash run_axcl_x86.sh

```
#### Or run on AXCL aarch64 Board with Gradio GUI 
1) Start server
```

bash run_api_axcl_x86.sh

```
2) Start Gradio GUI
```

python scripts/gradio_demo.py

```
3) Open the page from a browser  
The page url is : `https://{your device ip}:7860`  

Note that you need to run these two commands in the project root directory.

![](./gradio.png)

### Optional. Process Prompt Speech    
If you want to replicate a specific sound, do this step.  
You can use audio in asset/ .  

#### (1). Downlaod wetext  
```

pip3 install modelscope

modelscope download --model pengzhendong/wetext --local_dir pengzhendong/wetext

```

#### (2). Process Prompt Speech  
Example:
```

python3 scripts/process_prompt.py --prompt_text  asset/zh_man1.txt --prompt_speech asset/zh_man1.wav --output zh_man1

```

Pass parameters according to the actual situation.
```

python3 scripts/process_prompt.py -h



usage: process_prompt.py [-h] [--model_dir MODEL_DIR] [--wetext_dir WETEXT_DIR] [--sample_rate SAMPLE_RATE] [--prompt_text PROMPT_TEXT] [--prompt_speech PROMPT_SPEECH]

                         [--output OUTPUT]



options:

  -h, --help            show this help message and exit

  --model_dir MODEL_DIR

                        tokenizer configuration directionary

  --wetext_dir WETEXT_DIR

                        path to wetext

  --sample_rate SAMPLE_RATE

                        Sampling rate for prompt audio

  --prompt_text PROMPT_TEXT

                        The text content of the prompt(reference) audio. Text or file path.

  --prompt_speech PROMPT_SPEECH

                        The path to prompt(reference) audio.

  --output OUTPUT       Output data storage directory

```

After executing the above command, files like the following will be generated:
```

flow_embedding.txt  

flow_prompt_speech_token.txt  

llm_embedding.txt  

llm_prompt_speech_token.txt  

prompt_speech_feat.txt  

prompt_text.txt

```

When you run run_ax650.sh, pass the output path here to the prompt_files parameter of the run_ax650.sh script.