File size: 8,093 Bytes
f66b0a3
 
 
 
 
 
 
8a03d54
f66b0a3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f520907
 
f66b0a3
 
 
 
 
 
1ca9646
f66b0a3
 
 
 
 
39ad93d
f66b0a3
39ad93d
10f3fbf
39ad93d
4ae14db
 
f66b0a3
 
 
 
39ad93d
f66b0a3
 
 
 
 
39ad93d
 
26bfbe8
39ad93d
 
44f9606
4ae14db
44f9606
f66b0a3
44f9606
f66b0a3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4ae14db
 
 
39ad93d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b12f63e
26bfbe8
39ad93d
26bfbe8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b12f63e
26bfbe8
 
 
39ad93d
e07e977
 
39ad93d
960e081
 
4ae14db
39ad93d
4ae14db
 
 
 
 
39ad93d
35bd210
4ae14db
7c96fdd
4ae14db
 
 
 
78419a3
4ae14db
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
44f9606
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
---
license: mit
language:
- en
- zh
base_model:
- CosyVoice2
pipeline_tag: text-to-speech
library_name: transformers
tags:
- CosyVoice2
- Speech
---

# CosyVoice2  
This version of CosyVoice2 has been converted to run on the Axera NPU using **w8a16** quantization.
Compatible with Pulsar2 version: 4.2

## Convert tools links:
For those who are interested in model conversion, you can try to export axmodel through the original repo : 
[Cosyvoice](https://github.com/FunAudioLLM/CosyVoice)

[Pulsar2 Link, How to Convert LLM from Huggingface to axmodel](https://pulsar2-docs.readthedocs.io/en/latest/appendix/build_llm.html) 

[AXera NPU HOST LLM Runtime](https://github.com/AXERA-TECH/Cosyvoice2.Axera) 

## Support Platform

- AX650
  - AX650N DEMO Board
  - [M4N-Dock(爱芯派Pro)](https://wiki.sipeed.com/hardware/zh/maixIV/m4ndock/m4ndock.html)
  - [M.2 Accelerator card](https://axcl-docs.readthedocs.io/zh-cn/latest/doc_guide_hardware.html)

**Speech Generation**  
| Stage | Time |
|------|------|
| llm prefill ( input_token_num + prompt_token_num 在 [0,128 ] ) | 104 ms  | 
| llm prefill ( input_token_num + prompt_token_num 在 [128,256 ] ) | 234 ms  | 
| Decode  |  21.24 token/s |

## How to use

Download all files from this repository to the device  

### 1. PrePare  

#### 1.1 Copy this project to AX650 Board  

#### 1.2 Prepare Dependencies  

**Running HTTP Tokenizer Server** and **Processing Prompt Speech** require these Python packages. If you run these two step on a PC, install them on the PC.  
```
pip3 install -r scripts/requirements.txt
```  

### 2. Start HTTP Tokenizer Server  
```
cd scripts
python cosyvoice2_tokenizer.py --host {your host} --port {your port}   
```


### 3. Run on Axera Device
There are 2 kinds of device, AX650 Board , AXCL aarch64 Board and AXCL x86 Board.

#### 3.1 Run on AX650 Board  
1) Moidfy the HTTP host in `run_ax650.sh`.  

2) Run `run_ax650.sh`  
```shell
root@ax650 ~/Cosyvoice2 # bash run_ax650.sh 
rm: cannot remove 'output*.wav': No such file or directory
[I][                            Init][ 108]: LLM init start
[I][                            Init][  34]: connect http://10.122.86.184:12345 ok
bos_id: 0, eos_id: 1773
  7% | β–ˆβ–ˆβ–ˆ                               |   2 /  27 [3.11s<42.04s, 0.64 count/s] embed_selector init ok[I][                            Init][ 138]: attr.axmodel_num:24
100% | β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ |  27 /  27 [10.32s<10.32s, 2.62 count/s] init post axmodel ok,remain_cmm(7178 MB)
[I][                            Init][ 216]: max_token_len : 1023
[I][                            Init][ 221]: kv_cache_size : 128, kv_cache_num: 1023
[I][                            Init][ 229]: prefill_token_num : 128
[I][                            Init][ 233]: grp: 1, prefill_max_token_num : 1
[I][                            Init][ 233]: grp: 2, prefill_max_token_num : 128
[I][                            Init][ 233]: grp: 3, prefill_max_token_num : 256
[I][                            Init][ 233]: grp: 4, prefill_max_token_num : 384
[I][                            Init][ 233]: grp: 5, prefill_max_token_num : 512
[I][                            Init][ 237]: prefill_max_token_num : 512
[I][                            Init][ 249]: LLM init ok
[I][                            Init][ 154]: Token2Wav init ok
[I][                            main][ 273]: 
[I][                             Run][ 388]: input token num : 142, prefill_split_num : 2
[I][                             Run][ 422]: input_num_token:128
[I][                             Run][ 422]: input_num_token:14
[I][                             Run][ 607]: ttft: 236.90 ms
[Main/Token2Wav Thread] Processing batch of 28 tokens...
Successfully saved audio to output_0.wav (32-bit Float PCM).
[Main/Token2Wav Thread] Processing batch of 53 tokens...
Successfully saved audio to output_1.wav (32-bit Float PCM).
[Main/Token2Wav Thread] Processing batch of 78 tokens...
Successfully saved audio to output_2.wav (32-bit Float PCM).
[Main/Token2Wav Thread] Processing batch of 78 tokens...
Successfully saved audio to output_3.wav (32-bit Float PCM).
[Main/Token2Wav Thread] Processing batch of 78 tokens...
Successfully saved audio to output_4.wav (32-bit Float PCM).
[Main/Token2Wav Thread] Processing batch of 78 tokens...
Successfully saved audio to output_5.wav (32-bit Float PCM).
[Main/Token2Wav Thread] Processing batch of 78 tokens...
Successfully saved audio to output_6.wav (32-bit Float PCM).
[Main/Token2Wav Thread] Processing batch of 78 tokens...
Successfully saved audio to output_7.wav (32-bit Float PCM).
[Main/Token2Wav Thread] Processing batch of 78 tokens...
Successfully saved audio to output_8.wav (32-bit Float PCM).
[Main/Token2Wav Thread] Processing batch of 78 tokens...
Successfully saved audio to output_9.wav (32-bit Float PCM).
[I][                             Run][ 723]: hit eos, llm finished
[I][                             Run][ 753]: llm finished
[Main/Token2Wav Thread] Buffer is empty and LLM finished. Exiting.


[I][                             Run][ 758]: total decode tokens:271
[N][                             Run][ 759]: hit eos,avg 21.47 token/s

Successfully saved audio to output_10.wav (32-bit Float PCM).
Successfully saved audio to output.wav (32-bit Float PCM).

Voice generation pipeline completed.
Type "q" to exit, Ctrl+c to stop current running
text >> 
```

Output Speech:
[output.wav](asset/output.wav)


####  Or run on AX650 Board with Gradio GUI 
1) Start server
```
bash run_api_ax650.sh
```
2) Start Gradio GUI
```
python scripts/gradio_demo.py
```

#### 3.2 Run on AXCL aarch64 Board
```
bash run_axcl_aarch64.sh
```
#### Or run on AXCL aarch64 Board with Gradio GUI 
1) Start server
```
bash run_api_axcl_aarch64.sh
```
2) Start Gradio GUI
```
python scripts/gradio_demo.py
```
3) Open the page from a browser  
The page url is : `https://{your device ip}:7860`  

Note that you need to run these two commands in the project root directory.

#### 3.3 Run on AXCL x86 Board
```
bash run_axcl_x86.sh
```
#### Or run on AXCL aarch64 Board with Gradio GUI 
1) Start server
```
bash run_api_axcl_x86.sh
```
2) Start Gradio GUI
```
python scripts/gradio_demo.py
```
3) Open the page from a browser  
The page url is : `https://{your device ip}:7860`  

Note that you need to run these two commands in the project root directory.

![](./gradio.png)

### Optional. Process Prompt Speech    
If you want to replicate a specific sound, do this step.  
You can use audio in asset/ .  

#### (1). Downlaod wetext  
```
pip3 install modelscope
modelscope download --model pengzhendong/wetext --local_dir pengzhendong/wetext
```

#### (2). Process Prompt Speech  
Example:
```
python3 scripts/process_prompt.py --prompt_text  asset/zh_man1.txt --prompt_speech asset/zh_man1.wav --output zh_man1
```

Pass parameters according to the actual situation.
```
python3 scripts/process_prompt.py -h

usage: process_prompt.py [-h] [--model_dir MODEL_DIR] [--wetext_dir WETEXT_DIR] [--sample_rate SAMPLE_RATE] [--prompt_text PROMPT_TEXT] [--prompt_speech PROMPT_SPEECH]
                         [--output OUTPUT]

options:
  -h, --help            show this help message and exit
  --model_dir MODEL_DIR
                        tokenizer configuration directionary
  --wetext_dir WETEXT_DIR
                        path to wetext
  --sample_rate SAMPLE_RATE
                        Sampling rate for prompt audio
  --prompt_text PROMPT_TEXT
                        The text content of the prompt(reference) audio. Text or file path.
  --prompt_speech PROMPT_SPEECH
                        The path to prompt(reference) audio.
  --output OUTPUT       Output data storage directory
```

After executing the above command, files like the following will be generated:
```
flow_embedding.txt  
flow_prompt_speech_token.txt  
llm_embedding.txt  
llm_prompt_speech_token.txt  
prompt_speech_feat.txt  
prompt_text.txt
```

When you run run_ax650.sh, pass the output path here to the prompt_files parameter of the run_ax650.sh script.