File size: 8,328 Bytes

---

license: mit
language:
- en
- zh
base_model:
- CosyVoice3
pipeline_tag: text-to-speech
library_name: transformers
tags:
- CosyVoice3
- Speech
---


# CosyVoice3  
This version of CosyVoice3 has been converted to run on the Axera NPU using **w8a16** quantization.
Compatible with Pulsar2 version: 4.2

## Convert tools links:
For those who are interested in model conversion, you can try to export axmodel through the original repo : 
[Cosyvoice](https://github.com/FunAudioLLM/CosyVoice)

[Pulsar2 Link, How to Convert LLM from Huggingface to axmodel](https://pulsar2-docs.readthedocs.io/en/latest/appendix/build_llm.html) 

[AXera NPU HOST LLM Runtime](https://github.com/AXERA-TECH/CosyVoice3.Axera) 

## Support Platform

- AX650
  - AX650N DEMO Board
  - [M4N-Dock(爱芯派Pro)](https://wiki.sipeed.com/hardware/zh/maixIV/m4ndock/m4ndock.html)
  - [M.2 Accelerator card](https://axcl-docs.readthedocs.io/zh-cn/latest/doc_guide_hardware.html)

**Speech Generation**  
| Stage | Time |
|------|------|
| llm prefill ( input_token_num + prompt_token_num 在 [0,128 ] ) | 104 ms  | 
| llm prefill ( input_token_num + prompt_token_num 在 [128,256 ] ) | 160 ms  | 
| Decode  |  14 token/s |

## How to use

Download all files from this repository to the device  

### 1. PrePare  

#### 1.1 Copy this project to AX650 Board  

#### 1.2 Prepare Dependencies  

**Running HTTP Tokenizer Server** and **Processing Prompt Speech** require these Python packages. If you run these two step on a PC, install them on the PC.  
```

pip3 install -r scripts/requirements.txt

```  

### 2. Start HTTP Tokenizer Server  
```

cd scripts

python scripts/cosyvoice3_tokenizer.py --host {your host} --port {your port}   

```


### 3. Run on Axera Device
There are 2 kinds of device, AX650 Board , AXCL aarch64 Board and AXCL x86 Board.

#### 3.1 Run on AX650 Board  
1) Moidfy the HTTP host in `run_ax650.sh`.  

2) Run `run_ax650.sh`  
```shell

root@ax650 ~/CosyVoice3 # bash run_ax650.sh 

rm: cannot remove 'output*.wav': No such file or directory

[I][                            Init][ 108]: LLM init start

[I][                            Init][  34]: connect http://10.122.86.184:12345 ok

bos_id: 0, eos_id: 1773

  7% | ███                               |   2 /  27 [3.11s<42.04s, 0.64 count/s] embed_selector init ok[I][                            Init][ 138]: attr.axmodel_num:24

100% | ████████████████████████████████ |  27 /  27 [10.32s<10.32s, 2.62 count/s] init post axmodel ok,remain_cmm(7178 MB)

[I][                            Init][ 216]: max_token_len : 1023

[I][                            Init][ 221]: kv_cache_size : 128, kv_cache_num: 1023

[I][                            Init][ 229]: prefill_token_num : 128

[I][                            Init][ 233]: grp: 1, prefill_max_token_num : 1

[I][                            Init][ 233]: grp: 2, prefill_max_token_num : 128

[I][                            Init][ 233]: grp: 3, prefill_max_token_num : 256

[I][                            Init][ 233]: grp: 4, prefill_max_token_num : 384

[I][                            Init][ 233]: grp: 5, prefill_max_token_num : 512

[I][                            Init][ 237]: prefill_max_token_num : 512

[I][                            Init][ 249]: LLM init ok

[I][                            Init][ 154]: Token2Wav init ok

[I][                            main][ 273]: 

[I][                             Run][ 388]: input token num : 142, prefill_split_num : 2

[I][                             Run][ 422]: input_num_token:128

[I][                             Run][ 422]: input_num_token:14

[I][                             Run][ 607]: ttft: 236.90 ms

[Main/Token2Wav Thread] Processing batch of 28 tokens...

Successfully saved audio to output_0.wav (32-bit Float PCM).

[Main/Token2Wav Thread] Processing batch of 53 tokens...

Successfully saved audio to output_1.wav (32-bit Float PCM).

[Main/Token2Wav Thread] Processing batch of 78 tokens...

Successfully saved audio to output_2.wav (32-bit Float PCM).

[Main/Token2Wav Thread] Processing batch of 78 tokens...

Successfully saved audio to output_3.wav (32-bit Float PCM).

[Main/Token2Wav Thread] Processing batch of 78 tokens...

Successfully saved audio to output_4.wav (32-bit Float PCM).

[Main/Token2Wav Thread] Processing batch of 78 tokens...

Successfully saved audio to output_5.wav (32-bit Float PCM).

[Main/Token2Wav Thread] Processing batch of 78 tokens...

Successfully saved audio to output_6.wav (32-bit Float PCM).

[Main/Token2Wav Thread] Processing batch of 78 tokens...

Successfully saved audio to output_7.wav (32-bit Float PCM).

[Main/Token2Wav Thread] Processing batch of 78 tokens...

Successfully saved audio to output_8.wav (32-bit Float PCM).

[Main/Token2Wav Thread] Processing batch of 78 tokens...

Successfully saved audio to output_9.wav (32-bit Float PCM).

[I][                             Run][ 723]: hit eos, llm finished

[I][                             Run][ 753]: llm finished

[Main/Token2Wav Thread] Buffer is empty and LLM finished. Exiting.





[I][                             Run][ 758]: total decode tokens:271

[N][                             Run][ 759]: hit eos,avg 21.47 token/s



Successfully saved audio to output_10.wav (32-bit Float PCM).

Successfully saved audio to output.wav (32-bit Float PCM).



Voice generation pipeline completed.

Type "q" to exit, Ctrl+c to stop current running

text >> 

```

Output Speech：
[output.wav](asset/output.wav)


####  Or run on AX650 Board with Gradio GUI 
1) Start server
```

bash run_api_ax650.sh

```
2) Start Gradio GUI
```

python scripts/gradio_demo.py

```

#### 3.2 Run on AXCL aarch64 Board
```

bash run_axcl_aarch64.sh

```
#### Or run on AXCL aarch64 Board with Gradio GUI 
1) Start server
```

bash run_api_axcl_aarch64.sh

```
2) Start Gradio GUI
```

python scripts/gradio_demo.py

```
3) Open the page from a browser  
The page url is : `https://{your device ip}:7860`  

Note that you need to run these two commands in the project root directory.

#### 3.3 Run on AXCL x86 Board
```

bash run_axcl_x86.sh

```
#### Or run on AXCL aarch64 Board with Gradio GUI 
1) Start server
```

bash run_api_axcl_x86.sh

```
2) Start Gradio GUI
```

python scripts/gradio_demo.py

```
3) Open the page from a browser  
The page url is : `https://{your device ip}:7860`  

Note that you need to run these two commands in the project root directory.

![](./gradio.png)

### Optional. Process Prompt Speech    
If you want to replicate a specific sound, do this step.  
You can use audio in asset/ .  

#### (1). Downlaod wetext  
```

pip3 install modelscope

modelscope download --model pengzhendong/wetext --local_dir pengzhendong/wetext

```

#### (2). Process Prompt Speech  
Example:
```

python3 scripts/process_prompt.py --prompt_text  asset/zh_man1.txt --prompt_speech asset/zh_man1.wav --output zh_man1

```

Pass parameters according to the actual situation.
```

python3 scripts/process_prompt.py -h



usage: process_prompt.py [-h] [--model_dir MODEL_DIR] [--wetext_dir WETEXT_DIR] [--sample_rate SAMPLE_RATE] [--prompt_text PROMPT_TEXT] [--prompt_speech PROMPT_SPEECH]

                         [--output OUTPUT]



options:

  -h, --help            show this help message and exit

  --model_dir MODEL_DIR

                        tokenizer configuration directionary

  --wetext_dir WETEXT_DIR

                        path to wetext

  --sample_rate SAMPLE_RATE

                        Sampling rate for prompt audio

  --prompt_text PROMPT_TEXT

                        The text content of the prompt(reference) audio. Text or file path.

  --prompt_speech PROMPT_SPEECH

                        The path to prompt(reference) audio.

  --output OUTPUT       Output data storage directory

```

After executing the above command, files like the following will be generated:
```

flow_embedding.txt  

flow_prompt_speech_token.txt  

llm_embedding.txt  

llm_prompt_speech_token.txt  

prompt_speech_feat.txt  

prompt_text.txt

```

When you run run_ax650.sh, pass the output path here to the prompt_files parameter of the run_ax650.sh script.