|
|
--- |
|
|
license: mit |
|
|
language: |
|
|
- en |
|
|
- zh |
|
|
base_model: |
|
|
- CosyVoice2 |
|
|
pipeline_tag: text-to-speech |
|
|
library_name: transformers |
|
|
tags: |
|
|
- CosyVoice2 |
|
|
- Speech |
|
|
--- |
|
|
|
|
|
# CosyVoice2 |
|
|
This version of CosyVoice2 has been converted to run on the Axera NPU using **w8a16** quantization. |
|
|
Compatible with Pulsar2 version: 4.2 |
|
|
|
|
|
## Convert tools links: |
|
|
For those who are interested in model conversion, you can try to export axmodel through the original repo : |
|
|
[Cosyvoice](https://github.com/FunAudioLLM/CosyVoice) |
|
|
|
|
|
[Pulsar2 Link, How to Convert LLM from Huggingface to axmodel](https://pulsar2-docs.readthedocs.io/en/latest/appendix/build_llm.html) |
|
|
|
|
|
[AXera NPU HOST LLM Runtime](https://github.com/AXERA-TECH/Cosyvoice2.Axera) |
|
|
|
|
|
## Support Platform |
|
|
|
|
|
- AX650 |
|
|
- AX650N DEMO Board |
|
|
- [M4N-Dock(爱芯派Pro)](https://wiki.sipeed.com/hardware/zh/maixIV/m4ndock/m4ndock.html) |
|
|
- [M.2 Accelerator card](https://axcl-docs.readthedocs.io/zh-cn/latest/doc_guide_hardware.html) |
|
|
|
|
|
**Speech Generation** |
|
|
| Stage | Time | |
|
|
|------|------| |
|
|
| llm prefill ( input_token_num + prompt_token_num 在 [0,128 ] ) | 104 ms | |
|
|
| llm prefill ( input_token_num + prompt_token_num 在 [128,256 ] ) | 234 ms | |
|
|
| Decode | 21.24 token/s | |
|
|
|
|
|
## How to use |
|
|
|
|
|
Download all files from this repository to the device |
|
|
|
|
|
### 1. PrePare |
|
|
|
|
|
#### 1.1 Copy this project to AX650 Board |
|
|
|
|
|
#### 1.2 Prepare Dependencies |
|
|
|
|
|
**Running HTTP Tokenizer Server** and **Processing Prompt Speech** require these Python packages. If you run these two step on a PC, install them on the PC. |
|
|
``` |
|
|
pip3 install -r scripts/requirements.txt |
|
|
``` |
|
|
|
|
|
### 2. Start HTTP Tokenizer Server |
|
|
``` |
|
|
cd scripts |
|
|
python cosyvoice2_tokenizer.py --host {your host} --port {your port} |
|
|
``` |
|
|
|
|
|
|
|
|
### 3. Run on Axera Device |
|
|
There are 2 kinds of device, AX650 Board , AXCL aarch64 Board and AXCL x86 Board. |
|
|
|
|
|
#### 3.1 Run on AX650 Board |
|
|
1) Moidfy the HTTP host in `run_ax650.sh`. |
|
|
|
|
|
2) Run `run_ax650.sh` |
|
|
```shell |
|
|
root@ax650 ~/Cosyvoice2 # bash run_ax650.sh |
|
|
rm: cannot remove 'output*.wav': No such file or directory |
|
|
[I][ Init][ 108]: LLM init start |
|
|
[I][ Init][ 34]: connect http://10.122.86.184:12345 ok |
|
|
bos_id: 0, eos_id: 1773 |
|
|
7% | ███ | 2 / 27 [3.11s<42.04s, 0.64 count/s] embed_selector init ok[I][ Init][ 138]: attr.axmodel_num:24 |
|
|
100% | ████████████████████████████████ | 27 / 27 [10.32s<10.32s, 2.62 count/s] init post axmodel ok,remain_cmm(7178 MB) |
|
|
[I][ Init][ 216]: max_token_len : 1023 |
|
|
[I][ Init][ 221]: kv_cache_size : 128, kv_cache_num: 1023 |
|
|
[I][ Init][ 229]: prefill_token_num : 128 |
|
|
[I][ Init][ 233]: grp: 1, prefill_max_token_num : 1 |
|
|
[I][ Init][ 233]: grp: 2, prefill_max_token_num : 128 |
|
|
[I][ Init][ 233]: grp: 3, prefill_max_token_num : 256 |
|
|
[I][ Init][ 233]: grp: 4, prefill_max_token_num : 384 |
|
|
[I][ Init][ 233]: grp: 5, prefill_max_token_num : 512 |
|
|
[I][ Init][ 237]: prefill_max_token_num : 512 |
|
|
[I][ Init][ 249]: LLM init ok |
|
|
[I][ Init][ 154]: Token2Wav init ok |
|
|
[I][ main][ 273]: |
|
|
[I][ Run][ 388]: input token num : 142, prefill_split_num : 2 |
|
|
[I][ Run][ 422]: input_num_token:128 |
|
|
[I][ Run][ 422]: input_num_token:14 |
|
|
[I][ Run][ 607]: ttft: 236.90 ms |
|
|
[Main/Token2Wav Thread] Processing batch of 28 tokens... |
|
|
Successfully saved audio to output_0.wav (32-bit Float PCM). |
|
|
[Main/Token2Wav Thread] Processing batch of 53 tokens... |
|
|
Successfully saved audio to output_1.wav (32-bit Float PCM). |
|
|
[Main/Token2Wav Thread] Processing batch of 78 tokens... |
|
|
Successfully saved audio to output_2.wav (32-bit Float PCM). |
|
|
[Main/Token2Wav Thread] Processing batch of 78 tokens... |
|
|
Successfully saved audio to output_3.wav (32-bit Float PCM). |
|
|
[Main/Token2Wav Thread] Processing batch of 78 tokens... |
|
|
Successfully saved audio to output_4.wav (32-bit Float PCM). |
|
|
[Main/Token2Wav Thread] Processing batch of 78 tokens... |
|
|
Successfully saved audio to output_5.wav (32-bit Float PCM). |
|
|
[Main/Token2Wav Thread] Processing batch of 78 tokens... |
|
|
Successfully saved audio to output_6.wav (32-bit Float PCM). |
|
|
[Main/Token2Wav Thread] Processing batch of 78 tokens... |
|
|
Successfully saved audio to output_7.wav (32-bit Float PCM). |
|
|
[Main/Token2Wav Thread] Processing batch of 78 tokens... |
|
|
Successfully saved audio to output_8.wav (32-bit Float PCM). |
|
|
[Main/Token2Wav Thread] Processing batch of 78 tokens... |
|
|
Successfully saved audio to output_9.wav (32-bit Float PCM). |
|
|
[I][ Run][ 723]: hit eos, llm finished |
|
|
[I][ Run][ 753]: llm finished |
|
|
[Main/Token2Wav Thread] Buffer is empty and LLM finished. Exiting. |
|
|
|
|
|
|
|
|
[I][ Run][ 758]: total decode tokens:271 |
|
|
[N][ Run][ 759]: hit eos,avg 21.47 token/s |
|
|
|
|
|
Successfully saved audio to output_10.wav (32-bit Float PCM). |
|
|
Successfully saved audio to output.wav (32-bit Float PCM). |
|
|
|
|
|
Voice generation pipeline completed. |
|
|
Type "q" to exit, Ctrl+c to stop current running |
|
|
text >> |
|
|
``` |
|
|
|
|
|
Output Speech: |
|
|
[output.wav](asset/output.wav) |
|
|
|
|
|
|
|
|
#### Or run on AX650 Board with Gradio GUI |
|
|
1) Start server |
|
|
``` |
|
|
bash run_api_ax650.sh |
|
|
``` |
|
|
2) Start Gradio GUI |
|
|
``` |
|
|
python scripts/gradio_demo.py |
|
|
``` |
|
|
|
|
|
#### 3.2 Run on AXCL aarch64 Board |
|
|
``` |
|
|
bash run_axcl_aarch64.sh |
|
|
``` |
|
|
#### Or run on AXCL aarch64 Board with Gradio GUI |
|
|
1) Start server |
|
|
``` |
|
|
bash run_api_axcl_aarch64.sh |
|
|
``` |
|
|
2) Start Gradio GUI |
|
|
``` |
|
|
python scripts/gradio_demo.py |
|
|
``` |
|
|
3) Open the page from a browser |
|
|
The page url is : `https://{your device ip}:7860` |
|
|
|
|
|
Note that you need to run these two commands in the project root directory. |
|
|
|
|
|
#### 3.3 Run on AXCL x86 Board |
|
|
``` |
|
|
bash run_axcl_x86.sh |
|
|
``` |
|
|
#### Or run on AXCL aarch64 Board with Gradio GUI |
|
|
1) Start server |
|
|
``` |
|
|
bash run_api_axcl_x86.sh |
|
|
``` |
|
|
2) Start Gradio GUI |
|
|
``` |
|
|
python scripts/gradio_demo.py |
|
|
``` |
|
|
3) Open the page from a browser |
|
|
The page url is : `https://{your device ip}:7860` |
|
|
|
|
|
Note that you need to run these two commands in the project root directory. |
|
|
|
|
|
 |
|
|
|
|
|
### Optional. Process Prompt Speech |
|
|
If you want to replicate a specific sound, do this step. |
|
|
You can use audio in asset/ . |
|
|
|
|
|
#### (1). Downlaod wetext |
|
|
``` |
|
|
pip3 install modelscope |
|
|
modelscope download --model pengzhendong/wetext --local_dir pengzhendong/wetext |
|
|
``` |
|
|
|
|
|
#### (2). Process Prompt Speech |
|
|
Example: |
|
|
``` |
|
|
python3 scripts/process_prompt.py --prompt_text asset/zh_man1.txt --prompt_speech asset/zh_man1.wav --output zh_man1 |
|
|
``` |
|
|
|
|
|
Pass parameters according to the actual situation. |
|
|
``` |
|
|
python3 scripts/process_prompt.py -h |
|
|
|
|
|
usage: process_prompt.py [-h] [--model_dir MODEL_DIR] [--wetext_dir WETEXT_DIR] [--sample_rate SAMPLE_RATE] [--prompt_text PROMPT_TEXT] [--prompt_speech PROMPT_SPEECH] |
|
|
[--output OUTPUT] |
|
|
|
|
|
options: |
|
|
-h, --help show this help message and exit |
|
|
--model_dir MODEL_DIR |
|
|
tokenizer configuration directionary |
|
|
--wetext_dir WETEXT_DIR |
|
|
path to wetext |
|
|
--sample_rate SAMPLE_RATE |
|
|
Sampling rate for prompt audio |
|
|
--prompt_text PROMPT_TEXT |
|
|
The text content of the prompt(reference) audio. Text or file path. |
|
|
--prompt_speech PROMPT_SPEECH |
|
|
The path to prompt(reference) audio. |
|
|
--output OUTPUT Output data storage directory |
|
|
``` |
|
|
|
|
|
After executing the above command, files like the following will be generated: |
|
|
``` |
|
|
flow_embedding.txt |
|
|
flow_prompt_speech_token.txt |
|
|
llm_embedding.txt |
|
|
llm_prompt_speech_token.txt |
|
|
prompt_speech_feat.txt |
|
|
prompt_text.txt |
|
|
``` |
|
|
|
|
|
When you run run_ax650.sh, pass the output path here to the prompt_files parameter of the run_ax650.sh script. |
|
|
|