CosyVoice3 / README.md

lihongjie

update

6ca7c7e 20 days ago

8.33 kB

	---
	license: mit
	language:
	- en
	- zh
	base_model:
	- CosyVoice3
	pipeline_tag: text-to-speech
	library_name: transformers
	tags:
	- CosyVoice3
	- Speech
	---

	# CosyVoice3
	This version of CosyVoice3 has been converted to run on the Axera NPU using w8a16 quantization.
	Compatible with Pulsar2 version: 4.2

	## Convert tools links:
	For those who are interested in model conversion, you can try to export axmodel through the original repo :
	[Cosyvoice](https://github.com/FunAudioLLM/CosyVoice)

	[Pulsar2 Link, How to Convert LLM from Huggingface to axmodel](https://pulsar2-docs.readthedocs.io/en/latest/appendix/build_llm.html)

	[AXera NPU HOST LLM Runtime](https://github.com/AXERA-TECH/CosyVoice3.Axera)

	## Support Platform

	- AX650
	- AX650N DEMO Board
	- [M4N-Dock(爱芯派Pro)](https://wiki.sipeed.com/hardware/zh/maixIV/m4ndock/m4ndock.html)
	- [M.2 Accelerator card](https://axcl-docs.readthedocs.io/zh-cn/latest/doc_guide_hardware.html)

	Speech Generation
	\| Stage \| Time \|
	\|------\|------\|
	\| llm prefill ( input_token_num + prompt_token_num 在 [0,128 ] ) \| 104 ms \|
	\| llm prefill ( input_token_num + prompt_token_num 在 [128,256 ] ) \| 160 ms \|
	\| Decode \| 14 token/s \|

	## How to use

	Download all files from this repository to the device

	### 1. PrePare

	#### 1.1 Copy this project to AX650 Board

	#### 1.2 Prepare Dependencies

	Running HTTP Tokenizer Server and Processing Prompt Speech require these Python packages. If you run these two step on a PC, install them on the PC.
	```
	pip3 install -r scripts/requirements.txt
	```

	### 2. Start HTTP Tokenizer Server
	```
	cd scripts
	python scripts/cosyvoice3_tokenizer.py --host {your host} --port {your port}
	```


	### 3. Run on Axera Device
	There are 2 kinds of device, AX650 Board , AXCL aarch64 Board and AXCL x86 Board.

	#### 3.1 Run on AX650 Board
	1) Moidfy the HTTP host in `run_ax650.sh`.

	2) Run `run_ax650.sh`
	```shell
	root@ax650 ~/CosyVoice3 # bash run_ax650.sh
	rm: cannot remove 'output*.wav': No such file or directory
	[I][ Init][ 108]: LLM init start
	[I][ Init][ 34]: connect http://10.122.86.184:12345 ok
	bos_id: 0, eos_id: 1773
	7% \| ███ \| 2 / 27 [3.11s<42.04s, 0.64 count/s] embed_selector init ok[I][ Init][ 138]: attr.axmodel_num:24
	100% \| ████████████████████████████████ \| 27 / 27 [10.32s<10.32s, 2.62 count/s] init post axmodel ok,remain_cmm(7178 MB)
	[I][ Init][ 216]: max_token_len : 1023
	[I][ Init][ 221]: kv_cache_size : 128, kv_cache_num: 1023
	[I][ Init][ 229]: prefill_token_num : 128
	[I][ Init][ 233]: grp: 1, prefill_max_token_num : 1
	[I][ Init][ 233]: grp: 2, prefill_max_token_num : 128
	[I][ Init][ 233]: grp: 3, prefill_max_token_num : 256
	[I][ Init][ 233]: grp: 4, prefill_max_token_num : 384
	[I][ Init][ 233]: grp: 5, prefill_max_token_num : 512
	[I][ Init][ 237]: prefill_max_token_num : 512
	[I][ Init][ 249]: LLM init ok
	[I][ Init][ 154]: Token2Wav init ok
	[I][ main][ 273]:
	[I][ Run][ 388]: input token num : 142, prefill_split_num : 2
	[I][ Run][ 422]: input_num_token:128
	[I][ Run][ 422]: input_num_token:14
	[I][ Run][ 607]: ttft: 236.90 ms
	[Main/Token2Wav Thread] Processing batch of 28 tokens...
	Successfully saved audio to output_0.wav (32-bit Float PCM).
	[Main/Token2Wav Thread] Processing batch of 53 tokens...
	Successfully saved audio to output_1.wav (32-bit Float PCM).
	[Main/Token2Wav Thread] Processing batch of 78 tokens...
	Successfully saved audio to output_2.wav (32-bit Float PCM).
	[Main/Token2Wav Thread] Processing batch of 78 tokens...
	Successfully saved audio to output_3.wav (32-bit Float PCM).
	[Main/Token2Wav Thread] Processing batch of 78 tokens...
	Successfully saved audio to output_4.wav (32-bit Float PCM).
	[Main/Token2Wav Thread] Processing batch of 78 tokens...
	Successfully saved audio to output_5.wav (32-bit Float PCM).
	[Main/Token2Wav Thread] Processing batch of 78 tokens...
	Successfully saved audio to output_6.wav (32-bit Float PCM).
	[Main/Token2Wav Thread] Processing batch of 78 tokens...
	Successfully saved audio to output_7.wav (32-bit Float PCM).
	[Main/Token2Wav Thread] Processing batch of 78 tokens...
	Successfully saved audio to output_8.wav (32-bit Float PCM).
	[Main/Token2Wav Thread] Processing batch of 78 tokens...
	Successfully saved audio to output_9.wav (32-bit Float PCM).
	[I][ Run][ 723]: hit eos, llm finished
	[I][ Run][ 753]: llm finished
	[Main/Token2Wav Thread] Buffer is empty and LLM finished. Exiting.


	[I][ Run][ 758]: total decode tokens:271
	[N][ Run][ 759]: hit eos,avg 21.47 token/s

	Successfully saved audio to output_10.wav (32-bit Float PCM).
	Successfully saved audio to output.wav (32-bit Float PCM).

	Voice generation pipeline completed.
	Type "q" to exit, Ctrl+c to stop current running
	text >>
	```

	Output Speech：
	[output.wav](asset/output.wav)


	#### Or run on AX650 Board with Gradio GUI
	1) Start server
	```
	bash run_api_ax650.sh
	```
	2) Start Gradio GUI
	```
	python scripts/gradio_demo.py
	```

	#### 3.2 Run on AXCL aarch64 Board
	```
	bash run_axcl_aarch64.sh
	```
	#### Or run on AXCL aarch64 Board with Gradio GUI
	1) Start server
	```
	bash run_api_axcl_aarch64.sh
	```
	2) Start Gradio GUI
	```
	python scripts/gradio_demo.py
	```
	3) Open the page from a browser
	The page url is : `https://{your device ip}:7860`

	Note that you need to run these two commands in the project root directory.

	#### 3.3 Run on AXCL x86 Board
	```
	bash run_axcl_x86.sh
	```
	#### Or run on AXCL aarch64 Board with Gradio GUI
	1) Start server
	```
	bash run_api_axcl_x86.sh
	```
	2) Start Gradio GUI
	```
	python scripts/gradio_demo.py
	```
	3) Open the page from a browser
	The page url is : `https://{your device ip}:7860`

	Note that you need to run these two commands in the project root directory.

	![](./gradio.png)

	### Optional. Process Prompt Speech
	If you want to replicate a specific sound, do this step.
	You can use audio in asset/ .

	#### (1). Downlaod wetext
	```
	pip3 install modelscope
	modelscope download --model pengzhendong/wetext --local_dir pengzhendong/wetext
	```

	#### (2). Process Prompt Speech
	Example:
	```
	python3 scripts/process_prompt.py --prompt_text asset/zh_man1.txt --prompt_speech asset/zh_man1.wav --output zh_man1
	```

	Pass parameters according to the actual situation.
	```
	python3 scripts/process_prompt.py -h

	usage: process_prompt.py [-h] [--model_dir MODEL_DIR] [--wetext_dir WETEXT_DIR] [--sample_rate SAMPLE_RATE] [--prompt_text PROMPT_TEXT] [--prompt_speech PROMPT_SPEECH]
	[--output OUTPUT]

	options:
	-h, --help show this help message and exit
	--model_dir MODEL_DIR
	tokenizer configuration directionary
	--wetext_dir WETEXT_DIR
	path to wetext
	--sample_rate SAMPLE_RATE
	Sampling rate for prompt audio
	--prompt_text PROMPT_TEXT
	The text content of the prompt(reference) audio. Text or file path.
	--prompt_speech PROMPT_SPEECH
	The path to prompt(reference) audio.
	--output OUTPUT Output data storage directory
	```

	After executing the above command, files like the following will be generated:
	```
	flow_embedding.txt
	flow_prompt_speech_token.txt
	llm_embedding.txt
	llm_prompt_speech_token.txt
	prompt_speech_feat.txt
	prompt_text.txt
	```

	When you run run_ax650.sh, pass the output path here to the prompt_files parameter of the run_ax650.sh script.