Whisper / README_EN.md

Update README_EN.md

c453403 verified about 1 month ago

6.85 kB

	---
	license: bsd-3-clause
	pipeline_tag: automatic-speech-recognition
	---

	# whisper.axera

	- [English](https://huggingface.co/AXERA-TECH/Whisper/blob/main/README_EN.md)
	- [中文](https://huggingface.co/AXERA-TECH/Whisper/blob/main/README.md)

	OpenAI Whisper on Axera Platform

	## Overview

	This project provides an optimized implementation of OpenAI's Whisper speech recognition model for Axera AI processors (AX650N/AX630C). It supports both C++ and Python interfaces for efficient on-device speech-to-text conversion.

	## Features

	- Dual Language Support: Both C++ and Python APIs available
	- Multiple Model Sizes: Support for tiny, base, small, and turbo model variants
	- Multi-language Recognition: Tested with English, Chinese, Japanese, and Korean
	- Optimized Performance: Specially optimized for Axera NPU acceleration
	- Easy Deployment: Pre-built packages and cross-compilation support

	## Update

	- 2026/01/14: We provide cleaner model architecture now.(With encoder and decoder instead of decoder_main and decoder_loop). Support exporting models from huggingface.

	## Supported Platforms

	- ✅ AX650N
	- ✅ AX630C

	## Pre-trained Models

	Download pre-compiled models from:
	- [Baidu Cloud](https://pan.baidu.com/s/1tOHVMZCin0A68T5HmKRJyg?pwd=axyz)
	- [Huggingface](https://huggingface.co/AXERA-TECH/Whisper)

	For custom model conversion, please refer to [Model Conversion Guide](./model_convert/README_EN.md).

	## Model Conversion

	Currently supported model scales:
	- tiny
	- base
	- small
	- medium
	- turbo

	Tested languages:
	- English
	- Chinese
	- Japanese
	- Korean
	- Malaysian

	For other languages or custom model sizes, please refer to the [Model Conversion Guide](./model_convert/README_EN.md).

	## Deployment on Target Devices

	### Prerequisites
	- AX650N/AX630C devices with Ubuntu 22.04 pre-installed
	- Internet connection for `apt install` and `pip install`
	- Verified hardware platforms:
	- [MaixIV M4nDock (AX650N)](https://wiki.sipeed.com/hardware/zh/maixIV/m4ndock/m4ndock.html)
	- [M.2 Accelerator Card (AX650N)](https://axcl-docs.readthedocs.io/zh-cn/latest/doc_guide_hardware.html)
	- [Axera Pi 2 (AX630C)](https://axera-pi-2-docs-cn.readthedocs.io/zh-cn/latest/index.html)
	- [Module-LLM (AX630C)](https://docs.m5stack.com/zh_CN/module/Module-LLM)
	- [LLM630 Compute Kit (AX630C)](https://docs.m5stack.com/zh_CN/core/LLM630%20Compute%20Kit)

	## Programming Language Support

	### Python

	Tested with Python 3.12. We recommend using [Miniconda](https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-aarch64.sh) for environment management.

	#### Installation

	```bash
	cd python
	pip3 install -r requirements.txt
	```

	#### pyaxenigne

	Install NPU Python API from: https://github.com/AXERA-TECH/pyaxengine

	#### Usage

	##### Command Line Interface

	```
	cd python
	(whisper) root@ax650:/mnt/data/HF/Whisper/python# python whisper_cli.py -w ../demo.wav -t tiny
	[INFO] Available providers: ['AxEngineExecutionProvider']
	{'wav': '../demo.wav', 'model_type': 'tiny', 'model_path': '../models-ax650', 'language': 'zh', 'task': 'transcribe'}
	[INFO] Using provider: AxEngineExecutionProvider
	[INFO] Chip type: ChipType.MC50
	[INFO] VNPU type: VNPUType.DISABLED
	[INFO] Engine version: 2.12.0s
	[INFO] Model type: 2 (triple core)
	[INFO] Compiler version: 5.0 76f70fdc
	[INFO] Using provider: AxEngineExecutionProvider
	[INFO] Model type: 2 (triple core)
	[INFO] Compiler version: 5.0 76f70fdc
	ASR result:
	擅职出现交易几乎停止的情况
	RTF: 0.10313174677896837

	```

	Command line arguments:
	\| Argument \| Description \| Default \|
	\| --- \| --- \| --- \|
	\| --wav \| Input audio file \| - \|
	\| --model_type/-t \| Model type: tiny/base/small \| - \|
	\| --model_path/-p \| Model directory \| ../models \|
	\| --language/-l \| Recognition language \| zh \|


	##### Server Mode

	```
	(whisper) root@ax650:/mnt/data/HF/Whisper/python# python whisper_svr.py
	[INFO] Available providers: ['AxEngineExecutionProvider']
	Server started at http://0.0.0.0:8000

	```

	Test the server:
	```
	python test_svr.py
	```


	<h3 id="CPP">CPP</h3>

	#### Usage on Target Device
	```
	cd cpp/ax650
	./whisper_cli -w ../demo.wav -t tiny
	```

	或

	```
	cd cpp/ax650
	./whisper_cli --model_type small -w ../demo.wav
	```

	Example Output:

	```
	(whisper) root@ax650:/mnt/data/HF/Whisper/cpp/ax650# ./whisper_cli -w ../../demo.wav -t tiny
	wav_file: ../../demo.wav
	model_path: ../../models-ax650
	model_type: tiny
	language: zh
	Init whisper success, take 0.3540seconds
	Result: 甚至出现交易几乎停止的情况
	RTF: 0.0968

	```

	### Server Mode

	```
	cd cpp/ax650
	(whisper) root@ax650:/mnt/data/HF/Whisper/cpp/ax650# ./whisper_svr -t tiny
	port: 8080
	model_path: ../../models-ax650
	model_type: tiny
	language: zh
	[I][ main][ 60]: Initializing server...
	[I][ main][ 65]: Init server success
	[I][ start][ 32]: Start server at port 8080, POST binary stream to IP:8080/asr

	```

	### Client test using curl:

	```
	ffmpeg -i demo.wav -f f32le -c:a pcm_f32le - 2>/dev/null \| \
	curl -X POST 10.126.33.192:8080/asr \
	-H "Content-Type: application/octet-stream" \
	--data-binary @-
	```

	## Performance Benchmarks

	### Latency

	RTF: Real-Time Factor

	CPP:

	\| Models \| AX650N \| AX630C \|
	\| ------------- \| ------ \| ------ \|
	\| Whisper-Tiny \| 0.08 \| \|
	\| Whisper-Base \| 0.11 \| 0.35 \|
	\| Whisper-Small \| 0.24 \| \|
	\| Whisper-Turbo \| 0.48 \| \|

	Python:

	\| Models \| AX650N \| AX630C \|
	\| ------------- \| ------ \| ------ \|
	\| Whisper-Tiny \| 0.12 \| \|
	\| Whisper-Base \| 0.16 \| 0.35 \|
	\| Whisper-Small \| 0.50 \| \|
	\| Whisper-Turbo \| 0.60 \| \|

	### Word Error Rate(Test on AIShell dataset)

	\| Models \| AX650N \| AX630C \|
	\| ------------- \| ------ \| ------ \|
	\| Whisper-Tiny \| 0.24 \| \|
	\| Whisper-Base \| 0.18 \| \|
	\| Whisper-Small \| 0.11 \| \|
	\| Whisper-Turbo \| 0.06 \| \|

	To reproduce WER test results:

	Download dataset:
	```
	cd model_convert
	bash download_dataset.sh
	```

	Run test script:
	```
	cd python
	conda activate whisper
	python test_wer.py -d aishell --gt_path ../model_convert/datasets/ground_truth.txt --model_type tiny

	```

	### MEM Usage

	* CMM Stands for Physical memory used by Axera modules like VDEC(Video decoder), VENC(Video encoder), NPU, etc.

	Python:

	\| Models \| CMM(MB)\| OS(MB) \|
	\| ------------- \| ------ \| ------ \|
	\| Whisper-Tiny \| 332 \| 512 \|
	\| Whisper-Base \| 533 \| 644 \|
	\| Whisper-Small \| 1106 \| 906 \|
	\| Whisper-Turbo \| 2065 \| 2084 \|

	C++:

	\| Models \| CMM(MB)\| OS(MB) \|
	\| ------------- \| ------ \| ------ \|
	\| Whisper-Tiny \| 332 \| 31 \|
	\| Whisper-Base \| 533 \| 54 \|
	\| Whisper-Small \| 1106 \| 146 \|
	\| Whisper-Turbo \| 2065 \| 86 \|


	## Technical Discussion

	- Github issues
	- Tencent QQ Group: 139953715