Whisper / README_EN.md
inoryQwQ's picture
Update README_EN.md
c453403 verified
metadata
license: bsd-3-clause
pipeline_tag: automatic-speech-recognition

whisper.axera

OpenAI Whisper on Axera Platform

Overview

This project provides an optimized implementation of OpenAI's Whisper speech recognition model for Axera AI processors (AX650N/AX630C). It supports both C++ and Python interfaces for efficient on-device speech-to-text conversion.

Features

  • Dual Language Support: Both C++ and Python APIs available
  • Multiple Model Sizes: Support for tiny, base, small, and turbo model variants
  • Multi-language Recognition: Tested with English, Chinese, Japanese, and Korean
  • Optimized Performance: Specially optimized for Axera NPU acceleration
  • Easy Deployment: Pre-built packages and cross-compilation support

Update

  • 2026/01/14: We provide cleaner model architecture now.(With encoder and decoder instead of decoder_main and decoder_loop). Support exporting models from huggingface.

Supported Platforms

  • ✅ AX650N
  • ✅ AX630C

Pre-trained Models

Download pre-compiled models from:

For custom model conversion, please refer to Model Conversion Guide.

Model Conversion

Currently supported model scales:

  • tiny
  • base
  • small
  • medium
  • turbo

Tested languages:

  • English
  • Chinese
  • Japanese
  • Korean
  • Malaysian

For other languages or custom model sizes, please refer to the Model Conversion Guide.

Deployment on Target Devices

Prerequisites

Programming Language Support

Python

Tested with Python 3.12. We recommend using Miniconda for environment management.

Installation

cd python
pip3 install -r requirements.txt

pyaxenigne

Install NPU Python API from: https://github.com/AXERA-TECH/pyaxengine

Usage

Command Line Interface
cd python  
(whisper) root@ax650:/mnt/data/HF/Whisper/python# python whisper_cli.py -w ../demo.wav -t tiny
[INFO] Available providers:  ['AxEngineExecutionProvider']
{'wav': '../demo.wav', 'model_type': 'tiny', 'model_path': '../models-ax650', 'language': 'zh', 'task': 'transcribe'}
[INFO] Using provider: AxEngineExecutionProvider
[INFO] Chip type: ChipType.MC50
[INFO] VNPU type: VNPUType.DISABLED
[INFO] Engine version: 2.12.0s
[INFO] Model type: 2 (triple core)
[INFO] Compiler version: 5.0 76f70fdc
[INFO] Using provider: AxEngineExecutionProvider
[INFO] Model type: 2 (triple core)
[INFO] Compiler version: 5.0 76f70fdc
ASR result:
擅职出现交易几乎停止的情况
RTF: 0.10313174677896837

Command line arguments:

Argument Description Default
--wav Input audio file -
--model_type/-t Model type: tiny/base/small -
--model_path/-p Model directory ../models
--language/-l Recognition language zh
Server Mode
(whisper) root@ax650:/mnt/data/HF/Whisper/python# python whisper_svr.py
[INFO] Available providers:  ['AxEngineExecutionProvider']
Server started at http://0.0.0.0:8000

Test the server:

python test_svr.py

CPP

Usage on Target Device

cd cpp/ax650
./whisper_cli -w ../demo.wav -t tiny

cd cpp/ax650
./whisper_cli --model_type small -w ../demo.wav

Example Output:

(whisper) root@ax650:/mnt/data/HF/Whisper/cpp/ax650# ./whisper_cli -w ../../demo.wav -t tiny
wav_file: ../../demo.wav
model_path: ../../models-ax650
model_type: tiny
language: zh
Init whisper success, take 0.3540seconds
Result: 甚至出现交易几乎停止的情况
RTF: 0.0968

Server Mode

cd cpp/ax650
(whisper) root@ax650:/mnt/data/HF/Whisper/cpp/ax650# ./whisper_svr -t tiny
port: 8080
model_path: ../../models-ax650
model_type: tiny
language: zh
[I][                            main][  60]: Initializing server...
[I][                            main][  65]: Init server success
[I][                           start][  32]: Start server at port 8080, POST binary stream to IP:8080/asr

Client test using curl:

ffmpeg -i demo.wav -f f32le -c:a pcm_f32le - 2>/dev/null | \
curl -X POST 10.126.33.192:8080/asr \
  -H "Content-Type: application/octet-stream" \
  --data-binary @-

Performance Benchmarks

Latency

RTF: Real-Time Factor

CPP:

Models AX650N AX630C
Whisper-Tiny 0.08
Whisper-Base 0.11 0.35
Whisper-Small 0.24
Whisper-Turbo 0.48

Python:

Models AX650N AX630C
Whisper-Tiny 0.12
Whisper-Base 0.16 0.35
Whisper-Small 0.50
Whisper-Turbo 0.60

Word Error Rate(Test on AIShell dataset)

Models AX650N AX630C
Whisper-Tiny 0.24
Whisper-Base 0.18
Whisper-Small 0.11
Whisper-Turbo 0.06

To reproduce WER test results:

Download dataset:

cd model_convert
bash download_dataset.sh

Run test script:

cd python
conda activate whisper
python test_wer.py -d aishell --gt_path ../model_convert/datasets/ground_truth.txt --model_type tiny

MEM Usage

  • CMM Stands for Physical memory used by Axera modules like VDEC(Video decoder), VENC(Video encoder), NPU, etc.

Python:

Models CMM(MB) OS(MB)
Whisper-Tiny 332 512
Whisper-Base 533 644
Whisper-Small 1106 906
Whisper-Turbo 2065 2084

C++:

Models CMM(MB) OS(MB)
Whisper-Tiny 332 31
Whisper-Base 533 54
Whisper-Small 1106 146
Whisper-Turbo 2065 86

Technical Discussion

  • Github issues
  • Tencent QQ Group: 139953715