lite-whisper-tiny-fast / README.md

nielsr HF Staff

Improve model card with abstract, detailed usage, and comprehensive benchmarks

601b040 verified 5 months ago

preview code

raw

history blame

6.6 kB

metadata

base_model: openai/whisper-tiny
library_name: transformers
license: apache-2.0
pipeline_tag: automatic-speech-recognition
tags:
  - audio
  - automatic-speech-recognition
  - whisper
  - hf-asr-leaderboard

LiteASR: Efficient Automatic Speech Recognition with Low-Rank Approximation

LiteASR is a compression scheme for automatic speech recognition (ASR) models that leverages the low-rank properties of activation values. Our method can compress OpenAI's Whisper encoder by up to ~50%.

See our GitHub repository and paper for technical details.

Abstract

Modern automatic speech recognition (ASR) models, such as OpenAI's Whisper, rely on deep encoder-decoder architectures, and their encoders are a critical bottleneck for efficient deployment due to high computational intensity. We introduce LiteASR, a low-rank compression scheme for ASR encoders that significantly reduces inference costs while maintaining transcription accuracy. Our approach leverages the strong low-rank properties observed in intermediate activations: by applying principal component analysis (PCA) with a small calibration dataset, we approximate linear transformations with a chain of low-rank matrix multiplications, and further optimize self-attention to work in reduced dimensionality. Evaluation results show that our method can compress Whisper large-v3's encoder size by over 50%, matching Whisper medium's size with better transcription accuracy, thereby establishing a new Pareto frontier of accuracy and efficiency.

Quick Start

The easiest way to run our model is to use our integration with HuggingFace Transformers library. We provide model weights for the compressed version of OpenAI Whisper series here.

import librosa
import torch
from transformers import AutoProcessor, AutoModel

device = "cuda:0"
dtype = torch.float16

# load the compressed Whisper model
model = AutoModel.from_pretrained(
    "efficient-speech/lite-whisper-tiny-fast", # This is the current model repository
    trust_remote_code=True,
)
model.to(dtype).to(device)

# we use the same processor as the original base model (whisper-tiny)
processor = AutoProcessor.from_pretrained("openai/whisper-tiny")

# set the path to your audio file
path = "path/to/audio.wav"
audio, _ = librosa.load(path, sr=16000)

input_features = processor(audio, sampling_rate=16000, return_tensors="pt").input_features
input_features = input_features.to(dtype).to(device)

predicted_ids = model.generate(input_features)
transcription = processor.batch_decode(
    predicted_ids,
    skip_special_tokens=True
)[0]

print(transcription)

Benchmark Results

LiteASR can compress Whisper models with minimal degradation in accuracy (lite-whisper series). We provide three checkpoints per model: fast, plain, and acc, to be chosen based on resource and accuracy requirements. Here is the average word error rate (WER) evaluated on the ESB datasets:

Model	Average WER (↓)	Encoder Size	Decoder Size
whisper-large-v3	10.1	635M	907M
lite-whisper-large-v3-acc	10.1	429M	907M
lite-whisper-large-v3	10.2	377M	907M
lite-whisper-large-v3-fast	11.3	308M	907M

whisper-large-v3-turbo	10.1	635M	172M
lite-whisper-large-v3-turbo-acc	10.2	421M	172M
lite-whisper-large-v3-turbo	12.6	374M	172M
lite-whisper-large-v3-turbo-fast	20.1	313M	172M

whisper-medium	14.8	306M	457M
lite-whisper-medium-acc	13.46	269.93M	456.64M
lite-whisper-medium	14.50	239.99M	456.64M
lite-whisper-medium-fast	14.52	215.31M	456.64M

whisper-small	15.89	87.00M	153.58M
lite-whisper-small-acc	15.37	76.99M	153.58M
lite-whisper-small	14.96	70.16M	153.58M
lite-whisper-small-fast	14.92	63.11M	153.58M

whisper-base	17.67	19.82M	52.00M
lite-whisper-base-acc	19.07	18.64M	52.00M
lite-whisper-base	19.71	17.44M	52.00M
lite-whisper-base-fast	23.05	16.07M	52.00M

whisper-tiny	22.01	7.63M	29.55M
lite-whisper-tiny-acc	22.97	7.41M	29.55M
lite-whisper-tiny	23.95	7.00M	29.55M
lite-whisper-tiny-fast	27.09	6.48M	29.55M

Citation

If you use LiteASR in your research, please cite the following paper:

@misc{kamahori2025liteasrefficientautomaticspeech,
      title={LiteASR: Efficient Automatic Speech Recognition with Low-Rank Approximation},
      author={Keisuke Kamahori and Jungo Kasai and Noriyuki Kojima and Baris Kasikci},
      year={2025},
      eprint={2502.20583},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2502.20583},
}