Improve model card with abstract, detailed usage, and comprehensive benchmarks

601b040 verified 5 months ago

6.6 kB

	---
	base_model: openai/whisper-tiny
	library_name: transformers
	license: apache-2.0
	pipeline_tag: automatic-speech-recognition
	tags:
	- audio
	- automatic-speech-recognition
	- whisper
	- hf-asr-leaderboard
	---

	# LiteASR: Efficient Automatic Speech Recognition with Low-Rank Approximation

	LiteASR is a compression scheme for automatic speech recognition (ASR) models that leverages the _low-rank_ properties of activation values. Our method can compress OpenAI's Whisper encoder by up to ~50%.

	See our [GitHub repository](https://github.com/efeslab/LiteASR) and [paper](https://arxiv.org/abs/2502.20583) for technical details.

	## Abstract

	Modern automatic speech recognition (ASR) models, such as OpenAI's Whisper, rely on deep encoder-decoder architectures, and their encoders are a critical bottleneck for efficient deployment due to high computational intensity. We introduce LiteASR, a low-rank compression scheme for ASR encoders that significantly reduces inference costs while maintaining transcription accuracy. Our approach leverages the strong low-rank properties observed in intermediate activations: by applying principal component analysis (PCA) with a small calibration dataset, we approximate linear transformations with a chain of low-rank matrix multiplications, and further optimize self-attention to work in reduced dimensionality. Evaluation results show that our method can compress Whisper large-v3's encoder size by over 50%, matching Whisper medium's size with better transcription accuracy, thereby establishing a new Pareto frontier of accuracy and efficiency.

	## Quick Start

	The easiest way to run our model is to use our integration with HuggingFace Transformers library. We provide model weights for the compressed version of OpenAI Whisper series [here](https://huggingface.co/efficient-speech).

	```python
	import librosa
	import torch
	from transformers import AutoProcessor, AutoModel

	device = "cuda:0"
	dtype = torch.float16

	# load the compressed Whisper model
	model = AutoModel.from_pretrained(
	"efficient-speech/lite-whisper-tiny-fast", # This is the current model repository
	trust_remote_code=True,
	)
	model.to(dtype).to(device)

	# we use the same processor as the original base model (whisper-tiny)
	processor = AutoProcessor.from_pretrained("openai/whisper-tiny")

	# set the path to your audio file
	path = "path/to/audio.wav"
	audio, _ = librosa.load(path, sr=16000)

	input_features = processor(audio, sampling_rate=16000, return_tensors="pt").input_features
	input_features = input_features.to(dtype).to(device)

	predicted_ids = model.generate(input_features)
	transcription = processor.batch_decode(
	predicted_ids,
	skip_special_tokens=True
	)[0]

	print(transcription)
	```

	## Benchmark Results

	LiteASR can compress Whisper models with minimal degradation in accuracy (`lite-whisper` series). We provide three checkpoints per model: fast, plain, and acc, to be chosen based on resource and accuracy requirements.
	Here is the average word error rate (WER) evaluated on the [ESB datasets](https://huggingface.co/datasets/hf-audio/esb-datasets-test-only-sorted):

	\| Model \| Average WER (↓) \| Encoder Size \| Decoder Size \|
	\|-------\|----------------\|--------------\|--------------\|
	\| [whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) \| 10.1 \| 635M \| 907M \|
	\| [lite-whisper-large-v3-acc](https://huggingface.co/efficient-speech/lite-whisper-large-v3-acc) \| 10.1 \| 429M \| 907M \|
	\| [lite-whisper-large-v3](https://huggingface.co/efficient-speech/lite-whisper-large-v3) \| 10.2 \| 377M \| 907M \|
	\| [lite-whisper-large-v3-fast](https://huggingface.co/efficient-speech/lite-whisper-large-v3-fast) \| 11.3 \| 308M \| 907M \|
	\|   \|   \|   \|   \|
	\| [whisper-large-v3-turbo](https://huggingface.co/openai/whisper-large-v3-turbo) \| 10.1 \| 635M \| 172M \|
	\| [lite-whisper-large-v3-turbo-acc](https://huggingface.co/efficient-speech/lite-whisper-large-v3-turbo-acc) \| 10.2 \| 421M \| 172M \|
	\| [lite-whisper-large-v3-turbo](https://huggingface.co/efficient-speech/lite-whisper-large-v3-turbo) \| 12.6 \| 374M \| 172M \|
	\| [lite-whisper-large-v3-turbo-fast](https://huggingface.co/efficient-speech/lite-whisper-large-v3-turbo-fast) \| 20.1 \| 313M \| 172M \|
	\|   \|   \|   \|   \|
	\| [whisper-medium](https://huggingface.co/openai/whisper-medium) \| 14.8 \| 306M \| 457M \|
	\| [lite-whisper-medium-acc](https://huggingface.co/efficient-speech/lite-whisper-medium-acc) \| 13.46 \| 269.93M \| 456.64M \|
	\| [lite-whisper-medium](https://huggingface.co/efficient-speech/lite-whisper-medium) \| 14.50 \| 239.99M \| 456.64M \|
	\| [lite-whisper-medium-fast](https://huggingface.co/efficient-speech/lite-whisper-medium-fast) \| 14.52 \| 215.31M \| 456.64M \|
	\|   \|   \|   \|   \|
	\| [whisper-small](https://huggingface.co/openai/whisper-small) \| 15.89 \| 87.00M \| 153.58M \|
	\| [lite-whisper-small-acc](https://huggingface.co/efficient-speech/lite-whisper-small-acc) \| 15.37 \| 76.99M \| 153.58M \|
	\| [lite-whisper-small](https://huggingface.co/efficient-speech/lite-whisper-small) \| 14.96 \| 70.16M \| 153.58M \|
	\| [lite-whisper-small-fast](https://huggingface.co/efficient-speech/lite-whisper-small-fast) \| 14.92 \| 63.11M \| 153.58M \|
	\|   \|   \|   \|   \|
	\| [whisper-base](https://huggingface.co/openai/whisper-base) \| 17.67 \| 19.82M \| 52.00M \|
	\| [lite-whisper-base-acc](https://huggingface.co/efficient-speech/lite-whisper-base-acc) \| 19.07 \| 18.64M \| 52.00M \|
	\| [lite-whisper-base](https://huggingface.co/efficient-speech/lite-whisper-base) \| 19.71 \| 17.44M \| 52.00M \|
	\| [lite-whisper-base-fast](https://huggingface.co/efficient-speech/lite-whisper-base-fast) \| 23.05 \| 16.07M \| 52.00M \|
	\|   \|   \|   \|   \|
	\| [whisper-tiny](https://huggingface.co/openai/whisper-tiny) \| 22.01 \| 7.63M \| 29.55M \|
	\| [lite-whisper-tiny-acc](https://huggingface.co/efficient-speech/lite-whisper-tiny-acc) \| 22.97 \| 7.41M \| 29.55M \|
	\| [lite-whisper-tiny](https://huggingface.co/efficient-speech/lite-whisper-tiny) \| 23.95 \| 7.00M \| 29.55M \|
	\| [lite-whisper-tiny-fast](https://huggingface.co/efficient-speech/lite-whisper-tiny-fast) \| 27.09 \| 6.48M \| 29.55M \|

	## Citation

	If you use LiteASR in your research, please cite the following paper:

	```
	@misc{kamahori2025liteasrefficientautomaticspeech,
	title={LiteASR: Efficient Automatic Speech Recognition with Low-Rank Approximation},
	author={Keisuke Kamahori and Jungo Kasai and Noriyuki Kojima and Baris Kasikci},
	year={2025},
	eprint={2502.20583},
	archivePrefix={arXiv},
	primaryClass={cs.LG},
	url={https://arxiv.org/abs/2502.20583},
	}
	```