--- base_model: openai/whisper-medium library_name: transformers license: apache-2.0 pipeline_tag: automatic-speech-recognition tags: - audio - automatic-speech-recognition - whisper - hf-asr-leaderboard --- # LiteASR: Efficient Automatic Speech Recognition with Low-Rank Approximation [📄 Paper](https://huggingface.co/papers/2502.20583) | [💻 Code](https://github.com/efeslab/LiteASR) | [🌐 Project Page](https://efeslab.github.io/LiteASR/) LiteASR is a compression scheme for automatic speech recognition (ASR) models that leverages the _low-rank_ properties of activation values. Our method can compress OpenAI's Whisper encoder by up to **~50%**. ## Abstract Modern automatic speech recognition (ASR) models, such as OpenAI's Whisper, rely on deep encoder-decoder architectures, and their encoders are a critical bottleneck for efficient deployment due to high computational intensity. We introduce LiteASR, a low-rank compression scheme for ASR encoders that significantly reduces inference costs while maintaining transcription accuracy. Our approach leverages the strong low-rank properties observed in intermediate activations: by applying principal component analysis (PCA) with a small calibration dataset, we approximate linear transformations with a chain of low-rank matrix multiplications, and further optimize self-attention to work in reduced dimensionality. Evaluation results show that our method can compress Whisper large-v3's encoder size by over 50%, matching Whisper medium's size with better transcription accuracy, thereby establishing a new Pareto frontier of accuracy and efficiency. The code of LiteASR is available at this https URL . ## Quick Start (Sample Usage) The easiest way to run our model is to use our integration with HuggingFace Transformers library. We provide model weights for the compressed version of OpenAI Whisper series [here](https://huggingface.co/efficient-speech). ```python import librosa import torch from transformers import AutoProcessor, AutoModel device = "cuda:0" dtype = torch.float16 # load the compressed Whisper model model = AutoModel.from_pretrained( "efficient-speech/lite-whisper-large-v3-turbo", trust_remote_code=True, ) model.to(dtype).to(device) # we use the same processor as the original model processor = AutoProcessor.from_pretrained("openai/whisper-large-v3") # set the path to your audio file path = "path/to/audio.wav" audio, _ = librosa.load(path, sr=16000) input_features = processor(audio, sampling_rate=16000, return_tensors="pt").input_features input_features = input_features.to(dtype).to(device) predicted_ids = model.generate(input_features) transcription = processor.batch_decode( predicted_ids, skip_special_tokens=True )[0] print(transcription) ``` ## Benchmark Results Following is the average word error rate (WER) evaluated on the [ESB datasets](https://huggingface.co/datasets/hf-audio/esb-datasets-test-only-sorted): | Model | Average WER (↓) | Encoder Size | Decoder Size | |-------|----------------|--------------|--------------| | [whisper-tiny](https://huggingface.co/openai/whisper-tiny) | 22.01 | 7.63M | 29.55M | | [lite-whisper-tiny-acc](https://huggingface.co/efficient-speech/lite-whisper-tiny-acc) | 22.97 | 7.41M | 29.55M | | [lite-whisper-tiny](https://huggingface.co/efficient-speech/lite-whisper-tiny) | 23.95 | 7.00M | 29.55M | | [lite-whisper-tiny-fast](https://huggingface.co/efficient-speech/lite-whisper-tiny-fast) | 27.09 | 6.48M | 29.55M | |   |   |   |   | | [whisper-base](https://huggingface.co/openai/whisper-base) | 17.67 | 19.82M | 52.00M | | [lite-whisper-base-acc](https://huggingface.co/efficient-speech/lite-whisper-base-acc) | 19.07 | 18.64M | 52.00M | | [lite-whisper-base](https://huggingface.co/efficient-speech/lite-whisper-base) | 19.71 | 17.44M | 52.00M | | [lite-whisper-base-fast](https://huggingface.co/efficient-speech/lite-whisper-base-fast) | 23.05 | 16.07M | 52.00M | |   |   |   |   | | [whisper-small](https://huggingface.co/openai/whisper-small) | 15.89 | 87.00M | 153.58M | | [lite-whisper-small-acc](https://huggingface.co/efficient-speech/lite-whisper-small-acc) | 15.37 | 76.99M | 153.58M | | [lite-whisper-small](https://huggingface.co/efficient-speech/lite-whisper-small) | 14.96 | 70.16M | 153.58M | | [lite-whisper-small-fast](https://huggingface.co/efficient-speech/lite-whisper-small-fast) | 14.92 | 63.11M | 153.58M | |   |   |   |   | | [whisper-medium](https://huggingface.co/openai/whisper-medium) | 15.12 | 305.68M | 456.64M | | [lite-whisper-medium-acc](https://huggingface.co/efficient-speech/lite-whisper-medium-acc) | 13.46 | 269.93M | 456.64M | | [lite-whisper-medium](https://huggingface.co/efficient-speech/lite-whisper-medium) | 14.50 | 239.99M | 456.64M | | [lite-whisper-medium-fast](https://huggingface.co/efficient-speech/lite-whisper-medium-fast) | 14.52 | 215.31M | 456.64M | ## Citation If you use LiteASR in your research, please cite the following paper: ``` @misc{kamahori2025liteasrefficientautomaticspeech, title={LiteASR: Efficient Automatic Speech Recognition with Low-Rank Approximation}, author={Keisuke Kamahori and Jungo Kasai and Noriyuki Kojima and Baris Kasikci}, year={2025}, eprint={2502.20583}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2502.20583}, } ```