nielsr HF Staff commited on
Commit
474f7d6
·
verified ·
1 Parent(s): 5c3eb6c

Improve model card with abstract, quick start, and detailed methodology

Browse files

This PR enhances the model card for `efficient-speech/lite-whisper-base-acc` by:
- Adding a descriptive H1 title for better clarity.
- Including the paper abstract for a more detailed introduction to LiteASR.
- Adding a "Quick Start" code snippet for immediate model usage with the `transformers` library, adapted for this specific `base-acc` model.
- Expanding the "Benchmark Results" to cover larger LiteASR models, providing a comprehensive view of the family's performance.
- Integrating the "Motivations and Methodology" section with explanatory figures to provide deeper technical insights into the compression scheme.

These additions significantly improve the model card's informativeness and user-friendliness, highlighting the efficiency and performance benefits of LiteASR.

Files changed (1) hide show
  1. README.md +81 -2
README.md CHANGED
@@ -10,14 +10,60 @@ tags:
10
  - hf-asr-leaderboard
11
  ---
12
 
13
- <!-- Provide a quick summary of what the model is/does. -->
14
 
15
- Lite-Whisper is a compressed version of OpenAI Whisper with LiteASR. See our [GitHub repository](https://github.com/efeslab/LiteASR) and [paper](https://arxiv.org/abs/2502.20583) for details.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
16
 
17
  ## Benchmark Results
18
 
19
  Following is the average word error rate (WER) evaluated on the [ESB datasets](https://huggingface.co/datasets/hf-audio/esb-datasets-test-only-sorted):
20
 
 
 
21
  | Model | Average WER (↓) | Encoder Size | Decoder Size |
22
  |-------|----------------|--------------|--------------|
23
  | [whisper-tiny](https://huggingface.co/openai/whisper-tiny) | 22.01 | 7.63M | 29.55M |
@@ -40,6 +86,39 @@ Following is the average word error rate (WER) evaluated on the [ESB datasets](h
40
  | [lite-whisper-medium](https://huggingface.co/efficient-speech/lite-whisper-medium) | 14.50 | 239.99M | 456.64M |
41
  | [lite-whisper-medium-fast](https://huggingface.co/efficient-speech/lite-whisper-medium-fast) | 14.52 | 215.31M | 456.64M |
42
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
43
 
44
  ## Citation
45
 
 
10
  - hf-asr-leaderboard
11
  ---
12
 
13
+ # Lite-Whisper-Base-Acc: Efficient Automatic Speech Recognition with Low-Rank Approximation
14
 
15
+ This repository contains the `lite-whisper-base-acc` model, a compressed version of `openai/whisper-base` leveraging the LiteASR method.
16
+
17
+ For more technical details, refer to our [paper](https://arxiv.org/abs/2502.20583) and the [GitHub repository](https://github.com/efeslab/LiteASR).
18
+
19
+ ## Abstract
20
+
21
+ Modern automatic speech recognition (ASR) models, such as OpenAI's Whisper, rely on deep encoder-decoder architectures, and their encoders are a critical bottleneck for efficient deployment due to high computational intensity. We introduce LiteASR, a low-rank compression scheme for ASR encoders that significantly reduces inference costs while maintaining transcription accuracy. Our approach leverages the strong low-rank properties observed in intermediate activations: by applying principal component analysis (PCA) with a small calibration dataset, we approximate linear transformations with a chain of low-rank matrix multiplications, and further optimize self-attention to work in reduced dimensionality. Evaluation results show that our method can compress Whisper large-v3's encoder size by over 50%, matching Whisper medium's size with better transcription accuracy, thereby establishing a new Pareto frontier of accuracy and efficiency. The code of LiteASR is available at [this GitHub repository](https://github.com/efeslab/LiteASR).
22
+
23
+ ## Quick Start
24
+
25
+ The easiest way to run our model is to use our integration with the Hugging Face Transformers library. We provide model weights for the compressed version of OpenAI Whisper series [here](https://huggingface.co/efficient-speech).
26
+
27
+ ```python
28
+ import librosa
29
+ import torch
30
+ from transformers import AutoProcessor, AutoModel
31
+
32
+ device = "cuda:0"
33
+ dtype = torch.float16
34
+
35
+ # load the compressed Whisper model
36
+ model = AutoModel.from_pretrained(
37
+ "efficient-speech/lite-whisper-base-acc",
38
+ trust_remote_code=True,
39
+ )
40
+ model.to(dtype).to(device)
41
+
42
+ # we use the same processor as the original model
43
+ processor = AutoProcessor.from_pretrained("openai/whisper-base")
44
+
45
+ # set the path to your audio file
46
+ path = "path/to/audio.wav"
47
+ audio, _ = librosa.load(path, sr=16000)
48
+
49
+ input_features = processor(audio, sampling_rate=16000, return_tensors="pt").input_features
50
+ input_features = input_features.to(dtype).to(device)
51
+
52
+ predicted_ids = model.generate(input_features)
53
+ transcription = processor.batch_decode(
54
+ predicted_ids,
55
+ skip_special_tokens=True
56
+ )[0]
57
+
58
+ print(transcription)
59
+ ```
60
 
61
  ## Benchmark Results
62
 
63
  Following is the average word error rate (WER) evaluated on the [ESB datasets](https://huggingface.co/datasets/hf-audio/esb-datasets-test-only-sorted):
64
 
65
+ The table below presents the performance across various LiteASR model sizes:
66
+
67
  | Model | Average WER (↓) | Encoder Size | Decoder Size |
68
  |-------|----------------|--------------|--------------|
69
  | [whisper-tiny](https://huggingface.co/openai/whisper-tiny) | 22.01 | 7.63M | 29.55M |
 
86
  | [lite-whisper-medium](https://huggingface.co/efficient-speech/lite-whisper-medium) | 14.50 | 239.99M | 456.64M |
87
  | [lite-whisper-medium-fast](https://huggingface.co/efficient-speech/lite-whisper-medium-fast) | 14.52 | 215.31M | 456.64M |
88
 
89
+ For `large-v3` and `large-v3-turbo` models, the results are:
90
+
91
+ | Model | Average WER (↓) | Encoder Size | Decoder Size |
92
+ |-------|----------------|--------------|--------------|
93
+ | [whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) | 10.1 | 635M | 907M |
94
+ | [lite-whisper-large-v3-acc](https://huggingface.co/efficient-speech/lite-whisper-large-v3-acc) | 10.1 | 429M | 907M |
95
+ | [lite-whisper-large-v3](https://huggingface.co/efficient-speech/lite-whisper-large-v3) | 10.2 | 377M | 907M |
96
+ | [lite-whisper-large-v3-fast](https://huggingface.co/efficient-speech/lite-whisper-large-v3-fast) | 11.3 | 308M | 907M |
97
+ | &nbsp; | &nbsp; | &nbsp; | &nbsp; |
98
+ | [whisper-large-v3-turbo](https://huggingface.co/openai/whisper-large-v3-turbo) | 10.1 | 635M | 172M |
99
+ | [lite-whisper-large-v3-turbo-acc](https://huggingface.co/efficient-speech/lite-whisper-large-v3-turbo-acc) | 10.2 | 421M | 172M |
100
+ | [lite-whisper-large-v3-turbo](https://huggingface.co/efficient-speech/lite-whisper-large-v3-turbo) | 12.6 | 374M | 172M |
101
+ | [lite-whisper-large-v3-turbo-fast](https://huggingface.co/efficient-speech/lite-whisper-large-v3-turbo-fast) | 20.1 | 313M | 172M |
102
+
103
+ ## Motivations and Methodology
104
+
105
+ State-of-the-art ASR models typically employ encoder-decoder architectures, with LiteASR focusing specifically on compressing the encoder part. The encoder has emerged as the primary runtime bottleneck for two key reasons:
106
+
107
+ 1. Recent works like [Whisper-Turbo](https://huggingface.co/openai/whisper-large-v3-turbo) and [Distill-Whisper](https://huggingface.co/distil-whisper/distil-large-v3) demonstrate that the decoder can be aggressively compressed (8x or 16x) through distillation techniques with minimal impact on accuracy.
108
+
109
+ 2. Unlike decoders (which are primarily *memory-bound*), encoders must process longer sequences (e.g., 1500 tokens for Whisper) and are *compute-bound*, resulting in limited [batching efficiency](https://le.qun.ch/en/blog/2023/05/13/transformer-batching/) when handling multiple concurrent requests.
110
+
111
+ The figure below illustrates latency breakdown across various hardware configurations, model architectures, and batch sizes. As batch sizes increase and more recent models (i.e., turbo) are used, the encoder's proportion of overall latency grows substantially. This effect is particularly pronounced on devices with less computational power, such as the M1 Pro MacBook, compared to NVIDIA GPUs.
112
+
113
+ ![Latency Breakdown](https://github.com/efeslab/LiteASR/blob/main/assets/latency_breakdown.png?raw=true)
114
+
115
+ Compounding these challenges, Whisper models require fixed-length inputs (30 seconds = 1500 tokens) for encoders, creating inefficiencies when processing short audio clips or building [streaming applications](https://github.com/ufal/whisper_streaming). Weight quantization techniques, while effective at reducing model size, cannot accelerate the compute-bound encoders.
116
+
117
+ LiteASR addresses these limitations by fundamentally reducing the computational requirements of ASR encoders. Our approach is based on a key insight: we observed consistent low-rank structures in activation values during inference across diverse inputs. This property enables us to approximate activation values as products of low-rank matrices, which in turn allows us to represent weight matrices as chains of computationally efficient low-rank matrix multiplications.
118
+
119
+ The simplified visualization below illustrates our core concept. For a more comprehensive technical explanation, please refer to our [paper](https://arxiv.org/abs/2502.20583).
120
+
121
+ ![LiteASR Idea](https://github.com/efeslab/LiteASR/blob/main/assets/idea.png?raw=true)
122
 
123
  ## Citation
124