Optimized Whisper Small (FP16 + CUDA Graphs + Flash Attention)
This is an optimized version of openai/whisper-small designed for NVIDIA GPUs. It achieves up to 4x speedup compared to the baseline by leveraging:
- FP16 Precision: Reduced memory footprint and tensor core acceleration.
- Flash Attention 2 (SDPA): Faster attention mechanism with lower memory overhead.
- CUDA Graphs: Eliminated CPU overhead by capturing the graph with
staticcache.
π Performance
- Speedup: ~2-3x faster than baseline
fp32eager mode. - Precision: FP16
- Architecture: Standard Whisper Encoder-Decoder with static caching enabled.
Github
https://github.com/ItzDEXX/WhisperOptimization
π οΈ How to Run (Important!)
To achieve the speedup, you must enable CUDA Graphs (compilation) at runtime. The model weights are standard FP16, but the speed comes from torch.compile.
import torch
from transformers import WhisperForConditionalGeneration, WhisperProcessor
model_id = "YOUR_USERNAME/YOUR_REPO_NAME" # <--- REPLACE THIS
# 1. Load Model
model = WhisperForConditionalGeneration.from_pretrained(
model_id,
torch_dtype=torch.float16,
attn_implementation="sdpa", # Flash Attention
device_map="cuda"
)
# 2. Enable CUDA Graphs (The Magic Step)
model.generation_config.cache_implementation = "static"
model = torch.compile(model, mode="reduce-overhead", fullgraph=True)
# 3. Run Inference
processor = WhisperProcessor.from_pretrained(model_id)
# ... standard generation code ...
- Downloads last month
- 27
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support