Optimized Whisper Small (FP16 + CUDA Graphs + Flash Attention)

This is an optimized version of openai/whisper-small designed for NVIDIA GPUs. It achieves up to 4x speedup compared to the baseline by leveraging:

  1. FP16 Precision: Reduced memory footprint and tensor core acceleration.
  2. Flash Attention 2 (SDPA): Faster attention mechanism with lower memory overhead.
  3. CUDA Graphs: Eliminated CPU overhead by capturing the graph with static cache.

πŸš€ Performance

  • Speedup: ~2-3x faster than baseline fp32 eager mode.
  • Precision: FP16
  • Architecture: Standard Whisper Encoder-Decoder with static caching enabled.

Github

https://github.com/ItzDEXX/WhisperOptimization

πŸ› οΈ How to Run (Important!)

To achieve the speedup, you must enable CUDA Graphs (compilation) at runtime. The model weights are standard FP16, but the speed comes from torch.compile.

import torch
from transformers import WhisperForConditionalGeneration, WhisperProcessor

model_id = "YOUR_USERNAME/YOUR_REPO_NAME" # <--- REPLACE THIS

# 1. Load Model
model = WhisperForConditionalGeneration.from_pretrained(
    model_id, 
    torch_dtype=torch.float16, 
    attn_implementation="sdpa", # Flash Attention
    device_map="cuda"
)

# 2. Enable CUDA Graphs (The Magic Step)
model.generation_config.cache_implementation = "static"
model = torch.compile(model, mode="reduce-overhead", fullgraph=True)

# 3. Run Inference
processor = WhisperProcessor.from_pretrained(model_id)
# ... standard generation code ...
Downloads last month
27
Safetensors
Model size
0.2B params
Tensor type
F16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support