Optimized Whisper Small (FP16 + CUDA Graphs + Flash Attention)

This is an optimized version of openai/whisper-small designed for NVIDIA GPUs. It achieves up to 4x speedup compared to the baseline by leveraging:

FP16 Precision: Reduced memory footprint and tensor core acceleration.
Flash Attention 2 (SDPA): Faster attention mechanism with lower memory overhead.
CUDA Graphs: Eliminated CPU overhead by capturing the graph with static cache.

🚀 Performance

Speedup: ~2-3x faster than baseline fp32 eager mode.
Precision: FP16
Architecture: Standard Whisper Encoder-Decoder with static caching enabled.

Github

https://github.com/ItzDEXX/WhisperOptimization

🛠️ How to Run (Important!)

To achieve the speedup, you must enable CUDA Graphs (compilation) at runtime. The model weights are standard FP16, but the speed comes from torch.compile.

import torch
from transformers import WhisperForConditionalGeneration, WhisperProcessor

model_id = "YOUR_USERNAME/YOUR_REPO_NAME" # <--- REPLACE THIS

# 1. Load Model
model = WhisperForConditionalGeneration.from_pretrained(
    model_id, 
    torch_dtype=torch.float16, 
    attn_implementation="sdpa", # Flash Attention
    device_map="cuda"
)

# 2. Enable CUDA Graphs (The Magic Step)
model.generation_config.cache_implementation = "static"
model = torch.compile(model, mode="reduce-overhead", fullgraph=True)

# 3. Run Inference
processor = WhisperProcessor.from_pretrained(model_id)
# ... standard generation code ...

Downloads last month: 11

Safetensors

Model size

0.2B params

Tensor type

F16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support