YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
moonshine-tiny-optimized
Optimized version of UsefulSensors/moonshine-tiny for faster GPU inference.
Optimizations applied
- FP16 weights โ halves model memory footprint
- SDPA (Scaled Dot-Product Attention) โ uses optimized fused attention kernels
- Static KV cache support โ pre-allocates cache during generation for up to 1.19x speedup
- Updated config โ
attn_implementation="sdpa"set as default
Benchmarks (T4 GPU, 5s audio)
| Variant | Median Time | Speedup vs Baseline | Peak Memory |
|---|---|---|---|
| Baseline FP32 | 0.028s | 1.0x | 123.0 MB |
| FP16 + SDPA | 0.028s | 0.98x | 118.6 MB |
| FP16 + SDPA + Static KV | 0.024s | 1.19x | 72.1 MB |
| torch.compile | 0.030s | 0.94x | 126.3 MB |
| 8-bit quantization | 0.108s | 0.26x | 124.0 MB |
Best config: FP16 + SDPA + static KV cache gives 1.19x speedup and 41% memory reduction.
Usage
from transformers import AutoProcessor, MoonshineForConditionalGeneration
import torch
processor = AutoProcessor.from_pretrained("felixem/moonshine-tiny-optimized")
model = MoonshineForConditionalGeneration.from_pretrained(
"felixem/moonshine-tiny-optimized",
torch_dtype=torch.float16,
device_map="auto",
)
# For maximum speed, use static KV cache:
inputs = processor(audio_array, sampling_rate=16000, return_tensors="pt")
inputs = inputs.to(model.device, dtype=model.dtype)
generated_ids = model.generate(
**inputs,
cache_implementation="static",
max_new_tokens=50,
)
transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
Model Info
- Architecture: Moonshine (encoder-decoder transformer, RoPE)
- Parameters: 27.1M
- License: MIT
- Original model: https://huggingface.co/UsefulSensors/moonshine-tiny
- Downloads last month
- 81
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐ Ask for provider support