YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

moonshine-tiny-optimized

Optimized version of UsefulSensors/moonshine-tiny for faster GPU inference.

Optimizations applied

  • FP16 weights โ€” halves model memory footprint
  • SDPA (Scaled Dot-Product Attention) โ€” uses optimized fused attention kernels
  • Static KV cache support โ€” pre-allocates cache during generation for up to 1.19x speedup
  • Updated config โ€” attn_implementation="sdpa" set as default

Benchmarks (T4 GPU, 5s audio)

Variant Median Time Speedup vs Baseline Peak Memory
Baseline FP32 0.028s 1.0x 123.0 MB
FP16 + SDPA 0.028s 0.98x 118.6 MB
FP16 + SDPA + Static KV 0.024s 1.19x 72.1 MB
torch.compile 0.030s 0.94x 126.3 MB
8-bit quantization 0.108s 0.26x 124.0 MB

Best config: FP16 + SDPA + static KV cache gives 1.19x speedup and 41% memory reduction.

Usage

from transformers import AutoProcessor, MoonshineForConditionalGeneration
import torch

processor = AutoProcessor.from_pretrained("felixem/moonshine-tiny-optimized")
model = MoonshineForConditionalGeneration.from_pretrained(
    "felixem/moonshine-tiny-optimized",
    torch_dtype=torch.float16,
    device_map="auto",
)

# For maximum speed, use static KV cache:
inputs = processor(audio_array, sampling_rate=16000, return_tensors="pt")
inputs = inputs.to(model.device, dtype=model.dtype)

generated_ids = model.generate(
    **inputs,
    cache_implementation="static",
    max_new_tokens=50,
)
transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

Model Info

Downloads last month
81
Safetensors
Model size
27.1M params
Tensor type
F16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support