# moonshine-tiny-optimized Optimized version of [UsefulSensors/moonshine-tiny](https://huggingface.co/UsefulSensors/moonshine-tiny) for faster GPU inference. ## Optimizations applied - **FP16 weights** — halves model memory footprint - **SDPA (Scaled Dot-Product Attention)** — uses optimized fused attention kernels - **Static KV cache support** — pre-allocates cache during generation for up to 1.19x speedup - **Updated config** — `attn_implementation="sdpa"` set as default ## Benchmarks (T4 GPU, 5s audio) | Variant | Median Time | Speedup vs Baseline | Peak Memory | |---------|------------|---------------------|-------------| | Baseline FP32 | 0.028s | 1.0x | 123.0 MB | | FP16 + SDPA | 0.028s | 0.98x | 118.6 MB | | **FP16 + SDPA + Static KV** | **0.024s** | **1.19x** | **72.1 MB** | | torch.compile | 0.030s | 0.94x | 126.3 MB | | 8-bit quantization | 0.108s | 0.26x | 124.0 MB | **Best config**: `FP16 + SDPA + static KV cache` gives **1.19x speedup** and **41% memory reduction**. ## Usage ```python from transformers import AutoProcessor, MoonshineForConditionalGeneration import torch processor = AutoProcessor.from_pretrained("felixem/moonshine-tiny-optimized") model = MoonshineForConditionalGeneration.from_pretrained( "felixem/moonshine-tiny-optimized", torch_dtype=torch.float16, device_map="auto", ) # For maximum speed, use static KV cache: inputs = processor(audio_array, sampling_rate=16000, return_tensors="pt") inputs = inputs.to(model.device, dtype=model.dtype) generated_ids = model.generate( **inputs, cache_implementation="static", max_new_tokens=50, ) transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0] ``` ## Model Info - **Architecture**: Moonshine (encoder-decoder transformer, RoPE) - **Parameters**: 27.1M - **License**: MIT - **Original model**: https://huggingface.co/UsefulSensors/moonshine-tiny