felixem
/

moonshine-streaming-tiny-optimized

+---
+license: mit
+base_model: UsefulSensors/moonshine-streaming-tiny
+language:
+- en
+tags:
+  - onnx
+  - int8
+  - fp16
+  - quantized
+  - optimized
+  - speech-recognition
+  - asr
+  - streaming
+  - moonshine
+library_name: onnxruntime
+pipeline_tag: automatic-speech-recognition
+---
+# Moonshine Streaming Tiny — Optimized
+Optimized variants of [UsefulSensors/moonshine-streaming-tiny](https://huggingface.co/UsefulSensors/moonshine-streaming-tiny), a 34M parameter streaming ASR model designed for real-time, on-device English speech recognition.
+Based on: [Moonshine v2: Ergodic Streaming Encoder ASR](https://arxiv.org/abs/2602.12241)
+## Optimized Variants
+| Variant | Total Size | Size Reduction | Best For |
+|---------|-----------|---------------|----------|
+| **Original FP32** | 168.1 MB | — | Reference |
+| **ONNX INT8** | 79.8 MB | **52%** | CPU deployment, edge devices |
+| **FP16 SafeTensors** | 88.1 MB | **48%** | GPU inference |
+| **ONNX FP32** | 297 MB | — | ONNX Runtime without quantization |
+## Benchmark Results
+Tested with 5 seconds of audio, generating up to 64 tokens:
+| Variant | Avg Latency | RTF | Speedup vs FP32 CPU |
+|---------|------------|-----|---------------------|
+| **PyTorch FP16 (GPU)** | 47.7 ms | 0.0095 | **1.71x** |
+| PyTorch INT8 (CPU) | 78.6 ms | 0.0157 | 1.03x |
+| PyTorch FP32 (CPU) | 81.3 ms | 0.0163 | 1.00x (baseline) |
+| ONNX FP32 (CPU) | 115.5 ms | 0.0231 | 0.70x |
+| ONNX INT8 (CPU) | 153.2 ms | 0.0306 | 0.53x |
+> **Note**: ONNX benchmarks include session overhead and were run on a single test. For production deployment with session reuse on real audio, ONNX Runtime typically provides better throughput, especially for long-running services. The Moonshine team reports 50ms response latency on Apple M3 with their C++ ONNX Runtime backend.
+## File Structure
+```
+├── onnx_int8/                          # ONNX INT8 quantized (recommended for CPU)
+│   ├── encoder_model_int8.onnx         # 9.8 MB
+│   ├── decoder_model_int8.onnx         # 36 MB
+│   ├── decoder_with_past_model_int8.onnx # 32 MB
+│   ├── tokenizer.json
+│   ├── config.json
+│   └── quantize_config.json
+├── onnx/                               # ONNX FP32
+│   ├── encoder_model.onnx + .data
+│   ├── decoder_model.onnx + .data
+│   ├── decoder_with_past_model.onnx + .data
+│   └── ...
+└── fp16/                               # FP16 SafeTensors (for GPU)
+    ├── model.safetensors               # 88.1 MB
+    ├── config.json
+    └── tokenizer.json
+```
+## Usage
+### ONNX INT8 Inference (CPU — Recommended for Edge)
+```bash
+pip install onnxruntime numpy tokenizers
+```
+```python
+import numpy as np
+import onnxruntime as ort
+from tokenizers import Tokenizer
+MODEL_DIR = "onnx_int8"  # or download from this repo
+BOS, EOS = 1, 2
+# Load models
+opts = ort.SessionOptions()
+opts.intra_op_num_threads = 4
+providers = ["CPUExecutionProvider"]
+encoder = ort.InferenceSession(f"{MODEL_DIR}/encoder_model_int8.onnx", opts, providers=providers)
+decoder = ort.InferenceSession(f"{MODEL_DIR}/decoder_model_int8.onnx", opts, providers=providers)
+decoder_past = ort.InferenceSession(f"{MODEL_DIR}/decoder_with_past_model_int8.onnx", opts, providers=providers)
+tokenizer = Tokenizer.from_file(f"{MODEL_DIR}/tokenizer.json")
+# Prepare audio (16kHz float32, padded to multiple of 80 samples)
+audio = np.random.randn(16000 * 5).astype(np.float32)  # replace with real audio
+remainder = len(audio) % 80
+if remainder:
+    audio = np.pad(audio, (0, 80 - remainder))
+audio_input = audio[np.newaxis, :]
+attention_mask = np.ones_like(audio_input, dtype=np.int64)
+# Encode audio
+(enc_out,) = encoder.run(None, {
+    "input_values": audio_input,
+    "attention_mask": attention_mask,
+})
+# First decode step
+outs = decoder.run(None, {
+    "decoder_input_ids": np.array([[BOS]], dtype=np.int64),
+    "encoder_hidden_states": enc_out,
+})
+logits, past_kvs = outs[0], outs[1:]
+token = int(np.argmax(logits[0, -1, :]))
+# Build KV cache mapping
+dec_out_names = [o.name for o in decoder.get_outputs()][1:]
+past_in_names = {i.name for i in decoder_past.get_inputs()} - {"decoder_input_ids", "encoder_hidden_states"}
+kv_dict = {}
+for name, tensor in zip(dec_out_names, past_kvs):
+    mapped = name.replace("present_", "past_", 1)
+    if mapped in past_in_names:
+        kv_dict[mapped] = tensor
+# Autoregressive decode loop
+past_out_names = [o.name for o in decoder_past.get_outputs()][1:]
+tokens = [token]
+for _ in range(255):
+    if token == EOS:
+        break
+    inputs = {
+        "decoder_input_ids": np.array([[token]], dtype=np.int64),
+        "encoder_hidden_states": enc_out,
+    }
+    inputs.update(kv_dict)
+    outs = decoder_past.run(None, inputs)
+    token = int(np.argmax(outs[0][0, -1, :]))
+    tokens.append(token)
+    kv_dict = {}
+    for name, tensor in zip(past_out_names, outs[1:]):
+        mapped = name.replace("present_", "past_", 1)
+        if mapped in past_in_names:
+            kv_dict[mapped] = tensor
+text = tokenizer.decode(tokens)
+print(text)
+```
+### FP16 PyTorch Inference (GPU)
+```python
+from transformers import MoonshineStreamingForConditionalGeneration, AutoProcessor
+import torch
+model = MoonshineStreamingForConditionalGeneration.from_pretrained(
+    "felixem/moonshine-streaming-tiny-optimized",
+    subfolder="fp16",
+    torch_dtype=torch.float16,
+).to("cuda")
+processor = AutoProcessor.from_pretrained(
+    "felixem/moonshine-streaming-tiny-optimized",
+    subfolder="fp16",
+)
+# Process audio
+inputs = processor(audio_array, return_tensors="pt", sampling_rate=16000)
+inputs = {k: v.to("cuda", torch.float16) for k, v in inputs.items()}
+generated_ids = model.generate(**inputs, max_new_tokens=128)
+text = processor.decode(generated_ids[0], skip_special_tokens=True)
+```
+### PyTorch Dynamic INT8 (CPU — Quick Setup)
+```python
+import torch
+from transformers import MoonshineStreamingForConditionalGeneration, AutoProcessor
+model = MoonshineStreamingForConditionalGeneration.from_pretrained(
+    "UsefulSensors/moonshine-streaming-tiny"
+).eval()
+# Quantize Linear layers to INT8
+model = torch.quantization.quantize_dynamic(
+    model, {torch.nn.Linear}, dtype=torch.qint8
+)
+processor = AutoProcessor.from_pretrained("UsefulSensors/moonshine-streaming-tiny")
+inputs = processor(audio_array, return_tensors="pt", sampling_rate=16000)
+generated_ids = model.generate(**inputs, max_new_tokens=128)
+text = processor.decode(generated_ids[0], skip_special_tokens=True)
+```
+## ONNX Export Details
+- **Encoder**: Exported with `torch.onnx.export(dynamo=True)` to handle vmap-based sliding-window attention masking
+- **Decoder**: Separate models for first step (no KV cache) and autoregressive steps (with KV cache)
+- **Quantization**: `onnxruntime.quantization.quantize_dynamic` with symmetric INT8, per-channel, reduce_range=True
+### KV Cache Structure
+Each decoder layer produces 4 KV tensors:
+- `present_{layer}_self_key` / `present_{layer}_self_value`: Self-attention cache [B, 8, S, 40]
+- `present_{layer}_cross_key` / `present_{layer}_cross_value`: Cross-attention cache [B, 8, T, 40]
+For `decoder_with_past_model`, feed these back as `past_{layer}_*` inputs.
+## Quantization Impact
+Based on the [Edge-ASR paper](https://arxiv.org/abs/2507.07877) (Table 14), INT8 quantization on Moonshine Tiny has negligible WER impact:
+| Config | Avg WER | vs FP32 |
+|--------|---------|---------|
+| FP32 baseline | 12.72% | — |
+| **W8-A8 (INT8)** | **12.81%** | **+0.09%** |
+| W4-A16 (SpQR) | 13.61% | +0.89% |
+INT8 is the sweet spot for Moonshine Tiny — virtually no accuracy loss with ~50% model size reduction.
+## Limitations
+- English only
+- Optimized for short utterances (streaming chunks of 1-5 seconds)
+- ONNX models use external data files (`.onnx.data`) for FP32 variant
+- The decoder uses autoregressive generation, so output latency scales with transcript length
+## Citation
+```bibtex
+@article{kudlur2025moonshine,
+  title={Moonshine v2: Ergodic Streaming Encoder ASR},
+  author={Kudlur, Manjunath and King, Evan and Wang, James and Warden, Pete},
+  journal={arXiv preprint arXiv:2602.12241},
+  year={2025}
+}
+```
+## License
+MIT (same as base model)