File size: 8,235 Bytes
2f5c1d8 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 | ---
license: mit
base_model: UsefulSensors/moonshine-streaming-tiny
language:
- en
tags:
- onnx
- int8
- fp16
- quantized
- optimized
- speech-recognition
- asr
- streaming
- moonshine
library_name: onnxruntime
pipeline_tag: automatic-speech-recognition
---
# Moonshine Streaming Tiny β Optimized
Optimized variants of [UsefulSensors/moonshine-streaming-tiny](https://huggingface.co/UsefulSensors/moonshine-streaming-tiny), a 34M parameter streaming ASR model designed for real-time, on-device English speech recognition.
Based on: [Moonshine v2: Ergodic Streaming Encoder ASR](https://arxiv.org/abs/2602.12241)
## Optimized Variants
| Variant | Total Size | Size Reduction | Best For |
|---------|-----------|---------------|----------|
| **Original FP32** | 168.1 MB | β | Reference |
| **ONNX INT8** | 79.8 MB | **52%** | CPU deployment, edge devices |
| **FP16 SafeTensors** | 88.1 MB | **48%** | GPU inference |
| **ONNX FP32** | 297 MB | β | ONNX Runtime without quantization |
## Benchmark Results
Tested with 5 seconds of audio, generating up to 64 tokens:
| Variant | Avg Latency | RTF | Speedup vs FP32 CPU |
|---------|------------|-----|---------------------|
| **PyTorch FP16 (GPU)** | 47.7 ms | 0.0095 | **1.71x** |
| PyTorch INT8 (CPU) | 78.6 ms | 0.0157 | 1.03x |
| PyTorch FP32 (CPU) | 81.3 ms | 0.0163 | 1.00x (baseline) |
| ONNX FP32 (CPU) | 115.5 ms | 0.0231 | 0.70x |
| ONNX INT8 (CPU) | 153.2 ms | 0.0306 | 0.53x |
> **Note**: ONNX benchmarks include session overhead and were run on a single test. For production deployment with session reuse on real audio, ONNX Runtime typically provides better throughput, especially for long-running services. The Moonshine team reports 50ms response latency on Apple M3 with their C++ ONNX Runtime backend.
## File Structure
```
βββ onnx_int8/ # ONNX INT8 quantized (recommended for CPU)
β βββ encoder_model_int8.onnx # 9.8 MB
β βββ decoder_model_int8.onnx # 36 MB
β βββ decoder_with_past_model_int8.onnx # 32 MB
β βββ tokenizer.json
β βββ config.json
β βββ quantize_config.json
βββ onnx/ # ONNX FP32
β βββ encoder_model.onnx + .data
β βββ decoder_model.onnx + .data
β βββ decoder_with_past_model.onnx + .data
β βββ ...
βββ fp16/ # FP16 SafeTensors (for GPU)
βββ model.safetensors # 88.1 MB
βββ config.json
βββ tokenizer.json
```
## Usage
### ONNX INT8 Inference (CPU β Recommended for Edge)
```bash
pip install onnxruntime numpy tokenizers
```
```python
import numpy as np
import onnxruntime as ort
from tokenizers import Tokenizer
MODEL_DIR = "onnx_int8" # or download from this repo
BOS, EOS = 1, 2
# Load models
opts = ort.SessionOptions()
opts.intra_op_num_threads = 4
providers = ["CPUExecutionProvider"]
encoder = ort.InferenceSession(f"{MODEL_DIR}/encoder_model_int8.onnx", opts, providers=providers)
decoder = ort.InferenceSession(f"{MODEL_DIR}/decoder_model_int8.onnx", opts, providers=providers)
decoder_past = ort.InferenceSession(f"{MODEL_DIR}/decoder_with_past_model_int8.onnx", opts, providers=providers)
tokenizer = Tokenizer.from_file(f"{MODEL_DIR}/tokenizer.json")
# Prepare audio (16kHz float32, padded to multiple of 80 samples)
audio = np.random.randn(16000 * 5).astype(np.float32) # replace with real audio
remainder = len(audio) % 80
if remainder:
audio = np.pad(audio, (0, 80 - remainder))
audio_input = audio[np.newaxis, :]
attention_mask = np.ones_like(audio_input, dtype=np.int64)
# Encode audio
(enc_out,) = encoder.run(None, {
"input_values": audio_input,
"attention_mask": attention_mask,
})
# First decode step
outs = decoder.run(None, {
"decoder_input_ids": np.array([[BOS]], dtype=np.int64),
"encoder_hidden_states": enc_out,
})
logits, past_kvs = outs[0], outs[1:]
token = int(np.argmax(logits[0, -1, :]))
# Build KV cache mapping
dec_out_names = [o.name for o in decoder.get_outputs()][1:]
past_in_names = {i.name for i in decoder_past.get_inputs()} - {"decoder_input_ids", "encoder_hidden_states"}
kv_dict = {}
for name, tensor in zip(dec_out_names, past_kvs):
mapped = name.replace("present_", "past_", 1)
if mapped in past_in_names:
kv_dict[mapped] = tensor
# Autoregressive decode loop
past_out_names = [o.name for o in decoder_past.get_outputs()][1:]
tokens = [token]
for _ in range(255):
if token == EOS:
break
inputs = {
"decoder_input_ids": np.array([[token]], dtype=np.int64),
"encoder_hidden_states": enc_out,
}
inputs.update(kv_dict)
outs = decoder_past.run(None, inputs)
token = int(np.argmax(outs[0][0, -1, :]))
tokens.append(token)
kv_dict = {}
for name, tensor in zip(past_out_names, outs[1:]):
mapped = name.replace("present_", "past_", 1)
if mapped in past_in_names:
kv_dict[mapped] = tensor
text = tokenizer.decode(tokens)
print(text)
```
### FP16 PyTorch Inference (GPU)
```python
from transformers import MoonshineStreamingForConditionalGeneration, AutoProcessor
import torch
model = MoonshineStreamingForConditionalGeneration.from_pretrained(
"felixem/moonshine-streaming-tiny-optimized",
subfolder="fp16",
torch_dtype=torch.float16,
).to("cuda")
processor = AutoProcessor.from_pretrained(
"felixem/moonshine-streaming-tiny-optimized",
subfolder="fp16",
)
# Process audio
inputs = processor(audio_array, return_tensors="pt", sampling_rate=16000)
inputs = {k: v.to("cuda", torch.float16) for k, v in inputs.items()}
generated_ids = model.generate(**inputs, max_new_tokens=128)
text = processor.decode(generated_ids[0], skip_special_tokens=True)
```
### PyTorch Dynamic INT8 (CPU β Quick Setup)
```python
import torch
from transformers import MoonshineStreamingForConditionalGeneration, AutoProcessor
model = MoonshineStreamingForConditionalGeneration.from_pretrained(
"UsefulSensors/moonshine-streaming-tiny"
).eval()
# Quantize Linear layers to INT8
model = torch.quantization.quantize_dynamic(
model, {torch.nn.Linear}, dtype=torch.qint8
)
processor = AutoProcessor.from_pretrained("UsefulSensors/moonshine-streaming-tiny")
inputs = processor(audio_array, return_tensors="pt", sampling_rate=16000)
generated_ids = model.generate(**inputs, max_new_tokens=128)
text = processor.decode(generated_ids[0], skip_special_tokens=True)
```
## ONNX Export Details
- **Encoder**: Exported with `torch.onnx.export(dynamo=True)` to handle vmap-based sliding-window attention masking
- **Decoder**: Separate models for first step (no KV cache) and autoregressive steps (with KV cache)
- **Quantization**: `onnxruntime.quantization.quantize_dynamic` with symmetric INT8, per-channel, reduce_range=True
### KV Cache Structure
Each decoder layer produces 4 KV tensors:
- `present_{layer}_self_key` / `present_{layer}_self_value`: Self-attention cache [B, 8, S, 40]
- `present_{layer}_cross_key` / `present_{layer}_cross_value`: Cross-attention cache [B, 8, T, 40]
For `decoder_with_past_model`, feed these back as `past_{layer}_*` inputs.
## Quantization Impact
Based on the [Edge-ASR paper](https://arxiv.org/abs/2507.07877) (Table 14), INT8 quantization on Moonshine Tiny has negligible WER impact:
| Config | Avg WER | vs FP32 |
|--------|---------|---------|
| FP32 baseline | 12.72% | β |
| **W8-A8 (INT8)** | **12.81%** | **+0.09%** |
| W4-A16 (SpQR) | 13.61% | +0.89% |
INT8 is the sweet spot for Moonshine Tiny β virtually no accuracy loss with ~50% model size reduction.
## Limitations
- English only
- Optimized for short utterances (streaming chunks of 1-5 seconds)
- ONNX models use external data files (`.onnx.data`) for FP32 variant
- The decoder uses autoregressive generation, so output latency scales with transcript length
## Citation
```bibtex
@article{kudlur2025moonshine,
title={Moonshine v2: Ergodic Streaming Encoder ASR},
author={Kudlur, Manjunath and King, Evan and Wang, James and Warden, Pete},
journal={arXiv preprint arXiv:2602.12241},
year={2025}
}
```
## License
MIT (same as base model)
|