Moonshine Streaming Tiny β Optimized
Optimized variants of UsefulSensors/moonshine-streaming-tiny, a 34M parameter streaming ASR model designed for real-time, on-device English speech recognition.
Based on: Moonshine v2: Ergodic Streaming Encoder ASR
Optimized Variants
| Variant | Total Size | Size Reduction | Best For |
|---|---|---|---|
| Original FP32 | 168.1 MB | β | Reference |
| ONNX INT8 | 79.8 MB | 52% | CPU deployment, edge devices |
| FP16 SafeTensors | 88.1 MB | 48% | GPU inference |
| ONNX FP32 | 297 MB | β | ONNX Runtime without quantization |
Benchmark Results
Tested with 5 seconds of audio, generating up to 64 tokens:
| Variant | Avg Latency | RTF | Speedup vs FP32 CPU |
|---|---|---|---|
| PyTorch FP16 (GPU) | 47.7 ms | 0.0095 | 1.71x |
| PyTorch INT8 (CPU) | 78.6 ms | 0.0157 | 1.03x |
| PyTorch FP32 (CPU) | 81.3 ms | 0.0163 | 1.00x (baseline) |
| ONNX FP32 (CPU) | 115.5 ms | 0.0231 | 0.70x |
| ONNX INT8 (CPU) | 153.2 ms | 0.0306 | 0.53x |
Note: ONNX benchmarks include session overhead and were run on a single test. For production deployment with session reuse on real audio, ONNX Runtime typically provides better throughput, especially for long-running services. The Moonshine team reports 50ms response latency on Apple M3 with their C++ ONNX Runtime backend.
File Structure
βββ onnx_int8/ # ONNX INT8 quantized (recommended for CPU)
β βββ encoder_model_int8.onnx # 9.8 MB
β βββ decoder_model_int8.onnx # 36 MB
β βββ decoder_with_past_model_int8.onnx # 32 MB
β βββ tokenizer.json
β βββ config.json
β βββ quantize_config.json
βββ onnx/ # ONNX FP32
β βββ encoder_model.onnx + .data
β βββ decoder_model.onnx + .data
β βββ decoder_with_past_model.onnx + .data
β βββ ...
βββ fp16/ # FP16 SafeTensors (for GPU)
βββ model.safetensors # 88.1 MB
βββ config.json
βββ tokenizer.json
Usage
ONNX INT8 Inference (CPU β Recommended for Edge)
pip install onnxruntime numpy tokenizers
import numpy as np
import onnxruntime as ort
from tokenizers import Tokenizer
MODEL_DIR = "onnx_int8" # or download from this repo
BOS, EOS = 1, 2
# Load models
opts = ort.SessionOptions()
opts.intra_op_num_threads = 4
providers = ["CPUExecutionProvider"]
encoder = ort.InferenceSession(f"{MODEL_DIR}/encoder_model_int8.onnx", opts, providers=providers)
decoder = ort.InferenceSession(f"{MODEL_DIR}/decoder_model_int8.onnx", opts, providers=providers)
decoder_past = ort.InferenceSession(f"{MODEL_DIR}/decoder_with_past_model_int8.onnx", opts, providers=providers)
tokenizer = Tokenizer.from_file(f"{MODEL_DIR}/tokenizer.json")
# Prepare audio (16kHz float32, padded to multiple of 80 samples)
audio = np.random.randn(16000 * 5).astype(np.float32) # replace with real audio
remainder = len(audio) % 80
if remainder:
audio = np.pad(audio, (0, 80 - remainder))
audio_input = audio[np.newaxis, :]
attention_mask = np.ones_like(audio_input, dtype=np.int64)
# Encode audio
(enc_out,) = encoder.run(None, {
"input_values": audio_input,
"attention_mask": attention_mask,
})
# First decode step
outs = decoder.run(None, {
"decoder_input_ids": np.array([[BOS]], dtype=np.int64),
"encoder_hidden_states": enc_out,
})
logits, past_kvs = outs[0], outs[1:]
token = int(np.argmax(logits[0, -1, :]))
# Build KV cache mapping
dec_out_names = [o.name for o in decoder.get_outputs()][1:]
past_in_names = {i.name for i in decoder_past.get_inputs()} - {"decoder_input_ids", "encoder_hidden_states"}
kv_dict = {}
for name, tensor in zip(dec_out_names, past_kvs):
mapped = name.replace("present_", "past_", 1)
if mapped in past_in_names:
kv_dict[mapped] = tensor
# Autoregressive decode loop
past_out_names = [o.name for o in decoder_past.get_outputs()][1:]
tokens = [token]
for _ in range(255):
if token == EOS:
break
inputs = {
"decoder_input_ids": np.array([[token]], dtype=np.int64),
"encoder_hidden_states": enc_out,
}
inputs.update(kv_dict)
outs = decoder_past.run(None, inputs)
token = int(np.argmax(outs[0][0, -1, :]))
tokens.append(token)
kv_dict = {}
for name, tensor in zip(past_out_names, outs[1:]):
mapped = name.replace("present_", "past_", 1)
if mapped in past_in_names:
kv_dict[mapped] = tensor
text = tokenizer.decode(tokens)
print(text)
FP16 PyTorch Inference (GPU)
from transformers import MoonshineStreamingForConditionalGeneration, AutoProcessor
import torch
model = MoonshineStreamingForConditionalGeneration.from_pretrained(
"felixem/moonshine-streaming-tiny-optimized",
subfolder="fp16",
torch_dtype=torch.float16,
).to("cuda")
processor = AutoProcessor.from_pretrained(
"felixem/moonshine-streaming-tiny-optimized",
subfolder="fp16",
)
# Process audio
inputs = processor(audio_array, return_tensors="pt", sampling_rate=16000)
inputs = {k: v.to("cuda", torch.float16) for k, v in inputs.items()}
generated_ids = model.generate(**inputs, max_new_tokens=128)
text = processor.decode(generated_ids[0], skip_special_tokens=True)
PyTorch Dynamic INT8 (CPU β Quick Setup)
import torch
from transformers import MoonshineStreamingForConditionalGeneration, AutoProcessor
model = MoonshineStreamingForConditionalGeneration.from_pretrained(
"UsefulSensors/moonshine-streaming-tiny"
).eval()
# Quantize Linear layers to INT8
model = torch.quantization.quantize_dynamic(
model, {torch.nn.Linear}, dtype=torch.qint8
)
processor = AutoProcessor.from_pretrained("UsefulSensors/moonshine-streaming-tiny")
inputs = processor(audio_array, return_tensors="pt", sampling_rate=16000)
generated_ids = model.generate(**inputs, max_new_tokens=128)
text = processor.decode(generated_ids[0], skip_special_tokens=True)
ONNX Export Details
- Encoder: Exported with
torch.onnx.export(dynamo=True)to handle vmap-based sliding-window attention masking - Decoder: Separate models for first step (no KV cache) and autoregressive steps (with KV cache)
- Quantization:
onnxruntime.quantization.quantize_dynamicwith symmetric INT8, per-channel, reduce_range=True
KV Cache Structure
Each decoder layer produces 4 KV tensors:
present_{layer}_self_key/present_{layer}_self_value: Self-attention cache [B, 8, S, 40]present_{layer}_cross_key/present_{layer}_cross_value: Cross-attention cache [B, 8, T, 40]
For decoder_with_past_model, feed these back as past_{layer}_* inputs.
Quantization Impact
Based on the Edge-ASR paper (Table 14), INT8 quantization on Moonshine Tiny has negligible WER impact:
| Config | Avg WER | vs FP32 |
|---|---|---|
| FP32 baseline | 12.72% | β |
| W8-A8 (INT8) | 12.81% | +0.09% |
| W4-A16 (SpQR) | 13.61% | +0.89% |
INT8 is the sweet spot for Moonshine Tiny β virtually no accuracy loss with ~50% model size reduction.
Limitations
- English only
- Optimized for short utterances (streaming chunks of 1-5 seconds)
- ONNX models use external data files (
.onnx.data) for FP32 variant - The decoder uses autoregressive generation, so output latency scales with transcript length
Citation
@article{kudlur2025moonshine,
title={Moonshine v2: Ergodic Streaming Encoder ASR},
author={Kudlur, Manjunath and King, Evan and Wang, James and Warden, Pete},
journal={arXiv preprint arXiv:2602.12241},
year={2025}
}
License
MIT (same as base model)
Model tree for felixem/moonshine-streaming-tiny-optimized
Base model
UsefulSensors/moonshine-streaming-tiny