Moonshine Streaming Tiny — Optimized

Optimized variants of UsefulSensors/moonshine-streaming-tiny, a 34M parameter streaming ASR model designed for real-time, on-device English speech recognition.

Based on: Moonshine v2: Ergodic Streaming Encoder ASR

Optimized Variants

Variant	Total Size	Size Reduction	Best For
Original FP32	168.1 MB	—	Reference
ONNX INT8	79.8 MB	52%	CPU deployment, edge devices
FP16 SafeTensors	88.1 MB	48%	GPU inference
ONNX FP32	297 MB	—	ONNX Runtime without quantization

Benchmark Results

Tested with 5 seconds of audio, generating up to 64 tokens:

Variant	Avg Latency	RTF	Speedup vs FP32 CPU
PyTorch FP16 (GPU)	47.7 ms	0.0095	1.71x
PyTorch INT8 (CPU)	78.6 ms	0.0157	1.03x
PyTorch FP32 (CPU)	81.3 ms	0.0163	1.00x (baseline)
ONNX FP32 (CPU)	115.5 ms	0.0231	0.70x
ONNX INT8 (CPU)	153.2 ms	0.0306	0.53x

Note: ONNX benchmarks include session overhead and were run on a single test. For production deployment with session reuse on real audio, ONNX Runtime typically provides better throughput, especially for long-running services. The Moonshine team reports 50ms response latency on Apple M3 with their C++ ONNX Runtime backend.

File Structure

├── onnx_int8/                          # ONNX INT8 quantized (recommended for CPU)
│   ├── encoder_model_int8.onnx         # 9.8 MB
│   ├── decoder_model_int8.onnx         # 36 MB
│   ├── decoder_with_past_model_int8.onnx # 32 MB
│   ├── tokenizer.json
│   ├── config.json
│   └── quantize_config.json
├── onnx/                               # ONNX FP32
│   ├── encoder_model.onnx + .data
│   ├── decoder_model.onnx + .data
│   ├── decoder_with_past_model.onnx + .data
│   └── ...
└── fp16/                               # FP16 SafeTensors (for GPU)
    ├── model.safetensors               # 88.1 MB
    ├── config.json
    └── tokenizer.json

Usage

ONNX INT8 Inference (CPU — Recommended for Edge)

pip install onnxruntime numpy tokenizers

import numpy as np
import onnxruntime as ort
from tokenizers import Tokenizer

MODEL_DIR = "onnx_int8"  # or download from this repo
BOS, EOS = 1, 2

# Load models
opts = ort.SessionOptions()
opts.intra_op_num_threads = 4
providers = ["CPUExecutionProvider"]

encoder = ort.InferenceSession(f"{MODEL_DIR}/encoder_model_int8.onnx", opts, providers=providers)
decoder = ort.InferenceSession(f"{MODEL_DIR}/decoder_model_int8.onnx", opts, providers=providers)
decoder_past = ort.InferenceSession(f"{MODEL_DIR}/decoder_with_past_model_int8.onnx", opts, providers=providers)
tokenizer = Tokenizer.from_file(f"{MODEL_DIR}/tokenizer.json")

# Prepare audio (16kHz float32, padded to multiple of 80 samples)
audio = np.random.randn(16000 * 5).astype(np.float32)  # replace with real audio
remainder = len(audio) % 80
if remainder:
    audio = np.pad(audio, (0, 80 - remainder))

audio_input = audio[np.newaxis, :]
attention_mask = np.ones_like(audio_input, dtype=np.int64)

# Encode audio
(enc_out,) = encoder.run(None, {
    "input_values": audio_input,
    "attention_mask": attention_mask,
})

# First decode step
outs = decoder.run(None, {
    "decoder_input_ids": np.array([[BOS]], dtype=np.int64),
    "encoder_hidden_states": enc_out,
})
logits, past_kvs = outs[0], outs[1:]
token = int(np.argmax(logits[0, -1, :]))

# Build KV cache mapping
dec_out_names = [o.name for o in decoder.get_outputs()][1:]
past_in_names = {i.name for i in decoder_past.get_inputs()} - {"decoder_input_ids", "encoder_hidden_states"}

kv_dict = {}
for name, tensor in zip(dec_out_names, past_kvs):
    mapped = name.replace("present_", "past_", 1)
    if mapped in past_in_names:
        kv_dict[mapped] = tensor

# Autoregressive decode loop
past_out_names = [o.name for o in decoder_past.get_outputs()][1:]
tokens = [token]

for _ in range(255):
    if token == EOS:
        break
    inputs = {
        "decoder_input_ids": np.array([[token]], dtype=np.int64),
        "encoder_hidden_states": enc_out,
    }
    inputs.update(kv_dict)
    outs = decoder_past.run(None, inputs)
    token = int(np.argmax(outs[0][0, -1, :]))
    tokens.append(token)
    
    kv_dict = {}
    for name, tensor in zip(past_out_names, outs[1:]):
        mapped = name.replace("present_", "past_", 1)
        if mapped in past_in_names:
            kv_dict[mapped] = tensor

text = tokenizer.decode(tokens)
print(text)

FP16 PyTorch Inference (GPU)

from transformers import MoonshineStreamingForConditionalGeneration, AutoProcessor
import torch

model = MoonshineStreamingForConditionalGeneration.from_pretrained(
    "felixem/moonshine-streaming-tiny-optimized",
    subfolder="fp16",
    torch_dtype=torch.float16,
).to("cuda")

processor = AutoProcessor.from_pretrained(
    "felixem/moonshine-streaming-tiny-optimized",
    subfolder="fp16",
)

# Process audio
inputs = processor(audio_array, return_tensors="pt", sampling_rate=16000)
inputs = {k: v.to("cuda", torch.float16) for k, v in inputs.items()}

generated_ids = model.generate(**inputs, max_new_tokens=128)
text = processor.decode(generated_ids[0], skip_special_tokens=True)

PyTorch Dynamic INT8 (CPU — Quick Setup)

import torch
from transformers import MoonshineStreamingForConditionalGeneration, AutoProcessor

model = MoonshineStreamingForConditionalGeneration.from_pretrained(
    "UsefulSensors/moonshine-streaming-tiny"
).eval()

# Quantize Linear layers to INT8
model = torch.quantization.quantize_dynamic(
    model, {torch.nn.Linear}, dtype=torch.qint8
)

processor = AutoProcessor.from_pretrained("UsefulSensors/moonshine-streaming-tiny")
inputs = processor(audio_array, return_tensors="pt", sampling_rate=16000)
generated_ids = model.generate(**inputs, max_new_tokens=128)
text = processor.decode(generated_ids[0], skip_special_tokens=True)

ONNX Export Details

Encoder: Exported with torch.onnx.export(dynamo=True) to handle vmap-based sliding-window attention masking
Decoder: Separate models for first step (no KV cache) and autoregressive steps (with KV cache)
Quantization: onnxruntime.quantization.quantize_dynamic with symmetric INT8, per-channel, reduce_range=True

KV Cache Structure

Each decoder layer produces 4 KV tensors:

present_{layer}_self_key / present_{layer}_self_value: Self-attention cache [B, 8, S, 40]
present_{layer}_cross_key / present_{layer}_cross_value: Cross-attention cache [B, 8, T, 40]

For decoder_with_past_model, feed these back as past_{layer}_* inputs.

Quantization Impact

Based on the Edge-ASR paper (Table 14), INT8 quantization on Moonshine Tiny has negligible WER impact:

Config	Avg WER	vs FP32
FP32 baseline	12.72%	—
W8-A8 (INT8)	12.81%	+0.09%
W4-A16 (SpQR)	13.61%	+0.89%

INT8 is the sweet spot for Moonshine Tiny — virtually no accuracy loss with ~50% model size reduction.

Limitations

English only
Optimized for short utterances (streaming chunks of 1-5 seconds)
ONNX models use external data files (.onnx.data) for FP32 variant
The decoder uses autoregressive generation, so output latency scales with transcript length

Citation

@article{kudlur2025moonshine,
  title={Moonshine v2: Ergodic Streaming Encoder ASR},
  author={Kudlur, Manjunath and King, Evan and Wang, James and Warden, Pete},
  journal={arXiv preprint arXiv:2602.12241},
  year={2025}
}

License

MIT (same as base model)

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for felixem/moonshine-streaming-tiny-optimized

Base model

UsefulSensors/moonshine-streaming-tiny

Quantized

(2)

this model

Papers for felixem/moonshine-streaming-tiny-optimized

Moonshine v2: Ergodic Streaming Encoder ASR for Latency-Critical Speech Applications

Paper • 2602.12241 • Published Feb 12

Edge-ASR: Towards Low-Bit Quantization of Automatic Speech Recognition Models

Paper • 2507.07877 • Published Jul 10, 2025 • 3