Moonshine Streaming Tiny β€” Optimized

Optimized variants of UsefulSensors/moonshine-streaming-tiny, a 34M parameter streaming ASR model designed for real-time, on-device English speech recognition.

Based on: Moonshine v2: Ergodic Streaming Encoder ASR

Optimized Variants

Variant Total Size Size Reduction Best For
Original FP32 168.1 MB β€” Reference
ONNX INT8 79.8 MB 52% CPU deployment, edge devices
FP16 SafeTensors 88.1 MB 48% GPU inference
ONNX FP32 297 MB β€” ONNX Runtime without quantization

Benchmark Results

Tested with 5 seconds of audio, generating up to 64 tokens:

Variant Avg Latency RTF Speedup vs FP32 CPU
PyTorch FP16 (GPU) 47.7 ms 0.0095 1.71x
PyTorch INT8 (CPU) 78.6 ms 0.0157 1.03x
PyTorch FP32 (CPU) 81.3 ms 0.0163 1.00x (baseline)
ONNX FP32 (CPU) 115.5 ms 0.0231 0.70x
ONNX INT8 (CPU) 153.2 ms 0.0306 0.53x

Note: ONNX benchmarks include session overhead and were run on a single test. For production deployment with session reuse on real audio, ONNX Runtime typically provides better throughput, especially for long-running services. The Moonshine team reports 50ms response latency on Apple M3 with their C++ ONNX Runtime backend.

File Structure

β”œβ”€β”€ onnx_int8/                          # ONNX INT8 quantized (recommended for CPU)
β”‚   β”œβ”€β”€ encoder_model_int8.onnx         # 9.8 MB
β”‚   β”œβ”€β”€ decoder_model_int8.onnx         # 36 MB
β”‚   β”œβ”€β”€ decoder_with_past_model_int8.onnx # 32 MB
β”‚   β”œβ”€β”€ tokenizer.json
β”‚   β”œβ”€β”€ config.json
β”‚   └── quantize_config.json
β”œβ”€β”€ onnx/                               # ONNX FP32
β”‚   β”œβ”€β”€ encoder_model.onnx + .data
β”‚   β”œβ”€β”€ decoder_model.onnx + .data
β”‚   β”œβ”€β”€ decoder_with_past_model.onnx + .data
β”‚   └── ...
└── fp16/                               # FP16 SafeTensors (for GPU)
    β”œβ”€β”€ model.safetensors               # 88.1 MB
    β”œβ”€β”€ config.json
    └── tokenizer.json

Usage

ONNX INT8 Inference (CPU β€” Recommended for Edge)

pip install onnxruntime numpy tokenizers
import numpy as np
import onnxruntime as ort
from tokenizers import Tokenizer

MODEL_DIR = "onnx_int8"  # or download from this repo
BOS, EOS = 1, 2

# Load models
opts = ort.SessionOptions()
opts.intra_op_num_threads = 4
providers = ["CPUExecutionProvider"]

encoder = ort.InferenceSession(f"{MODEL_DIR}/encoder_model_int8.onnx", opts, providers=providers)
decoder = ort.InferenceSession(f"{MODEL_DIR}/decoder_model_int8.onnx", opts, providers=providers)
decoder_past = ort.InferenceSession(f"{MODEL_DIR}/decoder_with_past_model_int8.onnx", opts, providers=providers)
tokenizer = Tokenizer.from_file(f"{MODEL_DIR}/tokenizer.json")

# Prepare audio (16kHz float32, padded to multiple of 80 samples)
audio = np.random.randn(16000 * 5).astype(np.float32)  # replace with real audio
remainder = len(audio) % 80
if remainder:
    audio = np.pad(audio, (0, 80 - remainder))

audio_input = audio[np.newaxis, :]
attention_mask = np.ones_like(audio_input, dtype=np.int64)

# Encode audio
(enc_out,) = encoder.run(None, {
    "input_values": audio_input,
    "attention_mask": attention_mask,
})

# First decode step
outs = decoder.run(None, {
    "decoder_input_ids": np.array([[BOS]], dtype=np.int64),
    "encoder_hidden_states": enc_out,
})
logits, past_kvs = outs[0], outs[1:]
token = int(np.argmax(logits[0, -1, :]))

# Build KV cache mapping
dec_out_names = [o.name for o in decoder.get_outputs()][1:]
past_in_names = {i.name for i in decoder_past.get_inputs()} - {"decoder_input_ids", "encoder_hidden_states"}

kv_dict = {}
for name, tensor in zip(dec_out_names, past_kvs):
    mapped = name.replace("present_", "past_", 1)
    if mapped in past_in_names:
        kv_dict[mapped] = tensor

# Autoregressive decode loop
past_out_names = [o.name for o in decoder_past.get_outputs()][1:]
tokens = [token]

for _ in range(255):
    if token == EOS:
        break
    inputs = {
        "decoder_input_ids": np.array([[token]], dtype=np.int64),
        "encoder_hidden_states": enc_out,
    }
    inputs.update(kv_dict)
    outs = decoder_past.run(None, inputs)
    token = int(np.argmax(outs[0][0, -1, :]))
    tokens.append(token)
    
    kv_dict = {}
    for name, tensor in zip(past_out_names, outs[1:]):
        mapped = name.replace("present_", "past_", 1)
        if mapped in past_in_names:
            kv_dict[mapped] = tensor

text = tokenizer.decode(tokens)
print(text)

FP16 PyTorch Inference (GPU)

from transformers import MoonshineStreamingForConditionalGeneration, AutoProcessor
import torch

model = MoonshineStreamingForConditionalGeneration.from_pretrained(
    "felixem/moonshine-streaming-tiny-optimized",
    subfolder="fp16",
    torch_dtype=torch.float16,
).to("cuda")

processor = AutoProcessor.from_pretrained(
    "felixem/moonshine-streaming-tiny-optimized",
    subfolder="fp16",
)

# Process audio
inputs = processor(audio_array, return_tensors="pt", sampling_rate=16000)
inputs = {k: v.to("cuda", torch.float16) for k, v in inputs.items()}

generated_ids = model.generate(**inputs, max_new_tokens=128)
text = processor.decode(generated_ids[0], skip_special_tokens=True)

PyTorch Dynamic INT8 (CPU β€” Quick Setup)

import torch
from transformers import MoonshineStreamingForConditionalGeneration, AutoProcessor

model = MoonshineStreamingForConditionalGeneration.from_pretrained(
    "UsefulSensors/moonshine-streaming-tiny"
).eval()

# Quantize Linear layers to INT8
model = torch.quantization.quantize_dynamic(
    model, {torch.nn.Linear}, dtype=torch.qint8
)

processor = AutoProcessor.from_pretrained("UsefulSensors/moonshine-streaming-tiny")
inputs = processor(audio_array, return_tensors="pt", sampling_rate=16000)
generated_ids = model.generate(**inputs, max_new_tokens=128)
text = processor.decode(generated_ids[0], skip_special_tokens=True)

ONNX Export Details

  • Encoder: Exported with torch.onnx.export(dynamo=True) to handle vmap-based sliding-window attention masking
  • Decoder: Separate models for first step (no KV cache) and autoregressive steps (with KV cache)
  • Quantization: onnxruntime.quantization.quantize_dynamic with symmetric INT8, per-channel, reduce_range=True

KV Cache Structure

Each decoder layer produces 4 KV tensors:

  • present_{layer}_self_key / present_{layer}_self_value: Self-attention cache [B, 8, S, 40]
  • present_{layer}_cross_key / present_{layer}_cross_value: Cross-attention cache [B, 8, T, 40]

For decoder_with_past_model, feed these back as past_{layer}_* inputs.

Quantization Impact

Based on the Edge-ASR paper (Table 14), INT8 quantization on Moonshine Tiny has negligible WER impact:

Config Avg WER vs FP32
FP32 baseline 12.72% β€”
W8-A8 (INT8) 12.81% +0.09%
W4-A16 (SpQR) 13.61% +0.89%

INT8 is the sweet spot for Moonshine Tiny β€” virtually no accuracy loss with ~50% model size reduction.

Limitations

  • English only
  • Optimized for short utterances (streaming chunks of 1-5 seconds)
  • ONNX models use external data files (.onnx.data) for FP32 variant
  • The decoder uses autoregressive generation, so output latency scales with transcript length

Citation

@article{kudlur2025moonshine,
  title={Moonshine v2: Ergodic Streaming Encoder ASR},
  author={Kudlur, Manjunath and King, Evan and Wang, James and Warden, Pete},
  journal={arXiv preprint arXiv:2602.12241},
  year={2025}
}

License

MIT (same as base model)

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for felixem/moonshine-streaming-tiny-optimized

Quantized
(2)
this model

Papers for felixem/moonshine-streaming-tiny-optimized