Moonshine Streaming Medium β€” ONNX INT8

Dynamic INT8 quantized ONNX export of UsefulSensors/moonshine-streaming-medium for fast CPU inference with ONNX Runtime.

Model Overview

Moonshine v2 Streaming is an encoder-decoder ASR model designed for real-time streaming speech recognition. The encoder uses causal sliding-window attention (no positional embeddings), enabling it to process audio incrementally. The decoder uses RoPE-based causal attention with cross-attention to encoder states.

Details
Architecture Encoder-Decoder Transformer (streaming)
Parameters ~330M (FP32)
Encoder 14 layers, 768-dim, 10 heads, sliding-window attention
Decoder 14 layers, 640-dim, 10 heads, RoPE + cross-attention
Vocab 32,768 BPE tokens
Audio Input 16 kHz mono, 5ms frames (80 samples)
Quantization Dynamic INT8 (weight-only, symmetric)
Latency Real-time capable on modern CPUs

Files

File Size Description
encoder_model_int8.onnx 135 MB Audio β†’ 768-dim encoder hidden states
decoder_model_int8.onnx 225 MB First decode step (initializes KV cache)
decoder_with_past_model_int8.onnx 202 MB Subsequent decode steps (streaming KV cache)
config.json β€” Model architecture configuration
tokenizer.json 3.6 MB BPE tokenizer (32,768 vocab)
processor_config.json β€” Audio processor settings
tokenizer_config.json β€” Tokenizer metadata
Total ~562 MB 64% smaller than FP32 (1.57 GB)

Quick Start

Installation

pip install onnxruntime numpy tokenizers

Basic Inference

import numpy as np
import onnxruntime as ort
from tokenizers import Tokenizer

MODEL_DIR = "moonshine-streaming-medium-onnx"

# Load models
opts = ort.SessionOptions()
opts.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
providers = ["CPUExecutionProvider"]

encoder = ort.InferenceSession(f"{MODEL_DIR}/encoder_model_int8.onnx", opts, providers=providers)
decoder = ort.InferenceSession(f"{MODEL_DIR}/decoder_model_int8.onnx", opts, providers=providers)
decoder_past = ort.InferenceSession(f"{MODEL_DIR}/decoder_with_past_model_int8.onnx", opts, providers=providers)
tokenizer = Tokenizer.from_file(f"{MODEL_DIR}/tokenizer.json")

# Encode audio (16kHz float32)
audio = np.random.randn(1, 16000).astype(np.float32)  # 1 second
mask = np.ones((1, 16000), dtype=np.int64)
enc_out = encoder.run(None, {"input_values": audio, "attention_mask": mask})[0]

# First decode step (BOS token = 1)
bos = np.array([[1]], dtype=np.int64)
first_out = decoder.run(None, {"decoder_input_ids": bos, "encoder_hidden_states": enc_out})
logits = first_out[0]
token_id = int(np.argmax(logits[0, -1, :]))

# Build KV cache mapping for subsequent steps
dec_out_names = [o.name for o in decoder.get_outputs()][1:]
dec_past_in_names = {i.name for i in decoder_past.get_inputs() if i.name not in ("decoder_input_ids", "encoder_hidden_states")}
kv = {}
for name, tensor in zip(dec_out_names, first_out[1:]):
    past_name = name.replace("present_", "past_", 1)
    mapped = past_name if past_name in dec_past_in_names else (name + "_orig" if name + "_orig" in dec_past_in_names else name)
    kv[mapped] = tensor

# Autoregressive decoding
tokens = [token_id]
EOS = 2
while token_id != EOS and len(tokens) < 256:
    inputs = {"decoder_input_ids": np.array([[token_id]], dtype=np.int64), "encoder_hidden_states": enc_out}
    inputs.update(kv)
    past_out = decoder_past.run(None, inputs)
    token_id = int(np.argmax(past_out[0][0, -1, :]))
    tokens.append(token_id)
    # Update KV cache
    past_out_names = [o.name for o in decoder_past.get_outputs()][1:]
    kv = {}
    for name, tensor in zip(past_out_names, past_out[1:]):
        past_name = name.replace("present_", "past_", 1)
        mapped = past_name if past_name in dec_past_in_names else (name + "_orig" if name + "_orig" in dec_past_in_names else name)
        kv[mapped] = tensor

text = tokenizer.decode(tokens)
print(text)

Real-Time Microphone Streaming

The companion CLI tool provides real-time streaming ASR with voice activity detection:

pip install sounddevice
python inference_moonshine.py --model-dir moonshine_streaming_medium

Architecture Details

Three-Model Design

The model is split into 3 ONNX graphs for efficient streaming:

  1. Encoder β€” Processes raw audio with causal stride-2 convolutions and sliding-window attention. Outputs 768-dim hidden states at 50Hz (one frame per 20ms of audio).

  2. Decoder (first step) β€” Takes BOS token + encoder states, produces first token logits and initializes 56 KV cache tensors (14 layers Γ— 2 attention types Γ— key+value).

  3. Decoder with past β€” Takes previous token + encoder states + KV cache, produces next token logits and updated cache. Self-attention KV grows each step; cross-attention KV stays constant.

Sliding Window Attention

The encoder uses per-layer sliding window sizes for streaming efficiency:

Layers Window Lookahead Purpose
0–1 16 frames 4 frames Initial context with lookahead
2–11 16 frames 0 frames Causal processing (no future)
12–13 16 frames 4 frames Final refinement with lookahead

Quantization Details

  • Method: Dynamic INT8 (weight-only)
  • Target ops: MatMul, Gemm (transformer compute)
  • Weights: Symmetric INT8
  • Activations: Remain FP32 at runtime
  • Audio frontend: Conv/ConvTranspose kept at full precision
  • Accuracy impact: Negligible (<0.0001 max absolute encoder diff vs FP32)

Comparison with Small

Small Medium
Encoder layers 10 14
Encoder hidden 620 768
Decoder layers 10 14
Decoder hidden 512 640
Attention heads 8 10
INT8 size ~358 MB ~562 MB
KV tensors 40 56

Execution Providers

Works with any ONNX Runtime execution provider:

Provider Platform Notes
CPUExecutionProvider All Default, always available
CoreMLExecutionProvider macOS Hardware-accelerated on Apple Silicon
CUDAExecutionProvider Linux/Windows NVIDIA GPU
DirectMLExecutionProvider Windows DirectX 12 GPU

Export Reproduction

To reproduce this export from the original safetensors:

pip install "transformers>=5.2.0" "huggingface_hub>=0.23" torch onnx onnxruntime
python export_moonshine_streaming_medium.py

Export script: onnx-creator

License

This model inherits the MIT License from the original Moonshine model by Useful Sensors.

Citation

@article{jeffries2024moonshine,
  title={Moonshine: Speech Recognition for Live Transcription and Voice Commands},
  author={Jeffries, Nat and Silent, Evan},
  journal={arXiv preprint arXiv:2410.15608},
  year={2024}
}
Downloads last month
43
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Mazino0/moonshine-streaming-medium-onnx

Quantized
(1)
this model

Dataset used to train Mazino0/moonshine-streaming-medium-onnx

Paper for Mazino0/moonshine-streaming-medium-onnx