Moonshine Tiny ONNX (Mobile)

Mobile-compatible Moonshine Tiny for on-device speech recognition on iOS and Android.

This model re-exports the encoder from onnx-community/moonshine-tiny-ONNX using dynamic quantization to eliminate ConvInteger operators that are blocked on ONNX Runtime Mobile. All other files (decoders, tokenizer, config) are unchanged from the upstream repository.

Why This Model Exists

The upstream onnx-community/moonshine-tiny-ONNX int8 encoder uses Optimum's default static quantization, which produces ConvInteger operators (ONNX opset 10). These operators are not registered in ORT Mobile on iOS or Android, causing a crash at session creation:

onnxruntime::Model::Model: ... ConvInteger is not a registered function/op

This repository replaces the encoder with a dynamically quantized version that uses MatMulInteger + DynamicQuantizeLinear — both fully supported on ORT Mobile — while preserving near-lossless accuracy.

Accuracy

Encoder output comparison on 200 LibriSpeech dev-clean-2 utterances:

Variant	Cosine Similarity vs FP32	Encoder Size	ConvInteger Nodes	Mobile Compatible
FP32 (baseline)	1.0000	29 MB	0	Yes
Dynamic INT8 (this repo)	0.9986	12 MB	0	Yes
Static INT8 (upstream)	0.6340	7.6 MB	3	No

Dynamic quantization computes activation ranges on-the-fly via DynamicQuantizeLinear, making it robust to variable-length speech inputs. Static quantization fixes activation ranges at export time using calibration data, which generalizes poorly to utterances outside the calibration distribution.

Supported Platforms

Platform	Minimum Version	Execution Provider
iOS	17.0+	CPU, CoreML
Android	API 26+	CPU, XNNPACK
macOS	14.0+	CPU, CoreML

Usage

import onnxruntime as ort
from huggingface_hub import hf_hub_download

# Download the int8 encoder
encoder_path = hf_hub_download(
    "bitsydarel/moonshine-tiny-onnx-mobile",
    "onnx/encoder_model_int8.onnx"
)

# Create session
session = ort.InferenceSession(
    encoder_path,
    providers=["CPUExecutionProvider"]
)

# Verify inputs
for inp in session.get_inputs():
    print(f"{inp.name}: {inp.shape} ({inp.type})")

Model Files

This repository contains two complete model configurations — FP32 and INT8 — plus shared tokenizer and config files. All decoders are merged variants with KV-cache support via the use_cache_branch input. Basic (non-cached) decoders are not included.

Model Configurations

Configuration	Encoder	Decoder	Total Size	Recommended EP	Use Case
FP32	`encoder_model.onnx`	`decoder_model_merged.onnx`	~104 MB	CPU or XNNPACK	Highest accuracy, development
INT8	`encoder_model_int8.onnx`	`decoder_model_merged_int8.onnx`	~31 MB	CPU	Production mobile (3.4x smaller)

ONNX Files

File	Size	Format	KV-Cache	Description
`onnx/encoder_model.onnx`	29 MB	FP32	N/A	Full-precision encoder
`onnx/encoder_model_int8.onnx`	12 MB	Dynamic INT8	N/A	Re-exported encoder (`MatMulInteger` + `DynamicQuantizeLinear`)
`onnx/decoder_model_merged.onnx`	75 MB	FP32	Yes	Merged decoder — `use_cache_branch=false` on step 0, `true` on step 1+
`onnx/decoder_model_merged_int8.onnx`	19 MB	INT8	Yes	Quantized merged decoder — same KV-cache behavior as FP32

Shared Files

File	Size	Description
`config.json`	1 KB	Model architecture configuration (hidden size, attention heads, vocab size)
`tokenizer.json`	3.6 MB	Tokenizer vocabulary (shared across all configurations)
`tokenizer_config.json`	133 KB	Tokenizer settings (BOS/EOS token IDs, special tokens)

KV-Cache Decoder Usage

Both merged decoders use the use_cache_branch boolean input to control KV-cache behavior:

Step 0 (use_cache_branch=false): Computes fresh key-value pairs from encoder hidden states. All past_key_values inputs must still be provided (use zero sequence length).
Step 1+ (use_cache_branch=true): Reuses cached key-value pairs for faster autoregressive decoding. Decoder KV-cache grows each step; encoder KV-cache is frozen after step 0.

Important: The merged decoder outputs invalid encoder cache when use_cache_branch=true. Always preserve the encoder cache from step 0 and ignore encoder cache outputs on subsequent steps.

Technical Details

Why Dynamic over Static Quantization?

Speech-to-text models process variable-length audio inputs. Static quantization (QDQ format) fixes activation ranges at export time using calibration data, which makes assumptions about the distribution of input values. When inference-time inputs differ from calibration data — common with speech of varying length, volume, and content — activation clipping causes significant accuracy loss.

Dynamic quantization computes activation ranges on-the-fly for each inference call using DynamicQuantizeLinear. This adds minimal latency overhead (< 1ms per encoder pass on Apple A17) while adapting to every input. For the Moonshine encoder, this preserves 0.9986 cosine similarity vs FP32, compared to 0.634 with static quantization.

Quantization Method

from onnxruntime.quantization import quantize_dynamic, QuantType

quantize_dynamic(
    model_input="encoder_model.onnx",
    model_output="encoder_model_int8.onnx",
    weight_type=QuantType.QInt8,
    op_types_to_quantize=["MatMul"],  # Only quantize MatMul, leave Conv in fp32
)

Only MatMul operations are quantized. Convolution layers remain in FP32 to avoid ConvInteger operators entirely.

Operator Compatibility

Operator	ORT Mobile Support	Used In
`MatMulInteger`	Yes	Dynamic INT8 encoder
`DynamicQuantizeLinear`	Yes	Dynamic INT8 encoder
`ConvInteger`	No	Static INT8 encoder (upstream)
`Conv`	Yes	FP32 layers (preserved)

Limitations

English only — Moonshine Tiny is trained on English speech data
27M parameters — Smallest Moonshine variant; for higher accuracy, consider Moonshine Base (61M params)
Short-form audio — Optimized for utterances up to ~30 seconds; for longer audio, segment before transcribing
Encoder-only re-export — Only the encoder was re-exported with dynamic quantization; decoder files are unchanged from upstream

Citation

@article{jeffries2024moonshine,
  title={Moonshine: Speech Recognition for Live Transcription and Voice Commands},
  author={Jeffries, Nat and Birch, Evan and Zhang, Micha\l{}},
  journal={arXiv preprint arXiv:2410.15608},
  year={2024}
}

License

This model is released under the MIT License, consistent with the original Moonshine model license.

Acknowledgments

Useful Sensors for the Moonshine model family
Hugging Face Optimum for ONNX export tooling
ONNX Runtime for cross-platform inference

Downloads last month: 14

Model tree for bitsydarel/moonshine-tiny-onnx-mobile

Base model

UsefulSensors/moonshine-tiny

Quantized

(3)

this model

Dataset used to train bitsydarel/moonshine-tiny-onnx-mobile

Paper for bitsydarel/moonshine-tiny-onnx-mobile

Moonshine: Speech Recognition for Live Transcription and Voice Commands

Paper • 2410.15608 • Published Oct 21, 2024 • 12

Evaluation results

Cosine Similarity (vs FP32) on LibriSpeech (dev-clean-2)
self-reported

0.999