Moonshine Tiny ONNX (Mobile)
Mobile-compatible Moonshine Tiny for on-device speech recognition on iOS and Android.
This model re-exports the encoder from onnx-community/moonshine-tiny-ONNX using dynamic quantization to eliminate ConvInteger operators that are blocked on ONNX Runtime Mobile. All other files (decoders, tokenizer, config) are unchanged from the upstream repository.
Why This Model Exists
The upstream onnx-community/moonshine-tiny-ONNX int8 encoder uses Optimum's default static quantization, which produces ConvInteger operators (ONNX opset 10). These operators are not registered in ORT Mobile on iOS or Android, causing a crash at session creation:
onnxruntime::Model::Model: ... ConvInteger is not a registered function/op
This repository replaces the encoder with a dynamically quantized version that uses MatMulInteger + DynamicQuantizeLinear β both fully supported on ORT Mobile β while preserving near-lossless accuracy.
Accuracy
Encoder output comparison on 200 LibriSpeech dev-clean-2 utterances:
| Variant | Cosine Similarity vs FP32 | Encoder Size | ConvInteger Nodes | Mobile Compatible |
|---|---|---|---|---|
| FP32 (baseline) | 1.0000 | 29 MB | 0 | Yes |
| Dynamic INT8 (this repo) | 0.9986 | 12 MB | 0 | Yes |
| Static INT8 (upstream) | 0.6340 | 7.6 MB | 3 | No |
Dynamic quantization computes activation ranges on-the-fly via DynamicQuantizeLinear, making it robust to variable-length speech inputs. Static quantization fixes activation ranges at export time using calibration data, which generalizes poorly to utterances outside the calibration distribution.
Supported Platforms
| Platform | Minimum Version | Execution Provider |
|---|---|---|
| iOS | 17.0+ | CPU, CoreML |
| Android | API 26+ | CPU, XNNPACK |
| macOS | 14.0+ | CPU, CoreML |
Usage
import onnxruntime as ort
from huggingface_hub import hf_hub_download
# Download the int8 encoder
encoder_path = hf_hub_download(
"bitsydarel/moonshine-tiny-onnx-mobile",
"onnx/encoder_model_int8.onnx"
)
# Create session
session = ort.InferenceSession(
encoder_path,
providers=["CPUExecutionProvider"]
)
# Verify inputs
for inp in session.get_inputs():
print(f"{inp.name}: {inp.shape} ({inp.type})")
Model Files
This repository contains two complete model configurations β FP32 and INT8 β plus shared tokenizer and config files. All decoders are merged variants with KV-cache support via the use_cache_branch input. Basic (non-cached) decoders are not included.
Model Configurations
| Configuration | Encoder | Decoder | Total Size | Recommended EP | Use Case |
|---|---|---|---|---|---|
| FP32 | encoder_model.onnx |
decoder_model_merged.onnx |
~104 MB | CPU or XNNPACK | Highest accuracy, development |
| INT8 | encoder_model_int8.onnx |
decoder_model_merged_int8.onnx |
~31 MB | CPU | Production mobile (3.4x smaller) |
ONNX Files
| File | Size | Format | KV-Cache | Description |
|---|---|---|---|---|
onnx/encoder_model.onnx |
29 MB | FP32 | N/A | Full-precision encoder |
onnx/encoder_model_int8.onnx |
12 MB | Dynamic INT8 | N/A | Re-exported encoder (MatMulInteger + DynamicQuantizeLinear) |
onnx/decoder_model_merged.onnx |
75 MB | FP32 | Yes | Merged decoder β use_cache_branch=false on step 0, true on step 1+ |
onnx/decoder_model_merged_int8.onnx |
19 MB | INT8 | Yes | Quantized merged decoder β same KV-cache behavior as FP32 |
Shared Files
| File | Size | Description |
|---|---|---|
config.json |
1 KB | Model architecture configuration (hidden size, attention heads, vocab size) |
tokenizer.json |
3.6 MB | Tokenizer vocabulary (shared across all configurations) |
tokenizer_config.json |
133 KB | Tokenizer settings (BOS/EOS token IDs, special tokens) |
KV-Cache Decoder Usage
Both merged decoders use the use_cache_branch boolean input to control KV-cache behavior:
- Step 0 (
use_cache_branch=false): Computes fresh key-value pairs from encoder hidden states. Allpast_key_valuesinputs must still be provided (use zero sequence length). - Step 1+ (
use_cache_branch=true): Reuses cached key-value pairs for faster autoregressive decoding. Decoder KV-cache grows each step; encoder KV-cache is frozen after step 0.
Important: The merged decoder outputs invalid encoder cache when
use_cache_branch=true. Always preserve the encoder cache from step 0 and ignore encoder cache outputs on subsequent steps.
Technical Details
Why Dynamic over Static Quantization?
Speech-to-text models process variable-length audio inputs. Static quantization (QDQ format) fixes activation ranges at export time using calibration data, which makes assumptions about the distribution of input values. When inference-time inputs differ from calibration data β common with speech of varying length, volume, and content β activation clipping causes significant accuracy loss.
Dynamic quantization computes activation ranges on-the-fly for each inference call using DynamicQuantizeLinear. This adds minimal latency overhead (< 1ms per encoder pass on Apple A17) while adapting to every input. For the Moonshine encoder, this preserves 0.9986 cosine similarity vs FP32, compared to 0.634 with static quantization.
Quantization Method
from onnxruntime.quantization import quantize_dynamic, QuantType
quantize_dynamic(
model_input="encoder_model.onnx",
model_output="encoder_model_int8.onnx",
weight_type=QuantType.QInt8,
op_types_to_quantize=["MatMul"], # Only quantize MatMul, leave Conv in fp32
)
Only MatMul operations are quantized. Convolution layers remain in FP32 to avoid ConvInteger operators entirely.
Operator Compatibility
| Operator | ORT Mobile Support | Used In |
|---|---|---|
MatMulInteger |
Yes | Dynamic INT8 encoder |
DynamicQuantizeLinear |
Yes | Dynamic INT8 encoder |
ConvInteger |
No | Static INT8 encoder (upstream) |
Conv |
Yes | FP32 layers (preserved) |
Limitations
- English only β Moonshine Tiny is trained on English speech data
- 27M parameters β Smallest Moonshine variant; for higher accuracy, consider Moonshine Base (61M params)
- Short-form audio β Optimized for utterances up to ~30 seconds; for longer audio, segment before transcribing
- Encoder-only re-export β Only the encoder was re-exported with dynamic quantization; decoder files are unchanged from upstream
Citation
@article{jeffries2024moonshine,
title={Moonshine: Speech Recognition for Live Transcription and Voice Commands},
author={Jeffries, Nat and Birch, Evan and Zhang, Micha\l{}},
journal={arXiv preprint arXiv:2410.15608},
year={2024}
}
License
This model is released under the MIT License, consistent with the original Moonshine model license.
Acknowledgments
- Useful Sensors for the Moonshine model family
- Hugging Face Optimum for ONNX export tooling
- ONNX Runtime for cross-platform inference
- Downloads last month
- 24
Model tree for bitsydarel/moonshine-tiny-onnx-mobile
Base model
UsefulSensors/moonshine-tinyDataset used to train bitsydarel/moonshine-tiny-onnx-mobile
Paper for bitsydarel/moonshine-tiny-onnx-mobile
Evaluation results
- Cosine Similarity (vs FP32) on LibriSpeech (dev-clean-2)self-reported0.999