Moonshine Tiny ONNX (Mobile)

Mobile-compatible Moonshine Tiny for on-device speech recognition on iOS and Android.

This model re-exports the encoder from onnx-community/moonshine-tiny-ONNX using dynamic quantization to eliminate ConvInteger operators that are blocked on ONNX Runtime Mobile. All other files (decoders, tokenizer, config) are unchanged from the upstream repository.

Why This Model Exists

The upstream onnx-community/moonshine-tiny-ONNX int8 encoder uses Optimum's default static quantization, which produces ConvInteger operators (ONNX opset 10). These operators are not registered in ORT Mobile on iOS or Android, causing a crash at session creation:

onnxruntime::Model::Model: ... ConvInteger is not a registered function/op

This repository replaces the encoder with a dynamically quantized version that uses MatMulInteger + DynamicQuantizeLinear β€” both fully supported on ORT Mobile β€” while preserving near-lossless accuracy.

Accuracy

Encoder output comparison on 200 LibriSpeech dev-clean-2 utterances:

Variant Cosine Similarity vs FP32 Encoder Size ConvInteger Nodes Mobile Compatible
FP32 (baseline) 1.0000 29 MB 0 Yes
Dynamic INT8 (this repo) 0.9986 12 MB 0 Yes
Static INT8 (upstream) 0.6340 7.6 MB 3 No

Dynamic quantization computes activation ranges on-the-fly via DynamicQuantizeLinear, making it robust to variable-length speech inputs. Static quantization fixes activation ranges at export time using calibration data, which generalizes poorly to utterances outside the calibration distribution.

Supported Platforms

Platform Minimum Version Execution Provider
iOS 17.0+ CPU, CoreML
Android API 26+ CPU, XNNPACK
macOS 14.0+ CPU, CoreML

Usage

import onnxruntime as ort
from huggingface_hub import hf_hub_download

# Download the int8 encoder
encoder_path = hf_hub_download(
    "bitsydarel/moonshine-tiny-onnx-mobile",
    "onnx/encoder_model_int8.onnx"
)

# Create session
session = ort.InferenceSession(
    encoder_path,
    providers=["CPUExecutionProvider"]
)

# Verify inputs
for inp in session.get_inputs():
    print(f"{inp.name}: {inp.shape} ({inp.type})")

Model Files

This repository contains two complete model configurations β€” FP32 and INT8 β€” plus shared tokenizer and config files. All decoders are merged variants with KV-cache support via the use_cache_branch input. Basic (non-cached) decoders are not included.

Model Configurations

Configuration Encoder Decoder Total Size Recommended EP Use Case
FP32 encoder_model.onnx decoder_model_merged.onnx ~104 MB CPU or XNNPACK Highest accuracy, development
INT8 encoder_model_int8.onnx decoder_model_merged_int8.onnx ~31 MB CPU Production mobile (3.4x smaller)

ONNX Files

File Size Format KV-Cache Description
onnx/encoder_model.onnx 29 MB FP32 N/A Full-precision encoder
onnx/encoder_model_int8.onnx 12 MB Dynamic INT8 N/A Re-exported encoder (MatMulInteger + DynamicQuantizeLinear)
onnx/decoder_model_merged.onnx 75 MB FP32 Yes Merged decoder β€” use_cache_branch=false on step 0, true on step 1+
onnx/decoder_model_merged_int8.onnx 19 MB INT8 Yes Quantized merged decoder β€” same KV-cache behavior as FP32

Shared Files

File Size Description
config.json 1 KB Model architecture configuration (hidden size, attention heads, vocab size)
tokenizer.json 3.6 MB Tokenizer vocabulary (shared across all configurations)
tokenizer_config.json 133 KB Tokenizer settings (BOS/EOS token IDs, special tokens)

KV-Cache Decoder Usage

Both merged decoders use the use_cache_branch boolean input to control KV-cache behavior:

  • Step 0 (use_cache_branch=false): Computes fresh key-value pairs from encoder hidden states. All past_key_values inputs must still be provided (use zero sequence length).
  • Step 1+ (use_cache_branch=true): Reuses cached key-value pairs for faster autoregressive decoding. Decoder KV-cache grows each step; encoder KV-cache is frozen after step 0.

Important: The merged decoder outputs invalid encoder cache when use_cache_branch=true. Always preserve the encoder cache from step 0 and ignore encoder cache outputs on subsequent steps.

Technical Details

Why Dynamic over Static Quantization?

Speech-to-text models process variable-length audio inputs. Static quantization (QDQ format) fixes activation ranges at export time using calibration data, which makes assumptions about the distribution of input values. When inference-time inputs differ from calibration data β€” common with speech of varying length, volume, and content β€” activation clipping causes significant accuracy loss.

Dynamic quantization computes activation ranges on-the-fly for each inference call using DynamicQuantizeLinear. This adds minimal latency overhead (< 1ms per encoder pass on Apple A17) while adapting to every input. For the Moonshine encoder, this preserves 0.9986 cosine similarity vs FP32, compared to 0.634 with static quantization.

Quantization Method

from onnxruntime.quantization import quantize_dynamic, QuantType

quantize_dynamic(
    model_input="encoder_model.onnx",
    model_output="encoder_model_int8.onnx",
    weight_type=QuantType.QInt8,
    op_types_to_quantize=["MatMul"],  # Only quantize MatMul, leave Conv in fp32
)

Only MatMul operations are quantized. Convolution layers remain in FP32 to avoid ConvInteger operators entirely.

Operator Compatibility

Operator ORT Mobile Support Used In
MatMulInteger Yes Dynamic INT8 encoder
DynamicQuantizeLinear Yes Dynamic INT8 encoder
ConvInteger No Static INT8 encoder (upstream)
Conv Yes FP32 layers (preserved)

Limitations

  • English only β€” Moonshine Tiny is trained on English speech data
  • 27M parameters β€” Smallest Moonshine variant; for higher accuracy, consider Moonshine Base (61M params)
  • Short-form audio β€” Optimized for utterances up to ~30 seconds; for longer audio, segment before transcribing
  • Encoder-only re-export β€” Only the encoder was re-exported with dynamic quantization; decoder files are unchanged from upstream

Citation

@article{jeffries2024moonshine,
  title={Moonshine: Speech Recognition for Live Transcription and Voice Commands},
  author={Jeffries, Nat and Birch, Evan and Zhang, Micha\l{}},
  journal={arXiv preprint arXiv:2410.15608},
  year={2024}
}

License

This model is released under the MIT License, consistent with the original Moonshine model license.

Acknowledgments

Downloads last month
24
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for bitsydarel/moonshine-tiny-onnx-mobile

Quantized
(2)
this model

Dataset used to train bitsydarel/moonshine-tiny-onnx-mobile

Paper for bitsydarel/moonshine-tiny-onnx-mobile

Evaluation results

  • Cosine Similarity (vs FP32) on LibriSpeech (dev-clean-2)
    self-reported
    0.999