| --- |
| license: mit |
| base_model: UsefulSensors/moonshine-streaming-tiny |
| language: |
| - en |
| tags: |
| - onnx |
| - int8 |
| - fp16 |
| - quantized |
| - optimized |
| - speech-recognition |
| - asr |
| - streaming |
| - moonshine |
| library_name: onnxruntime |
| pipeline_tag: automatic-speech-recognition |
| --- |
| |
| # Moonshine Streaming Tiny — Optimized |
|
|
| Optimized variants of [UsefulSensors/moonshine-streaming-tiny](https://huggingface.co/UsefulSensors/moonshine-streaming-tiny), a 34M parameter streaming ASR model designed for real-time, on-device English speech recognition. |
|
|
| Based on: [Moonshine v2: Ergodic Streaming Encoder ASR](https://arxiv.org/abs/2602.12241) |
|
|
| ## Optimized Variants |
|
|
| | Variant | Total Size | Size Reduction | Best For | |
| |---------|-----------|---------------|----------| |
| | **Original FP32** | 168.1 MB | — | Reference | |
| | **ONNX INT8** | 79.8 MB | **52%** | CPU deployment, edge devices | |
| | **FP16 SafeTensors** | 88.1 MB | **48%** | GPU inference | |
| | **ONNX FP32** | 297 MB | — | ONNX Runtime without quantization | |
|
|
| ## Benchmark Results |
|
|
| Tested with 5 seconds of audio, generating up to 64 tokens: |
|
|
| | Variant | Avg Latency | RTF | Speedup vs FP32 CPU | |
| |---------|------------|-----|---------------------| |
| | **PyTorch FP16 (GPU)** | 47.7 ms | 0.0095 | **1.71x** | |
| | PyTorch INT8 (CPU) | 78.6 ms | 0.0157 | 1.03x | |
| | PyTorch FP32 (CPU) | 81.3 ms | 0.0163 | 1.00x (baseline) | |
| | ONNX FP32 (CPU) | 115.5 ms | 0.0231 | 0.70x | |
| | ONNX INT8 (CPU) | 153.2 ms | 0.0306 | 0.53x | |
|
|
| > **Note**: ONNX benchmarks include session overhead and were run on a single test. For production deployment with session reuse on real audio, ONNX Runtime typically provides better throughput, especially for long-running services. The Moonshine team reports 50ms response latency on Apple M3 with their C++ ONNX Runtime backend. |
|
|
| ## File Structure |
|
|
| ``` |
| ├── onnx_int8/ # ONNX INT8 quantized (recommended for CPU) |
| │ ├── encoder_model_int8.onnx # 9.8 MB |
| │ ├── decoder_model_int8.onnx # 36 MB |
| │ ├── decoder_with_past_model_int8.onnx # 32 MB |
| │ ├── tokenizer.json |
| │ ├── config.json |
| │ └── quantize_config.json |
| ├── onnx/ # ONNX FP32 |
| │ ├── encoder_model.onnx + .data |
| │ ├── decoder_model.onnx + .data |
| │ ├── decoder_with_past_model.onnx + .data |
| │ └── ... |
| └── fp16/ # FP16 SafeTensors (for GPU) |
| ├── model.safetensors # 88.1 MB |
| ├── config.json |
| └── tokenizer.json |
| ``` |
|
|
| ## Usage |
|
|
| ### ONNX INT8 Inference (CPU — Recommended for Edge) |
|
|
| ```bash |
| pip install onnxruntime numpy tokenizers |
| ``` |
|
|
| ```python |
| import numpy as np |
| import onnxruntime as ort |
| from tokenizers import Tokenizer |
| |
| MODEL_DIR = "onnx_int8" # or download from this repo |
| BOS, EOS = 1, 2 |
| |
| # Load models |
| opts = ort.SessionOptions() |
| opts.intra_op_num_threads = 4 |
| providers = ["CPUExecutionProvider"] |
| |
| encoder = ort.InferenceSession(f"{MODEL_DIR}/encoder_model_int8.onnx", opts, providers=providers) |
| decoder = ort.InferenceSession(f"{MODEL_DIR}/decoder_model_int8.onnx", opts, providers=providers) |
| decoder_past = ort.InferenceSession(f"{MODEL_DIR}/decoder_with_past_model_int8.onnx", opts, providers=providers) |
| tokenizer = Tokenizer.from_file(f"{MODEL_DIR}/tokenizer.json") |
| |
| # Prepare audio (16kHz float32, padded to multiple of 80 samples) |
| audio = np.random.randn(16000 * 5).astype(np.float32) # replace with real audio |
| remainder = len(audio) % 80 |
| if remainder: |
| audio = np.pad(audio, (0, 80 - remainder)) |
| |
| audio_input = audio[np.newaxis, :] |
| attention_mask = np.ones_like(audio_input, dtype=np.int64) |
| |
| # Encode audio |
| (enc_out,) = encoder.run(None, { |
| "input_values": audio_input, |
| "attention_mask": attention_mask, |
| }) |
| |
| # First decode step |
| outs = decoder.run(None, { |
| "decoder_input_ids": np.array([[BOS]], dtype=np.int64), |
| "encoder_hidden_states": enc_out, |
| }) |
| logits, past_kvs = outs[0], outs[1:] |
| token = int(np.argmax(logits[0, -1, :])) |
| |
| # Build KV cache mapping |
| dec_out_names = [o.name for o in decoder.get_outputs()][1:] |
| past_in_names = {i.name for i in decoder_past.get_inputs()} - {"decoder_input_ids", "encoder_hidden_states"} |
| |
| kv_dict = {} |
| for name, tensor in zip(dec_out_names, past_kvs): |
| mapped = name.replace("present_", "past_", 1) |
| if mapped in past_in_names: |
| kv_dict[mapped] = tensor |
| |
| # Autoregressive decode loop |
| past_out_names = [o.name for o in decoder_past.get_outputs()][1:] |
| tokens = [token] |
| |
| for _ in range(255): |
| if token == EOS: |
| break |
| inputs = { |
| "decoder_input_ids": np.array([[token]], dtype=np.int64), |
| "encoder_hidden_states": enc_out, |
| } |
| inputs.update(kv_dict) |
| outs = decoder_past.run(None, inputs) |
| token = int(np.argmax(outs[0][0, -1, :])) |
| tokens.append(token) |
| |
| kv_dict = {} |
| for name, tensor in zip(past_out_names, outs[1:]): |
| mapped = name.replace("present_", "past_", 1) |
| if mapped in past_in_names: |
| kv_dict[mapped] = tensor |
| |
| text = tokenizer.decode(tokens) |
| print(text) |
| ``` |
|
|
| ### FP16 PyTorch Inference (GPU) |
|
|
| ```python |
| from transformers import MoonshineStreamingForConditionalGeneration, AutoProcessor |
| import torch |
| |
| model = MoonshineStreamingForConditionalGeneration.from_pretrained( |
| "felixem/moonshine-streaming-tiny-optimized", |
| subfolder="fp16", |
| torch_dtype=torch.float16, |
| ).to("cuda") |
| |
| processor = AutoProcessor.from_pretrained( |
| "felixem/moonshine-streaming-tiny-optimized", |
| subfolder="fp16", |
| ) |
| |
| # Process audio |
| inputs = processor(audio_array, return_tensors="pt", sampling_rate=16000) |
| inputs = {k: v.to("cuda", torch.float16) for k, v in inputs.items()} |
| |
| generated_ids = model.generate(**inputs, max_new_tokens=128) |
| text = processor.decode(generated_ids[0], skip_special_tokens=True) |
| ``` |
|
|
| ### PyTorch Dynamic INT8 (CPU — Quick Setup) |
|
|
| ```python |
| import torch |
| from transformers import MoonshineStreamingForConditionalGeneration, AutoProcessor |
| |
| model = MoonshineStreamingForConditionalGeneration.from_pretrained( |
| "UsefulSensors/moonshine-streaming-tiny" |
| ).eval() |
| |
| # Quantize Linear layers to INT8 |
| model = torch.quantization.quantize_dynamic( |
| model, {torch.nn.Linear}, dtype=torch.qint8 |
| ) |
| |
| processor = AutoProcessor.from_pretrained("UsefulSensors/moonshine-streaming-tiny") |
| inputs = processor(audio_array, return_tensors="pt", sampling_rate=16000) |
| generated_ids = model.generate(**inputs, max_new_tokens=128) |
| text = processor.decode(generated_ids[0], skip_special_tokens=True) |
| ``` |
|
|
| ## ONNX Export Details |
|
|
| - **Encoder**: Exported with `torch.onnx.export(dynamo=True)` to handle vmap-based sliding-window attention masking |
| - **Decoder**: Separate models for first step (no KV cache) and autoregressive steps (with KV cache) |
| - **Quantization**: `onnxruntime.quantization.quantize_dynamic` with symmetric INT8, per-channel, reduce_range=True |
| |
| ### KV Cache Structure |
| |
| Each decoder layer produces 4 KV tensors: |
| - `present_{layer}_self_key` / `present_{layer}_self_value`: Self-attention cache [B, 8, S, 40] |
| - `present_{layer}_cross_key` / `present_{layer}_cross_value`: Cross-attention cache [B, 8, T, 40] |
| |
| For `decoder_with_past_model`, feed these back as `past_{layer}_*` inputs. |
| |
| ## Quantization Impact |
| |
| Based on the [Edge-ASR paper](https://arxiv.org/abs/2507.07877) (Table 14), INT8 quantization on Moonshine Tiny has negligible WER impact: |
| |
| | Config | Avg WER | vs FP32 | |
| |--------|---------|---------| |
| | FP32 baseline | 12.72% | — | |
| | **W8-A8 (INT8)** | **12.81%** | **+0.09%** | |
| | W4-A16 (SpQR) | 13.61% | +0.89% | |
| |
| INT8 is the sweet spot for Moonshine Tiny — virtually no accuracy loss with ~50% model size reduction. |
| |
| ## Limitations |
| |
| - English only |
| - Optimized for short utterances (streaming chunks of 1-5 seconds) |
| - ONNX models use external data files (`.onnx.data`) for FP32 variant |
| - The decoder uses autoregressive generation, so output latency scales with transcript length |
| |
| ## Citation |
| |
| ```bibtex |
| @article{kudlur2025moonshine, |
| title={Moonshine v2: Ergodic Streaming Encoder ASR}, |
| author={Kudlur, Manjunath and King, Evan and Wang, James and Warden, Pete}, |
| journal={arXiv preprint arXiv:2602.12241}, |
| year={2025} |
| } |
| ``` |
| |
| ## License |
| |
| MIT (same as base model) |
| |