File size: 8,235 Bytes

2f5c1d8

---
license: mit
base_model: UsefulSensors/moonshine-streaming-tiny
language:
- en
tags:
  - onnx
  - int8
  - fp16
  - quantized
  - optimized
  - speech-recognition
  - asr
  - streaming
  - moonshine
library_name: onnxruntime
pipeline_tag: automatic-speech-recognition
---

# Moonshine Streaming Tiny — Optimized

Optimized variants of [UsefulSensors/moonshine-streaming-tiny](https://huggingface.co/UsefulSensors/moonshine-streaming-tiny), a 34M parameter streaming ASR model designed for real-time, on-device English speech recognition.

Based on: [Moonshine v2: Ergodic Streaming Encoder ASR](https://arxiv.org/abs/2602.12241)

## Optimized Variants

| Variant | Total Size | Size Reduction | Best For |
|---------|-----------|---------------|----------|
| **Original FP32** | 168.1 MB | — | Reference |
| **ONNX INT8** | 79.8 MB | **52%** | CPU deployment, edge devices |
| **FP16 SafeTensors** | 88.1 MB | **48%** | GPU inference |
| **ONNX FP32** | 297 MB | — | ONNX Runtime without quantization |

## Benchmark Results

Tested with 5 seconds of audio, generating up to 64 tokens:

| Variant | Avg Latency | RTF | Speedup vs FP32 CPU |
|---------|------------|-----|---------------------|
| **PyTorch FP16 (GPU)** | 47.7 ms | 0.0095 | **1.71x** |
| PyTorch INT8 (CPU) | 78.6 ms | 0.0157 | 1.03x |
| PyTorch FP32 (CPU) | 81.3 ms | 0.0163 | 1.00x (baseline) |
| ONNX FP32 (CPU) | 115.5 ms | 0.0231 | 0.70x |
| ONNX INT8 (CPU) | 153.2 ms | 0.0306 | 0.53x |

> **Note**: ONNX benchmarks include session overhead and were run on a single test. For production deployment with session reuse on real audio, ONNX Runtime typically provides better throughput, especially for long-running services. The Moonshine team reports 50ms response latency on Apple M3 with their C++ ONNX Runtime backend.

## File Structure

```
├── onnx_int8/                          # ONNX INT8 quantized (recommended for CPU)
│   ├── encoder_model_int8.onnx         # 9.8 MB
│   ├── decoder_model_int8.onnx         # 36 MB
│   ├── decoder_with_past_model_int8.onnx # 32 MB
│   ├── tokenizer.json
│   ├── config.json
│   └── quantize_config.json
├── onnx/                               # ONNX FP32
│   ├── encoder_model.onnx + .data
│   ├── decoder_model.onnx + .data
│   ├── decoder_with_past_model.onnx + .data
│   └── ...
└── fp16/                               # FP16 SafeTensors (for GPU)
    ├── model.safetensors               # 88.1 MB
    ├── config.json
    └── tokenizer.json
```

## Usage

### ONNX INT8 Inference (CPU — Recommended for Edge)

```bash
pip install onnxruntime numpy tokenizers
```

```python
import numpy as np
import onnxruntime as ort
from tokenizers import Tokenizer

MODEL_DIR = "onnx_int8"  # or download from this repo
BOS, EOS = 1, 2

# Load models
opts = ort.SessionOptions()
opts.intra_op_num_threads = 4
providers = ["CPUExecutionProvider"]

encoder = ort.InferenceSession(f"{MODEL_DIR}/encoder_model_int8.onnx", opts, providers=providers)
decoder = ort.InferenceSession(f"{MODEL_DIR}/decoder_model_int8.onnx", opts, providers=providers)
decoder_past = ort.InferenceSession(f"{MODEL_DIR}/decoder_with_past_model_int8.onnx", opts, providers=providers)
tokenizer = Tokenizer.from_file(f"{MODEL_DIR}/tokenizer.json")

# Prepare audio (16kHz float32, padded to multiple of 80 samples)
audio = np.random.randn(16000 * 5).astype(np.float32)  # replace with real audio
remainder = len(audio) % 80
if remainder:
    audio = np.pad(audio, (0, 80 - remainder))

audio_input = audio[np.newaxis, :]
attention_mask = np.ones_like(audio_input, dtype=np.int64)

# Encode audio
(enc_out,) = encoder.run(None, {
    "input_values": audio_input,
    "attention_mask": attention_mask,
})

# First decode step
outs = decoder.run(None, {
    "decoder_input_ids": np.array([[BOS]], dtype=np.int64),
    "encoder_hidden_states": enc_out,
})
logits, past_kvs = outs[0], outs[1:]
token = int(np.argmax(logits[0, -1, :]))

# Build KV cache mapping
dec_out_names = [o.name for o in decoder.get_outputs()][1:]
past_in_names = {i.name for i in decoder_past.get_inputs()} - {"decoder_input_ids", "encoder_hidden_states"}

kv_dict = {}
for name, tensor in zip(dec_out_names, past_kvs):
    mapped = name.replace("present_", "past_", 1)
    if mapped in past_in_names:
        kv_dict[mapped] = tensor

# Autoregressive decode loop
past_out_names = [o.name for o in decoder_past.get_outputs()][1:]
tokens = [token]

for _ in range(255):
    if token == EOS:
        break
    inputs = {
        "decoder_input_ids": np.array([[token]], dtype=np.int64),
        "encoder_hidden_states": enc_out,
    }
    inputs.update(kv_dict)
    outs = decoder_past.run(None, inputs)
    token = int(np.argmax(outs[0][0, -1, :]))
    tokens.append(token)
    
    kv_dict = {}
    for name, tensor in zip(past_out_names, outs[1:]):
        mapped = name.replace("present_", "past_", 1)
        if mapped in past_in_names:
            kv_dict[mapped] = tensor

text = tokenizer.decode(tokens)
print(text)
```

### FP16 PyTorch Inference (GPU)

```python
from transformers import MoonshineStreamingForConditionalGeneration, AutoProcessor
import torch

model = MoonshineStreamingForConditionalGeneration.from_pretrained(
    "felixem/moonshine-streaming-tiny-optimized",
    subfolder="fp16",
    torch_dtype=torch.float16,
).to("cuda")

processor = AutoProcessor.from_pretrained(
    "felixem/moonshine-streaming-tiny-optimized",
    subfolder="fp16",
)

# Process audio
inputs = processor(audio_array, return_tensors="pt", sampling_rate=16000)
inputs = {k: v.to("cuda", torch.float16) for k, v in inputs.items()}

generated_ids = model.generate(**inputs, max_new_tokens=128)
text = processor.decode(generated_ids[0], skip_special_tokens=True)
```

### PyTorch Dynamic INT8 (CPU — Quick Setup)

```python
import torch
from transformers import MoonshineStreamingForConditionalGeneration, AutoProcessor

model = MoonshineStreamingForConditionalGeneration.from_pretrained(
    "UsefulSensors/moonshine-streaming-tiny"
).eval()

# Quantize Linear layers to INT8
model = torch.quantization.quantize_dynamic(
    model, {torch.nn.Linear}, dtype=torch.qint8
)

processor = AutoProcessor.from_pretrained("UsefulSensors/moonshine-streaming-tiny")
inputs = processor(audio_array, return_tensors="pt", sampling_rate=16000)
generated_ids = model.generate(**inputs, max_new_tokens=128)
text = processor.decode(generated_ids[0], skip_special_tokens=True)
```

## ONNX Export Details

- **Encoder**: Exported with `torch.onnx.export(dynamo=True)` to handle vmap-based sliding-window attention masking
- **Decoder**: Separate models for first step (no KV cache) and autoregressive steps (with KV cache)
- **Quantization**: `onnxruntime.quantization.quantize_dynamic` with symmetric INT8, per-channel, reduce_range=True

### KV Cache Structure

Each decoder layer produces 4 KV tensors:
- `present_{layer}_self_key` / `present_{layer}_self_value`: Self-attention cache [B, 8, S, 40]
- `present_{layer}_cross_key` / `present_{layer}_cross_value`: Cross-attention cache [B, 8, T, 40]

For `decoder_with_past_model`, feed these back as `past_{layer}_*` inputs.

## Quantization Impact

Based on the [Edge-ASR paper](https://arxiv.org/abs/2507.07877) (Table 14), INT8 quantization on Moonshine Tiny has negligible WER impact:

| Config | Avg WER | vs FP32 |
|--------|---------|---------|
| FP32 baseline | 12.72% | — |
| **W8-A8 (INT8)** | **12.81%** | **+0.09%** |
| W4-A16 (SpQR) | 13.61% | +0.89% |

INT8 is the sweet spot for Moonshine Tiny — virtually no accuracy loss with ~50% model size reduction.

## Limitations

- English only
- Optimized for short utterances (streaming chunks of 1-5 seconds)
- ONNX models use external data files (`.onnx.data`) for FP32 variant
- The decoder uses autoregressive generation, so output latency scales with transcript length

## Citation

```bibtex
@article{kudlur2025moonshine,
  title={Moonshine v2: Ergodic Streaming Encoder ASR},
  author={Kudlur, Manjunath and King, Evan and Wang, James and Warden, Pete},
  journal={arXiv preprint arXiv:2602.12241},
  year={2025}
}
```

## License

MIT (same as base model)