File size: 8,235 Bytes
2f5c1d8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
---
license: mit
base_model: UsefulSensors/moonshine-streaming-tiny
language:
- en
tags:
  - onnx
  - int8
  - fp16
  - quantized
  - optimized
  - speech-recognition
  - asr
  - streaming
  - moonshine
library_name: onnxruntime
pipeline_tag: automatic-speech-recognition
---

# Moonshine Streaming Tiny β€” Optimized

Optimized variants of [UsefulSensors/moonshine-streaming-tiny](https://huggingface.co/UsefulSensors/moonshine-streaming-tiny), a 34M parameter streaming ASR model designed for real-time, on-device English speech recognition.

Based on: [Moonshine v2: Ergodic Streaming Encoder ASR](https://arxiv.org/abs/2602.12241)

## Optimized Variants

| Variant | Total Size | Size Reduction | Best For |
|---------|-----------|---------------|----------|
| **Original FP32** | 168.1 MB | β€” | Reference |
| **ONNX INT8** | 79.8 MB | **52%** | CPU deployment, edge devices |
| **FP16 SafeTensors** | 88.1 MB | **48%** | GPU inference |
| **ONNX FP32** | 297 MB | β€” | ONNX Runtime without quantization |

## Benchmark Results

Tested with 5 seconds of audio, generating up to 64 tokens:

| Variant | Avg Latency | RTF | Speedup vs FP32 CPU |
|---------|------------|-----|---------------------|
| **PyTorch FP16 (GPU)** | 47.7 ms | 0.0095 | **1.71x** |
| PyTorch INT8 (CPU) | 78.6 ms | 0.0157 | 1.03x |
| PyTorch FP32 (CPU) | 81.3 ms | 0.0163 | 1.00x (baseline) |
| ONNX FP32 (CPU) | 115.5 ms | 0.0231 | 0.70x |
| ONNX INT8 (CPU) | 153.2 ms | 0.0306 | 0.53x |

> **Note**: ONNX benchmarks include session overhead and were run on a single test. For production deployment with session reuse on real audio, ONNX Runtime typically provides better throughput, especially for long-running services. The Moonshine team reports 50ms response latency on Apple M3 with their C++ ONNX Runtime backend.

## File Structure

```
β”œβ”€β”€ onnx_int8/                          # ONNX INT8 quantized (recommended for CPU)
β”‚   β”œβ”€β”€ encoder_model_int8.onnx         # 9.8 MB
β”‚   β”œβ”€β”€ decoder_model_int8.onnx         # 36 MB
β”‚   β”œβ”€β”€ decoder_with_past_model_int8.onnx # 32 MB
β”‚   β”œβ”€β”€ tokenizer.json
β”‚   β”œβ”€β”€ config.json
β”‚   └── quantize_config.json
β”œβ”€β”€ onnx/                               # ONNX FP32
β”‚   β”œβ”€β”€ encoder_model.onnx + .data
β”‚   β”œβ”€β”€ decoder_model.onnx + .data
β”‚   β”œβ”€β”€ decoder_with_past_model.onnx + .data
β”‚   └── ...
└── fp16/                               # FP16 SafeTensors (for GPU)
    β”œβ”€β”€ model.safetensors               # 88.1 MB
    β”œβ”€β”€ config.json
    └── tokenizer.json
```

## Usage

### ONNX INT8 Inference (CPU β€” Recommended for Edge)

```bash
pip install onnxruntime numpy tokenizers
```

```python
import numpy as np
import onnxruntime as ort
from tokenizers import Tokenizer

MODEL_DIR = "onnx_int8"  # or download from this repo
BOS, EOS = 1, 2

# Load models
opts = ort.SessionOptions()
opts.intra_op_num_threads = 4
providers = ["CPUExecutionProvider"]

encoder = ort.InferenceSession(f"{MODEL_DIR}/encoder_model_int8.onnx", opts, providers=providers)
decoder = ort.InferenceSession(f"{MODEL_DIR}/decoder_model_int8.onnx", opts, providers=providers)
decoder_past = ort.InferenceSession(f"{MODEL_DIR}/decoder_with_past_model_int8.onnx", opts, providers=providers)
tokenizer = Tokenizer.from_file(f"{MODEL_DIR}/tokenizer.json")

# Prepare audio (16kHz float32, padded to multiple of 80 samples)
audio = np.random.randn(16000 * 5).astype(np.float32)  # replace with real audio
remainder = len(audio) % 80
if remainder:
    audio = np.pad(audio, (0, 80 - remainder))

audio_input = audio[np.newaxis, :]
attention_mask = np.ones_like(audio_input, dtype=np.int64)

# Encode audio
(enc_out,) = encoder.run(None, {
    "input_values": audio_input,
    "attention_mask": attention_mask,
})

# First decode step
outs = decoder.run(None, {
    "decoder_input_ids": np.array([[BOS]], dtype=np.int64),
    "encoder_hidden_states": enc_out,
})
logits, past_kvs = outs[0], outs[1:]
token = int(np.argmax(logits[0, -1, :]))

# Build KV cache mapping
dec_out_names = [o.name for o in decoder.get_outputs()][1:]
past_in_names = {i.name for i in decoder_past.get_inputs()} - {"decoder_input_ids", "encoder_hidden_states"}

kv_dict = {}
for name, tensor in zip(dec_out_names, past_kvs):
    mapped = name.replace("present_", "past_", 1)
    if mapped in past_in_names:
        kv_dict[mapped] = tensor

# Autoregressive decode loop
past_out_names = [o.name for o in decoder_past.get_outputs()][1:]
tokens = [token]

for _ in range(255):
    if token == EOS:
        break
    inputs = {
        "decoder_input_ids": np.array([[token]], dtype=np.int64),
        "encoder_hidden_states": enc_out,
    }
    inputs.update(kv_dict)
    outs = decoder_past.run(None, inputs)
    token = int(np.argmax(outs[0][0, -1, :]))
    tokens.append(token)
    
    kv_dict = {}
    for name, tensor in zip(past_out_names, outs[1:]):
        mapped = name.replace("present_", "past_", 1)
        if mapped in past_in_names:
            kv_dict[mapped] = tensor

text = tokenizer.decode(tokens)
print(text)
```

### FP16 PyTorch Inference (GPU)

```python
from transformers import MoonshineStreamingForConditionalGeneration, AutoProcessor
import torch

model = MoonshineStreamingForConditionalGeneration.from_pretrained(
    "felixem/moonshine-streaming-tiny-optimized",
    subfolder="fp16",
    torch_dtype=torch.float16,
).to("cuda")

processor = AutoProcessor.from_pretrained(
    "felixem/moonshine-streaming-tiny-optimized",
    subfolder="fp16",
)

# Process audio
inputs = processor(audio_array, return_tensors="pt", sampling_rate=16000)
inputs = {k: v.to("cuda", torch.float16) for k, v in inputs.items()}

generated_ids = model.generate(**inputs, max_new_tokens=128)
text = processor.decode(generated_ids[0], skip_special_tokens=True)
```

### PyTorch Dynamic INT8 (CPU β€” Quick Setup)

```python
import torch
from transformers import MoonshineStreamingForConditionalGeneration, AutoProcessor

model = MoonshineStreamingForConditionalGeneration.from_pretrained(
    "UsefulSensors/moonshine-streaming-tiny"
).eval()

# Quantize Linear layers to INT8
model = torch.quantization.quantize_dynamic(
    model, {torch.nn.Linear}, dtype=torch.qint8
)

processor = AutoProcessor.from_pretrained("UsefulSensors/moonshine-streaming-tiny")
inputs = processor(audio_array, return_tensors="pt", sampling_rate=16000)
generated_ids = model.generate(**inputs, max_new_tokens=128)
text = processor.decode(generated_ids[0], skip_special_tokens=True)
```

## ONNX Export Details

- **Encoder**: Exported with `torch.onnx.export(dynamo=True)` to handle vmap-based sliding-window attention masking
- **Decoder**: Separate models for first step (no KV cache) and autoregressive steps (with KV cache)
- **Quantization**: `onnxruntime.quantization.quantize_dynamic` with symmetric INT8, per-channel, reduce_range=True

### KV Cache Structure

Each decoder layer produces 4 KV tensors:
- `present_{layer}_self_key` / `present_{layer}_self_value`: Self-attention cache [B, 8, S, 40]
- `present_{layer}_cross_key` / `present_{layer}_cross_value`: Cross-attention cache [B, 8, T, 40]

For `decoder_with_past_model`, feed these back as `past_{layer}_*` inputs.

## Quantization Impact

Based on the [Edge-ASR paper](https://arxiv.org/abs/2507.07877) (Table 14), INT8 quantization on Moonshine Tiny has negligible WER impact:

| Config | Avg WER | vs FP32 |
|--------|---------|---------|
| FP32 baseline | 12.72% | β€” |
| **W8-A8 (INT8)** | **12.81%** | **+0.09%** |
| W4-A16 (SpQR) | 13.61% | +0.89% |

INT8 is the sweet spot for Moonshine Tiny β€” virtually no accuracy loss with ~50% model size reduction.

## Limitations

- English only
- Optimized for short utterances (streaming chunks of 1-5 seconds)
- ONNX models use external data files (`.onnx.data`) for FP32 variant
- The decoder uses autoregressive generation, so output latency scales with transcript length

## Citation

```bibtex
@article{kudlur2025moonshine,
  title={Moonshine v2: Ergodic Streaming Encoder ASR},
  author={Kudlur, Manjunath and King, Evan and Wang, James and Warden, Pete},
  journal={arXiv preprint arXiv:2602.12241},
  year={2025}
}
```

## License

MIT (same as base model)