felixem commited on
Commit
2f5c1d8
Β·
verified Β·
1 Parent(s): 2e065a7

Add comprehensive README with benchmarks and usage

Browse files
Files changed (1) hide show
  1. README.md +247 -0
README.md ADDED
@@ -0,0 +1,247 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ base_model: UsefulSensors/moonshine-streaming-tiny
4
+ language:
5
+ - en
6
+ tags:
7
+ - onnx
8
+ - int8
9
+ - fp16
10
+ - quantized
11
+ - optimized
12
+ - speech-recognition
13
+ - asr
14
+ - streaming
15
+ - moonshine
16
+ library_name: onnxruntime
17
+ pipeline_tag: automatic-speech-recognition
18
+ ---
19
+
20
+ # Moonshine Streaming Tiny β€” Optimized
21
+
22
+ Optimized variants of [UsefulSensors/moonshine-streaming-tiny](https://huggingface.co/UsefulSensors/moonshine-streaming-tiny), a 34M parameter streaming ASR model designed for real-time, on-device English speech recognition.
23
+
24
+ Based on: [Moonshine v2: Ergodic Streaming Encoder ASR](https://arxiv.org/abs/2602.12241)
25
+
26
+ ## Optimized Variants
27
+
28
+ | Variant | Total Size | Size Reduction | Best For |
29
+ |---------|-----------|---------------|----------|
30
+ | **Original FP32** | 168.1 MB | β€” | Reference |
31
+ | **ONNX INT8** | 79.8 MB | **52%** | CPU deployment, edge devices |
32
+ | **FP16 SafeTensors** | 88.1 MB | **48%** | GPU inference |
33
+ | **ONNX FP32** | 297 MB | β€” | ONNX Runtime without quantization |
34
+
35
+ ## Benchmark Results
36
+
37
+ Tested with 5 seconds of audio, generating up to 64 tokens:
38
+
39
+ | Variant | Avg Latency | RTF | Speedup vs FP32 CPU |
40
+ |---------|------------|-----|---------------------|
41
+ | **PyTorch FP16 (GPU)** | 47.7 ms | 0.0095 | **1.71x** |
42
+ | PyTorch INT8 (CPU) | 78.6 ms | 0.0157 | 1.03x |
43
+ | PyTorch FP32 (CPU) | 81.3 ms | 0.0163 | 1.00x (baseline) |
44
+ | ONNX FP32 (CPU) | 115.5 ms | 0.0231 | 0.70x |
45
+ | ONNX INT8 (CPU) | 153.2 ms | 0.0306 | 0.53x |
46
+
47
+ > **Note**: ONNX benchmarks include session overhead and were run on a single test. For production deployment with session reuse on real audio, ONNX Runtime typically provides better throughput, especially for long-running services. The Moonshine team reports 50ms response latency on Apple M3 with their C++ ONNX Runtime backend.
48
+
49
+ ## File Structure
50
+
51
+ ```
52
+ β”œβ”€β”€ onnx_int8/ # ONNX INT8 quantized (recommended for CPU)
53
+ β”‚ β”œβ”€β”€ encoder_model_int8.onnx # 9.8 MB
54
+ β”‚ β”œβ”€β”€ decoder_model_int8.onnx # 36 MB
55
+ β”‚ β”œβ”€β”€ decoder_with_past_model_int8.onnx # 32 MB
56
+ β”‚ β”œβ”€β”€ tokenizer.json
57
+ β”‚ β”œβ”€β”€ config.json
58
+ β”‚ └── quantize_config.json
59
+ β”œβ”€β”€ onnx/ # ONNX FP32
60
+ β”‚ β”œβ”€β”€ encoder_model.onnx + .data
61
+ β”‚ β”œβ”€β”€ decoder_model.onnx + .data
62
+ β”‚ β”œβ”€β”€ decoder_with_past_model.onnx + .data
63
+ β”‚ └── ...
64
+ └── fp16/ # FP16 SafeTensors (for GPU)
65
+ β”œβ”€β”€ model.safetensors # 88.1 MB
66
+ β”œβ”€β”€ config.json
67
+ └── tokenizer.json
68
+ ```
69
+
70
+ ## Usage
71
+
72
+ ### ONNX INT8 Inference (CPU β€” Recommended for Edge)
73
+
74
+ ```bash
75
+ pip install onnxruntime numpy tokenizers
76
+ ```
77
+
78
+ ```python
79
+ import numpy as np
80
+ import onnxruntime as ort
81
+ from tokenizers import Tokenizer
82
+
83
+ MODEL_DIR = "onnx_int8" # or download from this repo
84
+ BOS, EOS = 1, 2
85
+
86
+ # Load models
87
+ opts = ort.SessionOptions()
88
+ opts.intra_op_num_threads = 4
89
+ providers = ["CPUExecutionProvider"]
90
+
91
+ encoder = ort.InferenceSession(f"{MODEL_DIR}/encoder_model_int8.onnx", opts, providers=providers)
92
+ decoder = ort.InferenceSession(f"{MODEL_DIR}/decoder_model_int8.onnx", opts, providers=providers)
93
+ decoder_past = ort.InferenceSession(f"{MODEL_DIR}/decoder_with_past_model_int8.onnx", opts, providers=providers)
94
+ tokenizer = Tokenizer.from_file(f"{MODEL_DIR}/tokenizer.json")
95
+
96
+ # Prepare audio (16kHz float32, padded to multiple of 80 samples)
97
+ audio = np.random.randn(16000 * 5).astype(np.float32) # replace with real audio
98
+ remainder = len(audio) % 80
99
+ if remainder:
100
+ audio = np.pad(audio, (0, 80 - remainder))
101
+
102
+ audio_input = audio[np.newaxis, :]
103
+ attention_mask = np.ones_like(audio_input, dtype=np.int64)
104
+
105
+ # Encode audio
106
+ (enc_out,) = encoder.run(None, {
107
+ "input_values": audio_input,
108
+ "attention_mask": attention_mask,
109
+ })
110
+
111
+ # First decode step
112
+ outs = decoder.run(None, {
113
+ "decoder_input_ids": np.array([[BOS]], dtype=np.int64),
114
+ "encoder_hidden_states": enc_out,
115
+ })
116
+ logits, past_kvs = outs[0], outs[1:]
117
+ token = int(np.argmax(logits[0, -1, :]))
118
+
119
+ # Build KV cache mapping
120
+ dec_out_names = [o.name for o in decoder.get_outputs()][1:]
121
+ past_in_names = {i.name for i in decoder_past.get_inputs()} - {"decoder_input_ids", "encoder_hidden_states"}
122
+
123
+ kv_dict = {}
124
+ for name, tensor in zip(dec_out_names, past_kvs):
125
+ mapped = name.replace("present_", "past_", 1)
126
+ if mapped in past_in_names:
127
+ kv_dict[mapped] = tensor
128
+
129
+ # Autoregressive decode loop
130
+ past_out_names = [o.name for o in decoder_past.get_outputs()][1:]
131
+ tokens = [token]
132
+
133
+ for _ in range(255):
134
+ if token == EOS:
135
+ break
136
+ inputs = {
137
+ "decoder_input_ids": np.array([[token]], dtype=np.int64),
138
+ "encoder_hidden_states": enc_out,
139
+ }
140
+ inputs.update(kv_dict)
141
+ outs = decoder_past.run(None, inputs)
142
+ token = int(np.argmax(outs[0][0, -1, :]))
143
+ tokens.append(token)
144
+
145
+ kv_dict = {}
146
+ for name, tensor in zip(past_out_names, outs[1:]):
147
+ mapped = name.replace("present_", "past_", 1)
148
+ if mapped in past_in_names:
149
+ kv_dict[mapped] = tensor
150
+
151
+ text = tokenizer.decode(tokens)
152
+ print(text)
153
+ ```
154
+
155
+ ### FP16 PyTorch Inference (GPU)
156
+
157
+ ```python
158
+ from transformers import MoonshineStreamingForConditionalGeneration, AutoProcessor
159
+ import torch
160
+
161
+ model = MoonshineStreamingForConditionalGeneration.from_pretrained(
162
+ "felixem/moonshine-streaming-tiny-optimized",
163
+ subfolder="fp16",
164
+ torch_dtype=torch.float16,
165
+ ).to("cuda")
166
+
167
+ processor = AutoProcessor.from_pretrained(
168
+ "felixem/moonshine-streaming-tiny-optimized",
169
+ subfolder="fp16",
170
+ )
171
+
172
+ # Process audio
173
+ inputs = processor(audio_array, return_tensors="pt", sampling_rate=16000)
174
+ inputs = {k: v.to("cuda", torch.float16) for k, v in inputs.items()}
175
+
176
+ generated_ids = model.generate(**inputs, max_new_tokens=128)
177
+ text = processor.decode(generated_ids[0], skip_special_tokens=True)
178
+ ```
179
+
180
+ ### PyTorch Dynamic INT8 (CPU β€” Quick Setup)
181
+
182
+ ```python
183
+ import torch
184
+ from transformers import MoonshineStreamingForConditionalGeneration, AutoProcessor
185
+
186
+ model = MoonshineStreamingForConditionalGeneration.from_pretrained(
187
+ "UsefulSensors/moonshine-streaming-tiny"
188
+ ).eval()
189
+
190
+ # Quantize Linear layers to INT8
191
+ model = torch.quantization.quantize_dynamic(
192
+ model, {torch.nn.Linear}, dtype=torch.qint8
193
+ )
194
+
195
+ processor = AutoProcessor.from_pretrained("UsefulSensors/moonshine-streaming-tiny")
196
+ inputs = processor(audio_array, return_tensors="pt", sampling_rate=16000)
197
+ generated_ids = model.generate(**inputs, max_new_tokens=128)
198
+ text = processor.decode(generated_ids[0], skip_special_tokens=True)
199
+ ```
200
+
201
+ ## ONNX Export Details
202
+
203
+ - **Encoder**: Exported with `torch.onnx.export(dynamo=True)` to handle vmap-based sliding-window attention masking
204
+ - **Decoder**: Separate models for first step (no KV cache) and autoregressive steps (with KV cache)
205
+ - **Quantization**: `onnxruntime.quantization.quantize_dynamic` with symmetric INT8, per-channel, reduce_range=True
206
+
207
+ ### KV Cache Structure
208
+
209
+ Each decoder layer produces 4 KV tensors:
210
+ - `present_{layer}_self_key` / `present_{layer}_self_value`: Self-attention cache [B, 8, S, 40]
211
+ - `present_{layer}_cross_key` / `present_{layer}_cross_value`: Cross-attention cache [B, 8, T, 40]
212
+
213
+ For `decoder_with_past_model`, feed these back as `past_{layer}_*` inputs.
214
+
215
+ ## Quantization Impact
216
+
217
+ Based on the [Edge-ASR paper](https://arxiv.org/abs/2507.07877) (Table 14), INT8 quantization on Moonshine Tiny has negligible WER impact:
218
+
219
+ | Config | Avg WER | vs FP32 |
220
+ |--------|---------|---------|
221
+ | FP32 baseline | 12.72% | β€” |
222
+ | **W8-A8 (INT8)** | **12.81%** | **+0.09%** |
223
+ | W4-A16 (SpQR) | 13.61% | +0.89% |
224
+
225
+ INT8 is the sweet spot for Moonshine Tiny β€” virtually no accuracy loss with ~50% model size reduction.
226
+
227
+ ## Limitations
228
+
229
+ - English only
230
+ - Optimized for short utterances (streaming chunks of 1-5 seconds)
231
+ - ONNX models use external data files (`.onnx.data`) for FP32 variant
232
+ - The decoder uses autoregressive generation, so output latency scales with transcript length
233
+
234
+ ## Citation
235
+
236
+ ```bibtex
237
+ @article{kudlur2025moonshine,
238
+ title={Moonshine v2: Ergodic Streaming Encoder ASR},
239
+ author={Kudlur, Manjunath and King, Evan and Wang, James and Warden, Pete},
240
+ journal={arXiv preprint arXiv:2602.12241},
241
+ year={2025}
242
+ }
243
+ ```
244
+
245
+ ## License
246
+
247
+ MIT (same as base model)