Add comprehensive README with benchmarks and usage

2f5c1d8 verified 14 days ago

8.24 kB

	---
	license: mit
	base_model: UsefulSensors/moonshine-streaming-tiny
	language:
	- en
	tags:
	- onnx
	- int8
	- fp16
	- quantized
	- optimized
	- speech-recognition
	- asr
	- streaming
	- moonshine
	library_name: onnxruntime
	pipeline_tag: automatic-speech-recognition
	---

	# Moonshine Streaming Tiny — Optimized

	Optimized variants of [UsefulSensors/moonshine-streaming-tiny](https://huggingface.co/UsefulSensors/moonshine-streaming-tiny), a 34M parameter streaming ASR model designed for real-time, on-device English speech recognition.

	Based on: [Moonshine v2: Ergodic Streaming Encoder ASR](https://arxiv.org/abs/2602.12241)

	## Optimized Variants

	\| Variant \| Total Size \| Size Reduction \| Best For \|
	\|---------\|-----------\|---------------\|----------\|
	\| Original FP32 \| 168.1 MB \| — \| Reference \|
	\| ONNX INT8 \| 79.8 MB \| 52% \| CPU deployment, edge devices \|
	\| FP16 SafeTensors \| 88.1 MB \| 48% \| GPU inference \|
	\| ONNX FP32 \| 297 MB \| — \| ONNX Runtime without quantization \|

	## Benchmark Results

	Tested with 5 seconds of audio, generating up to 64 tokens:

	\| Variant \| Avg Latency \| RTF \| Speedup vs FP32 CPU \|
	\|---------\|------------\|-----\|---------------------\|
	\| PyTorch FP16 (GPU) \| 47.7 ms \| 0.0095 \| 1.71x \|
	\| PyTorch INT8 (CPU) \| 78.6 ms \| 0.0157 \| 1.03x \|
	\| PyTorch FP32 (CPU) \| 81.3 ms \| 0.0163 \| 1.00x (baseline) \|
	\| ONNX FP32 (CPU) \| 115.5 ms \| 0.0231 \| 0.70x \|
	\| ONNX INT8 (CPU) \| 153.2 ms \| 0.0306 \| 0.53x \|

	> Note: ONNX benchmarks include session overhead and were run on a single test. For production deployment with session reuse on real audio, ONNX Runtime typically provides better throughput, especially for long-running services. The Moonshine team reports 50ms response latency on Apple M3 with their C++ ONNX Runtime backend.

	## File Structure

	```
	├── onnx_int8/ # ONNX INT8 quantized (recommended for CPU)
	│ ├── encoder_model_int8.onnx # 9.8 MB
	│ ├── decoder_model_int8.onnx # 36 MB
	│ ├── decoder_with_past_model_int8.onnx # 32 MB
	│ ├── tokenizer.json
	│ ├── config.json
	│ └── quantize_config.json
	├── onnx/ # ONNX FP32
	│ ├── encoder_model.onnx + .data
	│ ├── decoder_model.onnx + .data
	│ ├── decoder_with_past_model.onnx + .data
	│ └── ...
	└── fp16/ # FP16 SafeTensors (for GPU)
	├── model.safetensors # 88.1 MB
	├── config.json
	└── tokenizer.json
	```

	## Usage

	### ONNX INT8 Inference (CPU — Recommended for Edge)

	```bash
	pip install onnxruntime numpy tokenizers
	```

	```python
	import numpy as np
	import onnxruntime as ort
	from tokenizers import Tokenizer

	MODEL_DIR = "onnx_int8" # or download from this repo
	BOS, EOS = 1, 2

	# Load models
	opts = ort.SessionOptions()
	opts.intra_op_num_threads = 4
	providers = ["CPUExecutionProvider"]

	encoder = ort.InferenceSession(f"{MODEL_DIR}/encoder_model_int8.onnx", opts, providers=providers)
	decoder = ort.InferenceSession(f"{MODEL_DIR}/decoder_model_int8.onnx", opts, providers=providers)
	decoder_past = ort.InferenceSession(f"{MODEL_DIR}/decoder_with_past_model_int8.onnx", opts, providers=providers)
	tokenizer = Tokenizer.from_file(f"{MODEL_DIR}/tokenizer.json")

	# Prepare audio (16kHz float32, padded to multiple of 80 samples)
	audio = np.random.randn(16000 * 5).astype(np.float32) # replace with real audio
	remainder = len(audio) % 80
	if remainder:
	audio = np.pad(audio, (0, 80 - remainder))

	audio_input = audio[np.newaxis, :]
	attention_mask = np.ones_like(audio_input, dtype=np.int64)

	# Encode audio
	(enc_out,) = encoder.run(None, {
	"input_values": audio_input,
	"attention_mask": attention_mask,
	})

	# First decode step
	outs = decoder.run(None, {
	"decoder_input_ids": np.array([[BOS]], dtype=np.int64),
	"encoder_hidden_states": enc_out,
	})
	logits, past_kvs = outs[0], outs[1:]
	token = int(np.argmax(logits[0, -1, :]))

	# Build KV cache mapping
	dec_out_names = [o.name for o in decoder.get_outputs()][1:]
	past_in_names = {i.name for i in decoder_past.get_inputs()} - {"decoder_input_ids", "encoder_hidden_states"}

	kv_dict = {}
	for name, tensor in zip(dec_out_names, past_kvs):
	mapped = name.replace("present_", "past_", 1)
	if mapped in past_in_names:
	kv_dict[mapped] = tensor

	# Autoregressive decode loop
	past_out_names = [o.name for o in decoder_past.get_outputs()][1:]
	tokens = [token]

	for _ in range(255):
	if token == EOS:
	break
	inputs = {
	"decoder_input_ids": np.array([[token]], dtype=np.int64),
	"encoder_hidden_states": enc_out,
	}
	inputs.update(kv_dict)
	outs = decoder_past.run(None, inputs)
	token = int(np.argmax(outs[0][0, -1, :]))
	tokens.append(token)

	kv_dict = {}
	for name, tensor in zip(past_out_names, outs[1:]):
	mapped = name.replace("present_", "past_", 1)
	if mapped in past_in_names:
	kv_dict[mapped] = tensor

	text = tokenizer.decode(tokens)
	print(text)
	```

	### FP16 PyTorch Inference (GPU)

	```python
	from transformers import MoonshineStreamingForConditionalGeneration, AutoProcessor
	import torch

	model = MoonshineStreamingForConditionalGeneration.from_pretrained(
	"felixem/moonshine-streaming-tiny-optimized",
	subfolder="fp16",
	torch_dtype=torch.float16,
	).to("cuda")

	processor = AutoProcessor.from_pretrained(
	"felixem/moonshine-streaming-tiny-optimized",
	subfolder="fp16",
	)

	# Process audio
	inputs = processor(audio_array, return_tensors="pt", sampling_rate=16000)
	inputs = {k: v.to("cuda", torch.float16) for k, v in inputs.items()}

	generated_ids = model.generate(**inputs, max_new_tokens=128)
	text = processor.decode(generated_ids[0], skip_special_tokens=True)
	```

	### PyTorch Dynamic INT8 (CPU — Quick Setup)

	```python
	import torch
	from transformers import MoonshineStreamingForConditionalGeneration, AutoProcessor

	model = MoonshineStreamingForConditionalGeneration.from_pretrained(
	"UsefulSensors/moonshine-streaming-tiny"
	).eval()

	# Quantize Linear layers to INT8
	model = torch.quantization.quantize_dynamic(
	model, {torch.nn.Linear}, dtype=torch.qint8
	)

	processor = AutoProcessor.from_pretrained("UsefulSensors/moonshine-streaming-tiny")
	inputs = processor(audio_array, return_tensors="pt", sampling_rate=16000)
	generated_ids = model.generate(**inputs, max_new_tokens=128)
	text = processor.decode(generated_ids[0], skip_special_tokens=True)
	```

	## ONNX Export Details

	- Encoder: Exported with `torch.onnx.export(dynamo=True)` to handle vmap-based sliding-window attention masking
	- Decoder: Separate models for first step (no KV cache) and autoregressive steps (with KV cache)
	- Quantization: `onnxruntime.quantization.quantize_dynamic` with symmetric INT8, per-channel, reduce_range=True

	### KV Cache Structure

	Each decoder layer produces 4 KV tensors:
	- `present_{layer}_self_key` / `present_{layer}_self_value`: Self-attention cache [B, 8, S, 40]
	- `present_{layer}_cross_key` / `present_{layer}_cross_value`: Cross-attention cache [B, 8, T, 40]

	For `decoder_with_past_model`, feed these back as `past_{layer}_*` inputs.

	## Quantization Impact

	Based on the [Edge-ASR paper](https://arxiv.org/abs/2507.07877) (Table 14), INT8 quantization on Moonshine Tiny has negligible WER impact:

	\| Config \| Avg WER \| vs FP32 \|
	\|--------\|---------\|---------\|
	\| FP32 baseline \| 12.72% \| — \|
	\| W8-A8 (INT8) \| 12.81% \| +0.09% \|
	\| W4-A16 (SpQR) \| 13.61% \| +0.89% \|

	INT8 is the sweet spot for Moonshine Tiny — virtually no accuracy loss with ~50% model size reduction.

	## Limitations

	- English only
	- Optimized for short utterances (streaming chunks of 1-5 seconds)
	- ONNX models use external data files (`.onnx.data`) for FP32 variant
	- The decoder uses autoregressive generation, so output latency scales with transcript length

	## Citation

	```bibtex
	@article{kudlur2025moonshine,
	title={Moonshine v2: Ergodic Streaming Encoder ASR},
	author={Kudlur, Manjunath and King, Evan and Wang, James and Warden, Pete},
	journal={arXiv preprint arXiv:2602.12241},
	year={2025}
	}
	```

	## License

	MIT (same as base model)