keveman's picture
Update README.md
f8e9dfd verified
---
license: mit
language:
- en
library_name: transformers
pipeline_tag: automatic-speech-recognition
---
# Moonshine Streaming
[[Paper]](https://download.moonshine.ai/docs/moonshine_streaming_paper.pdf)
This is the model card for the Moonshine Streaming automatic speech
recognition (ASR) models trained and released by Useful Sensors. Moonshine Streaming
pairs a lightweight 50~Hz audio frontend with a sliding-window Transformer
encoder to deliver low-latency streaming ASR on edge-class hardware. The encoder
uses bounded local attention and no positional embeddings (an "ergodic"
encoder), while an adapter injects positional information before a standard
autoregressive decoder.
This model card follows the recommendations from Model Cards for Model Reporting
(Mitchell et al.). See the paper draft in this repository for full details.
## Usage
Moonshine Streaming is supported in Hugging Face Transformers. The following example
matches the standard seq2seq ASR API and uses the streaming model checkpoint:
```bash
pip install --upgrade pip
pip install --upgrade git+https://github.com/huggingface/transformers.git#egg=transformers datasets[audio]
```
```python
from transformers import MoonshineStreamingForConditionalGeneration, AutoProcessor
from datasets import load_dataset, Audio
import torch
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model = MoonshineStreamingForConditionalGeneration.from_pretrained(
"usefulsensors/moonshine-streaming-small"
).to(device).to(torch_dtype)
processor = AutoProcessor.from_pretrained("usefulsensors/moonshine-streaming-small")
dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
dataset = dataset.cast_column("audio", Audio(processor.feature_extractor.sampling_rate))
sample = dataset[0]["audio"]
inputs = processor(
sample["array"],
return_tensors="pt",
sampling_rate=processor.feature_extractor.sampling_rate,
)
inputs = inputs.to(device, torch_dtype)
# Limit max output length to avoid hallucination loops.
token_limit_factor = 6.5 / processor.feature_extractor.sampling_rate
seq_lens = inputs.attention_mask.sum(dim=-1)
max_length = int((seq_lens * token_limit_factor).max().item())
generated_ids = model.generate(**inputs, max_length=max_length)
print(processor.decode(generated_ids[0], skip_special_tokens=True))
```
Note: the current Transformers code path does not yet implement fully efficient
streaming for these models. It uses the flash-attention backend's sliding-window
attention when available.
## Model Details
### Model type
Sequence-to-sequence ASR model with a streaming, sliding-window Transformer
encoder and an autoregressive Transformer decoder.
### Supported languages
English (trained and evaluated on English datasets).
### Model sizes
| Size | Parameters | Encoder / Decoder layers | Encoder dim | Decoder dim |
|:-----:|:----------:|:------------------------:|:-----------:|:-----------:|
| Tiny | 34M | 6 / 6 | 320 | 320 |
| Small | 123M | 10 / 10 | 620 | 512 |
| Medium| 245M | 14 / 14 | 768 | 640 |
### Architecture summary
- Audio frontend: 50~Hz features using simple time-domain operations, CMVN, and
two causal stride-2 convolutions.
- Encoder: sliding-window self-attention with no positional embeddings (ergodic
encoder). Windowing uses $(16,4)$ for the first two and last two layers and
$(16,0)$ for intermediate layers, giving an 80~ms lookahead in the lookahead
layers.
- Adapter: adds learned positional embeddings and aligns dimensions before the
decoder.
- Decoder: causal Transformer with RoPE, autoregressively generating text.
## Model Use
### Intended use
These models are intended for low-latency, on-device English speech
transcription on memory- and compute-constrained platforms (roughly
0.1--1~TOPS and sub-1~GB memory budgets). Typical applications include live
captioning, voice commands, and real-time transcription.
### Out-of-scope use
These models are not intended for non-consensual surveillance, speaker
identification, or high-stakes decision-making contexts. They have not been
robustly evaluated for tasks outside English ASR.
## Training Data
Moonshine Streaming was trained on roughly 300K hours of speech data. This includes the
original Moonshine training sources (about 200K hours of public web data and
open datasets) plus an additional 100K hours of internally prepared speech
data. See the paper for details and dataset sources.
## Performance and Limitations
### Open ASR benchmark results (WER %)
| Dataset | Tiny (34M) | Small (123M) | Medium (245M) |
|:----------------------|----------:|-------------:|--------------:|
| AMI | 19.03 | 12.54 | 10.68 |
| Earnings-22 | 20.27 | 13.53 | 11.90 |
| GigaSpeech | 13.90 | 10.41 | 9.46 |
| LibriSpeech (clean) | 4.49 | 2.49 | 2.08 |
| LibriSpeech (other) | 12.09 | 6.78 | 5.00 |
| SPGISpeech | 6.16 | 3.19 | 2.58 |
| TED-LIUM | 6.12 | 3.77 | 2.99 |
| VoxPopuli | 14.02 | 9.98 | 8.54 |
| **Average** | **12.01** | **7.84** | **6.65** |
### Known limitations
- The decoder is autoregressive, so full-output latency grows with transcript
length even when TTFT is low.
- The Transformers implementation does not yet perform fully efficient
streaming; it relies on the flash-attention backend for sliding-window
attention.
- Like other seq2seq ASR models, Moonshine Streaming can hallucinate words that are not
present in the audio, and may repeat phrases, especially on short or noisy
segments.
## Broader Implications
Moonshine Streaming enables low-cost, low-latency transcription, which benefits
accessibility and user interaction on edge devices. At the same time, ASR
capabilities can be misused for surveillance or other harmful purposes. Users
should consider consent, privacy, and domain-specific evaluation before
deployment.
## Citation
**TBD**