Improve model card: Update paper link, add GitHub, abstract, and new tags
Browse filesThis PR aims to significantly improve the model card for the Moonshine models by incorporating updated information from the paper "[Flavors of Moonshine: Tiny Specialized ASR Models for Edge Devices](https://huggingface.co/papers/2509.02523)" and the canonical GitHub repository.
Key changes include:
- **Metadata Update**: The `arxiv` tag has been replaced with `paper` pointing to the Hugging Face paper link, and new descriptive tags (`tiny-asr`, `edge-ai`, `monolingual`, `speech-to-text`) have been added for better discoverability.
- **Title and Links**: The main title of the model card has been updated to reflect the full paper title. A direct link to the canonical [GitHub repository](https://github.com/moonshine-ai/moonshine) has been added to the top section, and all internal paper links now point to the new Hugging Face paper page. The redundant `[[Installation]]` link has been removed.
- **Abstract Inclusion**: The paper's abstract has been added as a new section to provide a comprehensive overview of the research.
- **Sample Usage Refinement**: The `Usage` section has been updated with a simpler, directly verifiable code snippet from the official GitHub README, ensuring consistency with the authors' provided examples.
- **Citation Update**: The BibTeX citation block has been fully updated to reflect the new paper's title, publication year (2025), and corresponding arXiv identifier and URL.
These changes ensure the model card is up-to-date, more informative, and aligned with best practices for documentation on the Hugging Face Hub.
|
@@ -1,57 +1,49 @@
|
|
| 1 |
---
|
| 2 |
-
license: mit
|
| 3 |
language:
|
| 4 |
- en
|
| 5 |
library_name: transformers
|
|
|
|
| 6 |
pipeline_tag: automatic-speech-recognition
|
| 7 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 8 |
---
|
| 9 |
-
# Moonshine
|
| 10 |
|
| 11 |
-
|
|
|
|
|
|
|
| 12 |
|
| 13 |
This is the model card for running the automatic speech recognition (ASR) models (Moonshine models) trained and released by Useful Sensors.
|
| 14 |
|
| 15 |
-
Following [Model Cards for Model Reporting (Mitchell et al.)](https://arxiv.org/abs/1810.03993), we're providing some information about the automatic speech recognition model. More information on how these models were trained and evaluated can be found [in the paper](https://
|
| 16 |
|
| 17 |
-
##
|
|
|
|
| 18 |
|
| 19 |
-
|
| 20 |
|
| 21 |
-
``
|
| 22 |
-
pip install --upgrade pip
|
| 23 |
-
pip install --upgrade transformers datasets[audio]
|
| 24 |
-
```
|
| 25 |
|
| 26 |
```python
|
| 27 |
-
from transformers import MoonshineForConditionalGeneration, AutoProcessor
|
| 28 |
-
from datasets import load_dataset, Audio
|
| 29 |
import torch
|
|
|
|
|
|
|
| 30 |
|
| 31 |
-
|
| 32 |
-
|
| 33 |
|
| 34 |
-
|
| 35 |
-
|
| 36 |
|
| 37 |
-
|
| 38 |
-
dataset = dataset.cast_column("audio", Audio(processor.feature_extractor.sampling_rate))
|
| 39 |
-
sample = dataset[0]["audio"]
|
| 40 |
|
| 41 |
-
|
| 42 |
-
sample["array"],
|
| 43 |
-
return_tensors="pt",
|
| 44 |
-
sampling_rate=processor.feature_extractor.sampling_rate
|
| 45 |
-
)
|
| 46 |
-
inputs = inputs.to(device, torch_dtype)
|
| 47 |
|
| 48 |
-
|
| 49 |
-
|
| 50 |
-
seq_lens = inputs.attention_mask.sum(dim=-1)
|
| 51 |
-
max_length = int((seq_lens * token_limit_factor).max().item())
|
| 52 |
-
|
| 53 |
-
generated_ids = model.generate(**inputs, max_length=max_length)
|
| 54 |
-
print(processor.decode(generated_ids[0], skip_special_tokens=True))
|
| 55 |
```
|
| 56 |
|
| 57 |
## Model Details
|
|
@@ -73,7 +65,7 @@ Sequence-to-sequence ASR (automatic speech recognition) and speech translation m
|
|
| 73 |
|
| 74 |
### Paper & samples
|
| 75 |
|
| 76 |
-
[Paper](https://
|
| 77 |
|
| 78 |
## Model Use
|
| 79 |
|
|
@@ -87,7 +79,7 @@ In particular, we caution against using Moonshine models to transcribe recording
|
|
| 87 |
|
| 88 |
## Training Data
|
| 89 |
|
| 90 |
-
The models are trained on 200,000 hours of audio and the corresponding transcripts collected from the internet, as well as datasets openly available and accessible on HuggingFace. The open datasets used are listed in
|
| 91 |
|
| 92 |
## Performance and Limitations
|
| 93 |
|
|
@@ -95,7 +87,7 @@ Our evaluations show that, the models exhibit greater accuracy on standard datas
|
|
| 95 |
|
| 96 |
However, like any machine learning model, the predictions may include texts that are not actually spoken in the audio input (i.e. hallucination). We hypothesize that this happens because, given their general knowledge of language, the models combine trying to predict the next word in audio with trying to transcribe the audio itself.
|
| 97 |
|
| 98 |
-
In addition, the sequence-to-sequence architecture of the model makes it prone to generating repetitive texts, which can be mitigated to some degree by beam search and temperature scheduling but not perfectly. It is likely that this behavior and hallucinations may be worse for short audio segments, or segments where parts of words are cut off at the beginning or the end of the segment.
|
| 99 |
|
| 100 |
## Broader Implications
|
| 101 |
|
|
@@ -106,13 +98,13 @@ There are also potential dual-use concerns that come with releasing Moonshine. W
|
|
| 106 |
## Citation
|
| 107 |
If you benefit from our work, please cite us:
|
| 108 |
```
|
| 109 |
-
@misc{
|
| 110 |
-
title={Moonshine:
|
| 111 |
author={Nat Jeffries and Evan King and Manjunath Kudlur and Guy Nicholson and James Wang and Pete Warden},
|
| 112 |
-
year={
|
| 113 |
-
eprint={
|
| 114 |
archivePrefix={arXiv},
|
| 115 |
primaryClass={cs.SD},
|
| 116 |
-
url={https://arxiv.org/abs/
|
| 117 |
}
|
| 118 |
-
```
|
|
|
|
| 1 |
---
|
|
|
|
| 2 |
language:
|
| 3 |
- en
|
| 4 |
library_name: transformers
|
| 5 |
+
license: mit
|
| 6 |
pipeline_tag: automatic-speech-recognition
|
| 7 |
+
paper: https://huggingface.co/papers/2509.02523
|
| 8 |
+
tags:
|
| 9 |
+
- tiny-asr
|
| 10 |
+
- edge-ai
|
| 11 |
+
- monolingual
|
| 12 |
+
- speech-to-text
|
| 13 |
---
|
|
|
|
| 14 |
|
| 15 |
+
# Flavors of Moonshine: Tiny Specialized ASR Models for Edge Devices
|
| 16 |
+
|
| 17 |
+
[[Blog]](https://petewarden.com/2024/10/21/introducing-moonshine-the-new-state-of-the-art-for-speech-to-text/) [[Paper]](https://huggingface.co/papers/2509.02523) [[Code]](https://github.com/moonshine-ai/moonshine) [[Podcast]](https://notebooklm.google.com/notebook/d787d6c2-7d7b-478c-b7d5-a0be4c74ae19/audio)
|
| 18 |
|
| 19 |
This is the model card for running the automatic speech recognition (ASR) models (Moonshine models) trained and released by Useful Sensors.
|
| 20 |
|
| 21 |
+
Following [Model Cards for Model Reporting (Mitchell et al.)](https://arxiv.org/abs/1810.03993), we're providing some information about the automatic speech recognition model. More information on how these models were trained and evaluated can be found [in the paper](https://huggingface.co/papers/2509.02523). Note, a lot of the text has been copied verbatim from the [model card](https://github.com/openai/whisper/blob/main/model-card.md) for the Whisper model developed by OpenAI, because both models serve identical purposes, and carry identical risks.
|
| 22 |
|
| 23 |
+
## Abstract
|
| 24 |
+
We present the Flavors of Moonshine, a suite of tiny automatic speech recognition (ASR) models specialized for a range of underrepresented languages. Prevailing wisdom suggests that multilingual ASR models outperform monolingual counterparts by exploiting cross-lingual phonetic similarities. We challenge this assumption, showing that for sufficiently small models (27M parameters), training monolingual systems on a carefully balanced mix of high-quality human-labeled, pseudo-labeled, and synthetic data yields substantially superior performance. On average, our models achieve error rates 48% lower than the comparably sized Whisper Tiny model, outperform the 9x larger Whisper Small model, and in most cases match or outperform the 28x larger Whisper Medium model. These results advance the state of the art for models of this size, enabling accurate on-device ASR for languages that previously had limited support. We release Arabic, Chinese, Japanese, Korean, Ukrainian, and Vietnamese Moonshine models under a permissive open-source license.
|
| 25 |
|
| 26 |
+
## Usage
|
| 27 |
|
| 28 |
+
Moonshine models are available on the Hugging Face hub and can be used with the `transformers` library, as follows:
|
|
|
|
|
|
|
|
|
|
| 29 |
|
| 30 |
```python
|
|
|
|
|
|
|
| 31 |
import torch
|
| 32 |
+
from transformers import AutoProcessor, MoonshineForConditionalGeneration
|
| 33 |
+
from datasets import load_dataset
|
| 34 |
|
| 35 |
+
processor = AutoProcessor.from_pretrained("UsefulSensors/moonshine-tiny")
|
| 36 |
+
model = MoonshineForConditionalGeneration.from_pretrained("UsefulSensors/moonshine-tiny")
|
| 37 |
|
| 38 |
+
ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
|
| 39 |
+
audio_array = ds[0]["audio"]["array"]
|
| 40 |
|
| 41 |
+
inputs = processor(audio_array, return_tensors="pt")
|
|
|
|
|
|
|
| 42 |
|
| 43 |
+
generated_ids = model.generate(**inputs)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 44 |
|
| 45 |
+
transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
|
| 46 |
+
print(transcription)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 47 |
```
|
| 48 |
|
| 49 |
## Model Details
|
|
|
|
| 65 |
|
| 66 |
### Paper & samples
|
| 67 |
|
| 68 |
+
[Paper](https://huggingface.co/papers/2509.02523) / [Blog](https://petewarden.com/2024/10/21/introducing-moonshine-the-new-state-of-the-art-for-speech-to-text/)
|
| 69 |
|
| 70 |
## Model Use
|
| 71 |
|
|
|
|
| 79 |
|
| 80 |
## Training Data
|
| 81 |
|
| 82 |
+
The models are trained on 200,000 hours of audio and the corresponding transcripts collected from the internet, as well as datasets openly available and accessible on HuggingFace. The open datasets used are listed in [the accompanying paper](https://huggingface.co/papers/2509.02523).
|
| 83 |
|
| 84 |
## Performance and Limitations
|
| 85 |
|
|
|
|
| 87 |
|
| 88 |
However, like any machine learning model, the predictions may include texts that are not actually spoken in the audio input (i.e. hallucination). We hypothesize that this happens because, given their general knowledge of language, the models combine trying to predict the next word in audio with trying to transcribe the audio itself.
|
| 89 |
|
| 90 |
+
In addition, the sequence-to-sequence architecture of the model makes it prone to generating repetitive texts, which can be mitigated to some degree by beam search and temperature scheduling but not perfectly. It is likely that this behavior and hallucinations may be worse for short audio segments, or segments where parts of words are cut off at the beginning or at the end of the segment.
|
| 91 |
|
| 92 |
## Broader Implications
|
| 93 |
|
|
|
|
| 98 |
## Citation
|
| 99 |
If you benefit from our work, please cite us:
|
| 100 |
```
|
| 101 |
+
@misc{jeffries2025flavorsmoonshine,
|
| 102 |
+
title={Flavors of Moonshine: Tiny Specialized ASR Models for Edge Devices},
|
| 103 |
author={Nat Jeffries and Evan King and Manjunath Kudlur and Guy Nicholson and James Wang and Pete Warden},
|
| 104 |
+
year={2025},
|
| 105 |
+
eprint={2509.02523},
|
| 106 |
archivePrefix={arXiv},
|
| 107 |
primaryClass={cs.SD},
|
| 108 |
+
url={https://arxiv.org/abs/2509.02523},
|
| 109 |
}
|
| 110 |
+
```
|