Improve model card: Update paper link, add GitHub, abstract, and new tags

#5
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +34 -42
README.md CHANGED
@@ -1,57 +1,49 @@
1
  ---
2
- license: mit
3
  language:
4
  - en
5
  library_name: transformers
 
6
  pipeline_tag: automatic-speech-recognition
7
- arxiv: https://arxiv.org/abs/2410.15608
 
 
 
 
 
8
  ---
9
- # Moonshine
10
 
11
- [[Blog]](https://petewarden.com/2024/10/21/introducing-moonshine-the-new-state-of-the-art-for-speech-to-text/) [[Paper]](https://arxiv.org/abs/2410.15608) [[Installation]](https://github.com/usefulsensors/moonshine/blob/main/README.md) [[Podcast]](https://notebooklm.google.com/notebook/d787d6c2-7d7b-478c-b7d5-a0be4c74ae19/audio)
 
 
12
 
13
  This is the model card for running the automatic speech recognition (ASR) models (Moonshine models) trained and released by Useful Sensors.
14
 
15
- Following [Model Cards for Model Reporting (Mitchell et al.)](https://arxiv.org/abs/1810.03993), we're providing some information about the automatic speech recognition model. More information on how these models were trained and evaluated can be found [in the paper](https://arxiv.org/abs/2410.15608). Note, a lot of the text has been copied verbatim from the [model card](https://github.com/openai/whisper/blob/main/model-card.md) for the Whisper model developed by OpenAI, because both models serve identical purposes, and carry identical risks.
16
 
17
- ## Usage
 
18
 
19
- Moonshine is supported in Hugging Face 🤗 Transformers. To run the model, first install the Transformers library. For this example, we'll also install 🤗 Datasets to load toy audio dataset from the Hugging Face Hub, and 🤗 Accelerate to reduce the model loading time:
20
 
21
- ```bash
22
- pip install --upgrade pip
23
- pip install --upgrade transformers datasets[audio]
24
- ```
25
 
26
  ```python
27
- from transformers import MoonshineForConditionalGeneration, AutoProcessor
28
- from datasets import load_dataset, Audio
29
  import torch
 
 
30
 
31
- device = "cuda:0" if torch.cuda.is_available() else "cpu"
32
- torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
33
 
34
- model = MoonshineForConditionalGeneration.from_pretrained('UsefulSensors/moonshine-tiny').to(device).to(torch_dtype)
35
- processor = AutoProcessor.from_pretrained('UsefulSensors/moonshine-tiny')
36
 
37
- dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
38
- dataset = dataset.cast_column("audio", Audio(processor.feature_extractor.sampling_rate))
39
- sample = dataset[0]["audio"]
40
 
41
- inputs = processor(
42
- sample["array"],
43
- return_tensors="pt",
44
- sampling_rate=processor.feature_extractor.sampling_rate
45
- )
46
- inputs = inputs.to(device, torch_dtype)
47
 
48
- # to avoid hallucination loops, we limit the maximum length of the generated text based expected number of tokens per second
49
- token_limit_factor = 6.5 / processor.feature_extractor.sampling_rate # Maximum of 6.5 tokens per second
50
- seq_lens = inputs.attention_mask.sum(dim=-1)
51
- max_length = int((seq_lens * token_limit_factor).max().item())
52
-
53
- generated_ids = model.generate(**inputs, max_length=max_length)
54
- print(processor.decode(generated_ids[0], skip_special_tokens=True))
55
  ```
56
 
57
  ## Model Details
@@ -73,7 +65,7 @@ Sequence-to-sequence ASR (automatic speech recognition) and speech translation m
73
 
74
  ### Paper & samples
75
 
76
- [Paper](https://arxiv.org/abs/2410.15608) / [Blog](https://petewarden.com/2024/10/21/introducing-moonshine-the-new-state-of-the-art-for-speech-to-text/)
77
 
78
  ## Model Use
79
 
@@ -87,7 +79,7 @@ In particular, we caution against using Moonshine models to transcribe recording
87
 
88
  ## Training Data
89
 
90
- The models are trained on 200,000 hours of audio and the corresponding transcripts collected from the internet, as well as datasets openly available and accessible on HuggingFace. The open datasets used are listed in the [the accompanying paper](https://arxiv.org/abs/2410.15608).
91
 
92
  ## Performance and Limitations
93
 
@@ -95,7 +87,7 @@ Our evaluations show that, the models exhibit greater accuracy on standard datas
95
 
96
  However, like any machine learning model, the predictions may include texts that are not actually spoken in the audio input (i.e. hallucination). We hypothesize that this happens because, given their general knowledge of language, the models combine trying to predict the next word in audio with trying to transcribe the audio itself.
97
 
98
- In addition, the sequence-to-sequence architecture of the model makes it prone to generating repetitive texts, which can be mitigated to some degree by beam search and temperature scheduling but not perfectly. It is likely that this behavior and hallucinations may be worse for short audio segments, or segments where parts of words are cut off at the beginning or the end of the segment.
99
 
100
  ## Broader Implications
101
 
@@ -106,13 +98,13 @@ There are also potential dual-use concerns that come with releasing Moonshine. W
106
  ## Citation
107
  If you benefit from our work, please cite us:
108
  ```
109
- @misc{jeffries2024moonshinespeechrecognitionlive,
110
- title={Moonshine: Speech Recognition for Live Transcription and Voice Commands},
111
  author={Nat Jeffries and Evan King and Manjunath Kudlur and Guy Nicholson and James Wang and Pete Warden},
112
- year={2024},
113
- eprint={2410.15608},
114
  archivePrefix={arXiv},
115
  primaryClass={cs.SD},
116
- url={https://arxiv.org/abs/2410.15608},
117
  }
118
- ```
 
1
  ---
 
2
  language:
3
  - en
4
  library_name: transformers
5
+ license: mit
6
  pipeline_tag: automatic-speech-recognition
7
+ paper: https://huggingface.co/papers/2509.02523
8
+ tags:
9
+ - tiny-asr
10
+ - edge-ai
11
+ - monolingual
12
+ - speech-to-text
13
  ---
 
14
 
15
+ # Flavors of Moonshine: Tiny Specialized ASR Models for Edge Devices
16
+
17
+ [[Blog]](https://petewarden.com/2024/10/21/introducing-moonshine-the-new-state-of-the-art-for-speech-to-text/) [[Paper]](https://huggingface.co/papers/2509.02523) [[Code]](https://github.com/moonshine-ai/moonshine) [[Podcast]](https://notebooklm.google.com/notebook/d787d6c2-7d7b-478c-b7d5-a0be4c74ae19/audio)
18
 
19
  This is the model card for running the automatic speech recognition (ASR) models (Moonshine models) trained and released by Useful Sensors.
20
 
21
+ Following [Model Cards for Model Reporting (Mitchell et al.)](https://arxiv.org/abs/1810.03993), we're providing some information about the automatic speech recognition model. More information on how these models were trained and evaluated can be found [in the paper](https://huggingface.co/papers/2509.02523). Note, a lot of the text has been copied verbatim from the [model card](https://github.com/openai/whisper/blob/main/model-card.md) for the Whisper model developed by OpenAI, because both models serve identical purposes, and carry identical risks.
22
 
23
+ ## Abstract
24
+ We present the Flavors of Moonshine, a suite of tiny automatic speech recognition (ASR) models specialized for a range of underrepresented languages. Prevailing wisdom suggests that multilingual ASR models outperform monolingual counterparts by exploiting cross-lingual phonetic similarities. We challenge this assumption, showing that for sufficiently small models (27M parameters), training monolingual systems on a carefully balanced mix of high-quality human-labeled, pseudo-labeled, and synthetic data yields substantially superior performance. On average, our models achieve error rates 48% lower than the comparably sized Whisper Tiny model, outperform the 9x larger Whisper Small model, and in most cases match or outperform the 28x larger Whisper Medium model. These results advance the state of the art for models of this size, enabling accurate on-device ASR for languages that previously had limited support. We release Arabic, Chinese, Japanese, Korean, Ukrainian, and Vietnamese Moonshine models under a permissive open-source license.
25
 
26
+ ## Usage
27
 
28
+ Moonshine models are available on the Hugging Face hub and can be used with the `transformers` library, as follows:
 
 
 
29
 
30
  ```python
 
 
31
  import torch
32
+ from transformers import AutoProcessor, MoonshineForConditionalGeneration
33
+ from datasets import load_dataset
34
 
35
+ processor = AutoProcessor.from_pretrained("UsefulSensors/moonshine-tiny")
36
+ model = MoonshineForConditionalGeneration.from_pretrained("UsefulSensors/moonshine-tiny")
37
 
38
+ ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
39
+ audio_array = ds[0]["audio"]["array"]
40
 
41
+ inputs = processor(audio_array, return_tensors="pt")
 
 
42
 
43
+ generated_ids = model.generate(**inputs)
 
 
 
 
 
44
 
45
+ transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
46
+ print(transcription)
 
 
 
 
 
47
  ```
48
 
49
  ## Model Details
 
65
 
66
  ### Paper & samples
67
 
68
+ [Paper](https://huggingface.co/papers/2509.02523) / [Blog](https://petewarden.com/2024/10/21/introducing-moonshine-the-new-state-of-the-art-for-speech-to-text/)
69
 
70
  ## Model Use
71
 
 
79
 
80
  ## Training Data
81
 
82
+ The models are trained on 200,000 hours of audio and the corresponding transcripts collected from the internet, as well as datasets openly available and accessible on HuggingFace. The open datasets used are listed in [the accompanying paper](https://huggingface.co/papers/2509.02523).
83
 
84
  ## Performance and Limitations
85
 
 
87
 
88
  However, like any machine learning model, the predictions may include texts that are not actually spoken in the audio input (i.e. hallucination). We hypothesize that this happens because, given their general knowledge of language, the models combine trying to predict the next word in audio with trying to transcribe the audio itself.
89
 
90
+ In addition, the sequence-to-sequence architecture of the model makes it prone to generating repetitive texts, which can be mitigated to some degree by beam search and temperature scheduling but not perfectly. It is likely that this behavior and hallucinations may be worse for short audio segments, or segments where parts of words are cut off at the beginning or at the end of the segment.
91
 
92
  ## Broader Implications
93
 
 
98
  ## Citation
99
  If you benefit from our work, please cite us:
100
  ```
101
+ @misc{jeffries2025flavorsmoonshine,
102
+ title={Flavors of Moonshine: Tiny Specialized ASR Models for Edge Devices},
103
  author={Nat Jeffries and Evan King and Manjunath Kudlur and Guy Nicholson and James Wang and Pete Warden},
104
+ year={2025},
105
+ eprint={2509.02523},
106
  archivePrefix={arXiv},
107
  primaryClass={cs.SD},
108
+ url={https://arxiv.org/abs/2509.02523},
109
  }
110
+ ```