Improve model card: Update title, intro, links, and add project page (#1)
Browse files- Improve model card: Update title, intro, links, and add project page (38c1883723ed706dad221697352302cb3a929f9a)
Co-authored-by: Niels Rogge <nielsr@users.noreply.huggingface.co>
README.md
CHANGED
|
@@ -1,18 +1,19 @@
|
|
| 1 |
---
|
| 2 |
-
license: other
|
| 3 |
language:
|
| 4 |
-
- ko
|
| 5 |
library_name: transformers
|
|
|
|
| 6 |
pipeline_tag: automatic-speech-recognition
|
| 7 |
-
arxiv: https://arxiv.org/abs/2509.02523
|
| 8 |
---
|
| 9 |
-
# Moonshine
|
| 10 |
|
| 11 |
-
|
| 12 |
|
| 13 |
-
|
| 14 |
|
| 15 |
-
|
|
|
|
|
|
|
| 16 |
|
| 17 |
## Usage
|
| 18 |
|
|
@@ -84,7 +85,7 @@ In particular, we caution against using Moonshine models to transcribe recording
|
|
| 84 |
|
| 85 |
## Training Data
|
| 86 |
|
| 87 |
-
The models are trained on 72,000 hours of audio and the corresponding transcripts collected from the internet, as well as datasets openly available and accessible on HuggingFace. The open datasets used are listed in
|
| 88 |
|
| 89 |
## Performance and Limitations
|
| 90 |
|
|
@@ -92,7 +93,7 @@ Our evaluations show that, the models exhibit greater accuracy on standard datas
|
|
| 92 |
|
| 93 |
However, like any machine learning model, the predictions may include texts that are not actually spoken in the audio input (i.e. hallucination). We hypothesize that this happens because, given their general knowledge of language, the models combine trying to predict the next word in audio with trying to transcribe the audio itself.
|
| 94 |
|
| 95 |
-
In addition, the sequence-to-sequence architecture of the model makes it prone to generating repetitive texts, which can be mitigated to some degree by beam search and temperature scheduling but not perfectly. It is likely that this behavior and hallucinations may be worse for short audio segments, or segments where parts of words are cut off at the beginning or the end of the segment.
|
| 96 |
|
| 97 |
## Broader Implications
|
| 98 |
|
|
@@ -100,6 +101,9 @@ We anticipate that Moonshine models’ transcription capabilities may be used fo
|
|
| 100 |
|
| 101 |
There are also potential dual-use concerns that come with releasing Moonshine. While we hope the technology will be used primarily for beneficial purposes, making ASR technology more accessible could enable more actors to build capable surveillance technologies or scale up existing surveillance efforts, as the speed and accuracy allow for affordable automatic transcription and translation of large volumes of audio communication. Moreover, these models may have some capabilities to recognize specific individuals out of the box, which in turn presents safety concerns related both to dual use and disparate performance. In practice, we expect that the cost of transcription is not the limiting factor of scaling up surveillance projects.
|
| 102 |
|
|
|
|
|
|
|
|
|
|
| 103 |
## Citation
|
| 104 |
If you benefit from our work, please cite us:
|
| 105 |
|
|
@@ -111,6 +115,6 @@ If you benefit from our work, please cite us:
|
|
| 111 |
eprint={2509.02523},
|
| 112 |
archivePrefix={arXiv},
|
| 113 |
primaryClass={cs.CL},
|
| 114 |
-
url={https://
|
| 115 |
}
|
| 116 |
-
```
|
|
|
|
| 1 |
---
|
|
|
|
| 2 |
language:
|
| 3 |
+
- ko
|
| 4 |
library_name: transformers
|
| 5 |
+
license: other
|
| 6 |
pipeline_tag: automatic-speech-recognition
|
| 7 |
+
arxiv: https://arxiv.org/abs/2509.02523
|
| 8 |
---
|
|
|
|
| 9 |
|
| 10 |
+
# Flavors of Moonshine: Tiny Specialized ASR Models for Edge Devices
|
| 11 |
|
| 12 |
+
[[Paper]](https://huggingface.co/papers/2509.02523) [[Code]](https://github.com/moonshine-ai/moonshine) [[Installation]](https://github.com/usefulsensors/moonshine/blob/main/README.md)
|
| 13 |
|
| 14 |
+
This is the model card for running the automatic speech recognition (ASR) models (Moonshine models) trained and released by Moonshine AI (f.k.a Useful Sensors.) This model is part of the **Flavors of Moonshine** suite, tiny automatic speech recognition (ASR) models specialized for a range of underrepresented languages. Moonshine models are optimized for fast and accurate ASR on resource-constrained devices, outperforming comparably sized Whisper Tiny models.
|
| 15 |
+
|
| 16 |
+
Following [Model Cards for Model Reporting (Mitchell et al.)](https://arxiv.org/abs/1810.03993), we're providing some information about the automatic speech recognition model. More information on how these models were trained and evaluated can be found [in the paper](https://huggingface.co/papers/2509.02523). Note, a lot of the text has been copied verbatim from the [model card](https://github.com/openai/whisper/blob/main/model-card.md) for the Whisper model developed by OpenAI, because both models serve identical purposes, and carry identical risks.
|
| 17 |
|
| 18 |
## Usage
|
| 19 |
|
|
|
|
| 85 |
|
| 86 |
## Training Data
|
| 87 |
|
| 88 |
+
The models are trained on 72,000 hours of audio and the corresponding transcripts collected from the internet, as well as datasets openly available and accessible on HuggingFace. The open datasets used are listed in [the accompanying paper](https://huggingface.co/papers/2509.02523).
|
| 89 |
|
| 90 |
## Performance and Limitations
|
| 91 |
|
|
|
|
| 93 |
|
| 94 |
However, like any machine learning model, the predictions may include texts that are not actually spoken in the audio input (i.e. hallucination). We hypothesize that this happens because, given their general knowledge of language, the models combine trying to predict the next word in audio with trying to transcribe the audio itself.
|
| 95 |
|
| 96 |
+
In addition, the sequence-to-sequence architecture of the model makes it prone to generating repetitive texts, which can be mitigated to some degree by beam search and temperature scheduling but not perfectly. It is likely that this behavior and hallucinations may be worse for short audio segments, or segments where parts of words are cut off at the beginning or at the end of the segment.
|
| 97 |
|
| 98 |
## Broader Implications
|
| 99 |
|
|
|
|
| 101 |
|
| 102 |
There are also potential dual-use concerns that come with releasing Moonshine. While we hope the technology will be used primarily for beneficial purposes, making ASR technology more accessible could enable more actors to build capable surveillance technologies or scale up existing surveillance efforts, as the speed and accuracy allow for affordable automatic transcription and translation of large volumes of audio communication. Moreover, these models may have some capabilities to recognize specific individuals out of the box, which in turn presents safety concerns related both to dual use and disparate performance. In practice, we expect that the cost of transcription is not the limiting factor of scaling up surveillance projects.
|
| 103 |
|
| 104 |
+
## Project Page
|
| 105 |
+
Check out the blog post for more details: https://petewarden.com/2024/10/21/introducing-moonshine-the-new-state-of-the-art-for-speech-to-text/
|
| 106 |
+
|
| 107 |
## Citation
|
| 108 |
If you benefit from our work, please cite us:
|
| 109 |
|
|
|
|
| 115 |
eprint={2509.02523},
|
| 116 |
archivePrefix={arXiv},
|
| 117 |
primaryClass={cs.CL},
|
| 118 |
+
url={https://huggingface.co/papers/2509.02523},
|
| 119 |
}
|
| 120 |
+
```
|