update model card
Browse files
README.md
CHANGED
|
@@ -1,3 +1,57 @@
|
|
| 1 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2 |
license: mit
|
| 3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
+
language:
|
| 3 |
+
- en
|
| 4 |
+
library_name: whisper
|
| 5 |
+
tags:
|
| 6 |
+
- translation
|
| 7 |
+
- speech
|
| 8 |
+
- audio
|
| 9 |
+
- automatic-speech-recognition
|
| 10 |
+
datasets:
|
| 11 |
+
- whisper
|
| 12 |
+
metrics:
|
| 13 |
+
- WER
|
| 14 |
license: mit
|
| 15 |
---
|
| 16 |
+
This model was forked from the original [OpenAI whisper model](https://github.com/openai/whisper).
|
| 17 |
+
|
| 18 |
+
# Whisper
|
| 19 |
+
|
| 20 |
+
## Model
|
| 21 |
+
Whisper is a multi-lingual speech-to-text model.
|
| 22 |
+
It takes in raw audio recordings from many languages and outputs transcriptions in the language of origin or translated to english.
|
| 23 |
+
The model first converts speech to spectrograms, then uses an auto-regressive transformer to decode the speech to text.
|
| 24 |
+
Here is an overview of the architecture:
|
| 25 |
+
|
| 26 |
+

|
| 27 |
+
|
| 28 |
+
For more information on the technical implementations, consult the [paper](https://cdn.openai.com/papers/whisper.pdf).
|
| 29 |
+
## Training Data
|
| 30 |
+
|
| 31 |
+
The model was trained on 680 000 hours of audio and associated transcripts trained from the internet.
|
| 32 |
+
The majority of the audio is in english (~65%) while the remainder is in other languages.
|
| 33 |
+
A total of 98 different languages were used in the dataset.
|
| 34 |
+
|
| 35 |
+

|
| 36 |
+
|
| 37 |
+
|
| 38 |
+
## Model Variations
|
| 39 |
+
|
| 40 |
+
OpenAI has released 9 different versions of the model, trained either on english-only audio or on multilingual data.
|
| 41 |
+
|
| 42 |
+
| Size | Parameters | English-only model | Multilingual model | Required VRAM | Relative speed |
|
| 43 |
+
|:------:|:----------:|:------------------:|:------------------:|:-------------:|:--------------:|
|
| 44 |
+
| tiny | 39 M | `tiny.en` | `tiny` | ~1 GB | ~32x |
|
| 45 |
+
| base | 74 M | `base.en` | `base` | ~1 GB | ~16x |
|
| 46 |
+
| small | 244 M | `small.en` | `small` | ~2 GB | ~6x |
|
| 47 |
+
| medium | 769 M | `medium.en` | `medium` | ~5 GB | ~2x |
|
| 48 |
+
| large | 1550 M | N/A | `large` | ~10 GB | 1x |
|
| 49 |
+
|
| 50 |
+
## Limitations and bias
|
| 51 |
+
|
| 52 |
+
In the [paper](https://cdn.openai.com/papers/whisper.pdf), they find a direct corelation between performance on a given language and the amount of data available in the dataset.
|
| 53 |
+
As such, languages that are under-represented in the scraped dataset perform less well in whisper.
|
| 54 |
+
Because english is much more prevalent than other languages, the model will likely perform better in english.
|
| 55 |
+
This is shown in the following figure, where a lower word error rate (WER) indicates a better performance:
|
| 56 |
+
|
| 57 |
+

|