Available models and languages

There are six model sizes, four with English-only versions, offering speed and accuracy tradeoffs. Below are the names of the available models and their approximate memory requirements and inference speed relative to the large model. The relative speeds below are measured by transcribing English speech on a A100, and the real-world speed may vary significantly depending on many factors including the language, the speaking speed, and the available hardware.

Size	Parameters	English-only model	Multilingual model	Required VRAM	Relative speed
tiny	39 M	`tiny.en`	`tiny`	~1 GB	~10x
base	74 M	`base.en`	`base`	~1 GB	~7x
small	244 M	`small.en`	`small`	~2 GB	~4x
medium	769 M	`medium.en`	`medium`	~5 GB	~2x
large	1550 M	N/A	`large`	~10 GB	1x
turbo	809 M	N/A	`turbo`	~6 GB	~8x

Supported Languages

English Chinese German Spanish Russian Korean French Japanese Portuguese Turkish Polish Catalan Dutch Arabic Swedish Italian Indonesian Hindi Finnish Vietnamese Hebrew Ukrainian Greek Malay Czech Romanian Danish Hungarian Tamil Norwegian Thai Urdu Croatian Bulgarian Lithuanian Latin Māori Malayalam Welsh Slovak Telugu Persian Latvian Bengali Serbian Azerbaijani Slovenian Kannada Estonian Macedonian Breton Basque Icelandic Armenian Nepali Mongolian Bosnian Kazakh Albanian Swahili Galician Marathi Panjabi Sinhala Khmer Shona Yoruba Somali Afrikaans Occitan Georgian Belarusian Tajik Sindhi Gujarati Amharic Yiddish Lao Uzbek Faroese Haitian Pashto Turkmen Norwegian Nynorsk Maltese Sanskrit Luxembourgish Burmese Tibetan Tagalog Malagasy Assamese Tatar Hawaiian Lingala Hausa Bashkir jw Sundanese

Whisper's performance varies widely depending on the language. The figure below shows a performance breakdown of large-v3 and large-v2 models by language, using WERs (word error rates) or CER (character error rates, shown in Italic) evaluated on the Common Voice 15 and Fleurs datasets. Additional WER/CER metrics corresponding to the other models and datasets can be found in Appendix D.1, D.2, and D.4 of the paper, as well as the BLEU (Bilingual Evaluation Understudy) scores for translation in Appendix D.3.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for hik63382/Video_Audio_Scribe

Robust Speech Recognition via Large-Scale Weak Supervision

Paper • 2212.04356 • Published Dec 6, 2022 • 53