Source of audio used to train Whisper

#15

by mahelona - opened Dec 14, 2022

Dec 14, 2022

Aloha,

What are the sources for the audio used in training Whisper? In particular, our team are interested to know where the 1381 hours of te reo Māori and 338 hours of ʻōlelo Hawaiʻi are taken from. Were these just scraped from YouTube?

Thank you,
Keoni.

sanchit-gandhi

Dec 15, 2022

•

edited Dec 15, 2022

Hey @mahelona ! These are secrets only OpenAI know... All the knowledge about the training data that's in the public domain can be found in the Whisper paper: https://arxiv.org/pdf/2212.04356.pdf

The trained checkpoints were publicly release (and are thus hosted on the HF Hub), however the dataset remains behind closed doors. I would also very much like to know more details about the training data! But the situation is unlikely to change here

theothertom

Jan 18, 2024

thanks for the link, that helps

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment