DrDavis's picture
Upload folder using huggingface_hub
17c6d62 verified

Whisper [[whisper]]

๊ฐœ์š” [[overview]]

Whisper ๋ชจ๋ธ์€ Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, Ilya Sutskever์— ์˜ํ•ด Robust Speech Recognition via Large-Scale Weak Supervision์—์„œ ์ œ์•ˆ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

๋…ผ๋ฌธ์˜ ์ดˆ๋ก์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

์šฐ๋ฆฌ๋Š” ์ธํ„ฐ๋„ท์—์„œ ๋Œ€๋Ÿ‰์˜ ์˜ค๋””์˜ค๋ฅผ ๊ธ€๋กœ ์˜ฎ๊ธด ๊ฒƒ์„ ์˜ˆ์ธกํ•˜๋„๋ก ๊ฐ„๋‹จํžˆ ํ›ˆ๋ จ๋œ ์Œ์„ฑ ์ฒ˜๋ฆฌ ์‹œ์Šคํ…œ์˜ ์„ฑ๋Šฅ์„ ์—ฐ๊ตฌํ•ฉ๋‹ˆ๋‹ค. 68๋งŒ ์‹œ๊ฐ„์˜ ๋‹ค๊ตญ์–ด ๋ฐ ๋‹ค์ค‘ ์ž‘์—… ์ง€๋„(multitask supervision)์— ํ™•์žฅํ–ˆ์„ ๋•Œ, ๊ฒฐ๊ณผ ๋ชจ๋ธ์€ ํ‘œ์ค€ ๋ฒค์น˜๋งˆํฌ์— ์ž˜ ์ผ๋ฐ˜ํ™”๋˜๋ฉฐ, ๋ฏธ์„ธ ์กฐ์ •์ด ํ•„์š” ์—†๋Š” ์ œ๋กœ์ƒท ์ „์†ก ์„ค์ •์—์„œ ์ด์ „์˜ ์™„์ „ํžˆ ์ง€๋„๋œ(fully-supervised) ๊ฒฐ๊ณผ์™€ ๊ฒฝ์Ÿํ•  ์ˆ˜ ์žˆ๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์Šต๋‹ˆ๋‹ค. ์‚ฌ๋žŒ๊ณผ ๋น„๊ตํ•˜๋ฉด, ์ด ๋ชจ๋ธ์€ ์‚ฌ๋žŒ์˜ ์ •ํ™•๋„์™€ ๊ฒฌ๊ณ ์„ฑ์— ๊ทผ์ ‘ํ•ฉ๋‹ˆ๋‹ค. ์šฐ๋ฆฌ๋Š” ๊ฐ•๋ ฅํ•œ ์Œ์„ฑ ์ฒ˜๋ฆฌ๋ฅผ ์œ„ํ•œ ์ถ”๊ฐ€ ์ž‘์—…์˜ ๊ธฐ๋ฐ˜์ด ๋  ๋ชจ๋ธ๊ณผ ์ถ”๋ก  ์ฝ”๋“œ๋ฅผ ๊ณต๊ฐœํ•ฉ๋‹ˆ๋‹ค.

ํŒ:

  • ์ด ๋ชจ๋ธ์€ ์ผ๋ฐ˜์ ์œผ๋กœ ๋ณ„๋„์˜ ๋ฏธ์„ธ ์กฐ์ • ์—†์ด๋„ ์ž˜ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค.

  • ์•„ํ‚คํ…์ฒ˜๋Š” ๊ณ ์ „์ ์ธ ์ธ์ฝ”๋”-๋””์ฝ”๋” ์•„ํ‚คํ…์ฒ˜๋ฅผ ๋”ฐ๋ฅด๊ธฐ ๋•Œ๋ฌธ์—, ์ถ”๋ก ์„ ์œ„ํ•ด [~generation.GenerationMixin.generate] ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

  • ํ˜„์žฌ ์ถ”๋ก ์€ ์งง์€ ํ˜•์‹์—๋งŒ ๊ตฌํ˜„๋˜์–ด ์žˆ์œผ๋ฉฐ, ์˜ค๋””์˜ค๋Š” 30์ดˆ ๋ฏธ๋งŒ์˜ ์„ธ๊ทธ๋จผํŠธ๋กœ ๋ฏธ๋ฆฌ ๋ถ„ํ• ๋˜์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ํƒ€์ž„์Šคํƒฌํ”„๋ฅผ ํฌํ•จํ•œ ๊ธด ํ˜•์‹์— ๋Œ€ํ•œ ์ถ”๋ก ์€ ํ–ฅํ›„ ๋ฆด๋ฆฌ์Šค์—์„œ ๊ตฌํ˜„๋  ์˜ˆ์ •์ž…๋‹ˆ๋‹ค.

  • [WhisperProcessor]๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ชจ๋ธ์— ์‚ฌ์šฉํ•  ์˜ค๋””์˜ค๋ฅผ ์ค€๋น„ํ•˜๊ณ , ์˜ˆ์ธก๋œ ID๋ฅผ ํ…์ŠคํŠธ๋กœ ๋””์ฝ”๋”ฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  • ๋ชจ๋ธ๊ณผ ํ”„๋กœ์„ธ์„œ๋ฅผ ๋ณ€ํ™˜ํ•˜๋ ค๋ฉด ๋‹ค์Œ์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ์ข‹์Šต๋‹ˆ๋‹ค:

python src/transformers/models/whisper/convert_openai_to_hf.py --checkpoint_path "" --pytorch_dump_folder_path "Arthur/whisper-3" --convert_preprocessor True

์Šคํฌ๋ฆฝํŠธ๋Š” OpenAI ์ฒดํฌํฌ์ธํŠธ์—์„œ ํ•„์š”ํ•œ ๋ชจ๋“  ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ์ž๋™์œผ๋กœ ๊ฒฐ์ •ํ•ฉ๋‹ˆ๋‹ค. OpenAI ๋ณ€ํ™˜์„ ์ˆ˜ํ–‰ํ•˜๋ ค๋ฉด tiktoken ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์„ค์น˜ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์„ค์น˜ํ•ด์•ผ OpenAI ํ† ํฐํ™”๊ธฐ๋ฅผ tokenizers ๋ฒ„์ „์œผ๋กœ ๋ณ€ํ™˜ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ด ๋ชจ๋ธ์€ Arthur Zucker์— ์˜ํ•ด ์ œ๊ณต๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์ด ๋ชจ๋ธ์˜ Tensorflow ๋ฒ„์ „์€ amyeroberts์— ์˜ํ•ด ์ œ๊ณต๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์›๋ณธ ์ฝ”๋“œ๋Š” ์—ฌ๊ธฐ์—์„œ ์ฐพ์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

WhisperConfig [[whisperconfig]]

[[autodoc]] WhisperConfig

WhisperTokenizer [[whispertokenizer]]

[[autodoc]] WhisperTokenizer - set_prefix_tokens - build_inputs_with_special_tokens - get_special_tokens_mask - create_token_type_ids_from_sequences - save_vocabulary

WhisperTokenizerFast [[whispertokenizerfast]]

[[autodoc]] WhisperTokenizerFast - set_prefix_tokens - build_inputs_with_special_tokens - get_special_tokens_mask - create_token_type_ids_from_sequences - save_vocabulary

WhisperFeatureExtractor [[whisperfeatureextractor]]

[[autodoc]] WhisperFeatureExtractor - call

WhisperProcessor [[whisperprocessor]]

[[autodoc]] WhisperProcessor - call - from_pretrained - save_pretrained - batch_decode - decode

WhisperModel [[whispermodel]]

[[autodoc]] WhisperModel - forward - _mask_input_features

WhisperForConditionalGeneration [[whisperforconditionalgeneration]]

[[autodoc]] WhisperForConditionalGeneration - forward

WhisperForAudioClassification [[whisperforaudioclassification]]

[[autodoc]] WhisperForAudioClassification - forward

TFWhisperModel [[tfwhispermodel]]

[[autodoc]] TFWhisperModel - call

TFWhisperForConditionalGeneration [[tfwhisperforconditionalgeneration]]

[[autodoc]] TFWhisperForConditionalGeneration - call

FlaxWhisperModel [[flaxwhispermodel]]

[[autodoc]] FlaxWhisperModel - call

FlaxWhisperForConditionalGeneration [[flaxwhisperforconditionalgeneration]]

[[autodoc]] FlaxWhisperForConditionalGeneration - call

FlaxWhisperForAudioClassification [[flaxwhisperforaudioclassification]]

[[autodoc]] FlaxWhisperForAudioClassification - call