| <!--Copyright 2022 The HuggingFace Team. All rights reserved. | |
| Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with | |
| the License. You may obtain a copy of the License at | |
| http://www.apache.org/licenses/LICENSE-2.0 | |
| Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on | |
| an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the | |
| specific language governing permissions and limitations under the License. | |
| โ ๏ธ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be | |
| rendered properly in your Markdown viewer. | |
| --> | |
| # Whisper [[whisper]] | |
| ## ๊ฐ์ [[overview]] | |
| Whisper ๋ชจ๋ธ์ Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, Ilya Sutskever์ ์ํด [Robust Speech Recognition via Large-Scale Weak Supervision](https://cdn.openai.com/papers/whisper.pdf)์์ ์ ์๋์์ต๋๋ค. | |
| ๋ ผ๋ฌธ์ ์ด๋ก์ ๋ค์๊ณผ ๊ฐ์ต๋๋ค: | |
| *์ฐ๋ฆฌ๋ ์ธํฐ๋ท์์ ๋๋์ ์ค๋์ค๋ฅผ ๊ธ๋ก ์ฎ๊ธด ๊ฒ์ ์์ธกํ๋๋ก ๊ฐ๋จํ ํ๋ จ๋ ์์ฑ ์ฒ๋ฆฌ ์์คํ ์ ์ฑ๋ฅ์ ์ฐ๊ตฌํฉ๋๋ค. 68๋ง ์๊ฐ์ ๋ค๊ตญ์ด ๋ฐ ๋ค์ค ์์ ์ง๋(multitask supervision)์ ํ์ฅํ์ ๋, ๊ฒฐ๊ณผ ๋ชจ๋ธ์ ํ์ค ๋ฒค์น๋งํฌ์ ์ ์ผ๋ฐํ๋๋ฉฐ, ๋ฏธ์ธ ์กฐ์ ์ด ํ์ ์๋ ์ ๋ก์ท ์ ์ก ์ค์ ์์ ์ด์ ์ ์์ ํ ์ง๋๋(fully-supervised) ๊ฒฐ๊ณผ์ ๊ฒฝ์ํ ์ ์๋ ๊ฒฝ์ฐ๊ฐ ๋ง์ต๋๋ค. ์ฌ๋๊ณผ ๋น๊ตํ๋ฉด, ์ด ๋ชจ๋ธ์ ์ฌ๋์ ์ ํ๋์ ๊ฒฌ๊ณ ์ฑ์ ๊ทผ์ ํฉ๋๋ค. ์ฐ๋ฆฌ๋ ๊ฐ๋ ฅํ ์์ฑ ์ฒ๋ฆฌ๋ฅผ ์ํ ์ถ๊ฐ ์์ ์ ๊ธฐ๋ฐ์ด ๋ ๋ชจ๋ธ๊ณผ ์ถ๋ก ์ฝ๋๋ฅผ ๊ณต๊ฐํฉ๋๋ค.* | |
| ํ: | |
| - ์ด ๋ชจ๋ธ์ ์ผ๋ฐ์ ์ผ๋ก ๋ณ๋์ ๋ฏธ์ธ ์กฐ์ ์์ด๋ ์ ์๋ํฉ๋๋ค. | |
| - ์ํคํ ์ฒ๋ ๊ณ ์ ์ ์ธ ์ธ์ฝ๋-๋์ฝ๋ ์ํคํ ์ฒ๋ฅผ ๋ฐ๋ฅด๊ธฐ ๋๋ฌธ์, ์ถ๋ก ์ ์ํด [`~generation.GenerationMixin.generate`] ํจ์๋ฅผ ์ฌ์ฉํฉ๋๋ค. | |
| - ํ์ฌ ์ถ๋ก ์ ์งง์ ํ์์๋ง ๊ตฌํ๋์ด ์์ผ๋ฉฐ, ์ค๋์ค๋ 30์ด ๋ฏธ๋ง์ ์ธ๊ทธ๋จผํธ๋ก ๋ฏธ๋ฆฌ ๋ถํ ๋์ด์ผ ํฉ๋๋ค. ํ์์คํฌํ๋ฅผ ํฌํจํ ๊ธด ํ์์ ๋ํ ์ถ๋ก ์ ํฅํ ๋ฆด๋ฆฌ์ค์์ ๊ตฌํ๋ ์์ ์ ๋๋ค. | |
| - [`WhisperProcessor`]๋ฅผ ์ฌ์ฉํ์ฌ ๋ชจ๋ธ์ ์ฌ์ฉํ ์ค๋์ค๋ฅผ ์ค๋นํ๊ณ , ์์ธก๋ ID๋ฅผ ํ ์คํธ๋ก ๋์ฝ๋ฉํ ์ ์์ต๋๋ค. | |
| - ๋ชจ๋ธ๊ณผ ํ๋ก์ธ์๋ฅผ ๋ณํํ๋ ค๋ฉด ๋ค์์ ์ฌ์ฉํ๋ ๊ฒ์ด ์ข์ต๋๋ค: | |
| ```bash | |
| python src/transformers/models/whisper/convert_openai_to_hf.py --checkpoint_path "" --pytorch_dump_folder_path "Arthur/whisper-3" --convert_preprocessor True | |
| ``` | |
| ์คํฌ๋ฆฝํธ๋ OpenAI ์ฒดํฌํฌ์ธํธ์์ ํ์ํ ๋ชจ๋ ๋งค๊ฐ๋ณ์๋ฅผ ์๋์ผ๋ก ๊ฒฐ์ ํฉ๋๋ค. OpenAI ๋ณํ์ ์ํํ๋ ค๋ฉด `tiktoken` ๋ผ์ด๋ธ๋ฌ๋ฆฌ๋ฅผ ์ค์นํด์ผ ํฉ๋๋ค. | |
| ๋ผ์ด๋ธ๋ฌ๋ฆฌ๋ฅผ ์ค์นํด์ผ OpenAI ํ ํฐํ๊ธฐ๋ฅผ `tokenizers` ๋ฒ์ ์ผ๋ก ๋ณํํ ์ ์์ต๋๋ค. | |
| ์ด ๋ชจ๋ธ์ [Arthur Zucker](https://huggingface.co/ArthurZ)์ ์ํด ์ ๊ณต๋์์ต๋๋ค. ์ด ๋ชจ๋ธ์ Tensorflow ๋ฒ์ ์ [amyeroberts](https://huggingface.co/amyeroberts)์ ์ํด ์ ๊ณต๋์์ต๋๋ค. | |
| ์๋ณธ ์ฝ๋๋ [์ฌ๊ธฐ](https://github.com/openai/whisper)์์ ์ฐพ์ ์ ์์ต๋๋ค. | |
| ## WhisperConfig [[whisperconfig]] | |
| [[autodoc]] WhisperConfig | |
| ## WhisperTokenizer [[whispertokenizer]] | |
| [[autodoc]] WhisperTokenizer | |
| - set_prefix_tokens | |
| - build_inputs_with_special_tokens | |
| - get_special_tokens_mask | |
| - create_token_type_ids_from_sequences | |
| - save_vocabulary | |
| ## WhisperTokenizerFast [[whispertokenizerfast]] | |
| [[autodoc]] WhisperTokenizerFast | |
| - set_prefix_tokens | |
| - build_inputs_with_special_tokens | |
| - get_special_tokens_mask | |
| - create_token_type_ids_from_sequences | |
| - save_vocabulary | |
| ## WhisperFeatureExtractor [[whisperfeatureextractor]] | |
| [[autodoc]] WhisperFeatureExtractor | |
| - __call__ | |
| ## WhisperProcessor [[whisperprocessor]] | |
| [[autodoc]] WhisperProcessor | |
| - __call__ | |
| - from_pretrained | |
| - save_pretrained | |
| - batch_decode | |
| - decode | |
| ## WhisperModel [[whispermodel]] | |
| [[autodoc]] WhisperModel | |
| - forward | |
| - _mask_input_features | |
| ## WhisperForConditionalGeneration [[whisperforconditionalgeneration]] | |
| [[autodoc]] WhisperForConditionalGeneration | |
| - forward | |
| ## WhisperForAudioClassification [[whisperforaudioclassification]] | |
| [[autodoc]] WhisperForAudioClassification | |
| - forward | |
| ## TFWhisperModel [[tfwhispermodel]] | |
| [[autodoc]] TFWhisperModel | |
| - call | |
| ## TFWhisperForConditionalGeneration [[tfwhisperforconditionalgeneration]] | |
| [[autodoc]] TFWhisperForConditionalGeneration | |
| - call | |
| ## FlaxWhisperModel [[flaxwhispermodel]] | |
| [[autodoc]] FlaxWhisperModel | |
| - __call__ | |
| ## FlaxWhisperForConditionalGeneration [[flaxwhisperforconditionalgeneration]] | |
| [[autodoc]] FlaxWhisperForConditionalGeneration | |
| - __call__ | |
| ## FlaxWhisperForAudioClassification [[flaxwhisperforaudioclassification]] | |
| [[autodoc]] FlaxWhisperForAudioClassification | |
| - __call__ | |