| <!--Copyright 2023 The HuggingFace Team. All rights reserved. | |
| Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with | |
| the License. You may obtain a copy of the License at | |
| http://www.apache.org/licenses/LICENSE-2.0 | |
| Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on | |
| an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the | |
| specific language governing permissions and limitations under the License. | |
| โ ๏ธ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be | |
| rendered properly in your Markdown viewer. | |
| --> | |
| # IDEFICS๋ฅผ ์ด์ฉํ ์ด๋ฏธ์ง ์์ [[image-tasks-with-idefics]] | |
| [[open-in-colab]] | |
| ๊ฐ๋ณ ์์ ์ ํนํ๋ ๋ชจ๋ธ์ ๋ฏธ์ธ ์กฐ์ ํ์ฌ ์ฒ๋ฆฌํ ์ ์์ง๋ง, ์ต๊ทผ ๋ฑ์ฅํ์ฌ ์ธ๊ธฐ๋ฅผ ์ป๊ณ ์๋ ๋ฐฉ์์ ๋๊ท๋ชจ ๋ชจ๋ธ์ ๋ฏธ์ธ ์กฐ์ ์์ด ๋ค์ํ ์์ ์ ์ฌ์ฉํ๋ ๊ฒ์ ๋๋ค. ์๋ฅผ ๋ค์ด, ๋๊ท๋ชจ ์ธ์ด ๋ชจ๋ธ์ ์์ฝ, ๋ฒ์ญ, ๋ถ๋ฅ ๋ฑ๊ณผ ๊ฐ์ ์์ฐ์ด์ฒ๋ฆฌ (NLP) ์์ ์ ์ฒ๋ฆฌํ ์ ์์ต๋๋ค. ์ด ์ ๊ทผ ๋ฐฉ์์ ํ ์คํธ์ ๊ฐ์ ๋จ์ผ ๋ชจ๋ฌ๋ฆฌํฐ์ ๊ตญํ๋์ง ์์ผ๋ฉฐ, ์ด ๊ฐ์ด๋์์๋ IDEFICS๋ผ๋ ๋๊ท๋ชจ ๋ฉํฐ๋ชจ๋ฌ ๋ชจ๋ธ์ ์ฌ์ฉํ์ฌ ์ด๋ฏธ์ง-ํ ์คํธ ์์ ์ ๋ค๋ฃจ๋ ๋ฐฉ๋ฒ์ ์ค๋ช ํฉ๋๋ค. | |
| [IDEFICS](../model_doc/idefics)๋ [Flamingo](https://huggingface.co/papers/2204.14198)๋ฅผ ๊ธฐ๋ฐ์ผ๋ก ํ๋ ์คํ ์ก์ธ์ค ๋น์ ๋ฐ ์ธ์ด ๋ชจ๋ธ๋ก, DeepMind์์ ์ฒ์ ๊ฐ๋ฐํ ์ต์ ์๊ฐ ์ธ์ด ๋ชจ๋ธ์ ๋๋ค. ์ด ๋ชจ๋ธ์ ์์์ ์ด๋ฏธ์ง ๋ฐ ํ ์คํธ ์ ๋ ฅ ์ํ์ค๋ฅผ ๋ฐ์ ์ผ๊ด์ฑ ์๋ ํ ์คํธ๋ฅผ ์ถ๋ ฅ์ผ๋ก ์์ฑํฉ๋๋ค. ์ด๋ฏธ์ง์ ๋ํ ์ง๋ฌธ์ ๋ต๋ณํ๊ณ , ์๊ฐ์ ์ธ ๋ด์ฉ์ ์ค๋ช ํ๋ฉฐ, ์ฌ๋ฌ ์ด๋ฏธ์ง์ ๊ธฐ๋ฐํ ์ด์ผ๊ธฐ๋ฅผ ์์ฑํ๋ ๋ฑ ๋ค์ํ ์์ ์ ์ํํ ์ ์์ต๋๋ค. IDEFICS๋ [800์ต ํ๋ผ๋ฏธํฐ](https://huggingface.co/HuggingFaceM4/idefics-80b)์ [90์ต ํ๋ผ๋ฏธํฐ](https://huggingface.co/HuggingFaceM4/idefics-9b) ๋ ๊ฐ์ง ๋ฒ์ ์ ์ ๊ณตํ๋ฉฐ, ๋ ๋ฒ์ ๋ชจ๋ ๐ค Hub์์ ์ด์ฉํ ์ ์์ต๋๋ค. ๊ฐ ๋ฒ์ ์๋ ๋ํํ ์ฌ์ฉ ์ฌ๋ก์ ๋ง๊ฒ ๋ฏธ์ธ ์กฐ์ ๋ ๋ฒ์ ๋ ์์ต๋๋ค. | |
| ์ด ๋ชจ๋ธ์ ๋งค์ฐ ๋ค์ฌ๋ค๋ฅํ๋ฉฐ ๊ด๋ฒ์ํ ์ด๋ฏธ์ง ๋ฐ ๋ฉํฐ๋ชจ๋ฌ ์์ ์ ์ฌ์ฉ๋ ์ ์์ต๋๋ค. ๊ทธ๋ฌ๋ ๋๊ท๋ชจ ๋ชจ๋ธ์ด๊ธฐ ๋๋ฌธ์ ์๋นํ ์ปดํจํ ์์๊ณผ ์ธํ๋ผ๊ฐ ํ์ํฉ๋๋ค. ๊ฐ ๊ฐ๋ณ ์์ ์ ํนํ๋ ๋ชจ๋ธ์ ๋ฏธ์ธ ์กฐ์ ํ๋ ๊ฒ๋ณด๋ค ๋ชจ๋ธ์ ๊ทธ๋๋ก ์ฌ์ฉํ๋ ๊ฒ์ด ๋ ์ ํฉํ์ง๋ ์ฌ์ฉ์๊ฐ ํ๋จํด์ผ ํฉ๋๋ค. | |
| ์ด ๊ฐ์ด๋์์๋ ๋ค์์ ๋ฐฐ์ฐ๊ฒ ๋ฉ๋๋ค: | |
| - [IDEFICS ๋ก๋ํ๊ธฐ](#loading-the-model) ๋ฐ [์์ํ๋ ๋ฒ์ ์ ๋ชจ๋ธ ๋ก๋ํ๊ธฐ](#quantized-model) | |
| - IDEFICS๋ฅผ ์ฌ์ฉํ์ฌ: | |
| - [์ด๋ฏธ์ง ์บก์ ๋](#image-captioning) | |
| - [ํ๋กฌํํธ ์ด๋ฏธ์ง ์บก์ ๋](#prompted-image-captioning) | |
| - [ํจ์ท ํ๋กฌํํธ](#few-shot-prompting) | |
| - [์๊ฐ์ ์ง์ ์๋ต](#visual-question-answering) | |
| - [์ด๋ฏธ์ง ๋ถ๋ฅ](#image-classification) | |
| - [์ด๋ฏธ์ง ๊ธฐ๋ฐ ํ ์คํธ ์์ฑ](#image-guided-text-generation) | |
| - [๋ฐฐ์น ๋ชจ๋์์ ์ถ๋ก ์คํ](#running-inference-in-batch-mode) | |
| - [๋ํํ ์ฌ์ฉ์ ์ํ IDEFICS ์ธ์คํธ๋ญํธ ์คํ](#idefics-instruct-for-conversational-use) | |
| ์์ํ๊ธฐ ์ ์ ํ์ํ ๋ชจ๋ ๋ผ์ด๋ธ๋ฌ๋ฆฌ๊ฐ ์ค์น๋์ด ์๋์ง ํ์ธํ์ธ์. | |
| ```bash | |
| pip install -q bitsandbytes sentencepiece accelerate transformers | |
| ``` | |
| <Tip> | |
| ๋ค์ ์์ ๋ฅผ ๋น์์ํ๋ ๋ฒ์ ์ ๋ชจ๋ธ ์ฒดํฌํฌ์ธํธ๋ก ์คํํ๋ ค๋ฉด ์ต์ 20GB์ GPU ๋ฉ๋ชจ๋ฆฌ๊ฐ ํ์ํฉ๋๋ค. | |
| </Tip> | |
| ## ๋ชจ๋ธ ๋ก๋[[loading-the-model]] | |
| ๋ชจ๋ธ์ 90์ต ํ๋ผ๋ฏธํฐ ๋ฒ์ ์ ์ฒดํฌํฌ์ธํธ๋ก ๋ก๋ํด ๋ด ์๋ค: | |
| ```py | |
| >>> checkpoint = "HuggingFaceM4/idefics-9b" | |
| ``` | |
| ๋ค๋ฅธ Transformers ๋ชจ๋ธ๊ณผ ๋ง์ฐฌ๊ฐ์ง๋ก, ์ฒดํฌํฌ์ธํธ์์ ํ๋ก์ธ์์ ๋ชจ๋ธ ์์ฒด๋ฅผ ๋ก๋ํด์ผ ํฉ๋๋ค. | |
| IDEFICS ํ๋ก์ธ์๋ [`LlamaTokenizer`]์ IDEFICS ์ด๋ฏธ์ง ํ๋ก์ธ์๋ฅผ ํ๋์ ํ๋ก์ธ์๋ก ๊ฐ์ธ์ ํ ์คํธ์ ์ด๋ฏธ์ง ์ ๋ ฅ์ ๋ชจ๋ธ์ ๋ง๊ฒ ์ค๋นํฉ๋๋ค. | |
| ```py | |
| >>> import torch | |
| >>> from transformers import IdeficsForVisionText2Text, AutoProcessor | |
| >>> processor = AutoProcessor.from_pretrained(checkpoint) | |
| >>> model = IdeficsForVisionText2Text.from_pretrained(checkpoint, dtype=torch.bfloat16, device_map="auto") | |
| ``` | |
| `device_map`์ `"auto"`๋ก ์ค์ ํ๋ฉด ์ฌ์ฉ ์ค์ธ ์ฅ์น๋ฅผ ๊ณ ๋ คํ์ฌ ๋ชจ๋ธ ๊ฐ์ค์น๋ฅผ ๊ฐ์ฅ ์ต์ ํ๋ ๋ฐฉ์์ผ๋ก ๋ก๋ํ๊ณ ์ ์ฅํ๋ ๋ฐฉ๋ฒ์ ์๋์ผ๋ก ๊ฒฐ์ ํฉ๋๋ค. | |
| ### ์์ํ๋ ๋ชจ๋ธ[[quantized-model]] | |
| ๊ณ ์ฉ๋ GPU ์ฌ์ฉ์ด ์ด๋ ค์ด ๊ฒฝ์ฐ, ๋ชจ๋ธ์ ์์ํ๋ ๋ฒ์ ์ ๋ก๋ํ ์ ์์ต๋๋ค. ๋ชจ๋ธ๊ณผ ํ๋ก์ธ์๋ฅผ 4๋นํธ ์ ๋ฐ๋๋ก ๋ก๋ํ๊ธฐ ์ํด์, `from_pretrained` ๋ฉ์๋์ `BitsAndBytesConfig`๋ฅผ ์ ๋ฌํ๋ฉด ๋ชจ๋ธ์ด ๋ก๋๋๋ ๋์ ์ค์๊ฐ์ผ๋ก ์์ถ๋ฉ๋๋ค. | |
| ```py | |
| >>> import torch | |
| >>> from transformers import IdeficsForVisionText2Text, AutoProcessor, BitsAndBytesConfig | |
| >>> quantization_config = BitsAndBytesConfig( | |
| ... load_in_4bit=True, | |
| ... bnb_4bit_compute_dtype=torch.float16, | |
| ... ) | |
| >>> processor = AutoProcessor.from_pretrained(checkpoint) | |
| >>> model = IdeficsForVisionText2Text.from_pretrained( | |
| ... checkpoint, | |
| ... quantization_config=quantization_config, | |
| ... device_map="auto" | |
| ... ) | |
| ``` | |
| ์ด์ ๋ชจ๋ธ์ ์ ์๋ ๋ฐฉ๋ฒ ์ค ํ๋๋ก ๋ก๋ํ์ผ๋, IDEFICS๋ฅผ ์ฌ์ฉํ ์ ์๋ ์์ ๋ค์ ํ๊ตฌํด๋ด ์๋ค. | |
| ## ์ด๋ฏธ์ง ์บก์ ๋[[image-captioning]] | |
| ์ด๋ฏธ์ง ์บก์ ๋์ ์ฃผ์ด์ง ์ด๋ฏธ์ง์ ๋ํ ์บก์ ์ ์์ธกํ๋ ์์ ์ ๋๋ค. ์ผ๋ฐ์ ์ธ ์์ฉ ๋ถ์ผ๋ ์๊ฐ ์ฅ์ ์ธ์ด ๋ค์ํ ์ํฉ์ ํ์ํ ์ ์๋๋ก ๋๋ ๊ฒ์ ๋๋ค. ์๋ฅผ ๋ค์ด, ์จ๋ผ์ธ์์ ์ด๋ฏธ์ง ์ฝํ ์ธ ๋ฅผ ํ์ํ๋ ๋ฐ ๋์์ ์ค ์ ์์ต๋๋ค. | |
| ์์ ์ ์ค๋ช ํ๊ธฐ ์ํด ์บก์ ์ ๋ฌ ์ด๋ฏธ์ง ์์๋ฅผ ๊ฐ์ ธ์ต๋๋ค. ์์: | |
| <div class="flex justify-center"> | |
| <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/idefics-im-captioning.jpg" alt="Image of a puppy in a flower bed"/> | |
| </div> | |
| ์ฌ์ง ์ ๊ณต: [Hendo Wang](https://unsplash.com/@hendoo). | |
| IDEFICS๋ ํ ์คํธ ๋ฐ ์ด๋ฏธ์ง ํ๋กฌํํธ๋ฅผ ๋ชจ๋ ์์ฉํฉ๋๋ค. ๊ทธ๋ฌ๋ ์ด๋ฏธ์ง๋ฅผ ์บก์ ํ๊ธฐ ์ํด ๋ชจ๋ธ์ ํ ์คํธ ํ๋กฌํํธ๋ฅผ ์ ๊ณตํ ํ์๋ ์์ต๋๋ค. ์ ์ฒ๋ฆฌ๋ ์ ๋ ฅ ์ด๋ฏธ์ง๋ง ์ ๊ณตํ๋ฉด ๋ฉ๋๋ค. ํ ์คํธ ํ๋กฌํํธ ์์ด ๋ชจ๋ธ์ BOS(์ํ์ค ์์) ํ ํฐ๋ถํฐ ํ ์คํธ ์์ฑ์ ์์ํ์ฌ ์บก์ ์ ๋ง๋ญ๋๋ค. | |
| ๋ชจ๋ธ์ ์ด๋ฏธ์ง ์ ๋ ฅ์ผ๋ก๋ ์ด๋ฏธ์ง ๊ฐ์ฒด(`PIL.Image`) ๋๋ ์ด๋ฏธ์ง๋ฅผ ๊ฐ์ ธ์ฌ ์ ์๋ URL์ ์ฌ์ฉํ ์ ์์ต๋๋ค. | |
| ```py | |
| >>> prompt = [ | |
| ... "https://images.unsplash.com/photo-1583160247711-2191776b4b91?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3542&q=80", | |
| ... ] | |
| >>> inputs = processor(prompt, return_tensors="pt").to(model.device) | |
| >>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids | |
| >>> generated_ids = model.generate(**inputs, max_new_tokens=10, bad_words_ids=bad_words_ids) | |
| >>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True) | |
| >>> print(generated_text[0]) | |
| A puppy in a flower bed | |
| ``` | |
| <Tip> | |
| `max_new_tokens`์ ํฌ๊ธฐ๋ฅผ ์ฆ๊ฐ์ํฌ ๋ ๋ฐ์ํ ์ ์๋ ์ค๋ฅ๋ฅผ ํผํ๊ธฐ ์ํด `generate` ํธ์ถ ์ `bad_words_ids`๋ฅผ ํฌํจํ๋ ๊ฒ์ด ์ข์ต๋๋ค. ๋ชจ๋ธ๋ก๋ถํฐ ์์ฑ๋ ์ด๋ฏธ์ง๊ฐ ์์ ๋ ์๋ก์ด `<image>` ๋๋ `<fake_token_around_image>` ํ ํฐ์ ์์ฑํ๋ ค๊ณ ํ๊ธฐ ๋๋ฌธ์ ๋๋ค. | |
| ์ด ๊ฐ์ด๋์์์ฒ๋ผ `bad_words_ids`๋ฅผ ํจ์ ํธ์ถ ์์ ๋งค๊ฐ๋ณ์๋ก ์ค์ ํ๊ฑฐ๋, [ํ ์คํธ ์์ฑ ์ ๋ต](../generation_strategies) ๊ฐ์ด๋์ ์ค๋ช ๋ ๋๋ก `GenerationConfig`์ ์ ์ฅํ ์๋ ์์ต๋๋ค. | |
| </Tip> | |
| ## ํ๋กฌํํธ ์ด๋ฏธ์ง ์บก์ ๋[[prompted-image-captioning]] | |
| ํ ์คํธ ํ๋กฌํํธ๋ฅผ ์ด์ฉํ์ฌ ์ด๋ฏธ์ง ์บก์ ๋์ ํ์ฅํ ์ ์์ผ๋ฉฐ, ๋ชจ๋ธ์ ์ฃผ์ด์ง ์ด๋ฏธ์ง๋ฅผ ๋ฐํ์ผ๋ก ํ ์คํธ๋ฅผ ๊ณ์ ์์ฑํฉ๋๋ค. ๋ค์ ์ด๋ฏธ์ง๋ฅผ ์์๋ก ๋ค์ด๋ณด๊ฒ ์ต๋๋ค: | |
| <div class="flex justify-center"> | |
| <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/idefics-prompted-im-captioning.jpg" alt="Image of the Eiffel Tower at night"/> | |
| </div> | |
| ์ฌ์ง ์ ๊ณต: [Denys Nevozhai](https://unsplash.com/@dnevozhai). | |
| ํ ์คํธ ๋ฐ ์ด๋ฏธ์ง ํ๋กฌํํธ๋ ์ ์ ํ ์ ๋ ฅ์ ์์ฑํ๊ธฐ ์ํด ๋ชจ๋ธ์ ํ๋ก์ธ์์ ํ๋์ ๋ชฉ๋ก์ผ๋ก ์ ๋ฌ๋ ์ ์์ต๋๋ค. | |
| ```py | |
| >>> prompt = [ | |
| ... "https://images.unsplash.com/photo-1543349689-9a4d426bee8e?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3501&q=80", | |
| ... "This is an image of ", | |
| ... ] | |
| >>> inputs = processor(prompt, return_tensors="pt").to(model.device) | |
| >>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids | |
| >>> generated_ids = model.generate(**inputs, max_new_tokens=10, bad_words_ids=bad_words_ids) | |
| >>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True) | |
| >>> print(generated_text[0]) | |
| This is an image of the Eiffel Tower in Paris, France. | |
| ``` | |
| ## ํจ์ท ํ๋กฌํํธ[[few-shot-prompting]] | |
| IDEFICS๋ ํ๋ฅญํ ์ ๋ก์ท ๊ฒฐ๊ณผ๋ฅผ ๋ณด์ฌ์ฃผ์ง๋ง, ์์ ์ ํน์ ํ์์ ์บก์ ์ด ํ์ํ๊ฑฐ๋ ์์ ์ ๋ณต์ก์ฑ์ ๋์ด๋ ๋ค๋ฅธ ์ ํ ์ฌํญ์ด๋ ์๊ตฌ ์ฌํญ์ด ์์ ์ ์์ต๋๋ค. ์ด๋ด ๋ ํจ์ท ํ๋กฌํํธ๋ฅผ ์ฌ์ฉํ์ฌ ๋งฅ๋ฝ ๋ด ํ์ต(In-Context Learning)์ ๊ฐ๋ฅํ๊ฒ ํ ์ ์์ต๋๋ค. | |
| ํ๋กฌํํธ์ ์์๋ฅผ ์ ๊ณตํจ์ผ๋ก์จ ๋ชจ๋ธ์ด ์ฃผ์ด์ง ์์์ ํ์์ ๋ชจ๋ฐฉํ ๊ฒฐ๊ณผ๋ฅผ ์์ฑํ๋๋ก ์ ๋ํ ์ ์์ต๋๋ค. | |
| ์ด์ ์ ์ํ ํ ์ด๋ฏธ์ง๋ฅผ ๋ชจ๋ธ์ ์์๋ก ์ฌ์ฉํ๊ณ , ๋ชจ๋ธ์๊ฒ ์ด๋ฏธ์ง์ ๊ฐ์ฒด๋ฅผ ํ์ตํ๋ ๊ฒ ์ธ์๋ ํฅ๋ฏธ๋ก์ด ์ ๋ณด๋ฅผ ์ป๊ณ ์ถ๋ค๋ ๊ฒ์ ๋ณด์ฌ์ฃผ๋ ํ๋กฌํํธ๋ฅผ ์์ฑํด ๋ด ์๋ค. | |
| ๊ทธ๋ฐ ๋ค์ ์์ ์ ์ฌ์ ์ ์ด๋ฏธ์ง์ ๋ํด ๋์ผํ ์๋ต ํ์์ ์ป์ ์ ์๋์ง ํ์ธํด ๋ด ์๋ค: | |
| <div class="flex justify-center"> | |
| <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/idefics-few-shot.jpg" alt="Image of the Statue of Liberty"/> | |
| </div> | |
| ์ฌ์ง ์ ๊ณต: [Juan Mayobre](https://unsplash.com/@jmayobres). | |
| ```py | |
| >>> prompt = ["User:", | |
| ... "https://images.unsplash.com/photo-1543349689-9a4d426bee8e?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3501&q=80", | |
| ... "Describe this image.\nAssistant: An image of the Eiffel Tower at night. Fun fact: the Eiffel Tower is the same height as an 81-storey building.\n", | |
| ... "User:", | |
| ... "https://images.unsplash.com/photo-1524099163253-32b7f0256868?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3387&q=80", | |
| ... "Describe this image.\nAssistant:" | |
| ... ] | |
| >>> inputs = processor(prompt, return_tensors="pt").to(model.device) | |
| >>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids | |
| >>> generated_ids = model.generate(**inputs, max_new_tokens=30, bad_words_ids=bad_words_ids) | |
| >>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True) | |
| >>> print(generated_text[0]) | |
| User: Describe this image. | |
| Assistant: An image of the Eiffel Tower at night. Fun fact: the Eiffel Tower is the same height as an 81-storey building. | |
| User: Describe this image. | |
| Assistant: An image of the Statue of Liberty. Fun fact: the Statue of Liberty is 151 feet tall. | |
| ``` | |
| ๋จ ํ๋์ ์์๋ง์ผ๋ก๋(์ฆ, 1-shot) ๋ชจ๋ธ์ด ์์ ์ํ ๋ฐฉ๋ฒ์ ํ์ตํ๋ค๋ ์ ์ด ์ฃผ๋ชฉํ ๋งํฉ๋๋ค. ๋ ๋ณต์กํ ์์ ์ ๊ฒฝ์ฐ, ๋ ๋ง์ ์์(์: 3-shot, 5-shot ๋ฑ)๋ฅผ ์ฌ์ฉํ์ฌ ์คํํด ๋ณด๋ ๊ฒ๋ ์ข์ ๋ฐฉ๋ฒ์ ๋๋ค. | |
| ## ์๊ฐ์ ์ง์ ์๋ต[[visual-question-answering]] | |
| ์๊ฐ์ ์ง์ ์๋ต(VQA)์ ์ด๋ฏธ์ง๋ฅผ ๊ธฐ๋ฐ์ผ๋ก ๊ฐ๋ฐฉํ ์ง๋ฌธ์ ๋ตํ๋ ์์ ์ ๋๋ค. ์ด๋ฏธ์ง ์บก์ ๋๊ณผ ๋ง์ฐฌ๊ฐ์ง๋ก ์ ๊ทผ์ฑ ์ ํ๋ฆฌ์ผ์ด์ ์์ ์ฌ์ฉํ ์ ์์ง๋ง, ๊ต์ก(์๊ฐ ์๋ฃ์ ๋ํ ์ถ๋ก ), ๊ณ ๊ฐ ์๋น์ค(์ด๋ฏธ์ง๋ฅผ ๊ธฐ๋ฐ์ผ๋ก ํ ์ ํ ์ง๋ฌธ), ์ด๋ฏธ์ง ๊ฒ์ ๋ฑ์์๋ ์ฌ์ฉํ ์ ์์ต๋๋ค. | |
| ์ด ์์ ์ ์ํด ์๋ก์ด ์ด๋ฏธ์ง๋ฅผ ๊ฐ์ ธ์ต๋๋ค: | |
| <div class="flex justify-center"> | |
| <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/idefics-vqa.jpg" alt="Image of a couple having a picnic"/> | |
| </div> | |
| ์ฌ์ง ์ ๊ณต: [Jarritos Mexican Soda](https://unsplash.com/@jarritos). | |
| ์ ์ ํ ์ง์๋ฌธ์ ์ฌ์ฉํ๋ฉด ์ด๋ฏธ์ง ์บก์ ๋์์ ์๊ฐ์ ์ง์ ์๋ต์ผ๋ก ๋ชจ๋ธ์ ์ ๋ํ ์ ์์ต๋๋ค: | |
| ```py | |
| >>> prompt = [ | |
| ... "Instruction: Provide an answer to the question. Use the image to answer.\n", | |
| ... "https://images.unsplash.com/photo-1623944889288-cd147dbb517c?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3540&q=80", | |
| ... "Question: Where are these people and what's the weather like? Answer:" | |
| ... ] | |
| >>> inputs = processor(prompt, return_tensors="pt").to(model.device) | |
| >>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids | |
| >>> generated_ids = model.generate(**inputs, max_new_tokens=20, bad_words_ids=bad_words_ids) | |
| >>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True) | |
| >>> print(generated_text[0]) | |
| Instruction: Provide an answer to the question. Use the image to answer. | |
| Question: Where are these people and what's the weather like? Answer: They're in a park in New York City, and it's a beautiful day. | |
| ``` | |
| ## ์ด๋ฏธ์ง ๋ถ๋ฅ[[image-classification]] | |
| IDEFICS๋ ํน์ ์นดํ ๊ณ ๋ฆฌ์ ๋ผ๋ฒจ์ด ํฌํจ๋ ๋ฐ์ดํฐ๋ก ๋ช ์์ ์ผ๋ก ํ์ต๋์ง ์์๋ ์ด๋ฏธ์ง๋ฅผ ๋ค์ํ ์นดํ ๊ณ ๋ฆฌ๋ก ๋ถ๋ฅํ ์ ์์ต๋๋ค. ์นดํ ๊ณ ๋ฆฌ ๋ชฉ๋ก์ด ์ฃผ์ด์ง๋ฉด, ๋ชจ๋ธ์ ์ด๋ฏธ์ง์ ํ ์คํธ ์ดํด ๋ฅ๋ ฅ์ ์ฌ์ฉํ์ฌ ์ด๋ฏธ์ง๊ฐ ์ํ ๊ฐ๋ฅ์ฑ์ด ๋์ ์นดํ ๊ณ ๋ฆฌ๋ฅผ ์ถ๋ก ํ ์ ์์ต๋๋ค. | |
| ์ฌ๊ธฐ์ ์ผ์ฑ ๊ฐํ๋ ์ด๋ฏธ์ง๊ฐ ์์ต๋๋ค. | |
| <div class="flex justify-center"> | |
| <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/idefics-classification.jpg" alt="Image of a vegetable stand"/> | |
| </div> | |
| ์ฌ์ง ์ ๊ณต: [Peter Wendt](https://unsplash.com/@peterwendt). | |
| ์ฐ๋ฆฌ๋ ๋ชจ๋ธ์๊ฒ ์ฐ๋ฆฌ๊ฐ ๊ฐ์ง ์นดํ ๊ณ ๋ฆฌ ์ค ํ๋๋ก ์ด๋ฏธ์ง๋ฅผ ๋ถ๋ฅํ๋๋ก ์ง์ํ ์ ์์ต๋๋ค: | |
| ```py | |
| >>> categories = ['animals','vegetables', 'city landscape', 'cars', 'office'] | |
| >>> prompt = [f"Instruction: Classify the following image into a single category from the following list: {categories}.\n", | |
| ... "https://images.unsplash.com/photo-1471193945509-9ad0617afabf?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3540&q=80", | |
| ... "Category: " | |
| ... ] | |
| >>> inputs = processor(prompt, return_tensors="pt").to(model.device) | |
| >>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids | |
| >>> generated_ids = model.generate(**inputs, max_new_tokens=6, bad_words_ids=bad_words_ids) | |
| >>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True) | |
| >>> print(generated_text[0]) | |
| Instruction: Classify the following image into a single category from the following list: ['animals', 'vegetables', 'city landscape', 'cars', 'office']. | |
| Category: Vegetables | |
| ``` | |
| ์ ์์ ์์๋ ๋ชจ๋ธ์๊ฒ ์ด๋ฏธ์ง๋ฅผ ๋จ์ผ ์นดํ ๊ณ ๋ฆฌ๋ก ๋ถ๋ฅํ๋๋ก ์ง์ํ์ง๋ง, ์์ ๋ถ๋ฅ๋ฅผ ํ๋๋ก ๋ชจ๋ธ์ ํ๋กฌํํธ๋ฅผ ์ ๊ณตํ ์๋ ์์ต๋๋ค. | |
| ## ์ด๋ฏธ์ง ๊ธฐ๋ฐ ํ ์คํธ ์์ฑ[[image-guided-text-generation]] | |
| ์ด๋ฏธ์ง๋ฅผ ํ์ฉํ ํ ์คํธ ์์ฑ ๊ธฐ์ ์ ์ฌ์ฉํ๋ฉด ๋์ฑ ์ฐฝ์์ ์ธ ์์ ์ด ๊ฐ๋ฅํฉ๋๋ค. ์ด ๊ธฐ์ ์ ์ด๋ฏธ์ง๋ฅผ ๋ฐํ์ผ๋ก ํ ์คํธ๋ฅผ ๋ง๋ค์ด๋ด๋ฉฐ, ์ ํ ์ค๋ช , ๊ด๊ณ ๋ฌธ๊ตฌ, ์ฅ๋ฉด ๋ฌ์ฌ ๋ฑ ๋ค์ํ ์ฉ๋๋ก ํ์ฉํ ์ ์์ต๋๋ค. | |
| ๊ฐ๋จํ ์๋ก, ๋นจ๊ฐ ๋ฌธ ์ด๋ฏธ์ง๋ฅผ IDEFICS์ ์ ๋ ฅํ์ฌ ์ด์ผ๊ธฐ๋ฅผ ๋ง๋ค์ด๋ณด๊ฒ ์ต๋๋ค: | |
| <div class="flex justify-center"> | |
| <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/idefics-story-generation.jpg" alt="Image of a red door with a pumpkin on the steps"/> | |
| </div> | |
| ์ฌ์ง ์ ๊ณต: [Craig Tidball](https://unsplash.com/@devonshiremedia). | |
| ```py | |
| >>> prompt = ["Instruction: Use the image to write a story. \n", | |
| ... "https://images.unsplash.com/photo-1517086822157-2b0358e7684a?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=2203&q=80", | |
| ... "Story: \n"] | |
| >>> inputs = processor(prompt, return_tensors="pt").to(model.device) | |
| >>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids | |
| >>> generated_ids = model.generate(**inputs, num_beams=2, max_new_tokens=200, bad_words_ids=bad_words_ids) | |
| >>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True) | |
| >>> print(generated_text[0]) | |
| Instruction: Use the image to write a story. | |
| Story: | |
| Once upon a time, there was a little girl who lived in a house with a red door. She loved her red door. It was the prettiest door in the whole world. | |
| One day, the little girl was playing in her yard when she noticed a man standing on her doorstep. He was wearing a long black coat and a top hat. | |
| The little girl ran inside and told her mother about the man. | |
| Her mother said, โDonโt worry, honey. Heโs just a friendly ghost.โ | |
| The little girl wasnโt sure if she believed her mother, but she went outside anyway. | |
| When she got to the door, the man was gone. | |
| The next day, the little girl was playing in her yard again when she noticed the man standing on her doorstep. | |
| He was wearing a long black coat and a top hat. | |
| The little girl ran | |
| ``` | |
| IDEFICS๊ฐ ๋ฌธ ์์ ์๋ ํธ๋ฐ์ ๋ณด๊ณ ์ ๋ น์ ๋ํ ์ผ์ค์คํ ํ ๋ก์ ์ด์ผ๊ธฐ๋ฅผ ๋ง๋ ๊ฒ ๊ฐ์ต๋๋ค. | |
| <Tip> | |
| ์ด์ฒ๋ผ ๊ธด ํ ์คํธ๋ฅผ ์์ฑํ ๋๋ ํ ์คํธ ์์ฑ ์ ๋ต์ ์กฐ์ ํ๋ ๊ฒ์ด ์ข์ต๋๋ค. ์ด๋ ๊ฒ ํ๋ฉด ์์ฑ๋ ๊ฒฐ๊ณผ๋ฌผ์ ํ์ง์ ํฌ๊ฒ ํฅ์์ํฌ ์ ์์ต๋๋ค. ์์ธํ ๋ด์ฉ์ [ํ ์คํธ ์์ฑ ์ ๋ต](../generation_strategies)์ ์ฐธ์กฐํ์ธ์. | |
| </Tip> | |
| ## ๋ฐฐ์น ๋ชจ๋์์ ์ถ๋ก ์คํ[[running-inference-in-batch-mode]] | |
| ์์ ๋ชจ๋ ์น์ ์์๋ ๋จ์ผ ์์์ ๋ํด IDEFICS๋ฅผ ์ค๋ช ํ์ต๋๋ค. ์ด์ ๋งค์ฐ ์ ์ฌํ ๋ฐฉ์์ผ๋ก, ํ๋กฌํํธ ๋ชฉ๋ก์ ์ ๋ฌํ์ฌ ์ฌ๋ฌ ์์์ ๋ํ ์ถ๋ก ์ ์คํํ ์ ์์ต๋๋ค: | |
| ```py | |
| >>> prompts = [ | |
| ... [ "https://images.unsplash.com/photo-1543349689-9a4d426bee8e?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3501&q=80", | |
| ... "This is an image of ", | |
| ... ], | |
| ... [ "https://images.unsplash.com/photo-1623944889288-cd147dbb517c?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3540&q=80", | |
| ... "This is an image of ", | |
| ... ], | |
| ... [ "https://images.unsplash.com/photo-1471193945509-9ad0617afabf?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3540&q=80", | |
| ... "This is an image of ", | |
| ... ], | |
| ... ] | |
| >>> inputs = processor(prompts, return_tensors="pt").to(model.device) | |
| >>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids | |
| >>> generated_ids = model.generate(**inputs, max_new_tokens=10, bad_words_ids=bad_words_ids) | |
| >>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True) | |
| >>> for i,t in enumerate(generated_text): | |
| ... print(f"{i}:\n{t}\n") | |
| 0: | |
| This is an image of the Eiffel Tower in Paris, France. | |
| 1: | |
| This is an image of a couple on a picnic blanket. | |
| 2: | |
| This is an image of a vegetable stand. | |
| ``` | |
| ## ๋ํํ ์ฌ์ฉ์ ์ํ IDEFICS ์ธ์คํธ๋ญํธ ์คํ[[idefics-instruct-for-conversational-use]] | |
| ๋ํํ ์ฌ์ฉ ์ฌ๋ก๋ฅผ ์ํด, ๐ค Hub์์ ๋ช ๋ น์ด ์ํ์ ์ต์ ํ๋ ๋ฒ์ ์ ๋ชจ๋ธ์ ์ฐพ์ ์ ์์ต๋๋ค. ์ด๊ณณ์๋ `HuggingFaceM4/idefics-80b-instruct`์ `HuggingFaceM4/idefics-9b-instruct`๊ฐ ์์ต๋๋ค. | |
| ์ด ์ฒดํฌํฌ์ธํธ๋ ์ง๋ ํ์ต ๋ฐ ๋ช ๋ น์ด ๋ฏธ์ธ ์กฐ์ ๋ฐ์ดํฐ์ ์ ํผํฉ์ผ๋ก ๊ฐ๊ฐ์ ๊ธฐ๋ณธ ๋ชจ๋ธ์ ๋ฏธ์ธ ์กฐ์ ํ ๊ฒฐ๊ณผ์ ๋๋ค. ์ด๋ฅผ ํตํด ๋ชจ๋ธ์ ํ์ ์์ ์ฑ๋ฅ์ ํฅ์์ํค๋ ๋์์ ๋ํํ ํ๊ฒฝ์์ ๋ชจ๋ธ์ ๋ ์ฌ์ฉํ๊ธฐ ์ฝ๊ฒ ํฉ๋๋ค. | |
| ๋ํํ ์ฌ์ฉ์ ์ํ ์ฌ์ฉ๋ฒ ๋ฐ ํ๋กฌํํธ๋ ๊ธฐ๋ณธ ๋ชจ๋ธ์ ์ฌ์ฉํ๋ ๊ฒ๊ณผ ๋งค์ฐ ์ ์ฌํฉ๋๋ค. | |
| ```py | |
| >>> import torch | |
| >>> from transformers import IdeficsForVisionText2Text, AutoProcessor | |
| >>> checkpoint = "HuggingFaceM4/idefics-9b-instruct" | |
| >>> model = IdeficsForVisionText2Text.from_pretrained(checkpoint, dtype=torch.bfloat16, device_map="auto") | |
| >>> processor = AutoProcessor.from_pretrained(checkpoint) | |
| >>> prompts = [ | |
| ... [ | |
| ... "User: What is in this image?", | |
| ... "https://upload.wikimedia.org/wikipedia/commons/8/86/Id%C3%A9fix.JPG", | |
| ... "<end_of_utterance>", | |
| ... "\nAssistant: This picture depicts Idefix, the dog of Obelix in Asterix and Obelix. Idefix is running on the ground.<end_of_utterance>", | |
| ... "\nUser:", | |
| ... "https://static.wikia.nocookie.net/asterix/images/2/25/R22b.gif/revision/latest?cb=20110815073052", | |
| ... "And who is that?<end_of_utterance>", | |
| ... "\nAssistant:", | |
| ... ], | |
| ... ] | |
| >>> # --batched mode | |
| >>> inputs = processor(prompts, add_end_of_utterance_token=False, return_tensors="pt").to(device) | |
| >>> # --single sample mode | |
| >>> # inputs = processor(prompts[0], return_tensors="pt").to(device) | |
| >>> # args ์์ฑ | |
| >>> exit_condition = processor.tokenizer("<end_of_utterance>", add_special_tokens=False).input_ids | |
| >>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids | |
| >>> generated_ids = model.generate(**inputs, eos_token_id=exit_condition, bad_words_ids=bad_words_ids, max_length=100) | |
| >>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True) | |
| >>> for i, t in enumerate(generated_text): | |
| ... print(f"{i}:\n{t}\n") | |
| ``` | |