| <!--Copyright 2023 The HuggingFace Team. All rights reserved. | |
| Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with | |
| the License. You may obtain a copy of the License at | |
| http://www.apache.org/licenses/LICENSE-2.0 | |
| Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on | |
| an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the | |
| specific language governing permissions and limitations under the License. | |
| β οΈ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be | |
| rendered properly in your Markdown viewer. | |
| --> | |
| # IDEFICSλ₯Ό μ΄μ©ν μ΄λ―Έμ§ μμ [[image-tasks-with-idefics]] | |
| [[open-in-colab]] | |
| κ°λ³ μμ μ νΉνλ λͺ¨λΈμ λ―ΈμΈ μ‘°μ νμ¬ μ²λ¦¬ν μ μμ§λ§, μ΅κ·Ό λ±μ₯νμ¬ μΈκΈ°λ₯Ό μ»κ³ μλ λ°©μμ λκ·λͺ¨ λͺ¨λΈμ λ―ΈμΈ μ‘°μ μμ΄ λ€μν μμ μ μ¬μ©νλ κ²μ λλ€. μλ₯Ό λ€μ΄, λκ·λͺ¨ μΈμ΄ λͺ¨λΈμ μμ½, λ²μ, λΆλ₯ λ±κ³Ό κ°μ μμ°μ΄μ²λ¦¬ (NLP) μμ μ μ²λ¦¬ν μ μμ΅λλ€. μ΄ μ κ·Ό λ°©μμ ν μ€νΈμ κ°μ λ¨μΌ λͺ¨λ¬λ¦¬ν°μ κ΅νλμ§ μμΌλ©°, μ΄ κ°μ΄λμμλ IDEFICSλΌλ λκ·λͺ¨ λ©ν°λͺ¨λ¬ λͺ¨λΈμ μ¬μ©νμ¬ μ΄λ―Έμ§-ν μ€νΈ μμ μ λ€λ£¨λ λ°©λ²μ μ€λͺ ν©λλ€. | |
| [IDEFICS](../model_doc/idefics)λ [Flamingo](https://huggingface.co/papers/2204.14198)λ₯Ό κΈ°λ°μΌλ‘ νλ μ€ν μ‘μΈμ€ λΉμ λ° μΈμ΄ λͺ¨λΈλ‘, DeepMindμμ μ²μ κ°λ°ν μ΅μ μκ° μΈμ΄ λͺ¨λΈμ λλ€. μ΄ λͺ¨λΈμ μμμ μ΄λ―Έμ§ λ° ν μ€νΈ μ λ ₯ μνμ€λ₯Ό λ°μ μΌκ΄μ± μλ ν μ€νΈλ₯Ό μΆλ ₯μΌλ‘ μμ±ν©λλ€. μ΄λ―Έμ§μ λν μ§λ¬Έμ λ΅λ³νκ³ , μκ°μ μΈ λ΄μ©μ μ€λͺ νλ©°, μ¬λ¬ μ΄λ―Έμ§μ κΈ°λ°ν μ΄μΌκΈ°λ₯Ό μμ±νλ λ± λ€μν μμ μ μνν μ μμ΅λλ€. IDEFICSλ [800μ΅ νλΌλ―Έν°](https://huggingface.co/HuggingFaceM4/idefics-80b)μ [90μ΅ νλΌλ―Έν°](https://huggingface.co/HuggingFaceM4/idefics-9b) λ κ°μ§ λ²μ μ μ 곡νλ©°, λ λ²μ λͺ¨λ π€ Hubμμ μ΄μ©ν μ μμ΅λλ€. κ° λ²μ μλ λνν μ¬μ© μ¬λ‘μ λ§κ² λ―ΈμΈ μ‘°μ λ λ²μ λ μμ΅λλ€. | |
| μ΄ λͺ¨λΈμ λ§€μ° λ€μ¬λ€λ₯νλ©° κ΄λ²μν μ΄λ―Έμ§ λ° λ©ν°λͺ¨λ¬ μμ μ μ¬μ©λ μ μμ΅λλ€. κ·Έλ¬λ λκ·λͺ¨ λͺ¨λΈμ΄κΈ° λλ¬Έμ μλΉν μ»΄ν¨ν μμκ³Ό μΈνλΌκ° νμν©λλ€. κ° κ°λ³ μμ μ νΉνλ λͺ¨λΈμ λ―ΈμΈ μ‘°μ νλ κ²λ³΄λ€ λͺ¨λΈμ κ·Έλλ‘ μ¬μ©νλ κ²μ΄ λ μ ν©νμ§λ μ¬μ©μκ° νλ¨ν΄μΌ ν©λλ€. | |
| μ΄ κ°μ΄λμμλ λ€μμ λ°°μ°κ² λ©λλ€: | |
| - [IDEFICS λ‘λνκΈ°](#loading-the-model) λ° [μμνλ λ²μ μ λͺ¨λΈ λ‘λνκΈ°](#quantized-model) | |
| - IDEFICSλ₯Ό μ¬μ©νμ¬: | |
| - [μ΄λ―Έμ§ μΊ‘μ λ](#image-captioning) | |
| - [ν둬ννΈ μ΄λ―Έμ§ μΊ‘μ λ](#prompted-image-captioning) | |
| - [ν¨μ· ν둬ννΈ](#few-shot-prompting) | |
| - [μκ°μ μ§μ μλ΅](#visual-question-answering) | |
| - [μ΄λ―Έμ§ λΆλ₯](#image-classification) | |
| - [μ΄λ―Έμ§ κΈ°λ° ν μ€νΈ μμ±](#image-guided-text-generation) | |
| - [λ°°μΉ λͺ¨λμμ μΆλ‘ μ€ν](#running-inference-in-batch-mode) | |
| - [λνν μ¬μ©μ μν IDEFICS μΈμ€νΈλνΈ μ€ν](#idefics-instruct-for-conversational-use) | |
| μμνκΈ° μ μ νμν λͺ¨λ λΌμ΄λΈλ¬λ¦¬κ° μ€μΉλμ΄ μλμ§ νμΈνμΈμ. | |
| ```bash | |
| pip install -q bitsandbytes sentencepiece accelerate transformers | |
| ``` | |
| <Tip> | |
| λ€μ μμ λ₯Ό λΉμμνλ λ²μ μ λͺ¨λΈ 체ν¬ν¬μΈνΈλ‘ μ€ννλ €λ©΄ μ΅μ 20GBμ GPU λ©λͺ¨λ¦¬κ° νμν©λλ€. | |
| </Tip> | |
| ## λͺ¨λΈ λ‘λ[[loading-the-model]] | |
| λͺ¨λΈμ 90μ΅ νλΌλ―Έν° λ²μ μ 체ν¬ν¬μΈνΈλ‘ λ‘λν΄ λ΄ μλ€: | |
| ```py | |
| >>> checkpoint = "HuggingFaceM4/idefics-9b" | |
| ``` | |
| λ€λ₯Έ Transformers λͺ¨λΈκ³Ό λ§μ°¬κ°μ§λ‘, 체ν¬ν¬μΈνΈμμ νλ‘μΈμμ λͺ¨λΈ μ체λ₯Ό λ‘λν΄μΌ ν©λλ€. | |
| IDEFICS νλ‘μΈμλ [`LlamaTokenizer`]μ IDEFICS μ΄λ―Έμ§ νλ‘μΈμλ₯Ό νλμ νλ‘μΈμλ‘ κ°μΈμ ν μ€νΈμ μ΄λ―Έμ§ μ λ ₯μ λͺ¨λΈμ λ§κ² μ€λΉν©λλ€. | |
| ```py | |
| >>> import torch | |
| >>> from transformers import IdeficsForVisionText2Text, AutoProcessor | |
| >>> processor = AutoProcessor.from_pretrained(checkpoint) | |
| >>> model = IdeficsForVisionText2Text.from_pretrained(checkpoint, torch_dtype=torch.bfloat16, device_map="auto") | |
| ``` | |
| `device_map`μ `"auto"`λ‘ μ€μ νλ©΄ μ¬μ© μ€μΈ μ₯μΉλ₯Ό κ³ λ €νμ¬ λͺ¨λΈ κ°μ€μΉλ₯Ό κ°μ₯ μ΅μ νλ λ°©μμΌλ‘ λ‘λνκ³ μ μ₯νλ λ°©λ²μ μλμΌλ‘ κ²°μ ν©λλ€. | |
| ### μμνλ λͺ¨λΈ[[quantized-model]] | |
| κ³ μ©λ GPU μ¬μ©μ΄ μ΄λ €μ΄ κ²½μ°, λͺ¨λΈμ μμνλ λ²μ μ λ‘λν μ μμ΅λλ€. λͺ¨λΈκ³Ό νλ‘μΈμλ₯Ό 4λΉνΈ μ λ°λλ‘ λ‘λνκΈ° μν΄μ, `from_pretrained` λ©μλμ `BitsAndBytesConfig`λ₯Ό μ λ¬νλ©΄ λͺ¨λΈμ΄ λ‘λλλ λμ μ€μκ°μΌλ‘ μμΆλ©λλ€. | |
| ```py | |
| >>> import torch | |
| >>> from transformers import IdeficsForVisionText2Text, AutoProcessor, BitsAndBytesConfig | |
| >>> quantization_config = BitsAndBytesConfig( | |
| ... load_in_4bit=True, | |
| ... bnb_4bit_compute_dtype=torch.float16, | |
| ... ) | |
| >>> processor = AutoProcessor.from_pretrained(checkpoint) | |
| >>> model = IdeficsForVisionText2Text.from_pretrained( | |
| ... checkpoint, | |
| ... quantization_config=quantization_config, | |
| ... device_map="auto" | |
| ... ) | |
| ``` | |
| μ΄μ λͺ¨λΈμ μ μλ λ°©λ² μ€ νλλ‘ λ‘λνμΌλ, IDEFICSλ₯Ό μ¬μ©ν μ μλ μμ λ€μ νꡬν΄λ΄ μλ€. | |
| ## μ΄λ―Έμ§ μΊ‘μ λ[[image-captioning]] | |
| μ΄λ―Έμ§ μΊ‘μ λμ μ£Όμ΄μ§ μ΄λ―Έμ§μ λν μΊ‘μ μ μμΈ‘νλ μμ μ λλ€. μΌλ°μ μΈ μμ© λΆμΌλ μκ° μ₯μ μΈμ΄ λ€μν μν©μ νμν μ μλλ‘ λλ κ²μ λλ€. μλ₯Ό λ€μ΄, μ¨λΌμΈμμ μ΄λ―Έμ§ μ½ν μΈ λ₯Ό νμνλ λ° λμμ μ€ μ μμ΅λλ€. | |
| μμ μ μ€λͺ νκΈ° μν΄ μΊ‘μ μ λ¬ μ΄λ―Έμ§ μμλ₯Ό κ°μ Έμ΅λλ€. μμ: | |
| <div class="flex justify-center"> | |
| <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/idefics-im-captioning.jpg" alt="Image of a puppy in a flower bed"/> | |
| </div> | |
| μ¬μ§ μ 곡: [Hendo Wang](https://unsplash.com/@hendoo). | |
| IDEFICSλ ν μ€νΈ λ° μ΄λ―Έμ§ ν둬ννΈλ₯Ό λͺ¨λ μμ©ν©λλ€. κ·Έλ¬λ μ΄λ―Έμ§λ₯Ό μΊ‘μ νκΈ° μν΄ λͺ¨λΈμ ν μ€νΈ ν둬ννΈλ₯Ό μ 곡ν νμλ μμ΅λλ€. μ μ²λ¦¬λ μ λ ₯ μ΄λ―Έμ§λ§ μ 곡νλ©΄ λ©λλ€. ν μ€νΈ ν둬ννΈ μμ΄ λͺ¨λΈμ BOS(μνμ€ μμ) ν ν°λΆν° ν μ€νΈ μμ±μ μμνμ¬ μΊ‘μ μ λ§λλλ€. | |
| λͺ¨λΈμ μ΄λ―Έμ§ μ λ ₯μΌλ‘λ μ΄λ―Έμ§ κ°μ²΄(`PIL.Image`) λλ μ΄λ―Έμ§λ₯Ό κ°μ Έμ¬ μ μλ URLμ μ¬μ©ν μ μμ΅λλ€. | |
| ```py | |
| >>> prompt = [ | |
| ... "https://images.unsplash.com/photo-1583160247711-2191776b4b91?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3542&q=80", | |
| ... ] | |
| >>> inputs = processor(prompt, return_tensors="pt").to("cuda") | |
| >>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids | |
| >>> generated_ids = model.generate(**inputs, max_new_tokens=10, bad_words_ids=bad_words_ids) | |
| >>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True) | |
| >>> print(generated_text[0]) | |
| A puppy in a flower bed | |
| ``` | |
| <Tip> | |
| `max_new_tokens`μ ν¬κΈ°λ₯Ό μ¦κ°μν¬ λ λ°μν μ μλ μ€λ₯λ₯Ό νΌνκΈ° μν΄ `generate` νΈμΆ μ `bad_words_ids`λ₯Ό ν¬ν¨νλ κ²μ΄ μ’μ΅λλ€. λͺ¨λΈλ‘λΆν° μμ±λ μ΄λ―Έμ§κ° μμ λ μλ‘μ΄ `<image>` λλ `<fake_token_around_image>` ν ν°μ μμ±νλ €κ³ νκΈ° λλ¬Έμ λλ€. | |
| μ΄ κ°μ΄λμμμ²λΌ `bad_words_ids`λ₯Ό ν¨μ νΈμΆ μμ λ§€κ°λ³μλ‘ μ€μ νκ±°λ, [ν μ€νΈ μμ± μ λ΅](../generation_strategies) κ°μ΄λμ μ€λͺ λ λλ‘ `GenerationConfig`μ μ μ₯ν μλ μμ΅λλ€. | |
| </Tip> | |
| ## ν둬ννΈ μ΄λ―Έμ§ μΊ‘μ λ[[prompted-image-captioning]] | |
| ν μ€νΈ ν둬ννΈλ₯Ό μ΄μ©νμ¬ μ΄λ―Έμ§ μΊ‘μ λμ νμ₯ν μ μμΌλ©°, λͺ¨λΈμ μ£Όμ΄μ§ μ΄λ―Έμ§λ₯Ό λ°νμΌλ‘ ν μ€νΈλ₯Ό κ³μ μμ±ν©λλ€. λ€μ μ΄λ―Έμ§λ₯Ό μμλ‘ λ€μ΄λ³΄κ² μ΅λλ€: | |
| <div class="flex justify-center"> | |
| <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/idefics-prompted-im-captioning.jpg" alt="Image of the Eiffel Tower at night"/> | |
| </div> | |
| μ¬μ§ μ 곡: [Denys Nevozhai](https://unsplash.com/@dnevozhai). | |
| ν μ€νΈ λ° μ΄λ―Έμ§ ν둬ννΈλ μ μ ν μ λ ₯μ μμ±νκΈ° μν΄ λͺ¨λΈμ νλ‘μΈμμ νλμ λͺ©λ‘μΌλ‘ μ λ¬λ μ μμ΅λλ€. | |
| ```py | |
| >>> prompt = [ | |
| ... "https://images.unsplash.com/photo-1543349689-9a4d426bee8e?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3501&q=80", | |
| ... "This is an image of ", | |
| ... ] | |
| >>> inputs = processor(prompt, return_tensors="pt").to("cuda") | |
| >>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids | |
| >>> generated_ids = model.generate(**inputs, max_new_tokens=10, bad_words_ids=bad_words_ids) | |
| >>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True) | |
| >>> print(generated_text[0]) | |
| This is an image of the Eiffel Tower in Paris, France. | |
| ``` | |
| ## ν¨μ· ν둬ννΈ[[few-shot-prompting]] | |
| IDEFICSλ νλ₯ν μ λ‘μ· κ²°κ³Όλ₯Ό 보μ¬μ£Όμ§λ§, μμ μ νΉμ νμμ μΊ‘μ μ΄ νμνκ±°λ μμ μ 볡μ‘μ±μ λμ΄λ λ€λ₯Έ μ ν μ¬νμ΄λ μꡬ μ¬νμ΄ μμ μ μμ΅λλ€. μ΄λ΄ λ ν¨μ· ν둬ννΈλ₯Ό μ¬μ©νμ¬ λ§₯λ½ λ΄ νμ΅(In-Context Learning)μ κ°λ₯νκ² ν μ μμ΅λλ€. | |
| ν둬ννΈμ μμλ₯Ό μ 곡ν¨μΌλ‘μ¨ λͺ¨λΈμ΄ μ£Όμ΄μ§ μμμ νμμ λͺ¨λ°©ν κ²°κ³Όλ₯Ό μμ±νλλ‘ μ λν μ μμ΅λλ€. | |
| μ΄μ μ μν ν μ΄λ―Έμ§λ₯Ό λͺ¨λΈμ μμλ‘ μ¬μ©νκ³ , λͺ¨λΈμκ² μ΄λ―Έμ§μ κ°μ²΄λ₯Ό νμ΅νλ κ² μΈμλ ν₯λ―Έλ‘μ΄ μ 보λ₯Ό μ»κ³ μΆλ€λ κ²μ 보μ¬μ£Όλ ν둬ννΈλ₯Ό μμ±ν΄ λ΄ μλ€. | |
| κ·Έλ° λ€μ μμ μ μ¬μ μ μ΄λ―Έμ§μ λν΄ λμΌν μλ΅ νμμ μ»μ μ μλμ§ νμΈν΄ λ΄ μλ€: | |
| <div class="flex justify-center"> | |
| <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/idefics-few-shot.jpg" alt="Image of the Statue of Liberty"/> | |
| </div> | |
| μ¬μ§ μ 곡: [Juan Mayobre](https://unsplash.com/@jmayobres). | |
| ```py | |
| >>> prompt = ["User:", | |
| ... "https://images.unsplash.com/photo-1543349689-9a4d426bee8e?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3501&q=80", | |
| ... "Describe this image.\nAssistant: An image of the Eiffel Tower at night. Fun fact: the Eiffel Tower is the same height as an 81-storey building.\n", | |
| ... "User:", | |
| ... "https://images.unsplash.com/photo-1524099163253-32b7f0256868?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3387&q=80", | |
| ... "Describe this image.\nAssistant:" | |
| ... ] | |
| >>> inputs = processor(prompt, return_tensors="pt").to("cuda") | |
| >>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids | |
| >>> generated_ids = model.generate(**inputs, max_new_tokens=30, bad_words_ids=bad_words_ids) | |
| >>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True) | |
| >>> print(generated_text[0]) | |
| User: Describe this image. | |
| Assistant: An image of the Eiffel Tower at night. Fun fact: the Eiffel Tower is the same height as an 81-storey building. | |
| User: Describe this image. | |
| Assistant: An image of the Statue of Liberty. Fun fact: the Statue of Liberty is 151 feet tall. | |
| ``` | |
| λ¨ νλμ μμλ§μΌλ‘λ(μ¦, 1-shot) λͺ¨λΈμ΄ μμ μν λ°©λ²μ νμ΅νλ€λ μ μ΄ μ£Όλͺ©ν λ§ν©λλ€. λ 볡μ‘ν μμ μ κ²½μ°, λ λ§μ μμ(μ: 3-shot, 5-shot λ±)λ₯Ό μ¬μ©νμ¬ μ€νν΄ λ³΄λ κ²λ μ’μ λ°©λ²μ λλ€. | |
| ## μκ°μ μ§μ μλ΅[[visual-question-answering]] | |
| μκ°μ μ§μ μλ΅(VQA)μ μ΄λ―Έμ§λ₯Ό κΈ°λ°μΌλ‘ κ°λ°©ν μ§λ¬Έμ λ΅νλ μμ μ λλ€. μ΄λ―Έμ§ μΊ‘μ λκ³Ό λ§μ°¬κ°μ§λ‘ μ κ·Όμ± μ ν리μΌμ΄μ μμ μ¬μ©ν μ μμ§λ§, κ΅μ‘(μκ° μλ£μ λν μΆλ‘ ), κ³ κ° μλΉμ€(μ΄λ―Έμ§λ₯Ό κΈ°λ°μΌλ‘ ν μ ν μ§λ¬Έ), μ΄λ―Έμ§ κ²μ λ±μμλ μ¬μ©ν μ μμ΅λλ€. | |
| μ΄ μμ μ μν΄ μλ‘μ΄ μ΄λ―Έμ§λ₯Ό κ°μ Έμ΅λλ€: | |
| <div class="flex justify-center"> | |
| <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/idefics-vqa.jpg" alt="Image of a couple having a picnic"/> | |
| </div> | |
| μ¬μ§ μ 곡: [Jarritos Mexican Soda](https://unsplash.com/@jarritos). | |
| μ μ ν μ§μλ¬Έμ μ¬μ©νλ©΄ μ΄λ―Έμ§ μΊ‘μ λμμ μκ°μ μ§μ μλ΅μΌλ‘ λͺ¨λΈμ μ λν μ μμ΅λλ€: | |
| ```py | |
| >>> prompt = [ | |
| ... "Instruction: Provide an answer to the question. Use the image to answer.\n", | |
| ... "https://images.unsplash.com/photo-1623944889288-cd147dbb517c?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3540&q=80", | |
| ... "Question: Where are these people and what's the weather like? Answer:" | |
| ... ] | |
| >>> inputs = processor(prompt, return_tensors="pt").to("cuda") | |
| >>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids | |
| >>> generated_ids = model.generate(**inputs, max_new_tokens=20, bad_words_ids=bad_words_ids) | |
| >>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True) | |
| >>> print(generated_text[0]) | |
| Instruction: Provide an answer to the question. Use the image to answer. | |
| Question: Where are these people and what's the weather like? Answer: They're in a park in New York City, and it's a beautiful day. | |
| ``` | |
| ## μ΄λ―Έμ§ λΆλ₯[[image-classification]] | |
| IDEFICSλ νΉμ μΉ΄ν κ³ λ¦¬μ λΌλ²¨μ΄ ν¬ν¨λ λ°μ΄ν°λ‘ λͺ μμ μΌλ‘ νμ΅λμ§ μμλ μ΄λ―Έμ§λ₯Ό λ€μν μΉ΄ν κ³ λ¦¬λ‘ λΆλ₯ν μ μμ΅λλ€. μΉ΄ν κ³ λ¦¬ λͺ©λ‘μ΄ μ£Όμ΄μ§λ©΄, λͺ¨λΈμ μ΄λ―Έμ§μ ν μ€νΈ μ΄ν΄ λ₯λ ₯μ μ¬μ©νμ¬ μ΄λ―Έμ§κ° μν κ°λ₯μ±μ΄ λμ μΉ΄ν κ³ λ¦¬λ₯Ό μΆλ‘ ν μ μμ΅λλ€. | |
| μ¬κΈ°μ μΌμ± κ°νλ μ΄λ―Έμ§κ° μμ΅λλ€. | |
| <div class="flex justify-center"> | |
| <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/idefics-classification.jpg" alt="Image of a vegetable stand"/> | |
| </div> | |
| μ¬μ§ μ 곡: [Peter Wendt](https://unsplash.com/@peterwendt). | |
| μ°λ¦¬λ λͺ¨λΈμκ² μ°λ¦¬κ° κ°μ§ μΉ΄ν κ³ λ¦¬ μ€ νλλ‘ μ΄λ―Έμ§λ₯Ό λΆλ₯νλλ‘ μ§μν μ μμ΅λλ€: | |
| ```py | |
| >>> categories = ['animals','vegetables', 'city landscape', 'cars', 'office'] | |
| >>> prompt = [f"Instruction: Classify the following image into a single category from the following list: {categories}.\n", | |
| ... "https://images.unsplash.com/photo-1471193945509-9ad0617afabf?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3540&q=80", | |
| ... "Category: " | |
| ... ] | |
| >>> inputs = processor(prompt, return_tensors="pt").to("cuda") | |
| >>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids | |
| >>> generated_ids = model.generate(**inputs, max_new_tokens=6, bad_words_ids=bad_words_ids) | |
| >>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True) | |
| >>> print(generated_text[0]) | |
| Instruction: Classify the following image into a single category from the following list: ['animals', 'vegetables', 'city landscape', 'cars', 'office']. | |
| Category: Vegetables | |
| ``` | |
| μ μμ μμλ λͺ¨λΈμκ² μ΄λ―Έμ§λ₯Ό λ¨μΌ μΉ΄ν κ³ λ¦¬λ‘ λΆλ₯νλλ‘ μ§μνμ§λ§, μμ λΆλ₯λ₯Ό νλλ‘ λͺ¨λΈμ ν둬ννΈλ₯Ό μ 곡ν μλ μμ΅λλ€. | |
| ## μ΄λ―Έμ§ κΈ°λ° ν μ€νΈ μμ±[[image-guided-text-generation]] | |
| μ΄λ―Έμ§λ₯Ό νμ©ν ν μ€νΈ μμ± κΈ°μ μ μ¬μ©νλ©΄ λμ± μ°½μμ μΈ μμ μ΄ κ°λ₯ν©λλ€. μ΄ κΈ°μ μ μ΄λ―Έμ§λ₯Ό λ°νμΌλ‘ ν μ€νΈλ₯Ό λ§λ€μ΄λ΄λ©°, μ ν μ€λͺ , κ΄κ³ 문ꡬ, μ₯λ©΄ λ¬μ¬ λ± λ€μν μ©λλ‘ νμ©ν μ μμ΅λλ€. | |
| κ°λ¨ν μλ‘, λΉ¨κ° λ¬Έ μ΄λ―Έμ§λ₯Ό IDEFICSμ μ λ ₯νμ¬ μ΄μΌκΈ°λ₯Ό λ§λ€μ΄λ³΄κ² μ΅λλ€: | |
| <div class="flex justify-center"> | |
| <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/idefics-story-generation.jpg" alt="Image of a red door with a pumpkin on the steps"/> | |
| </div> | |
| μ¬μ§ μ 곡: [Craig Tidball](https://unsplash.com/@devonshiremedia). | |
| ```py | |
| >>> prompt = ["Instruction: Use the image to write a story. \n", | |
| ... "https://images.unsplash.com/photo-1517086822157-2b0358e7684a?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=2203&q=80", | |
| ... "Story: \n"] | |
| >>> inputs = processor(prompt, return_tensors="pt").to("cuda") | |
| >>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids | |
| >>> generated_ids = model.generate(**inputs, num_beams=2, max_new_tokens=200, bad_words_ids=bad_words_ids) | |
| >>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True) | |
| >>> print(generated_text[0]) | |
| Instruction: Use the image to write a story. | |
| Story: | |
| Once upon a time, there was a little girl who lived in a house with a red door. She loved her red door. It was the prettiest door in the whole world. | |
| One day, the little girl was playing in her yard when she noticed a man standing on her doorstep. He was wearing a long black coat and a top hat. | |
| The little girl ran inside and told her mother about the man. | |
| Her mother said, βDonβt worry, honey. Heβs just a friendly ghost.β | |
| The little girl wasnβt sure if she believed her mother, but she went outside anyway. | |
| When she got to the door, the man was gone. | |
| The next day, the little girl was playing in her yard again when she noticed the man standing on her doorstep. | |
| He was wearing a long black coat and a top hat. | |
| The little girl ran | |
| ``` | |
| IDEFICSκ° λ¬Έ μμ μλ νΈλ°μ λ³΄κ³ μ λ Ήμ λν μΌμ€μ€ν ν λ‘μ μ΄μΌκΈ°λ₯Ό λ§λ κ² κ°μ΅λλ€. | |
| <Tip> | |
| μ΄μ²λΌ κΈ΄ ν μ€νΈλ₯Ό μμ±ν λλ ν μ€νΈ μμ± μ λ΅μ μ‘°μ νλ κ²μ΄ μ’μ΅λλ€. μ΄λ κ² νλ©΄ μμ±λ κ²°κ³Όλ¬Όμ νμ§μ ν¬κ² ν₯μμν¬ μ μμ΅λλ€. μμΈν λ΄μ©μ [ν μ€νΈ μμ± μ λ΅](../generation_strategies)μ μ°Έμ‘°νμΈμ. | |
| </Tip> | |
| ## λ°°μΉ λͺ¨λμμ μΆλ‘ μ€ν[[running-inference-in-batch-mode]] | |
| μμ λͺ¨λ μΉμ μμλ λ¨μΌ μμμ λν΄ IDEFICSλ₯Ό μ€λͺ νμ΅λλ€. μ΄μ λ§€μ° μ μ¬ν λ°©μμΌλ‘, ν둬ννΈ λͺ©λ‘μ μ λ¬νμ¬ μ¬λ¬ μμμ λν μΆλ‘ μ μ€νν μ μμ΅λλ€: | |
| ```py | |
| >>> prompts = [ | |
| ... [ "https://images.unsplash.com/photo-1543349689-9a4d426bee8e?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3501&q=80", | |
| ... "This is an image of ", | |
| ... ], | |
| ... [ "https://images.unsplash.com/photo-1623944889288-cd147dbb517c?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3540&q=80", | |
| ... "This is an image of ", | |
| ... ], | |
| ... [ "https://images.unsplash.com/photo-1471193945509-9ad0617afabf?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3540&q=80", | |
| ... "This is an image of ", | |
| ... ], | |
| ... ] | |
| >>> inputs = processor(prompts, return_tensors="pt").to("cuda") | |
| >>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids | |
| >>> generated_ids = model.generate(**inputs, max_new_tokens=10, bad_words_ids=bad_words_ids) | |
| >>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True) | |
| >>> for i,t in enumerate(generated_text): | |
| ... print(f"{i}:\n{t}\n") | |
| 0: | |
| This is an image of the Eiffel Tower in Paris, France. | |
| 1: | |
| This is an image of a couple on a picnic blanket. | |
| 2: | |
| This is an image of a vegetable stand. | |
| ``` | |
| ## λνν μ¬μ©μ μν IDEFICS μΈμ€νΈλνΈ μ€ν[[idefics-instruct-for-conversational-use]] | |
| λνν μ¬μ© μ¬λ‘λ₯Ό μν΄, π€ Hubμμ λͺ λ Ήμ΄ μνμ μ΅μ νλ λ²μ μ λͺ¨λΈμ μ°Ύμ μ μμ΅λλ€. μ΄κ³³μλ `HuggingFaceM4/idefics-80b-instruct`μ `HuggingFaceM4/idefics-9b-instruct`κ° μμ΅λλ€. | |
| μ΄ μ²΄ν¬ν¬μΈνΈλ μ§λ νμ΅ λ° λͺ λ Ήμ΄ λ―ΈμΈ μ‘°μ λ°μ΄ν°μ μ νΌν©μΌλ‘ κ°κ°μ κΈ°λ³Έ λͺ¨λΈμ λ―ΈμΈ μ‘°μ ν κ²°κ³Όμ λλ€. μ΄λ₯Ό ν΅ν΄ λͺ¨λΈμ νμ μμ μ±λ₯μ ν₯μμν€λ λμμ λνν νκ²½μμ λͺ¨λΈμ λ μ¬μ©νκΈ° μ½κ² ν©λλ€. | |
| λνν μ¬μ©μ μν μ¬μ©λ² λ° ν둬ννΈλ κΈ°λ³Έ λͺ¨λΈμ μ¬μ©νλ κ²κ³Ό λ§€μ° μ μ¬ν©λλ€. | |
| ```py | |
| >>> import torch | |
| >>> from transformers import IdeficsForVisionText2Text, AutoProcessor | |
| >>> device = "cuda" if torch.cuda.is_available() else "cpu" | |
| >>> checkpoint = "HuggingFaceM4/idefics-9b-instruct" | |
| >>> model = IdeficsForVisionText2Text.from_pretrained(checkpoint, torch_dtype=torch.bfloat16).to(device) | |
| >>> processor = AutoProcessor.from_pretrained(checkpoint) | |
| >>> prompts = [ | |
| ... [ | |
| ... "User: What is in this image?", | |
| ... "https://upload.wikimedia.org/wikipedia/commons/8/86/Id%C3%A9fix.JPG", | |
| ... "<end_of_utterance>", | |
| ... "\nAssistant: This picture depicts Idefix, the dog of Obelix in Asterix and Obelix. Idefix is running on the ground.<end_of_utterance>", | |
| ... "\nUser:", | |
| ... "https://static.wikia.nocookie.net/asterix/images/2/25/R22b.gif/revision/latest?cb=20110815073052", | |
| ... "And who is that?<end_of_utterance>", | |
| ... "\nAssistant:", | |
| ... ], | |
| ... ] | |
| >>> # --batched mode | |
| >>> inputs = processor(prompts, add_end_of_utterance_token=False, return_tensors="pt").to(device) | |
| >>> # --single sample mode | |
| >>> # inputs = processor(prompts[0], return_tensors="pt").to(device) | |
| >>> # args μμ± | |
| >>> exit_condition = processor.tokenizer("<end_of_utterance>", add_special_tokens=False).input_ids | |
| >>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids | |
| >>> generated_ids = model.generate(**inputs, eos_token_id=exit_condition, bad_words_ids=bad_words_ids, max_length=100) | |
| >>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True) | |
| >>> for i, t in enumerate(generated_text): | |
| ... print(f"{i}:\n{t}\n") | |
| ``` | |