AbdulElahGwaith's picture
Upload folder using huggingface_hub
a9bd396 verified
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
โš ๏ธ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.
-->
# IDEFICS๋ฅผ ์ด์šฉํ•œ ์ด๋ฏธ์ง€ ์ž‘์—…[[image-tasks-with-idefics]]
[[open-in-colab]]
๊ฐœ๋ณ„ ์ž‘์—…์€ ํŠนํ™”๋œ ๋ชจ๋ธ์„ ๋ฏธ์„ธ ์กฐ์ •ํ•˜์—ฌ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์ง€๋งŒ, ์ตœ๊ทผ ๋“ฑ์žฅํ•˜์—ฌ ์ธ๊ธฐ๋ฅผ ์–ป๊ณ  ์žˆ๋Š” ๋ฐฉ์‹์€ ๋Œ€๊ทœ๋ชจ ๋ชจ๋ธ์„ ๋ฏธ์„ธ ์กฐ์ • ์—†์ด ๋‹ค์–‘ํ•œ ์ž‘์—…์— ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ๋Œ€๊ทœ๋ชจ ์–ธ์–ด ๋ชจ๋ธ์€ ์š”์•ฝ, ๋ฒˆ์—ญ, ๋ถ„๋ฅ˜ ๋“ฑ๊ณผ ๊ฐ™์€ ์ž์—ฐ์–ด์ฒ˜๋ฆฌ (NLP) ์ž‘์—…์„ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด ์ ‘๊ทผ ๋ฐฉ์‹์€ ํ…์ŠคํŠธ์™€ ๊ฐ™์€ ๋‹จ์ผ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ์— ๊ตญํ•œ๋˜์ง€ ์•Š์œผ๋ฉฐ, ์ด ๊ฐ€์ด๋“œ์—์„œ๋Š” IDEFICS๋ผ๋Š” ๋Œ€๊ทœ๋ชจ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜์—ฌ ์ด๋ฏธ์ง€-ํ…์ŠคํŠธ ์ž‘์—…์„ ๋‹ค๋ฃจ๋Š” ๋ฐฉ๋ฒ•์„ ์„ค๋ช…ํ•ฉ๋‹ˆ๋‹ค.
[IDEFICS](../model_doc/idefics)๋Š” [Flamingo](https://huggingface.co/papers/2204.14198)๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๋Š” ์˜คํ”ˆ ์•ก์„ธ์Šค ๋น„์ „ ๋ฐ ์–ธ์–ด ๋ชจ๋ธ๋กœ, DeepMind์—์„œ ์ฒ˜์Œ ๊ฐœ๋ฐœํ•œ ์ตœ์‹  ์‹œ๊ฐ ์–ธ์–ด ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. ์ด ๋ชจ๋ธ์€ ์ž„์˜์˜ ์ด๋ฏธ์ง€ ๋ฐ ํ…์ŠคํŠธ ์ž…๋ ฅ ์‹œํ€€์Šค๋ฅผ ๋ฐ›์•„ ์ผ๊ด€์„ฑ ์žˆ๋Š” ํ…์ŠคํŠธ๋ฅผ ์ถœ๋ ฅ์œผ๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฏธ์ง€์— ๋Œ€ํ•œ ์งˆ๋ฌธ์— ๋‹ต๋ณ€ํ•˜๊ณ , ์‹œ๊ฐ์ ์ธ ๋‚ด์šฉ์„ ์„ค๋ช…ํ•˜๋ฉฐ, ์—ฌ๋Ÿฌ ์ด๋ฏธ์ง€์— ๊ธฐ๋ฐ˜ํ•œ ์ด์•ผ๊ธฐ๋ฅผ ์ƒ์„ฑํ•˜๋Š” ๋“ฑ ๋‹ค์–‘ํ•œ ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. IDEFICS๋Š” [800์–ต ํŒŒ๋ผ๋ฏธํ„ฐ](https://huggingface.co/HuggingFaceM4/idefics-80b)์™€ [90์–ต ํŒŒ๋ผ๋ฏธํ„ฐ](https://huggingface.co/HuggingFaceM4/idefics-9b) ๋‘ ๊ฐ€์ง€ ๋ฒ„์ „์„ ์ œ๊ณตํ•˜๋ฉฐ, ๋‘ ๋ฒ„์ „ ๋ชจ๋‘ ๐Ÿค— Hub์—์„œ ์ด์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ฐ ๋ฒ„์ „์—๋Š” ๋Œ€ํ™”ํ˜• ์‚ฌ์šฉ ์‚ฌ๋ก€์— ๋งž๊ฒŒ ๋ฏธ์„ธ ์กฐ์ •๋œ ๋ฒ„์ „๋„ ์žˆ์Šต๋‹ˆ๋‹ค.
์ด ๋ชจ๋ธ์€ ๋งค์šฐ ๋‹ค์žฌ๋‹ค๋Šฅํ•˜๋ฉฐ ๊ด‘๋ฒ”์œ„ํ•œ ์ด๋ฏธ์ง€ ๋ฐ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์ž‘์—…์— ์‚ฌ์šฉ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๋Œ€๊ทœ๋ชจ ๋ชจ๋ธ์ด๊ธฐ ๋•Œ๋ฌธ์— ์ƒ๋‹นํ•œ ์ปดํ“จํŒ… ์ž์›๊ณผ ์ธํ”„๋ผ๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. ๊ฐ ๊ฐœ๋ณ„ ์ž‘์—…์— ํŠนํ™”๋œ ๋ชจ๋ธ์„ ๋ฏธ์„ธ ์กฐ์ •ํ•˜๋Š” ๊ฒƒ๋ณด๋‹ค ๋ชจ๋ธ์„ ๊ทธ๋Œ€๋กœ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ๋” ์ ํ•ฉํ•œ์ง€๋Š” ์‚ฌ์šฉ์ž๊ฐ€ ํŒ๋‹จํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.
์ด ๊ฐ€์ด๋“œ์—์„œ๋Š” ๋‹ค์Œ์„ ๋ฐฐ์šฐ๊ฒŒ ๋ฉ๋‹ˆ๋‹ค:
- [IDEFICS ๋กœ๋“œํ•˜๊ธฐ](#loading-the-model) ๋ฐ [์–‘์žํ™”๋œ ๋ฒ„์ „์˜ ๋ชจ๋ธ ๋กœ๋“œํ•˜๊ธฐ](#quantized-model)
- IDEFICS๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ:
- [์ด๋ฏธ์ง€ ์บก์…”๋‹](#image-captioning)
- [ํ”„๋กฌํ”„ํŠธ ์ด๋ฏธ์ง€ ์บก์…”๋‹](#prompted-image-captioning)
- [ํ“จ์ƒท ํ”„๋กฌํ”„ํŠธ](#few-shot-prompting)
- [์‹œ๊ฐ์  ์งˆ์˜ ์‘๋‹ต](#visual-question-answering)
- [์ด๋ฏธ์ง€ ๋ถ„๋ฅ˜](#image-classification)
- [์ด๋ฏธ์ง€ ๊ธฐ๋ฐ˜ ํ…์ŠคํŠธ ์ƒ์„ฑ](#image-guided-text-generation)
- [๋ฐฐ์น˜ ๋ชจ๋“œ์—์„œ ์ถ”๋ก  ์‹คํ–‰](#running-inference-in-batch-mode)
- [๋Œ€ํ™”ํ˜• ์‚ฌ์šฉ์„ ์œ„ํ•œ IDEFICS ์ธ์ŠคํŠธ๋ŸญํŠธ ์‹คํ–‰](#idefics-instruct-for-conversational-use)
์‹œ์ž‘ํ•˜๊ธฐ ์ „์— ํ•„์š”ํ•œ ๋ชจ๋“  ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๊ฐ€ ์„ค์น˜๋˜์–ด ์žˆ๋Š”์ง€ ํ™•์ธํ•˜์„ธ์š”.
```bash
pip install -q bitsandbytes sentencepiece accelerate transformers
```
<Tip>
๋‹ค์Œ ์˜ˆ์ œ๋ฅผ ๋น„์–‘์žํ™”๋œ ๋ฒ„์ „์˜ ๋ชจ๋ธ ์ฒดํฌํฌ์ธํŠธ๋กœ ์‹คํ–‰ํ•˜๋ ค๋ฉด ์ตœ์†Œ 20GB์˜ GPU ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.
</Tip>
## ๋ชจ๋ธ ๋กœ๋“œ[[loading-the-model]]
๋ชจ๋ธ์„ 90์–ต ํŒŒ๋ผ๋ฏธํ„ฐ ๋ฒ„์ „์˜ ์ฒดํฌํฌ์ธํŠธ๋กœ ๋กœ๋“œํ•ด ๋ด…์‹œ๋‹ค:
```py
>>> checkpoint = "HuggingFaceM4/idefics-9b"
```
๋‹ค๋ฅธ Transformers ๋ชจ๋ธ๊ณผ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ, ์ฒดํฌํฌ์ธํŠธ์—์„œ ํ”„๋กœ์„ธ์„œ์™€ ๋ชจ๋ธ ์ž์ฒด๋ฅผ ๋กœ๋“œํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.
IDEFICS ํ”„๋กœ์„ธ์„œ๋Š” [`LlamaTokenizer`]์™€ IDEFICS ์ด๋ฏธ์ง€ ํ”„๋กœ์„ธ์„œ๋ฅผ ํ•˜๋‚˜์˜ ํ”„๋กœ์„ธ์„œ๋กœ ๊ฐ์‹ธ์„œ ํ…์ŠคํŠธ์™€ ์ด๋ฏธ์ง€ ์ž…๋ ฅ์„ ๋ชจ๋ธ์— ๋งž๊ฒŒ ์ค€๋น„ํ•ฉ๋‹ˆ๋‹ค.
```py
>>> import torch
>>> from transformers import IdeficsForVisionText2Text, AutoProcessor
>>> processor = AutoProcessor.from_pretrained(checkpoint)
>>> model = IdeficsForVisionText2Text.from_pretrained(checkpoint, dtype=torch.bfloat16, device_map="auto")
```
`device_map`์„ `"auto"`๋กœ ์„ค์ •ํ•˜๋ฉด ์‚ฌ์šฉ ์ค‘์ธ ์žฅ์น˜๋ฅผ ๊ณ ๋ คํ•˜์—ฌ ๋ชจ๋ธ ๊ฐ€์ค‘์น˜๋ฅผ ๊ฐ€์žฅ ์ตœ์ ํ™”๋œ ๋ฐฉ์‹์œผ๋กœ ๋กœ๋“œํ•˜๊ณ  ์ €์žฅํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ž๋™์œผ๋กœ ๊ฒฐ์ •ํ•ฉ๋‹ˆ๋‹ค.
### ์–‘์žํ™”๋œ ๋ชจ๋ธ[[quantized-model]]
๊ณ ์šฉ๋Ÿ‰ GPU ์‚ฌ์šฉ์ด ์–ด๋ ค์šด ๊ฒฝ์šฐ, ๋ชจ๋ธ์˜ ์–‘์žํ™”๋œ ๋ฒ„์ „์„ ๋กœ๋“œํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ชจ๋ธ๊ณผ ํ”„๋กœ์„ธ์„œ๋ฅผ 4๋น„ํŠธ ์ •๋ฐ€๋„๋กœ ๋กœ๋“œํ•˜๊ธฐ ์œ„ํ•ด์„œ, `from_pretrained` ๋ฉ”์†Œ๋“œ์— `BitsAndBytesConfig`๋ฅผ ์ „๋‹ฌํ•˜๋ฉด ๋ชจ๋ธ์ด ๋กœ๋“œ๋˜๋Š” ๋™์•ˆ ์‹ค์‹œ๊ฐ„์œผ๋กœ ์••์ถ•๋ฉ๋‹ˆ๋‹ค.
```py
>>> import torch
>>> from transformers import IdeficsForVisionText2Text, AutoProcessor, BitsAndBytesConfig
>>> quantization_config = BitsAndBytesConfig(
... load_in_4bit=True,
... bnb_4bit_compute_dtype=torch.float16,
... )
>>> processor = AutoProcessor.from_pretrained(checkpoint)
>>> model = IdeficsForVisionText2Text.from_pretrained(
... checkpoint,
... quantization_config=quantization_config,
... device_map="auto"
... )
```
์ด์ œ ๋ชจ๋ธ์„ ์ œ์•ˆ๋œ ๋ฐฉ๋ฒ• ์ค‘ ํ•˜๋‚˜๋กœ ๋กœ๋“œํ–ˆ์œผ๋‹ˆ, IDEFICS๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ์ž‘์—…๋“ค์„ ํƒ๊ตฌํ•ด๋ด…์‹œ๋‹ค.
## ์ด๋ฏธ์ง€ ์บก์…”๋‹[[image-captioning]]
์ด๋ฏธ์ง€ ์บก์…”๋‹์€ ์ฃผ์–ด์ง„ ์ด๋ฏธ์ง€์— ๋Œ€ํ•œ ์บก์…˜์„ ์˜ˆ์ธกํ•˜๋Š” ์ž‘์—…์ž…๋‹ˆ๋‹ค. ์ผ๋ฐ˜์ ์ธ ์‘์šฉ ๋ถ„์•ผ๋Š” ์‹œ๊ฐ ์žฅ์• ์ธ์ด ๋‹ค์–‘ํ•œ ์ƒํ™ฉ์„ ํƒ์ƒ‰ํ•  ์ˆ˜ ์žˆ๋„๋ก ๋•๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ์˜จ๋ผ์ธ์—์„œ ์ด๋ฏธ์ง€ ์ฝ˜ํ…์ธ ๋ฅผ ํƒ์ƒ‰ํ•˜๋Š” ๋ฐ ๋„์›€์„ ์ค„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
์ž‘์—…์„ ์„ค๋ช…ํ•˜๊ธฐ ์œ„ํ•ด ์บก์…˜์„ ๋‹ฌ ์ด๋ฏธ์ง€ ์˜ˆ์‹œ๋ฅผ ๊ฐ€์ ธ์˜ต๋‹ˆ๋‹ค. ์˜ˆ์‹œ:
<div class="flex justify-center">
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/idefics-im-captioning.jpg" alt="Image of a puppy in a flower bed"/>
</div>
์‚ฌ์ง„ ์ œ๊ณต: [Hendo Wang](https://unsplash.com/@hendoo).
IDEFICS๋Š” ํ…์ŠคํŠธ ๋ฐ ์ด๋ฏธ์ง€ ํ”„๋กฌํ”„ํŠธ๋ฅผ ๋ชจ๋‘ ์ˆ˜์šฉํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ด๋ฏธ์ง€๋ฅผ ์บก์…˜ํ•˜๊ธฐ ์œ„ํ•ด ๋ชจ๋ธ์— ํ…์ŠคํŠธ ํ”„๋กฌํ”„ํŠธ๋ฅผ ์ œ๊ณตํ•  ํ•„์š”๋Š” ์—†์Šต๋‹ˆ๋‹ค. ์ „์ฒ˜๋ฆฌ๋œ ์ž…๋ ฅ ์ด๋ฏธ์ง€๋งŒ ์ œ๊ณตํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค. ํ…์ŠคํŠธ ํ”„๋กฌํ”„ํŠธ ์—†์ด ๋ชจ๋ธ์€ BOS(์‹œํ€€์Šค ์‹œ์ž‘) ํ† ํฐ๋ถ€ํ„ฐ ํ…์ŠคํŠธ ์ƒ์„ฑ์„ ์‹œ์ž‘ํ•˜์—ฌ ์บก์…˜์„ ๋งŒ๋“ญ๋‹ˆ๋‹ค.
๋ชจ๋ธ์— ์ด๋ฏธ์ง€ ์ž…๋ ฅ์œผ๋กœ๋Š” ์ด๋ฏธ์ง€ ๊ฐ์ฒด(`PIL.Image`) ๋˜๋Š” ์ด๋ฏธ์ง€๋ฅผ ๊ฐ€์ ธ์˜ฌ ์ˆ˜ ์žˆ๋Š” URL์„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
```py
>>> prompt = [
... "https://images.unsplash.com/photo-1583160247711-2191776b4b91?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3542&q=80",
... ]
>>> inputs = processor(prompt, return_tensors="pt").to(model.device)
>>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids
>>> generated_ids = model.generate(**inputs, max_new_tokens=10, bad_words_ids=bad_words_ids)
>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
>>> print(generated_text[0])
A puppy in a flower bed
```
<Tip>
`max_new_tokens`์˜ ํฌ๊ธฐ๋ฅผ ์ฆ๊ฐ€์‹œํ‚ฌ ๋•Œ ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ๋Š” ์˜ค๋ฅ˜๋ฅผ ํ”ผํ•˜๊ธฐ ์œ„ํ•ด `generate` ํ˜ธ์ถœ ์‹œ `bad_words_ids`๋ฅผ ํฌํ•จํ•˜๋Š” ๊ฒƒ์ด ์ข‹์Šต๋‹ˆ๋‹ค. ๋ชจ๋ธ๋กœ๋ถ€ํ„ฐ ์ƒ์„ฑ๋œ ์ด๋ฏธ์ง€๊ฐ€ ์—†์„ ๋•Œ ์ƒˆ๋กœ์šด `<image>` ๋˜๋Š” `<fake_token_around_image>` ํ† ํฐ์„ ์ƒ์„ฑํ•˜๋ ค๊ณ  ํ•˜๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.
์ด ๊ฐ€์ด๋“œ์—์„œ์ฒ˜๋Ÿผ `bad_words_ids`๋ฅผ ํ•จ์ˆ˜ ํ˜ธ์ถœ ์‹œ์— ๋งค๊ฐœ๋ณ€์ˆ˜๋กœ ์„ค์ •ํ•˜๊ฑฐ๋‚˜, [ํ…์ŠคํŠธ ์ƒ์„ฑ ์ „๋žต](../generation_strategies) ๊ฐ€์ด๋“œ์— ์„ค๋ช…๋œ ๋Œ€๋กœ `GenerationConfig`์— ์ €์žฅํ•  ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค.
</Tip>
## ํ”„๋กฌํ”„ํŠธ ์ด๋ฏธ์ง€ ์บก์…”๋‹[[prompted-image-captioning]]
ํ…์ŠคํŠธ ํ”„๋กฌํ”„ํŠธ๋ฅผ ์ด์šฉํ•˜์—ฌ ์ด๋ฏธ์ง€ ์บก์…”๋‹์„ ํ™•์žฅํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ๋ชจ๋ธ์€ ์ฃผ์–ด์ง„ ์ด๋ฏธ์ง€๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ํ…์ŠคํŠธ๋ฅผ ๊ณ„์† ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. ๋‹ค์Œ ์ด๋ฏธ์ง€๋ฅผ ์˜ˆ์‹œ๋กœ ๋“ค์–ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค:
<div class="flex justify-center">
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/idefics-prompted-im-captioning.jpg" alt="Image of the Eiffel Tower at night"/>
</div>
์‚ฌ์ง„ ์ œ๊ณต: [Denys Nevozhai](https://unsplash.com/@dnevozhai).
ํ…์ŠคํŠธ ๋ฐ ์ด๋ฏธ์ง€ ํ”„๋กฌํ”„ํŠธ๋Š” ์ ์ ˆํ•œ ์ž…๋ ฅ์„ ์ƒ์„ฑํ•˜๊ธฐ ์œ„ํ•ด ๋ชจ๋ธ์˜ ํ”„๋กœ์„ธ์„œ์— ํ•˜๋‚˜์˜ ๋ชฉ๋ก์œผ๋กœ ์ „๋‹ฌ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
```py
>>> prompt = [
... "https://images.unsplash.com/photo-1543349689-9a4d426bee8e?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3501&q=80",
... "This is an image of ",
... ]
>>> inputs = processor(prompt, return_tensors="pt").to(model.device)
>>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids
>>> generated_ids = model.generate(**inputs, max_new_tokens=10, bad_words_ids=bad_words_ids)
>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
>>> print(generated_text[0])
This is an image of the Eiffel Tower in Paris, France.
```
## ํ“จ์ƒท ํ”„๋กฌํ”„ํŠธ[[few-shot-prompting]]
IDEFICS๋Š” ํ›Œ๋ฅญํ•œ ์ œ๋กœ์ƒท ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์ฃผ์ง€๋งŒ, ์ž‘์—…์— ํŠน์ • ํ˜•์‹์˜ ์บก์…˜์ด ํ•„์š”ํ•˜๊ฑฐ๋‚˜ ์ž‘์—…์˜ ๋ณต์žก์„ฑ์„ ๋†’์ด๋Š” ๋‹ค๋ฅธ ์ œํ•œ ์‚ฌํ•ญ์ด๋‚˜ ์š”๊ตฌ ์‚ฌํ•ญ์ด ์žˆ์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Ÿด ๋•Œ ํ“จ์ƒท ํ”„๋กฌํ”„ํŠธ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋งฅ๋ฝ ๋‚ด ํ•™์Šต(In-Context Learning)์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
ํ”„๋กฌํ”„ํŠธ์— ์˜ˆ์‹œ๋ฅผ ์ œ๊ณตํ•จ์œผ๋กœ์จ ๋ชจ๋ธ์ด ์ฃผ์–ด์ง„ ์˜ˆ์‹œ์˜ ํ˜•์‹์„ ๋ชจ๋ฐฉํ•œ ๊ฒฐ๊ณผ๋ฅผ ์ƒ์„ฑํ•˜๋„๋ก ์œ ๋„ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
์ด์ „์˜ ์—ํŽ ํƒ‘ ์ด๋ฏธ์ง€๋ฅผ ๋ชจ๋ธ์— ์˜ˆ์‹œ๋กœ ์‚ฌ์šฉํ•˜๊ณ , ๋ชจ๋ธ์—๊ฒŒ ์ด๋ฏธ์ง€์˜ ๊ฐ์ฒด๋ฅผ ํ•™์Šตํ•˜๋Š” ๊ฒƒ ์™ธ์—๋„ ํฅ๋ฏธ๋กœ์šด ์ •๋ณด๋ฅผ ์–ป๊ณ  ์‹ถ๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ฃผ๋Š” ํ”„๋กฌํ”„ํŠธ๋ฅผ ์ž‘์„ฑํ•ด ๋ด…์‹œ๋‹ค.
๊ทธ๋Ÿฐ ๋‹ค์Œ ์ž์œ ์˜ ์—ฌ์‹ ์ƒ ์ด๋ฏธ์ง€์— ๋Œ€ํ•ด ๋™์ผํ•œ ์‘๋‹ต ํ˜•์‹์„ ์–ป์„ ์ˆ˜ ์žˆ๋Š”์ง€ ํ™•์ธํ•ด ๋ด…์‹œ๋‹ค:
<div class="flex justify-center">
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/idefics-few-shot.jpg" alt="Image of the Statue of Liberty"/>
</div>
์‚ฌ์ง„ ์ œ๊ณต: [Juan Mayobre](https://unsplash.com/@jmayobres).
```py
>>> prompt = ["User:",
... "https://images.unsplash.com/photo-1543349689-9a4d426bee8e?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3501&q=80",
... "Describe this image.\nAssistant: An image of the Eiffel Tower at night. Fun fact: the Eiffel Tower is the same height as an 81-storey building.\n",
... "User:",
... "https://images.unsplash.com/photo-1524099163253-32b7f0256868?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3387&q=80",
... "Describe this image.\nAssistant:"
... ]
>>> inputs = processor(prompt, return_tensors="pt").to(model.device)
>>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids
>>> generated_ids = model.generate(**inputs, max_new_tokens=30, bad_words_ids=bad_words_ids)
>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
>>> print(generated_text[0])
User: Describe this image.
Assistant: An image of the Eiffel Tower at night. Fun fact: the Eiffel Tower is the same height as an 81-storey building.
User: Describe this image.
Assistant: An image of the Statue of Liberty. Fun fact: the Statue of Liberty is 151 feet tall.
```
๋‹จ ํ•˜๋‚˜์˜ ์˜ˆ์‹œ๋งŒ์œผ๋กœ๋„(์ฆ‰, 1-shot) ๋ชจ๋ธ์ด ์ž‘์—… ์ˆ˜ํ–‰ ๋ฐฉ๋ฒ•์„ ํ•™์Šตํ–ˆ๋‹ค๋Š” ์ ์ด ์ฃผ๋ชฉํ•  ๋งŒํ•ฉ๋‹ˆ๋‹ค. ๋” ๋ณต์žกํ•œ ์ž‘์—…์˜ ๊ฒฝ์šฐ, ๋” ๋งŽ์€ ์˜ˆ์‹œ(์˜ˆ: 3-shot, 5-shot ๋“ฑ)๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์‹คํ—˜ํ•ด ๋ณด๋Š” ๊ฒƒ๋„ ์ข‹์€ ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค.
## ์‹œ๊ฐ์  ์งˆ์˜ ์‘๋‹ต[[visual-question-answering]]
์‹œ๊ฐ์  ์งˆ์˜ ์‘๋‹ต(VQA)์€ ์ด๋ฏธ์ง€๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๊ฐœ๋ฐฉํ˜• ์งˆ๋ฌธ์— ๋‹ตํ•˜๋Š” ์ž‘์—…์ž…๋‹ˆ๋‹ค. ์ด๋ฏธ์ง€ ์บก์…”๋‹๊ณผ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ์ ‘๊ทผ์„ฑ ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์—์„œ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์ง€๋งŒ, ๊ต์œก(์‹œ๊ฐ ์ž๋ฃŒ์— ๋Œ€ํ•œ ์ถ”๋ก ), ๊ณ ๊ฐ ์„œ๋น„์Šค(์ด๋ฏธ์ง€๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ ์ œํ’ˆ ์งˆ๋ฌธ), ์ด๋ฏธ์ง€ ๊ฒ€์ƒ‰ ๋“ฑ์—์„œ๋„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
์ด ์ž‘์—…์„ ์œ„ํ•ด ์ƒˆ๋กœ์šด ์ด๋ฏธ์ง€๋ฅผ ๊ฐ€์ ธ์˜ต๋‹ˆ๋‹ค:
<div class="flex justify-center">
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/idefics-vqa.jpg" alt="Image of a couple having a picnic"/>
</div>
์‚ฌ์ง„ ์ œ๊ณต: [Jarritos Mexican Soda](https://unsplash.com/@jarritos).
์ ์ ˆํ•œ ์ง€์‹œ๋ฌธ์„ ์‚ฌ์šฉํ•˜๋ฉด ์ด๋ฏธ์ง€ ์บก์…”๋‹์—์„œ ์‹œ๊ฐ์  ์งˆ์˜ ์‘๋‹ต์œผ๋กœ ๋ชจ๋ธ์„ ์œ ๋„ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:
```py
>>> prompt = [
... "Instruction: Provide an answer to the question. Use the image to answer.\n",
... "https://images.unsplash.com/photo-1623944889288-cd147dbb517c?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3540&q=80",
... "Question: Where are these people and what's the weather like? Answer:"
... ]
>>> inputs = processor(prompt, return_tensors="pt").to(model.device)
>>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids
>>> generated_ids = model.generate(**inputs, max_new_tokens=20, bad_words_ids=bad_words_ids)
>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
>>> print(generated_text[0])
Instruction: Provide an answer to the question. Use the image to answer.
Question: Where are these people and what's the weather like? Answer: They're in a park in New York City, and it's a beautiful day.
```
## ์ด๋ฏธ์ง€ ๋ถ„๋ฅ˜[[image-classification]]
IDEFICS๋Š” ํŠน์ • ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋ผ๋ฒจ์ด ํฌํ•จ๋œ ๋ฐ์ดํ„ฐ๋กœ ๋ช…์‹œ์ ์œผ๋กœ ํ•™์Šต๋˜์ง€ ์•Š์•„๋„ ์ด๋ฏธ์ง€๋ฅผ ๋‹ค์–‘ํ•œ ์นดํ…Œ๊ณ ๋ฆฌ๋กœ ๋ถ„๋ฅ˜ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์นดํ…Œ๊ณ ๋ฆฌ ๋ชฉ๋ก์ด ์ฃผ์–ด์ง€๋ฉด, ๋ชจ๋ธ์€ ์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ ์ดํ•ด ๋Šฅ๋ ฅ์„ ์‚ฌ์šฉํ•˜์—ฌ ์ด๋ฏธ์ง€๊ฐ€ ์†ํ•  ๊ฐ€๋Šฅ์„ฑ์ด ๋†’์€ ์นดํ…Œ๊ณ ๋ฆฌ๋ฅผ ์ถ”๋ก ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
์—ฌ๊ธฐ์— ์•ผ์ฑ„ ๊ฐ€ํŒ๋Œ€ ์ด๋ฏธ์ง€๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.
<div class="flex justify-center">
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/idefics-classification.jpg" alt="Image of a vegetable stand"/>
</div>
์‚ฌ์ง„ ์ œ๊ณต: [Peter Wendt](https://unsplash.com/@peterwendt).
์šฐ๋ฆฌ๋Š” ๋ชจ๋ธ์—๊ฒŒ ์šฐ๋ฆฌ๊ฐ€ ๊ฐ€์ง„ ์นดํ…Œ๊ณ ๋ฆฌ ์ค‘ ํ•˜๋‚˜๋กœ ์ด๋ฏธ์ง€๋ฅผ ๋ถ„๋ฅ˜ํ•˜๋„๋ก ์ง€์‹œํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:
```py
>>> categories = ['animals','vegetables', 'city landscape', 'cars', 'office']
>>> prompt = [f"Instruction: Classify the following image into a single category from the following list: {categories}.\n",
... "https://images.unsplash.com/photo-1471193945509-9ad0617afabf?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3540&q=80",
... "Category: "
... ]
>>> inputs = processor(prompt, return_tensors="pt").to(model.device)
>>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids
>>> generated_ids = model.generate(**inputs, max_new_tokens=6, bad_words_ids=bad_words_ids)
>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
>>> print(generated_text[0])
Instruction: Classify the following image into a single category from the following list: ['animals', 'vegetables', 'city landscape', 'cars', 'office'].
Category: Vegetables
```
์œ„ ์˜ˆ์ œ์—์„œ๋Š” ๋ชจ๋ธ์—๊ฒŒ ์ด๋ฏธ์ง€๋ฅผ ๋‹จ์ผ ์นดํ…Œ๊ณ ๋ฆฌ๋กœ ๋ถ„๋ฅ˜ํ•˜๋„๋ก ์ง€์‹œํ–ˆ์ง€๋งŒ, ์ˆœ์œ„ ๋ถ„๋ฅ˜๋ฅผ ํ•˜๋„๋ก ๋ชจ๋ธ์— ํ”„๋กฌํ”„ํŠธ๋ฅผ ์ œ๊ณตํ•  ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค.
## ์ด๋ฏธ์ง€ ๊ธฐ๋ฐ˜ ํ…์ŠคํŠธ ์ƒ์„ฑ[[image-guided-text-generation]]
์ด๋ฏธ์ง€๋ฅผ ํ™œ์šฉํ•œ ํ…์ŠคํŠธ ์ƒ์„ฑ ๊ธฐ์ˆ ์„ ์‚ฌ์šฉํ•˜๋ฉด ๋”์šฑ ์ฐฝ์˜์ ์ธ ์ž‘์—…์ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค. ์ด ๊ธฐ์ˆ ์€ ์ด๋ฏธ์ง€๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ํ…์ŠคํŠธ๋ฅผ ๋งŒ๋“ค์–ด๋‚ด๋ฉฐ, ์ œํ’ˆ ์„ค๋ช…, ๊ด‘๊ณ  ๋ฌธ๊ตฌ, ์žฅ๋ฉด ๋ฌ˜์‚ฌ ๋“ฑ ๋‹ค์–‘ํ•œ ์šฉ๋„๋กœ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
๊ฐ„๋‹จํ•œ ์˜ˆ๋กœ, ๋นจ๊ฐ„ ๋ฌธ ์ด๋ฏธ์ง€๋ฅผ IDEFICS์— ์ž…๋ ฅํ•˜์—ฌ ์ด์•ผ๊ธฐ๋ฅผ ๋งŒ๋“ค์–ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค:
<div class="flex justify-center">
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/idefics-story-generation.jpg" alt="Image of a red door with a pumpkin on the steps"/>
</div>
์‚ฌ์ง„ ์ œ๊ณต: [Craig Tidball](https://unsplash.com/@devonshiremedia).
```py
>>> prompt = ["Instruction: Use the image to write a story. \n",
... "https://images.unsplash.com/photo-1517086822157-2b0358e7684a?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=2203&q=80",
... "Story: \n"]
>>> inputs = processor(prompt, return_tensors="pt").to(model.device)
>>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids
>>> generated_ids = model.generate(**inputs, num_beams=2, max_new_tokens=200, bad_words_ids=bad_words_ids)
>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
>>> print(generated_text[0])
Instruction: Use the image to write a story.
Story:
Once upon a time, there was a little girl who lived in a house with a red door. She loved her red door. It was the prettiest door in the whole world.
One day, the little girl was playing in her yard when she noticed a man standing on her doorstep. He was wearing a long black coat and a top hat.
The little girl ran inside and told her mother about the man.
Her mother said, โ€œDonโ€™t worry, honey. Heโ€™s just a friendly ghost.โ€
The little girl wasnโ€™t sure if she believed her mother, but she went outside anyway.
When she got to the door, the man was gone.
The next day, the little girl was playing in her yard again when she noticed the man standing on her doorstep.
He was wearing a long black coat and a top hat.
The little girl ran
```
IDEFICS๊ฐ€ ๋ฌธ ์•ž์— ์žˆ๋Š” ํ˜ธ๋ฐ•์„ ๋ณด๊ณ  ์œ ๋ น์— ๋Œ€ํ•œ ์œผ์Šค์Šคํ•œ ํ• ๋กœ์œˆ ์ด์•ผ๊ธฐ๋ฅผ ๋งŒ๋“  ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค.
<Tip>
์ด์ฒ˜๋Ÿผ ๊ธด ํ…์ŠคํŠธ๋ฅผ ์ƒ์„ฑํ•  ๋•Œ๋Š” ํ…์ŠคํŠธ ์ƒ์„ฑ ์ „๋žต์„ ์กฐ์ •ํ•˜๋Š” ๊ฒƒ์ด ์ข‹์Šต๋‹ˆ๋‹ค. ์ด๋ ‡๊ฒŒ ํ•˜๋ฉด ์ƒ์„ฑ๋œ ๊ฒฐ๊ณผ๋ฌผ์˜ ํ’ˆ์งˆ์„ ํฌ๊ฒŒ ํ–ฅ์ƒ์‹œํ‚ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ž์„ธํ•œ ๋‚ด์šฉ์€ [ํ…์ŠคํŠธ ์ƒ์„ฑ ์ „๋žต](../generation_strategies)์„ ์ฐธ์กฐํ•˜์„ธ์š”.
</Tip>
## ๋ฐฐ์น˜ ๋ชจ๋“œ์—์„œ ์ถ”๋ก  ์‹คํ–‰[[running-inference-in-batch-mode]]
์•ž์„  ๋ชจ๋“  ์„น์…˜์—์„œ๋Š” ๋‹จ์ผ ์˜ˆ์‹œ์— ๋Œ€ํ•ด IDEFICS๋ฅผ ์„ค๋ช…ํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด์™€ ๋งค์šฐ ์œ ์‚ฌํ•œ ๋ฐฉ์‹์œผ๋กœ, ํ”„๋กฌํ”„ํŠธ ๋ชฉ๋ก์„ ์ „๋‹ฌํ•˜์—ฌ ์—ฌ๋Ÿฌ ์˜ˆ์‹œ์— ๋Œ€ํ•œ ์ถ”๋ก ์„ ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:
```py
>>> prompts = [
... [ "https://images.unsplash.com/photo-1543349689-9a4d426bee8e?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3501&q=80",
... "This is an image of ",
... ],
... [ "https://images.unsplash.com/photo-1623944889288-cd147dbb517c?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3540&q=80",
... "This is an image of ",
... ],
... [ "https://images.unsplash.com/photo-1471193945509-9ad0617afabf?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3540&q=80",
... "This is an image of ",
... ],
... ]
>>> inputs = processor(prompts, return_tensors="pt").to(model.device)
>>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids
>>> generated_ids = model.generate(**inputs, max_new_tokens=10, bad_words_ids=bad_words_ids)
>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
>>> for i,t in enumerate(generated_text):
... print(f"{i}:\n{t}\n")
0:
This is an image of the Eiffel Tower in Paris, France.
1:
This is an image of a couple on a picnic blanket.
2:
This is an image of a vegetable stand.
```
## ๋Œ€ํ™”ํ˜• ์‚ฌ์šฉ์„ ์œ„ํ•œ IDEFICS ์ธ์ŠคํŠธ๋ŸญํŠธ ์‹คํ–‰[[idefics-instruct-for-conversational-use]]
๋Œ€ํ™”ํ˜• ์‚ฌ์šฉ ์‚ฌ๋ก€๋ฅผ ์œ„ํ•ด, ๐Ÿค— Hub์—์„œ ๋ช…๋ น์–ด ์ˆ˜ํ–‰์— ์ตœ์ ํ™”๋œ ๋ฒ„์ „์˜ ๋ชจ๋ธ์„ ์ฐพ์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๊ณณ์—๋Š” `HuggingFaceM4/idefics-80b-instruct`์™€ `HuggingFaceM4/idefics-9b-instruct`๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.
์ด ์ฒดํฌํฌ์ธํŠธ๋Š” ์ง€๋„ ํ•™์Šต ๋ฐ ๋ช…๋ น์–ด ๋ฏธ์„ธ ์กฐ์ • ๋ฐ์ดํ„ฐ์…‹์˜ ํ˜ผํ•ฉ์œผ๋กœ ๊ฐ๊ฐ์˜ ๊ธฐ๋ณธ ๋ชจ๋ธ์„ ๋ฏธ์„ธ ์กฐ์ •ํ•œ ๊ฒฐ๊ณผ์ž…๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ๋ชจ๋ธ์˜ ํ•˜์œ„ ์ž‘์—… ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ๋™์‹œ์— ๋Œ€ํ™”ํ˜• ํ™˜๊ฒฝ์—์„œ ๋ชจ๋ธ์„ ๋” ์‚ฌ์šฉํ•˜๊ธฐ ์‰ฝ๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.
๋Œ€ํ™”ํ˜• ์‚ฌ์šฉ์„ ์œ„ํ•œ ์‚ฌ์šฉ๋ฒ• ๋ฐ ํ”„๋กฌํ”„ํŠธ๋Š” ๊ธฐ๋ณธ ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ๊ณผ ๋งค์šฐ ์œ ์‚ฌํ•ฉ๋‹ˆ๋‹ค.
```py
>>> import torch
>>> from transformers import IdeficsForVisionText2Text, AutoProcessor
>>> checkpoint = "HuggingFaceM4/idefics-9b-instruct"
>>> model = IdeficsForVisionText2Text.from_pretrained(checkpoint, dtype=torch.bfloat16, device_map="auto")
>>> processor = AutoProcessor.from_pretrained(checkpoint)
>>> prompts = [
... [
... "User: What is in this image?",
... "https://upload.wikimedia.org/wikipedia/commons/8/86/Id%C3%A9fix.JPG",
... "<end_of_utterance>",
... "\nAssistant: This picture depicts Idefix, the dog of Obelix in Asterix and Obelix. Idefix is running on the ground.<end_of_utterance>",
... "\nUser:",
... "https://static.wikia.nocookie.net/asterix/images/2/25/R22b.gif/revision/latest?cb=20110815073052",
... "And who is that?<end_of_utterance>",
... "\nAssistant:",
... ],
... ]
>>> # --batched mode
>>> inputs = processor(prompts, add_end_of_utterance_token=False, return_tensors="pt").to(device)
>>> # --single sample mode
>>> # inputs = processor(prompts[0], return_tensors="pt").to(device)
>>> # args ์ƒ์„ฑ
>>> exit_condition = processor.tokenizer("<end_of_utterance>", add_special_tokens=False).input_ids
>>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids
>>> generated_ids = model.generate(**inputs, eos_token_id=exit_condition, bad_words_ids=bad_words_ids, max_length=100)
>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
>>> for i, t in enumerate(generated_text):
... print(f"{i}:\n{t}\n")
```