DrDavis's picture
Upload folder using huggingface_hub
17c6d62 verified

IDEFICS๋ฅผ ์ด์šฉํ•œ ์ด๋ฏธ์ง€ ์ž‘์—…[[image-tasks-with-idefics]]

[[open-in-colab]]

๊ฐœ๋ณ„ ์ž‘์—…์€ ํŠนํ™”๋œ ๋ชจ๋ธ์„ ๋ฏธ์„ธ ์กฐ์ •ํ•˜์—ฌ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์ง€๋งŒ, ์ตœ๊ทผ ๋“ฑ์žฅํ•˜์—ฌ ์ธ๊ธฐ๋ฅผ ์–ป๊ณ  ์žˆ๋Š” ๋ฐฉ์‹์€ ๋Œ€๊ทœ๋ชจ ๋ชจ๋ธ์„ ๋ฏธ์„ธ ์กฐ์ • ์—†์ด ๋‹ค์–‘ํ•œ ์ž‘์—…์— ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ๋Œ€๊ทœ๋ชจ ์–ธ์–ด ๋ชจ๋ธ์€ ์š”์•ฝ, ๋ฒˆ์—ญ, ๋ถ„๋ฅ˜ ๋“ฑ๊ณผ ๊ฐ™์€ ์ž์—ฐ์–ด์ฒ˜๋ฆฌ (NLP) ์ž‘์—…์„ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด ์ ‘๊ทผ ๋ฐฉ์‹์€ ํ…์ŠคํŠธ์™€ ๊ฐ™์€ ๋‹จ์ผ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ์— ๊ตญํ•œ๋˜์ง€ ์•Š์œผ๋ฉฐ, ์ด ๊ฐ€์ด๋“œ์—์„œ๋Š” IDEFICS๋ผ๋Š” ๋Œ€๊ทœ๋ชจ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜์—ฌ ์ด๋ฏธ์ง€-ํ…์ŠคํŠธ ์ž‘์—…์„ ๋‹ค๋ฃจ๋Š” ๋ฐฉ๋ฒ•์„ ์„ค๋ช…ํ•ฉ๋‹ˆ๋‹ค.

IDEFICS๋Š” Flamingo๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๋Š” ์˜คํ”ˆ ์•ก์„ธ์Šค ๋น„์ „ ๋ฐ ์–ธ์–ด ๋ชจ๋ธ๋กœ, DeepMind์—์„œ ์ฒ˜์Œ ๊ฐœ๋ฐœํ•œ ์ตœ์‹  ์‹œ๊ฐ ์–ธ์–ด ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. ์ด ๋ชจ๋ธ์€ ์ž„์˜์˜ ์ด๋ฏธ์ง€ ๋ฐ ํ…์ŠคํŠธ ์ž…๋ ฅ ์‹œํ€€์Šค๋ฅผ ๋ฐ›์•„ ์ผ๊ด€์„ฑ ์žˆ๋Š” ํ…์ŠคํŠธ๋ฅผ ์ถœ๋ ฅ์œผ๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฏธ์ง€์— ๋Œ€ํ•œ ์งˆ๋ฌธ์— ๋‹ต๋ณ€ํ•˜๊ณ , ์‹œ๊ฐ์ ์ธ ๋‚ด์šฉ์„ ์„ค๋ช…ํ•˜๋ฉฐ, ์—ฌ๋Ÿฌ ์ด๋ฏธ์ง€์— ๊ธฐ๋ฐ˜ํ•œ ์ด์•ผ๊ธฐ๋ฅผ ์ƒ์„ฑํ•˜๋Š” ๋“ฑ ๋‹ค์–‘ํ•œ ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. IDEFICS๋Š” 800์–ต ํŒŒ๋ผ๋ฏธํ„ฐ์™€ 90์–ต ํŒŒ๋ผ๋ฏธํ„ฐ ๋‘ ๊ฐ€์ง€ ๋ฒ„์ „์„ ์ œ๊ณตํ•˜๋ฉฐ, ๋‘ ๋ฒ„์ „ ๋ชจ๋‘ ๐Ÿค— Hub์—์„œ ์ด์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ฐ ๋ฒ„์ „์—๋Š” ๋Œ€ํ™”ํ˜• ์‚ฌ์šฉ ์‚ฌ๋ก€์— ๋งž๊ฒŒ ๋ฏธ์„ธ ์กฐ์ •๋œ ๋ฒ„์ „๋„ ์žˆ์Šต๋‹ˆ๋‹ค.

์ด ๋ชจ๋ธ์€ ๋งค์šฐ ๋‹ค์žฌ๋‹ค๋Šฅํ•˜๋ฉฐ ๊ด‘๋ฒ”์œ„ํ•œ ์ด๋ฏธ์ง€ ๋ฐ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์ž‘์—…์— ์‚ฌ์šฉ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๋Œ€๊ทœ๋ชจ ๋ชจ๋ธ์ด๊ธฐ ๋•Œ๋ฌธ์— ์ƒ๋‹นํ•œ ์ปดํ“จํŒ… ์ž์›๊ณผ ์ธํ”„๋ผ๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. ๊ฐ ๊ฐœ๋ณ„ ์ž‘์—…์— ํŠนํ™”๋œ ๋ชจ๋ธ์„ ๋ฏธ์„ธ ์กฐ์ •ํ•˜๋Š” ๊ฒƒ๋ณด๋‹ค ๋ชจ๋ธ์„ ๊ทธ๋Œ€๋กœ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ๋” ์ ํ•ฉํ•œ์ง€๋Š” ์‚ฌ์šฉ์ž๊ฐ€ ํŒ๋‹จํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

์ด ๊ฐ€์ด๋“œ์—์„œ๋Š” ๋‹ค์Œ์„ ๋ฐฐ์šฐ๊ฒŒ ๋ฉ๋‹ˆ๋‹ค:

์‹œ์ž‘ํ•˜๊ธฐ ์ „์— ํ•„์š”ํ•œ ๋ชจ๋“  ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๊ฐ€ ์„ค์น˜๋˜์–ด ์žˆ๋Š”์ง€ ํ™•์ธํ•˜์„ธ์š”.

pip install -q bitsandbytes sentencepiece accelerate transformers
๋‹ค์Œ ์˜ˆ์ œ๋ฅผ ๋น„์–‘์žํ™”๋œ ๋ฒ„์ „์˜ ๋ชจ๋ธ ์ฒดํฌํฌ์ธํŠธ๋กœ ์‹คํ–‰ํ•˜๋ ค๋ฉด ์ตœ์†Œ 20GB์˜ GPU ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

๋ชจ๋ธ ๋กœ๋“œ[[loading-the-model]]

๋ชจ๋ธ์„ 90์–ต ํŒŒ๋ผ๋ฏธํ„ฐ ๋ฒ„์ „์˜ ์ฒดํฌํฌ์ธํŠธ๋กœ ๋กœ๋“œํ•ด ๋ด…์‹œ๋‹ค:

>>> checkpoint = "HuggingFaceM4/idefics-9b"

๋‹ค๋ฅธ Transformers ๋ชจ๋ธ๊ณผ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ, ์ฒดํฌํฌ์ธํŠธ์—์„œ ํ”„๋กœ์„ธ์„œ์™€ ๋ชจ๋ธ ์ž์ฒด๋ฅผ ๋กœ๋“œํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. IDEFICS ํ”„๋กœ์„ธ์„œ๋Š” [LlamaTokenizer]์™€ IDEFICS ์ด๋ฏธ์ง€ ํ”„๋กœ์„ธ์„œ๋ฅผ ํ•˜๋‚˜์˜ ํ”„๋กœ์„ธ์„œ๋กœ ๊ฐ์‹ธ์„œ ํ…์ŠคํŠธ์™€ ์ด๋ฏธ์ง€ ์ž…๋ ฅ์„ ๋ชจ๋ธ์— ๋งž๊ฒŒ ์ค€๋น„ํ•ฉ๋‹ˆ๋‹ค.

>>> import torch

>>> from transformers import IdeficsForVisionText2Text, AutoProcessor

>>> processor = AutoProcessor.from_pretrained(checkpoint)

>>> model = IdeficsForVisionText2Text.from_pretrained(checkpoint, torch_dtype=torch.bfloat16, device_map="auto")

device_map์„ "auto"๋กœ ์„ค์ •ํ•˜๋ฉด ์‚ฌ์šฉ ์ค‘์ธ ์žฅ์น˜๋ฅผ ๊ณ ๋ คํ•˜์—ฌ ๋ชจ๋ธ ๊ฐ€์ค‘์น˜๋ฅผ ๊ฐ€์žฅ ์ตœ์ ํ™”๋œ ๋ฐฉ์‹์œผ๋กœ ๋กœ๋“œํ•˜๊ณ  ์ €์žฅํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ž๋™์œผ๋กœ ๊ฒฐ์ •ํ•ฉ๋‹ˆ๋‹ค.

์–‘์žํ™”๋œ ๋ชจ๋ธ[[quantized-model]]

๊ณ ์šฉ๋Ÿ‰ GPU ์‚ฌ์šฉ์ด ์–ด๋ ค์šด ๊ฒฝ์šฐ, ๋ชจ๋ธ์˜ ์–‘์žํ™”๋œ ๋ฒ„์ „์„ ๋กœ๋“œํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ชจ๋ธ๊ณผ ํ”„๋กœ์„ธ์„œ๋ฅผ 4๋น„ํŠธ ์ •๋ฐ€๋„๋กœ ๋กœ๋“œํ•˜๊ธฐ ์œ„ํ•ด์„œ, from_pretrained ๋ฉ”์†Œ๋“œ์— BitsAndBytesConfig๋ฅผ ์ „๋‹ฌํ•˜๋ฉด ๋ชจ๋ธ์ด ๋กœ๋“œ๋˜๋Š” ๋™์•ˆ ์‹ค์‹œ๊ฐ„์œผ๋กœ ์••์ถ•๋ฉ๋‹ˆ๋‹ค.

>>> import torch
>>> from transformers import IdeficsForVisionText2Text, AutoProcessor, BitsAndBytesConfig

>>> quantization_config = BitsAndBytesConfig(
...     load_in_4bit=True,
...     bnb_4bit_compute_dtype=torch.float16,
... )

>>> processor = AutoProcessor.from_pretrained(checkpoint)

>>> model = IdeficsForVisionText2Text.from_pretrained(
...     checkpoint,
...     quantization_config=quantization_config,
...     device_map="auto"
... )

์ด์ œ ๋ชจ๋ธ์„ ์ œ์•ˆ๋œ ๋ฐฉ๋ฒ• ์ค‘ ํ•˜๋‚˜๋กœ ๋กœ๋“œํ–ˆ์œผ๋‹ˆ, IDEFICS๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ์ž‘์—…๋“ค์„ ํƒ๊ตฌํ•ด๋ด…์‹œ๋‹ค.

์ด๋ฏธ์ง€ ์บก์…”๋‹[[image-captioning]]

์ด๋ฏธ์ง€ ์บก์…”๋‹์€ ์ฃผ์–ด์ง„ ์ด๋ฏธ์ง€์— ๋Œ€ํ•œ ์บก์…˜์„ ์˜ˆ์ธกํ•˜๋Š” ์ž‘์—…์ž…๋‹ˆ๋‹ค. ์ผ๋ฐ˜์ ์ธ ์‘์šฉ ๋ถ„์•ผ๋Š” ์‹œ๊ฐ ์žฅ์• ์ธ์ด ๋‹ค์–‘ํ•œ ์ƒํ™ฉ์„ ํƒ์ƒ‰ํ•  ์ˆ˜ ์žˆ๋„๋ก ๋•๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ์˜จ๋ผ์ธ์—์„œ ์ด๋ฏธ์ง€ ์ฝ˜ํ…์ธ ๋ฅผ ํƒ์ƒ‰ํ•˜๋Š” ๋ฐ ๋„์›€์„ ์ค„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ž‘์—…์„ ์„ค๋ช…ํ•˜๊ธฐ ์œ„ํ•ด ์บก์…˜์„ ๋‹ฌ ์ด๋ฏธ์ง€ ์˜ˆ์‹œ๋ฅผ ๊ฐ€์ ธ์˜ต๋‹ˆ๋‹ค. ์˜ˆ์‹œ:

Image of a puppy in a flower bed

์‚ฌ์ง„ ์ œ๊ณต: Hendo Wang.

IDEFICS๋Š” ํ…์ŠคํŠธ ๋ฐ ์ด๋ฏธ์ง€ ํ”„๋กฌํ”„ํŠธ๋ฅผ ๋ชจ๋‘ ์ˆ˜์šฉํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ด๋ฏธ์ง€๋ฅผ ์บก์…˜ํ•˜๊ธฐ ์œ„ํ•ด ๋ชจ๋ธ์— ํ…์ŠคํŠธ ํ”„๋กฌํ”„ํŠธ๋ฅผ ์ œ๊ณตํ•  ํ•„์š”๋Š” ์—†์Šต๋‹ˆ๋‹ค. ์ „์ฒ˜๋ฆฌ๋œ ์ž…๋ ฅ ์ด๋ฏธ์ง€๋งŒ ์ œ๊ณตํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค. ํ…์ŠคํŠธ ํ”„๋กฌํ”„ํŠธ ์—†์ด ๋ชจ๋ธ์€ BOS(์‹œํ€€์Šค ์‹œ์ž‘) ํ† ํฐ๋ถ€ํ„ฐ ํ…์ŠคํŠธ ์ƒ์„ฑ์„ ์‹œ์ž‘ํ•˜์—ฌ ์บก์…˜์„ ๋งŒ๋“ญ๋‹ˆ๋‹ค.

๋ชจ๋ธ์— ์ด๋ฏธ์ง€ ์ž…๋ ฅ์œผ๋กœ๋Š” ์ด๋ฏธ์ง€ ๊ฐ์ฒด(PIL.Image) ๋˜๋Š” ์ด๋ฏธ์ง€๋ฅผ ๊ฐ€์ ธ์˜ฌ ์ˆ˜ ์žˆ๋Š” URL์„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

>>> prompt = [
...     "https://images.unsplash.com/photo-1583160247711-2191776b4b91?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3542&q=80",
... ]

>>> inputs = processor(prompt, return_tensors="pt").to("cuda")
>>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids

>>> generated_ids = model.generate(**inputs, max_new_tokens=10, bad_words_ids=bad_words_ids)
>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
>>> print(generated_text[0])
A puppy in a flower bed

max_new_tokens์˜ ํฌ๊ธฐ๋ฅผ ์ฆ๊ฐ€์‹œํ‚ฌ ๋•Œ ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ๋Š” ์˜ค๋ฅ˜๋ฅผ ํ”ผํ•˜๊ธฐ ์œ„ํ•ด generate ํ˜ธ์ถœ ์‹œ bad_words_ids๋ฅผ ํฌํ•จํ•˜๋Š” ๊ฒƒ์ด ์ข‹์Šต๋‹ˆ๋‹ค. ๋ชจ๋ธ๋กœ๋ถ€ํ„ฐ ์ƒ์„ฑ๋œ ์ด๋ฏธ์ง€๊ฐ€ ์—†์„ ๋•Œ ์ƒˆ๋กœ์šด <image> ๋˜๋Š” <fake_token_around_image> ํ† ํฐ์„ ์ƒ์„ฑํ•˜๋ ค๊ณ  ํ•˜๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค. ์ด ๊ฐ€์ด๋“œ์—์„œ์ฒ˜๋Ÿผ bad_words_ids๋ฅผ ํ•จ์ˆ˜ ํ˜ธ์ถœ ์‹œ์— ๋งค๊ฐœ๋ณ€์ˆ˜๋กœ ์„ค์ •ํ•˜๊ฑฐ๋‚˜, ํ…์ŠคํŠธ ์ƒ์„ฑ ์ „๋žต ๊ฐ€์ด๋“œ์— ์„ค๋ช…๋œ ๋Œ€๋กœ GenerationConfig์— ์ €์žฅํ•  ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค.

ํ”„๋กฌํ”„ํŠธ ์ด๋ฏธ์ง€ ์บก์…”๋‹[[prompted-image-captioning]]

ํ…์ŠคํŠธ ํ”„๋กฌํ”„ํŠธ๋ฅผ ์ด์šฉํ•˜์—ฌ ์ด๋ฏธ์ง€ ์บก์…”๋‹์„ ํ™•์žฅํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ๋ชจ๋ธ์€ ์ฃผ์–ด์ง„ ์ด๋ฏธ์ง€๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ํ…์ŠคํŠธ๋ฅผ ๊ณ„์† ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. ๋‹ค์Œ ์ด๋ฏธ์ง€๋ฅผ ์˜ˆ์‹œ๋กœ ๋“ค์–ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค:

Image of the Eiffel Tower at night

์‚ฌ์ง„ ์ œ๊ณต: Denys Nevozhai.

ํ…์ŠคํŠธ ๋ฐ ์ด๋ฏธ์ง€ ํ”„๋กฌํ”„ํŠธ๋Š” ์ ์ ˆํ•œ ์ž…๋ ฅ์„ ์ƒ์„ฑํ•˜๊ธฐ ์œ„ํ•ด ๋ชจ๋ธ์˜ ํ”„๋กœ์„ธ์„œ์— ํ•˜๋‚˜์˜ ๋ชฉ๋ก์œผ๋กœ ์ „๋‹ฌ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

>>> prompt = [
...     "https://images.unsplash.com/photo-1543349689-9a4d426bee8e?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3501&q=80",
...     "This is an image of ",
... ]

>>> inputs = processor(prompt, return_tensors="pt").to("cuda")
>>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids

>>> generated_ids = model.generate(**inputs, max_new_tokens=10, bad_words_ids=bad_words_ids)
>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
>>> print(generated_text[0])
This is an image of the Eiffel Tower in Paris, France.

ํ“จ์ƒท ํ”„๋กฌํ”„ํŠธ[[few-shot-prompting]]

IDEFICS๋Š” ํ›Œ๋ฅญํ•œ ์ œ๋กœ์ƒท ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์ฃผ์ง€๋งŒ, ์ž‘์—…์— ํŠน์ • ํ˜•์‹์˜ ์บก์…˜์ด ํ•„์š”ํ•˜๊ฑฐ๋‚˜ ์ž‘์—…์˜ ๋ณต์žก์„ฑ์„ ๋†’์ด๋Š” ๋‹ค๋ฅธ ์ œํ•œ ์‚ฌํ•ญ์ด๋‚˜ ์š”๊ตฌ ์‚ฌํ•ญ์ด ์žˆ์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Ÿด ๋•Œ ํ“จ์ƒท ํ”„๋กฌํ”„ํŠธ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋งฅ๋ฝ ๋‚ด ํ•™์Šต(In-Context Learning)์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ํ”„๋กฌํ”„ํŠธ์— ์˜ˆ์‹œ๋ฅผ ์ œ๊ณตํ•จ์œผ๋กœ์จ ๋ชจ๋ธ์ด ์ฃผ์–ด์ง„ ์˜ˆ์‹œ์˜ ํ˜•์‹์„ ๋ชจ๋ฐฉํ•œ ๊ฒฐ๊ณผ๋ฅผ ์ƒ์„ฑํ•˜๋„๋ก ์œ ๋„ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ด์ „์˜ ์—ํŽ ํƒ‘ ์ด๋ฏธ์ง€๋ฅผ ๋ชจ๋ธ์— ์˜ˆ์‹œ๋กœ ์‚ฌ์šฉํ•˜๊ณ , ๋ชจ๋ธ์—๊ฒŒ ์ด๋ฏธ์ง€์˜ ๊ฐ์ฒด๋ฅผ ํ•™์Šตํ•˜๋Š” ๊ฒƒ ์™ธ์—๋„ ํฅ๋ฏธ๋กœ์šด ์ •๋ณด๋ฅผ ์–ป๊ณ  ์‹ถ๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ฃผ๋Š” ํ”„๋กฌํ”„ํŠธ๋ฅผ ์ž‘์„ฑํ•ด ๋ด…์‹œ๋‹ค. ๊ทธ๋Ÿฐ ๋‹ค์Œ ์ž์œ ์˜ ์—ฌ์‹ ์ƒ ์ด๋ฏธ์ง€์— ๋Œ€ํ•ด ๋™์ผํ•œ ์‘๋‹ต ํ˜•์‹์„ ์–ป์„ ์ˆ˜ ์žˆ๋Š”์ง€ ํ™•์ธํ•ด ๋ด…์‹œ๋‹ค:

Image of the Statue of Liberty

์‚ฌ์ง„ ์ œ๊ณต: Juan Mayobre.

>>> prompt = ["User:",
...            "https://images.unsplash.com/photo-1543349689-9a4d426bee8e?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3501&q=80",
...            "Describe this image.\nAssistant: An image of the Eiffel Tower at night. Fun fact: the Eiffel Tower is the same height as an 81-storey building.\n",
...            "User:",
...            "https://images.unsplash.com/photo-1524099163253-32b7f0256868?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3387&q=80",
...            "Describe this image.\nAssistant:"
...            ]

>>> inputs = processor(prompt, return_tensors="pt").to("cuda")
>>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids

>>> generated_ids = model.generate(**inputs, max_new_tokens=30, bad_words_ids=bad_words_ids)
>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
>>> print(generated_text[0])
User: Describe this image.
Assistant: An image of the Eiffel Tower at night. Fun fact: the Eiffel Tower is the same height as an 81-storey building. 
User: Describe this image.
Assistant: An image of the Statue of Liberty. Fun fact: the Statue of Liberty is 151 feet tall.

๋‹จ ํ•˜๋‚˜์˜ ์˜ˆ์‹œ๋งŒ์œผ๋กœ๋„(์ฆ‰, 1-shot) ๋ชจ๋ธ์ด ์ž‘์—… ์ˆ˜ํ–‰ ๋ฐฉ๋ฒ•์„ ํ•™์Šตํ–ˆ๋‹ค๋Š” ์ ์ด ์ฃผ๋ชฉํ•  ๋งŒํ•ฉ๋‹ˆ๋‹ค. ๋” ๋ณต์žกํ•œ ์ž‘์—…์˜ ๊ฒฝ์šฐ, ๋” ๋งŽ์€ ์˜ˆ์‹œ(์˜ˆ: 3-shot, 5-shot ๋“ฑ)๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์‹คํ—˜ํ•ด ๋ณด๋Š” ๊ฒƒ๋„ ์ข‹์€ ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค.

์‹œ๊ฐ์  ์งˆ์˜ ์‘๋‹ต[[visual-question-answering]]

์‹œ๊ฐ์  ์งˆ์˜ ์‘๋‹ต(VQA)์€ ์ด๋ฏธ์ง€๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๊ฐœ๋ฐฉํ˜• ์งˆ๋ฌธ์— ๋‹ตํ•˜๋Š” ์ž‘์—…์ž…๋‹ˆ๋‹ค. ์ด๋ฏธ์ง€ ์บก์…”๋‹๊ณผ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ์ ‘๊ทผ์„ฑ ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์—์„œ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์ง€๋งŒ, ๊ต์œก(์‹œ๊ฐ ์ž๋ฃŒ์— ๋Œ€ํ•œ ์ถ”๋ก ), ๊ณ ๊ฐ ์„œ๋น„์Šค(์ด๋ฏธ์ง€๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ ์ œํ’ˆ ์งˆ๋ฌธ), ์ด๋ฏธ์ง€ ๊ฒ€์ƒ‰ ๋“ฑ์—์„œ๋„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ด ์ž‘์—…์„ ์œ„ํ•ด ์ƒˆ๋กœ์šด ์ด๋ฏธ์ง€๋ฅผ ๊ฐ€์ ธ์˜ต๋‹ˆ๋‹ค:

Image of a couple having a picnic

์‚ฌ์ง„ ์ œ๊ณต: Jarritos Mexican Soda.

์ ์ ˆํ•œ ์ง€์‹œ๋ฌธ์„ ์‚ฌ์šฉํ•˜๋ฉด ์ด๋ฏธ์ง€ ์บก์…”๋‹์—์„œ ์‹œ๊ฐ์  ์งˆ์˜ ์‘๋‹ต์œผ๋กœ ๋ชจ๋ธ์„ ์œ ๋„ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

>>> prompt = [
...     "Instruction: Provide an answer to the question. Use the image to answer.\n",
...     "https://images.unsplash.com/photo-1623944889288-cd147dbb517c?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3540&q=80",
...     "Question: Where are these people and what's the weather like? Answer:"
... ]

>>> inputs = processor(prompt, return_tensors="pt").to("cuda")
>>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids

>>> generated_ids = model.generate(**inputs, max_new_tokens=20, bad_words_ids=bad_words_ids)
>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
>>> print(generated_text[0])
Instruction: Provide an answer to the question. Use the image to answer.
 Question: Where are these people and what's the weather like? Answer: They're in a park in New York City, and it's a beautiful day.

์ด๋ฏธ์ง€ ๋ถ„๋ฅ˜[[image-classification]]

IDEFICS๋Š” ํŠน์ • ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋ผ๋ฒจ์ด ํฌํ•จ๋œ ๋ฐ์ดํ„ฐ๋กœ ๋ช…์‹œ์ ์œผ๋กœ ํ•™์Šต๋˜์ง€ ์•Š์•„๋„ ์ด๋ฏธ์ง€๋ฅผ ๋‹ค์–‘ํ•œ ์นดํ…Œ๊ณ ๋ฆฌ๋กœ ๋ถ„๋ฅ˜ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์นดํ…Œ๊ณ ๋ฆฌ ๋ชฉ๋ก์ด ์ฃผ์–ด์ง€๋ฉด, ๋ชจ๋ธ์€ ์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ ์ดํ•ด ๋Šฅ๋ ฅ์„ ์‚ฌ์šฉํ•˜์—ฌ ์ด๋ฏธ์ง€๊ฐ€ ์†ํ•  ๊ฐ€๋Šฅ์„ฑ์ด ๋†’์€ ์นดํ…Œ๊ณ ๋ฆฌ๋ฅผ ์ถ”๋ก ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์—ฌ๊ธฐ์— ์•ผ์ฑ„ ๊ฐ€ํŒ๋Œ€ ์ด๋ฏธ์ง€๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.

Image of a vegetable stand

์‚ฌ์ง„ ์ œ๊ณต: Peter Wendt.

์šฐ๋ฆฌ๋Š” ๋ชจ๋ธ์—๊ฒŒ ์šฐ๋ฆฌ๊ฐ€ ๊ฐ€์ง„ ์นดํ…Œ๊ณ ๋ฆฌ ์ค‘ ํ•˜๋‚˜๋กœ ์ด๋ฏธ์ง€๋ฅผ ๋ถ„๋ฅ˜ํ•˜๋„๋ก ์ง€์‹œํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

>>> categories = ['animals','vegetables', 'city landscape', 'cars', 'office']
>>> prompt = [f"Instruction: Classify the following image into a single category from the following list: {categories}.\n",
...     "https://images.unsplash.com/photo-1471193945509-9ad0617afabf?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3540&q=80",    
...     "Category: "
... ]

>>> inputs = processor(prompt, return_tensors="pt").to("cuda")
>>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids

>>> generated_ids = model.generate(**inputs, max_new_tokens=6, bad_words_ids=bad_words_ids)
>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
>>> print(generated_text[0])
Instruction: Classify the following image into a single category from the following list: ['animals', 'vegetables', 'city landscape', 'cars', 'office'].
Category: Vegetables

์œ„ ์˜ˆ์ œ์—์„œ๋Š” ๋ชจ๋ธ์—๊ฒŒ ์ด๋ฏธ์ง€๋ฅผ ๋‹จ์ผ ์นดํ…Œ๊ณ ๋ฆฌ๋กœ ๋ถ„๋ฅ˜ํ•˜๋„๋ก ์ง€์‹œํ–ˆ์ง€๋งŒ, ์ˆœ์œ„ ๋ถ„๋ฅ˜๋ฅผ ํ•˜๋„๋ก ๋ชจ๋ธ์— ํ”„๋กฌํ”„ํŠธ๋ฅผ ์ œ๊ณตํ•  ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค.

์ด๋ฏธ์ง€ ๊ธฐ๋ฐ˜ ํ…์ŠคํŠธ ์ƒ์„ฑ[[image-guided-text-generation]]

์ด๋ฏธ์ง€๋ฅผ ํ™œ์šฉํ•œ ํ…์ŠคํŠธ ์ƒ์„ฑ ๊ธฐ์ˆ ์„ ์‚ฌ์šฉํ•˜๋ฉด ๋”์šฑ ์ฐฝ์˜์ ์ธ ์ž‘์—…์ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค. ์ด ๊ธฐ์ˆ ์€ ์ด๋ฏธ์ง€๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ํ…์ŠคํŠธ๋ฅผ ๋งŒ๋“ค์–ด๋‚ด๋ฉฐ, ์ œํ’ˆ ์„ค๋ช…, ๊ด‘๊ณ  ๋ฌธ๊ตฌ, ์žฅ๋ฉด ๋ฌ˜์‚ฌ ๋“ฑ ๋‹ค์–‘ํ•œ ์šฉ๋„๋กœ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๊ฐ„๋‹จํ•œ ์˜ˆ๋กœ, ๋นจ๊ฐ„ ๋ฌธ ์ด๋ฏธ์ง€๋ฅผ IDEFICS์— ์ž…๋ ฅํ•˜์—ฌ ์ด์•ผ๊ธฐ๋ฅผ ๋งŒ๋“ค์–ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค:

Image of a red door with a pumpkin on the steps

์‚ฌ์ง„ ์ œ๊ณต: Craig Tidball.

>>> prompt = ["Instruction: Use the image to write a story. \n",
...     "https://images.unsplash.com/photo-1517086822157-2b0358e7684a?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=2203&q=80",
...     "Story: \n"]

>>> inputs = processor(prompt, return_tensors="pt").to("cuda")
>>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids

>>> generated_ids = model.generate(**inputs, num_beams=2, max_new_tokens=200, bad_words_ids=bad_words_ids)
>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
>>> print(generated_text[0]) 
Instruction: Use the image to write a story. 
 Story: 
Once upon a time, there was a little girl who lived in a house with a red door.  She loved her red door.  It was the prettiest door in the whole world.

One day, the little girl was playing in her yard when she noticed a man standing on her doorstep.  He was wearing a long black coat and a top hat.

The little girl ran inside and told her mother about the man.

Her mother said, โ€œDonโ€™t worry, honey.  Heโ€™s just a friendly ghost.โ€

The little girl wasnโ€™t sure if she believed her mother, but she went outside anyway.

When she got to the door, the man was gone.

The next day, the little girl was playing in her yard again when she noticed the man standing on her doorstep.

He was wearing a long black coat and a top hat.

The little girl ran

IDEFICS๊ฐ€ ๋ฌธ ์•ž์— ์žˆ๋Š” ํ˜ธ๋ฐ•์„ ๋ณด๊ณ  ์œ ๋ น์— ๋Œ€ํ•œ ์œผ์Šค์Šคํ•œ ํ• ๋กœ์œˆ ์ด์•ผ๊ธฐ๋ฅผ ๋งŒ๋“  ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค.

์ด์ฒ˜๋Ÿผ ๊ธด ํ…์ŠคํŠธ๋ฅผ ์ƒ์„ฑํ•  ๋•Œ๋Š” ํ…์ŠคํŠธ ์ƒ์„ฑ ์ „๋žต์„ ์กฐ์ •ํ•˜๋Š” ๊ฒƒ์ด ์ข‹์Šต๋‹ˆ๋‹ค. ์ด๋ ‡๊ฒŒ ํ•˜๋ฉด ์ƒ์„ฑ๋œ ๊ฒฐ๊ณผ๋ฌผ์˜ ํ’ˆ์งˆ์„ ํฌ๊ฒŒ ํ–ฅ์ƒ์‹œํ‚ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ž์„ธํ•œ ๋‚ด์šฉ์€ ํ…์ŠคํŠธ ์ƒ์„ฑ ์ „๋žต์„ ์ฐธ์กฐํ•˜์„ธ์š”.

๋ฐฐ์น˜ ๋ชจ๋“œ์—์„œ ์ถ”๋ก  ์‹คํ–‰[[running-inference-in-batch-mode]]

์•ž์„  ๋ชจ๋“  ์„น์…˜์—์„œ๋Š” ๋‹จ์ผ ์˜ˆ์‹œ์— ๋Œ€ํ•ด IDEFICS๋ฅผ ์„ค๋ช…ํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด์™€ ๋งค์šฐ ์œ ์‚ฌํ•œ ๋ฐฉ์‹์œผ๋กœ, ํ”„๋กฌํ”„ํŠธ ๋ชฉ๋ก์„ ์ „๋‹ฌํ•˜์—ฌ ์—ฌ๋Ÿฌ ์˜ˆ์‹œ์— ๋Œ€ํ•œ ์ถ”๋ก ์„ ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

>>> prompts = [
...     [   "https://images.unsplash.com/photo-1543349689-9a4d426bee8e?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3501&q=80",
...         "This is an image of ",
...     ],
...     [   "https://images.unsplash.com/photo-1623944889288-cd147dbb517c?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3540&q=80",
...         "This is an image of ",
...     ],
...     [   "https://images.unsplash.com/photo-1471193945509-9ad0617afabf?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3540&q=80",
...         "This is an image of ",
...     ],
... ]

>>> inputs = processor(prompts, return_tensors="pt").to("cuda")
>>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids

>>> generated_ids = model.generate(**inputs, max_new_tokens=10, bad_words_ids=bad_words_ids)
>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
>>> for i,t in enumerate(generated_text):
...     print(f"{i}:\n{t}\n") 
0:
This is an image of the Eiffel Tower in Paris, France.

1:
This is an image of a couple on a picnic blanket.

2:
This is an image of a vegetable stand.

๋Œ€ํ™”ํ˜• ์‚ฌ์šฉ์„ ์œ„ํ•œ IDEFICS ์ธ์ŠคํŠธ๋ŸญํŠธ ์‹คํ–‰[[idefics-instruct-for-conversational-use]]

๋Œ€ํ™”ํ˜• ์‚ฌ์šฉ ์‚ฌ๋ก€๋ฅผ ์œ„ํ•ด, ๐Ÿค— Hub์—์„œ ๋ช…๋ น์–ด ์ˆ˜ํ–‰์— ์ตœ์ ํ™”๋œ ๋ฒ„์ „์˜ ๋ชจ๋ธ์„ ์ฐพ์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๊ณณ์—๋Š” HuggingFaceM4/idefics-80b-instruct์™€ HuggingFaceM4/idefics-9b-instruct๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.

์ด ์ฒดํฌํฌ์ธํŠธ๋Š” ์ง€๋„ ํ•™์Šต ๋ฐ ๋ช…๋ น์–ด ๋ฏธ์„ธ ์กฐ์ • ๋ฐ์ดํ„ฐ์…‹์˜ ํ˜ผํ•ฉ์œผ๋กœ ๊ฐ๊ฐ์˜ ๊ธฐ๋ณธ ๋ชจ๋ธ์„ ๋ฏธ์„ธ ์กฐ์ •ํ•œ ๊ฒฐ๊ณผ์ž…๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ๋ชจ๋ธ์˜ ํ•˜์œ„ ์ž‘์—… ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ๋™์‹œ์— ๋Œ€ํ™”ํ˜• ํ™˜๊ฒฝ์—์„œ ๋ชจ๋ธ์„ ๋” ์‚ฌ์šฉํ•˜๊ธฐ ์‰ฝ๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.

๋Œ€ํ™”ํ˜• ์‚ฌ์šฉ์„ ์œ„ํ•œ ์‚ฌ์šฉ๋ฒ• ๋ฐ ํ”„๋กฌํ”„ํŠธ๋Š” ๊ธฐ๋ณธ ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ๊ณผ ๋งค์šฐ ์œ ์‚ฌํ•ฉ๋‹ˆ๋‹ค.

>>> import torch
>>> from transformers import IdeficsForVisionText2Text, AutoProcessor

>>> device = "cuda" if torch.cuda.is_available() else "cpu"

>>> checkpoint = "HuggingFaceM4/idefics-9b-instruct"
>>> model = IdeficsForVisionText2Text.from_pretrained(checkpoint, torch_dtype=torch.bfloat16).to(device)
>>> processor = AutoProcessor.from_pretrained(checkpoint)

>>> prompts = [
...     [
...         "User: What is in this image?",
...         "https://upload.wikimedia.org/wikipedia/commons/8/86/Id%C3%A9fix.JPG",
...         "<end_of_utterance>",

...         "\nAssistant: This picture depicts Idefix, the dog of Obelix in Asterix and Obelix. Idefix is running on the ground.<end_of_utterance>",

...         "\nUser:",
...         "https://static.wikia.nocookie.net/asterix/images/2/25/R22b.gif/revision/latest?cb=20110815073052",
...         "And who is that?<end_of_utterance>",

...         "\nAssistant:",
...     ],
... ]

>>> # --batched mode
>>> inputs = processor(prompts, add_end_of_utterance_token=False, return_tensors="pt").to(device)
>>> # --single sample mode
>>> # inputs = processor(prompts[0], return_tensors="pt").to(device)

>>> # args ์ƒ์„ฑ
>>> exit_condition = processor.tokenizer("<end_of_utterance>", add_special_tokens=False).input_ids
>>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids

>>> generated_ids = model.generate(**inputs, eos_token_id=exit_condition, bad_words_ids=bad_words_ids, max_length=100)
>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
>>> for i, t in enumerate(generated_text):
...     print(f"{i}:\n{t}\n")