DrDavis's picture
Upload folder using huggingface_hub
17c6d62 verified

BLIP-2[[blip-2]]

๊ฐœ์š”[[overview]]

BLIP-2 ๋ชจ๋ธ์€ Junnan Li, Dongxu Li, Silvio Savarese, Steven Hoi์˜ BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models ๋…ผ๋ฌธ์—์„œ ์ œ์•ˆ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. BLIP-2๋Š” ๋™๊ฒฐ๋œ ์‚ฌ์ „ ํ•™์Šต ์ด๋ฏธ์ง€ ์ธ์ฝ”๋”์™€ ๋Œ€๊ทœ๋ชจ ์–ธ์–ด ๋ชจ๋ธ(LLM)์„ ์—ฐ๊ฒฐํ•˜๋Š” 12์ธต์˜ ๊ฒฝ๋Ÿ‰ Transformer ์ธ์ฝ”๋”๋ฅผ ํ•™์Šต์‹œ์ผœ, ์—ฌ๋Ÿฌ ๋น„์ „-์–ธ์–ด ์ž‘์—…์—์„œ SOTA(ํ˜„์žฌ ์ตœ๊ณ ์˜ ์„ฑ๋Šฅ)์„ ๋‹ฌ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค. ํŠนํžˆ, BLIP-2๋Š” 800์–ต ๊ฐœ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๊ฐ€์ง„ Flamingo ๋ชจ๋ธ๋ณด๋‹ค ์ œ๋กœ์ƒท VQAv2์—์„œ 8.7% ๋” ๋†’์€ ์„ฑ๋Šฅ์„ ๊ธฐ๋กํ–ˆ์œผ๋ฉฐ, ํ•™์Šต ๊ฐ€๋Šฅํ•œ ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜๋Š” Flamingo๋ณด๋‹ค 54๋ฐฐ ์ ์Šต๋‹ˆ๋‹ค.

๋…ผ๋ฌธ์˜ ์ดˆ๋ก์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

๋น„์ „-์–ธ์–ด ์‚ฌ์ „ ํ•™์Šต์˜ ๋น„์šฉ์€ ๋Œ€๊ทœ๋ชจ ๋ชจ๋ธ์˜ ์—”๋“œ-ํˆฌ-์—”๋“œ ํ•™์Šต์œผ๋กœ ์ธํ•ด ์ ์  ๋” ๋ถ€๋‹ด์Šค๋Ÿฌ์›Œ์ง€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์€ ์‚ฌ์ „ ํ•™์Šต๋œ ์ด๋ฏธ์ง€ ์ธ์ฝ”๋”์™€ ๋Œ€๊ทœ๋ชจ ์–ธ์–ด ๋ชจ๋ธ์„ ํ™œ์šฉํ•˜์—ฌ ๋น„์ „-์–ธ์–ด ์‚ฌ์ „ ํ•™์Šต์„ ๋ถ€ํŠธ์ŠคํŠธ๋ž˜ํ•‘ํ•˜๋Š” ์ผ๋ฐ˜์ ์ด๊ณ  ํšจ์œจ์ ์ธ ์‚ฌ์ „ ํ•™์Šต ์ „๋žต์ธ BLIP-2๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. BLIP-2๋Š” ๊ฒฝ๋Ÿ‰ํ™”๋œ Querying Transformer๋ฅผ ํ†ตํ•ด ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ ๊ฐ„์˜ ์ฐจ์ด๋ฅผ ์—ฐ๊ฒฐํ•˜๋ฉฐ, ๋‘ ๋‹จ๊ณ„๋กœ ์‚ฌ์ „ ํ•™์Šต๋ฉ๋‹ˆ๋‹ค. ์ฒซ ๋ฒˆ์งธ ๋‹จ๊ณ„๋Š” ๋™๊ฒฐ๋œ ์ด๋ฏธ์ง€ ์ธ์ฝ”๋”๋กœ๋ถ€ํ„ฐ ๋น„์ „-์–ธ์–ด ํ‘œํ˜„ ํ•™์Šต์„ ๋ถ€ํŠธ์ŠคํŠธ๋ž˜ํ•‘ํ•˜๊ณ , ๋‘ ๋ฒˆ์งธ ๋‹จ๊ณ„๋Š” ๋™๊ฒฐ๋œ ์–ธ์–ด ๋ชจ๋ธ๋กœ๋ถ€ํ„ฐ ๋น„์ „-์–ธ์–ด ์ƒ์„ฑ ํ•™์Šต์„ ๋ถ€ํŠธ์ŠคํŠธ๋ž˜ํ•‘ํ•ฉ๋‹ˆ๋‹ค. BLIP-2๋Š” ๊ธฐ์กด ๋ฐฉ๋ฒ•๋“ค์— ๋น„ํ•ด ํ›จ์”ฌ ์ ์€ ํ•™์Šต ๊ฐ€๋Šฅํ•œ ํŒŒ๋ผ๋ฏธํ„ฐ๋กœ ๋‹ค์–‘ํ•œ ๋น„์ „-์–ธ์–ด ์ž‘์—…์—์„œ ์ตœ์ฒจ๋‹จ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ์šฐ๋ฆฌ ๋ชจ๋ธ์€ ์ œ๋กœ์ƒท VQAv2์—์„œ Flamingo80B๋ณด๋‹ค 8.7% ๋†’์€ ์„ฑ๋Šฅ์„ ๊ธฐ๋กํ•˜๋ฉฐ, ํ•™์Šต ๊ฐ€๋Šฅํ•œ ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜๋Š” 54๋ฐฐ ์ ์Šต๋‹ˆ๋‹ค. ์šฐ๋ฆฌ๋Š” ๋˜ํ•œ ์ž์—ฐ์–ด ๋ช…๋ น์„ ๋”ฐ๋ฅผ ์ˆ˜ ์žˆ๋Š” ์ œ๋กœ์ƒท ์ด๋ฏธ์ง€-ํ…์ŠคํŠธ ์ƒ์„ฑ์˜ ์ƒˆ๋กœ์šด ๊ธฐ๋Šฅ์„ ์ž…์ฆํ–ˆ์Šต๋‹ˆ๋‹ค.

drawing

BLIP-2 ๊ตฌ์กฐ. ์›๋ณธ ๋…ผ๋ฌธ ์—์„œ ๋ฐœ์ทŒ.

์ด ๋ชจ๋ธ์€ nielsr๊ฐ€ ๊ธฐ์—ฌํ–ˆ์Šต๋‹ˆ๋‹ค. ์›๋ณธ ์ฝ”๋“œ๋Š” ์—ฌ๊ธฐ์—์„œ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์‚ฌ์šฉ ํŒ[[usage-tips]]

  • BLIP-2๋Š” ์ด๋ฏธ์ง€์™€ ์กฐ๊ฑด์— ๋”ฐ๋ผ ํ…์ŠคํŠธ ํ”„๋กฌํ”„ํŠธ๋ฅผ ์ž…๋ ฅ๋ฐ›์•„ ์กฐ๊ฑด๋ถ€ ํ…์ŠคํŠธ๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. ์ถ”๋ก  ์‹œ [generate] ๋ฉ”์†Œ๋“œ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ๊ถŒ์žฅ๋ฉ๋‹ˆ๋‹ค.
  • [Blip2Processor]๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ชจ๋ธ์— ์ด๋ฏธ์ง€๋ฅผ ์ค€๋น„ํ•˜๊ณ , ์˜ˆ์ธก๋œ ํ† ํฐ ID๋ฅผ ํ…์ŠคํŠธ๋กœ ๋””์ฝ”๋”ฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ž๋ฃŒ[[resources]]

BLIP-2๋ฅผ ์‹œ์ž‘ํ•˜๋Š” ๋ฐ ๋„์›€์ด ๋˜๋Š” ๊ณต์‹ Hugging Face ๋ฐ ์ปค๋ฎค๋‹ˆํ‹ฐ(๐ŸŒŽ ํ‘œ์‹œ) ์ž๋ฃŒ ๋ชฉ๋ก์ž…๋‹ˆ๋‹ค.

  • ์ด๋ฏธ์ง€ ์บก์…”๋‹, ์‹œ๊ฐ ์งˆ๋ฌธ ์‘๋‹ต(VQA), ์ฑ„ํŒ…๊ณผ ๊ฐ™์€ ๋Œ€ํ™”ํ˜• ์ž‘์—…์„ ์œ„ํ•œ BLIP-2 ๋ฐ๋ชจ ๋…ธํŠธ๋ถ์€ ์—ฌ๊ธฐ์—์„œ ์ฐพ์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๋ฆฌ์†Œ์Šค๋ฅผ ์ œ์ถœํ•˜์—ฌ ์—ฌ๊ธฐ์— ํฌํ•จํ•˜๊ณ  ์‹ถ๋‹ค๋ฉด ์–ธ์ œ๋“ ์ง€ ํ’€ ๋ฆฌํ€˜์ŠคํŠธ๋ฅผ ์—ด์–ด์ฃผ์„ธ์š”! ๋ฆฌ์†Œ์Šค๋Š” ๊ธฐ์กด ๋ฆฌ์†Œ์Šค๋ฅผ ๋ณต์ œํ•˜์ง€ ์•Š๊ณ  ์ƒˆ๋กœ์šด ๋‚ด์šฉ์ด์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

Blip2Config[[transformers.Blip2Config]]

[[autodoc]] Blip2Config - from_vision_qformer_text_configs

Blip2VisionConfig[[transformers.Blip2VisionConfig]]

[[autodoc]] Blip2VisionConfig

Blip2QFormerConfig[[transformers.Blip2QFormerConfig]]

[[autodoc]] Blip2QFormerConfig

Blip2Processor[[transformers.Blip2Processor]]

[[autodoc]] Blip2Processor

Blip2VisionModel[[transformers.Blip2VisionModel]]

[[autodoc]] Blip2VisionModel - forward

Blip2QFormerModel[[transformers.Blip2QFormerModel]]

[[autodoc]] Blip2QFormerModel - forward

Blip2Model[[transformers.Blip2Model]]

[[autodoc]] Blip2Model - forward - get_text_features - get_image_features - get_qformer_features

Blip2ForConditionalGeneration[[transformers.Blip2ForConditionalGeneration]]

[[autodoc]] Blip2ForConditionalGeneration - forward - generate

Blip2ForImageTextRetrieval[[transformers.Blip2ForImageTextRetrieval]]

[[autodoc]] Blip2ForImageTextRetrieval - forward

Blip2TextModelWithProjection[[transformers.Blip2TextModelWithProjection]]

[[autodoc]] Blip2TextModelWithProjection

Blip2VisionModelWithProjection[[transformers.Blip2VisionModelWithProjection]]

[[autodoc]] Blip2VisionModelWithProjection