DrDavis's picture
Upload folder using huggingface_hub
17c6d62 verified

BLIP[[blip]]

๊ฐœ์š”[[overview]]

BLIP ๋ชจ๋ธ์€ Junnan Li, Dongxu Li, Caiming Xiong, Steven Hoi์˜ BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation ๋…ผ๋ฌธ์—์„œ ์ œ์•ˆ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

BLIP์€ ์—ฌ๋Ÿฌ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋Š” ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค:

  • ์‹œ๊ฐ ์งˆ๋ฌธ ์‘๋‹ต (Visual Question Answering, VQA)
  • ์ด๋ฏธ์ง€-ํ…์ŠคํŠธ ๊ฒ€์ƒ‰ (์ด๋ฏธ์ง€-ํ…์ŠคํŠธ ๋งค์นญ)
  • ์ด๋ฏธ์ง€ ์บก์…”๋‹

๋…ผ๋ฌธ์˜ ์ดˆ๋ก์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

๋น„์ „-์–ธ์–ด ์‚ฌ์ „ ํ•™์Šต(Vision-Language Pre-training, VLP)์€ ๋‹ค์–‘ํ•œ ๋น„์ „-์–ธ์–ด ์ž‘์—…์˜ ์„ฑ๋Šฅ์„ ํฌ๊ฒŒ ํ–ฅ์ƒ์‹œ์ผฐ์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ, ๋Œ€๋ถ€๋ถ„์˜ ๊ธฐ์กด ์‚ฌ์ „ ํ•™์Šต ๋ชจ๋ธ๋“ค์€ ์ดํ•ด ๊ธฐ๋ฐ˜ ์ž‘์—…์ด๋‚˜ ์ƒ์„ฑ ๊ธฐ๋ฐ˜ ์ž‘์—… ์ค‘ ํ•˜๋‚˜์—์„œ๋งŒ ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ์„ ๋ฐœํœ˜ํ•ฉ๋‹ˆ๋‹ค. ๋˜ํ•œ ์„ฑ๋Šฅ ํ–ฅ์ƒ์€ ์ฃผ๋กœ ์›น์—์„œ ์ˆ˜์ง‘ํ•œ ๋…ธ์ด์ฆˆ๊ฐ€ ๋งŽ์€ ์ด๋ฏธ์ง€-ํ…์ŠคํŠธ ์Œ์œผ๋กœ ๋ฐ์ดํ„ฐ์…‹์˜ ๊ทœ๋ชจ๋ฅผ ํ‚ค์šฐ๋Š” ๋ฐฉ์‹์œผ๋กœ ์ด๋ฃจ์–ด์กŒ๋Š”๋ฐ, ์ด๋Š” ์ตœ์ ์˜ ์ง€๋„ ํ•™์Šต ๋ฐฉ์‹์ด๋ผ๊ณ  ๋ณด๊ธฐ ์–ด๋ ต์Šต๋‹ˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” BLIP์ด๋ผ๋Š” ์ƒˆ๋กœ์šด VLP ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ์ด ํ”„๋ ˆ์ž„์›Œํฌ๋Š” ๋น„์ „-์–ธ์–ด ์ดํ•ด ๋ฐ ์ƒ์„ฑ ์ž‘์—… ๋ชจ๋‘์— ์œ ์—ฐํ•˜๊ฒŒ ์ ์šฉ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. BLIP๋Š” ์บก์…”๋„ˆ๊ฐ€ ํ•ฉ์„ฑ ์บก์…˜์„ ์ƒ์„ฑํ•˜๊ณ  ํ•„ํ„ฐ๊ฐ€ ๋…ธ์ด์ฆˆ ์บก์…˜์„ ์ œ๊ฑฐํ•˜๋Š” ๋ถ€ํŠธ์ŠคํŠธ๋ž˜ํ•‘ ๋ฐฉ๋ฒ•์„ ํ†ตํ•ด ์›น ๋ฐ์ดํ„ฐ์˜ ๋…ธ์ด์ฆˆ๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ํ™œ์šฉํ•ฉ๋‹ˆ๋‹ค. ์šฐ๋ฆฌ๋Š” ์ด๋ฏธ์ง€-ํ…์ŠคํŠธ ๊ฒ€์ƒ‰(Recall@1์—์„œ +2.7%), ์ด๋ฏธ์ง€ ์บก์…”๋‹(CIDEr์—์„œ +2.8%), ๊ทธ๋ฆฌ๊ณ  VQA(VQA ์ ์ˆ˜์—์„œ +1.6%)์™€ ๊ฐ™์€ ๋‹ค์–‘ํ•œ ๋น„์ „-์–ธ์–ด ์ž‘์—…์—์„œ ์ตœ์‹  ์„ฑ๊ณผ๋ฅผ ๋‹ฌ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ BLIP์€ ์ œ๋กœ์ƒท ๋ฐฉ์‹์œผ๋กœ ๋น„๋””์˜ค-์–ธ์–ด ์ž‘์—…์— ์ง์ ‘ ์ „์ด๋  ๋•Œ๋„ ๊ฐ•๋ ฅํ•œ ์ผ๋ฐ˜ํ™” ๋Šฅ๋ ฅ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ์ด ๋…ผ๋ฌธ์˜ ์ฝ”๋“œ, ๋ชจ๋ธ, ๋ฐ์ดํ„ฐ์…‹์€ ๊ณต๊ฐœ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

BLIP.gif

์ด ๋ชจ๋ธ์€ ybelkada๊ฐ€ ๊ธฐ์—ฌํ–ˆ์Šต๋‹ˆ๋‹ค. ์›๋ณธ ์ฝ”๋“œ๋Š” ์—ฌ๊ธฐ์—์„œ ์ฐพ์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ž๋ฃŒ[[resources]]

  • Jupyter notebook: ์‚ฌ์šฉ์ž ์ •์˜ ๋ฐ์ดํ„ฐ์…‹์—์„œ BLIP๋ฅผ ์ด๋ฏธ์ง€ ์บก์…”๋‹์œผ๋กœ ๋ฏธ์„ธ ์กฐ์ •ํ•˜๋Š” ๋ฐฉ๋ฒ•

BlipConfig[[transformers.BlipConfig]]

[[autodoc]] BlipConfig - from_text_vision_configs

BlipTextConfig[[transformers.BlipTextConfig]]

[[autodoc]] BlipTextConfig

BlipVisionConfig[[transformers.BlipVisionConfig]]

[[autodoc]] BlipVisionConfig

BlipProcessor[[transformers.BlipProcessor]]

[[autodoc]] BlipProcessor

BlipImageProcessor[[transformers.BlipImageProcessor]]

[[autodoc]] BlipImageProcessor - preprocess

BlipModel[[transformers.BlipModel]]

BlipModel์€ ํ–ฅํ›„ ๋ฒ„์ „์—์„œ ๋” ์ด์ƒ ์ง€์›๋˜์ง€ ์•Š์„ ์˜ˆ์ •์ž…๋‹ˆ๋‹ค. ๋ชฉ์ ์— ๋”ฐ๋ผ BlipForConditionalGeneration, BlipForImageTextRetrieval ๋˜๋Š” BlipForQuestionAnswering์„ ์‚ฌ์šฉํ•˜์‹ญ์‹œ์˜ค.

[[autodoc]] BlipModel - forward - get_text_features - get_image_features

BlipTextModel[[transformers.BlipTextModel]]

[[autodoc]] BlipTextModel - forward

BlipVisionModel[[transformers.BlipVisionModel]]

[[autodoc]] BlipVisionModel - forward

BlipForConditionalGeneration[[transformers.BlipForConditionalGeneration]]

[[autodoc]] BlipForConditionalGeneration - forward

BlipForImageTextRetrieval[[transformers.BlipForImageTextRetrieval]]

[[autodoc]] BlipForImageTextRetrieval - forward

BlipForQuestionAnswering[[transformers.BlipForQuestionAnswering]]

[[autodoc]] BlipForQuestionAnswering - forward

TFBlipModel[[transformers.TFBlipModel]]

[[autodoc]] TFBlipModel - call - get_text_features - get_image_features

TFBlipTextModel[[transformers.TFBlipTextModel]]

[[autodoc]] TFBlipTextModel - call

TFBlipVisionModel[[transformers.TFBlipVisionModel]]

[[autodoc]] TFBlipVisionModel - call

TFBlipForConditionalGeneration[[transformers.TFBlipForConditionalGeneration]]

[[autodoc]] TFBlipForConditionalGeneration - call

TFBlipForImageTextRetrieval[[transformers.TFBlipForImageTextRetrieval]]

[[autodoc]] TFBlipForImageTextRetrieval - call

TFBlipForQuestionAnswering[[transformers.TFBlipForQuestionAnswering]]

[[autodoc]] TFBlipForQuestionAnswering - call