| <!--Copyright 2023 The HuggingFace Team. All rights reserved. | |
| Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with | |
| the License. You may obtain a copy of the License at | |
| http://www.apache.org/licenses/LICENSE-2.0 | |
| Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on | |
| an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the | |
| specific language governing permissions and limitations under the License. | |
| β οΈ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be | |
| rendered properly in your Markdown viewer. | |
| --> | |
| # BLIP[[blip]] | |
| ## κ°μ[[overview]] | |
| BLIP λͺ¨λΈμ Junnan Li, Dongxu Li, Caiming Xiong, Steven Hoiμ [BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation](https://huggingface.co/papers/2201.12086) λ Όλ¬Έμμ μ μλμμ΅λλ€. | |
| BLIPμ μ¬λ¬ λ©ν°λͺ¨λ¬ μμ μ μνν μ μλ λͺ¨λΈμ λλ€: | |
| - μκ° μ§λ¬Έ μλ΅ (Visual Question Answering, VQA) | |
| - μ΄λ―Έμ§-ν μ€νΈ κ²μ (μ΄λ―Έμ§-ν μ€νΈ λ§€μΉ) | |
| - μ΄λ―Έμ§ μΊ‘μ λ | |
| λ Όλ¬Έμ μ΄λ‘μ λ€μκ³Ό κ°μ΅λλ€: | |
| *λΉμ -μΈμ΄ μ¬μ νμ΅(Vision-Language Pre-training, VLP)μ λ€μν λΉμ -μΈμ΄ μμ μ μ±λ₯μ ν¬κ² ν₯μμμΌ°μ΅λλ€. νμ§λ§, λλΆλΆμ κΈ°μ‘΄ μ¬μ νμ΅ λͺ¨λΈλ€μ μ΄ν΄ κΈ°λ° μμ μ΄λ μμ± κΈ°λ° μμ μ€ νλμμλ§ λ°μ΄λ μ±λ₯μ λ°νν©λλ€. λν μ±λ₯ ν₯μμ μ£Όλ‘ μΉμμ μμ§ν λ Έμ΄μ¦κ° λ§μ μ΄λ―Έμ§-ν μ€νΈ μμΌλ‘ λ°μ΄ν°μ μ κ·λͺ¨λ₯Ό ν€μ°λ λ°©μμΌλ‘ μ΄λ£¨μ΄μ‘λλ°, μ΄λ μ΅μ μ μ§λ νμ΅ λ°©μμ΄λΌκ³ 보기 μ΄λ ΅μ΅λλ€. λ³Έ λ Όλ¬Έμμλ BLIPμ΄λΌλ μλ‘μ΄ VLP νλ μμν¬λ₯Ό μ μν©λλ€. μ΄ νλ μμν¬λ λΉμ -μΈμ΄ μ΄ν΄ λ° μμ± μμ λͺ¨λμ μ μ°νκ² μ μ©λ μ μμ΅λλ€. BLIPλ μΊ‘μ λκ° ν©μ± μΊ‘μ μ μμ±νκ³ νν°κ° λ Έμ΄μ¦ μΊ‘μ μ μ κ±°νλ λΆνΈμ€νΈλν λ°©λ²μ ν΅ν΄ μΉ λ°μ΄ν°μ λ Έμ΄μ¦λ₯Ό ν¨κ³Όμ μΌλ‘ νμ©ν©λλ€. μ°λ¦¬λ μ΄λ―Έμ§-ν μ€νΈ κ²μ(Recall@1μμ +2.7%), μ΄λ―Έμ§ μΊ‘μ λ(CIDErμμ +2.8%), κ·Έλ¦¬κ³ VQA(VQA μ μμμ +1.6%)μ κ°μ λ€μν λΉμ -μΈμ΄ μμ μμ μ΅μ μ±κ³Όλ₯Ό λ¬μ±νμ΅λλ€. λν BLIPμ μ λ‘μ· λ°©μμΌλ‘ λΉλμ€-μΈμ΄ μμ μ μ§μ μ μ΄λ λλ κ°λ ₯ν μΌλ°ν λ₯λ ₯μ 보μ¬μ€λλ€. μ΄ λ Όλ¬Έμ μ½λ, λͺ¨λΈ, λ°μ΄ν°μ μ 곡κ°λμμ΅λλ€.* | |
|  | |
| μ΄ λͺ¨λΈμ [ybelkada](https://huggingface.co/ybelkada)κ° κΈ°μ¬νμ΅λλ€. | |
| μλ³Έ μ½λλ [μ¬κΈ°](https://github.com/salesforce/BLIP)μμ μ°Ύμ μ μμ΅λλ€. | |
| ## μλ£[[resources]] | |
| - [Jupyter notebook](https://github.com/huggingface/notebooks/blob/main/examples/image_captioning_blip.ipynb): μ¬μ©μ μ μ λ°μ΄ν°μ μμ BLIPλ₯Ό μ΄λ―Έμ§ μΊ‘μ λμΌλ‘ λ―ΈμΈ μ‘°μ νλ λ°©λ² | |
| ## BlipConfig[[transformers.BlipConfig]] | |
| [[autodoc]] BlipConfig | |
| - from_text_vision_configs | |
| ## BlipTextConfig[[transformers.BlipTextConfig]] | |
| [[autodoc]] BlipTextConfig | |
| ## BlipVisionConfig[[transformers.BlipVisionConfig]] | |
| [[autodoc]] BlipVisionConfig | |
| ## BlipProcessor[[transformers.BlipProcessor]] | |
| [[autodoc]] BlipProcessor | |
| ## BlipImageProcessor[[transformers.BlipImageProcessor]] | |
| [[autodoc]] BlipImageProcessor | |
| - preprocess | |
| <frameworkcontent> | |
| <pt> | |
| ## BlipModel[[transformers.BlipModel]] | |
| `BlipModel`μ ν₯ν λ²μ μμ λ μ΄μ μ§μλμ§ μμ μμ μ λλ€. λͺ©μ μ λ°λΌ `BlipForConditionalGeneration`, `BlipForImageTextRetrieval` λλ `BlipForQuestionAnswering`μ μ¬μ©νμμμ€. | |
| [[autodoc]] BlipModel | |
| - forward | |
| - get_text_features | |
| - get_image_features | |
| ## BlipTextModel[[transformers.BlipTextModel]] | |
| [[autodoc]] BlipTextModel | |
| - forward | |
| ## BlipVisionModel[[transformers.BlipVisionModel]] | |
| [[autodoc]] BlipVisionModel | |
| - forward | |
| ## BlipForConditionalGeneration[[transformers.BlipForConditionalGeneration]] | |
| [[autodoc]] BlipForConditionalGeneration | |
| - forward | |
| ## BlipForImageTextRetrieval[[transformers.BlipForImageTextRetrieval]] | |
| [[autodoc]] BlipForImageTextRetrieval | |
| - forward | |
| ## BlipForQuestionAnswering[[transformers.BlipForQuestionAnswering]] | |
| [[autodoc]] BlipForQuestionAnswering | |
| - forward | |
| </pt> | |
| <tf> | |
| ## TFBlipModel[[transformers.TFBlipModel]] | |
| [[autodoc]] TFBlipModel | |
| - call | |
| - get_text_features | |
| - get_image_features | |
| ## TFBlipTextModel[[transformers.TFBlipTextModel]] | |
| [[autodoc]] TFBlipTextModel | |
| - call | |
| ## TFBlipVisionModel[[transformers.TFBlipVisionModel]] | |
| [[autodoc]] TFBlipVisionModel | |
| - call | |
| ## TFBlipForConditionalGeneration[[transformers.TFBlipForConditionalGeneration]] | |
| [[autodoc]] TFBlipForConditionalGeneration | |
| - call | |
| ## TFBlipForImageTextRetrieval[[transformers.TFBlipForImageTextRetrieval]] | |
| [[autodoc]] TFBlipForImageTextRetrieval | |
| - call | |
| ## TFBlipForQuestionAnswering[[transformers.TFBlipForQuestionAnswering]] | |
| [[autodoc]] TFBlipForQuestionAnswering | |
| - call | |
| </tf> | |
| </frameworkcontent> | |