Buckets:
| # BLIP[[blip]] | |
| ## 개요[[overview]] | |
| BLIP 모델은 Junnan Li, Dongxu Li, Caiming Xiong, Steven Hoi의 [BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation](https://huggingface.co/papers/2201.12086) 논문에서 제안되었습니다. | |
| BLIP은 여러 멀티모달 작업을 수행할 수 있는 모델입니다: | |
| - 시각 질문 응답 (Visual Question Answering, VQA) | |
| - 이미지-텍스트 검색 (이미지-텍스트 매칭) | |
| - 이미지 캡셔닝 | |
| 논문의 초록은 다음과 같습니다: | |
| *비전-언어 사전 학습(Vision-Language Pre-training, VLP)은 다양한 비전-언어 작업의 성능을 크게 향상시켰습니다. 하지만, 대부분의 기존 사전 학습 모델들은 이해 기반 작업이나 생성 기반 작업 중 하나에서만 뛰어난 성능을 발휘합니다. 또한 성능 향상은 주로 웹에서 수집한 노이즈가 많은 이미지-텍스트 쌍으로 데이터셋의 규모를 키우는 방식으로 이루어졌는데, 이는 최적의 지도 학습 방식이라고 보기 어렵습니다. 본 논문에서는 BLIP이라는 새로운 VLP 프레임워크를 제안합니다. 이 프레임워크는 비전-언어 이해 및 생성 작업 모두에 유연하게 적용될 수 있습니다. BLIP는 캡셔너가 합성 캡션을 생성하고 필터가 노이즈 캡션을 제거하는 부트스트래핑 방법을 통해 웹 데이터의 노이즈를 효과적으로 활용합니다. 우리는 이미지-텍스트 검색(Recall@1에서 +2.7%), 이미지 캡셔닝(CIDEr에서 +2.8%), 그리고 VQA(VQA 점수에서 +1.6%)와 같은 다양한 비전-언어 작업에서 최신 성과를 달성했습니다. 또한 BLIP은 제로샷 방식으로 비디오-언어 작업에 직접 전이될 때도 강력한 일반화 능력을 보여줍니다. 이 논문의 코드, 모델, 데이터셋은 공개되었습니다.* | |
|  | |
| 이 모델은 [ybelkada](https://huggingface.co/ybelkada)가 기여했습니다. | |
| 원본 코드는 [여기](https://github.com/salesforce/BLIP)에서 찾을 수 있습니다. | |
| ## 자료[[resources]] | |
| - [Jupyter notebook](https://github.com/huggingface/notebooks/blob/main/examples/image_captioning_blip.ipynb): 사용자 정의 데이터셋에서 BLIP를 이미지 캡셔닝으로 미세 조정하는 방법 | |
| ## BlipConfig[[transformers.BlipConfig]][[transformers.BlipConfig]] | |
| #### transformers.BlipConfig[[transformers.BlipConfig]] | |
| [Source](https://github.com/huggingface/transformers/blob/main/src/transformers/models/blip/configuration_blip.py#L112) | |
| This is the configuration class to store the configuration of a BlipModel. It is used to instantiate a Blip | |
| model according to the specified arguments, defining the model architecture. Instantiating a configuration with the | |
| defaults will yield a similar configuration to that of the [Salesforce/blip-vqa-base](https://huggingface.co/Salesforce/blip-vqa-base) | |
| Configuration objects inherit from [PreTrainedConfig](/docs/transformers/main/ko/main_classes/configuration#transformers.PreTrainedConfig) and can be used to control the model outputs. Read the | |
| documentation from [PreTrainedConfig](/docs/transformers/main/ko/main_classes/configuration#transformers.PreTrainedConfig) for more information. | |
| Example: | |
| ```python | |
| >>> from transformers import BlipConfig, BlipModel | |
| >>> # Initializing a BlipConfig with Salesforce/blip-vqa-base style configuration | |
| >>> configuration = BlipConfig() | |
| >>> # Initializing a BlipPModel (with random weights) from the Salesforce/blip-vqa-base style configuration | |
| >>> model = BlipModel(configuration) | |
| >>> # Accessing the model configuration | |
| >>> configuration = model.config | |
| >>> # We can also initialize a BlipConfig from a BlipTextConfig and a BlipVisionConfig | |
| >>> # Initializing a BLIPText and BLIPVision configuration | |
| >>> config_text = BlipTextConfig() | |
| >>> config_vision = BlipVisionConfig() | |
| >>> config = BlipConfig(text_config=config_text, vision_config=config_vision) | |
| ``` | |
| **Parameters:** | |
| text_config (`Union[dict, ~configuration_utils.PreTrainedConfig]`, *optional*) : The config object or dictionary of the text backbone. | |
| vision_config (`Union[dict, ~configuration_utils.PreTrainedConfig]`, *optional*) : The config object or dictionary of the vision backbone. | |
| projection_dim (`int`, *optional*, defaults to `512`) : Dimensionality of text and vision projection layers. | |
| logit_scale_init_value (`float`, *optional*, defaults to `2.6592`) : The initial value of the *logit_scale* parameter. | |
| image_text_hidden_size (`int`, *optional*, defaults to 256) : Dimensionality of the hidden state of the image-text fusion layer. | |
| label_smoothing (`float`, *optional*) : A float in [0.0, 1.0]. Specifies the amount of smoothing when computing the loss, where 0.0 means no smoothing. The targets become a mixture of the original ground truth and a uniform distribution as described in `Rethinking the Inception Architecture for Computer Vision <https://huggingface.co/papers/1512.00567>`__. Default: :math:`0.0`. | |
| tie_word_embeddings (`bool`, *optional*, defaults to `True`) : Whether to tie weight embeddings according to model's `tied_weights_keys` mapping. | |
| initializer_factor (`float`, *optional*, defaults to `1.0`) : A factor for initializing all weight matrices (should be kept to 1, used internally for initialization testing). | |
| initializer_range (`float`, *optional*, defaults to `0.02`) : The standard deviation of the truncated_normal_initializer for initializing all weight matrices. | |
| ## BlipTextConfig[[transformers.BlipTextConfig]][[transformers.BlipTextConfig]] | |
| #### transformers.BlipTextConfig[[transformers.BlipTextConfig]] | |
| [Source](https://github.com/huggingface/transformers/blob/main/src/transformers/models/blip/configuration_blip.py#L27) | |
| This is the configuration class to store the configuration of a BlipModel. It is used to instantiate a Blip | |
| model according to the specified arguments, defining the model architecture. Instantiating a configuration with the | |
| defaults will yield a similar configuration to that of the [Salesforce/blip-vqa-base](https://huggingface.co/Salesforce/blip-vqa-base) | |
| Configuration objects inherit from [PreTrainedConfig](/docs/transformers/main/ko/main_classes/configuration#transformers.PreTrainedConfig) and can be used to control the model outputs. Read the | |
| documentation from [PreTrainedConfig](/docs/transformers/main/ko/main_classes/configuration#transformers.PreTrainedConfig) for more information. | |
| Example: | |
| ```python | |
| >>> from transformers import BlipTextConfig, BlipTextModel | |
| >>> # Initializing a BlipTextConfig with Salesforce/blip-vqa-base style configuration | |
| >>> configuration = BlipTextConfig() | |
| >>> # Initializing a BlipTextModel (with random weights) from the Salesforce/blip-vqa-base style configuration | |
| >>> model = BlipTextModel(configuration) | |
| >>> # Accessing the model configuration | |
| >>> configuration = model.config | |
| ``` | |
| **Parameters:** | |
| vocab_size (`int`, *optional*, defaults to `30524`) : Vocabulary size of the model. Defines the number of different tokens that can be represented by the `input_ids`. | |
| hidden_size (`int`, *optional*, defaults to `768`) : Dimension of the hidden representations. | |
| encoder_hidden_size (`int`, *optional*, defaults to `768`) : Dimension of the hidden representations. | |
| intermediate_size (`int`, *optional*, defaults to `3072`) : Dimension of the MLP representations. | |
| projection_dim (`int`, *optional*, defaults to `768`) : Dimensionality of text and vision projection layers. | |
| num_hidden_layers (`int`, *optional*, defaults to `12`) : Number of hidden layers in the Transformer decoder. | |
| num_attention_heads (`int`, *optional*, defaults to `8`) : Number of attention heads for each attention layer in the Transformer decoder. | |
| max_position_embeddings (`int`, *optional*, defaults to `512`) : The maximum sequence length that this model might ever be used with. | |
| hidden_act (`str`, *optional*, defaults to `gelu`) : The non-linear activation function (function or string) in the decoder. For example, `"gelu"`, `"relu"`, `"silu"`, etc. | |
| layer_norm_eps (`float`, *optional*, defaults to `1e-12`) : The epsilon used by the layer normalization layers. | |
| hidden_dropout_prob (`Union[float, int]`, *optional*, defaults to `0.0`) : The dropout probability for all fully connected layers in the embeddings, encoder, and pooler. | |
| attention_probs_dropout_prob (`Union[float, int]`, *optional*, defaults to `0.0`) : The dropout ratio for the attention probabilities. | |
| initializer_range (`float`, *optional*, defaults to `0.02`) : The standard deviation of the truncated_normal_initializer for initializing all weight matrices. | |
| bos_token_id (`int`, *optional*, defaults to `30522`) : Token id used for beginning-of-stream in the vocabulary. | |
| eos_token_id (`Union[int, list[int]]`, *optional*, defaults to `2`) : Token id used for end-of-stream in the vocabulary. | |
| pad_token_id (`int`, *optional*, defaults to `0`) : Token id used for padding in the vocabulary. | |
| sep_token_id (`int`, *optional*, defaults to `102`) : Token id used for separator in the vocabulary. | |
| is_decoder (`bool`, *optional*, defaults to `True`) : Whether the model is used as a decoder or not. If `False`, the model is used as an encoder. | |
| use_cache (`bool`, *optional*, defaults to `True`) : Whether or not the model should return the last key/values attentions (not used by all models). Only relevant if `config.is_decoder=True` or when the model is a decoder-only generative model. | |
| tie_word_embeddings (`bool`, *optional*, defaults to `True`) : Whether to tie weight embeddings according to model's `tied_weights_keys` mapping. | |
| label_smoothing (`float`, *optional*) : A float in [0.0, 1.0]. Specifies the amount of smoothing when computing the loss, where 0.0 means no smoothing. The targets become a mixture of the original ground truth and a uniform distribution as described in `Rethinking the Inception Architecture for Computer Vision <https://huggingface.co/papers/1512.00567>`__. Default: :math:`0.0`. | |
| ## BlipVisionConfig[[transformers.BlipVisionConfig]][[transformers.BlipVisionConfig]] | |
| #### transformers.BlipVisionConfig[[transformers.BlipVisionConfig]] | |
| [Source](https://github.com/huggingface/transformers/blob/main/src/transformers/models/blip/configuration_blip.py#L77) | |
| This is the configuration class to store the configuration of a BlipModel. It is used to instantiate a Blip | |
| model according to the specified arguments, defining the model architecture. Instantiating a configuration with the | |
| defaults will yield a similar configuration to that of the [Salesforce/blip-vqa-base](https://huggingface.co/Salesforce/blip-vqa-base) | |
| Configuration objects inherit from [PreTrainedConfig](/docs/transformers/main/ko/main_classes/configuration#transformers.PreTrainedConfig) and can be used to control the model outputs. Read the | |
| documentation from [PreTrainedConfig](/docs/transformers/main/ko/main_classes/configuration#transformers.PreTrainedConfig) for more information. | |
| Example: | |
| ```python | |
| >>> from transformers import BlipVisionConfig, BlipVisionModel | |
| >>> # Initializing a BlipVisionConfig with Salesforce/blip-vqa-base style configuration | |
| >>> configuration = BlipVisionConfig() | |
| >>> # Initializing a BlipVisionModel (with random weights) from the Salesforce/blip-vqa-base style configuration | |
| >>> model = BlipVisionModel(configuration) | |
| >>> # Accessing the model configuration | |
| >>> configuration = model.config | |
| ``` | |
| **Parameters:** | |
| hidden_size (`int`, *optional*, defaults to `768`) : Dimension of the hidden representations. | |
| intermediate_size (`int`, *optional*, defaults to `3072`) : Dimension of the MLP representations. | |
| projection_dim (`int`, *optional*, defaults to `512`) : Dimensionality of text and vision projection layers. | |
| num_hidden_layers (`int`, *optional*, defaults to `12`) : Number of hidden layers in the Transformer decoder. | |
| num_attention_heads (`int`, *optional*, defaults to `12`) : Number of attention heads for each attention layer in the Transformer decoder. | |
| image_size (`Union[int, list[int], tuple[int, int]]`, *optional*, defaults to `384`) : The size (resolution) of each image. | |
| patch_size (`Union[int, list[int], tuple[int, int]]`, *optional*, defaults to `16`) : The size (resolution) of each patch. | |
| hidden_act (`str`, *optional*, defaults to `gelu`) : The non-linear activation function (function or string) in the decoder. For example, `"gelu"`, `"relu"`, `"silu"`, etc. | |
| layer_norm_eps (`float`, *optional*, defaults to `1e-05`) : The epsilon used by the layer normalization layers. | |
| attention_dropout (`Union[float, int]`, *optional*, defaults to `0.0`) : The dropout ratio for the attention probabilities. | |
| initializer_range (`float`, *optional*, defaults to `1e-10`) : The standard deviation of the truncated_normal_initializer for initializing all weight matrices. | |
| ## BlipProcessor[[transformers.BlipProcessor]][[transformers.BlipProcessor]] | |
| #### transformers.BlipProcessor[[transformers.BlipProcessor]] | |
| [Source](https://github.com/huggingface/transformers/blob/main/src/transformers/models/blip/processing_blip.py#L39) | |
| Constructs a BlipProcessor which wraps a image processor and a tokenizer into a single processor. | |
| [BlipProcessor](/docs/transformers/main/ko/model_doc/blip#transformers.BlipProcessor) offers all the functionalities of [BlipImageProcessor](/docs/transformers/main/ko/model_doc/blip#transformers.BlipImageProcessor) and [BertTokenizer](/docs/transformers/main/ko/model_doc/bert#transformers.BertTokenizer). See the | |
| [~BlipImageProcessor](/docs/transformers/main/ko/model_doc/blip#transformers.BlipImageProcessor) and [~BertTokenizer](/docs/transformers/main/ko/model_doc/bert#transformers.BertTokenizer) for more information. | |
| **Parameters:** | |
| image_processor (`BlipImageProcessor`) : The image processor is a required input. | |
| tokenizer (`BertTokenizer`) : The tokenizer is a required input. | |
| ## BlipImageProcessor[[transformers.BlipImageProcessor]][[transformers.BlipImageProcessor]] | |
| #### transformers.BlipImageProcessor[[transformers.BlipImageProcessor]] | |
| [Source](https://github.com/huggingface/transformers/blob/main/src/transformers/models/blip/image_processing_blip.py#L22) | |
| Constructs a BlipImageProcessor image processor. | |
| preprocesstransformers.BlipImageProcessor.preprocesshttps://github.com/huggingface/transformers/blob/main/src/transformers/image_processing_utils.py#L382[{"name": "images", "val": ": typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']]"}, {"name": "*args", "val": ""}, {"name": "**kwargs", "val": ": typing_extensions.Unpack[transformers.processing_utils.ImagesKwargs]"}]- **images** (`Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, list[PIL.Image.Image], list[numpy.ndarray], list[torch.Tensor]]`) -- | |
| Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If | |
| passing in images with pixel values between 0 and 1, set `do_rescale=False`. | |
| - **return_tensors** (`str` or [TensorType](/docs/transformers/main/ko/internal/file_utils#transformers.TensorType), *optional*) -- | |
| Returns stacked tensors if set to `'pt'`, otherwise returns a list of tensors. | |
| - ****kwargs** (`ImagesKwargs`, *optional*) -- | |
| Additional image preprocessing options. Model-specific kwargs are listed above; see the TypedDict class | |
| for the complete list of supported arguments.0`~image_processing_base.BatchFeature`- **data** (`dict`) -- Dictionary of lists/arrays/tensors returned by the __call__ method ('pixel_values', etc.). | |
| - **tensor_type** (`Union[None, str, TensorType]`, *optional*) -- You can give a tensor_type here to convert the lists of integers in PyTorch/Numpy Tensors at | |
| initialization. | |
| **Parameters:** | |
| - ****kwargs** (`ImagesKwargs`, *optional*) : Additional image preprocessing options. Model-specific kwargs are listed above; see the TypedDict class for the complete list of supported arguments. | |
| **Returns:** | |
| ``~image_processing_base.BatchFeature`` | |
| - **data** (`dict`) -- Dictionary of lists/arrays/tensors returned by the __call__ method ('pixel_values', etc.). | |
| - **tensor_type** (`Union[None, str, TensorType]`, *optional*) -- You can give a tensor_type here to convert the lists of integers in PyTorch/Numpy Tensors at | |
| initialization. | |
| ## BlipModel[[transformers.BlipModel]][[transformers.BlipModel]] | |
| `BlipModel`은 향후 버전에서 더 이상 지원되지 않을 예정입니다. 목적에 따라 `BlipForConditionalGeneration`, `BlipForImageTextRetrieval` 또는 `BlipForQuestionAnswering`을 사용하십시오. | |
| #### transformers.BlipModel[[transformers.BlipModel]] | |
| [Source](https://github.com/huggingface/transformers/blob/main/src/transformers/models/blip/modeling_blip.py#L522) | |
| This model is going to be deprecated in future versions. Please use `BlipForConditionalGeneration`, `BlipForQuestionAnswering` or `BlipForImageTextRetrieval` depending on your usecase. | |
| This model inherits from [PreTrainedModel](/docs/transformers/main/ko/main_classes/model#transformers.PreTrainedModel). Check the superclass documentation for the generic methods the | |
| library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads | |
| etc.) | |
| This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass. | |
| Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage | |
| and behavior. | |
| forwardtransformers.BlipModel.forwardhttps://github.com/huggingface/transformers/blob/main/src/transformers/models/blip/modeling_blip.py#L691[{"name": "input_ids", "val": ": torch.LongTensor | None = None"}, {"name": "pixel_values", "val": ": torch.FloatTensor | None = None"}, {"name": "attention_mask", "val": ": torch.Tensor | None = None"}, {"name": "position_ids", "val": ": torch.LongTensor | None = None"}, {"name": "return_loss", "val": ": bool | None = None"}, {"name": "interpolate_pos_encoding", "val": ": bool = False"}, {"name": "**kwargs", "val": ": typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs]"}]- **input_ids** (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*) -- | |
| Indices of input sequence tokens in the vocabulary. Padding will be ignored by default. | |
| Indices can be obtained using [AutoTokenizer](/docs/transformers/main/ko/model_doc/auto#transformers.AutoTokenizer). See [PreTrainedTokenizer.encode()](/docs/transformers/main/ko/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.encode) and | |
| [PreTrainedTokenizer.__call__()](/docs/transformers/main/ko/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.__call__) for details. | |
| [What are input IDs?](../glossary#input-ids) | |
| - **pixel_values** (`torch.FloatTensor` of shape `(batch_size, num_channels, image_size, image_size)`, *optional*) -- | |
| The tensors corresponding to the input images. Pixel values can be obtained using | |
| [BlipImageProcessor](/docs/transformers/main/ko/model_doc/blip#transformers.BlipImageProcessor). See `BlipImageProcessor.__call__()` for details ([BlipProcessor](/docs/transformers/main/ko/model_doc/blip#transformers.BlipProcessor) uses | |
| [BlipImageProcessor](/docs/transformers/main/ko/model_doc/blip#transformers.BlipImageProcessor) for processing images). | |
| - **attention_mask** (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*) -- | |
| Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`: | |
| - 1 for tokens that are **not masked**, | |
| - 0 for tokens that are **masked**. | |
| [What are attention masks?](../glossary#attention-mask) | |
| - **position_ids** (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*) -- | |
| Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0, config.n_positions - 1]`. | |
| [What are position IDs?](../glossary#position-ids) | |
| - **return_loss** (`bool`, *optional*) -- | |
| Whether or not to return the contrastive loss. | |
| - **interpolate_pos_encoding** (`bool`, *optional*, defaults to `False`) -- | |
| Whether to interpolate the pre-trained position encodings.0`BlipOutput` or `tuple(torch.FloatTensor)`A `BlipOutput` or a tuple of | |
| `torch.FloatTensor` (if `return_dict=False` is passed or when `config.return_dict=False`) comprising various | |
| elements depending on the configuration ([BlipConfig](/docs/transformers/main/ko/model_doc/blip#transformers.BlipConfig)) and inputs. | |
| The [BlipModel](/docs/transformers/main/ko/model_doc/blip#transformers.BlipModel) forward method, overrides the `__call__` special method. | |
| Although the recipe for forward pass needs to be defined within this function, one should call the `Module` | |
| instance afterwards instead of this since the former takes care of running the pre and post processing steps while | |
| the latter silently ignores them. | |
| - **loss** (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `return_loss` is `True`) -- Contrastive loss for image-text similarity. | |
| - **logits_per_image** (`torch.FloatTensor` of shape `(image_batch_size, text_batch_size)`) -- The scaled dot product scores between `image_embeds` and `text_embeds`. This represents the image-text | |
| similarity scores. | |
| - **logits_per_text** (`torch.FloatTensor` of shape `(text_batch_size, image_batch_size)`) -- The scaled dot product scores between `text_embeds` and `image_embeds`. This represents the text-image | |
| similarity scores. | |
| - **text_embeds** (`torch.FloatTensor` of shape `(batch_size, output_dim`) -- The text embeddings obtained by applying the projection layer to the pooled output of [BlipTextModel](/docs/transformers/main/ko/model_doc/blip#transformers.BlipTextModel). | |
| - **image_embeds** (`torch.FloatTensor` of shape `(batch_size, output_dim`) -- The image embeddings obtained by applying the projection layer to the pooled output of [BlipVisionModel](/docs/transformers/main/ko/model_doc/blip#transformers.BlipVisionModel). | |
| - **text_model_output** (`~modeling_outputs.BaseModelOutputWithPooling`, defaults to `None`) -- The output of the [BlipTextModel](/docs/transformers/main/ko/model_doc/blip#transformers.BlipTextModel). | |
| - **vision_model_output** (`~modeling_outputs.BaseModelOutputWithPooling`, defaults to `None`) -- The output of the [BlipVisionModel](/docs/transformers/main/ko/model_doc/blip#transformers.BlipVisionModel). | |
| Examples: | |
| ```python | |
| >>> from PIL import Image | |
| >>> import httpx | |
| >>> from io import BytesIO | |
| >>> from transformers import AutoProcessor, BlipModel | |
| >>> model = BlipModel.from_pretrained("Salesforce/blip-image-captioning-base") | |
| >>> processor = AutoProcessor.from_pretrained("Salesforce/blip-image-captioning-base") | |
| >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg" | |
| >>> with httpx.stream("GET", url) as response: | |
| ... image = Image.open(BytesIO(response.read())) | |
| >>> inputs = processor( | |
| ... text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True | |
| ... ) | |
| >>> outputs = model(**inputs) | |
| >>> logits_per_image = outputs.logits_per_image # this is the image-text similarity score | |
| >>> probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the label probabilities | |
| ``` | |
| **Parameters:** | |
| config ([BlipConfig](/docs/transformers/main/ko/model_doc/blip#transformers.BlipConfig)) : Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the [from_pretrained()](/docs/transformers/main/ko/main_classes/model#transformers.PreTrainedModel.from_pretrained) method to load the model weights. | |
| **Returns:** | |
| ``BlipOutput` or `tuple(torch.FloatTensor)`` | |
| A `BlipOutput` or a tuple of | |
| `torch.FloatTensor` (if `return_dict=False` is passed or when `config.return_dict=False`) comprising various | |
| elements depending on the configuration ([BlipConfig](/docs/transformers/main/ko/model_doc/blip#transformers.BlipConfig)) and inputs. | |
| #### get_text_features[[transformers.BlipModel.get_text_features]] | |
| [Source](https://github.com/huggingface/transformers/blob/main/src/transformers/models/blip/modeling_blip.py#L567) | |
| - **last_hidden_state** (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`) -- Sequence of hidden-states at the output of the last layer of the model. | |
| - **pooler_output** (`torch.FloatTensor` of shape `(batch_size, hidden_size)`) -- Last layer hidden-state of the first token of the sequence (classification token) after further processing | |
| through the layers used for the auxiliary pretraining task. E.g. for BERT-family of models, this returns | |
| the classification token after processing through a linear layer and a tanh activation function. The linear | |
| layer weights are trained from the next sentence prediction (classification) objective during pretraining. | |
| - **hidden_states** (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`) -- Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, + | |
| one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`. | |
| Hidden-states of the model at the output of each layer plus the optional initial embedding outputs. | |
| - **attentions** (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`) -- Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, | |
| sequence_length)`. | |
| Attentions weights after the attention softmax, used to compute the weighted average in the self-attention | |
| heads. | |
| Examples: | |
| ```python | |
| >>> from transformers import AutoProcessor, BlipModel | |
| >>> model = BlipModel.from_pretrained("Salesforce/blip-image-captioning-base") | |
| >>> processor = AutoProcessor.from_pretrained("Salesforce/blip-image-captioning-base") | |
| >>> inputs = processor(text=["a photo of a cat", "a photo of a dog"], padding=True, return_tensors="pt") | |
| >>> text_features = model.get_text_features(**inputs) | |
| ``` | |
| **Parameters:** | |
| input_ids (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*) : Indices of input sequence tokens in the vocabulary. Padding will be ignored by default. Indices can be obtained using [AutoTokenizer](/docs/transformers/main/ko/model_doc/auto#transformers.AutoTokenizer). See [PreTrainedTokenizer.encode()](/docs/transformers/main/ko/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.encode) and [PreTrainedTokenizer.__call__()](/docs/transformers/main/ko/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.__call__) for details. [What are input IDs?](../glossary#input-ids) | |
| attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*) : Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`: - 1 for tokens that are **not masked**, - 0 for tokens that are **masked**. [What are attention masks?](../glossary#attention-mask) | |
| position_ids (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*) : Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0, config.n_positions - 1]`. [What are position IDs?](../glossary#position-ids) | |
| **Returns:** | |
| `[BaseModelOutputWithPooling](/docs/transformers/main/ko/main_classes/output#transformers.modeling_outputs.BaseModelOutputWithPooling) or `tuple(torch.FloatTensor)`` | |
| A [BaseModelOutputWithPooling](/docs/transformers/main/ko/main_classes/output#transformers.modeling_outputs.BaseModelOutputWithPooling) or a tuple of | |
| `torch.FloatTensor` (if `return_dict=False` is passed or when `config.return_dict=False`) comprising various | |
| elements depending on the configuration ([BlipConfig](/docs/transformers/main/ko/model_doc/blip#transformers.BlipConfig)) and inputs. | |
| #### get_image_features[[transformers.BlipModel.get_image_features]] | |
| [Source](https://github.com/huggingface/transformers/blob/main/src/transformers/models/blip/modeling_blip.py#L600) | |
| - **last_hidden_state** (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`) -- Sequence of hidden-states at the output of the last layer of the model. | |
| - **pooler_output** (`torch.FloatTensor` of shape `(batch_size, hidden_size)`) -- Last layer hidden-state of the first token of the sequence (classification token) after further processing | |
| through the layers used for the auxiliary pretraining task. E.g. for BERT-family of models, this returns | |
| the classification token after processing through a linear layer and a tanh activation function. The linear | |
| layer weights are trained from the next sentence prediction (classification) objective during pretraining. | |
| - **hidden_states** (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`) -- Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, + | |
| one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`. | |
| Hidden-states of the model at the output of each layer plus the optional initial embedding outputs. | |
| - **attentions** (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`) -- Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, | |
| sequence_length)`. | |
| Attentions weights after the attention softmax, used to compute the weighted average in the self-attention | |
| heads. | |
| Examples: | |
| ```python | |
| >>> from PIL import Image | |
| >>> import httpx | |
| >>> from io import BytesIO | |
| >>> from transformers import AutoProcessor, BlipModel | |
| >>> model = BlipModel.from_pretrained("Salesforce/blip-image-captioning-base") | |
| >>> processor = AutoProcessor.from_pretrained("Salesforce/blip-image-captioning-base") | |
| >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg" | |
| >>> with httpx.stream("GET", url) as response: | |
| ... image = Image.open(BytesIO(response.read())) | |
| >>> inputs = processor(images=image, return_tensors="pt") | |
| >>> image_features = model.get_image_features(**inputs) | |
| ``` | |
| **Parameters:** | |
| pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, image_size, image_size)`, *optional*) : The tensors corresponding to the input images. Pixel values can be obtained using [BlipImageProcessor](/docs/transformers/main/ko/model_doc/blip#transformers.BlipImageProcessor). See `BlipImageProcessor.__call__()` for details ([BlipProcessor](/docs/transformers/main/ko/model_doc/blip#transformers.BlipProcessor) uses [BlipImageProcessor](/docs/transformers/main/ko/model_doc/blip#transformers.BlipImageProcessor) for processing images). | |
| interpolate_pos_encoding (`bool`, *optional*, defaults to `False`) : Whether to interpolate the pre-trained position encodings. | |
| **Returns:** | |
| `[BaseModelOutputWithPooling](/docs/transformers/main/ko/main_classes/output#transformers.modeling_outputs.BaseModelOutputWithPooling) or `tuple(torch.FloatTensor)`` | |
| A [BaseModelOutputWithPooling](/docs/transformers/main/ko/main_classes/output#transformers.modeling_outputs.BaseModelOutputWithPooling) or a tuple of | |
| `torch.FloatTensor` (if `return_dict=False` is passed or when `config.return_dict=False`) comprising various | |
| elements depending on the configuration ([BlipConfig](/docs/transformers/main/ko/model_doc/blip#transformers.BlipConfig)) and inputs. | |
| ## BlipTextModel[[transformers.BlipTextModel]][[transformers.BlipTextModel]] | |
| #### transformers.BlipTextModel[[transformers.BlipTextModel]] | |
| [Source](https://github.com/huggingface/transformers/blob/main/src/transformers/models/blip/modeling_blip_text.py#L462) | |
| The model can behave as an encoder (with only self-attention) as well as a decoder, in which case a layer of | |
| cross-attention is added between the self-attention layers, following the architecture described in [Attention is | |
| all you need](https://huggingface.co/papers/1706.03762) by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, | |
| Llion Jones, Aidan N. Gomez, Lukasz Kaiser and Illia Polosukhin. argument and `is_decoder` set to `True`; an | |
| `encoder_hidden_states` is then expected as an input to the forward pass. | |
| forwardtransformers.BlipTextModel.forwardhttps://github.com/huggingface/transformers/blob/main/src/transformers/models/blip/modeling_blip_text.py#L487[{"name": "input_ids", "val": ": torch.Tensor | None = None"}, {"name": "attention_mask", "val": ": torch.Tensor | None = None"}, {"name": "position_ids", "val": ": torch.Tensor | None = None"}, {"name": "inputs_embeds", "val": ": torch.Tensor | None = None"}, {"name": "encoder_embeds", "val": ": torch.Tensor | None = None"}, {"name": "encoder_hidden_states", "val": ": torch.Tensor | None = None"}, {"name": "encoder_attention_mask", "val": ": torch.Tensor | None = None"}, {"name": "past_key_values", "val": ": transformers.cache_utils.Cache | None = None"}, {"name": "use_cache", "val": ": bool | None = None"}, {"name": "is_decoder", "val": ": bool | None = False"}, {"name": "**kwargs", "val": ": typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs]"}] | |
| encoder_hidden_states (`torch.FloatTensor`, *optional*): | |
| Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if | |
| the model is configured as a decoder. | |
| encoder_attention_mask (`torch.FloatTensor`, *optional*): | |
| Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in | |
| the cross-attention if the model is configured as a decoder. Mask values selected in `[0, 1]`: | |
| - 1 for tokens that are **not masked**, | |
| - 0 for tokens that are **masked**. | |
| past_key_values (`Cache`, *optional*): | |
| Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding. | |
| If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those that | |
| don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of all | |
| `decoder_input_ids` of shape `(batch_size, sequence_length)`. | |
| use_cache (`bool`, *optional*): | |
| If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see | |
| `past_key_values`). | |
| ## BlipVisionModel[[transformers.BlipVisionModel]][[transformers.BlipVisionModel]] | |
| #### transformers.BlipVisionModel[[transformers.BlipVisionModel]] | |
| [Source](https://github.com/huggingface/transformers/blob/main/src/transformers/models/blip/modeling_blip.py#L463) | |
| forwardtransformers.BlipVisionModel.forwardhttps://github.com/huggingface/transformers/blob/main/src/transformers/models/blip/modeling_blip.py#L483[{"name": "pixel_values", "val": ": torch.FloatTensor | None = None"}, {"name": "interpolate_pos_encoding", "val": ": bool = False"}, {"name": "**kwargs", "val": ": typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs]"}]- **pixel_values** (`torch.FloatTensor` of shape `(batch_size, num_channels, image_size, image_size)`, *optional*) -- | |
| The tensors corresponding to the input images. Pixel values can be obtained using | |
| [BlipImageProcessor](/docs/transformers/main/ko/model_doc/blip#transformers.BlipImageProcessor). See `BlipImageProcessor.__call__()` for details ([BlipProcessor](/docs/transformers/main/ko/model_doc/blip#transformers.BlipProcessor) uses | |
| [BlipImageProcessor](/docs/transformers/main/ko/model_doc/blip#transformers.BlipImageProcessor) for processing images). | |
| - **interpolate_pos_encoding** (`bool`, *optional*, defaults to `False`) -- | |
| Whether to interpolate the pre-trained position encodings.0[BaseModelOutputWithPooling](/docs/transformers/main/ko/main_classes/output#transformers.modeling_outputs.BaseModelOutputWithPooling) or `tuple(torch.FloatTensor)`A [BaseModelOutputWithPooling](/docs/transformers/main/ko/main_classes/output#transformers.modeling_outputs.BaseModelOutputWithPooling) or a tuple of | |
| `torch.FloatTensor` (if `return_dict=False` is passed or when `config.return_dict=False`) comprising various | |
| elements depending on the configuration ([BlipConfig](/docs/transformers/main/ko/model_doc/blip#transformers.BlipConfig)) and inputs. | |
| The [BlipVisionModel](/docs/transformers/main/ko/model_doc/blip#transformers.BlipVisionModel) forward method, overrides the `__call__` special method. | |
| Although the recipe for forward pass needs to be defined within this function, one should call the `Module` | |
| instance afterwards instead of this since the former takes care of running the pre and post processing steps while | |
| the latter silently ignores them. | |
| - **last_hidden_state** (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`) -- Sequence of hidden-states at the output of the last layer of the model. | |
| - **pooler_output** (`torch.FloatTensor` of shape `(batch_size, hidden_size)`) -- Last layer hidden-state of the first token of the sequence (classification token) after further processing | |
| through the layers used for the auxiliary pretraining task. E.g. for BERT-family of models, this returns | |
| the classification token after processing through a linear layer and a tanh activation function. The linear | |
| layer weights are trained from the next sentence prediction (classification) objective during pretraining. | |
| - **hidden_states** (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`) -- Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, + | |
| one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`. | |
| Hidden-states of the model at the output of each layer plus the optional initial embedding outputs. | |
| - **attentions** (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`) -- Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, | |
| sequence_length)`. | |
| Attentions weights after the attention softmax, used to compute the weighted average in the self-attention | |
| heads. | |
| **Parameters:** | |
| pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, image_size, image_size)`, *optional*) : The tensors corresponding to the input images. Pixel values can be obtained using [BlipImageProcessor](/docs/transformers/main/ko/model_doc/blip#transformers.BlipImageProcessor). See `BlipImageProcessor.__call__()` for details ([BlipProcessor](/docs/transformers/main/ko/model_doc/blip#transformers.BlipProcessor) uses [BlipImageProcessor](/docs/transformers/main/ko/model_doc/blip#transformers.BlipImageProcessor) for processing images). | |
| interpolate_pos_encoding (`bool`, *optional*, defaults to `False`) : Whether to interpolate the pre-trained position encodings. | |
| **Returns:** | |
| `[BaseModelOutputWithPooling](/docs/transformers/main/ko/main_classes/output#transformers.modeling_outputs.BaseModelOutputWithPooling) or `tuple(torch.FloatTensor)`` | |
| A [BaseModelOutputWithPooling](/docs/transformers/main/ko/main_classes/output#transformers.modeling_outputs.BaseModelOutputWithPooling) or a tuple of | |
| `torch.FloatTensor` (if `return_dict=False` is passed or when `config.return_dict=False`) comprising various | |
| elements depending on the configuration ([BlipConfig](/docs/transformers/main/ko/model_doc/blip#transformers.BlipConfig)) and inputs. | |
| ## BlipForConditionalGeneration[[transformers.BlipForConditionalGeneration]][[transformers.BlipForConditionalGeneration]] | |
| #### transformers.BlipForConditionalGeneration[[transformers.BlipForConditionalGeneration]] | |
| [Source](https://github.com/huggingface/transformers/blob/main/src/transformers/models/blip/modeling_blip.py#L782) | |
| BLIP Model for image captioning. The model consists of a vision encoder and a text decoder. One can optionally pass | |
| `input_ids` to the model, which serve as a text prompt, to make the text decoder continue the prompt. Otherwise, | |
| the decoder starts generating text from the [BOS] (beginning-of-sequence) token. will start generating the caption | |
| from the text input. If no text input is provided, the decoder will start with the [BOS] token only. | |
| This model inherits from [PreTrainedModel](/docs/transformers/main/ko/main_classes/model#transformers.PreTrainedModel). Check the superclass documentation for the generic methods the | |
| library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads | |
| etc.) | |
| This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass. | |
| Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage | |
| and behavior. | |
| forwardtransformers.BlipForConditionalGeneration.forwardhttps://github.com/huggingface/transformers/blob/main/src/transformers/models/blip/modeling_blip.py#L809[{"name": "pixel_values", "val": ": FloatTensor"}, {"name": "input_ids", "val": ": torch.LongTensor | None = None"}, {"name": "attention_mask", "val": ": torch.LongTensor | None = None"}, {"name": "labels", "val": ": torch.LongTensor | None = None"}, {"name": "interpolate_pos_encoding", "val": ": bool = False"}, {"name": "logits_to_keep", "val": ": int | torch.Tensor = 0"}, {"name": "**kwargs", "val": ": typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs]"}]- **pixel_values** (`torch.FloatTensor` of shape `(batch_size, num_channels, image_size, image_size)`) -- | |
| The tensors corresponding to the input images. Pixel values can be obtained using | |
| [BlipImageProcessor](/docs/transformers/main/ko/model_doc/blip#transformers.BlipImageProcessor). See `BlipImageProcessor.__call__()` for details ([BlipProcessor](/docs/transformers/main/ko/model_doc/blip#transformers.BlipProcessor) uses | |
| [BlipImageProcessor](/docs/transformers/main/ko/model_doc/blip#transformers.BlipImageProcessor) for processing images). | |
| - **input_ids** (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*) -- | |
| Indices of input sequence tokens in the vocabulary. Padding will be ignored by default. | |
| Indices can be obtained using [AutoTokenizer](/docs/transformers/main/ko/model_doc/auto#transformers.AutoTokenizer). See [PreTrainedTokenizer.encode()](/docs/transformers/main/ko/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.encode) and | |
| [PreTrainedTokenizer.__call__()](/docs/transformers/main/ko/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.__call__) for details. | |
| [What are input IDs?](../glossary#input-ids) | |
| - **attention_mask** (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*) -- | |
| Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`: | |
| - 1 for tokens that are **not masked**, | |
| - 0 for tokens that are **masked**. | |
| [What are attention masks?](../glossary#attention-mask) | |
| - **labels** (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*) -- | |
| Labels for computing the masked language modeling loss. Indices should either be in `[0, ..., | |
| config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored | |
| (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`. | |
| - **interpolate_pos_encoding** (`bool`, *optional*, defaults to `False`) -- | |
| Whether to interpolate the pre-trained position encodings. | |
| - **logits_to_keep** (`Union[int, torch.Tensor]`, *optional*, defaults to `0`) -- | |
| If an `int`, compute logits for the last `logits_to_keep` tokens. If `0`, calculate logits for all | |
| `input_ids` (special case). Only last token logits are needed for generation, and calculating them only for that | |
| token can save memory, which becomes pretty significant for long sequences or large vocabulary size. | |
| If a `torch.Tensor`, must be 1D corresponding to the indices to keep in the sequence length dimension. | |
| This is useful when using packed tensor format (single dimension for batch and sequence length).0`BlipForConditionalGenerationModelOutput` or `tuple(torch.FloatTensor)`A `BlipForConditionalGenerationModelOutput` or a tuple of | |
| `torch.FloatTensor` (if `return_dict=False` is passed or when `config.return_dict=False`) comprising various | |
| elements depending on the configuration ([BlipConfig](/docs/transformers/main/ko/model_doc/blip#transformers.BlipConfig)) and inputs. | |
| The [BlipForConditionalGeneration](/docs/transformers/main/ko/model_doc/blip#transformers.BlipForConditionalGeneration) forward method, overrides the `__call__` special method. | |
| Although the recipe for forward pass needs to be defined within this function, one should call the `Module` | |
| instance afterwards instead of this since the former takes care of running the pre and post processing steps while | |
| the latter silently ignores them. | |
| - **loss** (`torch.FloatTensor`, *optional*, returned when `labels` is provided, `torch.FloatTensor` of shape `(1,)`) -- Language modeling loss from the text decoder. | |
| - **logits** (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`, *optional*) -- Prediction scores of the language modeling head of the text decoder model. | |
| - **image_embeds** (`torch.FloatTensor` of shape `(batch_size, output_dim)`, *optional*) -- The image embeddings obtained after applying the Vision Transformer model to the input image. | |
| - **last_hidden_state** (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*, defaults to `None`) -- Sequence of hidden-states at the output of the last layer of the model. | |
| - **hidden_states** (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True`) -- Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, + | |
| one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`. | |
| Hidden-states of the model at the output of each layer plus the optional initial embedding outputs. | |
| - **attentions** (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed) -- Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, | |
| sequence_length)`. | |
| Attentions weights after the attention softmax, used to compute the weighted average in the self-attention | |
| heads. | |
| Examples: | |
| ```python | |
| >>> from PIL import Image | |
| >>> import httpx | |
| >>> from io import BytesIO | |
| >>> from transformers import AutoProcessor, BlipForConditionalGeneration | |
| >>> processor = AutoProcessor.from_pretrained("Salesforce/blip-image-captioning-base") | |
| >>> model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base") | |
| >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg" | |
| >>> with httpx.stream("GET", url) as response: | |
| ... image = Image.open(BytesIO(response.read())) | |
| >>> text = "A picture of" | |
| >>> inputs = processor(images=image, text=text, return_tensors="pt") | |
| >>> outputs = model(**inputs) | |
| ``` | |
| **Parameters:** | |
| config ([BlipConfig](/docs/transformers/main/ko/model_doc/blip#transformers.BlipConfig)) : Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the [from_pretrained()](/docs/transformers/main/ko/main_classes/model#transformers.PreTrainedModel.from_pretrained) method to load the model weights. | |
| **Returns:** | |
| ``BlipForConditionalGenerationModelOutput` or `tuple(torch.FloatTensor)`` | |
| A `BlipForConditionalGenerationModelOutput` or a tuple of | |
| `torch.FloatTensor` (if `return_dict=False` is passed or when `config.return_dict=False`) comprising various | |
| elements depending on the configuration ([BlipConfig](/docs/transformers/main/ko/model_doc/blip#transformers.BlipConfig)) and inputs. | |
| ## BlipForImageTextRetrieval[[transformers.BlipForImageTextRetrieval]][[transformers.BlipForImageTextRetrieval]] | |
| #### transformers.BlipForImageTextRetrieval[[transformers.BlipForImageTextRetrieval]] | |
| [Source](https://github.com/huggingface/transformers/blob/main/src/transformers/models/blip/modeling_blip.py#L1178) | |
| BLIP Model with a vision and text projector, and a classification head on top. The model is used in the context of | |
| image-text retrieval. Given an image and a text, the model returns the probability of the text being relevant to | |
| the image. | |
| This model inherits from [PreTrainedModel](/docs/transformers/main/ko/main_classes/model#transformers.PreTrainedModel). Check the superclass documentation for the generic methods the | |
| library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads | |
| etc.) | |
| This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass. | |
| Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage | |
| and behavior. | |
| forwardtransformers.BlipForImageTextRetrieval.forwardhttps://github.com/huggingface/transformers/blob/main/src/transformers/models/blip/modeling_blip.py#L1217[{"name": "input_ids", "val": ": LongTensor"}, {"name": "pixel_values", "val": ": FloatTensor"}, {"name": "use_itm_head", "val": ": bool | None = True"}, {"name": "attention_mask", "val": ": torch.LongTensor | None = None"}, {"name": "interpolate_pos_encoding", "val": ": bool = False"}, {"name": "**kwargs", "val": ": typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs]"}]- **input_ids** (`torch.LongTensor` of shape `(batch_size, sequence_length)`) -- | |
| Indices of input sequence tokens in the vocabulary. Padding will be ignored by default. | |
| Indices can be obtained using [AutoTokenizer](/docs/transformers/main/ko/model_doc/auto#transformers.AutoTokenizer). See [PreTrainedTokenizer.encode()](/docs/transformers/main/ko/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.encode) and | |
| [PreTrainedTokenizer.__call__()](/docs/transformers/main/ko/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.__call__) for details. | |
| [What are input IDs?](../glossary#input-ids) | |
| - **pixel_values** (`torch.FloatTensor` of shape `(batch_size, num_channels, image_size, image_size)`) -- | |
| The tensors corresponding to the input images. Pixel values can be obtained using | |
| [BlipImageProcessor](/docs/transformers/main/ko/model_doc/blip#transformers.BlipImageProcessor). See `BlipImageProcessor.__call__()` for details ([BlipProcessor](/docs/transformers/main/ko/model_doc/blip#transformers.BlipProcessor) uses | |
| [BlipImageProcessor](/docs/transformers/main/ko/model_doc/blip#transformers.BlipImageProcessor) for processing images). | |
| - **use_itm_head** (`bool`, *optional*, defaults to `True`) -- | |
| Whether or not to use the image-text matching head. | |
| - **attention_mask** (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*) -- | |
| Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`: | |
| - 1 for tokens that are **not masked**, | |
| - 0 for tokens that are **masked**. | |
| [What are attention masks?](../glossary#attention-mask) | |
| - **interpolate_pos_encoding** (`bool`, *optional*, defaults to `False`) -- | |
| Whether to interpolate the pre-trained position encodings.0`BlipTextVisionModelOutput` or `tuple(torch.FloatTensor)`A `BlipTextVisionModelOutput` or a tuple of | |
| `torch.FloatTensor` (if `return_dict=False` is passed or when `config.return_dict=False`) comprising various | |
| elements depending on the configuration ([BlipConfig](/docs/transformers/main/ko/model_doc/blip#transformers.BlipConfig)) and inputs. | |
| The [BlipForImageTextRetrieval](/docs/transformers/main/ko/model_doc/blip#transformers.BlipForImageTextRetrieval) forward method, overrides the `__call__` special method. | |
| Although the recipe for forward pass needs to be defined within this function, one should call the `Module` | |
| instance afterwards instead of this since the former takes care of running the pre and post processing steps while | |
| the latter silently ignores them. | |
| - **loss** (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided) -- Language modeling loss from the text decoder. | |
| - **image_embeds** (`torch.FloatTensor` of shape `(batch_size, output_dim)` *optional* returned when model is initialized with `with_projection=True`) -- The image embeddings obtained by applying the projection layer to the pooler_output. | |
| - **last_hidden_state** (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*, defaults to `None`) -- Sequence of hidden-states at the output of the last layer of the model. | |
| - **hidden_states** (`tuple[torch.FloatTensor, ...]`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`) -- Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, + | |
| one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`. | |
| Hidden-states of the model at the output of each layer plus the optional initial embedding outputs. | |
| - **attentions** (`tuple[torch.FloatTensor, ...]`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`) -- Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, | |
| sequence_length)`. | |
| Attentions weights after the attention softmax, used to compute the weighted average in the self-attention | |
| heads. | |
| Examples: | |
| ```python | |
| >>> from PIL import Image | |
| >>> import httpx | |
| >>> from io import BytesIO | |
| >>> from transformers import AutoProcessor, BlipForImageTextRetrieval | |
| >>> model = BlipForImageTextRetrieval.from_pretrained("Salesforce/blip-itm-base-coco") | |
| >>> processor = AutoProcessor.from_pretrained("Salesforce/blip-itm-base-coco") | |
| >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg" | |
| >>> with httpx.stream("GET", url) as response: | |
| ... image = Image.open(BytesIO(response.read())) | |
| >>> text = "an image of a cat" | |
| >>> inputs = processor(images=image, text=text, return_tensors="pt") | |
| >>> outputs = model(**inputs) | |
| ``` | |
| **Parameters:** | |
| config ([BlipConfig](/docs/transformers/main/ko/model_doc/blip#transformers.BlipConfig)) : Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the [from_pretrained()](/docs/transformers/main/ko/main_classes/model#transformers.PreTrainedModel.from_pretrained) method to load the model weights. | |
| **Returns:** | |
| ``BlipTextVisionModelOutput` or `tuple(torch.FloatTensor)`` | |
| A `BlipTextVisionModelOutput` or a tuple of | |
| `torch.FloatTensor` (if `return_dict=False` is passed or when `config.return_dict=False`) comprising various | |
| elements depending on the configuration ([BlipConfig](/docs/transformers/main/ko/model_doc/blip#transformers.BlipConfig)) and inputs. | |
| ## BlipForQuestionAnswering[[transformers.BlipForQuestionAnswering]][[transformers.BlipForQuestionAnswering]] | |
| #### transformers.BlipForQuestionAnswering[[transformers.BlipForQuestionAnswering]] | |
| [Source](https://github.com/huggingface/transformers/blob/main/src/transformers/models/blip/modeling_blip.py#L955) | |
| BLIP Model for visual question answering. The model consists of a vision encoder, a text encoder as well as a text | |
| decoder. The vision encoder will encode the input image, the text encoder will encode the input question together | |
| with the encoding of the image, and the text decoder will output the answer to the question. | |
| This model inherits from [PreTrainedModel](/docs/transformers/main/ko/main_classes/model#transformers.PreTrainedModel). Check the superclass documentation for the generic methods the | |
| library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads | |
| etc.) | |
| This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass. | |
| Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage | |
| and behavior. | |
| forwardtransformers.BlipForQuestionAnswering.forwardhttps://github.com/huggingface/transformers/blob/main/src/transformers/models/blip/modeling_blip.py#L983[{"name": "input_ids", "val": ": LongTensor"}, {"name": "pixel_values", "val": ": FloatTensor"}, {"name": "decoder_input_ids", "val": ": torch.LongTensor | None = None"}, {"name": "decoder_attention_mask", "val": ": torch.LongTensor | None = None"}, {"name": "attention_mask", "val": ": torch.LongTensor | None = None"}, {"name": "labels", "val": ": torch.LongTensor | None = None"}, {"name": "interpolate_pos_encoding", "val": ": bool = False"}, {"name": "**kwargs", "val": ": typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs]"}]- **input_ids** (`torch.LongTensor` of shape `(batch_size, sequence_length)`) -- | |
| Indices of input sequence tokens in the vocabulary. Padding will be ignored by default. | |
| Indices can be obtained using [AutoTokenizer](/docs/transformers/main/ko/model_doc/auto#transformers.AutoTokenizer). See [PreTrainedTokenizer.encode()](/docs/transformers/main/ko/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.encode) and | |
| [PreTrainedTokenizer.__call__()](/docs/transformers/main/ko/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.__call__) for details. | |
| [What are input IDs?](../glossary#input-ids) | |
| - **pixel_values** (`torch.FloatTensor` of shape `(batch_size, num_channels, image_size, image_size)`) -- | |
| The tensors corresponding to the input images. Pixel values can be obtained using | |
| [BlipImageProcessor](/docs/transformers/main/ko/model_doc/blip#transformers.BlipImageProcessor). See `BlipImageProcessor.__call__()` for details ([BlipProcessor](/docs/transformers/main/ko/model_doc/blip#transformers.BlipProcessor) uses | |
| [BlipImageProcessor](/docs/transformers/main/ko/model_doc/blip#transformers.BlipImageProcessor) for processing images). | |
| - **decoder_input_ids** (`torch.LongTensor` of shape `(batch_size, target_sequence_length)`, *optional*) -- | |
| Indices of decoder input sequence tokens in the vocabulary. | |
| Indices can be obtained using [AutoTokenizer](/docs/transformers/main/ko/model_doc/auto#transformers.AutoTokenizer). See [PreTrainedTokenizer.encode()](/docs/transformers/main/ko/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.encode) and | |
| [PreTrainedTokenizer.__call__()](/docs/transformers/main/ko/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.__call__) for details. | |
| [What are decoder input IDs?](../glossary#decoder-input-ids) | |
| - **decoder_attention_mask** (`torch.LongTensor` of shape `(batch_size, target_sequence_length)`, *optional*) -- | |
| Mask to avoid performing attention on certain token indices. By default, a causal mask will be used, to | |
| make sure the model can only look at previous inputs in order to predict the future. | |
| - **attention_mask** (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*) -- | |
| Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`: | |
| - 1 for tokens that are **not masked**, | |
| - 0 for tokens that are **masked**. | |
| [What are attention masks?](../glossary#attention-mask) | |
| - **labels** (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*) -- | |
| Labels for computing the masked language modeling loss. Indices should either be in `[0, ..., | |
| config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored | |
| (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`. | |
| - **interpolate_pos_encoding** (`bool`, *optional*, defaults to `False`) -- | |
| Whether to interpolate the pre-trained position encodings.0`BlipTextVisionModelOutput` or `tuple(torch.FloatTensor)`A `BlipTextVisionModelOutput` or a tuple of | |
| `torch.FloatTensor` (if `return_dict=False` is passed or when `config.return_dict=False`) comprising various | |
| elements depending on the configuration ([BlipConfig](/docs/transformers/main/ko/model_doc/blip#transformers.BlipConfig)) and inputs. | |
| The [BlipForQuestionAnswering](/docs/transformers/main/ko/model_doc/blip#transformers.BlipForQuestionAnswering) forward method, overrides the `__call__` special method. | |
| Although the recipe for forward pass needs to be defined within this function, one should call the `Module` | |
| instance afterwards instead of this since the former takes care of running the pre and post processing steps while | |
| the latter silently ignores them. | |
| - **loss** (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided) -- Language modeling loss from the text decoder. | |
| - **image_embeds** (`torch.FloatTensor` of shape `(batch_size, output_dim)` *optional* returned when model is initialized with `with_projection=True`) -- The image embeddings obtained by applying the projection layer to the pooler_output. | |
| - **last_hidden_state** (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*, defaults to `None`) -- Sequence of hidden-states at the output of the last layer of the model. | |
| - **hidden_states** (`tuple[torch.FloatTensor, ...]`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`) -- Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, + | |
| one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`. | |
| Hidden-states of the model at the output of each layer plus the optional initial embedding outputs. | |
| - **attentions** (`tuple[torch.FloatTensor, ...]`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`) -- Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, | |
| sequence_length)`. | |
| Attentions weights after the attention softmax, used to compute the weighted average in the self-attention | |
| heads. | |
| Examples: | |
| ```python | |
| >>> from PIL import Image | |
| >>> import httpx | |
| >>> from io import BytesIO | |
| >>> from transformers import AutoProcessor, BlipForQuestionAnswering | |
| >>> model = BlipForQuestionAnswering.from_pretrained("Salesforce/blip-vqa-base") | |
| >>> processor = AutoProcessor.from_pretrained("Salesforce/blip-vqa-base") | |
| >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg" | |
| >>> with httpx.stream("GET", url) as response: | |
| ... image = Image.open(BytesIO(response.read())) | |
| >>> # training | |
| >>> text = "How many cats are in the picture?" | |
| >>> label = "2" | |
| >>> inputs = processor(images=image, text=text, return_tensors="pt") | |
| >>> labels = processor(text=label, return_tensors="pt").input_ids | |
| >>> inputs["labels"] = labels | |
| >>> outputs = model(**inputs) | |
| >>> loss = outputs.loss | |
| >>> loss.backward() | |
| >>> # inference | |
| >>> text = "How many cats are in the picture?" | |
| >>> inputs = processor(images=image, text=text, return_tensors="pt") | |
| >>> outputs = model.generate(**inputs) | |
| >>> print(processor.decode(outputs[0], skip_special_tokens=True)) | |
| 2 | |
| ``` | |
| **Parameters:** | |
| config ([BlipConfig](/docs/transformers/main/ko/model_doc/blip#transformers.BlipConfig)) : Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the [from_pretrained()](/docs/transformers/main/ko/main_classes/model#transformers.PreTrainedModel.from_pretrained) method to load the model weights. | |
| **Returns:** | |
| ``BlipTextVisionModelOutput` or `tuple(torch.FloatTensor)`` | |
| A `BlipTextVisionModelOutput` or a tuple of | |
| `torch.FloatTensor` (if `return_dict=False` is passed or when `config.return_dict=False`) comprising various | |
| elements depending on the configuration ([BlipConfig](/docs/transformers/main/ko/model_doc/blip#transformers.BlipConfig)) and inputs. | |
Xet Storage Details
- Size:
- 64.1 kB
- Xet hash:
- 7c3604927a3758e1bb8787838d9a7874abe6ebf0f589a558b715c7e247f2a033
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.