| <!--Copyright 2021 The HuggingFace Team. All rights reserved. | |
| Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with | |
| the License. You may obtain a copy of the License at | |
| http://www.apache.org/licenses/LICENSE-2.0 | |
| Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on | |
| an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the | |
| specific language governing permissions and limitations under the License. | |
| β οΈ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be | |
| rendered properly in your Markdown viewer. | |
| --> | |
| # CLIP[[clip]] | |
| ## κ°μ[[overview]] | |
| CLIP λͺ¨λΈμ Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, | |
| Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskeverκ° μ μν [μμ°μ΄ μ§λ(supervision)λ₯Ό ν΅ν μ μ΄ κ°λ₯ν μκ° λͺ¨λΈ νμ΅](https://huggingface.co/papers/2103.00020)λΌλ λ Όλ¬Έμμ μκ°λμμ΅λλ€. CLIP(Contrastive Language-Image Pre-Training)μ λ€μν μ΄λ―Έμ§μ ν μ€νΈ μμΌλ‘ νλ ¨λ μ κ²½λ§ μ λλ€. GPT-2μ 3μ μ λ‘μ· λ₯λ ₯κ³Ό μ μ¬νκ², ν΄λΉ μμ μ μ§μ μ μΌλ‘ μ΅μ ννμ§ μκ³ λ μ£Όμ΄μ§ μ΄λ―Έμ§μ λν΄ κ°μ₯ κ΄λ ¨μ± μλ ν μ€νΈ μ€λν«μ μμΈ‘νλλ‘ μμ°μ΄λ‘ μ§μν μ μμ΅λλ€. | |
| ν΄λΉ λ Όλ¬Έμ μ΄λ‘μ λλ€. | |
| *μ΅μ μ»΄ν¨ν° λΉμ μμ€ν μ 미리 μ ν΄μ§ κ³ μ λ κ°μ²΄ μΉ΄ν κ³ λ¦¬ μ§ν©μ μμΈ‘νλλ‘ νλ ¨λ©λλ€. μ΄λ¬ν μ νλ ννμ μ§λλ λ€λ₯Έ μκ°μ κ°λ μ μ§μ νκΈ° μν΄ μΆκ°μ μΈ λΌλ²¨λ§λ λ°μ΄ν°κ° νμνλ―λ‘ κ·Έ μΌλ°μ±κ³Ό μ¬μ©μ±μ μ νν©λλ€. μ΄λ―Έμ§ μμ ν μ€νΈμμ μ§μ νμ΅νλ κ²μ ν¨μ¬ λ κ΄λ²μν μ§λ μμ€λ₯Ό νμ©νλ μμ£Ό μ’μ λμμ λλ€. μ΄λ―Έμ§μ μΊ‘μ μ λ§μΆλ κ°λ¨ν μ¬μ νμ΅ μμ μ΄, μΈν°λ·μμ μμ§ν 4μ΅ μμ μ΄λ―Έμ§-ν μ€νΈ λ°μ΄ν°μ μμ SOTA μμ€μ μ΄λ―Έμ§ ννμ μ²μλΆν° ν¨μ¨μ μ΄κ³ νμ₯ κ°λ₯νκ² νμ΅νλ λ°©λ²μμ νμΈν μ μμ΅λλ€. μ¬μ νλ ¨ ν, μμ°μ΄λ νμ΅λ μκ°μ κ°λ μ μ°Έμ‘°νκ±°λ μλ‘μ΄ κ°λ μ μ€λͺ νλ λ° μ¬μ©λμ΄ λͺ¨λΈμ νμ μμ μΌλ‘μ μ λ‘μ· μ μ΄λ₯Ό κ°λ₯νκ² ν©λλ€. ν΄λΉ λ Όλ¬Έμμλ OCR, λΉλμ€ λ΄ νλ μΈμ, μ§λ¦¬μ μμΉ νμ , κ·Έλ¦¬κ³ λ§μ μ’ λ₯μ μΈλ°ν κ°μ²΄ λΆλ₯ λ± 30κ° μ΄μμ λ€μν κΈ°μ‘΄ μ»΄ν¨ν° λΉμ λ°μ΄ν°μ μ λν λ²€μΉλ§νΉμ ν΅ν΄ μ΄ μ κ·Ό λ°©μμ μ±λ₯μ μ°κ΅¬ν©λλ€. μ΄ λͺ¨λΈμ λλΆλΆμ μμ μ λν΄ μλ―Έ μκ² μ μ΄λλ©°, μ’ μ’ λ°μ΄ν°μ λ³ νλ ¨ μμ΄λ μμ μ§λ νμ΅ κΈ°μ€μ κ³Ό κ²½μλ ₯ μλ μ±λ₯μ 보μ λλ€. μλ₯Ό λ€μ΄, ImageNetμμ μλ ResNet-50μ μ νλλ₯Ό μ λ‘μ·μΌλ‘ μΌμΉμν€λλ°, μ΄λ ResNet-50μ΄ νλ ¨λ 128λ§ κ°μ νλ ¨ μμ λ₯Ό μ ν μ¬μ©ν νμκ° μμμ΅λλ€. μ½λ λ° μ¬μ νλ ¨λ λͺ¨λΈ κ°μ€μΉλ μ΄ https URLμμ 곡κ°ν©λλ€.* | |
| μ΄ λͺ¨λΈμ [valhalla](https://huggingface.co/valhalla)μ μν΄ κΈ°μ¬λμμ΅λλ€. | |
| μλ³Έ μ½λλ [μ΄κ³³](https://github.com/openai/CLIP)μμ νμΈν μ μμ΅λλ€. | |
| ## μ¬μ© νκ³Ό μμ[[usage-tips-and-example]] | |
| CLIPμ λ©ν°λͺ¨λ¬ λΉμ λ° μΈμ΄ λͺ¨λΈμ λλ€. μ΄λ―Έμ§-ν μ€νΈ μ μ¬λ κ³μ°κ³Ό μ λ‘μ· μ΄λ―Έμ§ λΆλ₯μ μ¬μ©λ μ μμ΅λλ€. CLIPμ ViTμ μ μ¬ν νΈλμ€ν¬λ¨Έλ₯Ό μ¬μ©νμ¬ μκ°μ νΉμ§μ μΆμΆνκ³ , μΈκ³Όμ μΈμ΄ λͺ¨λΈμ μ¬μ©νμ¬ ν μ€νΈ νΉμ§μ μΆμΆν©λλ€. κ·Έ ν ν μ€νΈμ μκ°μ νΉμ§ λͺ¨λ λμΌν μ°¨μμ μ μ¬(latent) 곡κ°μΌλ‘ ν¬μλ©λλ€. ν¬μλ μ΄λ―Έμ§μ ν μ€νΈ νΉμ§ μ¬μ΄μ λ΄μ μ΄ μ μ¬λ μ μλ‘ μ¬μ©λ©λλ€. | |
| νΈλμ€ν¬λ¨Έ μΈμ½λμ μ΄λ―Έμ§λ₯Ό μ λ ₯νκΈ° μν΄, κ° μ΄λ―Έμ§λ κ³ μ ν¬κΈ°μ κ²ΉμΉμ§ μλ ν¨μΉλ€μ μνμ€λ‘ λΆν λκ³ , μ΄ν μ ν μλ² λ©λ©λλ€. [CLS]ν ν°μ΄ μ 체 μ΄λ―Έμ§μ ννμΌλ‘ μΆκ°λ©λλ€. μ μλ€μ λν μ λ μμΉ μλ² λ©μ μΆκ°νκ³ , κ²°κ³Όλ‘ λμ¨ λ²‘ν° μνμ€λ₯Ό νμ€ νΈλμ€ν¬λ¨Έ μΈν λμ μ λ ₯ν©λλ€. [`CLIPImageProcessor`]λ λͺ¨λΈμ μν΄ μ΄λ―Έμ§λ₯Ό 리μ¬μ΄μ¦(λλ μ¬μ€μΊμΌλ§)νκ³ μ κ·ννλλ° μ¬μ©λ μ μμ΅λλ€. | |
| [`CLIPTokenizer`]λ ν μ€νΈλ₯Ό μΈμ½λ©νλλ° μ¬μ©λ©λλ€. [`CLIPProcessor`]λ [`CLIPImageProcessor`]μ [`CLIPTokenizer`]λ₯Ό νλμ μΈμ€ν΄μ€λ‘ κ°μΈμ ν μ€νΈλ₯Ό μΈμ½λ©νκ³ μ΄λ―Έμ§λ₯Ό μ€λΉνλλ° λͺ¨λ μ¬μ©λ©λλ€. | |
| λ€μ μμλ [`CLIPProcessor`]μ [`CLIPModel`]μ μ¬μ©νμ¬ μ΄λ―Έμ§-ν μ€νΈ μ μ¬λ μ μλ₯Ό μ»λ λ°©λ²μ 보μ¬μ€λλ€. | |
| ```python | |
| >>> from PIL import Image | |
| >>> import requests | |
| >>> from transformers import CLIPProcessor, CLIPModel | |
| >>> model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32") | |
| >>> processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32") | |
| >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg" | |
| >>> image = Image.open(requests.get(url, stream=True).raw) | |
| >>> inputs = processor(text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True) | |
| >>> outputs = model(**inputs) | |
| >>> logits_per_image = outputs.logits_per_image # μ΄λ―Έμ§-ν μ€νΈ μ μ¬μ± μ μ | |
| >>> probs = logits_per_image.softmax(dim=1) # νλ₯ μ λ μ΄λΈλ§ νκΈ°μν΄μ μννΈλ§₯μ€λ₯Ό μ·¨ν©λλ€. | |
| ``` | |
| ### CLIPκ³Ό νλμ μ΄ν μ 2 κ²°ν©[[combining-clip-and-flash-attention-2]] | |
| λ¨Όμ μ΅μ λ²μ μ νλμ μ΄ν μ 2λ₯Ό μ€μΉν©λλ€. | |
| ```bash | |
| pip install -U flash-attn --no-build-isolation | |
| ``` | |
| νλμ μ΄ν μ 2μ νΈνλλ νλμ¨μ΄λ₯Ό κ°μ§κ³ μλμ§ νμΈνμΈμ. μ΄μ λν μμΈν λ΄μ©μ flash-attn 리ν¬μ§ν 리μ 곡μλ¬Έμμμ νμΈν μ μμ΅λλ€. λν λͺ¨λΈμ λ°μ λ°λ(`torch.float16`)λ‘ λ‘λνλ κ²μ μμ§ λ§μΈμ. | |
| <Tip warning={true}> | |
| μμ λ°°μΉ ν¬κΈ°λ₯Ό μ¬μ©ν λ, νλμ μ΄ν μ μ μ¬μ©νλ©΄ λͺ¨λΈμ΄ λλ €μ§λ κ²μ λλ μ μμ΅λλ€.μλμ [νλμ μ΄ν μ κ³Ό SDPAλ₯Ό μ¬μ©ν μμ μλ ν₯μ](#Expected-speedups-with-Flash-Attention-and-SDPA) μΉμ μ μ°Έμ‘°νμ¬ μ μ ν μ΄ν μ ꡬνμ μ ννμΈμ. | |
| </Tip> | |
| νλμ μ΄ν μ 2λ₯Ό μ¬μ©ν΄μ λͺ¨λΈμ λ‘λνκ³ κ΅¬λνκΈ° μν΄μ λ€μ μ€λν«μ μ°Έκ³ νμΈμ: | |
| ```python | |
| >>> import torch | |
| >>> import requests | |
| >>> from PIL import Image | |
| >>> from transformers import CLIPProcessor, CLIPModel | |
| >>> device = "cuda" | |
| >>> torch_dtype = torch.float16 | |
| >>> model = CLIPModel.from_pretrained( | |
| ... "openai/clip-vit-base-patch32", | |
| ... attn_implementation="flash_attention_2", | |
| ... device_map=device, | |
| ... torch_dtype=torch_dtype, | |
| ... ) | |
| >>> processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32") | |
| >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg" | |
| >>> image = Image.open(requests.get(url, stream=True).raw) | |
| >>> inputs = processor(text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True) | |
| >>> inputs.to(device) | |
| >>> with torch.no_grad(): | |
| ... with torch.autocast(device): | |
| ... outputs = model(**inputs) | |
| >>> logits_per_image = outputs.logits_per_image # μ΄λ―Έμ§-ν μ€νΈ μ μ¬μ± μ μ | |
| >>> probs = logits_per_image.softmax(dim=1) # νλ₯ μ λ μ΄λΈλ§ νκΈ°μν΄μ μννΈλ§₯μ€λ₯Ό μ·¨ν©λλ€. | |
| >>> print(probs) | |
| tensor([[0.9946, 0.0052]], device='cuda:0', dtype=torch.float16) | |
| ``` | |
| ### μ€μΌμΌλ λ΄μ μ΄ν μ (Scaled dot-product Attention(SDPA)) μ¬μ©νκΈ°[[using-scaled-dot-product-attention-sdpa]] | |
| νμ΄ν μΉλ `torch.nn.functional`μ μΌλΆλ‘ λ€μ΄ν°λΈ μ€μΌμΌλ λ΄μ μ΄ν μ (SPDA) μ°μ°μλ₯Ό ν¬ν¨νκ³ μμ΅λλ€. μ΄ ν¨μλ μ λ ₯κ³Ό μ¬μ© μ€μΈ νλμ¨μ΄μ λ°λΌ μ μ©λ μ μλ μ¬λ¬ ꡬνμ ν¬ν¨ν©λλ€. μμΈν μ 보λ [곡μλ¬Έμ](https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html)λ [GPU μΆλ‘ ](https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#pytorch-scaled-dot-product-attention) νμ΄μ§λ₯Ό μ°Έμ‘°νμΈμ. | |
| `torch>=2.1.1`μμλ ꡬνμ΄ κ°λ₯ν λ SDPAκ° κΈ°λ³Έμ μΌλ‘ μ¬μ©λμ§λ§, `from_pretrained()` ν¨μμμ `attn_implementation="sdpa"`λ₯Ό μ€μ νμ¬ SDPAλ₯Ό λͺ μμ μΌλ‘ μ¬μ©νλλ‘ μμ²ν μλ μμ΅λλ€. | |
| ```python | |
| from transformers import CLIPModel | |
| model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32", torch_dtype=torch.float16, attn_implementation="sdpa") | |
| ``` | |
| μ΅κ³ μ μλν₯μμ μν΄μ, λ°μ λ°λλ‘ λͺ¨λΈμ λ‘λνλ κ²μ μΆμ²ν©λλ€. (μλ₯Όλ€λ©΄ `torch.float16` λλ `torch.bfloat16`). | |
| ### νλμ μ΄ν μ κ³Ό μ€μΌμΌλ λ΄μ μ΄ν μ (SDPA)μΌλ‘ μΈν΄ μμλλ μλν₯μ[[expected-speedups-with-flash-attention-and-sdpa]] | |
| λ‘컬 λ²€μΉλ§ν¬(NVIDIA A10G, PyTorch 2.3.1+cu121)μμ `float16`μ μ¬μ©νμ¬ `"openai/clip-vit-large-patch14"` 체ν¬ν¬μΈνΈλ‘ μΆλ‘ μ μννμ λ, λ€μκ³Ό κ°μ μλ ν₯μμ νμΈ νμ΅λλ€. | |
| [μ½λ](https://gist.github.com/qubvel/ac691a54e54f9fae8144275f866a7ff8): | |
| #### CLIPTextModel[[cliptextmodel]] | |
| | Num text labels | Eager (s/iter) | FA2 (s/iter) | FA2 speedup | SDPA (s/iter) | SDPA speedup | | |
| |------------------:|-----------------:|---------------:|--------------:|----------------:|---------------:| | |
| | 4 | 0.009 | 0.012 | 0.737 | 0.007 | 1.269 | | |
| | 16 | 0.009 | 0.014 | 0.659 | 0.008 | 1.187 | | |
| | 32 | 0.018 | 0.021 | 0.862 | 0.016 | 1.142 | | |
| | 64 | 0.034 | 0.034 | 1.001 | 0.03 | 1.163 | | |
| | 128 | 0.063 | 0.058 | 1.09 | 0.054 | 1.174 | | |
|  | |
| #### CLIPVisionModel[[clipvisionmodel]] | |
| | Image batch size | Eager (s/iter) | FA2 (s/iter) | FA2 speedup | SDPA (s/iter) | SDPA speedup | | |
| |-------------------:|-----------------:|---------------:|--------------:|----------------:|---------------:| | |
| | 1 | 0.016 | 0.013 | 1.247 | 0.012 | 1.318 | | |
| | 4 | 0.025 | 0.021 | 1.198 | 0.021 | 1.202 | | |
| | 16 | 0.093 | 0.075 | 1.234 | 0.075 | 1.24 | | |
| | 32 | 0.181 | 0.147 | 1.237 | 0.146 | 1.241 | | |
|  | |
| #### CLIPModel[[clipmodel]] | |
| | Image batch size | Num text labels | Eager (s/iter) | FA2 (s/iter) | FA2 speedup | SDPA (s/iter) | SDPA speedup | | |
| |-------------------:|------------------:|-----------------:|---------------:|--------------:|----------------:|---------------:| | |
| | 1 | 4 | 0.025 | 0.026 | 0.954 | 0.02 | 1.217 | | |
| | 1 | 16 | 0.026 | 0.028 | 0.918 | 0.02 | 1.287 | | |
| | 1 | 64 | 0.042 | 0.046 | 0.906 | 0.036 | 1.167 | | |
| | 4 | 4 | 0.028 | 0.033 | 0.849 | 0.024 | 1.189 | | |
| | 4 | 16 | 0.034 | 0.035 | 0.955 | 0.029 | 1.169 | | |
| | 4 | 64 | 0.059 | 0.055 | 1.072 | 0.05 | 1.179 | | |
| | 16 | 4 | 0.096 | 0.088 | 1.091 | 0.078 | 1.234 | | |
| | 16 | 16 | 0.102 | 0.09 | 1.129 | 0.083 | 1.224 | | |
| | 16 | 64 | 0.127 | 0.11 | 1.157 | 0.105 | 1.218 | | |
| | 32 | 4 | 0.185 | 0.159 | 1.157 | 0.149 | 1.238 | | |
| | 32 | 16 | 0.19 | 0.162 | 1.177 | 0.154 | 1.233 | | |
| | 32 | 64 | 0.216 | 0.181 | 1.19 | 0.176 | 1.228 | | |
| ## μλ£[[resources]] | |
| CLIPμ μμνλ λ° λμμ΄ λλ Hugging Faceμ community μλ£ λͺ©λ‘(πλ‘ νμλ¨) μ λλ€. | |
| - [μ격 μΌμ± (μΈκ³΅μμ±) μ΄λ―Έμ§μ μΊ‘μ μ κ°μ§κ³ CLIP λ―ΈμΈμ‘°μ νκΈ°](https://huggingface.co/blog/fine-tune-clip-rsicd): | |
| [RSICD dataset](https://github.com/201528014227051/RSICD_optimal)μ κ°μ§κ³ CLIPμ λ―ΈμΈμ‘°μ νλ λ°©λ²κ³Ό λ°μ΄ν° μ¦κ°μ λν μ±λ₯ λΉκ΅μ λν λΈλ‘κ·Έ ν¬μ€νΈ | |
| - μ΄ [μμ μ€ν¬λ¦½νΈ](https://github.com/huggingface/transformers/tree/main/examples/pytorch/contrastive-image-text)λ [COCO dataset](https://cocodataset.org/#home)λ₯Ό μ΄μ©ν μ¬μ νμ΅λ λΉμ κ³Ό ν μ€νΈμ μΈμ½λλ₯Ό μ¬μ©ν΄μ CLIPκ°μ λΉμ -ν μ€νΈ λμΌ λͺ¨λΈμ μ΄λ»κ² νμ΅μν€λμ§ λ³΄μ¬μ€λλ€. | |
| <PipelineTag pipeline="image-to-text"/> | |
| - μ¬μ νμ΅λ CLIPλͺ¨λΈμ μ΄λ―Έμ§ μΊ‘μ λμ μν λΉμμΉ μΆλ‘ μ μ΄λ»κ² νμ©νλμ§μ κ΄ν [λ ΈνΈλΆ](https://colab.research.google.com/drive/1tuoAC5F4sC7qid56Z0ap-stR3rwdk0ZV?usp=sharing) | |
| **μ΄λ―Έμ§ κ²μ** | |
| - μ¬μ νμ΅λ CLIPλͺ¨λΈκ³Ό MRR(Mean Reciprocal Rank) μ μ μ°μ°μ μ¬μ©ν μ΄λ―Έμ§ κ²μμ λν [λ ΈνΈλΆ](https://colab.research.google.com/drive/1bLVwVKpAndpEDHqjzxVPr_9nGrSbuOQd?usp=sharing). π | |
| - μ΄λ―Έμ§ κ²μκ³Ό μ μ¬μ± μ μμ λν΄ λ³΄μ¬μ£Όλ [λ ΈνΈλΆ](https://colab.research.google.com/github/deep-diver/image_search_with_natural_language/blob/main/notebooks/Image_Search_CLIP.ipynb). π | |
| - Multilingual CLIPλ₯Ό μ¬μ©ν΄μ μ΄λ―Έμ§μ ν μ€νΈλ₯Ό μ΄λ»κ² κ°μ λ²‘ν° κ³΅κ°μ λ§€ν μν€λμ§μ λν [λ ΈνΈλΆ](https://colab.research.google.com/drive/1xO-wC_m_GNzgjIBQ4a4znvQkvDoZJvH4?usp=sharing). π | |
| - [Unsplash](https://unsplash.com)μ [TMDB](https://www.themoviedb.org/) λ°μ΄ν°μ μ νμ©ν μλ―Έλ‘ μ (semantic) μ΄λ―Έμ§ κ²μμμ CLIPμ ꡬλνλ λ°©λ²μ λν [λ ΈνΈλΆ](https://colab.research.google.com/github/vivien000/clip-demo/blob/master/clip.ipynb#scrollTo=uzdFhRGqiWkR). π | |
| **μ€λͺ κ°λ₯μ±** | |
| - μ λ ₯ ν ν°κ³Ό μ΄λ―Έμ§ μ‘°κ°(segment) μ¬μ΄μ μ μ¬μ±μ μκ°ν μν€λ λ°©λ²μ λν [λ ΈνΈλΆ](https://colab.research.google.com/github/hila-chefer/Transformer-MM-Explainability/blob/main/CLIP_explainability.ipynb). π | |
| μ¬κΈ°μ ν¬ν¨λ μλ£λ₯Ό μ μΆνκ³ μΆμΌμλ€λ©΄ PR(Pull Request)λ₯Ό μ΄μ΄μ£ΌμΈμ. 리뷰 ν΄λλ¦¬κ² μ΅λλ€! μλ£λ κΈ°μ‘΄ μλ£λ₯Ό 볡μ νλ λμ μλ‘μ΄ λ΄μ©μ λ΄κ³ μμ΄μΌ ν©λλ€. | |
| ## CLIPConfig[[transformers.CLIPConfig]] | |
| [[autodoc]] CLIPConfig | |
| - from_text_vision_configs | |
| ## CLIPTextConfig[[transformers.CLIPTextConfig]] | |
| [[autodoc]] CLIPTextConfig | |
| ## CLIPVisionConfig[[transformers.CLIPVisionConfig]] | |
| [[autodoc]] CLIPVisionConfig | |
| ## CLIPTokenizer[[transformers.CLIPTokenizer]] | |
| [[autodoc]] CLIPTokenizer | |
| - build_inputs_with_special_tokens | |
| - get_special_tokens_mask | |
| - create_token_type_ids_from_sequences | |
| - save_vocabulary | |
| ## CLIPTokenizerFast[[transformers.CLIPTokenizerFast]] | |
| [[autodoc]] CLIPTokenizerFast | |
| ## CLIPImageProcessor[[transformers.CLIPImageProcessor]] | |
| [[autodoc]] CLIPImageProcessor | |
| - preprocess | |
| ## CLIPFeatureExtractor[[transformers.CLIPFeatureExtractor]] | |
| [[autodoc]] CLIPFeatureExtractor | |
| ## CLIPProcessor[[transformers.CLIPProcessor]] | |
| [[autodoc]] CLIPProcessor | |
| <frameworkcontent> | |
| <pt> | |
| ## CLIPModel[[transformers.CLIPModel]] | |
| [[autodoc]] CLIPModel | |
| - forward | |
| - get_text_features | |
| - get_image_features | |
| ## CLIPTextModel[[transformers.CLIPTextModel]] | |
| [[autodoc]] CLIPTextModel | |
| - forward | |
| ## CLIPTextModelWithProjection[[transformers.CLIPTextModelWithProjection]] | |
| [[autodoc]] CLIPTextModelWithProjection | |
| - forward | |
| ## CLIPVisionModelWithProjection[[transformers.CLIPVisionModelWithProjection]] | |
| [[autodoc]] CLIPVisionModelWithProjection | |
| - forward | |
| ## CLIPVisionModel[[transformers.CLIPVisionModel]] | |
| [[autodoc]] CLIPVisionModel | |
| - forward | |
| ## CLIPForImageClassification[[transformers.CLIPForImageClassification]] | |
| [[autodoc]] CLIPForImageClassification | |
| - forward | |
| </pt> | |
| <tf> | |
| ## TFCLIPModel[[transformers.TFCLIPModel]] | |
| [[autodoc]] TFCLIPModel | |
| - call | |
| - get_text_features | |
| - get_image_features | |
| ## TFCLIPTextModel[[transformers.TFCLIPTextModel]] | |
| [[autodoc]] TFCLIPTextModel | |
| - call | |
| ## TFCLIPVisionModel[[transformers.TFCLIPVisionModel]] | |
| [[autodoc]] TFCLIPVisionModel | |
| - call | |
| </tf> | |
| <jax> | |
| ## FlaxCLIPModel[[transformers.FlaxCLIPModel]] | |
| [[autodoc]] FlaxCLIPModel | |
| - __call__ | |
| - get_text_features | |
| - get_image_features | |
| ## FlaxCLIPTextModel[[transformers.FlaxCLIPTextModel]] | |
| [[autodoc]] FlaxCLIPTextModel | |
| - __call__ | |
| ## FlaxCLIPTextModelWithProjection[[transformers.FlaxCLIPTextModelWithProjection]] | |
| [[autodoc]] FlaxCLIPTextModelWithProjection | |
| - __call__ | |
| ## FlaxCLIPVisionModel[[transformers.FlaxCLIPVisionModel]] | |
| [[autodoc]] FlaxCLIPVisionModel | |
| - __call__ | |
| </jax> | |
| </frameworkcontent> | |