DrDavis's picture
Upload folder using huggingface_hub
17c6d62 verified

CLIP[[clip]]

κ°œμš”[[overview]]

CLIP λͺ¨λΈμ€ Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskeverκ°€ μ œμ•ˆν•œ μžμ—°μ–΄ 지도(supervision)λ₯Ό ν†΅ν•œ 전이 κ°€λŠ₯ν•œ μ‹œκ° λͺ¨λΈ ν•™μŠ΅λΌλŠ” λ…Όλ¬Έμ—μ„œ μ†Œκ°œλ˜μ—ˆμŠ΅λ‹ˆλ‹€. CLIP(Contrastive Language-Image Pre-Training)은 λ‹€μ–‘ν•œ 이미지와 ν…μŠ€νŠΈ 쌍으둜 ν›ˆλ ¨λœ 신경망 μž…λ‹ˆλ‹€. GPT-2와 3의 μ œλ‘œμƒ· λŠ₯λ ₯κ³Ό μœ μ‚¬ν•˜κ²Œ, ν•΄λ‹Ή μž‘μ—…μ— μ§μ ‘μ μœΌλ‘œ μ΅œμ ν™”ν•˜μ§€ μ•Šκ³ λ„ μ£Όμ–΄μ§„ 이미지에 λŒ€ν•΄ κ°€μž₯ κ΄€λ ¨μ„± μžˆλŠ” ν…μŠ€νŠΈ μŠ€λ‹ˆνŽ«μ„ μ˜ˆμΈ‘ν•˜λ„λ‘ μžμ—°μ–΄λ‘œ μ§€μ‹œν•  수 μžˆμŠ΅λ‹ˆλ‹€.

ν•΄λ‹Ή λ…Όλ¬Έμ˜ μ΄ˆλ‘μž…λ‹ˆλ‹€.

μ΅œμ‹  컴퓨터 λΉ„μ „ μ‹œμŠ€ν…œμ€ 미리 μ •ν•΄μ§„ κ³ μ •λœ 객체 μΉ΄ν…Œκ³ λ¦¬ 집합을 μ˜ˆμΈ‘ν•˜λ„λ‘ ν›ˆλ ¨λ©λ‹ˆλ‹€. μ΄λŸ¬ν•œ μ œν•œλœ ν˜•νƒœμ˜ μ§€λ„λŠ” λ‹€λ₯Έ μ‹œκ°μ  κ°œλ…μ„ μ§€μ •ν•˜κΈ° μœ„ν•΄ 좔가적인 라벨링된 데이터가 ν•„μš”ν•˜λ―€λ‘œ κ·Έ μΌλ°˜μ„±κ³Ό μ‚¬μš©μ„±μ„ μ œν•œν•©λ‹ˆλ‹€. 이미지 μ›μ‹œ ν…μŠ€νŠΈμ—μ„œ 직접 ν•™μŠ΅ν•˜λŠ” 것은 훨씬 더 κ΄‘λ²”μœ„ν•œ 지도 μ†ŒμŠ€λ₯Ό ν™œμš©ν•˜λŠ” μ•„μ£Ό 쒋은 λŒ€μ•ˆμž…λ‹ˆλ‹€. 이미지와 μΊ‘μ…˜μ„ λ§žμΆ”λŠ” κ°„λ‹¨ν•œ 사전 ν•™μŠ΅ μž‘μ—…μ΄, μΈν„°λ„·μ—μ„œ μˆ˜μ§‘ν•œ 4μ–΅ 쌍의 이미지-ν…μŠ€νŠΈ λ°μ΄ν„°μ…‹μ—μ„œ SOTA μˆ˜μ€€μ˜ 이미지 ν‘œν˜„μ„ μ²˜μŒλΆ€ν„° 효율적이고 ν™•μž₯ κ°€λŠ₯ν•˜κ²Œ ν•™μŠ΅ν•˜λŠ” λ°©λ²•μž„μ„ 확인할 수 μžˆμŠ΅λ‹ˆλ‹€. 사전 ν›ˆλ ¨ ν›„, μžμ—°μ–΄λŠ” ν•™μŠ΅λœ μ‹œκ°μ  κ°œλ…μ„ μ°Έμ‘°ν•˜κ±°λ‚˜ μƒˆλ‘œμš΄ κ°œλ…μ„ μ„€λͺ…ν•˜λŠ” 데 μ‚¬μš©λ˜μ–΄ λͺ¨λΈμ˜ ν•˜μœ„ μž‘μ—…μœΌλ‘œμ˜ μ œλ‘œμƒ· 전이λ₯Ό κ°€λŠ₯ν•˜κ²Œ ν•©λ‹ˆλ‹€. ν•΄λ‹Ή λ…Όλ¬Έμ—μ„œλŠ” OCR, λΉ„λ””μ˜€ λ‚΄ 행동 인식, 지리적 μœ„μΉ˜ νŒŒμ•…, 그리고 λ§Žμ€ μ’…λ₯˜μ˜ μ„Έλ°€ν•œ 객체 λΆ„λ₯˜ λ“± 30개 μ΄μƒμ˜ λ‹€μ–‘ν•œ κΈ°μ‘΄ 컴퓨터 λΉ„μ „ 데이터셋에 λŒ€ν•œ λ²€μΉ˜λ§ˆν‚Ήμ„ 톡해 이 μ ‘κ·Ό λ°©μ‹μ˜ μ„±λŠ₯을 μ—°κ΅¬ν•©λ‹ˆλ‹€. 이 λͺ¨λΈμ€ λŒ€λΆ€λΆ„μ˜ μž‘μ—…μ— λŒ€ν•΄ 의미 있게 μ „μ΄λ˜λ©°, μ’…μ’… 데이터셋별 ν›ˆλ ¨ 없이도 μ™„μ „ 지도 ν•™μŠ΅ κΈ°μ€€μ„ κ³Ό 경쟁λ ₯ μžˆλŠ” μ„±λŠ₯을 λ³΄μž…λ‹ˆλ‹€. 예λ₯Ό λ“€μ–΄, ImageNetμ—μ„œ μ›λž˜ ResNet-50의 정확도λ₯Ό μ œλ‘œμƒ·μœΌλ‘œ μΌμΉ˜μ‹œν‚€λŠ”λ°, μ΄λŠ” ResNet-50이 ν›ˆλ ¨λœ 128만 개의 ν›ˆλ ¨ 예제λ₯Ό μ „ν˜€ μ‚¬μš©ν•  ν•„μš”κ°€ μ—†μ—ˆμŠ΅λ‹ˆλ‹€. μ½”λ“œ 및 사전 ν›ˆλ ¨λœ λͺ¨λΈ κ°€μ€‘μΉ˜λŠ” 이 https URLμ—μ„œ κ³΅κ°œν•©λ‹ˆλ‹€.

이 λͺ¨λΈμ€ valhalla에 μ˜ν•΄ κΈ°μ—¬λ˜μ—ˆμŠ΅λ‹ˆλ‹€. 원본 μ½”λ“œλŠ” μ΄κ³³μ—μ„œ 확인할 수 μžˆμŠ΅λ‹ˆλ‹€.

μ‚¬μš© 팁과 μ˜ˆμ‹œ[[usage-tips-and-example]]

CLIP은 λ©€ν‹°λͺ¨λ‹¬ λΉ„μ „ λ°’ μ–Έμ–΄ λͺ¨λΈμž…λ‹ˆλ‹€. 이미지-ν…μŠ€νŠΈ μœ μ‚¬λ„ 계산과 μ œλ‘œμƒ· 이미지 λΆ„λ₯˜μ— μ‚¬μš©λ  수 μžˆμŠ΅λ‹ˆλ‹€. CLIP은 ViT와 μœ μ‚¬ν•œ 트랜슀포머λ₯Ό μ‚¬μš©ν•˜μ—¬ μ‹œκ°μ  νŠΉμ§•μ„ μΆ”μΆœν•˜κ³ , 인과적 μ–Έμ–΄ λͺ¨λΈμ„ μ‚¬μš©ν•˜μ—¬ ν…μŠ€νŠΈ νŠΉμ§•μ„ μΆ”μΆœν•©λ‹ˆλ‹€. κ·Έ ν›„ ν…μŠ€νŠΈμ™€ μ‹œκ°μ  νŠΉμ§• λͺ¨λ‘ λ™μΌν•œ μ°¨μ›μ˜ 잠재(latent) κ³΅κ°„μœΌλ‘œ νˆ¬μ˜λ©λ‹ˆλ‹€. 투영된 이미지와 ν…μŠ€νŠΈ νŠΉμ§• μ‚¬μ΄μ˜ 내적이 μœ μ‚¬λ„ 점수둜 μ‚¬μš©λ©λ‹ˆλ‹€.

트랜슀포머 인코더에 이미지λ₯Ό μž…λ ₯ν•˜κΈ° μœ„ν•΄, 각 μ΄λ―Έμ§€λŠ” κ³ μ • 크기의 κ²ΉμΉ˜μ§€ μ•ŠλŠ” νŒ¨μΉ˜λ“€μ˜ μ‹œν€€μŠ€λ‘œ λΆ„ν• λ˜κ³ , 이후 μ„ ν˜• μž„λ² λ”©λ©λ‹ˆλ‹€. [CLS]토큰이 전체 μ΄λ―Έμ§€μ˜ ν‘œν˜„μœΌλ‘œ μΆ”κ°€λ©λ‹ˆλ‹€. μ €μžλ“€μ€ λ˜ν•œ μ ˆλŒ€ μœ„μΉ˜ μž„λ² λ”©μ„ μΆ”κ°€ν•˜κ³ , 결과둜 λ‚˜μ˜¨ 벑터 μ‹œν€€μŠ€λ₯Ό ν‘œμ€€ 트랜슀포머 인토더에 μž…λ ₯ν•©λ‹ˆλ‹€. [CLIPImageProcessor]λŠ” λͺ¨λΈμ„ μœ„ν•΄ 이미지λ₯Ό λ¦¬μ‚¬μ΄μ¦ˆ(λ˜λŠ” 재슀캐일링)ν•˜κ³  μ •κ·œν™”ν•˜λŠ”λ° μ‚¬μš©λ  수 μžˆμŠ΅λ‹ˆλ‹€.

[CLIPTokenizer]λŠ” ν…μŠ€νŠΈλ₯Ό μΈμ½”λ”©ν•˜λŠ”λ° μ‚¬μš©λ©λ‹ˆλ‹€. [CLIPProcessor]λŠ” [CLIPImageProcessor]와 [CLIPTokenizer]λ₯Ό ν•˜λ‚˜μ˜ μΈμŠ€ν„΄μŠ€λ‘œ κ°μ‹Έμ„œ ν…μŠ€νŠΈλ₯Ό μΈμ½”λ”©ν•˜κ³  이미지λ₯Ό μ€€λΉ„ν•˜λŠ”λ° λͺ¨λ‘ μ‚¬μš©λ©λ‹ˆλ‹€.

λ‹€μŒ μ˜ˆμ‹œλŠ” [CLIPProcessor]와 [CLIPModel]을 μ‚¬μš©ν•˜μ—¬ 이미지-ν…μŠ€νŠΈ μœ μ‚¬λ„ 점수λ₯Ό μ–»λŠ” 방법을 λ³΄μ—¬μ€λ‹ˆλ‹€.

>>> from PIL import Image
>>> import requests

>>> from transformers import CLIPProcessor, CLIPModel

>>> model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
>>> processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)

>>> inputs = processor(text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True)

>>> outputs = model(**inputs)
>>> logits_per_image = outputs.logits_per_image  # 이미지-ν…μŠ€νŠΈ μœ μ‚¬μ„± 점수
>>> probs = logits_per_image.softmax(dim=1)  # ν™•λ₯ μ„ λ ˆμ΄λΈ”λ§ ν•˜κΈ°μœ„ν•΄μ„œ μ†Œν”„νŠΈλ§₯슀λ₯Ό μ·¨ν•©λ‹ˆλ‹€.

CLIPκ³Ό ν”Œλž˜μ‹œ μ–΄ν…μ…˜2 κ²°ν•©[[combining-clip-and-flash-attention-2]]

λ¨Όμ € μ΅œμ‹ λ²„μ „μ˜ ν”Œλž˜μ‹œ μ–΄ν…μ…˜2λ₯Ό μ„€μΉ˜ν•©λ‹ˆλ‹€.

pip install -U flash-attn --no-build-isolation

ν”Œλž˜μ‹œ μ–΄ν…μ…˜2와 ν˜Έν™˜λ˜λŠ” ν•˜λ“œμ›¨μ–΄λ₯Ό κ°€μ§€κ³  μžˆλŠ”μ§€ ν™•μΈν•˜μ„Έμš”. 이에 λŒ€ν•œ μžμ„Έν•œ λ‚΄μš©μ€ flash-attn λ¦¬ν¬μ§€ν† λ¦¬μ˜ κ³΅μ‹λ¬Έμ„œμ—μ„œ 확인할 수 μžˆμŠ΅λ‹ˆλ‹€. λ˜ν•œ λͺ¨λΈμ„ λ°˜μ •λ°€λ„(torch.float16)둜 λ‘œλ“œν•˜λŠ” 것을 μžŠμ§€ λ§ˆμ„Έμš”.

μž‘μ€ 배치 크기λ₯Ό μ‚¬μš©ν•  λ•Œ, ν”Œλž˜μ‹œ μ–΄ν…μ…˜μ„ μ‚¬μš©ν•˜λ©΄ λͺ¨λΈμ΄ λŠλ €μ§€λŠ” 것을 λŠλ‚„ 수 μžˆμŠ΅λ‹ˆλ‹€.μ•„λž˜μ˜ ν”Œλž˜μ‹œ μ–΄ν…μ…˜κ³Ό SDPAλ₯Ό μ‚¬μš©ν•œ μ˜ˆμƒ 속도 ν–₯상 μ„Ήμ…˜μ„ μ°Έμ‘°ν•˜μ—¬ μ μ ˆν•œ μ–΄ν…μ…˜ κ΅¬ν˜„μ„ μ„ νƒν•˜μ„Έμš”.

ν”Œλž˜μ‹œ μ–΄ν…μ…˜2λ₯Ό μ‚¬μš©ν•΄μ„œ λͺ¨λΈμ„ λ‘œλ“œν•˜κ³  κ΅¬λ™ν•˜κΈ° μœ„ν•΄μ„œ λ‹€μŒ μŠ€λ‹ˆνŽ«μ„ μ°Έκ³ ν•˜μ„Έμš”:

>>> import torch
>>> import requests
>>> from PIL import Image

>>> from transformers import CLIPProcessor, CLIPModel

>>> device = "cuda"
>>> torch_dtype = torch.float16

>>> model = CLIPModel.from_pretrained(
...     "openai/clip-vit-base-patch32",
...     attn_implementation="flash_attention_2",
...     device_map=device,
...     torch_dtype=torch_dtype,
... )
>>> processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)

>>> inputs = processor(text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True)
>>> inputs.to(device)

>>> with torch.no_grad():
...     with torch.autocast(device):
...         outputs = model(**inputs)

>>> logits_per_image = outputs.logits_per_image  # 이미지-ν…μŠ€νŠΈ μœ μ‚¬μ„± 점수
>>> probs = logits_per_image.softmax(dim=1)  # ν™•λ₯ μ„ λ ˆμ΄λΈ”λ§ ν•˜κΈ°μœ„ν•΄μ„œ μ†Œν”„νŠΈλ§₯슀λ₯Ό μ·¨ν•©λ‹ˆλ‹€.
>>> print(probs)
tensor([[0.9946, 0.0052]], device='cuda:0', dtype=torch.float16)

μŠ€μΌ€μΌλœ 내적 μ–΄ν…μ…˜ (Scaled dot-product Attention(SDPA)) μ‚¬μš©ν•˜κΈ°[[using-scaled-dot-product-attention-sdpa]]

νŒŒμ΄ν† μΉ˜λŠ” torch.nn.functional의 μΌλΆ€λ‘œ λ„€μ΄ν‹°λΈŒ μŠ€μΌ€μΌλœ 내적 μ–΄ν…μ…˜(SPDA) μ—°μ‚°μžλ₯Ό ν¬ν•¨ν•˜κ³  μžˆμŠ΅λ‹ˆλ‹€. 이 ν•¨μˆ˜λŠ” μž…λ ₯κ³Ό μ‚¬μš© 쀑인 ν•˜λ“œμ›¨μ–΄μ— 따라 적용될 수 μžˆλŠ” μ—¬λŸ¬ κ΅¬ν˜„μ„ ν¬ν•¨ν•©λ‹ˆλ‹€. μžμ„Έν•œ μ •λ³΄λŠ” κ³΅μ‹λ¬Έμ„œλ‚˜ GPU μΆ”λ‘  νŽ˜μ΄μ§€λ₯Ό μ°Έμ‘°ν•˜μ„Έμš”.

torch>=2.1.1μ—μ„œλŠ” κ΅¬ν˜„μ΄ κ°€λŠ₯ν•  λ•Œ SDPAκ°€ 기본적으둜 μ‚¬μš©λ˜μ§€λ§Œ, from_pretrained() ν•¨μˆ˜μ—μ„œ attn_implementation="sdpa"λ₯Ό μ„€μ •ν•˜μ—¬ SDPAλ₯Ό λͺ…μ‹œμ μœΌλ‘œ μ‚¬μš©ν•˜λ„λ‘ μš”μ²­ν•  μˆ˜λ„ μžˆμŠ΅λ‹ˆλ‹€.

from transformers import CLIPModel

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32", torch_dtype=torch.float16, attn_implementation="sdpa")

졜고의 속도ν–₯상을 μœ„ν•΄μ„œ, λ°˜μ •λ°€λ„λ‘œ λͺ¨λΈμ„ λ‘œλ“œν•˜λŠ” 것을 μΆ”μ²œν•©λ‹ˆλ‹€. (예λ₯Όλ“€λ©΄ torch.float16 λ˜λŠ” torch.bfloat16).

ν”Œλž˜μ‹œ μ–΄ν…μ…˜κ³Ό μŠ€μΌ€μΌλœ 내적 μ–΄ν…μ…˜(SDPA)으둜 인해 μ˜ˆμƒλ˜λŠ” 속도ν–₯상[[expected-speedups-with-flash-attention-and-sdpa]]

둜컬 벀치마크(NVIDIA A10G, PyTorch 2.3.1+cu121)μ—μ„œ float16을 μ‚¬μš©ν•˜μ—¬ "openai/clip-vit-large-patch14" 체크포인트둜 좔둠을 μˆ˜ν–‰ν–ˆμ„ λ•Œ, λ‹€μŒκ³Ό 같은 속도 ν–₯상을 확인 ν–ˆμŠ΅λ‹ˆλ‹€. μ½”λ“œ:

CLIPTextModel[[cliptextmodel]]

Num text labels Eager (s/iter) FA2 (s/iter) FA2 speedup SDPA (s/iter) SDPA speedup
4 0.009 0.012 0.737 0.007 1.269
16 0.009 0.014 0.659 0.008 1.187
32 0.018 0.021 0.862 0.016 1.142
64 0.034 0.034 1.001 0.03 1.163
128 0.063 0.058 1.09 0.054 1.174

clip_text_model_viz_3

CLIPVisionModel[[clipvisionmodel]]

Image batch size Eager (s/iter) FA2 (s/iter) FA2 speedup SDPA (s/iter) SDPA speedup
1 0.016 0.013 1.247 0.012 1.318
4 0.025 0.021 1.198 0.021 1.202
16 0.093 0.075 1.234 0.075 1.24
32 0.181 0.147 1.237 0.146 1.241

clip_image_model_viz_3

CLIPModel[[clipmodel]]

Image batch size Num text labels Eager (s/iter) FA2 (s/iter) FA2 speedup SDPA (s/iter) SDPA speedup
1 4 0.025 0.026 0.954 0.02 1.217
1 16 0.026 0.028 0.918 0.02 1.287
1 64 0.042 0.046 0.906 0.036 1.167
4 4 0.028 0.033 0.849 0.024 1.189
4 16 0.034 0.035 0.955 0.029 1.169
4 64 0.059 0.055 1.072 0.05 1.179
16 4 0.096 0.088 1.091 0.078 1.234
16 16 0.102 0.09 1.129 0.083 1.224
16 64 0.127 0.11 1.157 0.105 1.218
32 4 0.185 0.159 1.157 0.149 1.238
32 16 0.19 0.162 1.177 0.154 1.233
32 64 0.216 0.181 1.19 0.176 1.228

자료[[resources]]

CLIP을 μ‹œμž‘ν•˜λŠ” 데 도움이 λ˜λŠ” Hugging Face와 community 자료 λͺ©λ‘(🌎둜 ν‘œμ‹œλ¨) μž…λ‹ˆλ‹€.

  • μ‚¬μ „ν•™μŠ΅λœ CLIPλͺ¨λΈμ„ 이미지 캑셔닝을 μœ„ν•œ λΉ”μ„œμΉ˜ 좔둠에 μ–΄λ–»κ²Œ ν™œμš©ν•˜λŠ”μ§€μ— κ΄€ν•œ λ…ΈνŠΈλΆ

이미지 검색

  • μ‚¬μ „ν•™μŠ΅λœ CLIPλͺ¨λΈκ³Ό MRR(Mean Reciprocal Rank) 점수 연산을 μ‚¬μš©ν•œ 이미지 검색에 λŒ€ν•œ λ…ΈνŠΈλΆ. 🌎
  • 이미지 검색과 μœ μ‚¬μ„± μ μˆ˜μ— λŒ€ν•΄ λ³΄μ—¬μ£ΌλŠ” λ…ΈνŠΈλΆ. 🌎
  • Multilingual CLIPλ₯Ό μ‚¬μš©ν•΄μ„œ 이미지와 ν…μŠ€νŠΈλ₯Ό μ–΄λ–»κ²Œ 같은 벑터 곡간에 λ§€ν•‘ μ‹œν‚€λŠ”μ§€μ— λŒ€ν•œ λ…ΈνŠΈλΆ. 🌎
  • Unsplash와 TMDB 데이터셋을 ν™œμš©ν•œ 의미둠적(semantic) 이미지 κ²€μƒ‰μ—μ„œ CLIP을 κ΅¬λ™ν•˜λŠ” 방법에 λŒ€ν•œ λ…ΈνŠΈλΆ. 🌎

μ„€λͺ… κ°€λŠ₯μ„±

  • μž…λ ₯ 토큰과 이미지 쑰각(segment) μ‚¬μ΄μ˜ μœ μ‚¬μ„±μ„ μ‹œκ°ν™” μ‹œν‚€λŠ” 방법에 λŒ€ν•œ λ…ΈνŠΈλΆ. 🌎

여기에 포함될 자료λ₯Ό μ œμΆœν•˜κ³  μ‹ΆμœΌμ‹œλ‹€λ©΄ PR(Pull Request)λ₯Ό μ—΄μ–΄μ£Όμ„Έμš”. 리뷰 ν•΄λ“œλ¦¬κ² μŠ΅λ‹ˆλ‹€! μžλ£ŒλŠ” κΈ°μ‘΄ 자료λ₯Ό λ³΅μ œν•˜λŠ” λŒ€μ‹  μƒˆλ‘œμš΄ λ‚΄μš©μ„ λ‹΄κ³  μžˆμ–΄μ•Ό ν•©λ‹ˆλ‹€.

CLIPConfig[[transformers.CLIPConfig]]

[[autodoc]] CLIPConfig - from_text_vision_configs

CLIPTextConfig[[transformers.CLIPTextConfig]]

[[autodoc]] CLIPTextConfig

CLIPVisionConfig[[transformers.CLIPVisionConfig]]

[[autodoc]] CLIPVisionConfig

CLIPTokenizer[[transformers.CLIPTokenizer]]

[[autodoc]] CLIPTokenizer - build_inputs_with_special_tokens - get_special_tokens_mask - create_token_type_ids_from_sequences - save_vocabulary

CLIPTokenizerFast[[transformers.CLIPTokenizerFast]]

[[autodoc]] CLIPTokenizerFast

CLIPImageProcessor[[transformers.CLIPImageProcessor]]

[[autodoc]] CLIPImageProcessor - preprocess

CLIPFeatureExtractor[[transformers.CLIPFeatureExtractor]]

[[autodoc]] CLIPFeatureExtractor

CLIPProcessor[[transformers.CLIPProcessor]]

[[autodoc]] CLIPProcessor

CLIPModel[[transformers.CLIPModel]]

[[autodoc]] CLIPModel - forward - get_text_features - get_image_features

CLIPTextModel[[transformers.CLIPTextModel]]

[[autodoc]] CLIPTextModel - forward

CLIPTextModelWithProjection[[transformers.CLIPTextModelWithProjection]]

[[autodoc]] CLIPTextModelWithProjection - forward

CLIPVisionModelWithProjection[[transformers.CLIPVisionModelWithProjection]]

[[autodoc]] CLIPVisionModelWithProjection - forward

CLIPVisionModel[[transformers.CLIPVisionModel]]

[[autodoc]] CLIPVisionModel - forward

CLIPForImageClassification[[transformers.CLIPForImageClassification]]

[[autodoc]] CLIPForImageClassification - forward

TFCLIPModel[[transformers.TFCLIPModel]]

[[autodoc]] TFCLIPModel - call - get_text_features - get_image_features

TFCLIPTextModel[[transformers.TFCLIPTextModel]]

[[autodoc]] TFCLIPTextModel - call

TFCLIPVisionModel[[transformers.TFCLIPVisionModel]]

[[autodoc]] TFCLIPVisionModel - call

FlaxCLIPModel[[transformers.FlaxCLIPModel]]

[[autodoc]] FlaxCLIPModel - call - get_text_features - get_image_features

FlaxCLIPTextModel[[transformers.FlaxCLIPTextModel]]

[[autodoc]] FlaxCLIPTextModel - call

FlaxCLIPTextModelWithProjection[[transformers.FlaxCLIPTextModelWithProjection]]

[[autodoc]] FlaxCLIPTextModelWithProjection - call

FlaxCLIPVisionModel[[transformers.FlaxCLIPVisionModel]]

[[autodoc]] FlaxCLIPVisionModel - call