lsmpp's picture
Add files using upload-large-folder tool
4cef5ec verified
<!--Copyright 2021 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.
-->
# CLIP[[clip]]
## κ°œμš”[[overview]]
CLIP λͺ¨λΈμ€ Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh,
Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskeverκ°€ μ œμ•ˆν•œ [μžμ—°μ–΄ 지도(supervision)λ₯Ό ν†΅ν•œ 전이 κ°€λŠ₯ν•œ μ‹œκ° λͺ¨λΈ ν•™μŠ΅](https://huggingface.co/papers/2103.00020)λΌλŠ” λ…Όλ¬Έμ—μ„œ μ†Œκ°œλ˜μ—ˆμŠ΅λ‹ˆλ‹€. CLIP(Contrastive Language-Image Pre-Training)은 λ‹€μ–‘ν•œ 이미지와 ν…μŠ€νŠΈ 쌍으둜 ν›ˆλ ¨λœ 신경망 μž…λ‹ˆλ‹€. GPT-2와 3의 μ œλ‘œμƒ· λŠ₯λ ₯κ³Ό μœ μ‚¬ν•˜κ²Œ, ν•΄λ‹Ή μž‘μ—…μ— μ§μ ‘μ μœΌλ‘œ μ΅œμ ν™”ν•˜μ§€ μ•Šκ³ λ„ μ£Όμ–΄μ§„ 이미지에 λŒ€ν•΄ κ°€μž₯ κ΄€λ ¨μ„± μžˆλŠ” ν…μŠ€νŠΈ μŠ€λ‹ˆνŽ«μ„ μ˜ˆμΈ‘ν•˜λ„λ‘ μžμ—°μ–΄λ‘œ μ§€μ‹œν•  수 μžˆμŠ΅λ‹ˆλ‹€.
ν•΄λ‹Ή λ…Όλ¬Έμ˜ μ΄ˆλ‘μž…λ‹ˆλ‹€.
*μ΅œμ‹  컴퓨터 λΉ„μ „ μ‹œμŠ€ν…œμ€ 미리 μ •ν•΄μ§„ κ³ μ •λœ 객체 μΉ΄ν…Œκ³ λ¦¬ 집합을 μ˜ˆμΈ‘ν•˜λ„λ‘ ν›ˆλ ¨λ©λ‹ˆλ‹€. μ΄λŸ¬ν•œ μ œν•œλœ ν˜•νƒœμ˜ μ§€λ„λŠ” λ‹€λ₯Έ μ‹œκ°μ  κ°œλ…μ„ μ§€μ •ν•˜κΈ° μœ„ν•΄ 좔가적인 라벨링된 데이터가 ν•„μš”ν•˜λ―€λ‘œ κ·Έ μΌλ°˜μ„±κ³Ό μ‚¬μš©μ„±μ„ μ œν•œν•©λ‹ˆλ‹€. 이미지 μ›μ‹œ ν…μŠ€νŠΈμ—μ„œ 직접 ν•™μŠ΅ν•˜λŠ” 것은 훨씬 더 κ΄‘λ²”μœ„ν•œ 지도 μ†ŒμŠ€λ₯Ό ν™œμš©ν•˜λŠ” μ•„μ£Ό 쒋은 λŒ€μ•ˆμž…λ‹ˆλ‹€. 이미지와 μΊ‘μ…˜μ„ λ§žμΆ”λŠ” κ°„λ‹¨ν•œ 사전 ν•™μŠ΅ μž‘μ—…μ΄, μΈν„°λ„·μ—μ„œ μˆ˜μ§‘ν•œ 4μ–΅ 쌍의 이미지-ν…μŠ€νŠΈ λ°μ΄ν„°μ…‹μ—μ„œ SOTA μˆ˜μ€€μ˜ 이미지 ν‘œν˜„μ„ μ²˜μŒλΆ€ν„° 효율적이고 ν™•μž₯ κ°€λŠ₯ν•˜κ²Œ ν•™μŠ΅ν•˜λŠ” λ°©λ²•μž„μ„ 확인할 수 μžˆμŠ΅λ‹ˆλ‹€. 사전 ν›ˆλ ¨ ν›„, μžμ—°μ–΄λŠ” ν•™μŠ΅λœ μ‹œκ°μ  κ°œλ…μ„ μ°Έμ‘°ν•˜κ±°λ‚˜ μƒˆλ‘œμš΄ κ°œλ…μ„ μ„€λͺ…ν•˜λŠ” 데 μ‚¬μš©λ˜μ–΄ λͺ¨λΈμ˜ ν•˜μœ„ μž‘μ—…μœΌλ‘œμ˜ μ œλ‘œμƒ· 전이λ₯Ό κ°€λŠ₯ν•˜κ²Œ ν•©λ‹ˆλ‹€. ν•΄λ‹Ή λ…Όλ¬Έμ—μ„œλŠ” OCR, λΉ„λ””μ˜€ λ‚΄ 행동 인식, 지리적 μœ„μΉ˜ νŒŒμ•…, 그리고 λ§Žμ€ μ’…λ₯˜μ˜ μ„Έλ°€ν•œ 객체 λΆ„λ₯˜ λ“± 30개 μ΄μƒμ˜ λ‹€μ–‘ν•œ κΈ°μ‘΄ 컴퓨터 λΉ„μ „ 데이터셋에 λŒ€ν•œ λ²€μΉ˜λ§ˆν‚Ήμ„ 톡해 이 μ ‘κ·Ό λ°©μ‹μ˜ μ„±λŠ₯을 μ—°κ΅¬ν•©λ‹ˆλ‹€. 이 λͺ¨λΈμ€ λŒ€λΆ€λΆ„μ˜ μž‘μ—…μ— λŒ€ν•΄ 의미 있게 μ „μ΄λ˜λ©°, μ’…μ’… 데이터셋별 ν›ˆλ ¨ 없이도 μ™„μ „ 지도 ν•™μŠ΅ κΈ°μ€€μ„ κ³Ό 경쟁λ ₯ μžˆλŠ” μ„±λŠ₯을 λ³΄μž…λ‹ˆλ‹€. 예λ₯Ό λ“€μ–΄, ImageNetμ—μ„œ μ›λž˜ ResNet-50의 정확도λ₯Ό μ œλ‘œμƒ·μœΌλ‘œ μΌμΉ˜μ‹œν‚€λŠ”λ°, μ΄λŠ” ResNet-50이 ν›ˆλ ¨λœ 128만 개의 ν›ˆλ ¨ 예제λ₯Ό μ „ν˜€ μ‚¬μš©ν•  ν•„μš”κ°€ μ—†μ—ˆμŠ΅λ‹ˆλ‹€. μ½”λ“œ 및 사전 ν›ˆλ ¨λœ λͺ¨λΈ κ°€μ€‘μΉ˜λŠ” 이 https URLμ—μ„œ κ³΅κ°œν•©λ‹ˆλ‹€.*
이 λͺ¨λΈμ€ [valhalla](https://huggingface.co/valhalla)에 μ˜ν•΄ κΈ°μ—¬λ˜μ—ˆμŠ΅λ‹ˆλ‹€.
원본 μ½”λ“œλŠ” [이곳](https://github.com/openai/CLIP)μ—μ„œ 확인할 수 μžˆμŠ΅λ‹ˆλ‹€.
## μ‚¬μš© 팁과 μ˜ˆμ‹œ[[usage-tips-and-example]]
CLIP은 λ©€ν‹°λͺ¨λ‹¬ λΉ„μ „ λ°’ μ–Έμ–΄ λͺ¨λΈμž…λ‹ˆλ‹€. 이미지-ν…μŠ€νŠΈ μœ μ‚¬λ„ 계산과 μ œλ‘œμƒ· 이미지 λΆ„λ₯˜μ— μ‚¬μš©λ  수 μžˆμŠ΅λ‹ˆλ‹€. CLIP은 ViT와 μœ μ‚¬ν•œ 트랜슀포머λ₯Ό μ‚¬μš©ν•˜μ—¬ μ‹œκ°μ  νŠΉμ§•μ„ μΆ”μΆœν•˜κ³ , 인과적 μ–Έμ–΄ λͺ¨λΈμ„ μ‚¬μš©ν•˜μ—¬ ν…μŠ€νŠΈ νŠΉμ§•μ„ μΆ”μΆœν•©λ‹ˆλ‹€. κ·Έ ν›„ ν…μŠ€νŠΈμ™€ μ‹œκ°μ  νŠΉμ§• λͺ¨λ‘ λ™μΌν•œ μ°¨μ›μ˜ 잠재(latent) κ³΅κ°„μœΌλ‘œ νˆ¬μ˜λ©λ‹ˆλ‹€. 투영된 이미지와 ν…μŠ€νŠΈ νŠΉμ§• μ‚¬μ΄μ˜ 내적이 μœ μ‚¬λ„ 점수둜 μ‚¬μš©λ©λ‹ˆλ‹€.
트랜슀포머 인코더에 이미지λ₯Ό μž…λ ₯ν•˜κΈ° μœ„ν•΄, 각 μ΄λ―Έμ§€λŠ” κ³ μ • 크기의 κ²ΉμΉ˜μ§€ μ•ŠλŠ” νŒ¨μΉ˜λ“€μ˜ μ‹œν€€μŠ€λ‘œ λΆ„ν• λ˜κ³ , 이후 μ„ ν˜• μž„λ² λ”©λ©λ‹ˆλ‹€. [CLS]토큰이 전체 μ΄λ―Έμ§€μ˜ ν‘œν˜„μœΌλ‘œ μΆ”κ°€λ©λ‹ˆλ‹€. μ €μžλ“€μ€ λ˜ν•œ μ ˆλŒ€ μœ„μΉ˜ μž„λ² λ”©μ„ μΆ”κ°€ν•˜κ³ , 결과둜 λ‚˜μ˜¨ 벑터 μ‹œν€€μŠ€λ₯Ό ν‘œμ€€ 트랜슀포머 인토더에 μž…λ ₯ν•©λ‹ˆλ‹€. [`CLIPImageProcessor`]λŠ” λͺ¨λΈμ„ μœ„ν•΄ 이미지λ₯Ό λ¦¬μ‚¬μ΄μ¦ˆ(λ˜λŠ” 재슀캐일링)ν•˜κ³  μ •κ·œν™”ν•˜λŠ”λ° μ‚¬μš©λ  수 μžˆμŠ΅λ‹ˆλ‹€.
[`CLIPTokenizer`]λŠ” ν…μŠ€νŠΈλ₯Ό μΈμ½”λ”©ν•˜λŠ”λ° μ‚¬μš©λ©λ‹ˆλ‹€. [`CLIPProcessor`]λŠ” [`CLIPImageProcessor`]와 [`CLIPTokenizer`]λ₯Ό ν•˜λ‚˜μ˜ μΈμŠ€ν„΄μŠ€λ‘œ κ°μ‹Έμ„œ ν…μŠ€νŠΈλ₯Ό μΈμ½”λ”©ν•˜κ³  이미지λ₯Ό μ€€λΉ„ν•˜λŠ”λ° λͺ¨λ‘ μ‚¬μš©λ©λ‹ˆλ‹€.
λ‹€μŒ μ˜ˆμ‹œλŠ” [`CLIPProcessor`]와 [`CLIPModel`]을 μ‚¬μš©ν•˜μ—¬ 이미지-ν…μŠ€νŠΈ μœ μ‚¬λ„ 점수λ₯Ό μ–»λŠ” 방법을 λ³΄μ—¬μ€λ‹ˆλ‹€.
```python
>>> from PIL import Image
>>> import requests
>>> from transformers import CLIPProcessor, CLIPModel
>>> model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
>>> processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)
>>> inputs = processor(text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True)
>>> outputs = model(**inputs)
>>> logits_per_image = outputs.logits_per_image # 이미지-ν…μŠ€νŠΈ μœ μ‚¬μ„± 점수
>>> probs = logits_per_image.softmax(dim=1) # ν™•λ₯ μ„ λ ˆμ΄λΈ”λ§ ν•˜κΈ°μœ„ν•΄μ„œ μ†Œν”„νŠΈλ§₯슀λ₯Ό μ·¨ν•©λ‹ˆλ‹€.
```
### CLIPκ³Ό ν”Œλž˜μ‹œ μ–΄ν…μ…˜2 κ²°ν•©[[combining-clip-and-flash-attention-2]]
λ¨Όμ € μ΅œμ‹ λ²„μ „μ˜ ν”Œλž˜μ‹œ μ–΄ν…μ…˜2λ₯Ό μ„€μΉ˜ν•©λ‹ˆλ‹€.
```bash
pip install -U flash-attn --no-build-isolation
```
ν”Œλž˜μ‹œ μ–΄ν…μ…˜2와 ν˜Έν™˜λ˜λŠ” ν•˜λ“œμ›¨μ–΄λ₯Ό κ°€μ§€κ³  μžˆλŠ”μ§€ ν™•μΈν•˜μ„Έμš”. 이에 λŒ€ν•œ μžμ„Έν•œ λ‚΄μš©μ€ flash-attn λ¦¬ν¬μ§€ν† λ¦¬μ˜ κ³΅μ‹λ¬Έμ„œμ—μ„œ 확인할 수 μžˆμŠ΅λ‹ˆλ‹€. λ˜ν•œ λͺ¨λΈμ„ λ°˜μ •λ°€λ„(`torch.float16`)둜 λ‘œλ“œν•˜λŠ” 것을 μžŠμ§€ λ§ˆμ„Έμš”.
<Tip warning={true}>
μž‘μ€ 배치 크기λ₯Ό μ‚¬μš©ν•  λ•Œ, ν”Œλž˜μ‹œ μ–΄ν…μ…˜μ„ μ‚¬μš©ν•˜λ©΄ λͺ¨λΈμ΄ λŠλ €μ§€λŠ” 것을 λŠλ‚„ 수 μžˆμŠ΅λ‹ˆλ‹€.μ•„λž˜μ˜ [ν”Œλž˜μ‹œ μ–΄ν…μ…˜κ³Ό SDPAλ₯Ό μ‚¬μš©ν•œ μ˜ˆμƒ 속도 ν–₯상](#Expected-speedups-with-Flash-Attention-and-SDPA) μ„Ήμ…˜μ„ μ°Έμ‘°ν•˜μ—¬ μ μ ˆν•œ μ–΄ν…μ…˜ κ΅¬ν˜„μ„ μ„ νƒν•˜μ„Έμš”.
</Tip>
ν”Œλž˜μ‹œ μ–΄ν…μ…˜2λ₯Ό μ‚¬μš©ν•΄μ„œ λͺ¨λΈμ„ λ‘œλ“œν•˜κ³  κ΅¬λ™ν•˜κΈ° μœ„ν•΄μ„œ λ‹€μŒ μŠ€λ‹ˆνŽ«μ„ μ°Έκ³ ν•˜μ„Έμš”:
```python
>>> import torch
>>> import requests
>>> from PIL import Image
>>> from transformers import CLIPProcessor, CLIPModel
>>> device = "cuda"
>>> torch_dtype = torch.float16
>>> model = CLIPModel.from_pretrained(
... "openai/clip-vit-base-patch32",
... attn_implementation="flash_attention_2",
... device_map=device,
... torch_dtype=torch_dtype,
... )
>>> processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)
>>> inputs = processor(text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True)
>>> inputs.to(device)
>>> with torch.no_grad():
... with torch.autocast(device):
... outputs = model(**inputs)
>>> logits_per_image = outputs.logits_per_image # 이미지-ν…μŠ€νŠΈ μœ μ‚¬μ„± 점수
>>> probs = logits_per_image.softmax(dim=1) # ν™•λ₯ μ„ λ ˆμ΄λΈ”λ§ ν•˜κΈ°μœ„ν•΄μ„œ μ†Œν”„νŠΈλ§₯슀λ₯Ό μ·¨ν•©λ‹ˆλ‹€.
>>> print(probs)
tensor([[0.9946, 0.0052]], device='cuda:0', dtype=torch.float16)
```
### μŠ€μΌ€μΌλœ 내적 μ–΄ν…μ…˜ (Scaled dot-product Attention(SDPA)) μ‚¬μš©ν•˜κΈ°[[using-scaled-dot-product-attention-sdpa]]
νŒŒμ΄ν† μΉ˜λŠ” `torch.nn.functional`의 μΌλΆ€λ‘œ λ„€μ΄ν‹°λΈŒ μŠ€μΌ€μΌλœ 내적 μ–΄ν…μ…˜(SPDA) μ—°μ‚°μžλ₯Ό ν¬ν•¨ν•˜κ³  μžˆμŠ΅λ‹ˆλ‹€. 이 ν•¨μˆ˜λŠ” μž…λ ₯κ³Ό μ‚¬μš© 쀑인 ν•˜λ“œμ›¨μ–΄μ— 따라 적용될 수 μžˆλŠ” μ—¬λŸ¬ κ΅¬ν˜„μ„ ν¬ν•¨ν•©λ‹ˆλ‹€. μžμ„Έν•œ μ •λ³΄λŠ” [κ³΅μ‹λ¬Έμ„œ](https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html)λ‚˜ [GPU μΆ”λ‘ ](https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#pytorch-scaled-dot-product-attention) νŽ˜μ΄μ§€λ₯Ό μ°Έμ‘°ν•˜μ„Έμš”.
`torch>=2.1.1`μ—μ„œλŠ” κ΅¬ν˜„μ΄ κ°€λŠ₯ν•  λ•Œ SDPAκ°€ 기본적으둜 μ‚¬μš©λ˜μ§€λ§Œ, `from_pretrained()` ν•¨μˆ˜μ—μ„œ `attn_implementation="sdpa"`λ₯Ό μ„€μ •ν•˜μ—¬ SDPAλ₯Ό λͺ…μ‹œμ μœΌλ‘œ μ‚¬μš©ν•˜λ„λ‘ μš”μ²­ν•  μˆ˜λ„ μžˆμŠ΅λ‹ˆλ‹€.
```python
from transformers import CLIPModel
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32", torch_dtype=torch.float16, attn_implementation="sdpa")
```
졜고의 속도ν–₯상을 μœ„ν•΄μ„œ, λ°˜μ •λ°€λ„λ‘œ λͺ¨λΈμ„ λ‘œλ“œν•˜λŠ” 것을 μΆ”μ²œν•©λ‹ˆλ‹€. (예λ₯Όλ“€λ©΄ `torch.float16` λ˜λŠ” `torch.bfloat16`).
### ν”Œλž˜μ‹œ μ–΄ν…μ…˜κ³Ό μŠ€μΌ€μΌλœ 내적 μ–΄ν…μ…˜(SDPA)으둜 인해 μ˜ˆμƒλ˜λŠ” 속도ν–₯상[[expected-speedups-with-flash-attention-and-sdpa]]
둜컬 벀치마크(NVIDIA A10G, PyTorch 2.3.1+cu121)μ—μ„œ `float16`을 μ‚¬μš©ν•˜μ—¬ `"openai/clip-vit-large-patch14"` 체크포인트둜 좔둠을 μˆ˜ν–‰ν–ˆμ„ λ•Œ, λ‹€μŒκ³Ό 같은 속도 ν–₯상을 확인 ν–ˆμŠ΅λ‹ˆλ‹€.
[μ½”λ“œ](https://gist.github.com/qubvel/ac691a54e54f9fae8144275f866a7ff8):
#### CLIPTextModel[[cliptextmodel]]
| Num text labels | Eager (s/iter) | FA2 (s/iter) | FA2 speedup | SDPA (s/iter) | SDPA speedup |
|------------------:|-----------------:|---------------:|--------------:|----------------:|---------------:|
| 4 | 0.009 | 0.012 | 0.737 | 0.007 | 1.269 |
| 16 | 0.009 | 0.014 | 0.659 | 0.008 | 1.187 |
| 32 | 0.018 | 0.021 | 0.862 | 0.016 | 1.142 |
| 64 | 0.034 | 0.034 | 1.001 | 0.03 | 1.163 |
| 128 | 0.063 | 0.058 | 1.09 | 0.054 | 1.174 |
![clip_text_model_viz_3](https://github.com/user-attachments/assets/e9826b43-4e66-4f4c-952b-af4d90bd38eb)
#### CLIPVisionModel[[clipvisionmodel]]
| Image batch size | Eager (s/iter) | FA2 (s/iter) | FA2 speedup | SDPA (s/iter) | SDPA speedup |
|-------------------:|-----------------:|---------------:|--------------:|----------------:|---------------:|
| 1 | 0.016 | 0.013 | 1.247 | 0.012 | 1.318 |
| 4 | 0.025 | 0.021 | 1.198 | 0.021 | 1.202 |
| 16 | 0.093 | 0.075 | 1.234 | 0.075 | 1.24 |
| 32 | 0.181 | 0.147 | 1.237 | 0.146 | 1.241 |
![clip_image_model_viz_3](https://github.com/user-attachments/assets/50a36206-e3b9-4adc-ac8e-926b8b071d63)
#### CLIPModel[[clipmodel]]
| Image batch size | Num text labels | Eager (s/iter) | FA2 (s/iter) | FA2 speedup | SDPA (s/iter) | SDPA speedup |
|-------------------:|------------------:|-----------------:|---------------:|--------------:|----------------:|---------------:|
| 1 | 4 | 0.025 | 0.026 | 0.954 | 0.02 | 1.217 |
| 1 | 16 | 0.026 | 0.028 | 0.918 | 0.02 | 1.287 |
| 1 | 64 | 0.042 | 0.046 | 0.906 | 0.036 | 1.167 |
| 4 | 4 | 0.028 | 0.033 | 0.849 | 0.024 | 1.189 |
| 4 | 16 | 0.034 | 0.035 | 0.955 | 0.029 | 1.169 |
| 4 | 64 | 0.059 | 0.055 | 1.072 | 0.05 | 1.179 |
| 16 | 4 | 0.096 | 0.088 | 1.091 | 0.078 | 1.234 |
| 16 | 16 | 0.102 | 0.09 | 1.129 | 0.083 | 1.224 |
| 16 | 64 | 0.127 | 0.11 | 1.157 | 0.105 | 1.218 |
| 32 | 4 | 0.185 | 0.159 | 1.157 | 0.149 | 1.238 |
| 32 | 16 | 0.19 | 0.162 | 1.177 | 0.154 | 1.233 |
| 32 | 64 | 0.216 | 0.181 | 1.19 | 0.176 | 1.228 |
## 자료[[resources]]
CLIP을 μ‹œμž‘ν•˜λŠ” 데 도움이 λ˜λŠ” Hugging Face와 community 자료 λͺ©λ‘(🌎둜 ν‘œμ‹œλ¨) μž…λ‹ˆλ‹€.
- [원격 μ„Όμ‹± (μΈκ³΅μœ„μ„±) 이미지와 μΊ‘μ…˜μ„ κ°€μ§€κ³  CLIP λ―Έμ„Έμ‘°μ •ν•˜κΈ°](https://huggingface.co/blog/fine-tune-clip-rsicd):
[RSICD dataset](https://github.com/201528014227051/RSICD_optimal)을 κ°€μ§€κ³  CLIP을 λ―Έμ„Έμ‘°μ • ν•˜λŠ” 방법과 데이터 증강에 λŒ€ν•œ μ„±λŠ₯ 비ꡐ에 λŒ€ν•œ λΈ”λ‘œκ·Έ 포슀트
- 이 [μ˜ˆμ‹œ 슀크립트](https://github.com/huggingface/transformers/tree/main/examples/pytorch/contrastive-image-text)λŠ” [COCO dataset](https://cocodataset.org/#home)λ₯Ό μ΄μš©ν•œ μ‚¬μ „ν•™μŠ΅λœ λΉ„μ „κ³Ό ν…μŠ€νŠΈμ™€ 인코더λ₯Ό μ‚¬μš©ν•΄μ„œ CLIP같은 λΉ„μ „-ν…μŠ€νŠΈ λ“€μ–Ό λͺ¨λΈμ„ μ–΄λ–»κ²Œ ν•™μŠ΅μ‹œν‚€λŠ”μ§€ λ³΄μ—¬μ€λ‹ˆλ‹€.
<PipelineTag pipeline="image-to-text"/>
- μ‚¬μ „ν•™μŠ΅λœ CLIPλͺ¨λΈμ„ 이미지 캑셔닝을 μœ„ν•œ λΉ”μ„œμΉ˜ 좔둠에 μ–΄λ–»κ²Œ ν™œμš©ν•˜λŠ”μ§€μ— κ΄€ν•œ [λ…ΈνŠΈλΆ](https://colab.research.google.com/drive/1tuoAC5F4sC7qid56Z0ap-stR3rwdk0ZV?usp=sharing)
**이미지 검색**
- μ‚¬μ „ν•™μŠ΅λœ CLIPλͺ¨λΈκ³Ό MRR(Mean Reciprocal Rank) 점수 연산을 μ‚¬μš©ν•œ 이미지 검색에 λŒ€ν•œ [λ…ΈνŠΈλΆ](https://colab.research.google.com/drive/1bLVwVKpAndpEDHqjzxVPr_9nGrSbuOQd?usp=sharing). 🌎
- 이미지 검색과 μœ μ‚¬μ„± μ μˆ˜μ— λŒ€ν•΄ λ³΄μ—¬μ£ΌλŠ” [λ…ΈνŠΈλΆ](https://colab.research.google.com/github/deep-diver/image_search_with_natural_language/blob/main/notebooks/Image_Search_CLIP.ipynb). 🌎
- Multilingual CLIPλ₯Ό μ‚¬μš©ν•΄μ„œ 이미지와 ν…μŠ€νŠΈλ₯Ό μ–΄λ–»κ²Œ 같은 벑터 곡간에 λ§€ν•‘ μ‹œν‚€λŠ”μ§€μ— λŒ€ν•œ [λ…ΈνŠΈλΆ](https://colab.research.google.com/drive/1xO-wC_m_GNzgjIBQ4a4znvQkvDoZJvH4?usp=sharing). 🌎
- [Unsplash](https://unsplash.com)와 [TMDB](https://www.themoviedb.org/) 데이터셋을 ν™œμš©ν•œ 의미둠적(semantic) 이미지 κ²€μƒ‰μ—μ„œ CLIP을 κ΅¬λ™ν•˜λŠ” 방법에 λŒ€ν•œ [λ…ΈνŠΈλΆ](https://colab.research.google.com/github/vivien000/clip-demo/blob/master/clip.ipynb#scrollTo=uzdFhRGqiWkR). 🌎
**μ„€λͺ… κ°€λŠ₯μ„±**
- μž…λ ₯ 토큰과 이미지 쑰각(segment) μ‚¬μ΄μ˜ μœ μ‚¬μ„±μ„ μ‹œκ°ν™” μ‹œν‚€λŠ” 방법에 λŒ€ν•œ [λ…ΈνŠΈλΆ](https://colab.research.google.com/github/hila-chefer/Transformer-MM-Explainability/blob/main/CLIP_explainability.ipynb). 🌎
여기에 포함될 자료λ₯Ό μ œμΆœν•˜κ³  μ‹ΆμœΌμ‹œλ‹€λ©΄ PR(Pull Request)λ₯Ό μ—΄μ–΄μ£Όμ„Έμš”. 리뷰 ν•΄λ“œλ¦¬κ² μŠ΅λ‹ˆλ‹€! μžλ£ŒλŠ” κΈ°μ‘΄ 자료λ₯Ό λ³΅μ œν•˜λŠ” λŒ€μ‹  μƒˆλ‘œμš΄ λ‚΄μš©μ„ λ‹΄κ³  μžˆμ–΄μ•Ό ν•©λ‹ˆλ‹€.
## CLIPConfig[[transformers.CLIPConfig]]
[[autodoc]] CLIPConfig
- from_text_vision_configs
## CLIPTextConfig[[transformers.CLIPTextConfig]]
[[autodoc]] CLIPTextConfig
## CLIPVisionConfig[[transformers.CLIPVisionConfig]]
[[autodoc]] CLIPVisionConfig
## CLIPTokenizer[[transformers.CLIPTokenizer]]
[[autodoc]] CLIPTokenizer
- build_inputs_with_special_tokens
- get_special_tokens_mask
- create_token_type_ids_from_sequences
- save_vocabulary
## CLIPTokenizerFast[[transformers.CLIPTokenizerFast]]
[[autodoc]] CLIPTokenizerFast
## CLIPImageProcessor[[transformers.CLIPImageProcessor]]
[[autodoc]] CLIPImageProcessor
- preprocess
## CLIPFeatureExtractor[[transformers.CLIPFeatureExtractor]]
[[autodoc]] CLIPFeatureExtractor
## CLIPProcessor[[transformers.CLIPProcessor]]
[[autodoc]] CLIPProcessor
<frameworkcontent>
<pt>
## CLIPModel[[transformers.CLIPModel]]
[[autodoc]] CLIPModel
- forward
- get_text_features
- get_image_features
## CLIPTextModel[[transformers.CLIPTextModel]]
[[autodoc]] CLIPTextModel
- forward
## CLIPTextModelWithProjection[[transformers.CLIPTextModelWithProjection]]
[[autodoc]] CLIPTextModelWithProjection
- forward
## CLIPVisionModelWithProjection[[transformers.CLIPVisionModelWithProjection]]
[[autodoc]] CLIPVisionModelWithProjection
- forward
## CLIPVisionModel[[transformers.CLIPVisionModel]]
[[autodoc]] CLIPVisionModel
- forward
## CLIPForImageClassification[[transformers.CLIPForImageClassification]]
[[autodoc]] CLIPForImageClassification
- forward
</pt>
<tf>
## TFCLIPModel[[transformers.TFCLIPModel]]
[[autodoc]] TFCLIPModel
- call
- get_text_features
- get_image_features
## TFCLIPTextModel[[transformers.TFCLIPTextModel]]
[[autodoc]] TFCLIPTextModel
- call
## TFCLIPVisionModel[[transformers.TFCLIPVisionModel]]
[[autodoc]] TFCLIPVisionModel
- call
</tf>
<jax>
## FlaxCLIPModel[[transformers.FlaxCLIPModel]]
[[autodoc]] FlaxCLIPModel
- __call__
- get_text_features
- get_image_features
## FlaxCLIPTextModel[[transformers.FlaxCLIPTextModel]]
[[autodoc]] FlaxCLIPTextModel
- __call__
## FlaxCLIPTextModelWithProjection[[transformers.FlaxCLIPTextModelWithProjection]]
[[autodoc]] FlaxCLIPTextModelWithProjection
- __call__
## FlaxCLIPVisionModel[[transformers.FlaxCLIPVisionModel]]
[[autodoc]] FlaxCLIPVisionModel
- __call__
</jax>
</frameworkcontent>