File size: 18,066 Bytes
17c6d62
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
<!--Copyright 2021 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.

⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.

-->

# CLIP[[clip]]

## κ°œμš”[[overview]]

CLIP λͺ¨λΈμ€ Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh,
Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskeverκ°€ μ œμ•ˆν•œ [μžμ—°μ–΄ 지도(supervision)λ₯Ό ν†΅ν•œ 전이 κ°€λŠ₯ν•œ μ‹œκ° λͺ¨λΈ ν•™μŠ΅](https://arxiv.org/abs/2103.00020)λΌλŠ” λ…Όλ¬Έμ—μ„œ μ†Œκ°œλ˜μ—ˆμŠ΅λ‹ˆλ‹€. CLIP(Contrastive Language-Image Pre-Training)은 λ‹€μ–‘ν•œ 이미지와 ν…μŠ€νŠΈ 쌍으둜 ν›ˆλ ¨λœ 신경망 μž…λ‹ˆλ‹€. GPT-2와 3의 μ œλ‘œμƒ· λŠ₯λ ₯κ³Ό μœ μ‚¬ν•˜κ²Œ, ν•΄λ‹Ή μž‘μ—…μ— μ§μ ‘μ μœΌλ‘œ μ΅œμ ν™”ν•˜μ§€ μ•Šκ³ λ„ μ£Όμ–΄μ§„ 이미지에 λŒ€ν•΄ κ°€μž₯ κ΄€λ ¨μ„± μžˆλŠ” ν…μŠ€νŠΈ μŠ€λ‹ˆνŽ«μ„ μ˜ˆμΈ‘ν•˜λ„λ‘ μžμ—°μ–΄λ‘œ μ§€μ‹œν•  수 μžˆμŠ΅λ‹ˆλ‹€.

ν•΄λ‹Ή λ…Όλ¬Έμ˜ μ΄ˆλ‘μž…λ‹ˆλ‹€.

*μ΅œμ‹  컴퓨터 λΉ„μ „ μ‹œμŠ€ν…œμ€ 미리 μ •ν•΄μ§„ κ³ μ •λœ 객체 μΉ΄ν…Œκ³ λ¦¬ 집합을 μ˜ˆμΈ‘ν•˜λ„λ‘ ν›ˆλ ¨λ©λ‹ˆλ‹€. μ΄λŸ¬ν•œ μ œν•œλœ ν˜•νƒœμ˜ μ§€λ„λŠ” λ‹€λ₯Έ μ‹œκ°μ  κ°œλ…μ„ μ§€μ •ν•˜κΈ° μœ„ν•΄ 좔가적인 라벨링된 데이터가 ν•„μš”ν•˜λ―€λ‘œ κ·Έ μΌλ°˜μ„±κ³Ό μ‚¬μš©μ„±μ„ μ œν•œν•©λ‹ˆλ‹€. 이미지 μ›μ‹œ ν…μŠ€νŠΈμ—μ„œ 직접 ν•™μŠ΅ν•˜λŠ” 것은 훨씬 더 κ΄‘λ²”μœ„ν•œ 지도 μ†ŒμŠ€λ₯Ό ν™œμš©ν•˜λŠ” μ•„μ£Ό 쒋은 λŒ€μ•ˆμž…λ‹ˆλ‹€. 이미지와 μΊ‘μ…˜μ„ λ§žμΆ”λŠ” κ°„λ‹¨ν•œ 사전 ν•™μŠ΅ μž‘μ—…μ΄, μΈν„°λ„·μ—μ„œ μˆ˜μ§‘ν•œ 4μ–΅ 쌍의 이미지-ν…μŠ€νŠΈ λ°μ΄ν„°μ…‹μ—μ„œ SOTA μˆ˜μ€€μ˜ 이미지 ν‘œν˜„μ„ μ²˜μŒλΆ€ν„° 효율적이고 ν™•μž₯ κ°€λŠ₯ν•˜κ²Œ ν•™μŠ΅ν•˜λŠ” λ°©λ²•μž„μ„ 확인할 수 μžˆμŠ΅λ‹ˆλ‹€. 사전 ν›ˆλ ¨ ν›„, μžμ—°μ–΄λŠ” ν•™μŠ΅λœ μ‹œκ°μ  κ°œλ…μ„ μ°Έμ‘°ν•˜κ±°λ‚˜ μƒˆλ‘œμš΄ κ°œλ…μ„ μ„€λͺ…ν•˜λŠ” 데 μ‚¬μš©λ˜μ–΄ λͺ¨λΈμ˜ ν•˜μœ„ μž‘μ—…μœΌλ‘œμ˜ μ œλ‘œμƒ· 전이λ₯Ό κ°€λŠ₯ν•˜κ²Œ ν•©λ‹ˆλ‹€. ν•΄λ‹Ή λ…Όλ¬Έμ—μ„œλŠ” OCR, λΉ„λ””μ˜€ λ‚΄ 행동 인식, 지리적 μœ„μΉ˜ νŒŒμ•…, 그리고 λ§Žμ€ μ’…λ₯˜μ˜ μ„Έλ°€ν•œ 객체 λΆ„λ₯˜ λ“± 30개 μ΄μƒμ˜ λ‹€μ–‘ν•œ κΈ°μ‘΄ 컴퓨터 λΉ„μ „ 데이터셋에 λŒ€ν•œ λ²€μΉ˜λ§ˆν‚Ήμ„ 톡해 이 μ ‘κ·Ό λ°©μ‹μ˜ μ„±λŠ₯을 μ—°κ΅¬ν•©λ‹ˆλ‹€. 이 λͺ¨λΈμ€ λŒ€λΆ€λΆ„μ˜ μž‘μ—…μ— λŒ€ν•΄ 의미 있게 μ „μ΄λ˜λ©°, μ’…μ’… 데이터셋별 ν›ˆλ ¨ 없이도 μ™„μ „ 지도 ν•™μŠ΅ κΈ°μ€€μ„ κ³Ό 경쟁λ ₯ μžˆλŠ” μ„±λŠ₯을 λ³΄μž…λ‹ˆλ‹€. 예λ₯Ό λ“€μ–΄, ImageNetμ—μ„œ μ›λž˜ ResNet-50의 정확도λ₯Ό μ œλ‘œμƒ·μœΌλ‘œ μΌμΉ˜μ‹œν‚€λŠ”λ°, μ΄λŠ” ResNet-50이 ν›ˆλ ¨λœ 128만 개의 ν›ˆλ ¨ 예제λ₯Ό μ „ν˜€ μ‚¬μš©ν•  ν•„μš”κ°€ μ—†μ—ˆμŠ΅λ‹ˆλ‹€. μ½”λ“œ 및 사전 ν›ˆλ ¨λœ λͺ¨λΈ κ°€μ€‘μΉ˜λŠ” 이 https URLμ—μ„œ κ³΅κ°œν•©λ‹ˆλ‹€.*

이 λͺ¨λΈμ€ [valhalla](https://huggingface.co/valhalla)에 μ˜ν•΄ κΈ°μ—¬λ˜μ—ˆμŠ΅λ‹ˆλ‹€. 
원본 μ½”λ“œλŠ” [이곳](https://github.com/openai/CLIP)μ—μ„œ 확인할 수 μžˆμŠ΅λ‹ˆλ‹€.

## μ‚¬μš© 팁과 μ˜ˆμ‹œ[[usage-tips-and-example]]

CLIP은 λ©€ν‹°λͺ¨λ‹¬ λΉ„μ „ λ°’ μ–Έμ–΄ λͺ¨λΈμž…λ‹ˆλ‹€. 이미지-ν…μŠ€νŠΈ μœ μ‚¬λ„ 계산과 μ œλ‘œμƒ· 이미지 λΆ„λ₯˜μ— μ‚¬μš©λ  수 μžˆμŠ΅λ‹ˆλ‹€. CLIP은 ViT와 μœ μ‚¬ν•œ 트랜슀포머λ₯Ό μ‚¬μš©ν•˜μ—¬ μ‹œκ°μ  νŠΉμ§•μ„ μΆ”μΆœν•˜κ³ , 인과적 μ–Έμ–΄ λͺ¨λΈμ„ μ‚¬μš©ν•˜μ—¬ ν…μŠ€νŠΈ νŠΉμ§•μ„ μΆ”μΆœν•©λ‹ˆλ‹€. κ·Έ ν›„ ν…μŠ€νŠΈμ™€ μ‹œκ°μ  νŠΉμ§• λͺ¨λ‘ λ™μΌν•œ μ°¨μ›μ˜ 잠재(latent) κ³΅κ°„μœΌλ‘œ νˆ¬μ˜λ©λ‹ˆλ‹€. 투영된 이미지와 ν…μŠ€νŠΈ νŠΉμ§• μ‚¬μ΄μ˜ 내적이 μœ μ‚¬λ„ 점수둜 μ‚¬μš©λ©λ‹ˆλ‹€.

트랜슀포머 인코더에 이미지λ₯Ό μž…λ ₯ν•˜κΈ° μœ„ν•΄, 각 μ΄λ―Έμ§€λŠ” κ³ μ • 크기의 κ²ΉμΉ˜μ§€ μ•ŠλŠ” νŒ¨μΉ˜λ“€μ˜ μ‹œν€€μŠ€λ‘œ λΆ„ν• λ˜κ³ , 이후 μ„ ν˜• μž„λ² λ”©λ©λ‹ˆλ‹€. [CLS]토큰이 전체 μ΄λ―Έμ§€μ˜ ν‘œν˜„μœΌλ‘œ μΆ”κ°€λ©λ‹ˆλ‹€. μ €μžλ“€μ€ λ˜ν•œ μ ˆλŒ€ μœ„μΉ˜ μž„λ² λ”©μ„ μΆ”κ°€ν•˜κ³ , 결과둜 λ‚˜μ˜¨ 벑터 μ‹œν€€μŠ€λ₯Ό ν‘œμ€€ 트랜슀포머 인토더에 μž…λ ₯ν•©λ‹ˆλ‹€. [`CLIPImageProcessor`]λŠ” λͺ¨λΈμ„ μœ„ν•΄ 이미지λ₯Ό λ¦¬μ‚¬μ΄μ¦ˆ(λ˜λŠ” 재슀캐일링)ν•˜κ³  μ •κ·œν™”ν•˜λŠ”λ° μ‚¬μš©λ  수 μžˆμŠ΅λ‹ˆλ‹€.

[`CLIPTokenizer`]λŠ” ν…μŠ€νŠΈλ₯Ό μΈμ½”λ”©ν•˜λŠ”λ° μ‚¬μš©λ©λ‹ˆλ‹€. [`CLIPProcessor`]λŠ” [`CLIPImageProcessor`]와 [`CLIPTokenizer`]λ₯Ό ν•˜λ‚˜μ˜ μΈμŠ€ν„΄μŠ€λ‘œ κ°μ‹Έμ„œ ν…μŠ€νŠΈλ₯Ό μΈμ½”λ”©ν•˜κ³  이미지λ₯Ό μ€€λΉ„ν•˜λŠ”λ° λͺ¨λ‘ μ‚¬μš©λ©λ‹ˆλ‹€. 

λ‹€μŒ μ˜ˆμ‹œλŠ” [`CLIPProcessor`]와 [`CLIPModel`]을 μ‚¬μš©ν•˜μ—¬ 이미지-ν…μŠ€νŠΈ μœ μ‚¬λ„ 점수λ₯Ό μ–»λŠ” 방법을 λ³΄μ—¬μ€λ‹ˆλ‹€.


```python
>>> from PIL import Image
>>> import requests

>>> from transformers import CLIPProcessor, CLIPModel

>>> model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
>>> processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)

>>> inputs = processor(text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True)

>>> outputs = model(**inputs)
>>> logits_per_image = outputs.logits_per_image  # 이미지-ν…μŠ€νŠΈ μœ μ‚¬μ„± 점수
>>> probs = logits_per_image.softmax(dim=1)  # ν™•λ₯ μ„ λ ˆμ΄λΈ”λ§ ν•˜κΈ°μœ„ν•΄μ„œ μ†Œν”„νŠΈλ§₯슀λ₯Ό μ·¨ν•©λ‹ˆλ‹€.
```


### CLIPκ³Ό ν”Œλž˜μ‹œ μ–΄ν…μ…˜2 κ²°ν•©[[combining-clip-and-flash-attention-2]]

λ¨Όμ € μ΅œμ‹ λ²„μ „μ˜ ν”Œλž˜μ‹œ μ–΄ν…μ…˜2λ₯Ό μ„€μΉ˜ν•©λ‹ˆλ‹€.

```bash
pip install -U flash-attn --no-build-isolation
```

ν”Œλž˜μ‹œ μ–΄ν…μ…˜2와 ν˜Έν™˜λ˜λŠ” ν•˜λ“œμ›¨μ–΄λ₯Ό κ°€μ§€κ³  μžˆλŠ”μ§€ ν™•μΈν•˜μ„Έμš”. 이에 λŒ€ν•œ μžμ„Έν•œ λ‚΄μš©μ€ flash-attn λ¦¬ν¬μ§€ν† λ¦¬μ˜ κ³΅μ‹λ¬Έμ„œμ—μ„œ 확인할 수 μžˆμŠ΅λ‹ˆλ‹€. λ˜ν•œ λͺ¨λΈμ„ λ°˜μ •λ°€λ„(`torch.float16`)둜 λ‘œλ“œν•˜λŠ” 것을 μžŠμ§€ λ§ˆμ„Έμš”.

<Tip warning={true}>

μž‘μ€ 배치 크기λ₯Ό μ‚¬μš©ν•  λ•Œ, ν”Œλž˜μ‹œ μ–΄ν…μ…˜μ„ μ‚¬μš©ν•˜λ©΄ λͺ¨λΈμ΄ λŠλ €μ§€λŠ” 것을 λŠλ‚„ 수 μžˆμŠ΅λ‹ˆλ‹€.μ•„λž˜μ˜ [ν”Œλž˜μ‹œ μ–΄ν…μ…˜κ³Ό SDPAλ₯Ό μ‚¬μš©ν•œ μ˜ˆμƒ 속도 ν–₯상](#Expected-speedups-with-Flash-Attention-and-SDPA) μ„Ήμ…˜μ„ μ°Έμ‘°ν•˜μ—¬ μ μ ˆν•œ μ–΄ν…μ…˜ κ΅¬ν˜„μ„ μ„ νƒν•˜μ„Έμš”.

</Tip>

ν”Œλž˜μ‹œ μ–΄ν…μ…˜2λ₯Ό μ‚¬μš©ν•΄μ„œ λͺ¨λΈμ„ λ‘œλ“œν•˜κ³  κ΅¬λ™ν•˜κΈ° μœ„ν•΄μ„œ λ‹€μŒ μŠ€λ‹ˆνŽ«μ„ μ°Έκ³ ν•˜μ„Έμš”:

```python
>>> import torch
>>> import requests
>>> from PIL import Image

>>> from transformers import CLIPProcessor, CLIPModel

>>> device = "cuda"
>>> torch_dtype = torch.float16

>>> model = CLIPModel.from_pretrained(
...     "openai/clip-vit-base-patch32",
...     attn_implementation="flash_attention_2",
...     device_map=device,
...     torch_dtype=torch_dtype,
... )
>>> processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)

>>> inputs = processor(text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True)
>>> inputs.to(device)

>>> with torch.no_grad():
...     with torch.autocast(device):
...         outputs = model(**inputs)

>>> logits_per_image = outputs.logits_per_image  # 이미지-ν…μŠ€νŠΈ μœ μ‚¬μ„± 점수
>>> probs = logits_per_image.softmax(dim=1)  # ν™•λ₯ μ„ λ ˆμ΄λΈ”λ§ ν•˜κΈ°μœ„ν•΄μ„œ μ†Œν”„νŠΈλ§₯슀λ₯Ό μ·¨ν•©λ‹ˆλ‹€.
>>> print(probs)
tensor([[0.9946, 0.0052]], device='cuda:0', dtype=torch.float16)
```


### μŠ€μΌ€μΌλœ 내적 μ–΄ν…μ…˜ (Scaled dot-product Attention(SDPA)) μ‚¬μš©ν•˜κΈ°[[using-scaled-dot-product-attention-sdpa]]

νŒŒμ΄ν† μΉ˜λŠ” `torch.nn.functional`의 μΌλΆ€λ‘œ λ„€μ΄ν‹°λΈŒ μŠ€μΌ€μΌλœ 내적 μ–΄ν…μ…˜(SPDA) μ—°μ‚°μžλ₯Ό ν¬ν•¨ν•˜κ³  μžˆμŠ΅λ‹ˆλ‹€. 이 ν•¨μˆ˜λŠ” μž…λ ₯κ³Ό μ‚¬μš© 쀑인 ν•˜λ“œμ›¨μ–΄μ— 따라 적용될 수 μžˆλŠ” μ—¬λŸ¬ κ΅¬ν˜„μ„ ν¬ν•¨ν•©λ‹ˆλ‹€. μžμ„Έν•œ μ •λ³΄λŠ” [κ³΅μ‹λ¬Έμ„œ](https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html)λ‚˜ [GPU μΆ”λ‘ ](https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#pytorch-scaled-dot-product-attention) νŽ˜μ΄μ§€λ₯Ό μ°Έμ‘°ν•˜μ„Έμš”.

`torch>=2.1.1`μ—μ„œλŠ” κ΅¬ν˜„μ΄ κ°€λŠ₯ν•  λ•Œ SDPAκ°€ 기본적으둜 μ‚¬μš©λ˜μ§€λ§Œ, `from_pretrained()` ν•¨μˆ˜μ—μ„œ `attn_implementation="sdpa"`λ₯Ό μ„€μ •ν•˜μ—¬ SDPAλ₯Ό λͺ…μ‹œμ μœΌλ‘œ μ‚¬μš©ν•˜λ„λ‘ μš”μ²­ν•  μˆ˜λ„ μžˆμŠ΅λ‹ˆλ‹€.

```python
from transformers import CLIPModel

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32", torch_dtype=torch.float16, attn_implementation="sdpa")
```

졜고의 속도ν–₯상을 μœ„ν•΄μ„œ, λ°˜μ •λ°€λ„λ‘œ λͺ¨λΈμ„ λ‘œλ“œν•˜λŠ” 것을 μΆ”μ²œν•©λ‹ˆλ‹€. (예λ₯Όλ“€λ©΄ `torch.float16` λ˜λŠ” `torch.bfloat16`).

### ν”Œλž˜μ‹œ μ–΄ν…μ…˜κ³Ό μŠ€μΌ€μΌλœ 내적 μ–΄ν…μ…˜(SDPA)으둜 인해 μ˜ˆμƒλ˜λŠ” 속도ν–₯상[[expected-speedups-with-flash-attention-and-sdpa]]

둜컬 벀치마크(NVIDIA A10G, PyTorch 2.3.1+cu121)μ—μ„œ `float16`을 μ‚¬μš©ν•˜μ—¬ `"openai/clip-vit-large-patch14"` 체크포인트둜 좔둠을 μˆ˜ν–‰ν–ˆμ„ λ•Œ, λ‹€μŒκ³Ό 같은 속도 ν–₯상을 확인 ν–ˆμŠ΅λ‹ˆλ‹€.
[μ½”λ“œ](https://gist.github.com/qubvel/ac691a54e54f9fae8144275f866a7ff8):

#### CLIPTextModel[[cliptextmodel]]

|   Num text labels |   Eager (s/iter) |   FA2 (s/iter) |   FA2 speedup |   SDPA (s/iter) |   SDPA speedup |
|------------------:|-----------------:|---------------:|--------------:|----------------:|---------------:|
|                 4 |            0.009 |          0.012 |         0.737 |           0.007 |          1.269 |
|                16 |            0.009 |          0.014 |         0.659 |           0.008 |          1.187 |
|                32 |            0.018 |          0.021 |         0.862 |           0.016 |          1.142 |
|                64 |            0.034 |          0.034 |         1.001 |           0.03  |          1.163 |
|               128 |            0.063 |          0.058 |         1.09  |           0.054 |          1.174 |

![clip_text_model_viz_3](https://github.com/user-attachments/assets/e9826b43-4e66-4f4c-952b-af4d90bd38eb)

#### CLIPVisionModel[[clipvisionmodel]]

|   Image batch size |   Eager (s/iter) |   FA2 (s/iter) |   FA2 speedup |   SDPA (s/iter) |   SDPA speedup |
|-------------------:|-----------------:|---------------:|--------------:|----------------:|---------------:|
|                  1 |            0.016 |          0.013 |         1.247 |           0.012 |          1.318 |
|                  4 |            0.025 |          0.021 |         1.198 |           0.021 |          1.202 |
|                 16 |            0.093 |          0.075 |         1.234 |           0.075 |          1.24  |
|                 32 |            0.181 |          0.147 |         1.237 |           0.146 |          1.241 |

![clip_image_model_viz_3](https://github.com/user-attachments/assets/50a36206-e3b9-4adc-ac8e-926b8b071d63)

#### CLIPModel[[clipmodel]]

|   Image batch size |   Num text labels |   Eager (s/iter) |   FA2 (s/iter) |   FA2 speedup |   SDPA (s/iter) |   SDPA speedup |
|-------------------:|------------------:|-----------------:|---------------:|--------------:|----------------:|---------------:|
|                  1 |                 4 |            0.025 |          0.026 |         0.954 |           0.02  |          1.217 |
|                  1 |                16 |            0.026 |          0.028 |         0.918 |           0.02  |          1.287 |
|                  1 |                64 |            0.042 |          0.046 |         0.906 |           0.036 |          1.167 |
|                  4 |                 4 |            0.028 |          0.033 |         0.849 |           0.024 |          1.189 |
|                  4 |                16 |            0.034 |          0.035 |         0.955 |           0.029 |          1.169 |
|                  4 |                64 |            0.059 |          0.055 |         1.072 |           0.05  |          1.179 |
|                 16 |                 4 |            0.096 |          0.088 |         1.091 |           0.078 |          1.234 |
|                 16 |                16 |            0.102 |          0.09  |         1.129 |           0.083 |          1.224 |
|                 16 |                64 |            0.127 |          0.11  |         1.157 |           0.105 |          1.218 |
|                 32 |                 4 |            0.185 |          0.159 |         1.157 |           0.149 |          1.238 |
|                 32 |                16 |            0.19  |          0.162 |         1.177 |           0.154 |          1.233 |
|                 32 |                64 |            0.216 |          0.181 |         1.19  |           0.176 |          1.228 |

## 자료[[resources]]

CLIP을 μ‹œμž‘ν•˜λŠ” 데 도움이 λ˜λŠ” Hugging Face와 community 자료 λͺ©λ‘(🌎둜 ν‘œμ‹œλ¨) μž…λ‹ˆλ‹€.

- [원격 μ„Όμ‹± (μΈκ³΅μœ„μ„±) 이미지와 μΊ‘μ…˜μ„ κ°€μ§€κ³  CLIP λ―Έμ„Έμ‘°μ •ν•˜κΈ°](https://huggingface.co/blog/fine-tune-clip-rsicd): 
[RSICD dataset](https://github.com/201528014227051/RSICD_optimal)을 κ°€μ§€κ³  CLIP을 λ―Έμ„Έμ‘°μ • ν•˜λŠ” 방법과 데이터 증강에 λŒ€ν•œ μ„±λŠ₯ 비ꡐ에 λŒ€ν•œ λΈ”λ‘œκ·Έ 포슀트
- 이 [μ˜ˆμ‹œ 슀크립트](https://github.com/huggingface/transformers/tree/main/examples/pytorch/contrastive-image-text)λŠ” [COCO dataset](https://cocodataset.org/#home)λ₯Ό μ΄μš©ν•œ μ‚¬μ „ν•™μŠ΅λœ λΉ„μ „κ³Ό ν…μŠ€νŠΈμ™€ 인코더λ₯Ό μ‚¬μš©ν•΄μ„œ CLIP같은 λΉ„μ „-ν…μŠ€νŠΈ λ“€μ–Ό λͺ¨λΈμ„ μ–΄λ–»κ²Œ ν•™μŠ΅μ‹œν‚€λŠ”μ§€ λ³΄μ—¬μ€λ‹ˆλ‹€. 

<PipelineTag pipeline="image-to-text"/>

- μ‚¬μ „ν•™μŠ΅λœ CLIPλͺ¨λΈμ„ 이미지 캑셔닝을 μœ„ν•œ λΉ”μ„œμΉ˜ 좔둠에 μ–΄λ–»κ²Œ ν™œμš©ν•˜λŠ”μ§€μ— κ΄€ν•œ [λ…ΈνŠΈλΆ](https://colab.research.google.com/drive/1tuoAC5F4sC7qid56Z0ap-stR3rwdk0ZV?usp=sharing)

**이미지 검색**

- μ‚¬μ „ν•™μŠ΅λœ CLIPλͺ¨λΈκ³Ό MRR(Mean Reciprocal Rank) 점수 연산을 μ‚¬μš©ν•œ 이미지 검색에 λŒ€ν•œ [λ…ΈνŠΈλΆ](https://colab.research.google.com/drive/1bLVwVKpAndpEDHqjzxVPr_9nGrSbuOQd?usp=sharing). 🌎
- 이미지 검색과 μœ μ‚¬μ„± μ μˆ˜μ— λŒ€ν•΄ λ³΄μ—¬μ£ΌλŠ” [λ…ΈνŠΈλΆ](https://colab.research.google.com/github/deep-diver/image_search_with_natural_language/blob/main/notebooks/Image_Search_CLIP.ipynb). 🌎
- Multilingual CLIPλ₯Ό μ‚¬μš©ν•΄μ„œ 이미지와 ν…μŠ€νŠΈλ₯Ό μ–΄λ–»κ²Œ 같은 벑터 곡간에 λ§€ν•‘ μ‹œν‚€λŠ”μ§€μ— λŒ€ν•œ [λ…ΈνŠΈλΆ](https://colab.research.google.com/drive/1xO-wC_m_GNzgjIBQ4a4znvQkvDoZJvH4?usp=sharing). 🌎 
- [Unsplash](https://unsplash.com)와 [TMDB](https://www.themoviedb.org/) 데이터셋을 ν™œμš©ν•œ 의미둠적(semantic) 이미지 κ²€μƒ‰μ—μ„œ CLIP을 κ΅¬λ™ν•˜λŠ” 방법에 λŒ€ν•œ [λ…ΈνŠΈλΆ](https://colab.research.google.com/github/vivien000/clip-demo/blob/master/clip.ipynb#scrollTo=uzdFhRGqiWkR). 🌎

**μ„€λͺ… κ°€λŠ₯μ„±**

- μž…λ ₯ 토큰과 이미지 쑰각(segment) μ‚¬μ΄μ˜ μœ μ‚¬μ„±μ„ μ‹œκ°ν™” μ‹œν‚€λŠ” 방법에 λŒ€ν•œ [λ…ΈνŠΈλΆ](https://colab.research.google.com/github/hila-chefer/Transformer-MM-Explainability/blob/main/CLIP_explainability.ipynb). 🌎

여기에 포함될 자료λ₯Ό μ œμΆœν•˜κ³  μ‹ΆμœΌμ‹œλ‹€λ©΄ PR(Pull Request)λ₯Ό μ—΄μ–΄μ£Όμ„Έμš”. 리뷰 ν•΄λ“œλ¦¬κ² μŠ΅λ‹ˆλ‹€! μžλ£ŒλŠ” κΈ°μ‘΄ 자료λ₯Ό λ³΅μ œν•˜λŠ” λŒ€μ‹  μƒˆλ‘œμš΄ λ‚΄μš©μ„ λ‹΄κ³  μžˆμ–΄μ•Ό ν•©λ‹ˆλ‹€.

## CLIPConfig[[transformers.CLIPConfig]]

[[autodoc]] CLIPConfig
    - from_text_vision_configs

## CLIPTextConfig[[transformers.CLIPTextConfig]]

[[autodoc]] CLIPTextConfig

## CLIPVisionConfig[[transformers.CLIPVisionConfig]]

[[autodoc]] CLIPVisionConfig

## CLIPTokenizer[[transformers.CLIPTokenizer]]

[[autodoc]] CLIPTokenizer
    - build_inputs_with_special_tokens
    - get_special_tokens_mask
    - create_token_type_ids_from_sequences
    - save_vocabulary

## CLIPTokenizerFast[[transformers.CLIPTokenizerFast]]

[[autodoc]] CLIPTokenizerFast

## CLIPImageProcessor[[transformers.CLIPImageProcessor]]

[[autodoc]] CLIPImageProcessor
    - preprocess

## CLIPFeatureExtractor[[transformers.CLIPFeatureExtractor]]

[[autodoc]] CLIPFeatureExtractor

## CLIPProcessor[[transformers.CLIPProcessor]]

[[autodoc]] CLIPProcessor

<frameworkcontent>
<pt>

## CLIPModel[[transformers.CLIPModel]]

[[autodoc]] CLIPModel
    - forward
    - get_text_features
    - get_image_features

## CLIPTextModel[[transformers.CLIPTextModel]]

[[autodoc]] CLIPTextModel
    - forward

## CLIPTextModelWithProjection[[transformers.CLIPTextModelWithProjection]]

[[autodoc]] CLIPTextModelWithProjection
    - forward

## CLIPVisionModelWithProjection[[transformers.CLIPVisionModelWithProjection]]

[[autodoc]] CLIPVisionModelWithProjection
    - forward

## CLIPVisionModel[[transformers.CLIPVisionModel]]

[[autodoc]] CLIPVisionModel
    - forward

## CLIPForImageClassification[[transformers.CLIPForImageClassification]]

[[autodoc]] CLIPForImageClassification
    - forward

</pt>
<tf>

## TFCLIPModel[[transformers.TFCLIPModel]]

[[autodoc]] TFCLIPModel
    - call
    - get_text_features
    - get_image_features

## TFCLIPTextModel[[transformers.TFCLIPTextModel]]

[[autodoc]] TFCLIPTextModel
    - call

## TFCLIPVisionModel[[transformers.TFCLIPVisionModel]]

[[autodoc]] TFCLIPVisionModel
    - call

</tf>
<jax>

## FlaxCLIPModel[[transformers.FlaxCLIPModel]]

[[autodoc]] FlaxCLIPModel
    - __call__
    - get_text_features
    - get_image_features

## FlaxCLIPTextModel[[transformers.FlaxCLIPTextModel]]

[[autodoc]] FlaxCLIPTextModel
    - __call__

## FlaxCLIPTextModelWithProjection[[transformers.FlaxCLIPTextModelWithProjection]]

[[autodoc]] FlaxCLIPTextModelWithProjection
    - __call__

## FlaxCLIPVisionModel[[transformers.FlaxCLIPVisionModel]]

[[autodoc]] FlaxCLIPVisionModel
    - __call__

</jax>
</frameworkcontent>