File size: 4,710 Bytes
17c6d62
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
# AltCLIP

## ๊ฐœ์š”[[overview]]

AltCLIP ๋ชจ๋ธ์€ Zhongzhi Chen, Guang Liu, Bo-Wen Zhang, Fulong Ye, Qinghong Yang, Ledell Wu์˜ [AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities](https://arxiv.org/abs/2211.06679v2) ๋…ผ๋ฌธ์—์„œ ์ œ์•ˆ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. AltCLIP(CLIP์˜ ์–ธ์–ด ์ธ์ฝ”๋”๋ฅผ ๋ณ€๊ฒฝํ•˜์—ฌ ์–ธ์–ด ๊ธฐ๋Šฅ ํ™•์žฅ)์€ ๋‹ค์–‘ํ•œ ์ด๋ฏธ์ง€-ํ…์ŠคํŠธ ๋ฐ ํ…์ŠคํŠธ-ํ…์ŠคํŠธ ์Œ์œผ๋กœ ํ›ˆ๋ จ๋œ ์‹ ๊ฒฝ๋ง์ž…๋‹ˆ๋‹ค. CLIP์˜ ํ…์ŠคํŠธ ์ธ์ฝ”๋”๋ฅผ ์‚ฌ์ „ ํ›ˆ๋ จ๋œ ๋‹ค๊ตญ์–ด ํ…์ŠคํŠธ ์ธ์ฝ”๋” XLM-R๋กœ ๊ต์ฒดํ•˜์—ฌ, ๊ฑฐ์˜ ๋ชจ๋“  ์ž‘์—…์—์„œ CLIP๊ณผ ์œ ์‚ฌํ•œ ์„ฑ๋Šฅ์„ ์–ป์„ ์ˆ˜ ์žˆ์—ˆ์œผ๋ฉฐ, ์›๋ž˜ CLIP์˜ ๋‹ค๊ตญ์–ด ์ดํ•ด์™€ ๊ฐ™์€ ๊ธฐ๋Šฅ๋„ ํ™•์žฅ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

๋…ผ๋ฌธ์˜ ์ดˆ๋ก์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

*๋ณธ ์—ฐ๊ตฌ์—์„œ๋Š” ๊ฐ•๋ ฅํ•œ ์ด์ค‘ ์–ธ์–ด ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ํ‘œํ˜„ ๋ชจ๋ธ์„ ํ›ˆ๋ จํ•˜๋Š” ๊ฐœ๋…์ ์œผ๋กœ ๊ฐ„๋‹จํ•˜๊ณ  ํšจ๊ณผ์ ์ธ ๋ฐฉ๋ฒ•์„ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค. OpenAI์—์„œ ์ถœ์‹œํ•œ ์‚ฌ์ „ ํ›ˆ๋ จ๋œ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ํ‘œํ˜„ ๋ชจ๋ธ CLIP์—์„œ ์‹œ์ž‘ํ•˜์—ฌ, ๊ทธ ํ…์ŠคํŠธ ์ธ์ฝ”๋”๋ฅผ ์‚ฌ์ „ ํ›ˆ๋ จ๋œ ๋‹ค๊ตญ์–ด ํ…์ŠคํŠธ ์ธ์ฝ”๋” XLM-R๋กœ ๊ต์ฒดํ•˜๊ณ , ๊ต์‚ฌ ํ•™์Šต๊ณผ ๋Œ€์กฐ ํ•™์Šต์œผ๋กœ ๊ตฌ์„ฑ๋œ 2๋‹จ๊ณ„ ํ›ˆ๋ จ ์Šคํ‚ค๋งˆ๋ฅผ ํ†ตํ•ด ์–ธ์–ด์™€ ์ด๋ฏธ์ง€ ํ‘œํ˜„์„ ์ •๋ ฌํ–ˆ์Šต๋‹ˆ๋‹ค. ์šฐ๋ฆฌ๋Š” ๊ด‘๋ฒ”์œ„ํ•œ ์ž‘์—… ํ‰๊ฐ€๋ฅผ ํ†ตํ•ด ์šฐ๋ฆฌ์˜ ๋ฐฉ๋ฒ•์„ ๊ฒ€์ฆํ–ˆ์Šต๋‹ˆ๋‹ค. ImageNet-CN, Flicker30k-CN, COCO-CN์„ ํฌํ•จํ•œ ์—ฌ๋Ÿฌ ์ž‘์—…์—์„œ ์ƒˆ๋กœ์šด ์ตœ๊ณ  ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ–ˆ์œผ๋ฉฐ, ๊ฑฐ์˜ ๋ชจ๋“  ์ž‘์—…์—์„œ CLIP๊ณผ ์œ ์‚ฌํ•œ ์„ฑ๋Šฅ์„ ์–ป์—ˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” CLIP์˜ ํ…์ŠคํŠธ ์ธ์ฝ”๋”๋ฅผ ๋‹จ์ˆœํžˆ ๋ณ€๊ฒฝํ•˜์—ฌ ๋‹ค๊ตญ์–ด ์ดํ•ด์™€ ๊ฐ™์€ ํ™•์žฅ ๊ธฐ๋Šฅ์„ ์–ป์„ ์ˆ˜ ์žˆ์Œ์„ ์‹œ์‚ฌํ•ฉ๋‹ˆ๋‹ค.*

์ด ๋ชจ๋ธ์€ [jongjyh](https://huggingface.co/jongjyh)์— ์˜ํ•ด ๊ธฐ์—ฌ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

## ์‚ฌ์šฉ ํŒ๊ณผ ์˜ˆ์ œ[[usage-tips-and-example]]

AltCLIP์˜ ์‚ฌ์šฉ๋ฒ•์€ CLIP๊ณผ ๋งค์šฐ ์œ ์‚ฌํ•˜๋ฉฐ, ์ฐจ์ด์ ์€ ํ…์ŠคํŠธ ์ธ์ฝ”๋”์— ์žˆ์Šต๋‹ˆ๋‹ค. ์ผ๋ฐ˜์ ์ธ ์–ดํ…์…˜ ๋Œ€์‹  ์–‘๋ฐฉํ–ฅ ์–ดํ…์…˜์„ ์‚ฌ์šฉํ•˜๋ฉฐ, XLM-R์˜ [CLS] ํ† ํฐ์„ ์‚ฌ์šฉํ•˜์—ฌ ํ…์ŠคํŠธ ์ž„๋ฒ ๋”ฉ์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.

AltCLIP์€ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๋น„์ „ ๋ฐ ์–ธ์–ด ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. ์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ ๊ฐ„์˜ ์œ ์‚ฌ์„ฑ ๊ณ„์‚ฐ ๋ฐ ์ œ๋กœ์ƒท ์ด๋ฏธ์ง€ ๋ถ„๋ฅ˜์— ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. AltCLIP์€ ViT์™€ ๊ฐ™์€ ํŠธ๋žœ์Šคํฌ๋จธ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์‹œ๊ฐ์  ํŠน์ง•์„ ์–ป๊ณ , ์–‘๋ฐฉํ–ฅ ์–ธ์–ด ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜์—ฌ ํ…์ŠคํŠธ ํŠน์ง•์„ ์–ป์Šต๋‹ˆ๋‹ค. ์ดํ›„ ํ…์ŠคํŠธ์™€ ์‹œ๊ฐ์  ํŠน์ง• ๋ชจ๋‘ ๋™์ผํ•œ ์ฐจ์›์˜ ์ž ์žฌ ๊ณต๊ฐ„์œผ๋กœ ํˆฌ์‚ฌ๋ฉ๋‹ˆ๋‹ค. ํˆฌ์‚ฌ๋œ ์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ ํŠน์ง• ๊ฐ„์˜ ๋‚ด์ ์„ ์œ ์‚ฌ๋„ ์ ์ˆ˜๋กœ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

์ด๋ฏธ์ง€๋ฅผ ํŠธ๋žœ์Šคํฌ๋จธ ์ธ์ฝ”๋”์— ์ž…๋ ฅํ•˜๊ธฐ ์œ„ํ•ด, ๊ฐ ์ด๋ฏธ์ง€๋ฅผ ์ผ์ •ํ•œ ํฌ๊ธฐ์˜ ๊ฒน์น˜์ง€ ์•Š๋Š” ํŒจ์น˜ ์‹œํ€€์Šค๋กœ ๋ถ„ํ• ํ•œ ๋’ค, ์ด๋ฅผ ์„ ํ˜• ์ž„๋ฒ ๋”ฉํ•ฉ๋‹ˆ๋‹ค. ์ „์ฒด ์ด๋ฏธ์ง€๋ฅผ ๋‚˜ํƒ€๋‚ด๊ธฐ ์œ„ํ•ด [CLS] ํ† ํฐ์ด ์ถ”๊ฐ€๋ฉ๋‹ˆ๋‹ค. ์ €์ž๋“ค์€ ์ ˆ๋Œ€ ์œ„์น˜ ์ž„๋ฒ ๋”ฉ๋„ ์ถ”๊ฐ€ํ•˜์—ฌ ๊ฒฐ๊ณผ ๋ฒกํ„ฐ ์‹œํ€€์Šค๋ฅผ ํ‘œ์ค€ ํŠธ๋žœ์Šคํฌ๋จธ ์ธ์ฝ”๋”์— ์ž…๋ ฅํ•ฉ๋‹ˆ๋‹ค. [`CLIPImageProcessor`]๋Š” ๋ชจ๋ธ์„ ์œ„ํ•ด ์ด๋ฏธ์ง€๋ฅผ ํฌ๊ธฐ ์กฐ์ •ํ•˜๊ณ  ์ •๊ทœํ™”ํ•˜๋Š” ๋ฐ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

[`AltCLIPProcessor`]๋Š” [`CLIPImageProcessor`]์™€ [`XLMRobertaTokenizer`]๋ฅผ ํ•˜๋‚˜์˜ ์ธ์Šคํ„ด์Šค๋กœ ๋ฌถ์–ด ํ…์ŠคํŠธ๋ฅผ ์ธ์ฝ”๋”ฉํ•˜๊ณ  ์ด๋ฏธ์ง€๋ฅผ ์ค€๋น„ํ•ฉ๋‹ˆ๋‹ค. ๋‹ค์Œ ์˜ˆ์ œ๋Š” [`AltCLIPProcessor`]์™€ [`AltCLIPModel`]์„ ์‚ฌ์šฉํ•˜์—ฌ ์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ ๊ฐ„์˜ ์œ ์‚ฌ์„ฑ ์ ์ˆ˜๋ฅผ ์–ป๋Š” ๋ฐฉ๋ฒ•์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.
```python
>>> from PIL import Image
>>> import requests

>>> from transformers import AltCLIPModel, AltCLIPProcessor

>>> model = AltCLIPModel.from_pretrained("BAAI/AltCLIP")
>>> processor = AltCLIPProcessor.from_pretrained("BAAI/AltCLIP")

>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)

>>> inputs = processor(text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True)

>>> outputs = model(**inputs)
>>> logits_per_image = outputs.logits_per_image  # ์ด๋ฏธ์ง€-ํ…์ŠคํŠธ ์œ ์‚ฌ๋„ ์ ์ˆ˜
>>> probs = logits_per_image.softmax(dim=1)  # ๋ผ๋ฒจ ๋งˆ๋‹ค ํ™•๋ฅ ์„ ์–ป๊ธฐ ์œ„ํ•ด softmax ์ ์šฉ
```
<Tip>

์ด ๋ชจ๋ธ์€ `CLIPModel`์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๋ฏ€๋กœ, ์›๋ž˜ CLIP์ฒ˜๋Ÿผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

</Tip>

## AltCLIPConfig

[[autodoc]] AltCLIPConfig
    - from_text_vision_configs

## AltCLIPTextConfig

[[autodoc]] AltCLIPTextConfig

## AltCLIPVisionConfig

[[autodoc]] AltCLIPVisionConfig

## AltCLIPProcessor

[[autodoc]] AltCLIPProcessor

## AltCLIPModel

[[autodoc]] AltCLIPModel
    - forward
    - get_text_features
    - get_image_features

## AltCLIPTextModel

[[autodoc]] AltCLIPTextModel
    - forward

## AltCLIPVisionModel

[[autodoc]] AltCLIPVisionModel
    - forward