DrDavis's picture
Upload folder using huggingface_hub
17c6d62 verified

AltCLIP

๊ฐœ์š”[[overview]]

AltCLIP ๋ชจ๋ธ์€ Zhongzhi Chen, Guang Liu, Bo-Wen Zhang, Fulong Ye, Qinghong Yang, Ledell Wu์˜ AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities ๋…ผ๋ฌธ์—์„œ ์ œ์•ˆ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. AltCLIP(CLIP์˜ ์–ธ์–ด ์ธ์ฝ”๋”๋ฅผ ๋ณ€๊ฒฝํ•˜์—ฌ ์–ธ์–ด ๊ธฐ๋Šฅ ํ™•์žฅ)์€ ๋‹ค์–‘ํ•œ ์ด๋ฏธ์ง€-ํ…์ŠคํŠธ ๋ฐ ํ…์ŠคํŠธ-ํ…์ŠคํŠธ ์Œ์œผ๋กœ ํ›ˆ๋ จ๋œ ์‹ ๊ฒฝ๋ง์ž…๋‹ˆ๋‹ค. CLIP์˜ ํ…์ŠคํŠธ ์ธ์ฝ”๋”๋ฅผ ์‚ฌ์ „ ํ›ˆ๋ จ๋œ ๋‹ค๊ตญ์–ด ํ…์ŠคํŠธ ์ธ์ฝ”๋” XLM-R๋กœ ๊ต์ฒดํ•˜์—ฌ, ๊ฑฐ์˜ ๋ชจ๋“  ์ž‘์—…์—์„œ CLIP๊ณผ ์œ ์‚ฌํ•œ ์„ฑ๋Šฅ์„ ์–ป์„ ์ˆ˜ ์žˆ์—ˆ์œผ๋ฉฐ, ์›๋ž˜ CLIP์˜ ๋‹ค๊ตญ์–ด ์ดํ•ด์™€ ๊ฐ™์€ ๊ธฐ๋Šฅ๋„ ํ™•์žฅ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

๋…ผ๋ฌธ์˜ ์ดˆ๋ก์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

๋ณธ ์—ฐ๊ตฌ์—์„œ๋Š” ๊ฐ•๋ ฅํ•œ ์ด์ค‘ ์–ธ์–ด ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ํ‘œํ˜„ ๋ชจ๋ธ์„ ํ›ˆ๋ จํ•˜๋Š” ๊ฐœ๋…์ ์œผ๋กœ ๊ฐ„๋‹จํ•˜๊ณ  ํšจ๊ณผ์ ์ธ ๋ฐฉ๋ฒ•์„ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค. OpenAI์—์„œ ์ถœ์‹œํ•œ ์‚ฌ์ „ ํ›ˆ๋ จ๋œ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ํ‘œํ˜„ ๋ชจ๋ธ CLIP์—์„œ ์‹œ์ž‘ํ•˜์—ฌ, ๊ทธ ํ…์ŠคํŠธ ์ธ์ฝ”๋”๋ฅผ ์‚ฌ์ „ ํ›ˆ๋ จ๋œ ๋‹ค๊ตญ์–ด ํ…์ŠคํŠธ ์ธ์ฝ”๋” XLM-R๋กœ ๊ต์ฒดํ•˜๊ณ , ๊ต์‚ฌ ํ•™์Šต๊ณผ ๋Œ€์กฐ ํ•™์Šต์œผ๋กœ ๊ตฌ์„ฑ๋œ 2๋‹จ๊ณ„ ํ›ˆ๋ จ ์Šคํ‚ค๋งˆ๋ฅผ ํ†ตํ•ด ์–ธ์–ด์™€ ์ด๋ฏธ์ง€ ํ‘œํ˜„์„ ์ •๋ ฌํ–ˆ์Šต๋‹ˆ๋‹ค. ์šฐ๋ฆฌ๋Š” ๊ด‘๋ฒ”์œ„ํ•œ ์ž‘์—… ํ‰๊ฐ€๋ฅผ ํ†ตํ•ด ์šฐ๋ฆฌ์˜ ๋ฐฉ๋ฒ•์„ ๊ฒ€์ฆํ–ˆ์Šต๋‹ˆ๋‹ค. ImageNet-CN, Flicker30k-CN, COCO-CN์„ ํฌํ•จํ•œ ์—ฌ๋Ÿฌ ์ž‘์—…์—์„œ ์ƒˆ๋กœ์šด ์ตœ๊ณ  ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ–ˆ์œผ๋ฉฐ, ๊ฑฐ์˜ ๋ชจ๋“  ์ž‘์—…์—์„œ CLIP๊ณผ ์œ ์‚ฌํ•œ ์„ฑ๋Šฅ์„ ์–ป์—ˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” CLIP์˜ ํ…์ŠคํŠธ ์ธ์ฝ”๋”๋ฅผ ๋‹จ์ˆœํžˆ ๋ณ€๊ฒฝํ•˜์—ฌ ๋‹ค๊ตญ์–ด ์ดํ•ด์™€ ๊ฐ™์€ ํ™•์žฅ ๊ธฐ๋Šฅ์„ ์–ป์„ ์ˆ˜ ์žˆ์Œ์„ ์‹œ์‚ฌํ•ฉ๋‹ˆ๋‹ค.

์ด ๋ชจ๋ธ์€ jongjyh์— ์˜ํ•ด ๊ธฐ์—ฌ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

์‚ฌ์šฉ ํŒ๊ณผ ์˜ˆ์ œ[[usage-tips-and-example]]

AltCLIP์˜ ์‚ฌ์šฉ๋ฒ•์€ CLIP๊ณผ ๋งค์šฐ ์œ ์‚ฌํ•˜๋ฉฐ, ์ฐจ์ด์ ์€ ํ…์ŠคํŠธ ์ธ์ฝ”๋”์— ์žˆ์Šต๋‹ˆ๋‹ค. ์ผ๋ฐ˜์ ์ธ ์–ดํ…์…˜ ๋Œ€์‹  ์–‘๋ฐฉํ–ฅ ์–ดํ…์…˜์„ ์‚ฌ์šฉํ•˜๋ฉฐ, XLM-R์˜ [CLS] ํ† ํฐ์„ ์‚ฌ์šฉํ•˜์—ฌ ํ…์ŠคํŠธ ์ž„๋ฒ ๋”ฉ์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.

AltCLIP์€ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๋น„์ „ ๋ฐ ์–ธ์–ด ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. ์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ ๊ฐ„์˜ ์œ ์‚ฌ์„ฑ ๊ณ„์‚ฐ ๋ฐ ์ œ๋กœ์ƒท ์ด๋ฏธ์ง€ ๋ถ„๋ฅ˜์— ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. AltCLIP์€ ViT์™€ ๊ฐ™์€ ํŠธ๋žœ์Šคํฌ๋จธ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์‹œ๊ฐ์  ํŠน์ง•์„ ์–ป๊ณ , ์–‘๋ฐฉํ–ฅ ์–ธ์–ด ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜์—ฌ ํ…์ŠคํŠธ ํŠน์ง•์„ ์–ป์Šต๋‹ˆ๋‹ค. ์ดํ›„ ํ…์ŠคํŠธ์™€ ์‹œ๊ฐ์  ํŠน์ง• ๋ชจ๋‘ ๋™์ผํ•œ ์ฐจ์›์˜ ์ž ์žฌ ๊ณต๊ฐ„์œผ๋กœ ํˆฌ์‚ฌ๋ฉ๋‹ˆ๋‹ค. ํˆฌ์‚ฌ๋œ ์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ ํŠน์ง• ๊ฐ„์˜ ๋‚ด์ ์„ ์œ ์‚ฌ๋„ ์ ์ˆ˜๋กœ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

์ด๋ฏธ์ง€๋ฅผ ํŠธ๋žœ์Šคํฌ๋จธ ์ธ์ฝ”๋”์— ์ž…๋ ฅํ•˜๊ธฐ ์œ„ํ•ด, ๊ฐ ์ด๋ฏธ์ง€๋ฅผ ์ผ์ •ํ•œ ํฌ๊ธฐ์˜ ๊ฒน์น˜์ง€ ์•Š๋Š” ํŒจ์น˜ ์‹œํ€€์Šค๋กœ ๋ถ„ํ• ํ•œ ๋’ค, ์ด๋ฅผ ์„ ํ˜• ์ž„๋ฒ ๋”ฉํ•ฉ๋‹ˆ๋‹ค. ์ „์ฒด ์ด๋ฏธ์ง€๋ฅผ ๋‚˜ํƒ€๋‚ด๊ธฐ ์œ„ํ•ด [CLS] ํ† ํฐ์ด ์ถ”๊ฐ€๋ฉ๋‹ˆ๋‹ค. ์ €์ž๋“ค์€ ์ ˆ๋Œ€ ์œ„์น˜ ์ž„๋ฒ ๋”ฉ๋„ ์ถ”๊ฐ€ํ•˜์—ฌ ๊ฒฐ๊ณผ ๋ฒกํ„ฐ ์‹œํ€€์Šค๋ฅผ ํ‘œ์ค€ ํŠธ๋žœ์Šคํฌ๋จธ ์ธ์ฝ”๋”์— ์ž…๋ ฅํ•ฉ๋‹ˆ๋‹ค. [CLIPImageProcessor]๋Š” ๋ชจ๋ธ์„ ์œ„ํ•ด ์ด๋ฏธ์ง€๋ฅผ ํฌ๊ธฐ ์กฐ์ •ํ•˜๊ณ  ์ •๊ทœํ™”ํ•˜๋Š” ๋ฐ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

[AltCLIPProcessor]๋Š” [CLIPImageProcessor]์™€ [XLMRobertaTokenizer]๋ฅผ ํ•˜๋‚˜์˜ ์ธ์Šคํ„ด์Šค๋กœ ๋ฌถ์–ด ํ…์ŠคํŠธ๋ฅผ ์ธ์ฝ”๋”ฉํ•˜๊ณ  ์ด๋ฏธ์ง€๋ฅผ ์ค€๋น„ํ•ฉ๋‹ˆ๋‹ค. ๋‹ค์Œ ์˜ˆ์ œ๋Š” [AltCLIPProcessor]์™€ [AltCLIPModel]์„ ์‚ฌ์šฉํ•˜์—ฌ ์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ ๊ฐ„์˜ ์œ ์‚ฌ์„ฑ ์ ์ˆ˜๋ฅผ ์–ป๋Š” ๋ฐฉ๋ฒ•์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

>>> from PIL import Image
>>> import requests

>>> from transformers import AltCLIPModel, AltCLIPProcessor

>>> model = AltCLIPModel.from_pretrained("BAAI/AltCLIP")
>>> processor = AltCLIPProcessor.from_pretrained("BAAI/AltCLIP")

>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)

>>> inputs = processor(text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True)

>>> outputs = model(**inputs)
>>> logits_per_image = outputs.logits_per_image  # ์ด๋ฏธ์ง€-ํ…์ŠคํŠธ ์œ ์‚ฌ๋„ ์ ์ˆ˜
>>> probs = logits_per_image.softmax(dim=1)  # ๋ผ๋ฒจ ๋งˆ๋‹ค ํ™•๋ฅ ์„ ์–ป๊ธฐ ์œ„ํ•ด softmax ์ ์šฉ

์ด ๋ชจ๋ธ์€ CLIPModel์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๋ฏ€๋กœ, ์›๋ž˜ CLIP์ฒ˜๋Ÿผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

AltCLIPConfig

[[autodoc]] AltCLIPConfig - from_text_vision_configs

AltCLIPTextConfig

[[autodoc]] AltCLIPTextConfig

AltCLIPVisionConfig

[[autodoc]] AltCLIPVisionConfig

AltCLIPProcessor

[[autodoc]] AltCLIPProcessor

AltCLIPModel

[[autodoc]] AltCLIPModel - forward - get_text_features - get_image_features

AltCLIPTextModel

[[autodoc]] AltCLIPTextModel - forward

AltCLIPVisionModel

[[autodoc]] AltCLIPVisionModel - forward