metadata
language:
- ko
- en
license: mit
metrics:
- recall
base_model:
- google/siglip2-base-patch16-224
tags:
- zero-shot-image-classification
silgip2-base-patch16-224-ko
google/siglip2-base-patch16-224 ๋ชจ๋ธ์ Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation๊ธฐ๋ฐ์ผ๋ก ํ์ตํด์ ํ๊ตญ์ด ์ดํด๋ฅ๋ ฅ์ ๊ฐํํ Siglip2 ๋ชจ๋ธ์ ๋๋ค.
์ฌ์ฉ๋ ํ์ต ๋ฐ์ดํฐ : aihub english-korean parallel dataset
์ฌ์ฉ๋ ํ๊ฐ ๋ฐ์ดํฐ : ms-koko caption english korean dataset
How to use
import requests
import torch
from PIL import Image
from transformers import AutoModel, AutoProcessor
repo = "hyunlord/siglip2-base-patch16-224-ko"
model = AutoModel.from_pretrained(repo)
processor = AutoProcessor.from_pretrained(repo)
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
texts = ["๊ณ ์์ด ํ ๋ง๋ฆฌ",
"๊ณ ์์ด ๋ ๋ง๋ฆฌ",
"๋ถํ์ ์ํ์ ๋๋ฌ๋์ด ๊ณ ์์ด ์น๊ตฌ๋ค",
"๋ฆฌ๋ชจ์ปจ๊ณผ ๊ณ ์์ด ๋๋ง๋ฆฌ",
"๋ฆฌ๋ชจ์ปจ ๋ ๊ฐ์ ๊ณ ์์ด ๋๋ง๋ฆฌ",
"๋ถํ์ ์ํ ์์ ๋ฆฌ๋ชจ์ปจ ๋ ๊ฐ์ ๋๋ฌ๋์ด ๊ณ ์์ด ๋๋ง๋ฆฌ"]
inputs = processor(text=texts,
images=image,
padding="max_length",
max_length=64,
return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
logits_per_image = outputs.logits_per_image
probs = torch.sigmoid(logits_per_image)
>>> probs
tensor([[0.0038, 0.0429, 0.8294, 0.9787, 0.9816, 0.9990]])
MS-COCO Caption Evaluation
| Model | Parameter Size | (En) I-T Recall@1 | (En) T-I Recall@1 | (Ko) I-T Recall@1 | (Ko) T-I Recall@1 |
|---|---|---|---|---|---|
| google/siglip2-base-patch16-224 | 375,187,970 | 65.20% | 48.29% | 45.68% | 25.44% |
| google/siglip2-so400m-patch14-384 | 1,136,008,498 | 67.74% | 52.04% | 52.36% | 31.59% |
| hyunlord/siglip2-base-patch16-224-ko | 375,187,970 | 65.54% | 47.99% | 57.24% | 36.55% |