Zero-Shot Image Classification
Transformers
Safetensors
siglip
vision

How do I use this model?

#19
by Adefe - opened

The question might be very stupid, but I wanted to use this and siglip1 model in scenario where openai clip generated text and image embeddings in the same space. But I get trash results for both siglip and siglip2. I just can't get any meaningful similarity at all with siglip! Can someone please advise me how to use it. Is this model even capable to produce text and image embeddings in the same space?

My code:

!pip install -U torch torchvision pillow requests transformers

import torch
import requests
from PIL import Image
from transformers import AutoProcessor, AutoModel

model_name = "google/siglip2-base-patch16-224"
processor = AutoProcessor.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"{device=}")
model = model.to(device)
model.eval()

image_url = "https://cdn.omlet.com/images/originals/breed_abyssinian_cat.jpg"
image = Image.open(requests.get(image_url, stream=True).raw).convert("RGB")

text = "An image of a cat"

with torch.no_grad():
    inputs_image = processor(images=image, return_tensors="pt", padding="max_length").to(device)
    image_features = model.get_image_features(**inputs_image).pooler_output

    inputs_text = processor(text=[text], padding="max_length", return_tensors="pt").to(device)
    text_features = model.get_text_features(**inputs_text).pooler_output

cosine_similarity = torch.nn.functional.cosine_similarity(image_features, text_features)

print(f"{cosine_similarity=}")

Output:
cosine_similarity=tensor([0.1285], device='cuda:0')

Even when using example from the model card:

from transformers import pipeline

# load pipeline
ckpt = "google/siglip2-base-patch16-224"
image_classifier = pipeline(model=ckpt, task="zero-shot-image-classification")

# load image and candidate labels
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
candidate_labels = ["2 cats", "a plane", "a remote"]

# run inference
outputs = image_classifier(image, candidate_labels)
print(outputs)

I get:
[{'score': 0.0015877661062404513, 'label': '2 cats'}, {'score': 6.586445670109242e-05, 'label': 'a remote'}, {'score': 3.9615420064365026e-06, 'label': 'a plane'}]

Am I supposed to get similarity 0.0016?

Adefe changed discussion status to closed

Sign up or log in to comment