How do I use this model?
#19
by
Adefe
- opened
The question might be very stupid, but I wanted to use this and siglip1 model in scenario where openai clip generated text and image embeddings in the same space. But I get trash results for both siglip and siglip2. I just can't get any meaningful similarity at all with siglip! Can someone please advise me how to use it. Is this model even capable to produce text and image embeddings in the same space?
My code:
!pip install -U torch torchvision pillow requests transformers
import torch
import requests
from PIL import Image
from transformers import AutoProcessor, AutoModel
model_name = "google/siglip2-base-patch16-224"
processor = AutoProcessor.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"{device=}")
model = model.to(device)
model.eval()
image_url = "https://cdn.omlet.com/images/originals/breed_abyssinian_cat.jpg"
image = Image.open(requests.get(image_url, stream=True).raw).convert("RGB")
text = "An image of a cat"
with torch.no_grad():
inputs_image = processor(images=image, return_tensors="pt", padding="max_length").to(device)
image_features = model.get_image_features(**inputs_image).pooler_output
inputs_text = processor(text=[text], padding="max_length", return_tensors="pt").to(device)
text_features = model.get_text_features(**inputs_text).pooler_output
cosine_similarity = torch.nn.functional.cosine_similarity(image_features, text_features)
print(f"{cosine_similarity=}")
Output:cosine_similarity=tensor([0.1285], device='cuda:0')
Even when using example from the model card:
from transformers import pipeline
# load pipeline
ckpt = "google/siglip2-base-patch16-224"
image_classifier = pipeline(model=ckpt, task="zero-shot-image-classification")
# load image and candidate labels
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
candidate_labels = ["2 cats", "a plane", "a remote"]
# run inference
outputs = image_classifier(image, candidate_labels)
print(outputs)
I get:[{'score': 0.0015877661062404513, 'label': '2 cats'}, {'score': 6.586445670109242e-05, 'label': 'a remote'}, {'score': 3.9615420064365026e-06, 'label': 'a plane'}]
Am I supposed to get similarity 0.0016?
Adefe
changed discussion status to
closed