Post
Google's SigLIP is another alternative to openai's CLIP, and it just got merged to 🤗transformers and it's super easy to use!
To celebrate this, I have created a repository including notebooks and bunch of Spaces on various SigLIP based projects 🥳
Search for art 👉 merve/draw_to_search_art
Compare SigLIP with CLIP 👉 merve/compare_clip_siglip
How does SigLIP work?
SigLIP an vision-text pre-training technique based on contrastive learning. It jointly trains an image encoder and text encoder such that the dot product of embeddings are most similar for the appropriate text-image pairs
The image below is taken from CLIP, where this contrastive pre-training takes place with softmax, but SigLIP replaces softmax with sigmoid. 📎
Highlights from the paper on why you should use it ✨
🖼️📝 Authors used medium sized B/16 ViT for image encoder and B-sized transformer for text encoder
😍 More performant than CLIP on zero-shot
🗣️ Authors trained a multilingual model too!
⚡️ Super efficient, sigmoid is enabling up to 1M items per batch, but the authors chose 32k because the performance saturates after that
It's super easy to use thanks to transformers 👇
For all the SigLIP notebooks on similarity search and indexing, you can check this [repository](https://github.com/merveenoyan/siglip) out. 🤗
To celebrate this, I have created a repository including notebooks and bunch of Spaces on various SigLIP based projects 🥳
Search for art 👉 merve/draw_to_search_art
Compare SigLIP with CLIP 👉 merve/compare_clip_siglip
How does SigLIP work?
SigLIP an vision-text pre-training technique based on contrastive learning. It jointly trains an image encoder and text encoder such that the dot product of embeddings are most similar for the appropriate text-image pairs
The image below is taken from CLIP, where this contrastive pre-training takes place with softmax, but SigLIP replaces softmax with sigmoid. 📎
Highlights from the paper on why you should use it ✨
🖼️📝 Authors used medium sized B/16 ViT for image encoder and B-sized transformer for text encoder
😍 More performant than CLIP on zero-shot
🗣️ Authors trained a multilingual model too!
⚡️ Super efficient, sigmoid is enabling up to 1M items per batch, but the authors chose 32k because the performance saturates after that
It's super easy to use thanks to transformers 👇
from transformers import pipeline
from PIL import Image
import requests
# load pipe
image_classifier = pipeline(task="zero-shot-image-classification", model="google/siglip-base-patch16-256-i18n")
# load image
url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)
# inference
outputs = image_classifier(image, candidate_labels=["2 cats", "a plane", "a remote"])
outputs = [{"score": round(output["score"], 4), "label": output["label"] } for output in outputs]
print(outputs)For all the SigLIP notebooks on similarity search and indexing, you can check this [repository](https://github.com/merveenoyan/siglip) out. 🤗