Instructions to use ermcy/clip-ViT-B-32 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use ermcy/clip-ViT-B-32 with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("ermcy/clip-ViT-B-32") sentences = [ "That is a happy person", "That is a happy dog", "That is a very happy person", "Today is a sunny day" ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [4, 4] - Notebooks
- Google Colab
- Kaggle
clip-ViT-B-32
This is the Image & Text model CLIP, which maps text and images to a shared vector space. For applications of the models, have a look in our documentation SBERT.net - Image Search
Usage
After installing sentence-transformers (pip install sentence-transformers), the usage of this model is easy:
from sentence_transformers import SentenceTransformer, util
from PIL import Image
#Load CLIP model
model = SentenceTransformer('clip-ViT-B-32')
#Encode an image:
img_emb = model.encode(Image.open('two_dogs_in_snow.jpg'))
#Encode text descriptions
text_emb = model.encode(['Two dogs in the snow', 'A cat on a table', 'A picture of London at night'])
#Compute cosine similarities
cos_scores = util.cos_sim(img_emb, text_emb)
print(cos_scores)
See our SBERT.net - Image Search documentation for more examples how the model can be used for image search, zero-shot image classification, image clustering and image deduplication.
Performance
In the following table we find the zero-shot ImageNet validation set accuracy:
| Model | Top 1 Performance |
|---|---|
| clip-ViT-B-32 | 63.3 |
| clip-ViT-B-16 | 68.1 |
| clip-ViT-L-14 | 75.4 |
For a multilingual version of the CLIP model for 50+ languages have a look at: clip-ViT-B-32-multilingual-v1
Paper for ermcy/clip-ViT-B-32
Paper • 2103.00020 • Published • 22