How to use from the
Use from the
Transformers library
# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("zero-shot-image-classification", model="sujitpal/clip-imageclef")
pipe(
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/parrots.png",
    candidate_labels=["animals", "humans", "landscape"],
)
# Load model directly
from transformers import AutoProcessor, AutoModelForZeroShotImageClassification

processor = AutoProcessor.from_pretrained("sujitpal/clip-imageclef")
model = AutoModelForZeroShotImageClassification.from_pretrained("sujitpal/clip-imageclef")
Quick Links

Model Card: clip-imageclef

Model Details

OpenAI CLIP model fine-tuned using image-caption pairs from the Caption Prediction dataset provided for the ImageCLEF 2017 competition. The model was evaluated using before and after fine-tuning, MRR@10 were 0.57 and 0.88 respectively.

Model Date

September 6, 2021

Model Type

The base model is the OpenAI CLIP model. It uses a ViT-B/32 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss.

Fine-tuning

The fine-tuning can be reproduced using code from the Github repository elsevierlabs-os/clip-image-search.

Usage

from transformers import CLIPModel, CLIPProcessor

model = CLIPModel.from_pretrained("sujitpal/clip-imageclef")
processor = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
inputs = processor(text=captions, images=images, 
                   return_tensors="pt", padding=True)
output = model(**inputs)

Performance

Model-name k=1 k=3 k=5 k=10 k=20
zero-shot CLIP (baseline) 0.426 0.534 0.558 0.573 0.578
clip-imageclef (this model) 0.802 0.872 0.877 0.879 0.880
Downloads last month
12
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 1 Ask for provider support