Instructions to use openai/clip-vit-large-patch14 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use openai/clip-vit-large-patch14 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("zero-shot-image-classification", model="openai/clip-vit-large-patch14") pipe( "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/parrots.png", candidate_labels=["animals", "humans", "landscape"], )# Load model directly from transformers import AutoProcessor, AutoModelForZeroShotImageClassification processor = AutoProcessor.from_pretrained("openai/clip-vit-large-patch14") model = AutoModelForZeroShotImageClassification.from_pretrained("openai/clip-vit-large-patch14") - Notebooks
- Google Colab
- Kaggle
Fine-tuning CLIP model for image-image search
Hi all, I've been working on image-image search tasks and CLIP has work really well for me, currently I want to take the performance of my approach further and I was thinking in fine tuning the CLIP model for this task. For this, I'm just generating the embeddings of the images, store them in a vector index and the just computing the cosine similarity between the embedding of my search image and all the embeddings in the vector index. Im not really using any zero-shot application or image-text comparison and I've seen all the fine-tuning approaches for CLIP models I read use text-image pairs for the fine tuning, I don't understand how I should fine tune the model to increase the performance of my application, should I use text-image pairs? Or should I only fine tune the visual encoder of the model, and if thats the case anyone has some examples of how can i do it?
Why use you this Model?
Hey AFRF, can you tell me how i can use this model to compare the image and a text(sentence) similarity rather than providing the a bunch of classes