Fine-tuning CLIP model for image-image search

#23

by AFRF - opened Nov 29, 2023

Nov 29, 2023

Hi all, I've been working on image-image search tasks and CLIP has work really well for me, currently I want to take the performance of my approach further and I was thinking in fine tuning the CLIP model for this task. For this, I'm just generating the embeddings of the images, store them in a vector index and the just computing the cosine similarity between the embedding of my search image and all the embeddings in the vector index. Im not really using any zero-shot application or image-text comparison and I've seen all the fine-tuning approaches for CLIP models I read use text-image pairs for the fine tuning, I don't understand how I should fine tune the model to increase the performance of my application, should I use text-image pairs? Or should I only fine tune the visual encoder of the model, and if thats the case anyone has some examples of how can i do it?

Shoukat125

Jan 25, 2024

Why use you this Model?

StarGazer-Media

Mar 22, 2024

Hey AFRF, can you tell me how i can use this model to compare the image and a text(sentence) similarity rather than providing the a bunch of classes

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment