| | --- |
| | language: |
| | - en |
| | tags: |
| | - multimodal |
| | - language |
| | - vision |
| | - image-search |
| | - pytorch |
| | license: |
| | - mit |
| | metrics: |
| | - MRR |
| | --- |
| | |
| | ### Model Card: clip-imageclef |
| |
|
| | ### Model Details |
| |
|
| | [OpenAI CLIP model](https://openai.com/blog/clip/) fine-tuned using image-caption pairs from the [Caption Prediction dataset](https://www.imageclef.org/2017/caption) provided for the ImageCLEF 2017 competition. The model was evaluated using before and after fine-tuning, MRR@10 were 0.57 and 0.88 respectively. |
| |
|
| | ### Model Date |
| |
|
| | September 6, 2021 |
| |
|
| | ### Model Type |
| |
|
| | The base model is the OpenAI CLIP model. It uses a ViT-B/32 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss. |
| |
|
| | ### Fine-tuning |
| |
|
| | The fine-tuning can be reproduced using code from the Github repository [elsevierlabs-os/clip-image-search](https://github.com/elsevierlabs-os/clip-image-search#fine-tuning). |
| |
|
| | ### Usage |
| |
|
| | ```python |
| | from transformers import CLIPModel, CLIPProcessor |
| | |
| | model = CLIPModel.from_pretrained("sujitpal/clip-imageclef") |
| | processor = CLIPModel.from_pretrained("openai/clip-vit-base-patch32") |
| | inputs = processor(text=captions, images=images, |
| | return_tensors="pt", padding=True) |
| | output = model(**inputs) |
| | ``` |
| |
|
| | ### Performance |
| |
|
| | | Model-name | k=1 | k=3 | k=5 | k=10 | k=20 | |
| | | -------------------------------- | ----- | ----- | ----- | ----- | ----- | |
| | | zero-shot CLIP (baseline) | 0.426 | 0.534 | 0.558 | 0.573 | 0.578 | |
| | | clip-imageclef (this model) | 0.802 | 0.872 | 0.877 | 0.879 | 0.880 | |
| |
|
| |
|