|
|
--- |
|
|
language: |
|
|
- en |
|
|
library_name: transformers |
|
|
pipeline_tag: feature-extraction |
|
|
tags: |
|
|
- CLIP |
|
|
- SigLIP |
|
|
- contrastive-learning |
|
|
- dual-encoder |
|
|
- vision-language |
|
|
- image-text-retrieval |
|
|
- huggingface |
|
|
datasets: |
|
|
- jxie/flickr8k |
|
|
base_model: |
|
|
- openai/clip-vit-base-patch16 |
|
|
- google/siglip-base-patch16-224 |
|
|
|
|
|
license: other |
|
|
license_name: unspecified |
|
|
license_link: https://github.com/amirhossein-yousefi/Image-Contrastive-CLIP |
|
|
--- |
|
|
|
|
|
|
|
|
# Model Card for amirhossein-yousefi/Image-Contrastive-CLIP |
|
|
|
|
|
<!-- Provide a quick summary of what the model is/does. --> |
|
|
This repository provides a clean, reproducible **training recipe** to fineβtune CLIP and SigLIP imageβtext encoders for **bidirectional imageβtext retrieval** on datasets like Flickr8k and Flickr30k. It includes a custom contrastive `Trainer`, robust collators for CLIP vs. SigLIP tokenization, and a retrieval evaluator that reports **R@K** and **Median Rank**. |
|
|
|
|
|
## Model Details |
|
|
|
|
|
### Model Description |
|
|
<!-- Provide a longer summary of what this model is. --> |
|
|
- **Developed by:** Amirhossein Yousefi (repo maintainer) |
|
|
- **Model type:** **Dualβencoder** (vision transformer + text transformer) trained with **contrastive objectives** (CLIP softmax contrastive loss or SigLIP sigmoid loss) |
|
|
- **Language(s) (NLP):** English captions (Flickr8k/Flickr30k) |
|
|
- **License:** *No explicit license file in the repo at authoring time; respect base model licenses.* |
|
|
- **Finetuned from model [optional]:** Typical backbones are `openai/clip-vit-base-patch16` and `google/siglip-base-patch16-224` |
|
|
|
|
|
### Model Sources |
|
|
<!-- Provide the basic links for the model. --> |
|
|
- **Repository:** https://github.com/amirhossein-yousefi/Image-Contrastive-CLIP |
|
|
- **Paper :** |
|
|
- CLIP: Radford et al., 2021 β https://arxiv.org/abs/2103.00020 |
|
|
- SigLIP: Zhai et al., 2023 β https://arxiv.org/abs/2303.15343 |
|
|
|
|
|
## Uses |
|
|
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. --> |
|
|
|
|
|
### Direct Use |
|
|
<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. --> |
|
|
- **Task:** Imageβtext retrieval (imageβtext and textβimage) on English-captioned datasets, using CLIP/SigLIP encoders fineβtuned via this repo. |
|
|
- **Artifacts:** Training entrypoint (`src/main_training.py`), scripted evaluator (`src/evaluate_.py`), and index/metric utilities (`src/index_utils.py`, `src/retrieval_metrics.py`). |
|
|
|
|
|
### Downstream Use |
|
|
<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app --> |
|
|
- **Semantic search** over image collections (export embeddings and index with FAISS). |
|
|
- **Zeroβshot classification** via text prompts (CLIPβstyle) as a quick sanity check. |
|
|
- **Multimodal RAG / search**: retrieve images given queries or find captions matching an image. |
|
|
|
|
|
### Out-of-Scope Use |
|
|
<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. --> |
|
|
- **Biometric identification** and surveillance. |
|
|
- **Safetyβcritical decisionβmaking** (scores are not calibrated probabilities). |
|
|
- **NonβEnglish** tasks without additional multilingual data/processing (loaders provided here target English Flickr datasets). |
|
|
|
|
|
## Bias, Risks, and Limitations |
|
|
<!-- This section is meant to convey both technical and sociotechnical limitations. --> |
|
|
- **Dataset bias:** Flickr datasets contain webβcaptions with possible stereotypes and sensitive attributes; models may learn these associations. |
|
|
- **Domain shift:** Retrieval quality can degrade outside webβstyle captions (e.g., medical, aerial, industrial domains). |
|
|
- **Batch sensitivity:** Contrastive learning quality depends on batch composition/size; SigLIPβs sigmoid loss is often less batchβsize dependent than softmax. |
|
|
|
|
|
### Recommendations |
|
|
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. Consider disaggregated R@K reporting by people/places/activities, and add counterfactual tests or prompt templating to reduce biased retrieval. |
|
|
|
|
|
## How to Get Started with the Model |
|
|
Use the code below to get started with a minimal fineβtune and evaluation. |
|
|
|
|
|
```bash |
|
|
# (optional) conda |
|
|
conda create -n ic-clip python=3.10 -y && conda activate ic-clip |
|
|
|
|
|
# Core deps |
|
|
pip install -U torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121 |
|
|
pip install -U transformers datasets accelerate timm pillow tqdm tensorboard |
|
|
|
|
|
# (optional) for retrieval indexing |
|
|
pip install faiss-cpu # or faiss-gpu if you have CUDA toolchain |
|
|
``` |
|
|
|
|
|
```bash |
|
|
# Train CLIP on Flickr8k |
|
|
python -m src.main_training \ |
|
|
--model_name openai/clip-vit-base-patch16 \ |
|
|
--dataset flickr8k \ |
|
|
--output_dir runs/clip-finetune-flickr8k \ |
|
|
--epochs 5 --lr 1e-5 \ |
|
|
--train_bs 64 --eval_bs 128 \ |
|
|
--grad_accum 4 --warmup_ratio 0.05 \ |
|
|
--fp16 |
|
|
``` |
|
|
|
|
|
```bash |
|
|
# Evaluate a checkpoint on Flickr30k |
|
|
python -m src.evaluate_ \ |
|
|
--model_name /path/to/checkpoint_or_hub_id \ |
|
|
--dataset flickr30k \ |
|
|
--output_dir runs/clip-finetune-flickr30k \ |
|
|
--eval_bs 128 --fp16 |
|
|
``` |
|
|
|
|
|
The evaluator builds an index and writes retrieval metrics (R@1/5/10, MedR, and average best cosine) to a JSON file under your run directory. |
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Training Data |
|
|
<!-- Link to datasets & describe. --> |
|
|
- **Flickr8k** (`jxie/flickr8k`): 8k images with **5 captions per image**. |
|
|
- **Flickr30k** (`nlphuji/flickr30k`): ~31k images, also with **5 captions per image**. |
|
|
|
|
|
### Training Procedure |
|
|
|
|
|
#### Preprocessing |
|
|
- Uses `AutoProcessor`/`image_processor` + tokenizer. |
|
|
- For **SigLIP**, text padding is set to `max_length`; **CLIP** can use dynamic padding. |
|
|
- **Random caption per image** is sampled per step to keep batches wellβmixed. |
|
|
|
|
|
#### Training Hyperparameters |
|
|
- **Training regime:** Typical starting point β `epochs=5`, `lr=1e-5`, `train_bs=64`, `eval_bs=128`, `grad_accum=4`, `warmup_ratio=0.05`, `fp16` mixed precision. |
|
|
|
|
|
#### Speeds, Sizes, Times [optional] |
|
|
- For **16β―GB** GPUs, consider `--image_resize 196`, `--train_bs 32 --grad_accum 8`, and `--grad_ckpt`. TF32 and SDPA attention are enabled where supported for throughput. |
|
|
|
|
|
## Evaluation |
|
|
|
|
|
### β¨ Results for flickr8k |
|
|
|
|
|
> Test set: **1,000 images** Γ **5,000 texts** |
|
|
|
|
|
<p align="center"> |
|
|
<!-- Directional recalls as badges --> |
|
|
<img src="https://img.shields.io/badge/i%E2%86%92t_R%401-90.7%25-4c1?style=for-the-badge" alt="iβt R@1 90.7%"> |
|
|
<img src="https://img.shields.io/badge/i%E2%86%92t_R%405-99.0%25-4c1?style=for-the-badge" alt="iβt R@5 99.0%"> |
|
|
<img src="https://img.shields.io/badge/i%E2%86%92t_R%4010-99.4%25-4c1?style=for-the-badge" alt="iβt R@10 99.4%"> |
|
|
<br/> |
|
|
<img src="https://img.shields.io/badge/t%E2%86%92i_R%401-77.06%25-9cf?style=for-the-badge" alt="tβi R@1 77.06%"> |
|
|
<img src="https://img.shields.io/badge/t%E2%86%92i_R%405-93.82%25-9cf?style=for-the-badge" alt="tβi R@5 93.82%"> |
|
|
<img src="https://img.shields.io/badge/t%E2%86%92i_R%4010-96.94%25-9cf?style=for-the-badge" alt="tβi R@10 96.94%"> |
|
|
<br/> |
|
|
<img src="https://img.shields.io/badge/images-1,000-informational?style=flat-square" alt="n_images"> |
|
|
<img src="https://img.shields.io/badge/texts-5,000-informational?style=flat-square" alt="n_texts"> |
|
|
<img src="https://img.shields.io/badge/avg_best_cosine-0.347-lightgrey?style=flat-square" alt="avg_best_cosine"> |
|
|
</p> |
|
|
|
|
|
### π Metric Table |
|
|
| Direction | R@1 | R@5 | R@10 | MedR | MeanR | |
|
|
|:-----------------|-------:|------:|------:|-----:|------:| |
|
|
| **Image β Text** | **90.7%** | 99.0% | 99.4% | 1 | 1.261 | |
|
|
| **Text β Image** | **77.06%**| 93.82%| 96.94%| 1 | 2.557 | |
|
|
|
|
|
**Biβdirectional averages:** mR@1 = **83.88%**, mR@5 = **96.41%**, mR@10 = **98.17%** |
|
|
|
|
|
<details> |
|
|
<summary><b>ASCII bars </b></summary> |
|
|
|
|
|
``` |
|
|
iβt R@1 ββββββββββββββββββββββββββββββ 90.7% |
|
|
iβt R@5 ββββββββββββββββββββββββββββββ 99.0% |
|
|
iβt R@10 ββββββββββββββββββββββββββββββ 99.4% |
|
|
|
|
|
tβi R@1 ββββββββββββββββββββββββββββββ 77.06% |
|
|
tβi R@5 ββββββββββββββββββββββββββββββ 93.82% |
|
|
tβi R@10 ββββββββββββββββββββββββββββββ 96.94% |
|
|
``` |
|
|
</details> |
|
|
|
|
|
--- |
|
|
|
|
|
### β¨ Results for flickr30k |
|
|
|
|
|
> Test set: **1,000 images** Γ **5,000 texts** |
|
|
|
|
|
<p align="center"> |
|
|
<!-- Directional recalls as badges --> |
|
|
<img src="https://img.shields.io/badge/i%E2%86%92t_R%401-92.3%25-4c1?style=for-the-badge" alt="iβt R@1 92.3%"> |
|
|
<img src="https://img.shields.io/badge/i%E2%86%92t_R%405-99.1%25-4c1?style=for-the-badge" alt="iβt R@5 99.1%"> |
|
|
<img src="https://img.shields.io/badge/i%E2%86%92t_R%4010-99.7%25-4c1?style=for-the-badge" alt="iβt R@10 99.7%"> |
|
|
<br/> |
|
|
<img src="https://img.shields.io/badge/t%E2%86%92i_R%401-79.0%25-9cf?style=for-the-badge" alt="tβi R@1 79.0%"> |
|
|
<img src="https://img.shields.io/badge/t%E2%86%92i_R%405-95.28%25-9cf?style=for-the-badge" alt="tβi R@5 95.28%"> |
|
|
<img src="https://img.shields.io/badge/t%E2%86%92i_R%4010-97.86%25-9cf?style=for-the-badge" alt="tβi R@10 97.86%"> |
|
|
<br/> |
|
|
<img src="https://img.shields.io/badge/images-1,000-informational?style=flat-square" alt="n_images"> |
|
|
<img src="https://img.shields.io/badge/texts-5,000-informational?style=flat-square" alt="n_texts"> |
|
|
<img src="https://img.shields.io/badge/avg_best_cosine-0.337-lightgrey?style=flat-square" alt="avg_best_cosine"> |
|
|
</p> |
|
|
|
|
|
### π Metric Table |
|
|
| Direction | R@1 | R@5 | R@10 | MedR | MeanR | |
|
|
|:-----------------|-------:|------:|------:|-----:|------:| |
|
|
| **Image β Text** | **92.3%** | 99.1% | 99.7% | 1 | 1.198 | |
|
|
| **Text β Image** | **79.00%**| 95.28%| 97.86%| 1 | 2.158 | |
|
|
|
|
|
**Biβdirectional averages:** mR@1 = **85.65%**, mR@5 = **97.19%**, mR@10 = **98.78%** |
|
|
|
|
|
<details> |
|
|
<summary><b>ASCII bars (quick visual)</b></summary> |
|
|
|
|
|
``` |
|
|
iβt R@1 ββββββββββββββββββββββββββββββ 92.3% |
|
|
iβt R@5 ββββββββββββββββββββββββββββββ 99.1% |
|
|
iβt R@10 ββββββββββββββββββββββββββββββ 99.7% |
|
|
|
|
|
tβi R@1 ββββββββββββββββββββββββββββββ 79.0% |
|
|
tβi R@5 ββββββββββββββββββββββββββββββ 95.28% |
|
|
tβi R@10 ββββββββββββββββββββββββββββββ 97.86% |
|
|
``` |
|
|
</details> |
|
|
|
|
|
--- |
|
|
|
|
|
#### Testing Data |
|
|
- Flickr8k / Flickr30k test splits via the provided loaders. |
|
|
|
|
|
#### Factors |
|
|
- Report retrieval performance in both directions: **imageβtext** and **textβimage**; optionally disaggregate by content types (people, places, activities). |
|
|
|
|
|
#### Metrics |
|
|
- **Recall@K (R@1/5/10)**, **Median Rank (MedR)**, and **Average best cosine** similarity. |
|
|
|
|
|
|
|
|
#### Summary |
|
|
You should observe improvements over zeroβshot CLIP/SigLIP on inβdomain retrieval; magnitude depends on data size, steps, and prompts. |
|
|
|
|
|
## Model Examination |
|
|
Inspect nearestβneighbor hits in both directions and manually audit failure modes (nearβduplicates, spurious cues, biased descriptions). |
|
|
|
|
|
## π₯οΈ Training Hardware & Environment |
|
|
|
|
|
- **Device:** Laptop (Windows, WDDM driver model) |
|
|
- **GPU:** NVIDIA GeForce **RTX 3080 Ti Laptop GPU** (16 GB VRAM) |
|
|
- **Driver:** **576.52** |
|
|
- **CUDA (driver):** **12.9** |
|
|
- **PyTorch:** **2.8.0+cu129** |
|
|
- **CUDA available:** β
|
|
|
|
|
|
|
|
|
## π Training Logs & Metrics |
|
|
|
|
|
- **Total FLOPs (training):** `579,250,830,704,640` for flickr 8k and `3,895,219,925,811,200` for flickr30k |
|
|
- **Training runtime:** `480.4213` seconds for flickr 8k and `1,601.6088` for flickr30k |
|
|
|
|
|
### Model Architecture and Objective |
|
|
- **Dualβencoder** architecture (vision transformer + text transformer). |
|
|
- **CLIP** uses a temperatureβscaled softmax contrastive loss; **SigLIP** uses a pairwise sigmoid loss that is less batchβsize coupled. |
|
|
|
|
|
### Compute Infrastructure |
|
|
- **Hardware:** Works on single or multiβGPU; memoryβsafety flags provided. |
|
|
- **Software:** Pythonβ₯3.9, PyTorch, `transformers`, `datasets`, `accelerate`, `timm`, optional FAISS. |
|
|
|
|
|
|
|
|
**BibTeX (CLIP):** |
|
|
``` |
|
|
@inproceedings{radford2021learning, |
|
|
title={Learning Transferable Visual Models From Natural Language Supervision}, |
|
|
author={Radford, Alec and Kim, Jong Wook and Hallacy, Chris and Ramesh, Aditya and Goh, Gabriel and Agarwal, Sandhini and Sastry, Girish and Askell, Amanda and Mishkin, Pamela and Clark, Jack and Krueger, Gretchen and Sutskever, Ilya}, |
|
|
booktitle={ICML}, |
|
|
year={2021} |
|
|
} |
|
|
``` |
|
|
|
|
|
**BibTeX (SigLIP):** |
|
|
``` |
|
|
@inproceedings{zhai2023sigmoid, |
|
|
title={Sigmoid Loss for Language Image Pre-Training}, |
|
|
author={Zhai, Xiaohua and Mustafa, Basil and Kolesnikov, Alexander and Beyer, Lucas}, |
|
|
booktitle={ICCV}, |
|
|
year={2023} |
|
|
} |
|
|
``` |
|
|
|
|
|
## Model Card Contact |
|
|
- Please open a GitHub issue in the repository. |
|
|
|