Update README.md
Browse files
README.md
CHANGED
|
@@ -33,20 +33,17 @@ This repository provides a clean, reproducible **training recipe** to fine‑tun
|
|
| 33 |
### Model Description
|
| 34 |
<!-- Provide a longer summary of what this model is. -->
|
| 35 |
- **Developed by:** Amirhossein Yousefi (repo maintainer)
|
| 36 |
-
- **Funded by [optional]:** Not specified
|
| 37 |
-
- **Shared by [optional]:** Public, open-source repository
|
| 38 |
- **Model type:** **Dual‑encoder** (vision transformer + text transformer) trained with **contrastive objectives** (CLIP softmax contrastive loss or SigLIP sigmoid loss)
|
| 39 |
- **Language(s) (NLP):** English captions (Flickr8k/Flickr30k)
|
| 40 |
- **License:** *No explicit license file in the repo at authoring time; respect base model licenses.*
|
| 41 |
- **Finetuned from model [optional]:** Typical backbones are `openai/clip-vit-base-patch16` and `google/siglip-base-patch16-224`
|
| 42 |
|
| 43 |
-
### Model Sources
|
| 44 |
<!-- Provide the basic links for the model. -->
|
| 45 |
- **Repository:** https://github.com/amirhossein-yousefi/Image-Contrastive-CLIP
|
| 46 |
-
- **Paper
|
| 47 |
- CLIP: Radford et al., 2021 – https://arxiv.org/abs/2103.00020
|
| 48 |
- SigLIP: Zhai et al., 2023 – https://arxiv.org/abs/2303.15343
|
| 49 |
-
- **Demo [optional]:** (add a Colab/Space link if you publish one)
|
| 50 |
|
| 51 |
## Uses
|
| 52 |
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
|
|
@@ -56,7 +53,7 @@ This repository provides a clean, reproducible **training recipe** to fine‑tun
|
|
| 56 |
- **Task:** Image–text retrieval (image→text and text→image) on English-captioned datasets, using CLIP/SigLIP encoders fine‑tuned via this repo.
|
| 57 |
- **Artifacts:** Training entrypoint (`src/main_training.py`), scripted evaluator (`src/evaluate_.py`), and index/metric utilities (`src/index_utils.py`, `src/retrieval_metrics.py`).
|
| 58 |
|
| 59 |
-
### Downstream Use
|
| 60 |
<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
|
| 61 |
- **Semantic search** over image collections (export embeddings and index with FAISS).
|
| 62 |
- **Zero‑shot classification** via text prompts (CLIP‑style) as a quick sanity check.
|
|
@@ -124,7 +121,7 @@ The evaluator builds an index and writes retrieval metrics (R@1/5/10, MedR, and
|
|
| 124 |
|
| 125 |
### Training Procedure
|
| 126 |
|
| 127 |
-
#### Preprocessing
|
| 128 |
- Uses `AutoProcessor`/`image_processor` + tokenizer.
|
| 129 |
- For **SigLIP**, text padding is set to `max_length`; **CLIP** can use dynamic padding.
|
| 130 |
- **Random caption per image** is sampled per step to keep batches well‑mixed.
|
|
|
|
| 33 |
### Model Description
|
| 34 |
<!-- Provide a longer summary of what this model is. -->
|
| 35 |
- **Developed by:** Amirhossein Yousefi (repo maintainer)
|
|
|
|
|
|
|
| 36 |
- **Model type:** **Dual‑encoder** (vision transformer + text transformer) trained with **contrastive objectives** (CLIP softmax contrastive loss or SigLIP sigmoid loss)
|
| 37 |
- **Language(s) (NLP):** English captions (Flickr8k/Flickr30k)
|
| 38 |
- **License:** *No explicit license file in the repo at authoring time; respect base model licenses.*
|
| 39 |
- **Finetuned from model [optional]:** Typical backbones are `openai/clip-vit-base-patch16` and `google/siglip-base-patch16-224`
|
| 40 |
|
| 41 |
+
### Model Sources
|
| 42 |
<!-- Provide the basic links for the model. -->
|
| 43 |
- **Repository:** https://github.com/amirhossein-yousefi/Image-Contrastive-CLIP
|
| 44 |
+
- **Paper :**
|
| 45 |
- CLIP: Radford et al., 2021 – https://arxiv.org/abs/2103.00020
|
| 46 |
- SigLIP: Zhai et al., 2023 – https://arxiv.org/abs/2303.15343
|
|
|
|
| 47 |
|
| 48 |
## Uses
|
| 49 |
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
|
|
|
|
| 53 |
- **Task:** Image–text retrieval (image→text and text→image) on English-captioned datasets, using CLIP/SigLIP encoders fine‑tuned via this repo.
|
| 54 |
- **Artifacts:** Training entrypoint (`src/main_training.py`), scripted evaluator (`src/evaluate_.py`), and index/metric utilities (`src/index_utils.py`, `src/retrieval_metrics.py`).
|
| 55 |
|
| 56 |
+
### Downstream Use
|
| 57 |
<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
|
| 58 |
- **Semantic search** over image collections (export embeddings and index with FAISS).
|
| 59 |
- **Zero‑shot classification** via text prompts (CLIP‑style) as a quick sanity check.
|
|
|
|
| 121 |
|
| 122 |
### Training Procedure
|
| 123 |
|
| 124 |
+
#### Preprocessing
|
| 125 |
- Uses `AutoProcessor`/`image_processor` + tokenizer.
|
| 126 |
- For **SigLIP**, text padding is set to `max_length`; **CLIP** can use dynamic padding.
|
| 127 |
- **Random caption per image** is sampled per step to keep batches well‑mixed.
|