Amirhossein75
/

Image-Contrastive-CLIP-Flickr30k

@@ -33,17 +33,15 @@ This repository provides a clean, reproducible **training recipe** to fine‑tun
 ### Model Description
 <!-- Provide a longer summary of what this model is. -->
 - **Developed by:** Amirhossein Yousefi (repo maintainer)
-- **Funded by [optional]:** Not specified
-- **Shared by [optional]:** Public, open-source repository
 - **Model type:** **Dual‑encoder** (vision transformer + text transformer) trained with **contrastive objectives** (CLIP softmax contrastive loss or SigLIP sigmoid loss)
 - **Language(s) (NLP):** English captions (Flickr8k/Flickr30k)
 - **License:** *No explicit license file in the repo at authoring time; respect base model licenses.*
 - **Finetuned from model [optional]:** Typical backbones are `openai/clip-vit-base-patch16` and `google/siglip-base-patch16-224`
-### Model Sources [optional]
 <!-- Provide the basic links for the model. -->
 - **Repository:** https://github.com/amirhossein-yousefi/Image-Contrastive-CLIP
-- **Paper [optional]:**
   - CLIP: Radford et al., 2021 – https://arxiv.org/abs/2103.00020
   - SigLIP: Zhai et al., 2023 – https://arxiv.org/abs/2303.15343
 - **Demo [optional]:** (add a Colab/Space link if you publish one)
@@ -56,7 +54,7 @@ This repository provides a clean, reproducible **training recipe** to fine‑tun
 - **Task:** Image–text retrieval (image→text and text→image) on English-captioned datasets, using CLIP/SigLIP encoders fine‑tuned via this repo.
 - **Artifacts:** Training entrypoint (`src/main_training.py`), scripted evaluator (`src/evaluate_.py`), and index/metric utilities (`src/index_utils.py`, `src/retrieval_metrics.py`).
-### Downstream Use [optional]
 <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
 - **Semantic search** over image collections (export embeddings and index with FAISS).
 - **Zero‑shot classification** via text prompts (CLIP‑style) as a quick sanity check.
@@ -124,7 +122,7 @@ The evaluator builds an index and writes retrieval metrics (R@1/5/10, MedR, and
 ### Training Procedure
-#### Preprocessing [optional]
 - Uses `AutoProcessor`/`image_processor` + tokenizer.
 - For **SigLIP**, text padding is set to `max_length`; **CLIP** can use dynamic padding.
 - **Random caption per image** is sampled per step to keep batches well‑mixed.
@@ -132,7 +130,7 @@ The evaluator builds an index and writes retrieval metrics (R@1/5/10, MedR, and
 #### Training Hyperparameters
 - **Training regime:** Typical starting point — `epochs=5`, `lr=1e-5`, `train_bs=64`, `eval_bs=128`, `grad_accum=4`, `warmup_ratio=0.05`, `fp16` mixed precision.
-#### Speeds, Sizes, Times [optional]
 - For **16 GB** GPUs, consider `--image_resize 196`, `--train_bs 32 --grad_accum 8`, and `--grad_ckpt`. TF32 and SDPA attention are enabled where supported for throughput.
 ## Evaluation

 ### Model Description
 <!-- Provide a longer summary of what this model is. -->
 - **Developed by:** Amirhossein Yousefi (repo maintainer)
 - **Model type:** **Dual‑encoder** (vision transformer + text transformer) trained with **contrastive objectives** (CLIP softmax contrastive loss or SigLIP sigmoid loss)
 - **Language(s) (NLP):** English captions (Flickr8k/Flickr30k)
 - **License:** *No explicit license file in the repo at authoring time; respect base model licenses.*
 - **Finetuned from model [optional]:** Typical backbones are `openai/clip-vit-base-patch16` and `google/siglip-base-patch16-224`
+### Model Sources
 <!-- Provide the basic links for the model. -->
 - **Repository:** https://github.com/amirhossein-yousefi/Image-Contrastive-CLIP
+- **Paper :**
   - CLIP: Radford et al., 2021 – https://arxiv.org/abs/2103.00020
   - SigLIP: Zhai et al., 2023 – https://arxiv.org/abs/2303.15343
 - **Demo [optional]:** (add a Colab/Space link if you publish one)
 - **Task:** Image–text retrieval (image→text and text→image) on English-captioned datasets, using CLIP/SigLIP encoders fine‑tuned via this repo.
 - **Artifacts:** Training entrypoint (`src/main_training.py`), scripted evaluator (`src/evaluate_.py`), and index/metric utilities (`src/index_utils.py`, `src/retrieval_metrics.py`).
+### Downstream Use
 <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
 - **Semantic search** over image collections (export embeddings and index with FAISS).
 - **Zero‑shot classification** via text prompts (CLIP‑style) as a quick sanity check.
 ### Training Procedure
+#### Preprocessing
 - Uses `AutoProcessor`/`image_processor` + tokenizer.
 - For **SigLIP**, text padding is set to `max_length`; **CLIP** can use dynamic padding.
 - **Random caption per image** is sampled per step to keep batches well‑mixed.
 #### Training Hyperparameters
 - **Training regime:** Typical starting point — `epochs=5`, `lr=1e-5`, `train_bs=64`, `eval_bs=128`, `grad_accum=4`, `warmup_ratio=0.05`, `fp16` mixed precision.
+#### Speeds, Sizes, Times
 - For **16 GB** GPUs, consider `--image_resize 196`, `--train_bs 32 --grad_accum 8`, and `--grad_ckpt`. TF32 and SDPA attention are enabled where supported for throughput.
 ## Evaluation