Amirhossein75
/

Image-Contrastive-CLIP-Flickr8k

@@ -33,20 +33,17 @@ This repository provides a clean, reproducible **training recipe** to fine‑tun
 ### Model Description
 <!-- Provide a longer summary of what this model is. -->
 - **Developed by:** Amirhossein Yousefi (repo maintainer)
-- **Funded by [optional]:** Not specified
-- **Shared by [optional]:** Public, open-source repository
 - **Model type:** **Dual‑encoder** (vision transformer + text transformer) trained with **contrastive objectives** (CLIP softmax contrastive loss or SigLIP sigmoid loss)
 - **Language(s) (NLP):** English captions (Flickr8k/Flickr30k)
 - **License:** *No explicit license file in the repo at authoring time; respect base model licenses.*
 - **Finetuned from model [optional]:** Typical backbones are `openai/clip-vit-base-patch16` and `google/siglip-base-patch16-224`
-### Model Sources [optional]
 <!-- Provide the basic links for the model. -->
 - **Repository:** https://github.com/amirhossein-yousefi/Image-Contrastive-CLIP
-- **Paper [optional]:**
   - CLIP: Radford et al., 2021 – https://arxiv.org/abs/2103.00020
   - SigLIP: Zhai et al., 2023 – https://arxiv.org/abs/2303.15343
-- **Demo [optional]:** (add a Colab/Space link if you publish one)
 ## Uses
 <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
@@ -56,7 +53,7 @@ This repository provides a clean, reproducible **training recipe** to fine‑tun
 - **Task:** Image–text retrieval (image→text and text→image) on English-captioned datasets, using CLIP/SigLIP encoders fine‑tuned via this repo.
 - **Artifacts:** Training entrypoint (`src/main_training.py`), scripted evaluator (`src/evaluate_.py`), and index/metric utilities (`src/index_utils.py`, `src/retrieval_metrics.py`).
-### Downstream Use [optional]
 <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
 - **Semantic search** over image collections (export embeddings and index with FAISS).
 - **Zero‑shot classification** via text prompts (CLIP‑style) as a quick sanity check.
@@ -124,7 +121,7 @@ The evaluator builds an index and writes retrieval metrics (R@1/5/10, MedR, and
 ### Training Procedure
-#### Preprocessing [optional]
 - Uses `AutoProcessor`/`image_processor` + tokenizer.
 - For **SigLIP**, text padding is set to `max_length`; **CLIP** can use dynamic padding.
 - **Random caption per image** is sampled per step to keep batches well‑mixed.

 ### Model Description
 <!-- Provide a longer summary of what this model is. -->
 - **Developed by:** Amirhossein Yousefi (repo maintainer)
 - **Model type:** **Dual‑encoder** (vision transformer + text transformer) trained with **contrastive objectives** (CLIP softmax contrastive loss or SigLIP sigmoid loss)
 - **Language(s) (NLP):** English captions (Flickr8k/Flickr30k)
 - **License:** *No explicit license file in the repo at authoring time; respect base model licenses.*
 - **Finetuned from model [optional]:** Typical backbones are `openai/clip-vit-base-patch16` and `google/siglip-base-patch16-224`
+### Model Sources
 <!-- Provide the basic links for the model. -->
 - **Repository:** https://github.com/amirhossein-yousefi/Image-Contrastive-CLIP
+- **Paper :**
   - CLIP: Radford et al., 2021 – https://arxiv.org/abs/2103.00020
   - SigLIP: Zhai et al., 2023 – https://arxiv.org/abs/2303.15343
 ## Uses
 <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
 - **Task:** Image–text retrieval (image→text and text→image) on English-captioned datasets, using CLIP/SigLIP encoders fine‑tuned via this repo.
 - **Artifacts:** Training entrypoint (`src/main_training.py`), scripted evaluator (`src/evaluate_.py`), and index/metric utilities (`src/index_utils.py`, `src/retrieval_metrics.py`).
+### Downstream Use
 <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
 - **Semantic search** over image collections (export embeddings and index with FAISS).
 - **Zero‑shot classification** via text prompts (CLIP‑style) as a quick sanity check.
 ### Training Procedure
+#### Preprocessing
 - Uses `AutoProcessor`/`image_processor` + tokenizer.
 - For **SigLIP**, text padding is set to `max_length`; **CLIP** can use dynamic padding.
 - **Random caption per image** is sampled per step to keep batches well‑mixed.