Amirhossein75 commited on
Commit
a185dd7
·
verified ·
1 Parent(s): 2c17234

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -7
README.md CHANGED
@@ -33,17 +33,15 @@ This repository provides a clean, reproducible **training recipe** to fine‑tun
33
  ### Model Description
34
  <!-- Provide a longer summary of what this model is. -->
35
  - **Developed by:** Amirhossein Yousefi (repo maintainer)
36
- - **Funded by [optional]:** Not specified
37
- - **Shared by [optional]:** Public, open-source repository
38
  - **Model type:** **Dual‑encoder** (vision transformer + text transformer) trained with **contrastive objectives** (CLIP softmax contrastive loss or SigLIP sigmoid loss)
39
  - **Language(s) (NLP):** English captions (Flickr8k/Flickr30k)
40
  - **License:** *No explicit license file in the repo at authoring time; respect base model licenses.*
41
  - **Finetuned from model [optional]:** Typical backbones are `openai/clip-vit-base-patch16` and `google/siglip-base-patch16-224`
42
 
43
- ### Model Sources [optional]
44
  <!-- Provide the basic links for the model. -->
45
  - **Repository:** https://github.com/amirhossein-yousefi/Image-Contrastive-CLIP
46
- - **Paper [optional]:**
47
  - CLIP: Radford et al., 2021 – https://arxiv.org/abs/2103.00020
48
  - SigLIP: Zhai et al., 2023 – https://arxiv.org/abs/2303.15343
49
  - **Demo [optional]:** (add a Colab/Space link if you publish one)
@@ -56,7 +54,7 @@ This repository provides a clean, reproducible **training recipe** to fine‑tun
56
  - **Task:** Image–text retrieval (image→text and text→image) on English-captioned datasets, using CLIP/SigLIP encoders fine‑tuned via this repo.
57
  - **Artifacts:** Training entrypoint (`src/main_training.py`), scripted evaluator (`src/evaluate_.py`), and index/metric utilities (`src/index_utils.py`, `src/retrieval_metrics.py`).
58
 
59
- ### Downstream Use [optional]
60
  <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
61
  - **Semantic search** over image collections (export embeddings and index with FAISS).
62
  - **Zero‑shot classification** via text prompts (CLIP‑style) as a quick sanity check.
@@ -124,7 +122,7 @@ The evaluator builds an index and writes retrieval metrics (R@1/5/10, MedR, and
124
 
125
  ### Training Procedure
126
 
127
- #### Preprocessing [optional]
128
  - Uses `AutoProcessor`/`image_processor` + tokenizer.
129
  - For **SigLIP**, text padding is set to `max_length`; **CLIP** can use dynamic padding.
130
  - **Random caption per image** is sampled per step to keep batches well‑mixed.
@@ -132,7 +130,7 @@ The evaluator builds an index and writes retrieval metrics (R@1/5/10, MedR, and
132
  #### Training Hyperparameters
133
  - **Training regime:** Typical starting point — `epochs=5`, `lr=1e-5`, `train_bs=64`, `eval_bs=128`, `grad_accum=4`, `warmup_ratio=0.05`, `fp16` mixed precision.
134
 
135
- #### Speeds, Sizes, Times [optional]
136
  - For **16 GB** GPUs, consider `--image_resize 196`, `--train_bs 32 --grad_accum 8`, and `--grad_ckpt`. TF32 and SDPA attention are enabled where supported for throughput.
137
 
138
  ## Evaluation
 
33
  ### Model Description
34
  <!-- Provide a longer summary of what this model is. -->
35
  - **Developed by:** Amirhossein Yousefi (repo maintainer)
 
 
36
  - **Model type:** **Dual‑encoder** (vision transformer + text transformer) trained with **contrastive objectives** (CLIP softmax contrastive loss or SigLIP sigmoid loss)
37
  - **Language(s) (NLP):** English captions (Flickr8k/Flickr30k)
38
  - **License:** *No explicit license file in the repo at authoring time; respect base model licenses.*
39
  - **Finetuned from model [optional]:** Typical backbones are `openai/clip-vit-base-patch16` and `google/siglip-base-patch16-224`
40
 
41
+ ### Model Sources
42
  <!-- Provide the basic links for the model. -->
43
  - **Repository:** https://github.com/amirhossein-yousefi/Image-Contrastive-CLIP
44
+ - **Paper :**
45
  - CLIP: Radford et al., 2021 – https://arxiv.org/abs/2103.00020
46
  - SigLIP: Zhai et al., 2023 – https://arxiv.org/abs/2303.15343
47
  - **Demo [optional]:** (add a Colab/Space link if you publish one)
 
54
  - **Task:** Image–text retrieval (image→text and text→image) on English-captioned datasets, using CLIP/SigLIP encoders fine‑tuned via this repo.
55
  - **Artifacts:** Training entrypoint (`src/main_training.py`), scripted evaluator (`src/evaluate_.py`), and index/metric utilities (`src/index_utils.py`, `src/retrieval_metrics.py`).
56
 
57
+ ### Downstream Use
58
  <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
59
  - **Semantic search** over image collections (export embeddings and index with FAISS).
60
  - **Zero‑shot classification** via text prompts (CLIP‑style) as a quick sanity check.
 
122
 
123
  ### Training Procedure
124
 
125
+ #### Preprocessing
126
  - Uses `AutoProcessor`/`image_processor` + tokenizer.
127
  - For **SigLIP**, text padding is set to `max_length`; **CLIP** can use dynamic padding.
128
  - **Random caption per image** is sampled per step to keep batches well‑mixed.
 
130
  #### Training Hyperparameters
131
  - **Training regime:** Typical starting point — `epochs=5`, `lr=1e-5`, `train_bs=64`, `eval_bs=128`, `grad_accum=4`, `warmup_ratio=0.05`, `fp16` mixed precision.
132
 
133
+ #### Speeds, Sizes, Times
134
  - For **16 GB** GPUs, consider `--image_resize 196`, `--train_bs 32 --grad_accum 8`, and `--grad_ckpt`. TF32 and SDPA attention are enabled where supported for throughput.
135
 
136
  ## Evaluation