Amirhossein75 commited on
Commit
4d73dc6
·
verified ·
1 Parent(s): 5ed07e6

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +326 -0
README.md ADDED
@@ -0,0 +1,326 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ # For reference on model card metadata, see the spec: https://github.com/huggingface/hub-docs/blob/main/modelcard.md?plain=1
3
+ # Doc / guide: https://huggingface.co/docs/hub/model-cards
4
+ {
5
+ "library_name": "transformers",
6
+ "pipeline_tag": "image-to-text",
7
+ "license": "apache-2.0",
8
+ "tags": [
9
+ "vision-language",
10
+ "image-captioning",
11
+ "SmolVLM",
12
+ "LoRA",
13
+ "QLoRA",
14
+ "COCO",
15
+ "peft",
16
+ "accelerate"
17
+ ],
18
+ "base_model": "HuggingFaceTB/SmolVLM-Instruct",
19
+ "datasets": ["jxie/coco_captions"],
20
+ "language": ["en"],
21
+ "widget": [
22
+ {
23
+ "text": "Give a concise caption.",
24
+ "src": "https://images.cocodataset.org/val2014/COCO_val2014_000000522418.jpg"
25
+ }
26
+ ]
27
+ }
28
+ ---
29
+
30
+ # Model Card for **Image-Captioning-VLM (SmolVLM + COCO, LoRA/QLoRA)**
31
+
32
+ This repository provides a compact **vision–language image captioning model** built by fine-tuning **SmolVLM-Instruct** with **LoRA/QLoRA** adapters on the **MS COCO Captions** dataset. The goal is to offer an easy-to-train, memory‑efficient captioner for research, data labeling, and diffusion training workflows while keeping the **vision tower frozen** and adapting the language/cross‑modal components.
33
+
34
+ > **TL;DR**
35
+ >
36
+ > - Base: `HuggingFaceTB/SmolVLM-Instruct` (Apache-2.0).
37
+ > - Training data: `jxie/coco_captions` (English captions).
38
+ > - Method: LoRA/QLoRA SFT; **vision encoder frozen**.
39
+ > - Intended use: generate concise or descriptive captions for general images.
40
+ > - Not intended for high-stakes or safety-critical uses.
41
+
42
+ ---
43
+
44
+ ## Model Details
45
+
46
+ ### Model Description
47
+
48
+ - **Developed by:** *Amir Hossein Yousefi* (GitHub: `amirhossein-yousefi`)
49
+ - **Model type:** Vision–Language (**image → text**) captioning model with LoRA/QLoRA adapters on top of **SmolVLM-Instruct**
50
+ - **Language(s):** English
51
+ - **License:** **Apache-2.0** for the released model artifacts (inherits from the base model’s license); dataset retains its own license (see *Training Data*)
52
+ - **Finetuned from:** `HuggingFaceTB/SmolVLM-Instruct`
53
+
54
+ SmolVLM couples a **shape-optimized SigLIP** vision tower with a compact **SmolLM2** decoder via a multimodal projector and runs via `AutoModelForVision2Seq`. This project fine-tunes the language-side with LoRA/QLoRA while **freezing the vision tower** to keep memory use low and training simple.
55
+
56
+ ### Model Sources
57
+
58
+ - **Repository:** https://github.com/amirhossein-yousefi/Image-Captioning-VLM
59
+ - **Base model card:** https://huggingface.co/HuggingFaceTB/SmolVLM-Instruct
60
+ - **Base technical report :** https://arxiv.org/abs/2504.05299 (SmolVLM)
61
+ - **Dataset (training):** https://huggingface.co/datasets/jxie/coco_captions
62
+
63
+ ---
64
+
65
+ ## Uses
66
+
67
+ ### Direct Use
68
+
69
+ - Generate **concise** or **descriptive** captions for natural images.
70
+ - Provide **alt text**/accessibility descriptions (human review recommended).
71
+ - Produce captions for **vision dataset bootstrapping** or **diffusion training** pipelines.
72
+
73
+ **Quickstart (inference script from this repo):**
74
+
75
+ ```bash
76
+ python inference_vlm.py \
77
+ --base_model_id HuggingFaceTB/SmolVLM-Instruct \
78
+ --adapter_dir outputs/smolvlm-coco-lora \
79
+ --image https://images.cocodataset.org/val2014/COCO_val2014_000000522418.jpg \
80
+ --prompt "Give a concise caption."
81
+ ```
82
+
83
+ **Programmatic example (PEFT LoRA):**
84
+
85
+ ```python
86
+ import torch
87
+ from PIL import Image
88
+ from transformers import AutoProcessor, AutoModelForVision2Seq
89
+ from peft import PeftModel
90
+
91
+ device = "cuda" if torch.cuda.is_available() else "cpu"
92
+ base = "HuggingFaceTB/SmolVLM-Instruct"
93
+ adapter_dir = "outputs/smolvlm-coco-lora" # path from training
94
+
95
+ processor = AutoProcessor.from_pretrained(base)
96
+ model = AutoModelForVision2Seq.from_pretrained(
97
+ base, torch_dtype=torch.bfloat16 if device == "cuda" else torch.float32
98
+ ).to(device)
99
+
100
+ # Load LoRA/QLoRA adapter
101
+ model = PeftModel.from_pretrained(model, adapter_dir).to(device)
102
+ model.eval()
103
+
104
+ image = Image.open("sample.jpg").convert("RGB")
105
+ messages = [{"role": "user",
106
+ "content": [{"type": "image"},
107
+ {"type": "text", "text": "Give a concise caption."}]}]
108
+ prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
109
+
110
+ inputs = processor(text=prompt, images=[image], return_tensors="pt").to(device)
111
+ ids = model.generate(**inputs, max_new_tokens=64)
112
+ print(processor.batch_decode(ids, skip_special_tokens=True)[0])
113
+ ```
114
+
115
+ ### Downstream Use
116
+
117
+ - As a **captioning stage** within multi-step data pipelines (e.g., labeling, retrieval augmentation, dataset curation).
118
+ - As a starting point for **continued fine-tuning** on specialized domains (e.g., medical imagery, artwork) with domain-appropriate data and review.
119
+
120
+ ### Out-of-Scope Use
121
+
122
+ - **High-stakes** or **safety-critical** settings (medical, legal, surveillance, credit decisions, etc.).
123
+ - Automated systems where **factuality, fairness, or safety** must be guaranteed without **human in the loop**.
124
+ - Parsing small text (OCR) or reading sensitive PII from images; this model is not optimized for OCR.
125
+
126
+ ---
127
+
128
+ ## Bias, Risks, and Limitations
129
+
130
+ - **Data bias:** COCO captions are predominantly English and reflect biases of their sources; generated captions may mirror societal stereotypes.
131
+ - **Content coverage:** General-purpose images work best; performance may degrade on domains underrepresented in COCO (e.g., medical scans, satellite imagery).
132
+ - **Safety:** Captions may occasionally be **inaccurate**, **overconfident**, or **hallucinated**. Always review before downstream use, especially for accessibility.
133
+
134
+ ### Recommendations
135
+
136
+ - Keep a **human in the loop** for sensitive or impactful applications.
137
+ - When adapting to new domains, curate **diverse, representative** training sets and evaluate with domain-specific metrics and audits.
138
+ - Log model outputs and collect review feedback to iteratively improve quality.
139
+
140
+ ---
141
+
142
+ ## How to Get Started with the Model
143
+
144
+ **Environment setup**
145
+
146
+ ```bash
147
+ python -m venv .venv && source .venv/bin/activate
148
+ pip install -r requirements.txt
149
+ # (If on NVIDIA & want QLoRA) ensure bitsandbytes is installed; or use: --use_qlora false
150
+ ```
151
+
152
+ **Fine-tune (LoRA/QLoRA; frozen vision tower)**
153
+
154
+ ```bash
155
+ python train_vlm_sft.py \
156
+ --base_model_id HuggingFaceTB/SmolVLM-Instruct \
157
+ --dataset_id jxie/coco_captions \
158
+ --output_dir outputs/smolvlm-coco-lora \
159
+ --epochs 1 --batch_size 2 --grad_accum 8 \
160
+ --max_seq_len 1024 --image_longest_edge 1536
161
+ ```
162
+
163
+ ---
164
+
165
+ ## Training Details
166
+
167
+ ### Training Data
168
+
169
+ - **Dataset:** `jxie/coco_captions` (English captions for MS COCO images).
170
+ - **Notes:** COCO provides **~617k** caption examples with **5 captions per image**; images come from Flickr with their own terms. Please review the dataset card and the original COCO license/terms before use.
171
+
172
+ ### Training Procedure
173
+
174
+ #### Preprocessing
175
+
176
+ - Images are resized with **longest_edge = 1536** (consistent with SmolVLM’s 384×384 patching strategy at N=4).
177
+ - Text sequences truncated/padded to **max_seq_len = 1024**.
178
+
179
+ #### Training Hyperparameters
180
+
181
+ - **Regime:** Supervised fine-tuning with **LoRA** (or **QLoRA**) on the language-side parameters; **vision tower frozen**.
182
+ - **Example CLI:** see above. Mixed precision (`bf16` on CUDA) recommended if available.
183
+
184
+ #### Speeds, Sizes, Times
185
+
186
+ - The base SmolVLM reports **~5 GB min GPU RAM** for inference; fine-tuning requires more VRAM depending on batch size/sequence length. See the base card for details.
187
+
188
+ ---
189
+
190
+ ## Evaluation
191
+ ### 📊 Score card
192
+
193
+ **All scores increase with higher values (↑).** For visualization, `CIDEr` is shown ×100 in the chart to match the 0–100 scale of other metrics.
194
+
195
+ | Split | CIDEr | CLIPScore | BLEU-4 | METEOR | ROUGE-L | BERTScore-F1 | Images |
196
+ |:-------------|------:|----------:|-------:|-------:|--------:|-------------:|------:|
197
+ | **Test** | 0.560 | 30.830 | 15.73 | 47.84 | 45.18 | 91.73 | 1000 |
198
+ | **Validation**| 0.540 | 31.068 | 16.01 | 48.28 | 45.11 | 91.80 | 1000 |
199
+
200
+
201
+ ### Quick read on the metrics
202
+
203
+ - **CIDEr** — consensus with human captions; higher is better for human-like phrasing (0–>1 typical).
204
+ - **CLIPScore** — reference-free image–text compatibility via CLIP’s cosine similarity (commonly rescaled).
205
+ - **BLEU‑4** — 4‑gram precision with brevity penalty (lexical match).
206
+ - **METEOR** — unigram match with stemming/synonyms, emphasizes recall.
207
+ - **ROUGE‑L** — longest common subsequence overlap (structure/recall‑leaning).
208
+ - **BERTScore‑F1** — semantic similarity using contextual embeddings.
209
+
210
+
211
+ ### Testing Data, Factors & Metrics
212
+
213
+ #### Testing Data
214
+
215
+ - Hold out a portion of **COCO val** (e.g., `val2014`) or custom images for qualitative/quantitative evaluation.
216
+
217
+ #### Factors
218
+
219
+ - **Image domain** (indoor/outdoor), **object density**, **scene complexity**, and **presence of small text** (OCR-like) can affect performance.
220
+
221
+ #### Metrics
222
+ - Strong **semantic alignment** (BERTScore-F1 ≈ **91.8** on *val*), and balanced lexical overlap (BLEU-4 ≈ **16.0**).
223
+ - **CIDEr** is slightly higher on *test* (0.560) vs. *val* (0.540); other metrics are near parity across splits.
224
+ - Trained & evaluated with the minimal pipeline in the repo (LoRA/QLoRA-ready).
225
+ - This repo includes `eval_caption_metric.py` scaffolding.
226
+
227
+ ### Results
228
+
229
+ - Publish your scores here after running the evaluation script (e.g., CIDEr, BLEU-4) and include qualitative examples.
230
+
231
+
232
+ #### Summary
233
+
234
+ - The LoRA/QLoRA approach provides **memory‑efficient adaptation** while preserving the strong generalization of SmolVLM on image–text tasks.
235
+
236
+ ---
237
+
238
+ ## Model Examination
239
+
240
+ - You may inspect token attributions or visualize attention over image regions using third-party tools; no built‑in interpretability tooling is shipped here.
241
+
242
+ ---
243
+
244
+ ## 🖥️ Training Hardware & Environment
245
+
246
+ - **Device:** Laptop (Windows, WDDM driver model)
247
+ - **GPU:** NVIDIA GeForce **RTX 3080 Ti Laptop GPU** (16 GB VRAM)
248
+ - **Driver:** **576.52**
249
+ - **CUDA (driver):** **12.9**
250
+ - **PyTorch:** **2.8.0+cu129**
251
+ - **CUDA available:** ✅
252
+
253
+
254
+ ## 📊 Training Metrics
255
+
256
+ - **Total FLOPs (training):** `26,387,224,652,152,830`
257
+ - **Training runtime:** `5,664.0825` seconds
258
+ ---
259
+
260
+ ## Technical Specifications
261
+
262
+ ### Model Architecture and Objective
263
+
264
+ - **Architecture:** SmolVLM-style VLM with **SigLIP** vision tower, **SmolLM2** decoder, and a **multimodal projector**; trained here via **SFT with LoRA/QLoRA** for **image captioning**.
265
+ - **Objective:** Next-token generation conditioned on image tokens + text prompt (image → text).
266
+
267
+ ### Compute Infrastructure
268
+
269
+ #### Hardware
270
+
271
+ - Works on consumer GPUs for inference; fine‑tuning VRAM depends on adapter choice and batch size.
272
+
273
+ #### Software
274
+
275
+ - Python, PyTorch, `transformers`, `peft`, `accelerate`, `datasets`, `evaluate`, optional `bitsandbytes` for QLoRA.
276
+
277
+ ---
278
+
279
+ ## Citation
280
+
281
+ If you use this repository or the resulting model, please cite:
282
+
283
+ **BibTeX:**
284
+
285
+ ```bibtex
286
+ @software{ImageCaptioningVLM2025,
287
+ author = {Yousefi, Amir Hossein},
288
+ title = {Image-Captioning-VLM: LoRA/QLoRA fine-tuning of SmolVLM for image captioning},
289
+ year = {2025},
290
+ url = {https://github.com/amirhossein-yousefi/Image-Captioning-VLM}
291
+ }
292
+ ```
293
+
294
+ Also cite the **base model** and **dataset** as appropriate (see their pages).
295
+
296
+ **APA:**
297
+
298
+ Yousefi, A. H. (2025). *Image-Captioning-VLM: LoRA/QLoRA fine-tuning of SmolVLM for image captioning* [Computer software]. https://github.com/amirhossein-yousefi/Image-Captioning-VLM
299
+
300
+ ---
301
+
302
+ ## Glossary
303
+
304
+ - **LoRA/QLoRA:** Low‑Rank (Quantized) Adapters that enable parameter‑efficient fine‑tuning.
305
+ - **Vision tower:** The vision encoder (SigLIP) that turns image patches into tokens.
306
+ - **SFT:** Supervised Fine‑Tuning.
307
+
308
+ ---
309
+
310
+ ## More Information
311
+
312
+ - For issues and feature requests, open a GitHub issue on the repository.
313
+
314
+ ---
315
+
316
+ ## Model Card Authors
317
+
318
+ - Amir Hossein Yousefi (maintainer)
319
+ - Contributors welcome (via PRs)
320
+
321
+ ---
322
+
323
+ ## Model Card Contact
324
+
325
+ - Open an issue: https://github.com/amirhossein-yousefi/Image-Captioning-VLM/issues
326
+