File size: 11,922 Bytes
4d73dc6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
62c6b6d
4d73dc6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
791a01f
4d73dc6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3b91574
4d73dc6
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
---
# For reference on model card metadata, see the spec: https://github.com/huggingface/hub-docs/blob/main/modelcard.md?plain=1
# Doc / guide: https://huggingface.co/docs/hub/model-cards
{
  "library_name": "transformers",
  "pipeline_tag": "image-to-text",
  "license": "apache-2.0",
  "tags": [
    "vision-language",
    "image-captioning",
    "SmolVLM",
    "LoRA",
    "QLoRA",
    "COCO",
    "peft",
    "accelerate"
  ],
  "base_model": "HuggingFaceTB/SmolVLM-Instruct",
  "datasets": ["jxie/coco_captions"],
  "language": ["en"],
  "widget": [
    {
      "text": "Give a concise caption.",
      "src": "https://images.cocodataset.org/val2014/COCO_val2014_000000522418.jpg"
    }
  ]
}
---

# Model Card for **Image-Captioning-VLM (SmolVLM + COCO, LoRA/QLoRA)**

This repository provides a compact **vision–language image captioning model** built by fine-tuning **SmolVLM-Instruct** with **LoRA/QLoRA** adapters on the **MS COCO Captions** dataset. The goal is to offer an easy-to-train, memory‑efficient captioner for research, data labeling, and diffusion training workflows while keeping the **vision tower frozen** and adapting the language/cross‑modal components.

> **TL;DR**
>
> - Base: `HuggingFaceTB/SmolVLM-Instruct` (Apache-2.0).  
> - Training data: `jxie/coco_captions` (English captions).  
> - Method: LoRA/QLoRA SFT; **vision encoder frozen**.  
> - Intended use: generate concise or descriptive captions for general images.  
> - Not intended for high-stakes or safety-critical uses.

---

## Model Details

### Model Description

- **Developed by:** *Amirhossein Yousefi* (GitHub: `amirhossein-yousefi`)
- **Model type:** Vision–Language (**image → text**) captioning model with LoRA/QLoRA adapters on top of **SmolVLM-Instruct**
- **Language(s):** English
- **License:** **Apache-2.0** for the released model artifacts (inherits from the base model’s license); dataset retains its own license (see *Training Data*)
- **Finetuned from:** `HuggingFaceTB/SmolVLM-Instruct`

SmolVLM couples a **shape-optimized SigLIP** vision tower with a compact **SmolLM2** decoder via a multimodal projector and runs via `AutoModelForVision2Seq`. This project fine-tunes the language-side with LoRA/QLoRA while **freezing the vision tower** to keep memory use low and training simple.

### Model Sources

- **Repository:** https://github.com/amirhossein-yousefi/Image-Captioning-VLM
- **Base model card:** https://huggingface.co/HuggingFaceTB/SmolVLM-Instruct
- **Base technical report :** https://arxiv.org/abs/2504.05299 (SmolVLM)
- **Dataset (training):** https://huggingface.co/datasets/jxie/coco_captions

---

## Uses

### Direct Use

- Generate **concise** or **descriptive** captions for natural images.
- Provide **alt text**/accessibility descriptions (human review recommended).
- Produce captions for **vision dataset bootstrapping** or **diffusion training** pipelines.

**Quickstart (inference script from this repo):**

```bash
python inference_vlm.py \
  --base_model_id HuggingFaceTB/SmolVLM-Instruct \
  --adapter_dir outputs/smolvlm-coco-lora \
  --image https://images.cocodataset.org/val2014/COCO_val2014_000000522418.jpg \
  --prompt "Give a concise caption."
```

**Programmatic example (PEFT LoRA):**

```python
import torch
from PIL import Image
from transformers import AutoProcessor, AutoModelForVision2Seq
from peft import PeftModel

device = "cuda" if torch.cuda.is_available() else "cpu"
base = "HuggingFaceTB/SmolVLM-Instruct"
adapter_dir = "outputs/smolvlm-coco-lora"  # path from training

processor = AutoProcessor.from_pretrained(base)
model = AutoModelForVision2Seq.from_pretrained(
    base, torch_dtype=torch.bfloat16 if device == "cuda" else torch.float32
).to(device)

# Load LoRA/QLoRA adapter
model = PeftModel.from_pretrained(model, adapter_dir).to(device)
model.eval()

image = Image.open("sample.jpg").convert("RGB")
messages = [{"role": "user",
             "content": [{"type": "image"},
                         {"type": "text", "text": "Give a concise caption."}]}]
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)

inputs = processor(text=prompt, images=[image], return_tensors="pt").to(device)
ids = model.generate(**inputs, max_new_tokens=64)
print(processor.batch_decode(ids, skip_special_tokens=True)[0])
```

### Downstream Use

- As a **captioning stage** within multi-step data pipelines (e.g., labeling, retrieval augmentation, dataset curation).
- As a starting point for **continued fine-tuning** on specialized domains (e.g., medical imagery, artwork) with domain-appropriate data and review.

### Out-of-Scope Use

- **High-stakes** or **safety-critical** settings (medical, legal, surveillance, credit decisions, etc.).
- Automated systems where **factuality, fairness, or safety** must be guaranteed without **human in the loop**.
- Parsing small text (OCR) or reading sensitive PII from images; this model is not optimized for OCR.

---

## Bias, Risks, and Limitations

- **Data bias:** COCO captions are predominantly English and reflect biases of their sources; generated captions may mirror societal stereotypes.
- **Content coverage:** General-purpose images work best; performance may degrade on domains underrepresented in COCO (e.g., medical scans, satellite imagery).
- **Safety:** Captions may occasionally be **inaccurate**, **overconfident**, or **hallucinated**. Always review before downstream use, especially for accessibility.

### Recommendations

- Keep a **human in the loop** for sensitive or impactful applications.
- When adapting to new domains, curate **diverse, representative** training sets and evaluate with domain-specific metrics and audits.
- Log model outputs and collect review feedback to iteratively improve quality.

---

## How to Get Started with the Model

**Environment setup**

```bash
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
# (If on NVIDIA & want QLoRA) ensure bitsandbytes is installed; or use: --use_qlora false
```

**Fine-tune (LoRA/QLoRA; frozen vision tower)**

```bash
python train_vlm_sft.py \
  --base_model_id HuggingFaceTB/SmolVLM-Instruct \
  --dataset_id jxie/coco_captions \
  --output_dir outputs/smolvlm-coco-lora \
  --epochs 1 --batch_size 2 --grad_accum 8 \
  --max_seq_len 1024 --image_longest_edge 1536
```

---

## Training Details

### Training Data

- **Dataset:** `jxie/coco_captions` (English captions for MS COCO images).  
- **Notes:** COCO provides **~617k** caption examples with **5 captions per image**; images come from Flickr with their own terms. Please review the dataset card and the original COCO license/terms before use.

### Training Procedure

#### Preprocessing

- Images are resized with **longest_edge = 1536** (consistent with SmolVLM’s 384×384 patching strategy at N=4).
- Text sequences truncated/padded to **max_seq_len = 1024**.

#### Training Hyperparameters

- **Regime:** Supervised fine-tuning with **LoRA** (or **QLoRA**) on the language-side parameters; **vision tower frozen**.
- **Example CLI:** see above. Mixed precision (`bf16` on CUDA) recommended if available.

#### Speeds, Sizes, Times

- The base SmolVLM reports **~5 GB min GPU RAM** for inference; fine-tuning requires more VRAM depending on batch size/sequence length. See the base card for details.

---

## Evaluation
### 📊 Score card(on subsample of main data)

**All scores increase with higher values (↑).** For visualization, `CIDEr` is shown ×100 in the chart to match the 0–100 scale of other metrics.

| Split        | CIDEr | CLIPScore | BLEU-4 | METEOR | ROUGE-L | BERTScore-F1 | Images |
|:-------------|------:|----------:|-------:|-------:|--------:|-------------:|------:|
| **Test**      | 0.560 | 30.830    | 15.73  | 47.84  | 45.18   | 91.73        | 1000  |
| **Validation**| 0.540 | 31.068    | 16.01  | 48.28  | 45.11   | 91.80        | 1000  |


### Quick read on the metrics

- **CIDEr** — consensus with human captions; higher is better for human-like phrasing (0–>1 typical).  
- **CLIPScore** — reference-free image–text compatibility via CLIP’s cosine similarity (commonly rescaled).  
- **BLEU‑4** — 4‑gram precision with brevity penalty (lexical match).  
- **METEOR** — unigram match with stemming/synonyms, emphasizes recall.  
- **ROUGE‑L** — longest common subsequence overlap (structure/recall‑leaning).  
- **BERTScore‑F1** — semantic similarity using contextual embeddings.


### Testing Data, Factors & Metrics

#### Testing Data

- Hold out a portion of **COCO val** (e.g., `val2014`) or custom images for qualitative/quantitative evaluation.

#### Factors

- **Image domain** (indoor/outdoor), **object density**, **scene complexity**, and **presence of small text** (OCR-like) can affect performance.

#### Metrics
- Strong **semantic alignment** (BERTScore-F1 ≈ **91.8** on *val*), and balanced lexical overlap (BLEU-4 ≈ **16.0**).
- **CIDEr** is slightly higher on *test* (0.560) vs. *val* (0.540); other metrics are near parity across splits.
- Trained & evaluated with the minimal pipeline in the repo (LoRA/QLoRA-ready).
- This repo includes `eval_caption_metric.py` scaffolding.

### Results

- Publish your scores here after running the evaluation script (e.g., CIDEr, BLEU-4) and include qualitative examples.


#### Summary

- The LoRA/QLoRA approach provides **memory‑efficient adaptation** while preserving the strong generalization of SmolVLM on image–text tasks.

---

## Model Examination 

- You may inspect token attributions or visualize attention over image regions using third-party tools; no built‑in interpretability tooling is shipped here.

---

## 🖥️ Training Hardware & Environment

- **Device:** Laptop (Windows, WDDM driver model)  
- **GPU:** NVIDIA GeForce **RTX 3080 Ti Laptop GPU** (16 GB VRAM)  
- **Driver:** **576.52**  
- **CUDA (driver):** **12.9**  
- **PyTorch:** **2.8.0+cu129**  
- **CUDA available:** ✅ 


## 📊 Training Metrics

- **Total FLOPs (training):** `26,387,224,652,152,830`  
- **Training runtime:** `5,664.0825` seconds  
---

## Technical Specifications

### Model Architecture and Objective

- **Architecture:** SmolVLM-style VLM with **SigLIP** vision tower, **SmolLM2** decoder, and a **multimodal projector**; trained here via **SFT with LoRA/QLoRA** for **image captioning**.
- **Objective:** Next-token generation conditioned on image tokens + text prompt (image → text).

### Compute Infrastructure

#### Hardware

- Works on consumer GPUs for inference; fine‑tuning VRAM depends on adapter choice and batch size.

#### Software

- Python, PyTorch, `transformers`, `peft`, `accelerate`, `datasets`, `evaluate`, optional `bitsandbytes` for QLoRA.

---

## Citation

If you use this repository or the resulting model, please cite:

**BibTeX:**

```bibtex
@software{ImageCaptioningVLM2025,
  author = {Yousefi, Amir Hossein},
  title = {Image-Captioning-VLM: LoRA/QLoRA fine-tuning of SmolVLM for image captioning},
  year = {2025},
  url = {https://github.com/amirhossein-yousefi/Image-Captioning-VLM}
}
```

Also cite the **base model** and **dataset** as appropriate (see their pages).

**APA:**

Yousefi, A. H. (2025). *Image-Captioning-VLM: LoRA/QLoRA fine-tuning of SmolVLM for image captioning* [Computer software]. https://github.com/amirhossein-yousefi/Image-Captioning-VLM

---

## Glossary 

- **LoRA/QLoRA:** Low‑Rank (Quantized) Adapters that enable parameter‑efficient fine‑tuning.
- **Vision tower:** The vision encoder (SigLIP) that turns image patches into tokens.
- **SFT:** Supervised Fine‑Tuning.

---

## More Information 

- For issues and feature requests, open a GitHub issue on the repository.

---

## Model Card Authors 

- Amirhossein Yousefi (maintainer)
- Contributors welcome (via PRs)

---

## Model Card Contact

- Open an issue: https://github.com/amirhossein-yousefi/Image-Captioning-VLM/issues