File size: 11,922 Bytes
4d73dc6 62c6b6d 4d73dc6 791a01f 4d73dc6 3b91574 4d73dc6 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 |
---
# For reference on model card metadata, see the spec: https://github.com/huggingface/hub-docs/blob/main/modelcard.md?plain=1
# Doc / guide: https://huggingface.co/docs/hub/model-cards
{
"library_name": "transformers",
"pipeline_tag": "image-to-text",
"license": "apache-2.0",
"tags": [
"vision-language",
"image-captioning",
"SmolVLM",
"LoRA",
"QLoRA",
"COCO",
"peft",
"accelerate"
],
"base_model": "HuggingFaceTB/SmolVLM-Instruct",
"datasets": ["jxie/coco_captions"],
"language": ["en"],
"widget": [
{
"text": "Give a concise caption.",
"src": "https://images.cocodataset.org/val2014/COCO_val2014_000000522418.jpg"
}
]
}
---
# Model Card for **Image-Captioning-VLM (SmolVLM + COCO, LoRA/QLoRA)**
This repository provides a compact **vision–language image captioning model** built by fine-tuning **SmolVLM-Instruct** with **LoRA/QLoRA** adapters on the **MS COCO Captions** dataset. The goal is to offer an easy-to-train, memory‑efficient captioner for research, data labeling, and diffusion training workflows while keeping the **vision tower frozen** and adapting the language/cross‑modal components.
> **TL;DR**
>
> - Base: `HuggingFaceTB/SmolVLM-Instruct` (Apache-2.0).
> - Training data: `jxie/coco_captions` (English captions).
> - Method: LoRA/QLoRA SFT; **vision encoder frozen**.
> - Intended use: generate concise or descriptive captions for general images.
> - Not intended for high-stakes or safety-critical uses.
---
## Model Details
### Model Description
- **Developed by:** *Amirhossein Yousefi* (GitHub: `amirhossein-yousefi`)
- **Model type:** Vision–Language (**image → text**) captioning model with LoRA/QLoRA adapters on top of **SmolVLM-Instruct**
- **Language(s):** English
- **License:** **Apache-2.0** for the released model artifacts (inherits from the base model’s license); dataset retains its own license (see *Training Data*)
- **Finetuned from:** `HuggingFaceTB/SmolVLM-Instruct`
SmolVLM couples a **shape-optimized SigLIP** vision tower with a compact **SmolLM2** decoder via a multimodal projector and runs via `AutoModelForVision2Seq`. This project fine-tunes the language-side with LoRA/QLoRA while **freezing the vision tower** to keep memory use low and training simple.
### Model Sources
- **Repository:** https://github.com/amirhossein-yousefi/Image-Captioning-VLM
- **Base model card:** https://huggingface.co/HuggingFaceTB/SmolVLM-Instruct
- **Base technical report :** https://arxiv.org/abs/2504.05299 (SmolVLM)
- **Dataset (training):** https://huggingface.co/datasets/jxie/coco_captions
---
## Uses
### Direct Use
- Generate **concise** or **descriptive** captions for natural images.
- Provide **alt text**/accessibility descriptions (human review recommended).
- Produce captions for **vision dataset bootstrapping** or **diffusion training** pipelines.
**Quickstart (inference script from this repo):**
```bash
python inference_vlm.py \
--base_model_id HuggingFaceTB/SmolVLM-Instruct \
--adapter_dir outputs/smolvlm-coco-lora \
--image https://images.cocodataset.org/val2014/COCO_val2014_000000522418.jpg \
--prompt "Give a concise caption."
```
**Programmatic example (PEFT LoRA):**
```python
import torch
from PIL import Image
from transformers import AutoProcessor, AutoModelForVision2Seq
from peft import PeftModel
device = "cuda" if torch.cuda.is_available() else "cpu"
base = "HuggingFaceTB/SmolVLM-Instruct"
adapter_dir = "outputs/smolvlm-coco-lora" # path from training
processor = AutoProcessor.from_pretrained(base)
model = AutoModelForVision2Seq.from_pretrained(
base, torch_dtype=torch.bfloat16 if device == "cuda" else torch.float32
).to(device)
# Load LoRA/QLoRA adapter
model = PeftModel.from_pretrained(model, adapter_dir).to(device)
model.eval()
image = Image.open("sample.jpg").convert("RGB")
messages = [{"role": "user",
"content": [{"type": "image"},
{"type": "text", "text": "Give a concise caption."}]}]
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image], return_tensors="pt").to(device)
ids = model.generate(**inputs, max_new_tokens=64)
print(processor.batch_decode(ids, skip_special_tokens=True)[0])
```
### Downstream Use
- As a **captioning stage** within multi-step data pipelines (e.g., labeling, retrieval augmentation, dataset curation).
- As a starting point for **continued fine-tuning** on specialized domains (e.g., medical imagery, artwork) with domain-appropriate data and review.
### Out-of-Scope Use
- **High-stakes** or **safety-critical** settings (medical, legal, surveillance, credit decisions, etc.).
- Automated systems where **factuality, fairness, or safety** must be guaranteed without **human in the loop**.
- Parsing small text (OCR) or reading sensitive PII from images; this model is not optimized for OCR.
---
## Bias, Risks, and Limitations
- **Data bias:** COCO captions are predominantly English and reflect biases of their sources; generated captions may mirror societal stereotypes.
- **Content coverage:** General-purpose images work best; performance may degrade on domains underrepresented in COCO (e.g., medical scans, satellite imagery).
- **Safety:** Captions may occasionally be **inaccurate**, **overconfident**, or **hallucinated**. Always review before downstream use, especially for accessibility.
### Recommendations
- Keep a **human in the loop** for sensitive or impactful applications.
- When adapting to new domains, curate **diverse, representative** training sets and evaluate with domain-specific metrics and audits.
- Log model outputs and collect review feedback to iteratively improve quality.
---
## How to Get Started with the Model
**Environment setup**
```bash
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
# (If on NVIDIA & want QLoRA) ensure bitsandbytes is installed; or use: --use_qlora false
```
**Fine-tune (LoRA/QLoRA; frozen vision tower)**
```bash
python train_vlm_sft.py \
--base_model_id HuggingFaceTB/SmolVLM-Instruct \
--dataset_id jxie/coco_captions \
--output_dir outputs/smolvlm-coco-lora \
--epochs 1 --batch_size 2 --grad_accum 8 \
--max_seq_len 1024 --image_longest_edge 1536
```
---
## Training Details
### Training Data
- **Dataset:** `jxie/coco_captions` (English captions for MS COCO images).
- **Notes:** COCO provides **~617k** caption examples with **5 captions per image**; images come from Flickr with their own terms. Please review the dataset card and the original COCO license/terms before use.
### Training Procedure
#### Preprocessing
- Images are resized with **longest_edge = 1536** (consistent with SmolVLM’s 384×384 patching strategy at N=4).
- Text sequences truncated/padded to **max_seq_len = 1024**.
#### Training Hyperparameters
- **Regime:** Supervised fine-tuning with **LoRA** (or **QLoRA**) on the language-side parameters; **vision tower frozen**.
- **Example CLI:** see above. Mixed precision (`bf16` on CUDA) recommended if available.
#### Speeds, Sizes, Times
- The base SmolVLM reports **~5 GB min GPU RAM** for inference; fine-tuning requires more VRAM depending on batch size/sequence length. See the base card for details.
---
## Evaluation
### 📊 Score card(on subsample of main data)
**All scores increase with higher values (↑).** For visualization, `CIDEr` is shown ×100 in the chart to match the 0–100 scale of other metrics.
| Split | CIDEr | CLIPScore | BLEU-4 | METEOR | ROUGE-L | BERTScore-F1 | Images |
|:-------------|------:|----------:|-------:|-------:|--------:|-------------:|------:|
| **Test** | 0.560 | 30.830 | 15.73 | 47.84 | 45.18 | 91.73 | 1000 |
| **Validation**| 0.540 | 31.068 | 16.01 | 48.28 | 45.11 | 91.80 | 1000 |
### Quick read on the metrics
- **CIDEr** — consensus with human captions; higher is better for human-like phrasing (0–>1 typical).
- **CLIPScore** — reference-free image–text compatibility via CLIP’s cosine similarity (commonly rescaled).
- **BLEU‑4** — 4‑gram precision with brevity penalty (lexical match).
- **METEOR** — unigram match with stemming/synonyms, emphasizes recall.
- **ROUGE‑L** — longest common subsequence overlap (structure/recall‑leaning).
- **BERTScore‑F1** — semantic similarity using contextual embeddings.
### Testing Data, Factors & Metrics
#### Testing Data
- Hold out a portion of **COCO val** (e.g., `val2014`) or custom images for qualitative/quantitative evaluation.
#### Factors
- **Image domain** (indoor/outdoor), **object density**, **scene complexity**, and **presence of small text** (OCR-like) can affect performance.
#### Metrics
- Strong **semantic alignment** (BERTScore-F1 ≈ **91.8** on *val*), and balanced lexical overlap (BLEU-4 ≈ **16.0**).
- **CIDEr** is slightly higher on *test* (0.560) vs. *val* (0.540); other metrics are near parity across splits.
- Trained & evaluated with the minimal pipeline in the repo (LoRA/QLoRA-ready).
- This repo includes `eval_caption_metric.py` scaffolding.
### Results
- Publish your scores here after running the evaluation script (e.g., CIDEr, BLEU-4) and include qualitative examples.
#### Summary
- The LoRA/QLoRA approach provides **memory‑efficient adaptation** while preserving the strong generalization of SmolVLM on image–text tasks.
---
## Model Examination
- You may inspect token attributions or visualize attention over image regions using third-party tools; no built‑in interpretability tooling is shipped here.
---
## 🖥️ Training Hardware & Environment
- **Device:** Laptop (Windows, WDDM driver model)
- **GPU:** NVIDIA GeForce **RTX 3080 Ti Laptop GPU** (16 GB VRAM)
- **Driver:** **576.52**
- **CUDA (driver):** **12.9**
- **PyTorch:** **2.8.0+cu129**
- **CUDA available:** ✅
## 📊 Training Metrics
- **Total FLOPs (training):** `26,387,224,652,152,830`
- **Training runtime:** `5,664.0825` seconds
---
## Technical Specifications
### Model Architecture and Objective
- **Architecture:** SmolVLM-style VLM with **SigLIP** vision tower, **SmolLM2** decoder, and a **multimodal projector**; trained here via **SFT with LoRA/QLoRA** for **image captioning**.
- **Objective:** Next-token generation conditioned on image tokens + text prompt (image → text).
### Compute Infrastructure
#### Hardware
- Works on consumer GPUs for inference; fine‑tuning VRAM depends on adapter choice and batch size.
#### Software
- Python, PyTorch, `transformers`, `peft`, `accelerate`, `datasets`, `evaluate`, optional `bitsandbytes` for QLoRA.
---
## Citation
If you use this repository or the resulting model, please cite:
**BibTeX:**
```bibtex
@software{ImageCaptioningVLM2025,
author = {Yousefi, Amir Hossein},
title = {Image-Captioning-VLM: LoRA/QLoRA fine-tuning of SmolVLM for image captioning},
year = {2025},
url = {https://github.com/amirhossein-yousefi/Image-Captioning-VLM}
}
```
Also cite the **base model** and **dataset** as appropriate (see their pages).
**APA:**
Yousefi, A. H. (2025). *Image-Captioning-VLM: LoRA/QLoRA fine-tuning of SmolVLM for image captioning* [Computer software]. https://github.com/amirhossein-yousefi/Image-Captioning-VLM
---
## Glossary
- **LoRA/QLoRA:** Low‑Rank (Quantized) Adapters that enable parameter‑efficient fine‑tuning.
- **Vision tower:** The vision encoder (SigLIP) that turns image patches into tokens.
- **SFT:** Supervised Fine‑Tuning.
---
## More Information
- For issues and feature requests, open a GitHub issue on the repository.
---
## Model Card Authors
- Amirhossein Yousefi (maintainer)
- Contributors welcome (via PRs)
---
## Model Card Contact
- Open an issue: https://github.com/amirhossein-yousefi/Image-Captioning-VLM/issues
|