|
|
--- |
|
|
|
|
|
|
|
|
{ |
|
|
"library_name": "transformers", |
|
|
"pipeline_tag": "image-to-text", |
|
|
"license": "apache-2.0", |
|
|
"tags": [ |
|
|
"vision-language", |
|
|
"image-captioning", |
|
|
"SmolVLM", |
|
|
"LoRA", |
|
|
"QLoRA", |
|
|
"COCO", |
|
|
"peft", |
|
|
"accelerate" |
|
|
], |
|
|
"base_model": "HuggingFaceTB/SmolVLM-Instruct", |
|
|
"datasets": ["jxie/coco_captions"], |
|
|
"language": ["en"], |
|
|
"widget": [ |
|
|
{ |
|
|
"text": "Give a concise caption.", |
|
|
"src": "https://images.cocodataset.org/val2014/COCO_val2014_000000522418.jpg" |
|
|
} |
|
|
] |
|
|
} |
|
|
--- |
|
|
|
|
|
# Model Card for **Image-Captioning-VLM (SmolVLM + COCO, LoRA/QLoRA)** |
|
|
|
|
|
This repository provides a compact **vision–language image captioning model** built by fine-tuning **SmolVLM-Instruct** with **LoRA/QLoRA** adapters on the **MS COCO Captions** dataset. The goal is to offer an easy-to-train, memory‑efficient captioner for research, data labeling, and diffusion training workflows while keeping the **vision tower frozen** and adapting the language/cross‑modal components. |
|
|
|
|
|
> **TL;DR** |
|
|
> |
|
|
> - Base: `HuggingFaceTB/SmolVLM-Instruct` (Apache-2.0). |
|
|
> - Training data: `jxie/coco_captions` (English captions). |
|
|
> - Method: LoRA/QLoRA SFT; **vision encoder frozen**. |
|
|
> - Intended use: generate concise or descriptive captions for general images. |
|
|
> - Not intended for high-stakes or safety-critical uses. |
|
|
|
|
|
--- |
|
|
|
|
|
## Model Details |
|
|
|
|
|
### Model Description |
|
|
|
|
|
- **Developed by:** *Amirhossein Yousefi* (GitHub: `amirhossein-yousefi`) |
|
|
- **Model type:** Vision–Language (**image → text**) captioning model with LoRA/QLoRA adapters on top of **SmolVLM-Instruct** |
|
|
- **Language(s):** English |
|
|
- **License:** **Apache-2.0** for the released model artifacts (inherits from the base model’s license); dataset retains its own license (see *Training Data*) |
|
|
- **Finetuned from:** `HuggingFaceTB/SmolVLM-Instruct` |
|
|
|
|
|
SmolVLM couples a **shape-optimized SigLIP** vision tower with a compact **SmolLM2** decoder via a multimodal projector and runs via `AutoModelForVision2Seq`. This project fine-tunes the language-side with LoRA/QLoRA while **freezing the vision tower** to keep memory use low and training simple. |
|
|
|
|
|
### Model Sources |
|
|
|
|
|
- **Repository:** https://github.com/amirhossein-yousefi/Image-Captioning-VLM |
|
|
- **Base model card:** https://huggingface.co/HuggingFaceTB/SmolVLM-Instruct |
|
|
- **Base technical report :** https://arxiv.org/abs/2504.05299 (SmolVLM) |
|
|
- **Dataset (training):** https://huggingface.co/datasets/jxie/coco_captions |
|
|
|
|
|
--- |
|
|
|
|
|
## Uses |
|
|
|
|
|
### Direct Use |
|
|
|
|
|
- Generate **concise** or **descriptive** captions for natural images. |
|
|
- Provide **alt text**/accessibility descriptions (human review recommended). |
|
|
- Produce captions for **vision dataset bootstrapping** or **diffusion training** pipelines. |
|
|
|
|
|
**Quickstart (inference script from this repo):** |
|
|
|
|
|
```bash |
|
|
python inference_vlm.py \ |
|
|
--base_model_id HuggingFaceTB/SmolVLM-Instruct \ |
|
|
--adapter_dir outputs/smolvlm-coco-lora \ |
|
|
--image https://images.cocodataset.org/val2014/COCO_val2014_000000522418.jpg \ |
|
|
--prompt "Give a concise caption." |
|
|
``` |
|
|
|
|
|
**Programmatic example (PEFT LoRA):** |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from PIL import Image |
|
|
from transformers import AutoProcessor, AutoModelForVision2Seq |
|
|
from peft import PeftModel |
|
|
|
|
|
device = "cuda" if torch.cuda.is_available() else "cpu" |
|
|
base = "HuggingFaceTB/SmolVLM-Instruct" |
|
|
adapter_dir = "outputs/smolvlm-coco-lora" # path from training |
|
|
|
|
|
processor = AutoProcessor.from_pretrained(base) |
|
|
model = AutoModelForVision2Seq.from_pretrained( |
|
|
base, torch_dtype=torch.bfloat16 if device == "cuda" else torch.float32 |
|
|
).to(device) |
|
|
|
|
|
# Load LoRA/QLoRA adapter |
|
|
model = PeftModel.from_pretrained(model, adapter_dir).to(device) |
|
|
model.eval() |
|
|
|
|
|
image = Image.open("sample.jpg").convert("RGB") |
|
|
messages = [{"role": "user", |
|
|
"content": [{"type": "image"}, |
|
|
{"type": "text", "text": "Give a concise caption."}]}] |
|
|
prompt = processor.apply_chat_template(messages, add_generation_prompt=True) |
|
|
|
|
|
inputs = processor(text=prompt, images=[image], return_tensors="pt").to(device) |
|
|
ids = model.generate(**inputs, max_new_tokens=64) |
|
|
print(processor.batch_decode(ids, skip_special_tokens=True)[0]) |
|
|
``` |
|
|
|
|
|
### Downstream Use |
|
|
|
|
|
- As a **captioning stage** within multi-step data pipelines (e.g., labeling, retrieval augmentation, dataset curation). |
|
|
- As a starting point for **continued fine-tuning** on specialized domains (e.g., medical imagery, artwork) with domain-appropriate data and review. |
|
|
|
|
|
### Out-of-Scope Use |
|
|
|
|
|
- **High-stakes** or **safety-critical** settings (medical, legal, surveillance, credit decisions, etc.). |
|
|
- Automated systems where **factuality, fairness, or safety** must be guaranteed without **human in the loop**. |
|
|
- Parsing small text (OCR) or reading sensitive PII from images; this model is not optimized for OCR. |
|
|
|
|
|
--- |
|
|
|
|
|
## Bias, Risks, and Limitations |
|
|
|
|
|
- **Data bias:** COCO captions are predominantly English and reflect biases of their sources; generated captions may mirror societal stereotypes. |
|
|
- **Content coverage:** General-purpose images work best; performance may degrade on domains underrepresented in COCO (e.g., medical scans, satellite imagery). |
|
|
- **Safety:** Captions may occasionally be **inaccurate**, **overconfident**, or **hallucinated**. Always review before downstream use, especially for accessibility. |
|
|
|
|
|
### Recommendations |
|
|
|
|
|
- Keep a **human in the loop** for sensitive or impactful applications. |
|
|
- When adapting to new domains, curate **diverse, representative** training sets and evaluate with domain-specific metrics and audits. |
|
|
- Log model outputs and collect review feedback to iteratively improve quality. |
|
|
|
|
|
--- |
|
|
|
|
|
## How to Get Started with the Model |
|
|
|
|
|
**Environment setup** |
|
|
|
|
|
```bash |
|
|
python -m venv .venv && source .venv/bin/activate |
|
|
pip install -r requirements.txt |
|
|
# (If on NVIDIA & want QLoRA) ensure bitsandbytes is installed; or use: --use_qlora false |
|
|
``` |
|
|
|
|
|
**Fine-tune (LoRA/QLoRA; frozen vision tower)** |
|
|
|
|
|
```bash |
|
|
python train_vlm_sft.py \ |
|
|
--base_model_id HuggingFaceTB/SmolVLM-Instruct \ |
|
|
--dataset_id jxie/coco_captions \ |
|
|
--output_dir outputs/smolvlm-coco-lora \ |
|
|
--epochs 1 --batch_size 2 --grad_accum 8 \ |
|
|
--max_seq_len 1024 --image_longest_edge 1536 |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Training Data |
|
|
|
|
|
- **Dataset:** `jxie/coco_captions` (English captions for MS COCO images). |
|
|
- **Notes:** COCO provides **~617k** caption examples with **5 captions per image**; images come from Flickr with their own terms. Please review the dataset card and the original COCO license/terms before use. |
|
|
|
|
|
### Training Procedure |
|
|
|
|
|
#### Preprocessing |
|
|
|
|
|
- Images are resized with **longest_edge = 1536** (consistent with SmolVLM’s 384×384 patching strategy at N=4). |
|
|
- Text sequences truncated/padded to **max_seq_len = 1024**. |
|
|
|
|
|
#### Training Hyperparameters |
|
|
|
|
|
- **Regime:** Supervised fine-tuning with **LoRA** (or **QLoRA**) on the language-side parameters; **vision tower frozen**. |
|
|
- **Example CLI:** see above. Mixed precision (`bf16` on CUDA) recommended if available. |
|
|
|
|
|
#### Speeds, Sizes, Times |
|
|
|
|
|
- The base SmolVLM reports **~5 GB min GPU RAM** for inference; fine-tuning requires more VRAM depending on batch size/sequence length. See the base card for details. |
|
|
|
|
|
--- |
|
|
|
|
|
## Evaluation |
|
|
### 📊 Score card(on subsample of main data) |
|
|
|
|
|
**All scores increase with higher values (↑).** For visualization, `CIDEr` is shown ×100 in the chart to match the 0–100 scale of other metrics. |
|
|
|
|
|
| Split | CIDEr | CLIPScore | BLEU-4 | METEOR | ROUGE-L | BERTScore-F1 | Images | |
|
|
|:-------------|------:|----------:|-------:|-------:|--------:|-------------:|------:| |
|
|
| **Test** | 0.560 | 30.830 | 15.73 | 47.84 | 45.18 | 91.73 | 1000 | |
|
|
| **Validation**| 0.540 | 31.068 | 16.01 | 48.28 | 45.11 | 91.80 | 1000 | |
|
|
|
|
|
|
|
|
### Quick read on the metrics |
|
|
|
|
|
- **CIDEr** — consensus with human captions; higher is better for human-like phrasing (0–>1 typical). |
|
|
- **CLIPScore** — reference-free image–text compatibility via CLIP’s cosine similarity (commonly rescaled). |
|
|
- **BLEU‑4** — 4‑gram precision with brevity penalty (lexical match). |
|
|
- **METEOR** — unigram match with stemming/synonyms, emphasizes recall. |
|
|
- **ROUGE‑L** — longest common subsequence overlap (structure/recall‑leaning). |
|
|
- **BERTScore‑F1** — semantic similarity using contextual embeddings. |
|
|
|
|
|
|
|
|
### Testing Data, Factors & Metrics |
|
|
|
|
|
#### Testing Data |
|
|
|
|
|
- Hold out a portion of **COCO val** (e.g., `val2014`) or custom images for qualitative/quantitative evaluation. |
|
|
|
|
|
#### Factors |
|
|
|
|
|
- **Image domain** (indoor/outdoor), **object density**, **scene complexity**, and **presence of small text** (OCR-like) can affect performance. |
|
|
|
|
|
#### Metrics |
|
|
- Strong **semantic alignment** (BERTScore-F1 ≈ **91.8** on *val*), and balanced lexical overlap (BLEU-4 ≈ **16.0**). |
|
|
- **CIDEr** is slightly higher on *test* (0.560) vs. *val* (0.540); other metrics are near parity across splits. |
|
|
- Trained & evaluated with the minimal pipeline in the repo (LoRA/QLoRA-ready). |
|
|
- This repo includes `eval_caption_metric.py` scaffolding. |
|
|
|
|
|
### Results |
|
|
|
|
|
- Publish your scores here after running the evaluation script (e.g., CIDEr, BLEU-4) and include qualitative examples. |
|
|
|
|
|
|
|
|
#### Summary |
|
|
|
|
|
- The LoRA/QLoRA approach provides **memory‑efficient adaptation** while preserving the strong generalization of SmolVLM on image–text tasks. |
|
|
|
|
|
--- |
|
|
|
|
|
## Model Examination |
|
|
|
|
|
- You may inspect token attributions or visualize attention over image regions using third-party tools; no built‑in interpretability tooling is shipped here. |
|
|
|
|
|
--- |
|
|
|
|
|
## 🖥️ Training Hardware & Environment |
|
|
|
|
|
- **Device:** Laptop (Windows, WDDM driver model) |
|
|
- **GPU:** NVIDIA GeForce **RTX 3080 Ti Laptop GPU** (16 GB VRAM) |
|
|
- **Driver:** **576.52** |
|
|
- **CUDA (driver):** **12.9** |
|
|
- **PyTorch:** **2.8.0+cu129** |
|
|
- **CUDA available:** ✅ |
|
|
|
|
|
|
|
|
## 📊 Training Metrics |
|
|
|
|
|
- **Total FLOPs (training):** `26,387,224,652,152,830` |
|
|
- **Training runtime:** `5,664.0825` seconds |
|
|
--- |
|
|
|
|
|
## Technical Specifications |
|
|
|
|
|
### Model Architecture and Objective |
|
|
|
|
|
- **Architecture:** SmolVLM-style VLM with **SigLIP** vision tower, **SmolLM2** decoder, and a **multimodal projector**; trained here via **SFT with LoRA/QLoRA** for **image captioning**. |
|
|
- **Objective:** Next-token generation conditioned on image tokens + text prompt (image → text). |
|
|
|
|
|
### Compute Infrastructure |
|
|
|
|
|
#### Hardware |
|
|
|
|
|
- Works on consumer GPUs for inference; fine‑tuning VRAM depends on adapter choice and batch size. |
|
|
|
|
|
#### Software |
|
|
|
|
|
- Python, PyTorch, `transformers`, `peft`, `accelerate`, `datasets`, `evaluate`, optional `bitsandbytes` for QLoRA. |
|
|
|
|
|
--- |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this repository or the resulting model, please cite: |
|
|
|
|
|
**BibTeX:** |
|
|
|
|
|
```bibtex |
|
|
@software{ImageCaptioningVLM2025, |
|
|
author = {Yousefi, Amir Hossein}, |
|
|
title = {Image-Captioning-VLM: LoRA/QLoRA fine-tuning of SmolVLM for image captioning}, |
|
|
year = {2025}, |
|
|
url = {https://github.com/amirhossein-yousefi/Image-Captioning-VLM} |
|
|
} |
|
|
``` |
|
|
|
|
|
Also cite the **base model** and **dataset** as appropriate (see their pages). |
|
|
|
|
|
**APA:** |
|
|
|
|
|
Yousefi, A. H. (2025). *Image-Captioning-VLM: LoRA/QLoRA fine-tuning of SmolVLM for image captioning* [Computer software]. https://github.com/amirhossein-yousefi/Image-Captioning-VLM |
|
|
|
|
|
--- |
|
|
|
|
|
## Glossary |
|
|
|
|
|
- **LoRA/QLoRA:** Low‑Rank (Quantized) Adapters that enable parameter‑efficient fine‑tuning. |
|
|
- **Vision tower:** The vision encoder (SigLIP) that turns image patches into tokens. |
|
|
- **SFT:** Supervised Fine‑Tuning. |
|
|
|
|
|
--- |
|
|
|
|
|
## More Information |
|
|
|
|
|
- For issues and feature requests, open a GitHub issue on the repository. |
|
|
|
|
|
--- |
|
|
|
|
|
## Model Card Authors |
|
|
|
|
|
- Amirhossein Yousefi (maintainer) |
|
|
- Contributors welcome (via PRs) |
|
|
|
|
|
--- |
|
|
|
|
|
## Model Card Contact |
|
|
|
|
|
- Open an issue: https://github.com/amirhossein-yousefi/Image-Captioning-VLM/issues |
|
|
|
|
|
|