Update README.md

791a01f verified 5 months ago

11.9 kB

	---
	# For reference on model card metadata, see the spec: https://github.com/huggingface/hub-docs/blob/main/modelcard.md?plain=1
	# Doc / guide: https://huggingface.co/docs/hub/model-cards
	{
	"library_name": "transformers",
	"pipeline_tag": "image-to-text",
	"license": "apache-2.0",
	"tags": [
	"vision-language",
	"image-captioning",
	"SmolVLM",
	"LoRA",
	"QLoRA",
	"COCO",
	"peft",
	"accelerate"
	],
	"base_model": "HuggingFaceTB/SmolVLM-Instruct",
	"datasets": ["jxie/coco_captions"],
	"language": ["en"],
	"widget": [
	{
	"text": "Give a concise caption.",
	"src": "https://images.cocodataset.org/val2014/COCO_val2014_000000522418.jpg"
	}
	]
	}
	---

	# Model Card for Image-Captioning-VLM (SmolVLM + COCO, LoRA/QLoRA)

	This repository provides a compact vision–language image captioning model built by fine-tuning SmolVLM-Instruct with LoRA/QLoRA adapters on the MS COCO Captions dataset. The goal is to offer an easy-to-train, memory‑efficient captioner for research, data labeling, and diffusion training workflows while keeping the vision tower frozen and adapting the language/cross‑modal components.

	> TL;DR
	>
	> - Base: `HuggingFaceTB/SmolVLM-Instruct` (Apache-2.0).
	> - Training data: `jxie/coco_captions` (English captions).
	> - Method: LoRA/QLoRA SFT; vision encoder frozen.
	> - Intended use: generate concise or descriptive captions for general images.
	> - Not intended for high-stakes or safety-critical uses.

	---

	## Model Details

	### Model Description

	- Developed by: Amirhossein Yousefi (GitHub: `amirhossein-yousefi`)
	- Model type: Vision–Language (image → text) captioning model with LoRA/QLoRA adapters on top of SmolVLM-Instruct
	- Language(s): English
	- License: Apache-2.0 for the released model artifacts (inherits from the base model’s license); dataset retains its own license (see Training Data)
	- Finetuned from: `HuggingFaceTB/SmolVLM-Instruct`

	SmolVLM couples a shape-optimized SigLIP vision tower with a compact SmolLM2 decoder via a multimodal projector and runs via `AutoModelForVision2Seq`. This project fine-tunes the language-side with LoRA/QLoRA while freezing the vision tower to keep memory use low and training simple.

	### Model Sources

	- Repository: https://github.com/amirhossein-yousefi/Image-Captioning-VLM
	- Base model card: https://huggingface.co/HuggingFaceTB/SmolVLM-Instruct
	- Base technical report : https://arxiv.org/abs/2504.05299 (SmolVLM)
	- Dataset (training): https://huggingface.co/datasets/jxie/coco_captions

	---

	## Uses

	### Direct Use

	- Generate concise or descriptive captions for natural images.
	- Provide alt text/accessibility descriptions (human review recommended).
	- Produce captions for vision dataset bootstrapping or diffusion training pipelines.

	Quickstart (inference script from this repo):

	```bash
	python inference_vlm.py \
	--base_model_id HuggingFaceTB/SmolVLM-Instruct \
	--adapter_dir outputs/smolvlm-coco-lora \
	--image https://images.cocodataset.org/val2014/COCO_val2014_000000522418.jpg \
	--prompt "Give a concise caption."
	```

	Programmatic example (PEFT LoRA):

	```python
	import torch
	from PIL import Image
	from transformers import AutoProcessor, AutoModelForVision2Seq
	from peft import PeftModel

	device = "cuda" if torch.cuda.is_available() else "cpu"
	base = "HuggingFaceTB/SmolVLM-Instruct"
	adapter_dir = "outputs/smolvlm-coco-lora" # path from training

	processor = AutoProcessor.from_pretrained(base)
	model = AutoModelForVision2Seq.from_pretrained(
	base, torch_dtype=torch.bfloat16 if device == "cuda" else torch.float32
	).to(device)

	# Load LoRA/QLoRA adapter
	model = PeftModel.from_pretrained(model, adapter_dir).to(device)
	model.eval()

	image = Image.open("sample.jpg").convert("RGB")
	messages = [{"role": "user",
	"content": [{"type": "image"},
	{"type": "text", "text": "Give a concise caption."}]}]
	prompt = processor.apply_chat_template(messages, add_generation_prompt=True)

	inputs = processor(text=prompt, images=[image], return_tensors="pt").to(device)
	ids = model.generate(**inputs, max_new_tokens=64)
	print(processor.batch_decode(ids, skip_special_tokens=True)[0])
	```

	### Downstream Use

	- As a captioning stage within multi-step data pipelines (e.g., labeling, retrieval augmentation, dataset curation).
	- As a starting point for continued fine-tuning on specialized domains (e.g., medical imagery, artwork) with domain-appropriate data and review.

	### Out-of-Scope Use

	- High-stakes or safety-critical settings (medical, legal, surveillance, credit decisions, etc.).
	- Automated systems where factuality, fairness, or safety must be guaranteed without human in the loop.
	- Parsing small text (OCR) or reading sensitive PII from images; this model is not optimized for OCR.

	---

	## Bias, Risks, and Limitations

	- Data bias: COCO captions are predominantly English and reflect biases of their sources; generated captions may mirror societal stereotypes.
	- Content coverage: General-purpose images work best; performance may degrade on domains underrepresented in COCO (e.g., medical scans, satellite imagery).
	- Safety: Captions may occasionally be inaccurate, overconfident, or hallucinated. Always review before downstream use, especially for accessibility.

	### Recommendations

	- Keep a human in the loop for sensitive or impactful applications.
	- When adapting to new domains, curate diverse, representative training sets and evaluate with domain-specific metrics and audits.
	- Log model outputs and collect review feedback to iteratively improve quality.

	---

	## How to Get Started with the Model

	Environment setup

	```bash
	python -m venv .venv && source .venv/bin/activate
	pip install -r requirements.txt
	# (If on NVIDIA & want QLoRA) ensure bitsandbytes is installed; or use: --use_qlora false
	```

	Fine-tune (LoRA/QLoRA; frozen vision tower)

	```bash
	python train_vlm_sft.py \
	--base_model_id HuggingFaceTB/SmolVLM-Instruct \
	--dataset_id jxie/coco_captions \
	--output_dir outputs/smolvlm-coco-lora \
	--epochs 1 --batch_size 2 --grad_accum 8 \
	--max_seq_len 1024 --image_longest_edge 1536
	```

	---

	## Training Details

	### Training Data

	- Dataset: `jxie/coco_captions` (English captions for MS COCO images).
	- Notes: COCO provides ~617k caption examples with 5 captions per image; images come from Flickr with their own terms. Please review the dataset card and the original COCO license/terms before use.

	### Training Procedure

	#### Preprocessing

	- Images are resized with longest_edge = 1536 (consistent with SmolVLM’s 384×384 patching strategy at N=4).
	- Text sequences truncated/padded to max_seq_len = 1024.

	#### Training Hyperparameters

	- Regime: Supervised fine-tuning with LoRA (or QLoRA) on the language-side parameters; vision tower frozen.
	- Example CLI: see above. Mixed precision (`bf16` on CUDA) recommended if available.

	#### Speeds, Sizes, Times

	- The base SmolVLM reports ~5 GB min GPU RAM for inference; fine-tuning requires more VRAM depending on batch size/sequence length. See the base card for details.

	---

	## Evaluation
	### 📊 Score card(on subsample of main data)

	All scores increase with higher values (↑). For visualization, `CIDEr` is shown ×100 in the chart to match the 0–100 scale of other metrics.

	\| Split \| CIDEr \| CLIPScore \| BLEU-4 \| METEOR \| ROUGE-L \| BERTScore-F1 \| Images \|
	\|:-------------\|------:\|----------:\|-------:\|-------:\|--------:\|-------------:\|------:\|
	\| Test \| 0.560 \| 30.830 \| 15.73 \| 47.84 \| 45.18 \| 91.73 \| 1000 \|
	\| Validation\| 0.540 \| 31.068 \| 16.01 \| 48.28 \| 45.11 \| 91.80 \| 1000 \|


	### Quick read on the metrics

	- CIDEr — consensus with human captions; higher is better for human-like phrasing (0–>1 typical).
	- CLIPScore — reference-free image–text compatibility via CLIP’s cosine similarity (commonly rescaled).
	- BLEU‑4 — 4‑gram precision with brevity penalty (lexical match).
	- METEOR — unigram match with stemming/synonyms, emphasizes recall.
	- ROUGE‑L — longest common subsequence overlap (structure/recall‑leaning).
	- BERTScore‑F1 — semantic similarity using contextual embeddings.


	### Testing Data, Factors & Metrics

	#### Testing Data

	- Hold out a portion of COCO val (e.g., `val2014`) or custom images for qualitative/quantitative evaluation.

	#### Factors

	- Image domain (indoor/outdoor), object density, scene complexity, and presence of small text (OCR-like) can affect performance.

	#### Metrics
	- Strong semantic alignment (BERTScore-F1 ≈ 91.8 on val), and balanced lexical overlap (BLEU-4 ≈ 16.0).
	- CIDEr is slightly higher on test (0.560) vs. val (0.540); other metrics are near parity across splits.
	- Trained & evaluated with the minimal pipeline in the repo (LoRA/QLoRA-ready).
	- This repo includes `eval_caption_metric.py` scaffolding.

	### Results

	- Publish your scores here after running the evaluation script (e.g., CIDEr, BLEU-4) and include qualitative examples.


	#### Summary

	- The LoRA/QLoRA approach provides memory‑efficient adaptation while preserving the strong generalization of SmolVLM on image–text tasks.

	---

	## Model Examination

	- You may inspect token attributions or visualize attention over image regions using third-party tools; no built‑in interpretability tooling is shipped here.

	---

	## 🖥️ Training Hardware & Environment

	- Device: Laptop (Windows, WDDM driver model)
	- GPU: NVIDIA GeForce RTX 3080 Ti Laptop GPU (16 GB VRAM)
	- Driver: 576.52
	- CUDA (driver): 12.9
	- PyTorch: 2.8.0+cu129
	- CUDA available: ✅


	## 📊 Training Metrics

	- Total FLOPs (training): `26,387,224,652,152,830`
	- Training runtime: `5,664.0825` seconds
	---

	## Technical Specifications

	### Model Architecture and Objective

	- Architecture: SmolVLM-style VLM with SigLIP vision tower, SmolLM2 decoder, and a multimodal projector; trained here via SFT with LoRA/QLoRA for image captioning.
	- Objective: Next-token generation conditioned on image tokens + text prompt (image → text).

	### Compute Infrastructure

	#### Hardware

	- Works on consumer GPUs for inference; fine‑tuning VRAM depends on adapter choice and batch size.

	#### Software

	- Python, PyTorch, `transformers`, `peft`, `accelerate`, `datasets`, `evaluate`, optional `bitsandbytes` for QLoRA.

	---

	## Citation

	If you use this repository or the resulting model, please cite:

	BibTeX:

	```bibtex
	@software{ImageCaptioningVLM2025,
	author = {Yousefi, Amir Hossein},
	title = {Image-Captioning-VLM: LoRA/QLoRA fine-tuning of SmolVLM for image captioning},
	year = {2025},
	url = {https://github.com/amirhossein-yousefi/Image-Captioning-VLM}
	}
	```

	Also cite the base model and dataset as appropriate (see their pages).

	APA:

	Yousefi, A. H. (2025). Image-Captioning-VLM: LoRA/QLoRA fine-tuning of SmolVLM for image captioning [Computer software]. https://github.com/amirhossein-yousefi/Image-Captioning-VLM

	---

	## Glossary

	- LoRA/QLoRA: Low‑Rank (Quantized) Adapters that enable parameter‑efficient fine‑tuning.
	- Vision tower: The vision encoder (SigLIP) that turns image patches into tokens.
	- SFT: Supervised Fine‑Tuning.

	---

	## More Information

	- For issues and feature requests, open a GitHub issue on the repository.

	---

	## Model Card Authors

	- Amirhossein Yousefi (maintainer)
	- Contributors welcome (via PRs)

	---

	## Model Card Contact

	- Open an issue: https://github.com/amirhossein-yousefi/Image-Captioning-VLM/issues