Amirhossein75
/

VLM-Image-Captioning

+---
+# For reference on model card metadata, see the spec: https://github.com/huggingface/hub-docs/blob/main/modelcard.md?plain=1
+# Doc / guide: https://huggingface.co/docs/hub/model-cards
+{
+  "library_name": "transformers",
+  "pipeline_tag": "image-to-text",
+  "license": "apache-2.0",
+  "tags": [
+    "vision-language",
+    "image-captioning",
+    "SmolVLM",
+    "LoRA",
+    "QLoRA",
+    "COCO",
+    "peft",
+    "accelerate"
+  ],
+  "base_model": "HuggingFaceTB/SmolVLM-Instruct",
+  "datasets": ["jxie/coco_captions"],
+  "language": ["en"],
+  "widget": [
+    {
+      "text": "Give a concise caption.",
+      "src": "https://images.cocodataset.org/val2014/COCO_val2014_000000522418.jpg"
+    }
+  ]
+}
+---
+# Model Card for **Image-Captioning-VLM (SmolVLM + COCO, LoRA/QLoRA)**
+This repository provides a compact **vision–language image captioning model** built by fine-tuning **SmolVLM-Instruct** with **LoRA/QLoRA** adapters on the **MS COCO Captions** dataset. The goal is to offer an easy-to-train, memory‑efficient captioner for research, data labeling, and diffusion training workflows while keeping the **vision tower frozen** and adapting the language/cross‑modal components.
+> **TL;DR**
+>
+> - Base: `HuggingFaceTB/SmolVLM-Instruct` (Apache-2.0).
+> - Training data: `jxie/coco_captions` (English captions).
+> - Method: LoRA/QLoRA SFT; **vision encoder frozen**.
+> - Intended use: generate concise or descriptive captions for general images.
+> - Not intended for high-stakes or safety-critical uses.
+---
+## Model Details
+### Model Description
+- **Developed by:** *Amir Hossein Yousefi* (GitHub: `amirhossein-yousefi`)
+- **Model type:** Vision–Language (**image → text**) captioning model with LoRA/QLoRA adapters on top of **SmolVLM-Instruct**
+- **Language(s):** English
+- **License:** **Apache-2.0** for the released model artifacts (inherits from the base model’s license); dataset retains its own license (see *Training Data*)
+- **Finetuned from:** `HuggingFaceTB/SmolVLM-Instruct`
+SmolVLM couples a **shape-optimized SigLIP** vision tower with a compact **SmolLM2** decoder via a multimodal projector and runs via `AutoModelForVision2Seq`. This project fine-tunes the language-side with LoRA/QLoRA while **freezing the vision tower** to keep memory use low and training simple.
+### Model Sources
+- **Repository:** https://github.com/amirhossein-yousefi/Image-Captioning-VLM
+- **Base model card:** https://huggingface.co/HuggingFaceTB/SmolVLM-Instruct
+- **Base technical report :** https://arxiv.org/abs/2504.05299 (SmolVLM)
+- **Dataset (training):** https://huggingface.co/datasets/jxie/coco_captions
+---
+## Uses
+### Direct Use
+- Generate **concise** or **descriptive** captions for natural images.
+- Provide **alt text**/accessibility descriptions (human review recommended).
+- Produce captions for **vision dataset bootstrapping** or **diffusion training** pipelines.
+**Quickstart (inference script from this repo):**
+```bash
+python inference_vlm.py \
+  --base_model_id HuggingFaceTB/SmolVLM-Instruct \
+  --adapter_dir outputs/smolvlm-coco-lora \
+  --image https://images.cocodataset.org/val2014/COCO_val2014_000000522418.jpg \
+  --prompt "Give a concise caption."
+```
+**Programmatic example (PEFT LoRA):**
+```python
+import torch
+from PIL import Image
+from transformers import AutoProcessor, AutoModelForVision2Seq
+from peft import PeftModel
+device = "cuda" if torch.cuda.is_available() else "cpu"
+base = "HuggingFaceTB/SmolVLM-Instruct"
+adapter_dir = "outputs/smolvlm-coco-lora"  # path from training
+processor = AutoProcessor.from_pretrained(base)
+model = AutoModelForVision2Seq.from_pretrained(
+    base, torch_dtype=torch.bfloat16 if device == "cuda" else torch.float32
+).to(device)
+# Load LoRA/QLoRA adapter
+model = PeftModel.from_pretrained(model, adapter_dir).to(device)
+model.eval()
+image = Image.open("sample.jpg").convert("RGB")
+messages = [{"role": "user",
+             "content": [{"type": "image"},
+                         {"type": "text", "text": "Give a concise caption."}]}]
+prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
+inputs = processor(text=prompt, images=[image], return_tensors="pt").to(device)
+ids = model.generate(**inputs, max_new_tokens=64)
+print(processor.batch_decode(ids, skip_special_tokens=True)[0])
+```
+### Downstream Use
+- As a **captioning stage** within multi-step data pipelines (e.g., labeling, retrieval augmentation, dataset curation).
+- As a starting point for **continued fine-tuning** on specialized domains (e.g., medical imagery, artwork) with domain-appropriate data and review.
+### Out-of-Scope Use
+- **High-stakes** or **safety-critical** settings (medical, legal, surveillance, credit decisions, etc.).
+- Automated systems where **factuality, fairness, or safety** must be guaranteed without **human in the loop**.
+- Parsing small text (OCR) or reading sensitive PII from images; this model is not optimized for OCR.
+---
+## Bias, Risks, and Limitations
+- **Data bias:** COCO captions are predominantly English and reflect biases of their sources; generated captions may mirror societal stereotypes.
+- **Content coverage:** General-purpose images work best; performance may degrade on domains underrepresented in COCO (e.g., medical scans, satellite imagery).
+- **Safety:** Captions may occasionally be **inaccurate**, **overconfident**, or **hallucinated**. Always review before downstream use, especially for accessibility.
+### Recommendations
+- Keep a **human in the loop** for sensitive or impactful applications.
+- When adapting to new domains, curate **diverse, representative** training sets and evaluate with domain-specific metrics and audits.
+- Log model outputs and collect review feedback to iteratively improve quality.
+---
+## How to Get Started with the Model
+**Environment setup**
+```bash
+python -m venv .venv && source .venv/bin/activate
+pip install -r requirements.txt
+# (If on NVIDIA & want QLoRA) ensure bitsandbytes is installed; or use: --use_qlora false
+```
+**Fine-tune (LoRA/QLoRA; frozen vision tower)**
+```bash
+python train_vlm_sft.py \
+  --base_model_id HuggingFaceTB/SmolVLM-Instruct \
+  --dataset_id jxie/coco_captions \
+  --output_dir outputs/smolvlm-coco-lora \
+  --epochs 1 --batch_size 2 --grad_accum 8 \
+  --max_seq_len 1024 --image_longest_edge 1536
+```
+---
+## Training Details
+### Training Data
+- **Dataset:** `jxie/coco_captions` (English captions for MS COCO images).
+- **Notes:** COCO provides **~617k** caption examples with **5 captions per image**; images come from Flickr with their own terms. Please review the dataset card and the original COCO license/terms before use.
+### Training Procedure
+#### Preprocessing
+- Images are resized with **longest_edge = 1536** (consistent with SmolVLM’s 384×384 patching strategy at N=4).
+- Text sequences truncated/padded to **max_seq_len = 1024**.
+#### Training Hyperparameters
+- **Regime:** Supervised fine-tuning with **LoRA** (or **QLoRA**) on the language-side parameters; **vision tower frozen**.
+- **Example CLI:** see above. Mixed precision (`bf16` on CUDA) recommended if available.
+#### Speeds, Sizes, Times
+- The base SmolVLM reports **~5 GB min GPU RAM** for inference; fine-tuning requires more VRAM depending on batch size/sequence length. See the base card for details.
+---
+## Evaluation
+### 📊 Score card
+**All scores increase with higher values (↑).** For visualization, `CIDEr` is shown ×100 in the chart to match the 0–100 scale of other metrics.
+| Split        | CIDEr | CLIPScore | BLEU-4 | METEOR | ROUGE-L | BERTScore-F1 | Images |
+|:-------------|------:|----------:|-------:|-------:|--------:|-------------:|------:|
+| **Test**      | 0.560 | 30.830    | 15.73  | 47.84  | 45.18   | 91.73        | 1000  |
+| **Validation**| 0.540 | 31.068    | 16.01  | 48.28  | 45.11   | 91.80        | 1000  |
+### Quick read on the metrics
+- **CIDEr** — consensus with human captions; higher is better for human-like phrasing (0–>1 typical).
+- **CLIPScore** — reference-free image–text compatibility via CLIP’s cosine similarity (commonly rescaled).
+- **BLEU‑4** — 4‑gram precision with brevity penalty (lexical match).
+- **METEOR** — unigram match with stemming/synonyms, emphasizes recall.
+- **ROUGE‑L** — longest common subsequence overlap (structure/recall‑leaning).
+- **BERTScore‑F1** — semantic similarity using contextual embeddings.
+### Testing Data, Factors & Metrics
+#### Testing Data
+- Hold out a portion of **COCO val** (e.g., `val2014`) or custom images for qualitative/quantitative evaluation.
+#### Factors
+- **Image domain** (indoor/outdoor), **object density**, **scene complexity**, and **presence of small text** (OCR-like) can affect performance.
+#### Metrics
+- Strong **semantic alignment** (BERTScore-F1 ≈ **91.8** on *val*), and balanced lexical overlap (BLEU-4 ≈ **16.0**).
+- **CIDEr** is slightly higher on *test* (0.560) vs. *val* (0.540); other metrics are near parity across splits.
+- Trained & evaluated with the minimal pipeline in the repo (LoRA/QLoRA-ready).
+- This repo includes `eval_caption_metric.py` scaffolding.
+### Results
+- Publish your scores here after running the evaluation script (e.g., CIDEr, BLEU-4) and include qualitative examples.
+#### Summary
+- The LoRA/QLoRA approach provides **memory‑efficient adaptation** while preserving the strong generalization of SmolVLM on image–text tasks.
+---
+## Model Examination
+- You may inspect token attributions or visualize attention over image regions using third-party tools; no built‑in interpretability tooling is shipped here.
+---
+## 🖥️ Training Hardware & Environment
+- **Device:** Laptop (Windows, WDDM driver model)
+- **GPU:** NVIDIA GeForce **RTX 3080 Ti Laptop GPU** (16 GB VRAM)
+- **Driver:** **576.52**
+- **CUDA (driver):** **12.9**
+- **PyTorch:** **2.8.0+cu129**
+- **CUDA available:** ✅
+## 📊 Training Metrics
+- **Total FLOPs (training):** `26,387,224,652,152,830`
+- **Training runtime:** `5,664.0825` seconds
+---
+## Technical Specifications
+### Model Architecture and Objective
+- **Architecture:** SmolVLM-style VLM with **SigLIP** vision tower, **SmolLM2** decoder, and a **multimodal projector**; trained here via **SFT with LoRA/QLoRA** for **image captioning**.
+- **Objective:** Next-token generation conditioned on image tokens + text prompt (image → text).
+### Compute Infrastructure
+#### Hardware
+- Works on consumer GPUs for inference; fine‑tuning VRAM depends on adapter choice and batch size.
+#### Software
+- Python, PyTorch, `transformers`, `peft`, `accelerate`, `datasets`, `evaluate`, optional `bitsandbytes` for QLoRA.
+---
+## Citation
+If you use this repository or the resulting model, please cite:
+**BibTeX:**
+```bibtex
+@software{ImageCaptioningVLM2025,
+  author = {Yousefi, Amir Hossein},
+  title = {Image-Captioning-VLM: LoRA/QLoRA fine-tuning of SmolVLM for image captioning},
+  year = {2025},
+  url = {https://github.com/amirhossein-yousefi/Image-Captioning-VLM}
+}
+```
+Also cite the **base model** and **dataset** as appropriate (see their pages).
+**APA:**
+Yousefi, A. H. (2025). *Image-Captioning-VLM: LoRA/QLoRA fine-tuning of SmolVLM for image captioning* [Computer software]. https://github.com/amirhossein-yousefi/Image-Captioning-VLM
+---
+## Glossary
+- **LoRA/QLoRA:** Low‑Rank (Quantized) Adapters that enable parameter‑efficient fine‑tuning.
+- **Vision tower:** The vision encoder (SigLIP) that turns image patches into tokens.
+- **SFT:** Supervised Fine‑Tuning.
+---
+## More Information
+- For issues and feature requests, open a GitHub issue on the repository.
+---
+## Model Card Authors
+- Amir Hossein Yousefi (maintainer)
+- Contributors welcome (via PRs)
+---
+## Model Card Contact
+- Open an issue: https://github.com/amirhossein-yousefi/Image-Captioning-VLM/issues