Create README.md

0a065c5 verified 8 months ago

10.4 kB

	---
	# For reference on model card metadata, see the spec: https://github.com/huggingface/hub-docs/blob/main/modelcard.md?plain=1
	# Doc / guide: https://huggingface.co/docs/hub/model-cards
	language:
	- en
	library_name: transformers
	pipeline_tag: image-to-text
	tags:
	- blip
	- image-captioning
	- vision-language
	- flickr8k
	- coco
	license: bsd-3-clause
	datasets:
	- ariG23498/flickr8k
	- yerevann/coco-karpathy
	base_model: Salesforce/blip-image-captioning-base
	---

	# Model Card for Image-Captioning-BLIP (Fine‑Tuned BLIP for Image Captioning)

	<!-- Provide a quick summary of what the model is/does. -->

	This repository provides a lightweight, pragmatic fine‑tuning and evaluation pipeline around Salesforce BLIP for image captioning, with sane defaults and a tiny, production‑friendly inference helper. Use it to fine‑tune `Salesforce/blip-image-captioning-base` on Flickr8k or COCO‑Karpathy and export artifacts you can push to the Hugging Face Hub.

	> TL;DR: End‑to‑end train → evaluate → export → caption images with a few commands. Defaults: BLIP‑base (ViT‑B/16), Flickr8k, BLEU during training, COCO‑style metrics (CIDEr/METEOR/SPICE) after training.

	## Model Details

	### Model Description

	<!-- Provide a longer summary of what this model is. -->

	This project fine‑tunes BLIP (Bootstrapping Language‑Image Pre-training) for the image‑to‑text task. BLIP couples a ViT visual encoder with a text decoder for conditional generation and uses a bootstrapped captioning strategy during pretraining in the original work. Here, we re‑use the open `BlipForConditionalGeneration` weights and processor and adapt them to caption everyday photographs from Flickr8k or the COCO Karpathy split.

	- Developed by: Amirhossein Yousefi
	- Shared by : Amirhossein Yousefi
	- Model type: Vision–language encoder–decoder (BLIP base; ViT‑B/16 vision encoder + text decoder)
	- Language(s) (NLP): English
	- License: BSD‑3‑Clause (inherits from the base model’s license; ensure your own dataset/weight licensing is compatible)
	- Finetuned from model : `Salesforce/blip-image-captioning-base`

	### Model Sources

	<!-- Provide the basic links for the model. -->

	- Repository: https://github.com/amirhossein-yousefi/Image-Captioning-BLIP
	- Paper : BLIP — Bootstrapping Language‑Image Pre‑training (arXiv:2201.12086) https://arxiv.org/abs/2201.12086
	- Demo : See usage examples in the base model card on the Hub (PyTorch snippets)

	## Uses

	<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->

	### Direct Use

	<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->

	- Generate concise alt‑text‑style captions for photos.
	- Zero‑shot captioning with the base checkpoint, or improved fidelity after fine‑tuning on your target dataset.
	- Batch/offline captioning for indexing, search, and accessibility workflows.

	### Downstream Use

	<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->

	- Warm‑start other captioners or retrieval models by using generated captions as weak labels.
	- Build dataset bootstrapping pipelines (e.g., pseudo‑labels for new domains).
	- Use as a component in multi‑modal applications (e.g., visual content tagging, basic scene summaries).

	### Out-of-Scope Use

	<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->

	- High‑stakes or safety‑critical settings (medical, legal, surveillance).
	- Factual description of specialized imagery (e.g., diagrams, medical scans) without domain‑specific fine‑tuning.
	- Content moderation, protected‑attribute inference, or demographic classification.

	## Bias, Risks, and Limitations

	<!-- This section is meant to convey both technical and sociotechnical limitations. -->

	- Data bias: Flickr8k/COCO contain Western‑centric scenes and captions; captions may reflect annotator bias or stereotypes.
	- Language coverage: Training here targets English only; captions for non‑English content or localized entities may be poor.
	- Hallucination: Like most captioners, BLIP can produce plausible but incorrect or over‑confident statements.
	- Privacy: Avoid using on sensitive images or personally identifiable content without consent.
	- IP & license: Ensure you have rights to your training/evaluation images and that your dataset use complies with its license.

	### Recommendations

	<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->

	Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model.

	- Evaluate on a domain‑specific validation set before deployment.
	- Use a safety filter/keyword blacklist or human review if captions are user‑facing.
	- For specialized domains, continue fine‑tuning with in‑domain images and style prompts.
	- When summarizing scenes, prefer beam search with moderate length penalties and enforce max lengths to curb rambling.

	## How to Get Started with the Model

	Use the code below to get started with the model.

	```python
	from PIL import Image
	from transformers import BlipProcessor, BlipForConditionalGeneration

	# Replace with your fine-tuned repo once pushed, e.g. "amirhossein-yousefi/blip-captioning-flickr8k"
	MODEL_ID = "Salesforce/blip-image-captioning-base"

	processor = BlipProcessor.from_pretrained(MODEL_ID)
	model = BlipForConditionalGeneration.from_pretrained(MODEL_ID)

	image = Image.open("example.jpg").convert("RGB")
	inputs = processor(image, return_tensors="pt")
	out = model.generate(**inputs, max_new_tokens=30, num_beams=5, length_penalty=1.0, early_stopping=True)
	print(processor.decode(out[0], skip_special_tokens=True))
	```
	## Training Details

	### Training Data

	Two common options are wired in:

	- Flickr8k (`ariG23498/flickr8k`) — 8k images with 5 captions each. Default split in this repo: 90% train / 5% val / 5% test (deterministic by seed).
	- COCO‑Karpathy (`yerevann/coco-karpathy`) — community‑prepared Karpathy splits for COCO captions.

	> ⚠️ Always verify dataset licenses and usage terms before training or publishing models derived from them.

	### Training Procedure

	This project uses the Hugging Face Trainer with a custom collator; `BlipProcessor` handles both image and text preprocessing, and labels are padded to `-100` for loss masking.

	#### Preprocessing

	- Images and text are preprocessed by `BlipProcessor` consistent with BLIP defaults (resize/normalize/tokenize).
	- Optional vision encoder freezing is supported for parameter‑efficient fine‑tuning.

	#### Training Hyperparameters (defaults)

	- Epochs: `4`
	- Learning rate: `5e-5`
	- Per‑device batch size: `8` (train & eval)
	- Gradient accumulation: `2`
	- Gradient checkpointing: `True`
	- Freeze vision encoder: `False` (set `True` for low‑VRAM setups)
	- Logging: every `50` steps; keep `2` checkpoints
	- Model selection: best `sacrebleu`

	#### Generation (eval/inference defaults)

	- `max_txt_len = 40`, `gen_max_new_tokens = 30`, `num_beams = 5`, `length_penalty = 1.0`, `early_stopping = True`

	#### Speeds, Sizes, Times

	- Single 16 GB GPU is typically sufficient for BLIP‑base with the defaults (gradient checkpointing enabled).
	- If VRAM is tight: freeze the vision encoder, lower the batch size, and/or increase gradient accumulation.

	## Evaluation

	### Testing Data, Factors & Metrics

	- Data: Validation split of the chosen dataset (Flickr8k or COCO‑Karpathy).
	- Metrics: BLEU‑4 (during training), and post‑training COCO‑style metrics: CIDEr, METEOR, SPICE.
	- Notes: SPICE requires Java and can be slow; you can disable or subsample via config.

	### Results

	After training, a compact JSON with COCO metrics is written to:

	```
	blip-open-out/coco_metrics.json
	```
	## 🏆 Results (Test Split)

	<p align="center">
	<img alt="BLEU4" src="https://img.shields.io/badge/BLEU4-0.9708-2f81f7?style=for-the-badge">
	<img alt="METEOR" src="https://img.shields.io/badge/METEOR-0.7888-8a2be2?style=for-the-badge">
	<img alt="CIDEr" src="https://img.shields.io/badge/CIDEr-9.333-0f766e?style=for-the-badge">
	<img alt="SPICE" src="https://img.shields.io/badge/SPICE-n%2Fa-lightgray?style=for-the-badge">
	</p>

	\| Metric \| Score \|
	\|-----------\|------:\|
	\| BLEU‑4 \| 0.9708 \|
	\| METEOR \| 0.7888 \|
	\| CIDEr \| 9.3330 \|
	\| SPICE \| — \|

	<details>
	<summary>Raw JSON</summary>

	```json
	{
	"Bleu_4": 0.9707865195383757,
	"METEOR": 0.7887653835397767,
	"CIDEr": 9.332990983959254,
	"SPICE": null
	}
	```
	</details>
	---


	#### Summary

	- Expect strongest results when fine‑tuning on in‑domain imagery and using beam search at inference time.

	## Model Examination

	- Inspect failure cases: cluttered scenes, occlusions, specialized objects, or images with embedded text.
	- Run qualitative sweeps by toggling beam size and length penalties to see style/verbosity changes.

	## Environmental Impact

	Estimate using the [ML CO2 Impact calculator](https://mlco2.github.io/impact#compute). Fill the values you observe for your runs:

	- Hardware Type: (e.g., 1× NVIDIA T4 / A10 / A100)
	- Hours used: (e.g., 3.2 h for 4 epochs on Flickr8k)
	- Cloud Provider: (e.g., AWS on SageMaker, optional)
	- Compute Region: (e.g., us‑west‑2)
	- Carbon Emitted: (estimated grams of CO₂eq)

	## Technical Specifications

	### Model Architecture and Objective

	- Architecture: BLIP encoder–decoder; ViT‑B/16 vision backbone with a text decoder for conditional caption generation.
	- Objective: Cross‑entropy on tokenized captions with masked padding (`-100`), using the BLIP processor’s packing.

	### Compute Infrastructure

	#### Hardware

	- Trains comfortably on one 16 GB GPU (defaults).

	#### Software

	- Python 3.9+, PyTorch, Transformers, Datasets, evaluate, sacrebleu, optional pycocotools/pycocoevalcap (for CIDEr/METEOR/SPICE).
	- Optional AWS SageMaker entry points are included for managed training and inference.