---
title: BLIP Captioner
emoji: 🖼
colorFrom: indigo
colorTo: purple
sdk: gradio
sdk_version: 5.9.1
python_version: "3.11"
app_file: app.py
pinned: false
license: mit
tags:
  - image-captioning
  - vision-language
  - blip
  - multimodal
  - salesforce
short_description: Generate captions for images with BLIP
---

# BLIP Image Captioner

Generate natural-language descriptions for any image using Salesforce's
**BLIP** (Bootstrapping Language-Image Pre-training) model.

## Features

- **Single caption mode** — standard captioning with tunable beam width
- **Conditional captioning** — optional prompt prefix (e.g., "a painting of")
- **Variety comparison** — generate 3 captions with different beam widths
  to see how output changes

## Model

- **Name:** [Salesforce/blip-image-captioning-base](https://huggingface.co/Salesforce/blip-image-captioning-base)
- **Paper:** [BLIP](https://arxiv.org/abs/2201.12086) (Li et al., 2022)
- **Parameters:** ~250M
- **Architecture:** ViT-base + BERT-base with cross-attention

## Performance

- **First load:** ~20 seconds (model download + init)
- **Cached inference:** 2-8 seconds per caption (CPU, depends on beam width)

## License

MIT for this deployment code. Model is released by Salesforce under BSD-3.