--- title: BLIP Captioner emoji: 🖼 colorFrom: indigo colorTo: purple sdk: gradio sdk_version: 5.9.1 python_version: "3.11" app_file: app.py pinned: false license: mit tags: - image-captioning - vision-language - blip - multimodal - salesforce short_description: Generate captions for images with BLIP --- # BLIP Image Captioner Generate natural-language descriptions for any image using Salesforce's **BLIP** (Bootstrapping Language-Image Pre-training) model. ## Features - **Single caption mode** — standard captioning with tunable beam width - **Conditional captioning** — optional prompt prefix (e.g., "a painting of") - **Variety comparison** — generate 3 captions with different beam widths to see how output changes ## Model - **Name:** [Salesforce/blip-image-captioning-base](https://huggingface.co/Salesforce/blip-image-captioning-base) - **Paper:** [BLIP](https://arxiv.org/abs/2201.12086) (Li et al., 2022) - **Parameters:** ~250M - **Architecture:** ViT-base + BERT-base with cross-attention ## Performance - **First load:** ~20 seconds (model download + init) - **Cached inference:** 2-8 seconds per caption (CPU, depends on beam width) ## License MIT for this deployment code. Model is released by Salesforce under BSD-3.