WolfDavid commited on
Commit
a388160
Β·
1 Parent(s): 07271cf

Initial deploy: BLIP image captioning

Browse files
Files changed (3) hide show
  1. README.md +41 -6
  2. app.py +296 -0
  3. requirements.txt +5 -0
README.md CHANGED
@@ -1,12 +1,47 @@
1
  ---
2
- title: Blip Captioner
3
- emoji: πŸƒ
4
- colorFrom: pink
5
- colorTo: pink
6
  sdk: gradio
7
- sdk_version: 6.12.0
 
8
  app_file: app.py
9
  pinned: false
 
 
 
 
 
 
 
 
10
  ---
11
 
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: BLIP Captioner
3
+ emoji: πŸ–Ό
4
+ colorFrom: indigo
5
+ colorTo: purple
6
  sdk: gradio
7
+ sdk_version: 5.9.1
8
+ python_version: "3.11"
9
  app_file: app.py
10
  pinned: false
11
+ license: mit
12
+ tags:
13
+ - image-captioning
14
+ - vision-language
15
+ - blip
16
+ - multimodal
17
+ - salesforce
18
+ short_description: Generate captions for images with BLIP
19
  ---
20
 
21
+ # BLIP Image Captioner
22
+
23
+ Generate natural-language descriptions for any image using Salesforce's
24
+ **BLIP** (Bootstrapping Language-Image Pre-training) model.
25
+
26
+ ## Features
27
+
28
+ - **Single caption mode** β€” standard captioning with tunable beam width
29
+ - **Conditional captioning** β€” optional prompt prefix (e.g., "a painting of")
30
+ - **Variety comparison** β€” generate 3 captions with different beam widths
31
+ to see how output changes
32
+
33
+ ## Model
34
+
35
+ - **Name:** [Salesforce/blip-image-captioning-base](https://huggingface.co/Salesforce/blip-image-captioning-base)
36
+ - **Paper:** [BLIP](https://arxiv.org/abs/2201.12086) (Li et al., 2022)
37
+ - **Parameters:** ~250M
38
+ - **Architecture:** ViT-base + BERT-base with cross-attention
39
+
40
+ ## Performance
41
+
42
+ - **First load:** ~20 seconds (model download + init)
43
+ - **Cached inference:** 2-8 seconds per caption (CPU, depends on beam width)
44
+
45
+ ## License
46
+
47
+ MIT for this deployment code. Model is released by Salesforce under BSD-3.
app.py ADDED
@@ -0,0 +1,296 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ BLIP Image Captioner β€” HF Space
3
+
4
+ Real image-to-text captioning using Salesforce's BLIP model.
5
+ """
6
+
7
+ from __future__ import annotations
8
+
9
+ import time
10
+ from typing import Optional
11
+
12
+ import gradio as gr
13
+ import torch
14
+ from PIL import Image
15
+ from transformers import BlipForConditionalGeneration, BlipProcessor
16
+
17
+ # ═══════════════════════════════════════════════════════════════════
18
+ # Model loading
19
+ # ═══════════════════════════════════════════════════════════════════
20
+
21
+ MODEL_NAME = "Salesforce/blip-image-captioning-base"
22
+
23
+ _model: Optional[BlipForConditionalGeneration] = None
24
+ _processor: Optional[BlipProcessor] = None
25
+
26
+
27
+ def load_model():
28
+ """Load BLIP model and processor on first use."""
29
+ global _model, _processor
30
+
31
+ if _model is not None:
32
+ return
33
+
34
+ _processor = BlipProcessor.from_pretrained(MODEL_NAME)
35
+ _model = BlipForConditionalGeneration.from_pretrained(
36
+ MODEL_NAME,
37
+ torch_dtype=torch.float32,
38
+ )
39
+ _model.eval()
40
+
41
+
42
+ # ═══════════════════════════════════════════════════════════════════
43
+ # Caption generation
44
+ # ═══════════════════════════════════════════════════════════════════
45
+
46
+ def caption_image(
47
+ image: Image.Image,
48
+ prompt: str,
49
+ max_length: int,
50
+ num_beams: int,
51
+ ):
52
+ """Generate a caption for an image, optionally conditioned on a prompt."""
53
+ if image is None:
54
+ return "_Upload an image to get a caption._", "0 ms"
55
+
56
+ load_model()
57
+
58
+ image = image.convert("RGB")
59
+ prompt = (prompt or "").strip()
60
+
61
+ start = time.perf_counter()
62
+
63
+ if prompt:
64
+ inputs = _processor(image, prompt, return_tensors="pt")
65
+ else:
66
+ inputs = _processor(image, return_tensors="pt")
67
+
68
+ with torch.inference_mode():
69
+ output_ids = _model.generate(
70
+ **inputs,
71
+ max_new_tokens=int(max_length),
72
+ num_beams=int(num_beams),
73
+ early_stopping=True,
74
+ )
75
+
76
+ latency_ms = (time.perf_counter() - start) * 1000
77
+ caption = _processor.decode(output_ids[0], skip_special_tokens=True)
78
+
79
+ return caption, f"{latency_ms:.0f} ms"
80
+
81
+
82
+ # ═══════════════════════════════════════════════════════════════════
83
+ # Multiple captions (variety sampling)
84
+ # ═══════════════════════════════════════════════════════════════════
85
+
86
+ def generate_multiple_captions(image: Image.Image, n: int = 3):
87
+ """Generate multiple captions with different beam sizes for variety."""
88
+ if image is None:
89
+ return "_Upload an image first._"
90
+
91
+ load_model()
92
+ image = image.convert("RGB")
93
+
94
+ start = time.perf_counter()
95
+ inputs = _processor(image, return_tensors="pt")
96
+
97
+ captions = []
98
+ with torch.inference_mode():
99
+ for beams in (1, 3, 5):
100
+ output_ids = _model.generate(
101
+ **inputs,
102
+ max_new_tokens=50,
103
+ num_beams=beams,
104
+ early_stopping=True,
105
+ )
106
+ cap = _processor.decode(output_ids[0], skip_special_tokens=True)
107
+ captions.append((beams, cap))
108
+
109
+ latency_ms = (time.perf_counter() - start) * 1000
110
+
111
+ lines = [f"**Generated in {latency_ms:.0f} ms:**\n"]
112
+ for beams, cap in captions:
113
+ lines.append(f"- **Beams={beams}:** {cap}")
114
+ return "\n".join(lines)
115
+
116
+
117
+ # ═══════════════════════════════════════════════════════════════════
118
+ # Gradio UI
119
+ # ═══════════════════════════════════════════════════════════════════
120
+
121
+ with gr.Blocks(title="BLIP Image Captioner", theme=gr.themes.Soft()) as demo:
122
+ gr.Markdown(
123
+ """
124
+ # BLIP Image Captioner
125
+
126
+ Generate natural-language descriptions for any image using
127
+ **Salesforce's BLIP** (Bootstrapping Language-Image Pre-training).
128
+
129
+ Runs on HF's free CPU tier. First request loads the model (~20s),
130
+ subsequent captions generate in a few seconds.
131
+
132
+ > Try uploading a photo of a person, scene, object, or activity.
133
+ > You can optionally provide a **prompt prefix** to condition
134
+ > the caption (e.g., "a photograph of" or "a painting of").
135
+ """
136
+ )
137
+
138
+ with gr.Tabs():
139
+ # ─────────────────────────────────────────────────────────
140
+ # Tab 1 β€” Single Caption
141
+ # ─────────────────────────────────────────────────────────
142
+ with gr.Tab("Single Caption"):
143
+ with gr.Row():
144
+ with gr.Column(scale=1):
145
+ image_input = gr.Image(
146
+ type="pil",
147
+ label="Upload Image",
148
+ height=400,
149
+ )
150
+ prompt_input = gr.Textbox(
151
+ label="Optional Prompt Prefix",
152
+ placeholder="e.g., 'a photograph of' (leave blank for unconditional)",
153
+ )
154
+ with gr.Row():
155
+ max_length = gr.Slider(
156
+ minimum=20,
157
+ maximum=100,
158
+ step=5,
159
+ value=50,
160
+ label="Max Caption Length",
161
+ )
162
+ num_beams = gr.Slider(
163
+ minimum=1,
164
+ maximum=8,
165
+ step=1,
166
+ value=5,
167
+ label="Beam Search Width",
168
+ )
169
+ caption_btn = gr.Button(
170
+ "Generate Caption",
171
+ variant="primary",
172
+ size="lg",
173
+ )
174
+
175
+ with gr.Column(scale=1):
176
+ caption_output = gr.Textbox(
177
+ label="Generated Caption",
178
+ lines=3,
179
+ interactive=False,
180
+ )
181
+ latency_output = gr.Textbox(
182
+ label="Latency",
183
+ interactive=False,
184
+ )
185
+
186
+ caption_btn.click(
187
+ caption_image,
188
+ inputs=[image_input, prompt_input, max_length, num_beams],
189
+ outputs=[caption_output, latency_output],
190
+ )
191
+
192
+ gr.Examples(
193
+ examples=[
194
+ ["https://images.unsplash.com/photo-1574158622682-e40e69881006?w=640", ""],
195
+ ["https://images.unsplash.com/photo-1552053831-71594a27632d?w=640", ""],
196
+ ["https://images.unsplash.com/photo-1502920917128-1aa500764cbd?w=640", "a photograph of"],
197
+ ],
198
+ inputs=[image_input, prompt_input],
199
+ )
200
+
201
+ # ─────────────────────────────────────────────────────────
202
+ # Tab 2 β€” Variety Comparison
203
+ # ─────────────────────────────────────────────────────────
204
+ with gr.Tab("Variety Comparison"):
205
+ gr.Markdown(
206
+ """
207
+ Generate **multiple captions** with different beam search
208
+ widths to see how the model's output varies. Higher beam
209
+ width tends to produce more grammatical but sometimes
210
+ blander captions.
211
+ """
212
+ )
213
+ with gr.Row():
214
+ with gr.Column(scale=1):
215
+ image_input_var = gr.Image(
216
+ type="pil",
217
+ label="Upload Image",
218
+ height=400,
219
+ )
220
+ variety_btn = gr.Button(
221
+ "Generate 3 Captions",
222
+ variant="primary",
223
+ size="lg",
224
+ )
225
+ with gr.Column(scale=1):
226
+ variety_output = gr.Markdown()
227
+
228
+ variety_btn.click(
229
+ generate_multiple_captions,
230
+ inputs=[image_input_var],
231
+ outputs=[variety_output],
232
+ )
233
+
234
+ # ─────────────────────────────────────────────────────────
235
+ # Tab 3 β€” About
236
+ # ─────────────────────────────────────────────────────────
237
+ with gr.Tab("About"):
238
+ gr.Markdown(
239
+ """
240
+ ## Model
241
+
242
+ **Name:** [Salesforce/blip-image-captioning-base](https://huggingface.co/Salesforce/blip-image-captioning-base)
243
+
244
+ **Paper:** [BLIP: Bootstrapping Language-Image Pre-training](https://arxiv.org/abs/2201.12086)
245
+ (Li et al., 2022)
246
+
247
+ **Architecture:** ViT-base vision encoder + BERT-base
248
+ language decoder with cross-attention. Pre-trained on
249
+ a large corpus of image-caption pairs from the web with
250
+ a self-filtering approach (CapFilt) to clean noisy data.
251
+
252
+ **Parameters:** ~250M (base variant)
253
+
254
+ **Training data:** COCO, Visual Genome, SBU Captions,
255
+ Conceptual Captions, Conceptual 12M
256
+
257
+ ## Why BLIP?
258
+
259
+ Pre-BLIP vision-language models typically fell into two
260
+ camps: **understanding** models (CLIP) or **generation**
261
+ models (image captioning). BLIP unifies both by training
262
+ a single model that can do:
263
+
264
+ 1. **Image-text contrastive learning** (like CLIP)
265
+ 2. **Image-text matching** (binary classification)
266
+ 3. **Image-grounded text generation** (captioning)
267
+
268
+ The "Bootstrapping" in the name refers to the CapFilt
269
+ training procedure β€” using the model itself to filter
270
+ and generate synthetic captions to improve the training
271
+ data.
272
+
273
+ ## Limitations
274
+
275
+ - Base model (not large) β€” favors speed over quality
276
+ - Trained on English-language captions only
277
+ - May miss nuance or details in complex scenes
278
+ - Can struggle with rare objects or unusual scenes
279
+
280
+ ## Tech Stack
281
+
282
+ - **transformers** β€” model loading and inference
283
+ - **torch** β€” tensor backend (CPU on HF free tier)
284
+ - **Pillow** β€” image processing
285
+ - **Gradio** β€” UI
286
+
287
+ ---
288
+ **Source:** [github.com/wolfwdavid/ai-tools-collection](https://github.com/wolfwdavid/ai-tools-collection)
289
+  | 
290
+ **HF Profile:** [@WolfDavid](https://huggingface.co/WolfDavid)
291
+ """
292
+ )
293
+
294
+
295
+ if __name__ == "__main__":
296
+ demo.launch()
requirements.txt ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ gradio==5.9.1
2
+ huggingface_hub==0.26.5
3
+ transformers==4.46.3
4
+ torch==2.5.1
5
+ Pillow==11.0.0