BLIP Caption Model

This repository contains a BLIP-based image captioning model used to generate natural-language captions from uploaded images.

The model is connected to a live Hugging Face Space demo:

👉 Multimodal Image Captioning with BLIP Demo

Model Description

This model is designed for automatic image captioning. Given an input image, it generates a short textual description of the visual content.

The project demonstrates the use of vision-language models for multimodal AI applications, combining computer vision and natural language generation.

Intended Use

This model can be used for:

Image caption generation
Vision-language AI demonstrations
Multimodal learning experiments
Educational and portfolio projects
Prototyping image-to-text applications

How to Use

from transformers import BlipProcessor, BlipForConditionalGeneration
from PIL import Image
import torch

model_id = "YaekobB/blip-caption-model"

processor = BlipProcessor.from_pretrained(model_id)
model = BlipForConditionalGeneration.from_pretrained(model_id)

image = Image.open("your_image.jpg").convert("RGB")

inputs = processor(image, return_tensors="pt")

with torch.no_grad():
    output = model.generate(**inputs, max_new_tokens=50)

caption = processor.decode(output[0], skip_special_tokens=True)
print(caption)

Live Demo

A live inference demo is available on Hugging Face Spaces:

https://huggingface.co/spaces/YaekobB/image-captioning-blip-demo

The demo allows users to upload one or more images and generate captions using the model.

Limitations

This model may generate inaccurate or incomplete captions, especially for:

Complex scenes with many objects or people
Small or unclear objects
Low-quality or blurry images
Culturally specific contexts
Images requiring detailed reasoning or domain expertise

Generated captions should be treated as model-generated descriptions, not guaranteed factual annotations.

Ethical Considerations

This model should not be used as the sole source of truth for safety-critical, medical, legal, or identity-sensitive decisions.

It may produce biased, incomplete, or incorrect descriptions depending on the input image and training data limitations.

Author

Yaekob Beyene Yowhanns
M.Sc. Artificial Intelligence and Computer Science
University of Calabria

GitHub: yaekobB
Hugging Face: YaekobB

Downloads last month: 27

Safetensors

Model size

0.2B params

Tensor type

F32

YaekobB
/

blip-caption-model