BLIP Caption Model

This repository contains a BLIP-based image captioning model used to generate natural-language captions from uploaded images.

The model is connected to a live Hugging Face Space demo:

๐Ÿ‘‰ Multimodal Image Captioning with BLIP Demo

Model Description

This model is designed for automatic image captioning. Given an input image, it generates a short textual description of the visual content.

The project demonstrates the use of vision-language models for multimodal AI applications, combining computer vision and natural language generation.

Intended Use

This model can be used for:

  • Image caption generation
  • Vision-language AI demonstrations
  • Multimodal learning experiments
  • Educational and portfolio projects
  • Prototyping image-to-text applications

How to Use

from transformers import BlipProcessor, BlipForConditionalGeneration
from PIL import Image
import torch

model_id = "YaekobB/blip-caption-model"

processor = BlipProcessor.from_pretrained(model_id)
model = BlipForConditionalGeneration.from_pretrained(model_id)

image = Image.open("your_image.jpg").convert("RGB")

inputs = processor(image, return_tensors="pt")

with torch.no_grad():
    output = model.generate(**inputs, max_new_tokens=50)

caption = processor.decode(output[0], skip_special_tokens=True)
print(caption)

Live Demo

A live inference demo is available on Hugging Face Spaces:

https://huggingface.co/spaces/YaekobB/image-captioning-blip-demo

The demo allows users to upload one or more images and generate captions using the model.

Limitations

This model may generate inaccurate or incomplete captions, especially for:

  • Complex scenes with many objects or people
  • Small or unclear objects
  • Low-quality or blurry images
  • Culturally specific contexts
  • Images requiring detailed reasoning or domain expertise

Generated captions should be treated as model-generated descriptions, not guaranteed factual annotations.

Ethical Considerations

This model should not be used as the sole source of truth for safety-critical, medical, legal, or identity-sensitive decisions.

It may produce biased, incomplete, or incorrect descriptions depending on the input image and training data limitations.

Author

Yaekob Beyene Yowhanns
M.Sc. Artificial Intelligence and Computer Science
University of Calabria

GitHub: yaekobB
Hugging Face: YaekobB

Downloads last month
27
Safetensors
Model size
0.2B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Space using YaekobB/blip-caption-model 1