---
license: mit
tags:
  - image-captioning
  - blip
  - vision-language-model
  - multimodal-ai
  - computer-vision
  - deep-learning
  - transformers
  - pytorch
pipeline_tag: image-to-text
library_name: transformers
---

# BLIP Caption Model

This repository contains a BLIP-based image captioning model used to generate natural-language captions from uploaded images.

The model is connected to a live Hugging Face Space demo:

👉 [Multimodal Image Captioning with BLIP Demo](https://huggingface.co/spaces/YaekobB/image-captioning-blip-demo)

## Model Description

This model is designed for automatic image captioning. Given an input image, it generates a short textual description of the visual content.

The project demonstrates the use of vision-language models for multimodal AI applications, combining computer vision and natural language generation.

## Intended Use

This model can be used for:

- Image caption generation
- Vision-language AI demonstrations
- Multimodal learning experiments
- Educational and portfolio projects
- Prototyping image-to-text applications

## How to Use

```python
from transformers import BlipProcessor, BlipForConditionalGeneration
from PIL import Image
import torch

model_id = "YaekobB/blip-caption-model"

processor = BlipProcessor.from_pretrained(model_id)
model = BlipForConditionalGeneration.from_pretrained(model_id)

image = Image.open("your_image.jpg").convert("RGB")

inputs = processor(image, return_tensors="pt")

with torch.no_grad():
    output = model.generate(**inputs, max_new_tokens=50)

caption = processor.decode(output[0], skip_special_tokens=True)
print(caption)
```

## Live Demo

A live inference demo is available on Hugging Face Spaces:

[https://huggingface.co/spaces/YaekobB/image-captioning-blip-demo](https://huggingface.co/spaces/YaekobB/image-captioning-blip-demo)

The demo allows users to upload one or more images and generate captions using the model.

## Limitations

This model may generate inaccurate or incomplete captions, especially for:

- Complex scenes with many objects or people
- Small or unclear objects
- Low-quality or blurry images
- Culturally specific contexts
- Images requiring detailed reasoning or domain expertise

Generated captions should be treated as model-generated descriptions, not guaranteed factual annotations.

## Ethical Considerations

This model should not be used as the sole source of truth for safety-critical, medical, legal, or identity-sensitive decisions.

It may produce biased, incomplete, or incorrect descriptions depending on the input image and training data limitations.

## Author

**Yaekob Beyene Yowhanns**  
M.Sc. Artificial Intelligence and Computer Science  
University of Calabria  

GitHub: [yaekobB](https://github.com/yaekobB)  
Hugging Face: [YaekobB](https://huggingface.co/YaekobB)