Image-to-Text
Transformers
Safetensors
PyTorch
blip
image-text-to-text
image-captioning
vision-language-model
multimodal-ai
computer-vision
deep-learning
Instructions to use YaekobB/blip-caption-model with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use YaekobB/blip-caption-model with Transformers:
# Use a pipeline as a high-level helper # Warning: Pipeline type "image-to-text" is no longer supported in transformers v5. # You must load the model directly (see below) or downgrade to v4.x with: # 'pip install "transformers<5.0.0' from transformers import pipeline pipe = pipeline("image-to-text", model="YaekobB/blip-caption-model")# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("YaekobB/blip-caption-model") model = AutoModelForImageTextToText.from_pretrained("YaekobB/blip-caption-model") - Notebooks
- Google Colab
- Kaggle
File size: 2,829 Bytes
730ded8 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 | ---
license: mit
tags:
- image-captioning
- blip
- vision-language-model
- multimodal-ai
- computer-vision
- deep-learning
- transformers
- pytorch
pipeline_tag: image-to-text
library_name: transformers
---
# BLIP Caption Model
This repository contains a BLIP-based image captioning model used to generate natural-language captions from uploaded images.
The model is connected to a live Hugging Face Space demo:
👉 [Multimodal Image Captioning with BLIP Demo](https://huggingface.co/spaces/YaekobB/image-captioning-blip-demo)
## Model Description
This model is designed for automatic image captioning. Given an input image, it generates a short textual description of the visual content.
The project demonstrates the use of vision-language models for multimodal AI applications, combining computer vision and natural language generation.
## Intended Use
This model can be used for:
- Image caption generation
- Vision-language AI demonstrations
- Multimodal learning experiments
- Educational and portfolio projects
- Prototyping image-to-text applications
## How to Use
```python
from transformers import BlipProcessor, BlipForConditionalGeneration
from PIL import Image
import torch
model_id = "YaekobB/blip-caption-model"
processor = BlipProcessor.from_pretrained(model_id)
model = BlipForConditionalGeneration.from_pretrained(model_id)
image = Image.open("your_image.jpg").convert("RGB")
inputs = processor(image, return_tensors="pt")
with torch.no_grad():
output = model.generate(**inputs, max_new_tokens=50)
caption = processor.decode(output[0], skip_special_tokens=True)
print(caption)
```
## Live Demo
A live inference demo is available on Hugging Face Spaces:
[https://huggingface.co/spaces/YaekobB/image-captioning-blip-demo](https://huggingface.co/spaces/YaekobB/image-captioning-blip-demo)
The demo allows users to upload one or more images and generate captions using the model.
## Limitations
This model may generate inaccurate or incomplete captions, especially for:
- Complex scenes with many objects or people
- Small or unclear objects
- Low-quality or blurry images
- Culturally specific contexts
- Images requiring detailed reasoning or domain expertise
Generated captions should be treated as model-generated descriptions, not guaranteed factual annotations.
## Ethical Considerations
This model should not be used as the sole source of truth for safety-critical, medical, legal, or identity-sensitive decisions.
It may produce biased, incomplete, or incorrect descriptions depending on the input image and training data limitations.
## Author
**Yaekob Beyene Yowhanns**
M.Sc. Artificial Intelligence and Computer Science
University of Calabria
GitHub: [yaekobB](https://github.com/yaekobB)
Hugging Face: [YaekobB](https://huggingface.co/YaekobB) |