--- license: mit tags: - image-captioning - blip - vision-language-model - multimodal-ai - computer-vision - deep-learning - transformers - pytorch pipeline_tag: image-to-text library_name: transformers --- # BLIP Caption Model This repository contains a BLIP-based image captioning model used to generate natural-language captions from uploaded images. The model is connected to a live Hugging Face Space demo: 👉 [Multimodal Image Captioning with BLIP Demo](https://huggingface.co/spaces/YaekobB/image-captioning-blip-demo) ## Model Description This model is designed for automatic image captioning. Given an input image, it generates a short textual description of the visual content. The project demonstrates the use of vision-language models for multimodal AI applications, combining computer vision and natural language generation. ## Intended Use This model can be used for: - Image caption generation - Vision-language AI demonstrations - Multimodal learning experiments - Educational and portfolio projects - Prototyping image-to-text applications ## How to Use ```python from transformers import BlipProcessor, BlipForConditionalGeneration from PIL import Image import torch model_id = "YaekobB/blip-caption-model" processor = BlipProcessor.from_pretrained(model_id) model = BlipForConditionalGeneration.from_pretrained(model_id) image = Image.open("your_image.jpg").convert("RGB") inputs = processor(image, return_tensors="pt") with torch.no_grad(): output = model.generate(**inputs, max_new_tokens=50) caption = processor.decode(output[0], skip_special_tokens=True) print(caption) ``` ## Live Demo A live inference demo is available on Hugging Face Spaces: [https://huggingface.co/spaces/YaekobB/image-captioning-blip-demo](https://huggingface.co/spaces/YaekobB/image-captioning-blip-demo) The demo allows users to upload one or more images and generate captions using the model. ## Limitations This model may generate inaccurate or incomplete captions, especially for: - Complex scenes with many objects or people - Small or unclear objects - Low-quality or blurry images - Culturally specific contexts - Images requiring detailed reasoning or domain expertise Generated captions should be treated as model-generated descriptions, not guaranteed factual annotations. ## Ethical Considerations This model should not be used as the sole source of truth for safety-critical, medical, legal, or identity-sensitive decisions. It may produce biased, incomplete, or incorrect descriptions depending on the input image and training data limitations. ## Author **Yaekob Beyene Yowhanns** M.Sc. Artificial Intelligence and Computer Science University of Calabria GitHub: [yaekobB](https://github.com/yaekobB) Hugging Face: [YaekobB](https://huggingface.co/YaekobB)