BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
Paper
•
2201.12086
•
Published
•
3
BLIP is a unified vision-language model designed for image captioning, visual question answering, and related tasks. The current implementation is pretrained for image captioning and fine-tuned on a food-specific dataset.
BLIP (Bootstrapping Language-Image Pre-training) leverages vision transformers (ViT) for feature extraction and connects them with language models for unified vision-language understanding and generation. This particular model is fine-tuned to generate captions for food-related images.