Model Card for Model ID

This is a fine-tuned version of Salesforce's BLIP-2 model, adapted for the task of image captioning using the QLoRA methodology for parameter-efficient fine-tuning. The model is trained on the Flickr8k dataset to generate descriptive, human-like captions for a wide variety of images.

Model Details

Model Description

This model is an adaptation of the powerful BLIP-2 vision-language architecture, specifically the Salesforce/blip2-opt-2.7b variant. It has been fine-tuned to specialize in generating accurate and contextually relevant captions for images.

The fine-tuning was performed using QLoRA (Quantized Low-Rank Adaptation), a highly efficient technique that significantly reduces the computational and memory requirements for training. This is achieved by quantizing the base model to 4-bits and then training small, low-rank adapter matrices, leaving the vast majority of the original model's parameters frozen. This approach makes it possible to adapt large-scale models on consumer-grade hardware while preserving high performance.

This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.

Developed by: [Salesforce]
Model type: [Vision-Language Model (VLM) based on BLIP-2]
Language(s) (NLP): [English (en)]
License: [Apache 2.0]
Finetuned from model [optional]: [Salesforce/blip2-opt-2.7b]

Downloads last month: 7

Safetensors

Model size

4B params

Tensor type

F32

F16