Model Details

Model Description

An image-to-text model finetuned on BLIP-base with the transformers package

Developed by: vkao8264
Model type: Image-to-text
Language(s) (NLP): English
License: bsd-3-clause
Finetuned from model [optional]: blip-image-captioning-base

Uses

from PIL import Image
from transformers import AutoProcessor, BlipForConditionalGeneration

processor = AutoProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
model = BlipForConditionalGeneration.from_pretrained("vkao8264/blip-yoda-captioning")

filepath = "path-to-your-image"
raw_image = Image.open(filepath).convert('RGB')

inputs = processor(raw_image, return_tensors="pt").to("cuda")
output_tokens = model.generate(**inputs)
caption = processor.decode(output_tokens[0], skip_special_tokens=True)
print(caption)

Training Details

Training Data

The model was fine-tuned on 30000 image-caption pairs from the COCO captions dataset. Specifically, captions_train2014.

Before training, captions were changed to yoda-style captions using phi3 with few-shot learning

Scripts can be found on https://github.com/vincent8264/yoda_captioning

Downloads last month: 13

Safetensors

Model size

0.2B params

Tensor type

F32

Model tree for vkao8264/blip-yoda-captioning

Base model

Salesforce/blip-image-captioning-base

Finetuned

(49)

this model