jxie/flickr8k
Viewer β’ Updated β’ 8k β’ 4.29k β’ 6
A multimodal model combining Vision Transformer (ViT-B/16) and GPT-2 for image captioning, trained on Flickr8K dataset.
This model generates natural language captions for images by:
pip install torch torchvision transformers pillow huggingface_hub
import torch
from transformers import GPT2Tokenizer
from PIL import Image
from torchvision import transforms
# Load checkpoint
checkpoint = torch.load("model_fp32/model_checkpoint.pth", map_location="cpu")
# Load tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("model_fp32/tokenizer")
# Load your model architecture (you need to define this)
# model = YourVisionGPTModel(config)
# model.load_state_dict(checkpoint['model_state_dict'])
# model.eval()
print("Model loaded successfully!")
# Load FP16 checkpoint
checkpoint = torch.load("model_fp16/model_checkpoint.pth", map_location="cpu")
# Load model and convert to FP16
# model = YourVisionGPTModel(config)
# model.load_state_dict(checkpoint['model_state_dict'])
# model.half() # Convert to FP16
# model.eval()
# For inference with FP16, also convert input images to FP16
image_transform = transforms.Compose([
transforms.Resize((224, 224)),
transforms.Lambda(lambda x: x.convert('RGB')),
transforms.ToTensor(),
transforms.Normalize(
mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225]
),
])
# Load and preprocess image
image = Image.open("your_image.jpg")
image_tensor = image_transform(image).unsqueeze(0) # Add batch dimension
# Generate caption
with torch.no_grad():
# Forward pass
generated_ids = model.generate(
image_tensor,
max_length=50,
num_beams=5,
temperature=0.7
)
# Decode caption
caption = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
print(f"Generated caption: {caption}")
βββββββββββββββββββ
β Input Image β
β (224x224) β
ββββββββββ¬βββββββββ
β
βΌ
βββββββββββββββββββ
β ViT-B/16 β
β (frozen) β
ββββββββββ¬βββββββββ
β
βΌ
βββββββββββββββββββ
β Projection β
β (trainable) β
ββββββββββ¬βββββββββ
β
βΌ
βββββββββββββββββββ
β GPT-2 β
β (frozen) β
ββββββββββ¬βββββββββ
β
βΌ
βββββββββββββββββββ
β Caption Output β
βββββββββββββββββββ
If you use this model, please cite:
@misc{vision-gpt-flickr8k,
author = {gurumurthy3},
title = {Vision-GPT: Image Captioning with ViT and GPT-2},
year = {2025},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/gurumurthy3/vision-gpt-flickr8k}}
}
MIT License