Image Captioning with Vision Transformers
This repository contains a transformer-based image captioning model trained on the MS COCO dataset.
π Demo
Try the live demo here: https://huggingface.co/spaces/mostafahagali/vit-image-captioning
π§ Architecture
- Vision Transformer (ViT-B/16) encoder
- Transformer encoder
- Transformer decoder
π Training
- Dataset: MS COCO 2017
- Loss: Cross-Entropy
- Optimizer: AdamW
- Framework: PyTorch
π¦ Files
best_model.pth: Trained model weightsvocab.pkl: Vocabulary mapping
π οΈ Usage
import torch
import pickle
from image_captioning_model import ImageCaptioningModel
with open("vocab.pkl", "rb") as f:
vocab = pickle.load(f)
model = ImageCaptioningModel(
vocab_size=len(vocab),
pad_id=vocab.word2idx["<pad>"],
use_vit=True
)
state_dict = torch.load("best_model.pth", map_location="cpu")
model.load_state_dict(state_dict)
model.eval()
---
license: mit
---
---
Author: Mostafa Hagali
---