Image Captioning with Vision Transformers

This repository contains a transformer-based image captioning model trained on the MS COCO dataset.

🚀 Demo

Try the live demo here: https://huggingface.co/spaces/mostafahagali/vit-image-captioning

🧠 Architecture

Vision Transformer (ViT-B/16) encoder
Transformer encoder
Transformer decoder

📊 Training

Dataset: MS COCO 2017
Loss: Cross-Entropy
Optimizer: AdamW
Framework: PyTorch

📦 Files

best_model.pth: Trained model weights
vocab.pkl: Vocabulary mapping

🛠️ Usage

import torch
import pickle
from image_captioning_model import ImageCaptioningModel

with open("vocab.pkl", "rb") as f:
    vocab = pickle.load(f)

model = ImageCaptioningModel(
    vocab_size=len(vocab),
    pad_id=vocab.word2idx["<pad>"],
    use_vit=True
)

state_dict = torch.load("best_model.pth", map_location="cpu")
model.load_state_dict(state_dict)
model.eval()

---
license: mit
---

---
Author: Mostafa Hagali
---

Downloads last month: -; Downloads are not tracked for this model. How to track