Image Captioning with Vision Transformers

This repository contains a transformer-based image captioning model trained on the MS COCO dataset.

πŸš€ Demo

Try the live demo here: https://huggingface.co/spaces/mostafahagali/vit-image-captioning

🧠 Architecture

  • Vision Transformer (ViT-B/16) encoder
  • Transformer encoder
  • Transformer decoder

πŸ“Š Training

  • Dataset: MS COCO 2017
  • Loss: Cross-Entropy
  • Optimizer: AdamW
  • Framework: PyTorch

πŸ“¦ Files

  • best_model.pth: Trained model weights
  • vocab.pkl: Vocabulary mapping

πŸ› οΈ Usage

import torch
import pickle
from image_captioning_model import ImageCaptioningModel

with open("vocab.pkl", "rb") as f:
    vocab = pickle.load(f)

model = ImageCaptioningModel(
    vocab_size=len(vocab),
    pad_id=vocab.word2idx["<pad>"],
    use_vit=True
)

state_dict = torch.load("best_model.pth", map_location="cpu")
model.load_state_dict(state_dict)
model.eval()

---
license: mit
---

---
Author: Mostafa Hagali
---
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support