Image-to-Text
Transformers
PyTorch
English
image-captioning
vision-transformer
computer-vision
deep-learning
Instructions to use mostafahagali/image-captioning-vit-transformer with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use mostafahagali/image-captioning-vit-transformer with Transformers:
# Use a pipeline as a high-level helper # Warning: Pipeline type "image-to-text" is no longer supported in transformers v5. # You must load the model directly (see below) or downgrade to v4.x with: # 'pip install "transformers<5.0.0' from transformers import pipeline pipe = pipeline("image-to-text", model="mostafahagali/image-captioning-vit-transformer")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("mostafahagali/image-captioning-vit-transformer", dtype="auto") - Notebooks
- Google Colab
- Kaggle
Image Captioning with Vision Transformers
This repository contains a transformer-based image captioning model trained on the MS COCO dataset.
π Demo
Try the live demo here: https://huggingface.co/spaces/mostafahagali/vit-image-captioning
π§ Architecture
- Vision Transformer (ViT-B/16) encoder
- Transformer encoder
- Transformer decoder
π Training
- Dataset: MS COCO 2017
- Loss: Cross-Entropy
- Optimizer: AdamW
- Framework: PyTorch
π¦ Files
best_model.pth: Trained model weightsvocab.pkl: Vocabulary mapping
π οΈ Usage
import torch
import pickle
from image_captioning_model import ImageCaptioningModel
with open("vocab.pkl", "rb") as f:
vocab = pickle.load(f)
model = ImageCaptioningModel(
vocab_size=len(vocab),
pad_id=vocab.word2idx["<pad>"],
use_vit=True
)
state_dict = torch.load("best_model.pth", map_location="cpu")
model.load_state_dict(state_dict)
model.eval()
---
license: mit
---
---
Author: Mostafa Hagali
---