Image Captioning Model

This repository contains a custom PyTorch image captioning model. The model receives an input image and generates a natural-language caption describing the image.

The architecture is built from two main components:

Image Encoder: EfficientNet-V2-S backbone pretrained on ImageNet.
Text Decoder: Transformer decoder that generates captions token by token.

The model was trained for image caption generation using COCO-style image-caption pairs.

Model Architecture

The model follows an encoder-decoder structure:

Input Image
   ↓
EfficientNet-V2-S Image Encoder
   ↓
Image Feature Tokens
   ↓
Transformer Text Decoder
   ↓
Generated Caption

Image Encoder

The encoder uses EfficientNet_V2_S from torchvision.models.

The image encoder extracts visual features from the input image and projects them into a 256-dimensional embedding space. The final image representation is treated as a sequence of visual tokens.

Encoder details:

Backbone: EfficientNet-V2-S
Input image size: 224 x 224
Output visual tokens: 49
Embedding dimension: 256
ImageNet normalization: Yes

Text Decoder

The decoder is a Transformer decoder that generates captions autoregressively.

Decoder details:

Vocabulary size: 9,721
Embedding dimension: 256
Number of Transformer decoder layers: 6
Number of attention heads: 8
Feed-forward dimension: 1024
Maximum caption length: 52
Dropout: 0.1
Decoding methods: Greedy search and beam search

Repository Files

This repository contains:

best_phase1.pt       # PyTorch checkpoint
Training-5k.ipynb    # Training and inference notebook

The checkpoint contains:

epoch
model
val_loss

Checkpoint information:

Checkpoint file: best_phase1.pt
Epoch: 8
Validation loss: 3.6158

Important Note About Vocabulary

This model uses a custom word-level vocabulary built from the training captions.

The checkpoint stores the model weights, but it does not store the vocabulary mapping. To run inference correctly, you must use the same vocabulary that was used during training.

The vocabulary contains 9,721 tokens and uses the following special tokens:

<PAD> = 0
<SOS> = 1
<EOS> = 2
<UNK> = 3

If you want to make this model easier to use, it is recommended to upload an additional file such as:

vocab.json

containing the stoi and itos mappings.

Training Data

The model was trained using COCO-style image-caption data.

The training notebook is configured to use:

Dataset format: COCO captions
Training annotations: captions_train2014.json
Validation annotations: captions_val2014.json
Image size: 224 x 224
Batch size: 32
Maximum caption length: 52

The notebook version included in this repository was designed for a smaller training experiment using a limited number of samples.

Image Preprocessing

Images are resized to 224 x 224 and normalized with ImageNet statistics:

IMAGENET_MEAN = [0.485, 0.456, 0.406]
IMAGENET_STD  = [0.229, 0.224, 0.225]

Validation and inference transforms:

import torchvision.transforms as T

transform = T.Compose([
    T.Resize((224, 224)),
    T.ToTensor(),
    T.Normalize(
        mean=[0.485, 0.456, 0.406],
        std=[0.229, 0.224, 0.225]
    ),
])

How to Use

This is a custom PyTorch model. It is not a standard Hugging Face Transformers model, so it cannot be loaded directly with:

AutoModel.from_pretrained(...)

To use the model, open and run the notebook:

Training-5k.ipynb

The notebook contains:

Vocabulary class
Dataset class
EfficientNet encoder
Transformer decoder
ImageCaptioningModel class
Training loop
Checkpoint loading
Greedy decoding
Beam-search decoding
Evaluation code

Loading the Checkpoint

After defining the model architecture and rebuilding/loading the same vocabulary, the checkpoint can be loaded as follows:

import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model = ImageCaptioningModel(
    vocab_size=9721,
    embed_dim=256,
    num_heads=8,
    num_layers=6,
    ff_dim=1024,
    max_len=52,
    dropout=0.1
).to(device)

checkpoint = torch.load("best_phase1.pt", map_location=device)
model.load_state_dict(checkpoint["model"])
model.eval()

print("Loaded checkpoint")
print("Epoch:", checkpoint["epoch"])
print("Validation loss:", checkpoint["val_loss"])

Generating a Caption

The notebook includes two caption generation methods:

model.generate_greedy(image_tensor)
model.generate_beam(image_tensor, beam_size=5)

Example:

from PIL import Image

image = Image.open("example.jpg").convert("RGB")
image_tensor = transform(image)

caption = model.generate_beam(image_tensor, beam_size=5)
print(caption)

Example Output

Example caption format:

a bicycle with a clock as the front wheel

Actual output quality depends on the training data size, checkpoint version, and decoding method.

Evaluation

The notebook includes BLEU evaluation code using NLTK:

from nltk.translate.bleu_score import corpus_bleu, SmoothingFunction

You can evaluate the model on validation images using greedy decoding or beam search.

Recommended metrics for this task:

BLEU-1
BLEU-4
CIDEr
METEOR
ROUGE-L

Limitations

This model is an experimental image captioning model.

Known limitations:

The model uses a custom word-level tokenizer, not a subword tokenizer.
The vocabulary must match the original training vocabulary.
The checkpoint alone is not enough for fully reproducible inference unless the vocabulary is also available.
Caption quality may be limited if the model was trained on a small subset of the dataset.
The model may generate generic or repetitive captions.
The model may fail on images that are very different from the training distribution.
The model may hallucinate objects that are not present in the image.

Recommended Improvements

To make this repository easier to use, future versions should include:

vocab.json
model.py
requirements.txt
inference.py
example images
evaluation results

A better repository structure would be:

.
├── README.md
├── best_phase1.pt
├── Training-5k.ipynb
├── vocab.json
├── model.py
├── inference.py
└── requirements.txt

Requirements

The notebook uses the following main libraries:

torch
torchvision
Pillow
numpy
matplotlib
nltk
pycocotools
pycocoevalcap
einops

Install dependencies with:

pip install torch torchvision pillow numpy matplotlib nltk pycocotools pycocoevalcap einops

Citation

If you use this model, please cite or mention this repository.

Author

Created as a custom PyTorch image captioning model using an EfficientNet image encoder and a Transformer text decoder.

Downloads last month: -; Downloads are not tracked for this model. How to track