Image Captioning Model
This repository contains a custom PyTorch image captioning model. The model receives an input image and generates a natural-language caption describing the image.
The architecture is built from two main components:
- Image Encoder: EfficientNet-V2-S backbone pretrained on ImageNet.
- Text Decoder: Transformer decoder that generates captions token by token.
The model was trained for image caption generation using COCO-style image-caption pairs.
Model Architecture
The model follows an encoder-decoder structure:
Input Image
β
EfficientNet-V2-S Image Encoder
β
Image Feature Tokens
β
Transformer Text Decoder
β
Generated Caption
Image Encoder
The encoder uses EfficientNet_V2_S from torchvision.models.
The image encoder extracts visual features from the input image and projects them into a 256-dimensional embedding space. The final image representation is treated as a sequence of visual tokens.
Encoder details:
Backbone: EfficientNet-V2-S
Input image size: 224 x 224
Output visual tokens: 49
Embedding dimension: 256
ImageNet normalization: Yes
Text Decoder
The decoder is a Transformer decoder that generates captions autoregressively.
Decoder details:
Vocabulary size: 9,721
Embedding dimension: 256
Number of Transformer decoder layers: 6
Number of attention heads: 8
Feed-forward dimension: 1024
Maximum caption length: 52
Dropout: 0.1
Decoding methods: Greedy search and beam search
Repository Files
This repository contains:
best_phase1.pt # PyTorch checkpoint
Training-5k.ipynb # Training and inference notebook
The checkpoint contains:
epoch
model
val_loss
Checkpoint information:
Checkpoint file: best_phase1.pt
Epoch: 8
Validation loss: 3.6158
Important Note About Vocabulary
This model uses a custom word-level vocabulary built from the training captions.
The checkpoint stores the model weights, but it does not store the vocabulary mapping. To run inference correctly, you must use the same vocabulary that was used during training.
The vocabulary contains 9,721 tokens and uses the following special tokens:
<PAD> = 0
<SOS> = 1
<EOS> = 2
<UNK> = 3
If you want to make this model easier to use, it is recommended to upload an additional file such as:
vocab.json
containing the stoi and itos mappings.
Training Data
The model was trained using COCO-style image-caption data.
The training notebook is configured to use:
Dataset format: COCO captions
Training annotations: captions_train2014.json
Validation annotations: captions_val2014.json
Image size: 224 x 224
Batch size: 32
Maximum caption length: 52
The notebook version included in this repository was designed for a smaller training experiment using a limited number of samples.
Image Preprocessing
Images are resized to 224 x 224 and normalized with ImageNet statistics:
IMAGENET_MEAN = [0.485, 0.456, 0.406]
IMAGENET_STD = [0.229, 0.224, 0.225]
Validation and inference transforms:
import torchvision.transforms as T
transform = T.Compose([
T.Resize((224, 224)),
T.ToTensor(),
T.Normalize(
mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225]
),
])
How to Use
This is a custom PyTorch model. It is not a standard Hugging Face Transformers model, so it cannot be loaded directly with:
AutoModel.from_pretrained(...)
To use the model, open and run the notebook:
Training-5k.ipynb
The notebook contains:
Vocabulary class
Dataset class
EfficientNet encoder
Transformer decoder
ImageCaptioningModel class
Training loop
Checkpoint loading
Greedy decoding
Beam-search decoding
Evaluation code
Loading the Checkpoint
After defining the model architecture and rebuilding/loading the same vocabulary, the checkpoint can be loaded as follows:
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = ImageCaptioningModel(
vocab_size=9721,
embed_dim=256,
num_heads=8,
num_layers=6,
ff_dim=1024,
max_len=52,
dropout=0.1
).to(device)
checkpoint = torch.load("best_phase1.pt", map_location=device)
model.load_state_dict(checkpoint["model"])
model.eval()
print("Loaded checkpoint")
print("Epoch:", checkpoint["epoch"])
print("Validation loss:", checkpoint["val_loss"])
Generating a Caption
The notebook includes two caption generation methods:
model.generate_greedy(image_tensor)
model.generate_beam(image_tensor, beam_size=5)
Example:
from PIL import Image
image = Image.open("example.jpg").convert("RGB")
image_tensor = transform(image)
caption = model.generate_beam(image_tensor, beam_size=5)
print(caption)
Example Output
Example caption format:
a bicycle with a clock as the front wheel
Actual output quality depends on the training data size, checkpoint version, and decoding method.
Evaluation
The notebook includes BLEU evaluation code using NLTK:
from nltk.translate.bleu_score import corpus_bleu, SmoothingFunction
You can evaluate the model on validation images using greedy decoding or beam search.
Recommended metrics for this task:
BLEU-1
BLEU-4
CIDEr
METEOR
ROUGE-L
Limitations
This model is an experimental image captioning model.
Known limitations:
- The model uses a custom word-level tokenizer, not a subword tokenizer.
- The vocabulary must match the original training vocabulary.
- The checkpoint alone is not enough for fully reproducible inference unless the vocabulary is also available.
- Caption quality may be limited if the model was trained on a small subset of the dataset.
- The model may generate generic or repetitive captions.
- The model may fail on images that are very different from the training distribution.
- The model may hallucinate objects that are not present in the image.
Recommended Improvements
To make this repository easier to use, future versions should include:
vocab.json
model.py
requirements.txt
inference.py
example images
evaluation results
A better repository structure would be:
.
βββ README.md
βββ best_phase1.pt
βββ Training-5k.ipynb
βββ vocab.json
βββ model.py
βββ inference.py
βββ requirements.txt
Requirements
The notebook uses the following main libraries:
torch
torchvision
Pillow
numpy
matplotlib
nltk
pycocotools
pycocoevalcap
einops
Install dependencies with:
pip install torch torchvision pillow numpy matplotlib nltk pycocotools pycocoevalcap einops
Citation
If you use this model, please cite or mention this repository.
Author
Created as a custom PyTorch image captioning model using an EfficientNet image encoder and a Transformer text decoder.