ViTPets - Vision Transformer trained from scratch on Oxford Pets 🐶🐱

This model is a Vision Transformer (ViT) trained from scratch on the Oxford Pets dataset. It classifies images of cats and dogs into 37 different breeds.

Model Summary

Architecture: Custom Vision Transformer (ViT)
Input resolution: 128x128
Patch size: 16x16
Embedding dimension: 240
Number of Transformer blocks: 12
Number of heads: 4
MLP ratio: 2.0
Dropout: 10% on attention and MLP
Framework: PyTorch
Dataset: Oxford Pets (via 🤗 cvdl/oxford-pets)
Loss: CrossEntropyLoss
Optimizer: SGD with LR = 0.00257

Training Setup

Device: Multi-GPU (4 GPUs)
Batch size: 256 (64 × 4 GPUs)
Early stopping: patience 50, delta 1e-6
Logging: TensorBoard

How to Use

from model import ViT
import torch

model = ViT(
    img_size=(128, 128),
    patch_size=16,
    in_channels=3,
    embed_dim=240,
    n_classes=37,
    n_blocks=12,
    n_heads=4,
    mlp_ratio=2.0,
    qkv_bias=True,
    block_drop_p=0.1,
    attn_drop_p=0.1,
)

model.load_state_dict(torch.load("ViTPets.pth"))
model.eval()

Downloads last month: -; Downloads are not tracked for this model. How to track

Dataset used to train chitter99/vit_oxford_pets_patch16_128

Evaluation results

accuracy on Oxford Pets
self-reported

9.000