--- license: apache-2.0 tags: - image-classification - vision-transformer - pytorch - oxford-pets library_name: torch datasets: - cvdl/oxford-pets language: [] model-index: - name: ViTPets results: - task: type: image-classification dataset: name: Oxford Pets type: cvdl/oxford-pets metrics: - type: accuracy value: 9 --- # ViTPets - Vision Transformer trained from scratch on Oxford Pets 🐢🐱 This model is a Vision Transformer (ViT) trained from scratch on the [Oxford Pets dataset](https://huggingface.co/datasets/cvdl/oxford-pets). It classifies images of cats and dogs into 37 different breeds. ## Model Summary - **Architecture**: Custom Vision Transformer (ViT) - **Input resolution**: 128x128 - **Patch size**: 16x16 - **Embedding dimension**: 240 - **Number of Transformer blocks**: 12 - **Number of heads**: 4 - **MLP ratio**: 2.0 - **Dropout**: 10% on attention and MLP - **Framework**: PyTorch - **Dataset**: Oxford Pets (via πŸ€— `cvdl/oxford-pets`) - **Loss**: CrossEntropyLoss - **Optimizer**: SGD with LR = 0.00257 ## Training Setup - **Device**: Multi-GPU (4 GPUs) - **Batch size**: 256 (64 Γ— 4 GPUs) - **Early stopping**: patience 50, delta 1e-6 - **Logging**: TensorBoard ## How to Use ```python from model import ViT import torch model = ViT( img_size=(128, 128), patch_size=16, in_channels=3, embed_dim=240, n_classes=37, n_blocks=12, n_heads=4, mlp_ratio=2.0, qkv_bias=True, block_drop_p=0.1, attn_drop_p=0.1, ) model.load_state_dict(torch.load("ViTPets.pth")) model.eval() ```