File size: 1,649 Bytes
dabb0b0 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 |
---
license: apache-2.0
tags:
- image-classification
- vision-transformer
- pytorch
- oxford-pets
library_name: torch
datasets:
- cvdl/oxford-pets
language: []
model-index:
- name: ViTPets
results:
- task:
type: image-classification
dataset:
name: Oxford Pets
type: cvdl/oxford-pets
metrics:
- type: accuracy
value: 9
---
# ViTPets - Vision Transformer trained from scratch on Oxford Pets 🐶🐱
This model is a Vision Transformer (ViT) trained from scratch on the [Oxford Pets dataset](https://huggingface.co/datasets/cvdl/oxford-pets). It classifies images of cats and dogs into 37 different breeds.
## Model Summary
- **Architecture**: Custom Vision Transformer (ViT)
- **Input resolution**: 128x128
- **Patch size**: 16x16
- **Embedding dimension**: 240
- **Number of Transformer blocks**: 12
- **Number of heads**: 4
- **MLP ratio**: 2.0
- **Dropout**: 10% on attention and MLP
- **Framework**: PyTorch
- **Dataset**: Oxford Pets (via 🤗 `cvdl/oxford-pets`)
- **Loss**: CrossEntropyLoss
- **Optimizer**: SGD with LR = 0.00257
## Training Setup
- **Device**: Multi-GPU (4 GPUs)
- **Batch size**: 256 (64 × 4 GPUs)
- **Early stopping**: patience 50, delta 1e-6
- **Logging**: TensorBoard
## How to Use
```python
from model import ViT
import torch
model = ViT(
img_size=(128, 128),
patch_size=16,
in_channels=3,
embed_dim=240,
n_classes=37,
n_blocks=12,
n_heads=4,
mlp_ratio=2.0,
qkv_bias=True,
block_drop_p=0.1,
attn_drop_p=0.1,
)
model.load_state_dict(torch.load("ViTPets.pth"))
model.eval()
```
|