|
|
--- |
|
|
license: apache-2.0 |
|
|
tags: |
|
|
- image-classification |
|
|
- vision-transformer |
|
|
- pytorch |
|
|
- oxford-pets |
|
|
library_name: torch |
|
|
datasets: |
|
|
- cvdl/oxford-pets |
|
|
language: [] |
|
|
model-index: |
|
|
- name: ViTPets |
|
|
results: |
|
|
- task: |
|
|
type: image-classification |
|
|
dataset: |
|
|
name: Oxford Pets |
|
|
type: cvdl/oxford-pets |
|
|
metrics: |
|
|
- type: accuracy |
|
|
value: 9 |
|
|
--- |
|
|
|
|
|
# ViTPets - Vision Transformer trained from scratch on Oxford Pets 🐶🐱 |
|
|
|
|
|
This model is a Vision Transformer (ViT) trained from scratch on the [Oxford Pets dataset](https://huggingface.co/datasets/cvdl/oxford-pets). It classifies images of cats and dogs into 37 different breeds. |
|
|
|
|
|
## Model Summary |
|
|
|
|
|
- **Architecture**: Custom Vision Transformer (ViT) |
|
|
- **Input resolution**: 128x128 |
|
|
- **Patch size**: 16x16 |
|
|
- **Embedding dimension**: 240 |
|
|
- **Number of Transformer blocks**: 12 |
|
|
- **Number of heads**: 4 |
|
|
- **MLP ratio**: 2.0 |
|
|
- **Dropout**: 10% on attention and MLP |
|
|
- **Framework**: PyTorch |
|
|
- **Dataset**: Oxford Pets (via 🤗 `cvdl/oxford-pets`) |
|
|
- **Loss**: CrossEntropyLoss |
|
|
- **Optimizer**: SGD with LR = 0.00257 |
|
|
|
|
|
## Training Setup |
|
|
|
|
|
- **Device**: Multi-GPU (4 GPUs) |
|
|
- **Batch size**: 256 (64 × 4 GPUs) |
|
|
- **Early stopping**: patience 50, delta 1e-6 |
|
|
- **Logging**: TensorBoard |
|
|
|
|
|
## How to Use |
|
|
|
|
|
```python |
|
|
from model import ViT |
|
|
import torch |
|
|
|
|
|
model = ViT( |
|
|
img_size=(128, 128), |
|
|
patch_size=16, |
|
|
in_channels=3, |
|
|
embed_dim=240, |
|
|
n_classes=37, |
|
|
n_blocks=12, |
|
|
n_heads=4, |
|
|
mlp_ratio=2.0, |
|
|
qkv_bias=True, |
|
|
block_drop_p=0.1, |
|
|
attn_drop_p=0.1, |
|
|
) |
|
|
|
|
|
model.load_state_dict(torch.load("ViTPets.pth")) |
|
|
model.eval() |
|
|
``` |
|
|
|