chitter99
/

vit_oxford_pets_patch16_128

Image Classification

vision-transformer

Eval Results (legacy)

Model card Files Files and versions

vit_oxford_pets_patch16_128 / README.md

chitter99's picture

Update README.md

dabb0b0 verified 9 months ago

|

history blame contribute delete

1.65 kB

	---
	license: apache-2.0
	tags:
	- image-classification
	- vision-transformer
	- pytorch
	- oxford-pets
	library_name: torch
	datasets:
	- cvdl/oxford-pets
	language: []
	model-index:
	- name: ViTPets
	results:
	- task:
	type: image-classification
	dataset:
	name: Oxford Pets
	type: cvdl/oxford-pets
	metrics:
	- type: accuracy
	value: 9
	---

	# ViTPets - Vision Transformer trained from scratch on Oxford Pets 🐶🐱

	This model is a Vision Transformer (ViT) trained from scratch on the [Oxford Pets dataset](https://huggingface.co/datasets/cvdl/oxford-pets). It classifies images of cats and dogs into 37 different breeds.

	## Model Summary

	- Architecture: Custom Vision Transformer (ViT)
	- Input resolution: 128x128
	- Patch size: 16x16
	- Embedding dimension: 240
	- Number of Transformer blocks: 12
	- Number of heads: 4
	- MLP ratio: 2.0
	- Dropout: 10% on attention and MLP
	- Framework: PyTorch
	- Dataset: Oxford Pets (via 🤗 `cvdl/oxford-pets`)
	- Loss: CrossEntropyLoss
	- Optimizer: SGD with LR = 0.00257

	## Training Setup

	- Device: Multi-GPU (4 GPUs)
	- Batch size: 256 (64 × 4 GPUs)
	- Early stopping: patience 50, delta 1e-6
	- Logging: TensorBoard

	## How to Use

	```python
	from model import ViT
	import torch

	model = ViT(
	img_size=(128, 128),
	patch_size=16,
	in_channels=3,
	embed_dim=240,
	n_classes=37,
	n_blocks=12,
	n_heads=4,
	mlp_ratio=2.0,
	qkv_bias=True,
	block_drop_p=0.1,
	attn_drop_p=0.1,
	)

	model.load_state_dict(torch.load("ViTPets.pth"))
	model.eval()
	```