| --- |
| tags: |
| - vision-transformer |
| - image-classification |
| - simple |
| - imagenet100 |
| - pytorch |
| license: apache-2.0 |
| datasets: |
| - imagenet100 |
| metrics: |
| - accuracy |
| --- |
| |
| # Simple Vit - IMAGENET100 |
|
|
| This model was trained using the [vit-analysis](https://github.com/your-repo/vit-analysis) framework for analyzing Vision Transformer positional encoding methods. |
|
|
| ## Model Details |
|
|
| | Property | Value | |
| |----------|-------| |
| | **Model Type** | SIMPLE Vision Transformer | |
| | **Dataset** | imagenet100 | |
| | **Best Accuracy** | 71.94% | |
| | **Image Size** | 224 | |
| | **Patch Size** | 16 | |
| | **Hidden Dim** | 192 | |
| | **Depth** | 12 | |
| | **Num Heads** | 3 | |
| | **MLP Dim** | 768 | |
| | **Num Classes** | 100 | |
|
|
| ## Model Description |
|
|
| This is a Vision Transformer with **learnable positional embeddings**. |
| The model uses standard absolute positional embeddings that are learned during training. |
|
|
| ## Usage |
|
|
| ```python |
| import torch |
| from models import SimpleVisionTransformer |
| |
| # Initialize model |
| model = SimpleVisionTransformer( |
| image_size=224, |
| patch_size=16, |
| num_layers=12, |
| num_heads=3, |
| hidden_dim=192, |
| mlp_dim=768, |
| num_classes=100, |
| ) |
| |
| # Load checkpoint |
| checkpoint = torch.load('simple_vit_imagenet100_best.pth', map_location='cpu') |
| state_dict = checkpoint['state_dict'] |
| |
| # Remove 'module.' prefix if present (from DDP training) |
| state_dict = {k.replace('module.', ''): v for k, v in state_dict.items()} |
| model.load_state_dict(state_dict) |
| model.eval() |
| |
| # Inference |
| from torchvision import transforms |
| from PIL import Image |
| |
| transform = transforms.Compose([ |
| transforms.Resize(256), |
| transforms.CenterCrop(224), |
| transforms.ToTensor(), |
| transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]), |
| ]) |
| |
| image = Image.open('your_image.jpg').convert('RGB') |
| input_tensor = transform(image).unsqueeze(0) |
| |
| with torch.no_grad(): |
| output = model(input_tensor) |
| prediction = output.argmax(dim=1) |
| ``` |
|
|
| ## Training |
|
|
| This model was trained with: |
| - **Framework:** PyTorch |
| - **Optimizer:** AdamW |
| - **Mixed Precision:** Enabled |
|
|
| ## Citation |
|
|
| If you use this model, please cite: |
|
|
| ```bibtex |
| @misc{vit-analysis, |
| title={Vision Transformer Position Encoding Analysis}, |
| year={2024}, |
| url={https://github.com/your-repo/vit-analysis} |
| } |
| ``` |
|
|
| ## License |
|
|
| Apache 2.0 |
|
|