aadex
/

rope_vit-imagenet100

Image Classification

vision-transformer

Model card Files Files and versions

aadex commited on Dec 25, 2025

Commit

49c9a09

·

verified ·

1 Parent(s): 519bf7f

Add model card

Files changed (1) hide show

README.md +108 -0

README.md ADDED Viewed

	@@ -0,0 +1,108 @@

+---
+tags:
+- vision-transformer
+- image-classification
+- rope
+- imagenet100
+- pytorch
+license: apache-2.0
+datasets:
+- imagenet100
+metrics:
+- accuracy
+---
+# Rope Vit - IMAGENET100
+This model was trained using the [vit-analysis](https://github.com/your-repo/vit-analysis) framework for analyzing Vision Transformer positional encoding methods.
+## Model Details
+| Property | Value |
+|----------|-------|
+| **Model Type** | ROPE Vision Transformer |
+| **Dataset** | imagenet100 |
+| **Best Accuracy** | 77.30% |
+| **Image Size** | 224 |
+| **Patch Size** | 16 |
+| **Hidden Dim** | 192 |
+| **Depth** | 12 |
+| **Num Heads** | 3 |
+| **MLP Dim** | 768 |
+| **Num Classes** | 100 |
+## Model Description
+This is a Vision Transformer with **Rotary Position Embeddings (RoPE)**.
+RoPE encodes position information directly into the attention mechanism, enabling better
+generalization to different sequence lengths and improved extrapolation capabilities.
+- **RoPE Theta:** 10.0
+## Usage
+```python
+import torch
+from models import RoPESimpleVisionTransformer
+# Initialize model
+model = RoPESimpleVisionTransformer(
+    image_size=224,
+    patch_size=16,
+    num_layers=12,
+    num_heads=3,
+    hidden_dim=192,
+    mlp_dim=768,
+    num_classes=100,
+)
+# Load checkpoint
+checkpoint = torch.load('rope_vit_imagenet100_best.pth', map_location='cpu')
+state_dict = checkpoint['state_dict']
+# Remove 'module.' prefix if present (from DDP training)
+state_dict = {k.replace('module.', ''): v for k, v in state_dict.items()}
+model.load_state_dict(state_dict)
+model.eval()
+# Inference
+from torchvision import transforms
+from PIL import Image
+transform = transforms.Compose([
+    transforms.Resize(256),
+    transforms.CenterCrop(224),
+    transforms.ToTensor(),
+    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
+])
+image = Image.open('your_image.jpg').convert('RGB')
+input_tensor = transform(image).unsqueeze(0)
+with torch.no_grad():
+    output = model(input_tensor)
+    prediction = output.argmax(dim=1)
+```
+## Training
+This model was trained with:
+- **Framework:** PyTorch
+- **Optimizer:** AdamW
+- **Mixed Precision:** Enabled
+## Citation
+If you use this model, please cite:
+```bibtex
+@misc{vit-analysis,
+  title={Vision Transformer Position Encoding Analysis},
+  year={2024},
+  url={https://github.com/your-repo/vit-analysis}
+}
+```
+## License
+Apache 2.0