Vietnamese Scene Classification with Vision Transformer
Fine-tuned google/vit-base-patch16-224 (86M parameters) for Vietnamese scene classification.
Model Details
- Architecture: Vision Transformer (ViT-Base, patch size 16, 224×224 input)
- Parameters: 86M
- Base Model: google/vit-base-patch16-224 (ImageNet-21k pretrained)
- Task: Multi-class Vietnamese scene classification
- Framework: PyTorch + HuggingFace Transformers
Training
- Data Augmentation: RandomResizedCrop, RandomHorizontalFlip, ColorJitter
- Optimizer: AdamW (lr=3e-5, weight_decay=0.01)
- Scheduler: OneCycleLR with cosine annealing
- Gradient Clipping: max_norm=1.0
- Validation Accuracy: 94%+
Scene Classes (Vietnamese)
| English | Vietnamese |
|---|---|
| Beach | Bãi biển |
| City | Thành phố |
| Forest | Rừng |
| Mountain | Núi |
| Rice Field | Ruộng lúa |
| Market | Chợ |
| Temple | Chùa |
| River | Sông |
Features
- Attention rollout visualization across 12 transformer layers
- Per-class precision, recall, and F1 metrics
- Formatted confusion matrix output
Usage
from transformers import ViTForImageClassification, ViTImageProcessor
from PIL import Image
model = ViTForImageClassification.from_pretrained("sanvo/vietnamese-vit-classification")
processor = ViTImageProcessor.from_pretrained("google/vit-base-patch16-224")
image = Image.open("scene.jpg")
inputs = processor(images=image, return_tensors="pt")
outputs = model(**inputs)
predicted_class = outputs.logits.argmax(-1).item()
print(model.config.id2label[predicted_class])
Links
Model tree for sanvo/vietnamese-vit-classification
Base model
google/vit-base-patch16-224