--- language: vi license: mit tags: - vision-transformer - image-classification - vietnamese - scene-classification - pytorch - transformers base_model: google/vit-base-patch16-224 datasets: - custom pipeline_tag: image-classification --- # Vietnamese Scene Classification with Vision Transformer Fine-tuned [google/vit-base-patch16-224](https://huggingface.co/google/vit-base-patch16-224) (86M parameters) for Vietnamese scene classification. ## Model Details - **Architecture:** Vision Transformer (ViT-Base, patch size 16, 224×224 input) - **Parameters:** 86M - **Base Model:** google/vit-base-patch16-224 (ImageNet-21k pretrained) - **Task:** Multi-class Vietnamese scene classification - **Framework:** PyTorch + HuggingFace Transformers ## Training - **Data Augmentation:** RandomResizedCrop, RandomHorizontalFlip, ColorJitter - **Optimizer:** AdamW (lr=3e-5, weight_decay=0.01) - **Scheduler:** OneCycleLR with cosine annealing - **Gradient Clipping:** max_norm=1.0 - **Validation Accuracy:** 94%+ ## Scene Classes (Vietnamese) | English | Vietnamese | |---------|-----------| | Beach | Bãi biển | | City | Thành phố | | Forest | Rừng | | Mountain | Núi | | Rice Field | Ruộng lúa | | Market | Chợ | | Temple | Chùa | | River | Sông | ## Features - Attention rollout visualization across 12 transformer layers - Per-class precision, recall, and F1 metrics - Formatted confusion matrix output ## Usage ```python from transformers import ViTForImageClassification, ViTImageProcessor from PIL import Image model = ViTForImageClassification.from_pretrained("sanvo/vietnamese-vit-classification") processor = ViTImageProcessor.from_pretrained("google/vit-base-patch16-224") image = Image.open("scene.jpg") inputs = processor(images=image, return_tensors="pt") outputs = model(**inputs) predicted_class = outputs.logits.argmax(-1).item() print(model.config.id2label[predicted_class]) ``` ## Links - **GitHub:** [svn05/vietnamese-vit-classification](https://github.com/svn05/vietnamese-vit-classification)