| | --- |
| | language: vi |
| | license: mit |
| | tags: |
| | - vision-transformer |
| | - image-classification |
| | - vietnamese |
| | - scene-classification |
| | - pytorch |
| | - transformers |
| | base_model: google/vit-base-patch16-224 |
| | datasets: |
| | - custom |
| | pipeline_tag: image-classification |
| | --- |
| | |
| | # Vietnamese Scene Classification with Vision Transformer |
| |
|
| | Fine-tuned [google/vit-base-patch16-224](https://huggingface.co/google/vit-base-patch16-224) (86M parameters) for Vietnamese scene classification. |
| |
|
| | ## Model Details |
| |
|
| | - **Architecture:** Vision Transformer (ViT-Base, patch size 16, 224×224 input) |
| | - **Parameters:** 86M |
| | - **Base Model:** google/vit-base-patch16-224 (ImageNet-21k pretrained) |
| | - **Task:** Multi-class Vietnamese scene classification |
| | - **Framework:** PyTorch + HuggingFace Transformers |
| |
|
| | ## Training |
| |
|
| | - **Data Augmentation:** RandomResizedCrop, RandomHorizontalFlip, ColorJitter |
| | - **Optimizer:** AdamW (lr=3e-5, weight_decay=0.01) |
| | - **Scheduler:** OneCycleLR with cosine annealing |
| | - **Gradient Clipping:** max_norm=1.0 |
| | - **Validation Accuracy:** 94%+ |
| |
|
| | ## Scene Classes (Vietnamese) |
| |
|
| | | English | Vietnamese | |
| | |---------|-----------| |
| | | Beach | Bãi biển | |
| | | City | Thành phố | |
| | | Forest | Rừng | |
| | | Mountain | Núi | |
| | | Rice Field | Ruộng lúa | |
| | | Market | Chợ | |
| | | Temple | Chùa | |
| | | River | Sông | |
| |
|
| | ## Features |
| |
|
| | - Attention rollout visualization across 12 transformer layers |
| | - Per-class precision, recall, and F1 metrics |
| | - Formatted confusion matrix output |
| |
|
| | ## Usage |
| |
|
| | ```python |
| | from transformers import ViTForImageClassification, ViTImageProcessor |
| | from PIL import Image |
| | |
| | model = ViTForImageClassification.from_pretrained("sanvo/vietnamese-vit-classification") |
| | processor = ViTImageProcessor.from_pretrained("google/vit-base-patch16-224") |
| | |
| | image = Image.open("scene.jpg") |
| | inputs = processor(images=image, return_tensors="pt") |
| | outputs = model(**inputs) |
| | predicted_class = outputs.logits.argmax(-1).item() |
| | print(model.config.id2label[predicted_class]) |
| | ``` |
| |
|
| | ## Links |
| |
|
| | - **GitHub:** [svn05/vietnamese-vit-classification](https://github.com/svn05/vietnamese-vit-classification) |
| |
|