Vietnamese Scene Classification with Vision Transformer

Fine-tuned google/vit-base-patch16-224 (86M parameters) for Vietnamese scene classification.

Model Details

  • Architecture: Vision Transformer (ViT-Base, patch size 16, 224×224 input)
  • Parameters: 86M
  • Base Model: google/vit-base-patch16-224 (ImageNet-21k pretrained)
  • Task: Multi-class Vietnamese scene classification
  • Framework: PyTorch + HuggingFace Transformers

Training

  • Data Augmentation: RandomResizedCrop, RandomHorizontalFlip, ColorJitter
  • Optimizer: AdamW (lr=3e-5, weight_decay=0.01)
  • Scheduler: OneCycleLR with cosine annealing
  • Gradient Clipping: max_norm=1.0
  • Validation Accuracy: 94%+

Scene Classes (Vietnamese)

English Vietnamese
Beach Bãi biển
City Thành phố
Forest Rừng
Mountain Núi
Rice Field Ruộng lúa
Market Chợ
Temple Chùa
River Sông

Features

  • Attention rollout visualization across 12 transformer layers
  • Per-class precision, recall, and F1 metrics
  • Formatted confusion matrix output

Usage

from transformers import ViTForImageClassification, ViTImageProcessor
from PIL import Image

model = ViTForImageClassification.from_pretrained("sanvo/vietnamese-vit-classification")
processor = ViTImageProcessor.from_pretrained("google/vit-base-patch16-224")

image = Image.open("scene.jpg")
inputs = processor(images=image, return_tensors="pt")
outputs = model(**inputs)
predicted_class = outputs.logits.argmax(-1).item()
print(model.config.id2label[predicted_class])

Links

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for sanvo/vietnamese-vit-classification

Finetuned
(1964)
this model