sanvo's picture
Upload README.md with huggingface_hub
9a30753 verified
metadata
language: vi
license: mit
tags:
  - vision-transformer
  - image-classification
  - vietnamese
  - scene-classification
  - pytorch
  - transformers
base_model: google/vit-base-patch16-224
datasets:
  - custom
pipeline_tag: image-classification

Vietnamese Scene Classification with Vision Transformer

Fine-tuned google/vit-base-patch16-224 (86M parameters) for Vietnamese scene classification.

Model Details

  • Architecture: Vision Transformer (ViT-Base, patch size 16, 224×224 input)
  • Parameters: 86M
  • Base Model: google/vit-base-patch16-224 (ImageNet-21k pretrained)
  • Task: Multi-class Vietnamese scene classification
  • Framework: PyTorch + HuggingFace Transformers

Training

  • Data Augmentation: RandomResizedCrop, RandomHorizontalFlip, ColorJitter
  • Optimizer: AdamW (lr=3e-5, weight_decay=0.01)
  • Scheduler: OneCycleLR with cosine annealing
  • Gradient Clipping: max_norm=1.0
  • Validation Accuracy: 94%+

Scene Classes (Vietnamese)

English Vietnamese
Beach Bãi biển
City Thành phố
Forest Rừng
Mountain Núi
Rice Field Ruộng lúa
Market Chợ
Temple Chùa
River Sông

Features

  • Attention rollout visualization across 12 transformer layers
  • Per-class precision, recall, and F1 metrics
  • Formatted confusion matrix output

Usage

from transformers import ViTForImageClassification, ViTImageProcessor
from PIL import Image

model = ViTForImageClassification.from_pretrained("sanvo/vietnamese-vit-classification")
processor = ViTImageProcessor.from_pretrained("google/vit-base-patch16-224")

image = Image.open("scene.jpg")
inputs = processor(images=image, return_tensors="pt")
outputs = model(**inputs)
predicted_class = outputs.logits.argmax(-1).item()
print(model.config.id2label[predicted_class])

Links