durovali's picture
Create readme.md
51d4069 verified

Motorcycle Brand Classification App

This app compares 3 image classification approaches on motorcycle images:

  • Fine-tuned ViT model durovali/vit-motorcycle
  • Zero-shot CLIP (openai/clip-vit-large-patch14)
  • OpenAI vision model (GPT-4.1 Vision)

Dataset Description

  • Custom dataset created manually from web images (Google Images, Unsplash, Pexels)
  • Number of classes: 6 (bmw, honda, kawasaki, suzuki, triumph, yamaha)
  • Total images: 55 (~9 images per class)
  • Split: 80% train (44 images) / 20% validation (11 images)
  • Image formats: mixed JPG and PNG
  • Dataset loaded using HuggingFace imagefolder builder

Preprocessing Steps

  • Images loaded via HuggingFace imagefolder dataset builder
  • Converted to RGB format
  • Training transforms: RandomResizedCrop (224px), RandomHorizontalFlip, ToTensor, Normalize
  • Validation transforms: Resize (224px), CenterCrop (224px), ToTensor, Normalize
  • Normalized with ImageNet mean and std values from ViT processor

Model and Evaluation

  • Base model: google/vit-base-patch16-224-in21k (Vision Transformer)
  • Fine-tuned with transfer learning on custom motorcycle dataset
  • Training: 10 epochs, learning rate 2e-5, batch size 16, fp16 on GPU (Tesla T4)
  • Hugging Face model: https://huggingface.co/durovali/vit-motorcycle

Training Performance

Training Loss Epoch Validation Loss Accuracy
1.8418 1 1.7780 0.1818
1.7629 2 1.7664 0.4545
1.7311 3 1.7577 0.3636
1.7177 5 1.7402 0.4545
1.6922 10 1.7145 0.4545

Example Image Results

The table below reports the true class and Top-3 predictions for ViT and CLIP.

Image True Class ViT Top-3 (score) CLIP Top-3 (score) OpenAI (label, confidence)
bmw.jpg bmw honda (0.2178)
kawasaki (0.1764)
suzuki (0.1591)
bmw (0.9804)
yamaha (0.0165)
triumph (0.0019)
bmw (0.95)
honda.jpg honda honda (0.2035)
kawasaki (0.1863)
yamaha (0.1599)
honda (0.4927)
yamaha (0.4869)
suzuki (0.0100)
honda (0.95)
kawasaki.jpg kawasaki kawasaki (0.2186)
honda (0.2039)
bmw (0.1713)
yamaha (0.7077)
kawasaki (0.1124)
bmw (0.0653)
kawasaki (0.95)
triumph.jpg triumph honda (0.2249)
bmw (0.1748)
kawasaki (0.1738)
triumph (0.9904)
bmw (0.0071)
yamaha (0.0017)
triumph (0.98)
yamaha.jpg yamaha honda (0.1947)
yamaha (0.1914)
kawasaki (0.1635)
yamaha (0.9057)
bmw (0.0354)
honda (0.0213)
yamaha (0.90)

Comparison Results

Model Type Correct Predictions Accuracy Notes
ViT (fine-tuned) Closed-set 1/5 (kawasaki) 20% Too few training images (~9 per class)
CLIP Zero-shot (Open-Source) 4/5 80% Very strong without any training!
OpenAI GPT-4.1 Zero-shot (Closed-Source) 5/5 100% Perfect – recognizes logos and design details

Key finding: The fine-tuned ViT model underperforms due to the very small dataset size (~ 55 images total). CLIP and OpenAI perform much better as zero-shot models because they were pre-trained on massive datasets. With more training data (~ 500+ images per class), the fine-tuned model would likely outperform zero-shot approaches.

Links