Motorcycle Brand Classification App
This app compares 3 image classification approaches on motorcycle images:
- Fine-tuned ViT model
durovali/vit-motorcycle - Zero-shot CLIP (
openai/clip-vit-large-patch14) - OpenAI vision model (GPT-4.1 Vision)
Dataset Description
- Custom dataset created manually from web images (Google Images, Unsplash, Pexels)
- Number of classes:
6(bmw, honda, kawasaki, suzuki, triumph, yamaha) - Total images:
55(~9 images per class) - Split: 80% train (44 images) / 20% validation (11 images)
- Image formats: mixed JPG and PNG
- Dataset loaded using HuggingFace
imagefolderbuilder
Preprocessing Steps
- Images loaded via HuggingFace
imagefolderdataset builder - Converted to RGB format
- Training transforms: RandomResizedCrop (224px), RandomHorizontalFlip, ToTensor, Normalize
- Validation transforms: Resize (224px), CenterCrop (224px), ToTensor, Normalize
- Normalized with ImageNet mean and std values from ViT processor
Model and Evaluation
- Base model:
google/vit-base-patch16-224-in21k(Vision Transformer) - Fine-tuned with transfer learning on custom motorcycle dataset
- Training: 10 epochs, learning rate 2e-5, batch size 16, fp16 on GPU (Tesla T4)
- Hugging Face model: https://huggingface.co/durovali/vit-motorcycle
Training Performance
| Training Loss | Epoch | Validation Loss | Accuracy |
|---|---|---|---|
| 1.8418 | 1 | 1.7780 | 0.1818 |
| 1.7629 | 2 | 1.7664 | 0.4545 |
| 1.7311 | 3 | 1.7577 | 0.3636 |
| 1.7177 | 5 | 1.7402 | 0.4545 |
| 1.6922 | 10 | 1.7145 | 0.4545 |
Example Image Results
The table below reports the true class and Top-3 predictions for ViT and CLIP.
| Image | True Class | ViT Top-3 (score) | CLIP Top-3 (score) | OpenAI (label, confidence) |
|---|---|---|---|---|
bmw.jpg |
bmw |
honda (0.2178)kawasaki (0.1764)suzuki (0.1591) |
bmw (0.9804)yamaha (0.0165)triumph (0.0019) |
bmw (0.95) |
honda.jpg |
honda |
honda (0.2035)kawasaki (0.1863)yamaha (0.1599) |
honda (0.4927)yamaha (0.4869)suzuki (0.0100) |
honda (0.95) |
kawasaki.jpg |
kawasaki |
kawasaki (0.2186)honda (0.2039)bmw (0.1713) |
yamaha (0.7077)kawasaki (0.1124)bmw (0.0653) |
kawasaki (0.95) |
triumph.jpg |
triumph |
honda (0.2249)bmw (0.1748)kawasaki (0.1738) |
triumph (0.9904)bmw (0.0071)yamaha (0.0017) |
triumph (0.98) |
yamaha.jpg |
yamaha |
honda (0.1947)yamaha (0.1914)kawasaki (0.1635) |
yamaha (0.9057)bmw (0.0354)honda (0.0213) |
yamaha (0.90) |
Comparison Results
| Model | Type | Correct Predictions | Accuracy | Notes |
|---|---|---|---|---|
| ViT (fine-tuned) | Closed-set | 1/5 (kawasaki) | 20% | Too few training images (~9 per class) |
| CLIP | Zero-shot (Open-Source) | 4/5 | 80% | Very strong without any training! |
| OpenAI GPT-4.1 | Zero-shot (Closed-Source) | 5/5 | 100% | Perfect – recognizes logos and design details |
Key finding: The fine-tuned ViT model underperforms due to the very small dataset size (~ 55 images total). CLIP and OpenAI perform much better as zero-shot models because they were pre-trained on massive datasets. With more training data (~ 500+ images per class), the fine-tuned model would likely outperform zero-shot approaches.