Motorcycle Brand Classification App

This app compares 3 image classification approaches on motorcycle images:

Fine-tuned ViT model durovali/vit-motorcycle
Zero-shot CLIP (openai/clip-vit-large-patch14)
OpenAI vision model (GPT-4.1 Vision)

Dataset Description

Custom dataset created manually from web images (Google Images, Unsplash, Pexels)
Number of classes: 6 (bmw, honda, kawasaki, suzuki, triumph, yamaha)
Total images: 55 (~9 images per class)
Split: 80% train (44 images) / 20% validation (11 images)
Image formats: mixed JPG and PNG
Dataset loaded using HuggingFace imagefolder builder

Preprocessing Steps

Images loaded via HuggingFace imagefolder dataset builder
Converted to RGB format
Training transforms: RandomResizedCrop (224px), RandomHorizontalFlip, ToTensor, Normalize
Validation transforms: Resize (224px), CenterCrop (224px), ToTensor, Normalize
Normalized with ImageNet mean and std values from ViT processor

Model and Evaluation

Base model: google/vit-base-patch16-224-in21k (Vision Transformer)
Fine-tuned with transfer learning on custom motorcycle dataset
Training: 10 epochs, learning rate 2e-5, batch size 16, fp16 on GPU (Tesla T4)
Hugging Face model: https://huggingface.co/durovali/vit-motorcycle

Training Performance

Training Loss	Epoch	Validation Loss	Accuracy
1.8418	1	1.7780	0.1818
1.7629	2	1.7664	0.4545
1.7311	3	1.7577	0.3636
1.7177	5	1.7402	0.4545
1.6922	10	1.7145	0.4545

Example Image Results

The table below reports the true class and Top-3 predictions for ViT and CLIP.

Image	True Class	ViT Top-3 (score)	CLIP Top-3 (score)	OpenAI (label, confidence)
`bmw.jpg`	`bmw`	`honda` (0.2178) `kawasaki` (0.1764) `suzuki` (0.1591)	`bmw` (0.9804) `yamaha` (0.0165) `triumph` (0.0019)	`bmw` (0.95)
`honda.jpg`	`honda`	`honda` (0.2035) `kawasaki` (0.1863) `yamaha` (0.1599)	`honda` (0.4927) `yamaha` (0.4869) `suzuki` (0.0100)	`honda` (0.95)
`kawasaki.jpg`	`kawasaki`	`kawasaki` (0.2186) `honda` (0.2039) `bmw` (0.1713)	`yamaha` (0.7077) `kawasaki` (0.1124) `bmw` (0.0653)	`kawasaki` (0.95)
`triumph.jpg`	`triumph`	`honda` (0.2249) `bmw` (0.1748) `kawasaki` (0.1738)	`triumph` (0.9904) `bmw` (0.0071) `yamaha` (0.0017)	`triumph` (0.98)
`yamaha.jpg`	`yamaha`	`honda` (0.1947) `yamaha` (0.1914) `kawasaki` (0.1635)	`yamaha` (0.9057) `bmw` (0.0354) `honda` (0.0213)	`yamaha` (0.90)

Comparison Results

Model	Type	Correct Predictions	Accuracy	Notes
ViT (fine-tuned)	Closed-set	1/5 (kawasaki)	20%	Too few training images (~9 per class)
CLIP	Zero-shot (Open-Source)	4/5	80%	Very strong without any training!
OpenAI GPT-4.1	Zero-shot (Closed-Source)	5/5	100%	Perfect – recognizes logos and design details

Key finding: The fine-tuned ViT model underperforms due to the very small dataset size (~ 55 images total). CLIP and OpenAI perform much better as zero-shot models because they were pre-trained on massive datasets. With more training data (~ 500+ images per class), the fine-tuned model would likely outperform zero-shot approaches.

durovali
/

Block-Computer-Vision