Create readme.md

51d4069 verified 2 months ago

3.72 kB

	# Motorcycle Brand Classification App

	This app compares 3 image classification approaches on motorcycle images:

	- Fine-tuned ViT model [`durovali/vit-motorcycle`](https://huggingface.co/durovali/vit-motorcycle)
	- Zero-shot CLIP (`openai/clip-vit-large-patch14`)
	- OpenAI vision model (GPT-4.1 Vision)

	## Dataset Description

	- Custom dataset created manually from web images (Google Images, Unsplash, Pexels)
	- Number of classes: `6` (bmw, honda, kawasaki, suzuki, triumph, yamaha)
	- Total images: `55` (~9 images per class)
	- Split: 80% train (44 images) / 20% validation (11 images)
	- Image formats: mixed JPG and PNG
	- Dataset loaded using HuggingFace `imagefolder` builder

	## Preprocessing Steps

	- Images loaded via HuggingFace `imagefolder` dataset builder
	- Converted to RGB format
	- Training transforms: RandomResizedCrop (224px), RandomHorizontalFlip, ToTensor, Normalize
	- Validation transforms: Resize (224px), CenterCrop (224px), ToTensor, Normalize
	- Normalized with ImageNet mean and std values from ViT processor

	## Model and Evaluation

	- Base model: `google/vit-base-patch16-224-in21k` (Vision Transformer)
	- Fine-tuned with transfer learning on custom motorcycle dataset
	- Training: 10 epochs, learning rate 2e-5, batch size 16, fp16 on GPU (Tesla T4)
	- Hugging Face model: [https://huggingface.co/durovali/vit-motorcycle](https://huggingface.co/durovali/vit-motorcycle)

	## Training Performance

	\| Training Loss \| Epoch \| Validation Loss \| Accuracy \|
	\|---:\|---:\|---:\|---:\|
	\| 1.8418 \| 1 \| 1.7780 \| 0.1818 \|
	\| 1.7629 \| 2 \| 1.7664 \| 0.4545 \|
	\| 1.7311 \| 3 \| 1.7577 \| 0.3636 \|
	\| 1.7177 \| 5 \| 1.7402 \| 0.4545 \|
	\| 1.6922 \| 10 \| 1.7145 \| 0.4545 \|

	## Example Image Results

	The table below reports the true class and Top-3 predictions for ViT and CLIP.

	\| Image \| True Class \| ViT Top-3 (score) \| CLIP Top-3 (score) \| OpenAI (label, confidence) \|
	\|---\|---\|---\|---\|---\|
	\| `bmw.jpg` \| `bmw` \| `honda` (0.2178)<br>`kawasaki` (0.1764)<br>`suzuki` (0.1591) \| `bmw` (0.9804)<br>`yamaha` (0.0165)<br>`triumph` (0.0019) \| `bmw` (0.95) \|
	\| `honda.jpg` \| `honda` \| `honda` (0.2035)<br>`kawasaki` (0.1863)<br>`yamaha` (0.1599) \| `honda` (0.4927)<br>`yamaha` (0.4869)<br>`suzuki` (0.0100) \| `honda` (0.95) \|
	\| `kawasaki.jpg` \| `kawasaki` \| `kawasaki` (0.2186)<br>`honda` (0.2039)<br>`bmw` (0.1713) \| `yamaha` (0.7077)<br>`kawasaki` (0.1124)<br>`bmw` (0.0653) \| `kawasaki` (0.95) \|
	\| `triumph.jpg` \| `triumph` \| `honda` (0.2249)<br>`bmw` (0.1748)<br>`kawasaki` (0.1738) \| `triumph` (0.9904)<br>`bmw` (0.0071)<br>`yamaha` (0.0017) \| `triumph` (0.98) \|
	\| `yamaha.jpg` \| `yamaha` \| `honda` (0.1947)<br>`yamaha` (0.1914)<br>`kawasaki` (0.1635) \| `yamaha` (0.9057)<br>`bmw` (0.0354)<br>`honda` (0.0213) \| `yamaha` (0.90) \|

	## Comparison Results

	\| Model \| Type \| Correct Predictions \| Accuracy \| Notes \|
	\|---\|---\|---\|---\|---\|
	\| ViT (fine-tuned) \| Closed-set \| 1/5 (kawasaki) \| 20% \| Too few training images (~9 per class) \|
	\| CLIP \| Zero-shot (Open-Source) \| 4/5 \| 80% \| Very strong without any training! \|
	\| OpenAI GPT-4.1 \| Zero-shot (Closed-Source) \| 5/5 \| 100% \| Perfect – recognizes logos and design details \|

	Key finding: The fine-tuned ViT model underperforms due to the very small dataset size (~ 55 images total). CLIP and OpenAI perform much better as zero-shot models because they were pre-trained on massive datasets. With more training data (~ 500+ images per class), the fine-tuned model would likely outperform zero-shot approaches.
	## Links

	- Model: [https://huggingface.co/durovali/vit-motorcycle](https://huggingface.co/durovali/vit-motorcycle)
	- App: [https://huggingface.co/spaces/durovali/motorcycle-classification](https://huggingface.co/spaces/durovali/motorcycle-classification)