# Motorcycle Brand Classification App This app compares 3 image classification approaches on motorcycle images: - Fine-tuned ViT model [`durovali/vit-motorcycle`](https://huggingface.co/durovali/vit-motorcycle) - Zero-shot CLIP (`openai/clip-vit-large-patch14`) - OpenAI vision model (GPT-4.1 Vision) ## Dataset Description - Custom dataset created manually from web images (Google Images, Unsplash, Pexels) - Number of classes: `6` (bmw, honda, kawasaki, suzuki, triumph, yamaha) - Total images: `55` (~9 images per class) - Split: 80% train (44 images) / 20% validation (11 images) - Image formats: mixed JPG and PNG - Dataset loaded using HuggingFace `imagefolder` builder ## Preprocessing Steps - Images loaded via HuggingFace `imagefolder` dataset builder - Converted to RGB format - **Training transforms:** RandomResizedCrop (224px), RandomHorizontalFlip, ToTensor, Normalize - **Validation transforms:** Resize (224px), CenterCrop (224px), ToTensor, Normalize - Normalized with ImageNet mean and std values from ViT processor ## Model and Evaluation - Base model: `google/vit-base-patch16-224-in21k` (Vision Transformer) - Fine-tuned with transfer learning on custom motorcycle dataset - Training: 10 epochs, learning rate 2e-5, batch size 16, fp16 on GPU (Tesla T4) - Hugging Face model: [https://huggingface.co/durovali/vit-motorcycle](https://huggingface.co/durovali/vit-motorcycle) ## Training Performance | Training Loss | Epoch | Validation Loss | Accuracy | |---:|---:|---:|---:| | 1.8418 | 1 | 1.7780 | 0.1818 | | 1.7629 | 2 | 1.7664 | 0.4545 | | 1.7311 | 3 | 1.7577 | 0.3636 | | 1.7177 | 5 | 1.7402 | 0.4545 | | 1.6922 | 10 | 1.7145 | 0.4545 | ## Example Image Results The table below reports the true class and Top-3 predictions for ViT and CLIP. | Image | True Class | ViT Top-3 (score) | CLIP Top-3 (score) | OpenAI (label, confidence) | |---|---|---|---|---| | `bmw.jpg` | `bmw` | `honda` (0.2178)
`kawasaki` (0.1764)
`suzuki` (0.1591) | `bmw` (0.9804)
`yamaha` (0.0165)
`triumph` (0.0019) | `bmw` (0.95) | | `honda.jpg` | `honda` | `honda` (0.2035)
`kawasaki` (0.1863)
`yamaha` (0.1599) | `honda` (0.4927)
`yamaha` (0.4869)
`suzuki` (0.0100) | `honda` (0.95) | | `kawasaki.jpg` | `kawasaki` | `kawasaki` (0.2186)
`honda` (0.2039)
`bmw` (0.1713) | `yamaha` (0.7077)
`kawasaki` (0.1124)
`bmw` (0.0653) | `kawasaki` (0.95) | | `triumph.jpg` | `triumph` | `honda` (0.2249)
`bmw` (0.1748)
`kawasaki` (0.1738) | `triumph` (0.9904)
`bmw` (0.0071)
`yamaha` (0.0017) | `triumph` (0.98) | | `yamaha.jpg` | `yamaha` | `honda` (0.1947)
`yamaha` (0.1914)
`kawasaki` (0.1635) | `yamaha` (0.9057)
`bmw` (0.0354)
`honda` (0.0213) | `yamaha` (0.90) | ## Comparison Results | Model | Type | Correct Predictions | Accuracy | Notes | |---|---|---|---|---| | ViT (fine-tuned) | Closed-set | 1/5 (kawasaki) | 20% | Too few training images (~9 per class) | | CLIP | Zero-shot (Open-Source) | 4/5 | 80% | Very strong without any training! | | OpenAI GPT-4.1 | Zero-shot (Closed-Source) | 5/5 | 100% | Perfect – recognizes logos and design details | **Key finding:** The fine-tuned ViT model underperforms due to the very small dataset size (~ 55 images total). CLIP and OpenAI perform much better as zero-shot models because they were pre-trained on massive datasets. With more training data (~ 500+ images per class), the fine-tuned model would likely outperform zero-shot approaches. ## Links - Model: [https://huggingface.co/durovali/vit-motorcycle](https://huggingface.co/durovali/vit-motorcycle) - App: [https://huggingface.co/spaces/durovali/motorcycle-classification](https://huggingface.co/spaces/durovali/motorcycle-classification)