Create readme.md
Browse files
readme.md
ADDED
|
@@ -0,0 +1,67 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Motorcycle Brand Classification App
|
| 2 |
+
|
| 3 |
+
This app compares 3 image classification approaches on motorcycle images:
|
| 4 |
+
|
| 5 |
+
- Fine-tuned ViT model [`durovali/vit-motorcycle`](https://huggingface.co/durovali/vit-motorcycle)
|
| 6 |
+
- Zero-shot CLIP (`openai/clip-vit-large-patch14`)
|
| 7 |
+
- OpenAI vision model (GPT-4.1 Vision)
|
| 8 |
+
|
| 9 |
+
## Dataset Description
|
| 10 |
+
|
| 11 |
+
- Custom dataset created manually from web images (Google Images, Unsplash, Pexels)
|
| 12 |
+
- Number of classes: `6` (bmw, honda, kawasaki, suzuki, triumph, yamaha)
|
| 13 |
+
- Total images: `55` (~9 images per class)
|
| 14 |
+
- Split: 80% train (44 images) / 20% validation (11 images)
|
| 15 |
+
- Image formats: mixed JPG and PNG
|
| 16 |
+
- Dataset loaded using HuggingFace `imagefolder` builder
|
| 17 |
+
|
| 18 |
+
## Preprocessing Steps
|
| 19 |
+
|
| 20 |
+
- Images loaded via HuggingFace `imagefolder` dataset builder
|
| 21 |
+
- Converted to RGB format
|
| 22 |
+
- **Training transforms:** RandomResizedCrop (224px), RandomHorizontalFlip, ToTensor, Normalize
|
| 23 |
+
- **Validation transforms:** Resize (224px), CenterCrop (224px), ToTensor, Normalize
|
| 24 |
+
- Normalized with ImageNet mean and std values from ViT processor
|
| 25 |
+
|
| 26 |
+
## Model and Evaluation
|
| 27 |
+
|
| 28 |
+
- Base model: `google/vit-base-patch16-224-in21k` (Vision Transformer)
|
| 29 |
+
- Fine-tuned with transfer learning on custom motorcycle dataset
|
| 30 |
+
- Training: 10 epochs, learning rate 2e-5, batch size 16, fp16 on GPU (Tesla T4)
|
| 31 |
+
- Hugging Face model: [https://huggingface.co/durovali/vit-motorcycle](https://huggingface.co/durovali/vit-motorcycle)
|
| 32 |
+
|
| 33 |
+
## Training Performance
|
| 34 |
+
|
| 35 |
+
| Training Loss | Epoch | Validation Loss | Accuracy |
|
| 36 |
+
|---:|---:|---:|---:|
|
| 37 |
+
| 1.8418 | 1 | 1.7780 | 0.1818 |
|
| 38 |
+
| 1.7629 | 2 | 1.7664 | 0.4545 |
|
| 39 |
+
| 1.7311 | 3 | 1.7577 | 0.3636 |
|
| 40 |
+
| 1.7177 | 5 | 1.7402 | 0.4545 |
|
| 41 |
+
| 1.6922 | 10 | 1.7145 | 0.4545 |
|
| 42 |
+
|
| 43 |
+
## Example Image Results
|
| 44 |
+
|
| 45 |
+
The table below reports the true class and Top-3 predictions for ViT and CLIP.
|
| 46 |
+
|
| 47 |
+
| Image | True Class | ViT Top-3 (score) | CLIP Top-3 (score) | OpenAI (label, confidence) |
|
| 48 |
+
|---|---|---|---|---|
|
| 49 |
+
| `bmw.jpg` | `bmw` | `honda` (0.2178)<br>`kawasaki` (0.1764)<br>`suzuki` (0.1591) | `bmw` (0.9804)<br>`yamaha` (0.0165)<br>`triumph` (0.0019) | `bmw` (0.95) |
|
| 50 |
+
| `honda.jpg` | `honda` | `honda` (0.2035)<br>`kawasaki` (0.1863)<br>`yamaha` (0.1599) | `honda` (0.4927)<br>`yamaha` (0.4869)<br>`suzuki` (0.0100) | `honda` (0.95) |
|
| 51 |
+
| `kawasaki.jpg` | `kawasaki` | `kawasaki` (0.2186)<br>`honda` (0.2039)<br>`bmw` (0.1713) | `yamaha` (0.7077)<br>`kawasaki` (0.1124)<br>`bmw` (0.0653) | `kawasaki` (0.95) |
|
| 52 |
+
| `triumph.jpg` | `triumph` | `honda` (0.2249)<br>`bmw` (0.1748)<br>`kawasaki` (0.1738) | `triumph` (0.9904)<br>`bmw` (0.0071)<br>`yamaha` (0.0017) | `triumph` (0.98) |
|
| 53 |
+
| `yamaha.jpg` | `yamaha` | `honda` (0.1947)<br>`yamaha` (0.1914)<br>`kawasaki` (0.1635) | `yamaha` (0.9057)<br>`bmw` (0.0354)<br>`honda` (0.0213) | `yamaha` (0.90) |
|
| 54 |
+
|
| 55 |
+
## Comparison Results
|
| 56 |
+
|
| 57 |
+
| Model | Type | Correct Predictions | Accuracy | Notes |
|
| 58 |
+
|---|---|---|---|---|
|
| 59 |
+
| ViT (fine-tuned) | Closed-set | 1/5 (kawasaki) | 20% | Too few training images (~9 per class) |
|
| 60 |
+
| CLIP | Zero-shot (Open-Source) | 4/5 | 80% | Very strong without any training! |
|
| 61 |
+
| OpenAI GPT-4.1 | Zero-shot (Closed-Source) | 5/5 | 100% | Perfect – recognizes logos and design details |
|
| 62 |
+
|
| 63 |
+
**Key finding:** The fine-tuned ViT model underperforms due to the very small dataset size (~ 55 images total). CLIP and OpenAI perform much better as zero-shot models because they were pre-trained on massive datasets. With more training data (~ 500+ images per class), the fine-tuned model would likely outperform zero-shot approaches.
|
| 64 |
+
## Links
|
| 65 |
+
|
| 66 |
+
- Model: [https://huggingface.co/durovali/vit-motorcycle](https://huggingface.co/durovali/vit-motorcycle)
|
| 67 |
+
- App: [https://huggingface.co/spaces/durovali/motorcycle-classification](https://huggingface.co/spaces/durovali/motorcycle-classification)
|