| | --- |
| | title: CV Model Comparison In PyTorch |
| | emoji: 📊 |
| | colorFrom: indigo |
| | colorTo: gray |
| | sdk: gradio |
| | sdk_version: 6.8.0 |
| | app_file: app.py |
| | pinned: false |
| | license: mit |
| | short_description: PyTorch CV models comparison. |
| | models: |
| | - AIOmarRehan/PyTorch_Unified_CNN_Model |
| | datasets: |
| | - AIOmarRehan/Vehicles |
| | --- |
| | |
| | # PyTorch Model Comparison: From Custom CNNs to Advanced Transfer Learning |
| |
|
| |  |
| |  |
| |  |
| |  |
| |  |
| |
|
| | --- |
| |
|
| | ## Overview |
| |
|
| | This project compares **three computer vision approaches in PyTorch** on a vehicle classification task: |
| |
|
| | 1. Custom CNN (trained from scratch) |
| | 2. Vision Transformer (DeiT-Tiny) |
| | 3. Xception with two-phase transfer learning |
| |
|
| | The goal is to answer a practical question: |
| |
|
| | > On small or moderately sized datasets, should you train from scratch or use transfer learning? |
| |
|
| | The results clearly show that **transfer learning dramatically improves generalization and reliability**, especially when data and compute are limited. |
| |
|
| | --- |
| |
|
| | ## Architectures Compared |
| |
|
| | ### Custom CNN (From Scratch) |
| |
|
| | A traditional convolutional network built manually with Conv → ReLU → Pooling blocks and fully connected layers. |
| |
|
| | **Philosophy:** Full architectural control, no pre-training. |
| |
|
| | Minimal structure: |
| |
|
| | ```python |
| | class CustomCNN(nn.Module): |
| | def __init__(self, num_classes): |
| | super().__init__() |
| | self.features = nn.Sequential( |
| | nn.Conv2d(3, 32, 3, padding=1), |
| | nn.ReLU(), |
| | nn.MaxPool2d(2), |
| | nn.Conv2d(32, 64, 3, padding=1), |
| | nn.ReLU(), |
| | nn.MaxPool2d(2) |
| | ) |
| | self.classifier = nn.Sequential( |
| | nn.Linear(64 * 56 * 56, 256), |
| | nn.ReLU(), |
| | nn.Dropout(0.5), |
| | nn.Linear(256, num_classes) |
| | ) |
| | ``` |
| |
|
| | **Reality on small datasets:** |
| |
|
| | * Slower convergence |
| | * Higher variance |
| | * Larger generalization gap |
| |
|
| | --- |
| |
|
| | ### Vision Transformer (DeiT-Tiny) |
| |
|
| | Using Hugging Face's pre-trained Vision Transformer: |
| |
|
| | ```python |
| | model = AutoModelForImageClassification.from_pretrained( |
| | "facebook/deit-tiny-patch16-224", |
| | num_labels=num_classes, |
| | ignore_mismatched_sizes=True |
| | ) |
| | ``` |
| |
|
| | Trained with the Hugging Face `Trainer` API. |
| |
|
| | **Advantages:** |
| |
|
| | * Stable convergence |
| | * Lightweight |
| | * Easy deployment |
| | * Good performance-to-efficiency ratio |
| |
|
| | --- |
| |
|
| | ### Xception (Two-Phase Transfer Learning) |
| |
|
| | Implemented using `timm`. |
| |
|
| | ### Phase 1 - Train Classifier Head Only |
| |
|
| | ```python |
| | model = timm.create_model("xception", pretrained=True) |
| | |
| | for param in model.parameters(): |
| | param.requires_grad = False |
| | |
| | model.fc = nn.Sequential( |
| | nn.Linear(in_features, 512), |
| | nn.ReLU(), |
| | nn.Dropout(0.5), |
| | nn.Linear(512, num_classes) |
| | ) |
| | ``` |
| |
|
| | ### Phase 2 - Fine-Tune Selected Layers |
| |
|
| | ```python |
| | for name, param in model.named_parameters(): |
| | if "block14" in name or "fc" in name: |
| | param.requires_grad = True |
| | ``` |
| |
|
| | Lower learning rate used during fine-tuning. |
| |
|
| | **Result:** |
| | - Smoothest training curves |
| | - Lowest validation loss |
| | - Highest test accuracy |
| | - Strongest performance on unseen internet images |
| |
|
| | --- |
| |
|
| | ## Comparative Results |
| |
|
| | | Model | Validation Performance | Generalization | Stability | |
| | | ---------- | ---------------------- | -------------- | ----------- | |
| | | Custom CNN | High variance | Weak | Unstable | |
| | | DeiT-Tiny | Strong | Good | Stable | |
| | | Xception | Best | Excellent | Very Stable | |
| |
|
| | ### Key Insight |
| |
|
| | > High validation accuracy does NOT guarantee real-world reliability. |
| |
|
| | Custom CNN achieved strong validation scores (~87%) but struggled more on distribution shifts. |
| |
|
| | Xception consistently generalized better. |
| |
|
| | --- |
| |
|
| | ## Experimental Visualizations |
| |
|
| | ### Dataset Distribution Across All Three Models: |
| |
|
| |  |
| |
|
| | --- |
| |
|
| | ### Xception Model: |
| |  |
| | ### Custom CNN Model: |
| |  |
| |
|
| | --- |
| |
|
| | ### Confusion Matrix between both Models: |
| |
|
| | | **Custom CNN** | **Xception** | |
| | |------------|----------| |
| | | <img src="https://files.catbox.moe/aulaxo.webp" width="100%"> | <img src="https://files.catbox.moe/gy6yno.webp" width="100%"> | |
| |
|
| | --- |
| |
|
| | ## Example Test Results (Custom CNN) |
| |
|
| | ``` |
| | Test Accuracy: 87.89% |
| | |
| | Macro Avg: |
| | Precision: 0.8852 |
| | Recall: 0.8794 |
| | F1-Score: 0.8789 |
| | ``` |
| |
|
| | Despite solid metrics, performance dropped more noticeably on unseen real-world images compared to Xception. |
| |
|
| | --- |
| |
|
| | ## Deployment |
| |
|
| | ### Run Locally |
| |
|
| | ```bash |
| | pip install -r requirements.txt |
| | python app.py |
| | ``` |
| |
|
| | Access at: |
| |
|
| | ``` |
| | http://localhost:7860 |
| | ``` |
| |
|
| | --- |
| |
|
| | ## When to Use Each Approach |
| |
|
| | ### Use Custom CNN if: |
| |
|
| | * Domain is highly specialized |
| | * Pre-trained features don’t apply |
| | * You need full architectural control |
| |
|
| | ### Use Transfer Learning (e.g. DeiT or Xception) if: |
| |
|
| | * You want fast experimentation |
| | * Efficiency matters |
| | * You prefer high-level APIs |
| | * You want best accuracy |
| | * You care about generalization |
| | * You need production-grade reliability |
| |
|
| | --- |
| |
|
| | ## Final Conclusion |
| |
|
| | On small or moderately sized datasets: |
| |
|
| | > Transfer learning isn’t an optimization - it’s a necessity. |
| |
|
| | Training from scratch forces the model to learn both general visual features and task-specific knowledge simultaneously. |
| |
|
| | Pre-trained models already understand edges, textures, and spatial structure. |
| | Your dataset only needs to teach classification boundaries. |
| |
|
| | For most real-world tasks: |
| |
|
| | * Start with transfer learning |
| | * Fine-tune carefully |
| | * Only train from scratch if absolutely necessary |
| |
|
| | --- |
| |
|
| | ## Results |
| |
|
| | <p align="center"> |
| | <a href="https://files.catbox.moe/ss5ohr.mp4"> |
| | <img src="https://files.catbox.moe/3x5mp7.webp" width="400"> |
| | </a> |
| | </p> |