---
title: CV Model Comparison In PyTorch
emoji: 📊
colorFrom: indigo
colorTo: gray
sdk: gradio
sdk_version: 6.8.0
app_file: app.py
pinned: false
license: mit
short_description: PyTorch CV models comparison.
models:
  - AIOmarRehan/PyTorch_Unified_CNN_Model
datasets:
  - AIOmarRehan/Vehicles
---

# PyTorch Model Comparison: From Custom CNNs to Advanced Transfer Learning

![Python](https://img.shields.io/badge/Python-3.8+-3776AB?style=flat\&logo=python\&logoColor=white)
![PyTorch](https://img.shields.io/badge/PyTorch-2.0+-EE4C2C?style=flat\&logo=pytorch\&logoColor=white)
![Gradio](https://img.shields.io/badge/Gradio-4.0+-FF6F00?style=flat\&logo=gradio\&logoColor=white)
![Hugging Face](https://img.shields.io/badge/Hugging%20Face-Transformers-FFD21E?style=flat\&logo=huggingface\&logoColor=white)
![Docker](https://img.shields.io/badge/Docker-Ready-2496ED?style=flat\&logo=docker\&logoColor=white)

---

## Overview

This project compares **three computer vision approaches in PyTorch** on a vehicle classification task:

1. Custom CNN (trained from scratch)
2. Vision Transformer (DeiT-Tiny)
3. Xception with two-phase transfer learning

The goal is to answer a practical question:

> On small or moderately sized datasets, should you train from scratch or use transfer learning?

The results clearly show that **transfer learning dramatically improves generalization and reliability**, especially when data and compute are limited.

---

## Architectures Compared

### Custom CNN (From Scratch)

A traditional convolutional network built manually with Conv → ReLU → Pooling blocks and fully connected layers.

**Philosophy:** Full architectural control, no pre-training.

Minimal structure:

```python
class CustomCNN(nn.Module):
    def __init__(self, num_classes):
        super().__init__()
        self.features = nn.Sequential(
            nn.Conv2d(3, 32, 3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2),
            nn.Conv2d(32, 64, 3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2)
        )
        self.classifier = nn.Sequential(
            nn.Linear(64 * 56 * 56, 256),
            nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(256, num_classes)
        )
```

**Reality on small datasets:**

* Slower convergence
* Higher variance
* Larger generalization gap

---

### Vision Transformer (DeiT-Tiny)

Using Hugging Face's pre-trained Vision Transformer:

```python
model = AutoModelForImageClassification.from_pretrained(
    "facebook/deit-tiny-patch16-224",
    num_labels=num_classes,
    ignore_mismatched_sizes=True
)
```

Trained with the Hugging Face `Trainer` API.

**Advantages:**

* Stable convergence
* Lightweight
* Easy deployment
* Good performance-to-efficiency ratio

---

### Xception (Two-Phase Transfer Learning)

Implemented using `timm`.

### Phase 1 - Train Classifier Head Only

```python
model = timm.create_model("xception", pretrained=True)

for param in model.parameters():
    param.requires_grad = False

model.fc = nn.Sequential(
    nn.Linear(in_features, 512),
    nn.ReLU(),
    nn.Dropout(0.5),
    nn.Linear(512, num_classes)
)
```

### Phase 2 - Fine-Tune Selected Layers

```python
for name, param in model.named_parameters():
    if "block14" in name or "fc" in name:
        param.requires_grad = True
```

Lower learning rate used during fine-tuning.

**Result:**
- Smoothest training curves
- Lowest validation loss
- Highest test accuracy
- Strongest performance on unseen internet images

---

## Comparative Results

| Model      | Validation Performance | Generalization | Stability   |
| ---------- | ---------------------- | -------------- | ----------- |
| Custom CNN | High variance          | Weak           | Unstable    |
| DeiT-Tiny  | Strong                 | Good           | Stable      |
| Xception   | Best                   | Excellent      | Very Stable |

### Key Insight

> High validation accuracy does NOT guarantee real-world reliability.

Custom CNN achieved strong validation scores (~87%) but struggled more on distribution shifts.

Xception consistently generalized better.

---

## Experimental Visualizations

### Dataset Distribution Across All Three Models:

![Chart](https://files.catbox.moe/eyuftl.png)

---

### Xception Model:
![Accuracy & Loss](https://files.catbox.moe/qv7n6e.png)
### Custom CNN Model:
![Accuracy & Loss](https://files.catbox.moe/ch8s5d.png)

---

### Confusion Matrix between both Models:

| **Custom CNN** | **Xception** |
|------------|----------|
| <img src="https://files.catbox.moe/aulaxo.webp" width="100%"> | <img src="https://files.catbox.moe/gy6yno.webp" width="100%"> |

---

## Example Test Results (Custom CNN)

```
Test Accuracy: 87.89%

Macro Avg:
Precision: 0.8852
Recall:    0.8794
F1-Score:  0.8789
```

Despite solid metrics, performance dropped more noticeably on unseen real-world images compared to Xception.

---

## Deployment

### Run Locally

```bash
pip install -r requirements.txt
python app.py
```

Access at:

```
http://localhost:7860
```

---

## When to Use Each Approach

### Use Custom CNN if:

* Domain is highly specialized
* Pre-trained features don’t apply
* You need full architectural control

### Use Transfer Learning (e.g. DeiT or Xception) if:

* You want fast experimentation
* Efficiency matters
* You prefer high-level APIs
* You want best accuracy
* You care about generalization
* You need production-grade reliability

---

## Final Conclusion

On small or moderately sized datasets:

> Transfer learning isn’t an optimization - it’s a necessity.

Training from scratch forces the model to learn both general visual features and task-specific knowledge simultaneously.

Pre-trained models already understand edges, textures, and spatial structure.
Your dataset only needs to teach classification boundaries.

For most real-world tasks:

* Start with transfer learning
* Fine-tune carefully
* Only train from scratch if absolutely necessary

---

## Results

<p align="center">
  <a href="https://files.catbox.moe/ss5ohr.mp4">
    <img src="https://files.catbox.moe/3x5mp7.webp" width="400">
  </a>
</p>