Spaces:

nirmalpratheep
/

CIFAR100_ImageClassifier

Sleeping

App Files Files Community

CIFAR100_ImageClassifier / README.md

HF Deploy

Deploy CIFAR-100 classifier

a92663e 8 months ago

preview code

raw

history blame contribute delete

24.6 kB

A newer version of the Gradio SDK is available: 6.16.0

Upgrade

metadata

title: CIFAR-100 Image Classifier
emoji: 🎯
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 5.49.1
app_file: app.py
pinned: false
license: mit

CIFAR-100 ResNet Training from Scratch

A ResNet-34 model trained from scratch on CIFAR-100 dataset, achieving 76.68% top-1 accuracy in 100 epochs with OneCycle Learning Rate scheduling.

Project Overview

This project demonstrates training a ResNet architecture from scratch on the CIFAR-100 dataset without using any pre-trained models. The implementation leverages modern deep learning techniques including data augmentation, OneCycle LR scheduling, and mixed precision training.

Results Summary

Performance Metrics (100 Epochs)

Metric	Score
Top-1 Accuracy	76.68% ✅ (Target: 73%)
Top-3 Accuracy	90.95%
Top-5 Accuracy	94.07%
Best Test Accuracy	76.79% (Epoch 99)
Macro F1-Score	0.7670
Weighted F1-Score	0.7668

Averaged Metrics

Macro-Averaged (unweighted):

Precision: 0.7708
Recall: 0.7668
F1-Score: 0.7670

Weighted-Averaged (by class support):

Precision: 0.7708
Recall: 0.7668
F1-Score: 0.7668

Training Configuration

Model Architecture

Custom Lightweight ResNet for CIFAR-100

A specially designed ResNet variant optimized for small image classification:

Model: ResNet34 (CIFAR-optimized)
Total Parameters: 4,949,412 (~5M)
Trainable Parameters: 4,949,412
Input Size: 32×32×3 (RGB)
Output Classes: 100

Architecture Details (from model_cifar.py):

Layer-by-Layer Feature Map Progression

Layer	Operation	Kernel	Stride	Padding	Input Size	Output Size	Channels	Receptive Field
Input	-	-	-	-	32×32	32×32	3	1×1
conv1	Conv2d	3×3	1	1	32×32×3	32×32×64	64	3×3
bn1+relu	BN+ReLU	-	-	-	32×32×64	32×32×64	64	3×3
layer1	BasicBlock	3×3,3×3	1,1	1,1	32×32×64	32×32×64	64	7×7
layer2	BasicBlock	3×3,3×3	2,1	1,1	32×32×64	16×16×128	128	15×15
layer3	BasicBlock	3×3,3×3	2,1	1,1	16×16×128	8×8×256	256	31×31
layer4	BasicBlock	3×3,3×3	2,1	1,1	8×8×256	4×4×512	512	63×63
avgpool	AdaptiveAvgPool2d	4×4	-	-	4×4×512	1×1×512	512	Full image
fc	Linear	-	-	-	512	100	100	-

Key Observations:

Receptive field at layer4: 63×63 pixels (covers full 32×32 image with 2× margin)
Spatial downsampling: 3 stride-2 operations reduce 32×32 → 4×4 (8× reduction)
Channel expansion: 3 → 64 → 128 → 256 → 512 (progressive feature richness)
Feature map efficiency: No information loss from MaxPooling (common in ImageNet models)

Detailed Architecture Components

Initial Convolution Block
```
Input: 32×32×3 → Conv2d(3→64, k=3×3, s=1, p=1) → BN → ReLU → Output: 32×32×64
Receptive Field: 1×1 → 3×3
```
- CIFAR-optimized: 3×3 conv (not 7×7 like ImageNet ResNets)
- Preserves spatial resolution (no stride-2 or MaxPool)
- Captures fine-grained details essential for small images

Layer 1: Residual Stage 1 (64 channels, no downsampling)

Input: 32×32×64
BasicBlock:
  ├─ Conv(64→64, k=3×3, s=1, p=1) → BN → ReLU → 32×32×64
  ├─ Conv(64→64, k=3×3, s=1, p=1) → BN → 32×32×64
  └─ Add(identity) → ReLU → Output: 32×32×64
Receptive Field: 3×3 → 7×7

No spatial downsampling (stride=1)
Identity skip connection (no projection needed)
RF grows by 4 pixels (2 conv layers × 2 pixels each)

Layer 2: Residual Stage 2 (128 channels, downsample)

Input: 32×32×64
BasicBlock:
  ├─ Conv(64→128, k=3×3, s=2, p=1) → BN → ReLU → 16×16×128
  ├─ Conv(128→128, k=3×3, s=1, p=1) → BN → 16×16×128
  ├─ Skip: Conv(64→128, k=1×1, s=2) → BN → 16×16×128 (projection)
  └─ Add(skip) → ReLU → Output: 16×16×128
Receptive Field: 7×7 → 15×15

Spatial downsampling: 32×32 → 16×16 (stride=2 in first conv)
Channel expansion: 64 → 128
Projection shortcut: 1×1 conv matches dimensions
RF doubles due to stride-2 convolution

Layer 3: Residual Stage 3 (256 channels, downsample)

Input: 16×16×128
BasicBlock:
  ├─ Conv(128→256, k=3×3, s=2, p=1) → BN → ReLU → 8×8×256
  ├─ Conv(256→256, k=3×3, s=1, p=1) → BN → 8×8×256
  ├─ Skip: Conv(128→256, k=1×1, s=2) → BN → 8×8×256 (projection)
  └─ Add(skip) → ReLU → Output: 8×8×256
Receptive Field: 15×15 → 31×31

Spatial downsampling: 16×16 → 8×8
Channel expansion: 128 → 256
RF now covers most of the input image

Layer 4: Residual Stage 4 (512 channels, downsample)

Input: 8×8×256
BasicBlock:
  ├─ Conv(256→512, k=3×3, s=2, p=1) → BN → ReLU → 4×4×512
  ├─ Conv(512→512, k=3×3, s=1, p=1) → BN → 4×4×512
  ├─ Skip: Conv(256→512, k=1×1, s=2) → BN → 4×4×512 (projection)
  └─ Add(skip) → ReLU → Output: 4×4×512
Receptive Field: 31×31 → 63×63

Final spatial downsampling: 8×8 → 4×4
Maximum channels: 512 (highest feature richness)
RF exceeds input size: 63×63 > 32×32 (full image context)

Classification Head

Input: 4×4×512
  ├─ AdaptiveAvgPool2d((1,1)) → 1×1×512 (global spatial pooling)
  ├─ Flatten → 512
  └─ Linear(512 → 100) → 100 class logits

Global Average Pooling: Each of 512 channels → single value
Reduces overfitting vs fully-connected layers
Translation invariant features

Initialization Strategy
- Kaiming (He) Normal for Conv2d weights
  - Optimal for ReLU activations
  - std = sqrt(2 / fan_in)
- Constant initialization for BatchNorm
  - weight = 1, bias = 0

Architecture Flow Diagram

Input Image (32×32×3, RF=1×1)
    ↓
┌─────────────────────────────────────────────────────────┐
│ STEM: Conv 3×3 → BN → ReLU                             │
│ Output: 32×32×64, RF=3×3                               │
└─────────────────────────────────────────────────────────┘
    ↓
┌─────────────────────────────────────────────────────────┐
│ STAGE 1: BasicBlock (64 channels, stride=1)           │
│   Conv 3×3 → BN → ReLU → Conv 3×3 → BN → (+) → ReLU   │
│   Output: 32×32×64, RF=7×7                             │
│   Skip: Identity (no projection needed)                │
└─────────────────────────────────────────────────────────┘
    ↓ [Spatial: 32×32, Channels: 64, RF: 7×7]
┌─────────────────────────────────────────────────────────┐
│ STAGE 2: BasicBlock (128 channels, stride=2) ↓↓       │
│   Conv 3×3,s2 → BN → ReLU → Conv 3×3 → BN → (+) → ReLU│
│   Output: 16×16×128, RF=15×15                          │
│   Skip: Conv 1×1,s2 (projection: 64→128)               │
└─────────────────────────────────────────────────────────┘
    ↓ [Spatial: 16×16, Channels: 128, RF: 15×15]
┌─────────────────────────────────────────────────────────┐
│ STAGE 3: BasicBlock (256 channels, stride=2) ↓↓       │
│   Conv 3×3,s2 → BN → ReLU → Conv 3×3 → BN → (+) → ReLU│
│   Output: 8×8×256, RF=31×31                            │
│   Skip: Conv 1×1,s2 (projection: 128→256)              │
└─────────────────────────────────────────────────────────┘
    ↓ [Spatial: 8×8, Channels: 256, RF: 31×31]
┌─────────────────────────────────────────────────────────┐
│ STAGE 4: BasicBlock (512 channels, stride=2) ↓↓       │
│   Conv 3×3,s2 → BN → ReLU → Conv 3×3 → BN → (+) → ReLU│
│   Output: 4×4×512, RF=63×63 (exceeds 32×32!)          │
│   Skip: Conv 1×1,s2 (projection: 256→512)              │
└─────────────────────────────────────────────────────────┘
    ↓ [Spatial: 4×4, Channels: 512, RF: Full Image]
┌─────────────────────────────────────────────────────────┐
│ HEAD: Global Average Pooling → FC                      │
│   AdaptiveAvgPool2d(1,1) → Flatten → Linear(512→100)  │
│   Output: 100 class logits                             │
└─────────────────────────────────────────────────────────┘
    ↓
Predictions (100 classes)

Key Design Choices:

✅ CIFAR-specific stem: 3×3 conv instead of 7×7 (ImageNet-style)
✅ No aggressive downsampling: Preserves spatial information for 32×32 images
✅ Lightweight: 1 block per stage instead of [3,4,6,3] for efficient training
✅ Residual connections: Enable gradient flow for deeper networks
✅ Global Average Pooling: Reduces overfitting vs fully-connected layers
✅ Progressive RF growth: Each layer sees more context (7→15→31→63 pixels)

Training Hyperparameters

Epochs: 100
Batch Size: 512
Optimizer: SGD with Nesterov momentum
Momentum: 0.9
Weight Decay: 1e-4
Label Smoothing: 0.1
Mixed Precision: Enabled (AMP)
Gradient Clipping: 1.0

# OneCycle Learning Rate Schedule
LR Schedule: OneCycle (Custom)
  - Phase 1 (Epochs 0-40): 0.01 → 0.1 (warmup)
  - Phase 2 (Epochs 41-81): 0.1 → 0.01 (cooldown)
  - Phase 3 (Epochs 82-99): 0.01 → 0.001 (annihilation)

Data Augmentation

Using Albumentations library:

Training:
- Random padding (32→36) + Random crop (36→32)
- Horizontal flip (p=0.5)
- ShiftScaleRotate (shift=0.05, scale=0.05, rotate=5°, p=0.3)
- CoarseDropout/Cutout (16×16, p=0.4)
- Color jitter (brightness, contrast, saturation, hue, p=0.4)
- Normalization (CIFAR-100 mean/std)
Testing:
- Normalization only

Training Results

Training Curves

The training curves show:

Steady convergence with minimal overfitting
Effective learning rate schedule with OneCycle policy
Generalization gap maintained below 5% throughout training
Final training accuracy: 80.47%

Learning Rate Schedule

The OneCycle LR schedule implementation:

Warmup Phase (41 epochs): Linear increase from 0.01 to 0.1
Cooldown Phase (41 epochs): Linear decrease from 0.1 to 0.01
Annihilation Phase (18 epochs): Linear decrease from 0.01 to 0.001

This schedule helps the model:

Escape local minima early in training
Find a wide minimum for better generalization
Fine-tune with very small learning rates at the end

Per-Class Performance

Top 5 Best Performing Classes:

wardrobe - F1: 0.9458 (Precision: 0.9320, Recall: 0.9600)
sunflower - F1: 0.9381 (Precision: 0.9681, Recall: 0.9100)
poppy - F1: 0.9315 (Precision: 0.9444, Recall: 0.9189)
can - F1: 0.9310 (Precision: 0.9000, Recall: 0.9643)
skyscraper - F1: 0.9100 (Precision: 0.9100, Recall: 0.9100)

Most Challenging Classes:

boy - F1: 0.4286 (Fine-grained human features)
girl - F1: 0.4646 (Similar to boy)
baby - F1: 0.5079 (Fine-grained human features)
man - F1: 0.5758 (Similar to boy)
plate - F1: 0.5797 (Simple objects, easily confused)

The model performs exceptionally well on distinct objects (flowers, buildings, furniture) but struggles with fine-grained human categorization, which is expected for CIFAR-100's 32×32 resolution.

Model Architecture Summary

From model_cifar.py:

Component	Specification
Model Name	ResNet34 (CIFAR-optimized)
Total Parameters	4,949,412 (~5M)
Architecture Depth	10 weight layers (1 initial + 8 residual + 1 FC)
Residual Blocks	4 BasicBlocks (1 per stage)
Channel Progression	3 → 64 → 128 → 256 → 512 → 100
Spatial Downsampling	32×32 → 16×16 → 8×8 → 4×4 → 1×1
Receptive Field Growth	1×1 → 3×3 → 7×7 → 15×15 → 31×31 → 63×63
Skip Connections	4 (1 identity + 3 projection shortcuts)
Pooling Strategy	Global Average Pooling (4×4 → 1×1)
Initialization	Kaiming Normal (He) for Conv, Constant for BN
Downsampling Method	Strided convolutions (no MaxPool)

Why This Architecture Works for CIFAR-100:

Right-sized capacity: 5M parameters balances expressiveness vs overfitting risk
Preserved resolution: No aggressive downsampling maintains spatial detail in 32×32 images
Optimal receptive field: 63×63 RF exceeds input size (32×32), capturing full image context
Progressive downsampling: 3 stride-2 ops (vs 1 MaxPool + 4 stride-2 in ImageNet ResNet)
Residual learning: Skip connections enable gradient flow through 10 weight layers
Efficient computation: Lightweight design trains in ~2-3 hours on single GPU

Receptive Field Analysis:

By layer2 (16×16×128): RF = 15×15 → covers ~50% of image
By layer3 (8×8×256): RF = 31×31 → covers ~95% of image
By layer4 (4×4×512): RF = 63×63 → covers full image + context
Each neuron in final feature map can "see" the entire input image

Project Structure

CIFAR100/
├── main.py                 # Main training script with OneCycle LR
├── model_cifar.py         # Custom ResNet architecture (5M params)
│   ├── BasicBlock         # 2-layer residual block with skip connection
│   └── ResNet34           # CIFAR-optimized ResNet variant
├── train.py               # Training and evaluation loops
├── preprocess.py          # Data loading with Albumentations
├── visualization.py       # Metrics calculation and plotting
├── inference.py           # Model inference utilities
├── app.py                 # Gradio web interface for demo
├── run_complete_training.py  # Full training pipeline with logging
├── requirements.txt       # Python dependencies
├── log/                   # Training logs
│   └── training_complete_20251010-103227.log
└── plots_complete/        # Training visualizations
    ├── training_curves.png
    ├── learning_rate_schedule.png
    ├── class_metrics.png
    ├── confusion_matrix.png
    └── classification_report.txt

Quick Start

Installation

# Clone the repository
git clone <your-repo-url>
cd CIFAR100

# Install dependencies
pip install -r requirements.txt

Training

# Train with OneCycle LR for 100 epochs
python main.py \
    --scheduler onecycle \
    --epochs 100 \
    --batch_size 512 \
    --lr 0.1 \
    --momentum 0.9 \
    --weight_decay 1e-4 \
    --amp \
    --plot_training \
    --plot_evaluation

# Or use the complete training script with logging
python run_complete_training.py

Inference

# Run interactive web demo
python app.py

# Or use inference script
python inference.py --image path/to/image.jpg --model snapshots/best_model.pth

Key Features

1. OneCycle Learning Rate Policy

Implements the OneCycle LR schedule from "Super-Convergence: Very Fast Training of Neural Networks" paper:

Achieves faster convergence
Better generalization
Higher final accuracy

2. Comprehensive Metrics Logging

After each training run, the script automatically outputs:

Training and test accuracy/loss curves
Top-1, Top-3, Top-5 accuracies
Precision, Recall, F1-Score (macro and weighted)
Per-class performance breakdown
Confusion matrix and classification report

3. Mixed Precision Training (AMP)

2-3x faster training on modern GPUs
Reduced memory usage
Maintains accuracy with float16/float32 mixed precision

4. Advanced Data Augmentation

Uses Albumentations for efficient augmentation:

Faster than torchvision transforms
More augmentation options
GPU-compatible with minimal overhead

5. Model Checkpointing

Automatic snapshot saving at specified intervals
Best model tracking based on test accuracy
Resume training from any checkpoint

Detailed Training Log

Full training logs are available in log/training_complete_20251010-103227.log, including:

Per-epoch train/test loss and accuracy
Learning rate at each epoch
Final comprehensive evaluation with per-class metrics
Training time and resource utilization

Example final output: ```

TRAINING COMPLETED - FINAL EVALUATION

TRAINING SUMMARY

Total Epochs Trained: 100 Final Training Loss: 0.5584 Final Training Accuracy: 80.47% Best Training Accuracy: 81.05% (Epoch 94) Final Learning Rate: 0.001500

TEST/VALIDATION SUMMARY

Final Test Loss: 0.8985 Final Test Accuracy: 76.68% Best Test Accuracy: 76.79% (Epoch 99)

COMPREHENSIVE TEST SET METRICS

Top-1 Accuracy (Test): 76.68% Top-3 Accuracy (Test): 90.95% Top-5 Accuracy (Test): 94.07%


## Requirements Met

✅ **Training from Scratch**: Custom ResNet (5M params) trained without pre-trained weights  
✅ **CIFAR-100 Dataset**: All 100 classes used (50,000 train / 10,000 test)  
✅ **Target Accuracy**: **76.68% achieved** (target: 73%) - **Exceeded by 3.68%**  
✅ **Training Duration**: 100 epochs with OneCycle LR schedule  
✅ **Modern Tools**: Extensive use of ChatGPT/Cursor for development  
✅ **Comprehensive Evaluation**: Full metrics, plots, and detailed analysis  
✅ **Model Architecture**: Custom lightweight ResNet optimized for CIFAR-100  
✅ **Reproducibility**: Complete logs, checkpoints, and configuration documented  

## Technologies Used

- **PyTorch** - Deep learning framework
- **Albumentations** - Data augmentation
- **Gradio** - Web interface for inference
- **scikit-learn** - Metrics calculation
- **matplotlib/seaborn** - Visualization
- **numpy** - Numerical operations

## Model Comparison

| Model Variant | Parameters | Expected Accuracy | Notes |
|---------------|------------|-------------------|-------|
| **Our Model** (4 blocks) | **5M** | **76.68%** | Balanced efficiency & accuracy |
| Standard ResNet-18 | 11M | ~76-78% | Good baseline for CIFAR |
| Standard ResNet-34 | 21M | ~78-80% | More capacity, slower training |
| Wide-ResNet-28-10 | 36M | ~80-82% | State-of-art, requires more resources |
| PyramidNet | 26M | ~82-84% | Complex architecture |

**Our lightweight design achieves competitive accuracy with 2-4× fewer parameters than standard ResNets.**

## Future Improvements

Potential enhancements to reach higher accuracy (78%+):
1. **Architecture upgrades**: 
   - Increase blocks per stage: [2, 2, 2, 2] or [3, 3, 3, 3]
   - Try Wide-ResNet with wider channels
   - Add Squeeze-and-Excitation (SE) blocks
2. **Training tricks**: 
   - Mixup (α=0.2) for better generalization
   - CutMix for spatial regularization
   - AutoAugment or RandAugment policies
3. **Regularization**: 
   - Stochastic Depth (survival probability 0.8-0.9)
   - DropBlock for spatial dropout
   - Increased label smoothing (0.2)
4. **Ensemble methods**: 
   - Train 3-5 models with different seeds
   - Snapshot ensembles (save last N checkpoints)
5. **Longer training**: 
   - 200-300 epochs with cosine annealing
   - Multi-step or exponential LR decay
6. **Knowledge distillation**: 
   - Train larger teacher model first
   - Use soft targets for student training

## Technical Implementation Details

### Architecture Design Rationale

**Why a lightweight ResNet variant?**

1. **CIFAR-100 Image Size**: At 32×32 pixels, CIFAR images contain less spatial information than ImageNet (224×224)
   - Standard ResNet-34's [3,4,6,3] block structure is over-parameterized
   - Our [1,1,1,1] structure provides sufficient capacity without overfitting

2. **Parameter Efficiency**:
   - 5M parameters: Sweet spot between underfitting and overfitting
   - Faster training: 100 epochs in ~2-3 hours vs 5-6 hours for ResNet-34
   - Lower memory footprint: Can use larger batch sizes

3. **CIFAR-Specific Modifications**:
   - **3×3 initial conv** (vs 7×7): Preserves fine details in small images
   - **No MaxPool layer**: Maintains spatial resolution (32×32 → 4×4 over 4 stages)
   - **Stride-2 convolutions**: Gradual downsampling for feature hierarchy

### Code Reference

From `model_cifar.py`:
```python
class ResNet34(nn.Module):
    def __init__(self, num_classes=100):
        super().__init__()
        self.in_channels = 64
        
        # CIFAR-specific: 3×3 conv, no maxpool
        self.conv1 = nn.Conv2d(3, 64, kernel_size=3, stride=1, padding=1, bias=False)
        self.bn1 = nn.BatchNorm2d(64)
        self.relu = nn.ReLU(inplace=True)
        
        # 4 stages with 1 BasicBlock each
        self.layer1 = self._make_layer(64, 1)         # 32×32×64
        self.layer2 = self._make_layer(128, 1, stride=2)  # 16×16×128
        self.layer3 = self._make_layer(256, 1, stride=2)  # 8×8×256
        self.layer4 = self._make_layer(512, 1, stride=2)  # 4×4×512
        
        # Classification head
        self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
        self.fc = nn.Linear(512, num_classes)

BasicBlock (2 conv layers + skip connection):

class BasicBlock(nn.Module):
    def forward(self, x):
        identity = x
        out = F.relu(self.bn1(self.conv1(x)))    # Conv → BN → ReLU
        out = self.bn2(self.conv2(out))           # Conv → BN
        out += identity                            # Add skip connection
        out = F.relu(out)                         # ReLU
        return out

References

Papers:

He et al., "Deep Residual Learning for Image Recognition" (2016) - ResNet architecture
Smith, "Super-Convergence: Very Fast Training of Neural Networks" (2018) - OneCycle LR
Krizhevsky, "Learning Multiple Layers of Features from Tiny Images" (2009) - CIFAR-100

Implementation Resources:

PyTorch official ResNet implementation
Albumentations library for efficient augmentation
torchvision.datasets for CIFAR-100 loading

License

MIT License

Acknowledgments

This project was developed with extensive assistance from:

ChatGPT for architecture design and debugging
Cursor AI for code completion and refactoring
PyTorch and torchvision communities for reference implementations

Note: Training logs, model checkpoints, and detailed per-class metrics are available in the log/ and plots_complete/ directories.