HF Deploy
Deploy CIFAR-100 classifier
a92663e

A newer version of the Gradio SDK is available: 6.16.0

Upgrade
metadata
title: CIFAR-100 Image Classifier
emoji: 🎯
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 5.49.1
app_file: app.py
pinned: false
license: mit

CIFAR-100 ResNet Training from Scratch

A ResNet-34 model trained from scratch on CIFAR-100 dataset, achieving 76.68% top-1 accuracy in 100 epochs with OneCycle Learning Rate scheduling.

Project Overview

This project demonstrates training a ResNet architecture from scratch on the CIFAR-100 dataset without using any pre-trained models. The implementation leverages modern deep learning techniques including data augmentation, OneCycle LR scheduling, and mixed precision training.

Results Summary

Performance Metrics (100 Epochs)

Metric Score
Top-1 Accuracy 76.68% βœ… (Target: 73%)
Top-3 Accuracy 90.95%
Top-5 Accuracy 94.07%
Best Test Accuracy 76.79% (Epoch 99)
Macro F1-Score 0.7670
Weighted F1-Score 0.7668

Averaged Metrics

Macro-Averaged (unweighted):

  • Precision: 0.7708
  • Recall: 0.7668
  • F1-Score: 0.7670

Weighted-Averaged (by class support):

  • Precision: 0.7708
  • Recall: 0.7668
  • F1-Score: 0.7668

Training Configuration

Model Architecture

Custom Lightweight ResNet for CIFAR-100

A specially designed ResNet variant optimized for small image classification:

Model: ResNet34 (CIFAR-optimized)
Total Parameters: 4,949,412 (~5M)
Trainable Parameters: 4,949,412
Input Size: 32Γ—32Γ—3 (RGB)
Output Classes: 100

Architecture Details (from model_cifar.py):

Layer-by-Layer Feature Map Progression

Layer Operation Kernel Stride Padding Input Size Output Size Channels Receptive Field
Input - - - - 32Γ—32 32Γ—32 3 1Γ—1
conv1 Conv2d 3Γ—3 1 1 32Γ—32Γ—3 32Γ—32Γ—64 64 3Γ—3
bn1+relu BN+ReLU - - - 32Γ—32Γ—64 32Γ—32Γ—64 64 3Γ—3
layer1 BasicBlock 3Γ—3,3Γ—3 1,1 1,1 32Γ—32Γ—64 32Γ—32Γ—64 64 7Γ—7
layer2 BasicBlock 3Γ—3,3Γ—3 2,1 1,1 32Γ—32Γ—64 16Γ—16Γ—128 128 15Γ—15
layer3 BasicBlock 3Γ—3,3Γ—3 2,1 1,1 16Γ—16Γ—128 8Γ—8Γ—256 256 31Γ—31
layer4 BasicBlock 3Γ—3,3Γ—3 2,1 1,1 8Γ—8Γ—256 4Γ—4Γ—512 512 63Γ—63
avgpool AdaptiveAvgPool2d 4Γ—4 - - 4Γ—4Γ—512 1Γ—1Γ—512 512 Full image
fc Linear - - - 512 100 100 -

Key Observations:

  • Receptive field at layer4: 63Γ—63 pixels (covers full 32Γ—32 image with 2Γ— margin)
  • Spatial downsampling: 3 stride-2 operations reduce 32Γ—32 β†’ 4Γ—4 (8Γ— reduction)
  • Channel expansion: 3 β†’ 64 β†’ 128 β†’ 256 β†’ 512 (progressive feature richness)
  • Feature map efficiency: No information loss from MaxPooling (common in ImageNet models)

Detailed Architecture Components

  1. Initial Convolution Block

    Input: 32Γ—32Γ—3 β†’ Conv2d(3β†’64, k=3Γ—3, s=1, p=1) β†’ BN β†’ ReLU β†’ Output: 32Γ—32Γ—64
    Receptive Field: 1Γ—1 β†’ 3Γ—3
    
    • CIFAR-optimized: 3Γ—3 conv (not 7Γ—7 like ImageNet ResNets)
    • Preserves spatial resolution (no stride-2 or MaxPool)
    • Captures fine-grained details essential for small images
  2. Layer 1: Residual Stage 1 (64 channels, no downsampling)

    Input: 32Γ—32Γ—64
    BasicBlock:
      β”œβ”€ Conv(64β†’64, k=3Γ—3, s=1, p=1) β†’ BN β†’ ReLU β†’ 32Γ—32Γ—64
      β”œβ”€ Conv(64β†’64, k=3Γ—3, s=1, p=1) β†’ BN β†’ 32Γ—32Γ—64
      └─ Add(identity) β†’ ReLU β†’ Output: 32Γ—32Γ—64
    Receptive Field: 3Γ—3 β†’ 7Γ—7
    
    • No spatial downsampling (stride=1)
    • Identity skip connection (no projection needed)
    • RF grows by 4 pixels (2 conv layers Γ— 2 pixels each)
  3. Layer 2: Residual Stage 2 (128 channels, downsample)

    Input: 32Γ—32Γ—64
    BasicBlock:
      β”œβ”€ Conv(64β†’128, k=3Γ—3, s=2, p=1) β†’ BN β†’ ReLU β†’ 16Γ—16Γ—128
      β”œβ”€ Conv(128β†’128, k=3Γ—3, s=1, p=1) β†’ BN β†’ 16Γ—16Γ—128
      β”œβ”€ Skip: Conv(64β†’128, k=1Γ—1, s=2) β†’ BN β†’ 16Γ—16Γ—128 (projection)
      └─ Add(skip) β†’ ReLU β†’ Output: 16Γ—16Γ—128
    Receptive Field: 7Γ—7 β†’ 15Γ—15
    
    • Spatial downsampling: 32Γ—32 β†’ 16Γ—16 (stride=2 in first conv)
    • Channel expansion: 64 β†’ 128
    • Projection shortcut: 1Γ—1 conv matches dimensions
    • RF doubles due to stride-2 convolution
  4. Layer 3: Residual Stage 3 (256 channels, downsample)

    Input: 16Γ—16Γ—128
    BasicBlock:
      β”œβ”€ Conv(128β†’256, k=3Γ—3, s=2, p=1) β†’ BN β†’ ReLU β†’ 8Γ—8Γ—256
      β”œβ”€ Conv(256β†’256, k=3Γ—3, s=1, p=1) β†’ BN β†’ 8Γ—8Γ—256
      β”œβ”€ Skip: Conv(128β†’256, k=1Γ—1, s=2) β†’ BN β†’ 8Γ—8Γ—256 (projection)
      └─ Add(skip) β†’ ReLU β†’ Output: 8Γ—8Γ—256
    Receptive Field: 15Γ—15 β†’ 31Γ—31
    
    • Spatial downsampling: 16Γ—16 β†’ 8Γ—8
    • Channel expansion: 128 β†’ 256
    • RF now covers most of the input image
  5. Layer 4: Residual Stage 4 (512 channels, downsample)

    Input: 8Γ—8Γ—256
    BasicBlock:
      β”œβ”€ Conv(256β†’512, k=3Γ—3, s=2, p=1) β†’ BN β†’ ReLU β†’ 4Γ—4Γ—512
      β”œβ”€ Conv(512β†’512, k=3Γ—3, s=1, p=1) β†’ BN β†’ 4Γ—4Γ—512
      β”œβ”€ Skip: Conv(256β†’512, k=1Γ—1, s=2) β†’ BN β†’ 4Γ—4Γ—512 (projection)
      └─ Add(skip) β†’ ReLU β†’ Output: 4Γ—4Γ—512
    Receptive Field: 31Γ—31 β†’ 63Γ—63
    
    • Final spatial downsampling: 8Γ—8 β†’ 4Γ—4
    • Maximum channels: 512 (highest feature richness)
    • RF exceeds input size: 63Γ—63 > 32Γ—32 (full image context)
  6. Classification Head

    Input: 4Γ—4Γ—512
      β”œβ”€ AdaptiveAvgPool2d((1,1)) β†’ 1Γ—1Γ—512 (global spatial pooling)
      β”œβ”€ Flatten β†’ 512
      └─ Linear(512 β†’ 100) β†’ 100 class logits
    
    • Global Average Pooling: Each of 512 channels β†’ single value
    • Reduces overfitting vs fully-connected layers
    • Translation invariant features
  7. Initialization Strategy

    • Kaiming (He) Normal for Conv2d weights
      • Optimal for ReLU activations
      • std = sqrt(2 / fan_in)
    • Constant initialization for BatchNorm
      • weight = 1, bias = 0

Architecture Flow Diagram

Input Image (32Γ—32Γ—3, RF=1Γ—1)
    ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ STEM: Conv 3Γ—3 β†’ BN β†’ ReLU                             β”‚
β”‚ Output: 32Γ—32Γ—64, RF=3Γ—3                               β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
    ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ STAGE 1: BasicBlock (64 channels, stride=1)           β”‚
β”‚   Conv 3Γ—3 β†’ BN β†’ ReLU β†’ Conv 3Γ—3 β†’ BN β†’ (+) β†’ ReLU   β”‚
β”‚   Output: 32Γ—32Γ—64, RF=7Γ—7                             β”‚
β”‚   Skip: Identity (no projection needed)                β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
    ↓ [Spatial: 32Γ—32, Channels: 64, RF: 7Γ—7]
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ STAGE 2: BasicBlock (128 channels, stride=2) ↓↓       β”‚
β”‚   Conv 3Γ—3,s2 β†’ BN β†’ ReLU β†’ Conv 3Γ—3 β†’ BN β†’ (+) β†’ ReLUβ”‚
β”‚   Output: 16Γ—16Γ—128, RF=15Γ—15                          β”‚
β”‚   Skip: Conv 1Γ—1,s2 (projection: 64β†’128)               β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
    ↓ [Spatial: 16Γ—16, Channels: 128, RF: 15Γ—15]
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ STAGE 3: BasicBlock (256 channels, stride=2) ↓↓       β”‚
β”‚   Conv 3Γ—3,s2 β†’ BN β†’ ReLU β†’ Conv 3Γ—3 β†’ BN β†’ (+) β†’ ReLUβ”‚
β”‚   Output: 8Γ—8Γ—256, RF=31Γ—31                            β”‚
β”‚   Skip: Conv 1Γ—1,s2 (projection: 128β†’256)              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
    ↓ [Spatial: 8Γ—8, Channels: 256, RF: 31Γ—31]
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ STAGE 4: BasicBlock (512 channels, stride=2) ↓↓       β”‚
β”‚   Conv 3Γ—3,s2 β†’ BN β†’ ReLU β†’ Conv 3Γ—3 β†’ BN β†’ (+) β†’ ReLUβ”‚
β”‚   Output: 4Γ—4Γ—512, RF=63Γ—63 (exceeds 32Γ—32!)          β”‚
β”‚   Skip: Conv 1Γ—1,s2 (projection: 256β†’512)              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
    ↓ [Spatial: 4Γ—4, Channels: 512, RF: Full Image]
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ HEAD: Global Average Pooling β†’ FC                      β”‚
β”‚   AdaptiveAvgPool2d(1,1) β†’ Flatten β†’ Linear(512β†’100)  β”‚
β”‚   Output: 100 class logits                             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
    ↓
Predictions (100 classes)

Key Design Choices:

  • βœ… CIFAR-specific stem: 3Γ—3 conv instead of 7Γ—7 (ImageNet-style)
  • βœ… No aggressive downsampling: Preserves spatial information for 32Γ—32 images
  • βœ… Lightweight: 1 block per stage instead of [3,4,6,3] for efficient training
  • βœ… Residual connections: Enable gradient flow for deeper networks
  • βœ… Global Average Pooling: Reduces overfitting vs fully-connected layers
  • βœ… Progressive RF growth: Each layer sees more context (7β†’15β†’31β†’63 pixels)

Training Hyperparameters

Epochs: 100
Batch Size: 512
Optimizer: SGD with Nesterov momentum
Momentum: 0.9
Weight Decay: 1e-4
Label Smoothing: 0.1
Mixed Precision: Enabled (AMP)
Gradient Clipping: 1.0

# OneCycle Learning Rate Schedule
LR Schedule: OneCycle (Custom)
  - Phase 1 (Epochs 0-40): 0.01 β†’ 0.1 (warmup)
  - Phase 2 (Epochs 41-81): 0.1 β†’ 0.01 (cooldown)
  - Phase 3 (Epochs 82-99): 0.01 β†’ 0.001 (annihilation)

Data Augmentation

Using Albumentations library:

  • Training:

    • Random padding (32β†’36) + Random crop (36β†’32)
    • Horizontal flip (p=0.5)
    • ShiftScaleRotate (shift=0.05, scale=0.05, rotate=5Β°, p=0.3)
    • CoarseDropout/Cutout (16Γ—16, p=0.4)
    • Color jitter (brightness, contrast, saturation, hue, p=0.4)
    • Normalization (CIFAR-100 mean/std)
  • Testing:

    • Normalization only

Training Results

Training Curves

Training Curves

The training curves show:

  • Steady convergence with minimal overfitting
  • Effective learning rate schedule with OneCycle policy
  • Generalization gap maintained below 5% throughout training
  • Final training accuracy: 80.47%

Learning Rate Schedule

Learning Rate Schedule

The OneCycle LR schedule implementation:

  1. Warmup Phase (41 epochs): Linear increase from 0.01 to 0.1
  2. Cooldown Phase (41 epochs): Linear decrease from 0.1 to 0.01
  3. Annihilation Phase (18 epochs): Linear decrease from 0.01 to 0.001

This schedule helps the model:

  • Escape local minima early in training
  • Find a wide minimum for better generalization
  • Fine-tune with very small learning rates at the end

Per-Class Performance

Class Metrics

Top 5 Best Performing Classes:

  1. wardrobe - F1: 0.9458 (Precision: 0.9320, Recall: 0.9600)
  2. sunflower - F1: 0.9381 (Precision: 0.9681, Recall: 0.9100)
  3. poppy - F1: 0.9315 (Precision: 0.9444, Recall: 0.9189)
  4. can - F1: 0.9310 (Precision: 0.9000, Recall: 0.9643)
  5. skyscraper - F1: 0.9100 (Precision: 0.9100, Recall: 0.9100)

Most Challenging Classes:

  • boy - F1: 0.4286 (Fine-grained human features)
  • girl - F1: 0.4646 (Similar to boy)
  • baby - F1: 0.5079 (Fine-grained human features)
  • man - F1: 0.5758 (Similar to boy)
  • plate - F1: 0.5797 (Simple objects, easily confused)

The model performs exceptionally well on distinct objects (flowers, buildings, furniture) but struggles with fine-grained human categorization, which is expected for CIFAR-100's 32Γ—32 resolution.

Model Architecture Summary

From model_cifar.py:

Component Specification
Model Name ResNet34 (CIFAR-optimized)
Total Parameters 4,949,412 (~5M)
Architecture Depth 10 weight layers (1 initial + 8 residual + 1 FC)
Residual Blocks 4 BasicBlocks (1 per stage)
Channel Progression 3 β†’ 64 β†’ 128 β†’ 256 β†’ 512 β†’ 100
Spatial Downsampling 32Γ—32 β†’ 16Γ—16 β†’ 8Γ—8 β†’ 4Γ—4 β†’ 1Γ—1
Receptive Field Growth 1Γ—1 β†’ 3Γ—3 β†’ 7Γ—7 β†’ 15Γ—15 β†’ 31Γ—31 β†’ 63Γ—63
Skip Connections 4 (1 identity + 3 projection shortcuts)
Pooling Strategy Global Average Pooling (4Γ—4 β†’ 1Γ—1)
Initialization Kaiming Normal (He) for Conv, Constant for BN
Downsampling Method Strided convolutions (no MaxPool)

Why This Architecture Works for CIFAR-100:

  1. Right-sized capacity: 5M parameters balances expressiveness vs overfitting risk
  2. Preserved resolution: No aggressive downsampling maintains spatial detail in 32Γ—32 images
  3. Optimal receptive field: 63Γ—63 RF exceeds input size (32Γ—32), capturing full image context
  4. Progressive downsampling: 3 stride-2 ops (vs 1 MaxPool + 4 stride-2 in ImageNet ResNet)
  5. Residual learning: Skip connections enable gradient flow through 10 weight layers
  6. Efficient computation: Lightweight design trains in ~2-3 hours on single GPU

Receptive Field Analysis:

  • By layer2 (16Γ—16Γ—128): RF = 15Γ—15 β†’ covers ~50% of image
  • By layer3 (8Γ—8Γ—256): RF = 31Γ—31 β†’ covers ~95% of image
  • By layer4 (4Γ—4Γ—512): RF = 63Γ—63 β†’ covers full image + context
  • Each neuron in final feature map can "see" the entire input image

Project Structure

CIFAR100/
β”œβ”€β”€ main.py                 # Main training script with OneCycle LR
β”œβ”€β”€ model_cifar.py         # Custom ResNet architecture (5M params)
β”‚   β”œβ”€β”€ BasicBlock         # 2-layer residual block with skip connection
β”‚   └── ResNet34           # CIFAR-optimized ResNet variant
β”œβ”€β”€ train.py               # Training and evaluation loops
β”œβ”€β”€ preprocess.py          # Data loading with Albumentations
β”œβ”€β”€ visualization.py       # Metrics calculation and plotting
β”œβ”€β”€ inference.py           # Model inference utilities
β”œβ”€β”€ app.py                 # Gradio web interface for demo
β”œβ”€β”€ run_complete_training.py  # Full training pipeline with logging
β”œβ”€β”€ requirements.txt       # Python dependencies
β”œβ”€β”€ log/                   # Training logs
β”‚   └── training_complete_20251010-103227.log
└── plots_complete/        # Training visualizations
    β”œβ”€β”€ training_curves.png
    β”œβ”€β”€ learning_rate_schedule.png
    β”œβ”€β”€ class_metrics.png
    β”œβ”€β”€ confusion_matrix.png
    └── classification_report.txt

Quick Start

Installation

# Clone the repository
git clone <your-repo-url>
cd CIFAR100

# Install dependencies
pip install -r requirements.txt

Training

# Train with OneCycle LR for 100 epochs
python main.py \
    --scheduler onecycle \
    --epochs 100 \
    --batch_size 512 \
    --lr 0.1 \
    --momentum 0.9 \
    --weight_decay 1e-4 \
    --amp \
    --plot_training \
    --plot_evaluation

# Or use the complete training script with logging
python run_complete_training.py

Inference

# Run interactive web demo
python app.py

# Or use inference script
python inference.py --image path/to/image.jpg --model snapshots/best_model.pth

Key Features

1. OneCycle Learning Rate Policy

Implements the OneCycle LR schedule from "Super-Convergence: Very Fast Training of Neural Networks" paper:

  • Achieves faster convergence
  • Better generalization
  • Higher final accuracy

2. Comprehensive Metrics Logging

After each training run, the script automatically outputs:

  • Training and test accuracy/loss curves
  • Top-1, Top-3, Top-5 accuracies
  • Precision, Recall, F1-Score (macro and weighted)
  • Per-class performance breakdown
  • Confusion matrix and classification report

3. Mixed Precision Training (AMP)

  • 2-3x faster training on modern GPUs
  • Reduced memory usage
  • Maintains accuracy with float16/float32 mixed precision

4. Advanced Data Augmentation

Uses Albumentations for efficient augmentation:

  • Faster than torchvision transforms
  • More augmentation options
  • GPU-compatible with minimal overhead

5. Model Checkpointing

  • Automatic snapshot saving at specified intervals
  • Best model tracking based on test accuracy
  • Resume training from any checkpoint

Detailed Training Log

Full training logs are available in log/training_complete_20251010-103227.log, including:

  • Per-epoch train/test loss and accuracy
  • Learning rate at each epoch
  • Final comprehensive evaluation with per-class metrics
  • Training time and resource utilization

Example final output: ```

TRAINING COMPLETED - FINAL EVALUATION

TRAINING SUMMARY

Total Epochs Trained: 100 Final Training Loss: 0.5584 Final Training Accuracy: 80.47% Best Training Accuracy: 81.05% (Epoch 94) Final Learning Rate: 0.001500

TEST/VALIDATION SUMMARY

Final Test Loss: 0.8985 Final Test Accuracy: 76.68% Best Test Accuracy: 76.79% (Epoch 99)

COMPREHENSIVE TEST SET METRICS

Top-1 Accuracy (Test): 76.68% Top-3 Accuracy (Test): 90.95% Top-5 Accuracy (Test): 94.07%


## Requirements Met

βœ… **Training from Scratch**: Custom ResNet (5M params) trained without pre-trained weights  
βœ… **CIFAR-100 Dataset**: All 100 classes used (50,000 train / 10,000 test)  
βœ… **Target Accuracy**: **76.68% achieved** (target: 73%) - **Exceeded by 3.68%**  
βœ… **Training Duration**: 100 epochs with OneCycle LR schedule  
βœ… **Modern Tools**: Extensive use of ChatGPT/Cursor for development  
βœ… **Comprehensive Evaluation**: Full metrics, plots, and detailed analysis  
βœ… **Model Architecture**: Custom lightweight ResNet optimized for CIFAR-100  
βœ… **Reproducibility**: Complete logs, checkpoints, and configuration documented  

## Technologies Used

- **PyTorch** - Deep learning framework
- **Albumentations** - Data augmentation
- **Gradio** - Web interface for inference
- **scikit-learn** - Metrics calculation
- **matplotlib/seaborn** - Visualization
- **numpy** - Numerical operations

## Model Comparison

| Model Variant | Parameters | Expected Accuracy | Notes |
|---------------|------------|-------------------|-------|
| **Our Model** (4 blocks) | **5M** | **76.68%** | Balanced efficiency & accuracy |
| Standard ResNet-18 | 11M | ~76-78% | Good baseline for CIFAR |
| Standard ResNet-34 | 21M | ~78-80% | More capacity, slower training |
| Wide-ResNet-28-10 | 36M | ~80-82% | State-of-art, requires more resources |
| PyramidNet | 26M | ~82-84% | Complex architecture |

**Our lightweight design achieves competitive accuracy with 2-4Γ— fewer parameters than standard ResNets.**

## Future Improvements

Potential enhancements to reach higher accuracy (78%+):
1. **Architecture upgrades**: 
   - Increase blocks per stage: [2, 2, 2, 2] or [3, 3, 3, 3]
   - Try Wide-ResNet with wider channels
   - Add Squeeze-and-Excitation (SE) blocks
2. **Training tricks**: 
   - Mixup (Ξ±=0.2) for better generalization
   - CutMix for spatial regularization
   - AutoAugment or RandAugment policies
3. **Regularization**: 
   - Stochastic Depth (survival probability 0.8-0.9)
   - DropBlock for spatial dropout
   - Increased label smoothing (0.2)
4. **Ensemble methods**: 
   - Train 3-5 models with different seeds
   - Snapshot ensembles (save last N checkpoints)
5. **Longer training**: 
   - 200-300 epochs with cosine annealing
   - Multi-step or exponential LR decay
6. **Knowledge distillation**: 
   - Train larger teacher model first
   - Use soft targets for student training

## Technical Implementation Details

### Architecture Design Rationale

**Why a lightweight ResNet variant?**

1. **CIFAR-100 Image Size**: At 32Γ—32 pixels, CIFAR images contain less spatial information than ImageNet (224Γ—224)
   - Standard ResNet-34's [3,4,6,3] block structure is over-parameterized
   - Our [1,1,1,1] structure provides sufficient capacity without overfitting

2. **Parameter Efficiency**:
   - 5M parameters: Sweet spot between underfitting and overfitting
   - Faster training: 100 epochs in ~2-3 hours vs 5-6 hours for ResNet-34
   - Lower memory footprint: Can use larger batch sizes

3. **CIFAR-Specific Modifications**:
   - **3Γ—3 initial conv** (vs 7Γ—7): Preserves fine details in small images
   - **No MaxPool layer**: Maintains spatial resolution (32Γ—32 β†’ 4Γ—4 over 4 stages)
   - **Stride-2 convolutions**: Gradual downsampling for feature hierarchy

### Code Reference

From `model_cifar.py`:
```python
class ResNet34(nn.Module):
    def __init__(self, num_classes=100):
        super().__init__()
        self.in_channels = 64
        
        # CIFAR-specific: 3Γ—3 conv, no maxpool
        self.conv1 = nn.Conv2d(3, 64, kernel_size=3, stride=1, padding=1, bias=False)
        self.bn1 = nn.BatchNorm2d(64)
        self.relu = nn.ReLU(inplace=True)
        
        # 4 stages with 1 BasicBlock each
        self.layer1 = self._make_layer(64, 1)         # 32Γ—32Γ—64
        self.layer2 = self._make_layer(128, 1, stride=2)  # 16Γ—16Γ—128
        self.layer3 = self._make_layer(256, 1, stride=2)  # 8Γ—8Γ—256
        self.layer4 = self._make_layer(512, 1, stride=2)  # 4Γ—4Γ—512
        
        # Classification head
        self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
        self.fc = nn.Linear(512, num_classes)

BasicBlock (2 conv layers + skip connection):

class BasicBlock(nn.Module):
    def forward(self, x):
        identity = x
        out = F.relu(self.bn1(self.conv1(x)))    # Conv β†’ BN β†’ ReLU
        out = self.bn2(self.conv2(out))           # Conv β†’ BN
        out += identity                            # Add skip connection
        out = F.relu(out)                         # ReLU
        return out

References

Papers:

  • He et al., "Deep Residual Learning for Image Recognition" (2016) - ResNet architecture
  • Smith, "Super-Convergence: Very Fast Training of Neural Networks" (2018) - OneCycle LR
  • Krizhevsky, "Learning Multiple Layers of Features from Tiny Images" (2009) - CIFAR-100

Implementation Resources:

  • PyTorch official ResNet implementation
  • Albumentations library for efficient augmentation
  • torchvision.datasets for CIFAR-100 loading

License

MIT License

Acknowledgments

This project was developed with extensive assistance from:

  • ChatGPT for architecture design and debugging
  • Cursor AI for code completion and refactoring
  • PyTorch and torchvision communities for reference implementations

Note: Training logs, model checkpoints, and detailed per-class metrics are available in the log/ and plots_complete/ directories.