A newer version of the Gradio SDK is available: 6.16.0
title: CIFAR-100 Image Classifier
emoji: π―
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 5.49.1
app_file: app.py
pinned: false
license: mit
CIFAR-100 ResNet Training from Scratch
A ResNet-34 model trained from scratch on CIFAR-100 dataset, achieving 76.68% top-1 accuracy in 100 epochs with OneCycle Learning Rate scheduling.
Project Overview
This project demonstrates training a ResNet architecture from scratch on the CIFAR-100 dataset without using any pre-trained models. The implementation leverages modern deep learning techniques including data augmentation, OneCycle LR scheduling, and mixed precision training.
Results Summary
Performance Metrics (100 Epochs)
| Metric | Score |
|---|---|
| Top-1 Accuracy | 76.68% β (Target: 73%) |
| Top-3 Accuracy | 90.95% |
| Top-5 Accuracy | 94.07% |
| Best Test Accuracy | 76.79% (Epoch 99) |
| Macro F1-Score | 0.7670 |
| Weighted F1-Score | 0.7668 |
Averaged Metrics
Macro-Averaged (unweighted):
- Precision: 0.7708
- Recall: 0.7668
- F1-Score: 0.7670
Weighted-Averaged (by class support):
- Precision: 0.7708
- Recall: 0.7668
- F1-Score: 0.7668
Training Configuration
Model Architecture
Custom Lightweight ResNet for CIFAR-100
A specially designed ResNet variant optimized for small image classification:
Model: ResNet34 (CIFAR-optimized)
Total Parameters: 4,949,412 (~5M)
Trainable Parameters: 4,949,412
Input Size: 32Γ32Γ3 (RGB)
Output Classes: 100
Architecture Details (from model_cifar.py):
Layer-by-Layer Feature Map Progression
| Layer | Operation | Kernel | Stride | Padding | Input Size | Output Size | Channels | Receptive Field |
|---|---|---|---|---|---|---|---|---|
| Input | - | - | - | - | 32Γ32 | 32Γ32 | 3 | 1Γ1 |
| conv1 | Conv2d | 3Γ3 | 1 | 1 | 32Γ32Γ3 | 32Γ32Γ64 | 64 | 3Γ3 |
| bn1+relu | BN+ReLU | - | - | - | 32Γ32Γ64 | 32Γ32Γ64 | 64 | 3Γ3 |
| layer1 | BasicBlock | 3Γ3,3Γ3 | 1,1 | 1,1 | 32Γ32Γ64 | 32Γ32Γ64 | 64 | 7Γ7 |
| layer2 | BasicBlock | 3Γ3,3Γ3 | 2,1 | 1,1 | 32Γ32Γ64 | 16Γ16Γ128 | 128 | 15Γ15 |
| layer3 | BasicBlock | 3Γ3,3Γ3 | 2,1 | 1,1 | 16Γ16Γ128 | 8Γ8Γ256 | 256 | 31Γ31 |
| layer4 | BasicBlock | 3Γ3,3Γ3 | 2,1 | 1,1 | 8Γ8Γ256 | 4Γ4Γ512 | 512 | 63Γ63 |
| avgpool | AdaptiveAvgPool2d | 4Γ4 | - | - | 4Γ4Γ512 | 1Γ1Γ512 | 512 | Full image |
| fc | Linear | - | - | - | 512 | 100 | 100 | - |
Key Observations:
- Receptive field at layer4: 63Γ63 pixels (covers full 32Γ32 image with 2Γ margin)
- Spatial downsampling: 3 stride-2 operations reduce 32Γ32 β 4Γ4 (8Γ reduction)
- Channel expansion: 3 β 64 β 128 β 256 β 512 (progressive feature richness)
- Feature map efficiency: No information loss from MaxPooling (common in ImageNet models)
Detailed Architecture Components
Initial Convolution Block
Input: 32Γ32Γ3 β Conv2d(3β64, k=3Γ3, s=1, p=1) β BN β ReLU β Output: 32Γ32Γ64 Receptive Field: 1Γ1 β 3Γ3- CIFAR-optimized: 3Γ3 conv (not 7Γ7 like ImageNet ResNets)
- Preserves spatial resolution (no stride-2 or MaxPool)
- Captures fine-grained details essential for small images
Layer 1: Residual Stage 1 (64 channels, no downsampling)
Input: 32Γ32Γ64 BasicBlock: ββ Conv(64β64, k=3Γ3, s=1, p=1) β BN β ReLU β 32Γ32Γ64 ββ Conv(64β64, k=3Γ3, s=1, p=1) β BN β 32Γ32Γ64 ββ Add(identity) β ReLU β Output: 32Γ32Γ64 Receptive Field: 3Γ3 β 7Γ7- No spatial downsampling (stride=1)
- Identity skip connection (no projection needed)
- RF grows by 4 pixels (2 conv layers Γ 2 pixels each)
Layer 2: Residual Stage 2 (128 channels, downsample)
Input: 32Γ32Γ64 BasicBlock: ββ Conv(64β128, k=3Γ3, s=2, p=1) β BN β ReLU β 16Γ16Γ128 ββ Conv(128β128, k=3Γ3, s=1, p=1) β BN β 16Γ16Γ128 ββ Skip: Conv(64β128, k=1Γ1, s=2) β BN β 16Γ16Γ128 (projection) ββ Add(skip) β ReLU β Output: 16Γ16Γ128 Receptive Field: 7Γ7 β 15Γ15- Spatial downsampling: 32Γ32 β 16Γ16 (stride=2 in first conv)
- Channel expansion: 64 β 128
- Projection shortcut: 1Γ1 conv matches dimensions
- RF doubles due to stride-2 convolution
Layer 3: Residual Stage 3 (256 channels, downsample)
Input: 16Γ16Γ128 BasicBlock: ββ Conv(128β256, k=3Γ3, s=2, p=1) β BN β ReLU β 8Γ8Γ256 ββ Conv(256β256, k=3Γ3, s=1, p=1) β BN β 8Γ8Γ256 ββ Skip: Conv(128β256, k=1Γ1, s=2) β BN β 8Γ8Γ256 (projection) ββ Add(skip) β ReLU β Output: 8Γ8Γ256 Receptive Field: 15Γ15 β 31Γ31- Spatial downsampling: 16Γ16 β 8Γ8
- Channel expansion: 128 β 256
- RF now covers most of the input image
Layer 4: Residual Stage 4 (512 channels, downsample)
Input: 8Γ8Γ256 BasicBlock: ββ Conv(256β512, k=3Γ3, s=2, p=1) β BN β ReLU β 4Γ4Γ512 ββ Conv(512β512, k=3Γ3, s=1, p=1) β BN β 4Γ4Γ512 ββ Skip: Conv(256β512, k=1Γ1, s=2) β BN β 4Γ4Γ512 (projection) ββ Add(skip) β ReLU β Output: 4Γ4Γ512 Receptive Field: 31Γ31 β 63Γ63- Final spatial downsampling: 8Γ8 β 4Γ4
- Maximum channels: 512 (highest feature richness)
- RF exceeds input size: 63Γ63 > 32Γ32 (full image context)
Classification Head
Input: 4Γ4Γ512 ββ AdaptiveAvgPool2d((1,1)) β 1Γ1Γ512 (global spatial pooling) ββ Flatten β 512 ββ Linear(512 β 100) β 100 class logits- Global Average Pooling: Each of 512 channels β single value
- Reduces overfitting vs fully-connected layers
- Translation invariant features
Initialization Strategy
- Kaiming (He) Normal for Conv2d weights
- Optimal for ReLU activations
std = sqrt(2 / fan_in)
- Constant initialization for BatchNorm
- weight = 1, bias = 0
- Kaiming (He) Normal for Conv2d weights
Architecture Flow Diagram
Input Image (32Γ32Γ3, RF=1Γ1)
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STEM: Conv 3Γ3 β BN β ReLU β
β Output: 32Γ32Γ64, RF=3Γ3 β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STAGE 1: BasicBlock (64 channels, stride=1) β
β Conv 3Γ3 β BN β ReLU β Conv 3Γ3 β BN β (+) β ReLU β
β Output: 32Γ32Γ64, RF=7Γ7 β
β Skip: Identity (no projection needed) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β [Spatial: 32Γ32, Channels: 64, RF: 7Γ7]
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STAGE 2: BasicBlock (128 channels, stride=2) ββ β
β Conv 3Γ3,s2 β BN β ReLU β Conv 3Γ3 β BN β (+) β ReLUβ
β Output: 16Γ16Γ128, RF=15Γ15 β
β Skip: Conv 1Γ1,s2 (projection: 64β128) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β [Spatial: 16Γ16, Channels: 128, RF: 15Γ15]
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STAGE 3: BasicBlock (256 channels, stride=2) ββ β
β Conv 3Γ3,s2 β BN β ReLU β Conv 3Γ3 β BN β (+) β ReLUβ
β Output: 8Γ8Γ256, RF=31Γ31 β
β Skip: Conv 1Γ1,s2 (projection: 128β256) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β [Spatial: 8Γ8, Channels: 256, RF: 31Γ31]
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STAGE 4: BasicBlock (512 channels, stride=2) ββ β
β Conv 3Γ3,s2 β BN β ReLU β Conv 3Γ3 β BN β (+) β ReLUβ
β Output: 4Γ4Γ512, RF=63Γ63 (exceeds 32Γ32!) β
β Skip: Conv 1Γ1,s2 (projection: 256β512) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β [Spatial: 4Γ4, Channels: 512, RF: Full Image]
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β HEAD: Global Average Pooling β FC β
β AdaptiveAvgPool2d(1,1) β Flatten β Linear(512β100) β
β Output: 100 class logits β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
Predictions (100 classes)
Key Design Choices:
- β CIFAR-specific stem: 3Γ3 conv instead of 7Γ7 (ImageNet-style)
- β No aggressive downsampling: Preserves spatial information for 32Γ32 images
- β Lightweight: 1 block per stage instead of [3,4,6,3] for efficient training
- β Residual connections: Enable gradient flow for deeper networks
- β Global Average Pooling: Reduces overfitting vs fully-connected layers
- β Progressive RF growth: Each layer sees more context (7β15β31β63 pixels)
Training Hyperparameters
Epochs: 100
Batch Size: 512
Optimizer: SGD with Nesterov momentum
Momentum: 0.9
Weight Decay: 1e-4
Label Smoothing: 0.1
Mixed Precision: Enabled (AMP)
Gradient Clipping: 1.0
# OneCycle Learning Rate Schedule
LR Schedule: OneCycle (Custom)
- Phase 1 (Epochs 0-40): 0.01 β 0.1 (warmup)
- Phase 2 (Epochs 41-81): 0.1 β 0.01 (cooldown)
- Phase 3 (Epochs 82-99): 0.01 β 0.001 (annihilation)
Data Augmentation
Using Albumentations library:
Training:
- Random padding (32β36) + Random crop (36β32)
- Horizontal flip (p=0.5)
- ShiftScaleRotate (shift=0.05, scale=0.05, rotate=5Β°, p=0.3)
- CoarseDropout/Cutout (16Γ16, p=0.4)
- Color jitter (brightness, contrast, saturation, hue, p=0.4)
- Normalization (CIFAR-100 mean/std)
Testing:
- Normalization only
Training Results
Training Curves
The training curves show:
- Steady convergence with minimal overfitting
- Effective learning rate schedule with OneCycle policy
- Generalization gap maintained below 5% throughout training
- Final training accuracy: 80.47%
Learning Rate Schedule
The OneCycle LR schedule implementation:
- Warmup Phase (41 epochs): Linear increase from 0.01 to 0.1
- Cooldown Phase (41 epochs): Linear decrease from 0.1 to 0.01
- Annihilation Phase (18 epochs): Linear decrease from 0.01 to 0.001
This schedule helps the model:
- Escape local minima early in training
- Find a wide minimum for better generalization
- Fine-tune with very small learning rates at the end
Per-Class Performance
Top 5 Best Performing Classes:
- wardrobe - F1: 0.9458 (Precision: 0.9320, Recall: 0.9600)
- sunflower - F1: 0.9381 (Precision: 0.9681, Recall: 0.9100)
- poppy - F1: 0.9315 (Precision: 0.9444, Recall: 0.9189)
- can - F1: 0.9310 (Precision: 0.9000, Recall: 0.9643)
- skyscraper - F1: 0.9100 (Precision: 0.9100, Recall: 0.9100)
Most Challenging Classes:
- boy - F1: 0.4286 (Fine-grained human features)
- girl - F1: 0.4646 (Similar to boy)
- baby - F1: 0.5079 (Fine-grained human features)
- man - F1: 0.5758 (Similar to boy)
- plate - F1: 0.5797 (Simple objects, easily confused)
The model performs exceptionally well on distinct objects (flowers, buildings, furniture) but struggles with fine-grained human categorization, which is expected for CIFAR-100's 32Γ32 resolution.
Model Architecture Summary
From model_cifar.py:
| Component | Specification |
|---|---|
| Model Name | ResNet34 (CIFAR-optimized) |
| Total Parameters | 4,949,412 (~5M) |
| Architecture Depth | 10 weight layers (1 initial + 8 residual + 1 FC) |
| Residual Blocks | 4 BasicBlocks (1 per stage) |
| Channel Progression | 3 β 64 β 128 β 256 β 512 β 100 |
| Spatial Downsampling | 32Γ32 β 16Γ16 β 8Γ8 β 4Γ4 β 1Γ1 |
| Receptive Field Growth | 1Γ1 β 3Γ3 β 7Γ7 β 15Γ15 β 31Γ31 β 63Γ63 |
| Skip Connections | 4 (1 identity + 3 projection shortcuts) |
| Pooling Strategy | Global Average Pooling (4Γ4 β 1Γ1) |
| Initialization | Kaiming Normal (He) for Conv, Constant for BN |
| Downsampling Method | Strided convolutions (no MaxPool) |
Why This Architecture Works for CIFAR-100:
- Right-sized capacity: 5M parameters balances expressiveness vs overfitting risk
- Preserved resolution: No aggressive downsampling maintains spatial detail in 32Γ32 images
- Optimal receptive field: 63Γ63 RF exceeds input size (32Γ32), capturing full image context
- Progressive downsampling: 3 stride-2 ops (vs 1 MaxPool + 4 stride-2 in ImageNet ResNet)
- Residual learning: Skip connections enable gradient flow through 10 weight layers
- Efficient computation: Lightweight design trains in ~2-3 hours on single GPU
Receptive Field Analysis:
- By layer2 (16Γ16Γ128): RF = 15Γ15 β covers ~50% of image
- By layer3 (8Γ8Γ256): RF = 31Γ31 β covers ~95% of image
- By layer4 (4Γ4Γ512): RF = 63Γ63 β covers full image + context
- Each neuron in final feature map can "see" the entire input image
Project Structure
CIFAR100/
βββ main.py # Main training script with OneCycle LR
βββ model_cifar.py # Custom ResNet architecture (5M params)
β βββ BasicBlock # 2-layer residual block with skip connection
β βββ ResNet34 # CIFAR-optimized ResNet variant
βββ train.py # Training and evaluation loops
βββ preprocess.py # Data loading with Albumentations
βββ visualization.py # Metrics calculation and plotting
βββ inference.py # Model inference utilities
βββ app.py # Gradio web interface for demo
βββ run_complete_training.py # Full training pipeline with logging
βββ requirements.txt # Python dependencies
βββ log/ # Training logs
β βββ training_complete_20251010-103227.log
βββ plots_complete/ # Training visualizations
βββ training_curves.png
βββ learning_rate_schedule.png
βββ class_metrics.png
βββ confusion_matrix.png
βββ classification_report.txt
Quick Start
Installation
# Clone the repository
git clone <your-repo-url>
cd CIFAR100
# Install dependencies
pip install -r requirements.txt
Training
# Train with OneCycle LR for 100 epochs
python main.py \
--scheduler onecycle \
--epochs 100 \
--batch_size 512 \
--lr 0.1 \
--momentum 0.9 \
--weight_decay 1e-4 \
--amp \
--plot_training \
--plot_evaluation
# Or use the complete training script with logging
python run_complete_training.py
Inference
# Run interactive web demo
python app.py
# Or use inference script
python inference.py --image path/to/image.jpg --model snapshots/best_model.pth
Key Features
1. OneCycle Learning Rate Policy
Implements the OneCycle LR schedule from "Super-Convergence: Very Fast Training of Neural Networks" paper:
- Achieves faster convergence
- Better generalization
- Higher final accuracy
2. Comprehensive Metrics Logging
After each training run, the script automatically outputs:
- Training and test accuracy/loss curves
- Top-1, Top-3, Top-5 accuracies
- Precision, Recall, F1-Score (macro and weighted)
- Per-class performance breakdown
- Confusion matrix and classification report
3. Mixed Precision Training (AMP)
- 2-3x faster training on modern GPUs
- Reduced memory usage
- Maintains accuracy with float16/float32 mixed precision
4. Advanced Data Augmentation
Uses Albumentations for efficient augmentation:
- Faster than torchvision transforms
- More augmentation options
- GPU-compatible with minimal overhead
5. Model Checkpointing
- Automatic snapshot saving at specified intervals
- Best model tracking based on test accuracy
- Resume training from any checkpoint
Detailed Training Log
Full training logs are available in log/training_complete_20251010-103227.log, including:
- Per-epoch train/test loss and accuracy
- Learning rate at each epoch
- Final comprehensive evaluation with per-class metrics
- Training time and resource utilization
Example final output: ```
TRAINING COMPLETED - FINAL EVALUATION
TRAINING SUMMARY
Total Epochs Trained: 100 Final Training Loss: 0.5584 Final Training Accuracy: 80.47% Best Training Accuracy: 81.05% (Epoch 94) Final Learning Rate: 0.001500
TEST/VALIDATION SUMMARY
Final Test Loss: 0.8985 Final Test Accuracy: 76.68% Best Test Accuracy: 76.79% (Epoch 99)
COMPREHENSIVE TEST SET METRICS
Top-1 Accuracy (Test): 76.68% Top-3 Accuracy (Test): 90.95% Top-5 Accuracy (Test): 94.07%
## Requirements Met
β
**Training from Scratch**: Custom ResNet (5M params) trained without pre-trained weights
β
**CIFAR-100 Dataset**: All 100 classes used (50,000 train / 10,000 test)
β
**Target Accuracy**: **76.68% achieved** (target: 73%) - **Exceeded by 3.68%**
β
**Training Duration**: 100 epochs with OneCycle LR schedule
β
**Modern Tools**: Extensive use of ChatGPT/Cursor for development
β
**Comprehensive Evaluation**: Full metrics, plots, and detailed analysis
β
**Model Architecture**: Custom lightweight ResNet optimized for CIFAR-100
β
**Reproducibility**: Complete logs, checkpoints, and configuration documented
## Technologies Used
- **PyTorch** - Deep learning framework
- **Albumentations** - Data augmentation
- **Gradio** - Web interface for inference
- **scikit-learn** - Metrics calculation
- **matplotlib/seaborn** - Visualization
- **numpy** - Numerical operations
## Model Comparison
| Model Variant | Parameters | Expected Accuracy | Notes |
|---------------|------------|-------------------|-------|
| **Our Model** (4 blocks) | **5M** | **76.68%** | Balanced efficiency & accuracy |
| Standard ResNet-18 | 11M | ~76-78% | Good baseline for CIFAR |
| Standard ResNet-34 | 21M | ~78-80% | More capacity, slower training |
| Wide-ResNet-28-10 | 36M | ~80-82% | State-of-art, requires more resources |
| PyramidNet | 26M | ~82-84% | Complex architecture |
**Our lightweight design achieves competitive accuracy with 2-4Γ fewer parameters than standard ResNets.**
## Future Improvements
Potential enhancements to reach higher accuracy (78%+):
1. **Architecture upgrades**:
- Increase blocks per stage: [2, 2, 2, 2] or [3, 3, 3, 3]
- Try Wide-ResNet with wider channels
- Add Squeeze-and-Excitation (SE) blocks
2. **Training tricks**:
- Mixup (Ξ±=0.2) for better generalization
- CutMix for spatial regularization
- AutoAugment or RandAugment policies
3. **Regularization**:
- Stochastic Depth (survival probability 0.8-0.9)
- DropBlock for spatial dropout
- Increased label smoothing (0.2)
4. **Ensemble methods**:
- Train 3-5 models with different seeds
- Snapshot ensembles (save last N checkpoints)
5. **Longer training**:
- 200-300 epochs with cosine annealing
- Multi-step or exponential LR decay
6. **Knowledge distillation**:
- Train larger teacher model first
- Use soft targets for student training
## Technical Implementation Details
### Architecture Design Rationale
**Why a lightweight ResNet variant?**
1. **CIFAR-100 Image Size**: At 32Γ32 pixels, CIFAR images contain less spatial information than ImageNet (224Γ224)
- Standard ResNet-34's [3,4,6,3] block structure is over-parameterized
- Our [1,1,1,1] structure provides sufficient capacity without overfitting
2. **Parameter Efficiency**:
- 5M parameters: Sweet spot between underfitting and overfitting
- Faster training: 100 epochs in ~2-3 hours vs 5-6 hours for ResNet-34
- Lower memory footprint: Can use larger batch sizes
3. **CIFAR-Specific Modifications**:
- **3Γ3 initial conv** (vs 7Γ7): Preserves fine details in small images
- **No MaxPool layer**: Maintains spatial resolution (32Γ32 β 4Γ4 over 4 stages)
- **Stride-2 convolutions**: Gradual downsampling for feature hierarchy
### Code Reference
From `model_cifar.py`:
```python
class ResNet34(nn.Module):
def __init__(self, num_classes=100):
super().__init__()
self.in_channels = 64
# CIFAR-specific: 3Γ3 conv, no maxpool
self.conv1 = nn.Conv2d(3, 64, kernel_size=3, stride=1, padding=1, bias=False)
self.bn1 = nn.BatchNorm2d(64)
self.relu = nn.ReLU(inplace=True)
# 4 stages with 1 BasicBlock each
self.layer1 = self._make_layer(64, 1) # 32Γ32Γ64
self.layer2 = self._make_layer(128, 1, stride=2) # 16Γ16Γ128
self.layer3 = self._make_layer(256, 1, stride=2) # 8Γ8Γ256
self.layer4 = self._make_layer(512, 1, stride=2) # 4Γ4Γ512
# Classification head
self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
self.fc = nn.Linear(512, num_classes)
BasicBlock (2 conv layers + skip connection):
class BasicBlock(nn.Module):
def forward(self, x):
identity = x
out = F.relu(self.bn1(self.conv1(x))) # Conv β BN β ReLU
out = self.bn2(self.conv2(out)) # Conv β BN
out += identity # Add skip connection
out = F.relu(out) # ReLU
return out
References
Papers:
- He et al., "Deep Residual Learning for Image Recognition" (2016) - ResNet architecture
- Smith, "Super-Convergence: Very Fast Training of Neural Networks" (2018) - OneCycle LR
- Krizhevsky, "Learning Multiple Layers of Features from Tiny Images" (2009) - CIFAR-100
Implementation Resources:
- PyTorch official ResNet implementation
- Albumentations library for efficient augmentation
- torchvision.datasets for CIFAR-100 loading
License
MIT License
Acknowledgments
This project was developed with extensive assistance from:
- ChatGPT for architecture design and debugging
- Cursor AI for code completion and refactoring
- PyTorch and torchvision communities for reference implementations
Note: Training logs, model checkpoints, and detailed per-class metrics are available in the log/ and plots_complete/ directories.


