Spaces:

nirmalpratheep
/

CIFAR100_ImageClassifier

Sleeping

App Files Files Community

CIFAR100_ImageClassifier / README.md

HF Deploy

Deploy CIFAR-100 classifier

a92663e 8 months ago

preview code

raw

history blame contribute delete

24.6 kB

	---
	title: CIFAR-100 Image Classifier
	emoji: 🎯
	colorFrom: blue
	colorTo: purple
	sdk: gradio
	sdk_version: "5.49.1"
	app_file: app.py
	pinned: false
	license: mit
	---

	# CIFAR-100 ResNet Training from Scratch

	A ResNet-34 model trained from scratch on CIFAR-100 dataset, achieving 76.68% top-1 accuracy in 100 epochs with OneCycle Learning Rate scheduling.

	## Project Overview

	This project demonstrates training a ResNet architecture from scratch on the CIFAR-100 dataset without using any pre-trained models. The implementation leverages modern deep learning techniques including data augmentation, OneCycle LR scheduling, and mixed precision training.

	## Results Summary

	### Performance Metrics (100 Epochs)

	\| Metric \| Score \|
	\|--------\|-------\|
	\| Top-1 Accuracy \| 76.68% ✅ (Target: 73%) \|
	\| Top-3 Accuracy \| 90.95% \|
	\| Top-5 Accuracy \| 94.07% \|
	\| Best Test Accuracy \| 76.79% (Epoch 99) \|
	\| Macro F1-Score \| 0.7670 \|
	\| Weighted F1-Score \| 0.7668 \|

	### Averaged Metrics

	Macro-Averaged (unweighted):
	- Precision: 0.7708
	- Recall: 0.7668
	- F1-Score: 0.7670

	Weighted-Averaged (by class support):
	- Precision: 0.7708
	- Recall: 0.7668
	- F1-Score: 0.7668

	## Training Configuration

	### Model Architecture

	Custom Lightweight ResNet for CIFAR-100

	A specially designed ResNet variant optimized for small image classification:

	```
	Model: ResNet34 (CIFAR-optimized)
	Total Parameters: 4,949,412 (~5M)
	Trainable Parameters: 4,949,412
	Input Size: 32×32×3 (RGB)
	Output Classes: 100
	```

	Architecture Details (from `model_cifar.py`):

	### Layer-by-Layer Feature Map Progression

	\| Layer \| Operation \| Kernel \| Stride \| Padding \| Input Size \| Output Size \| Channels \| Receptive Field \|
	\|-------\|-----------\|--------\|--------\|---------\|------------\|-------------\|----------\|-----------------\|
	\| Input \| - \| - \| - \| - \| 32×32 \| 32×32 \| 3 \| 1×1 \|
	\| conv1 \| Conv2d \| 3×3 \| 1 \| 1 \| 32×32×3 \| 32×32×64 \| 64 \| 3×3 \|
	\| bn1+relu \| BN+ReLU \| - \| - \| - \| 32×32×64 \| 32×32×64 \| 64 \| 3×3 \|
	\| layer1 \| BasicBlock \| 3×3,3×3 \| 1,1 \| 1,1 \| 32×32×64 \| 32×32×64 \| 64 \| 7×7 \|
	\| layer2 \| BasicBlock \| 3×3,3×3 \| 2,1 \| 1,1 \| 32×32×64 \| 16×16×128 \| 128 \| 15×15 \|
	\| layer3 \| BasicBlock \| 3×3,3×3 \| 2,1 \| 1,1 \| 16×16×128 \| 8×8×256 \| 256 \| 31×31 \|
	\| layer4 \| BasicBlock \| 3×3,3×3 \| 2,1 \| 1,1 \| 8×8×256 \| 4×4×512 \| 512 \| 63×63 \|
	\| avgpool \| AdaptiveAvgPool2d \| 4×4 \| - \| - \| 4×4×512 \| 1×1×512 \| 512 \| Full image \|
	\| fc \| Linear \| - \| - \| - \| 512 \| 100 \| 100 \| - \|

	Key Observations:
	- Receptive field at layer4: 63×63 pixels (covers full 32×32 image with 2× margin)
	- Spatial downsampling: 3 stride-2 operations reduce 32×32 → 4×4 (8× reduction)
	- Channel expansion: 3 → 64 → 128 → 256 → 512 (progressive feature richness)
	- Feature map efficiency: No information loss from MaxPooling (common in ImageNet models)

	### Detailed Architecture Components

	1. Initial Convolution Block
	```
	Input: 32×32×3 → Conv2d(3→64, k=3×3, s=1, p=1) → BN → ReLU → Output: 32×32×64
	Receptive Field: 1×1 → 3×3
	```
	- CIFAR-optimized: 3×3 conv (not 7×7 like ImageNet ResNets)
	- Preserves spatial resolution (no stride-2 or MaxPool)
	- Captures fine-grained details essential for small images

	2. Layer 1: Residual Stage 1 (64 channels, no downsampling)
	```
	Input: 32×32×64
	BasicBlock:
	├─ Conv(64→64, k=3×3, s=1, p=1) → BN → ReLU → 32×32×64
	├─ Conv(64→64, k=3×3, s=1, p=1) → BN → 32×32×64
	└─ Add(identity) → ReLU → Output: 32×32×64
	Receptive Field: 3×3 → 7×7
	```
	- No spatial downsampling (stride=1)
	- Identity skip connection (no projection needed)
	- RF grows by 4 pixels (2 conv layers × 2 pixels each)

	3. Layer 2: Residual Stage 2 (128 channels, downsample)
	```
	Input: 32×32×64
	BasicBlock:
	├─ Conv(64→128, k=3×3, s=2, p=1) → BN → ReLU → 16×16×128
	├─ Conv(128→128, k=3×3, s=1, p=1) → BN → 16×16×128
	├─ Skip: Conv(64→128, k=1×1, s=2) → BN → 16×16×128 (projection)
	└─ Add(skip) → ReLU → Output: 16×16×128
	Receptive Field: 7×7 → 15×15
	```
	- Spatial downsampling: 32×32 → 16×16 (stride=2 in first conv)
	- Channel expansion: 64 → 128
	- Projection shortcut: 1×1 conv matches dimensions
	- RF doubles due to stride-2 convolution

	4. Layer 3: Residual Stage 3 (256 channels, downsample)
	```
	Input: 16×16×128
	BasicBlock:
	├─ Conv(128→256, k=3×3, s=2, p=1) → BN → ReLU → 8×8×256
	├─ Conv(256→256, k=3×3, s=1, p=1) → BN → 8×8×256
	├─ Skip: Conv(128→256, k=1×1, s=2) → BN → 8×8×256 (projection)
	└─ Add(skip) → ReLU → Output: 8×8×256
	Receptive Field: 15×15 → 31×31
	```
	- Spatial downsampling: 16×16 → 8×8
	- Channel expansion: 128 → 256
	- RF now covers most of the input image

	5. Layer 4: Residual Stage 4 (512 channels, downsample)
	```
	Input: 8×8×256
	BasicBlock:
	├─ Conv(256→512, k=3×3, s=2, p=1) → BN → ReLU → 4×4×512
	├─ Conv(512→512, k=3×3, s=1, p=1) → BN → 4×4×512
	├─ Skip: Conv(256→512, k=1×1, s=2) → BN → 4×4×512 (projection)
	└─ Add(skip) → ReLU → Output: 4×4×512
	Receptive Field: 31×31 → 63×63
	```
	- Final spatial downsampling: 8×8 → 4×4
	- Maximum channels: 512 (highest feature richness)
	- RF exceeds input size: 63×63 > 32×32 (full image context)

	6. Classification Head
	```
	Input: 4×4×512
	├─ AdaptiveAvgPool2d((1,1)) → 1×1×512 (global spatial pooling)
	├─ Flatten → 512
	└─ Linear(512 → 100) → 100 class logits
	```
	- Global Average Pooling: Each of 512 channels → single value
	- Reduces overfitting vs fully-connected layers
	- Translation invariant features

	7. Initialization Strategy
	- Kaiming (He) Normal for Conv2d weights
	- Optimal for ReLU activations
	- `std = sqrt(2 / fan_in)`
	- Constant initialization for BatchNorm
	- weight = 1, bias = 0

	### Architecture Flow Diagram

	```
	Input Image (32×32×3, RF=1×1)
	↓
	┌─────────────────────────────────────────────────────────┐
	│ STEM: Conv 3×3 → BN → ReLU │
	│ Output: 32×32×64, RF=3×3 │
	└─────────────────────────────────────────────────────────┘
	↓
	┌─────────────────────────────────────────────────────────┐
	│ STAGE 1: BasicBlock (64 channels, stride=1) │
	│ Conv 3×3 → BN → ReLU → Conv 3×3 → BN → (+) → ReLU │
	│ Output: 32×32×64, RF=7×7 │
	│ Skip: Identity (no projection needed) │
	└─────────────────────────────────────────────────────────┘
	↓ [Spatial: 32×32, Channels: 64, RF: 7×7]
	┌─────────────────────────────────────────────────────────┐
	│ STAGE 2: BasicBlock (128 channels, stride=2) ↓↓ │
	│ Conv 3×3,s2 → BN → ReLU → Conv 3×3 → BN → (+) → ReLU│
	│ Output: 16×16×128, RF=15×15 │
	│ Skip: Conv 1×1,s2 (projection: 64→128) │
	└─────────────────────────────────────────────────────────┘
	↓ [Spatial: 16×16, Channels: 128, RF: 15×15]
	┌─────────────────────────────────────────────────────────┐
	│ STAGE 3: BasicBlock (256 channels, stride=2) ↓↓ │
	│ Conv 3×3,s2 → BN → ReLU → Conv 3×3 → BN → (+) → ReLU│
	│ Output: 8×8×256, RF=31×31 │
	│ Skip: Conv 1×1,s2 (projection: 128→256) │
	└─────────────────────────────────────────────────────────┘
	↓ [Spatial: 8×8, Channels: 256, RF: 31×31]
	┌─────────────────────────────────────────────────────────┐
	│ STAGE 4: BasicBlock (512 channels, stride=2) ↓↓ │
	│ Conv 3×3,s2 → BN → ReLU → Conv 3×3 → BN → (+) → ReLU│
	│ Output: 4×4×512, RF=63×63 (exceeds 32×32!) │
	│ Skip: Conv 1×1,s2 (projection: 256→512) │
	└─────────────────────────────────────────────────────────┘
	↓ [Spatial: 4×4, Channels: 512, RF: Full Image]
	┌─────────────────────────────────────────────────────────┐
	│ HEAD: Global Average Pooling → FC │
	│ AdaptiveAvgPool2d(1,1) → Flatten → Linear(512→100) │
	│ Output: 100 class logits │
	└─────────────────────────────────────────────────────────┘
	↓
	Predictions (100 classes)
	```

	Key Design Choices:
	- ✅ CIFAR-specific stem: 3×3 conv instead of 7×7 (ImageNet-style)
	- ✅ No aggressive downsampling: Preserves spatial information for 32×32 images
	- ✅ Lightweight: 1 block per stage instead of [3,4,6,3] for efficient training
	- ✅ Residual connections: Enable gradient flow for deeper networks
	- ✅ Global Average Pooling: Reduces overfitting vs fully-connected layers
	- ✅ Progressive RF growth: Each layer sees more context (7→15→31→63 pixels)

	### Training Hyperparameters
	```python
	Epochs: 100
	Batch Size: 512
	Optimizer: SGD with Nesterov momentum
	Momentum: 0.9
	Weight Decay: 1e-4
	Label Smoothing: 0.1
	Mixed Precision: Enabled (AMP)
	Gradient Clipping: 1.0

	# OneCycle Learning Rate Schedule
	LR Schedule: OneCycle (Custom)
	- Phase 1 (Epochs 0-40): 0.01 → 0.1 (warmup)
	- Phase 2 (Epochs 41-81): 0.1 → 0.01 (cooldown)
	- Phase 3 (Epochs 82-99): 0.01 → 0.001 (annihilation)
	```

	### Data Augmentation
	Using Albumentations library:
	- Training:
	- Random padding (32→36) + Random crop (36→32)
	- Horizontal flip (p=0.5)
	- ShiftScaleRotate (shift=0.05, scale=0.05, rotate=5°, p=0.3)
	- CoarseDropout/Cutout (16×16, p=0.4)
	- Color jitter (brightness, contrast, saturation, hue, p=0.4)
	- Normalization (CIFAR-100 mean/std)

	- Testing:
	- Normalization only

	## Training Results

	### Training Curves

	![Training Curves](plots_complete/training_curves.png)

	The training curves show:
	- Steady convergence with minimal overfitting
	- Effective learning rate schedule with OneCycle policy
	- Generalization gap maintained below 5% throughout training
	- Final training accuracy: 80.47%

	### Learning Rate Schedule

	![Learning Rate Schedule](plots_complete/learning_rate_schedule.png)

	The OneCycle LR schedule implementation:
	1. Warmup Phase (41 epochs): Linear increase from 0.01 to 0.1
	2. Cooldown Phase (41 epochs): Linear decrease from 0.1 to 0.01
	3. Annihilation Phase (18 epochs): Linear decrease from 0.01 to 0.001

	This schedule helps the model:
	- Escape local minima early in training
	- Find a wide minimum for better generalization
	- Fine-tune with very small learning rates at the end

	### Per-Class Performance

	![Class Metrics](plots_complete/class_metrics.png)

	Top 5 Best Performing Classes:
	1. wardrobe - F1: 0.9458 (Precision: 0.9320, Recall: 0.9600)
	2. sunflower - F1: 0.9381 (Precision: 0.9681, Recall: 0.9100)
	3. poppy - F1: 0.9315 (Precision: 0.9444, Recall: 0.9189)
	4. can - F1: 0.9310 (Precision: 0.9000, Recall: 0.9643)
	5. skyscraper - F1: 0.9100 (Precision: 0.9100, Recall: 0.9100)

	Most Challenging Classes:
	- boy - F1: 0.4286 (Fine-grained human features)
	- girl - F1: 0.4646 (Similar to boy)
	- baby - F1: 0.5079 (Fine-grained human features)
	- man - F1: 0.5758 (Similar to boy)
	- plate - F1: 0.5797 (Simple objects, easily confused)

	The model performs exceptionally well on distinct objects (flowers, buildings, furniture) but struggles with fine-grained human categorization, which is expected for CIFAR-100's 32×32 resolution.

	## Model Architecture Summary

	From `model_cifar.py`:

	\| Component \| Specification \|
	\|-----------\|---------------\|
	\| Model Name \| ResNet34 (CIFAR-optimized) \|
	\| Total Parameters \| 4,949,412 (~5M) \|
	\| Architecture Depth \| 10 weight layers (1 initial + 8 residual + 1 FC) \|
	\| Residual Blocks \| 4 BasicBlocks (1 per stage) \|
	\| Channel Progression \| 3 → 64 → 128 → 256 → 512 → 100 \|
	\| Spatial Downsampling \| 32×32 → 16×16 → 8×8 → 4×4 → 1×1 \|
	\| Receptive Field Growth \| 1×1 → 3×3 → 7×7 → 15×15 → 31×31 → 63×63 \|
	\| Skip Connections \| 4 (1 identity + 3 projection shortcuts) \|
	\| Pooling Strategy \| Global Average Pooling (4×4 → 1×1) \|
	\| Initialization \| Kaiming Normal (He) for Conv, Constant for BN \|
	\| Downsampling Method \| Strided convolutions (no MaxPool) \|

	Why This Architecture Works for CIFAR-100:

	1. Right-sized capacity: 5M parameters balances expressiveness vs overfitting risk
	2. Preserved resolution: No aggressive downsampling maintains spatial detail in 32×32 images
	3. Optimal receptive field: 63×63 RF exceeds input size (32×32), capturing full image context
	4. Progressive downsampling: 3 stride-2 ops (vs 1 MaxPool + 4 stride-2 in ImageNet ResNet)
	5. Residual learning: Skip connections enable gradient flow through 10 weight layers
	6. Efficient computation: Lightweight design trains in ~2-3 hours on single GPU

	Receptive Field Analysis:
	- By layer2 (16×16×128): RF = 15×15 → covers ~50% of image
	- By layer3 (8×8×256): RF = 31×31 → covers ~95% of image
	- By layer4 (4×4×512): RF = 63×63 → covers full image + context
	- Each neuron in final feature map can "see" the entire input image

	## Project Structure

	```
	CIFAR100/
	├── main.py # Main training script with OneCycle LR
	├── model_cifar.py # Custom ResNet architecture (5M params)
	│ ├── BasicBlock # 2-layer residual block with skip connection
	│ └── ResNet34 # CIFAR-optimized ResNet variant
	├── train.py # Training and evaluation loops
	├── preprocess.py # Data loading with Albumentations
	├── visualization.py # Metrics calculation and plotting
	├── inference.py # Model inference utilities
	├── app.py # Gradio web interface for demo
	├── run_complete_training.py # Full training pipeline with logging
	├── requirements.txt # Python dependencies
	├── log/ # Training logs
	│ └── training_complete_20251010-103227.log
	└── plots_complete/ # Training visualizations
	├── training_curves.png
	├── learning_rate_schedule.png
	├── class_metrics.png
	├── confusion_matrix.png
	└── classification_report.txt
	```

	## Quick Start

	### Installation

	```bash
	# Clone the repository
	git clone <your-repo-url>
	cd CIFAR100

	# Install dependencies
	pip install -r requirements.txt
	```

	### Training

	```bash
	# Train with OneCycle LR for 100 epochs
	python main.py \
	--scheduler onecycle \
	--epochs 100 \
	--batch_size 512 \
	--lr 0.1 \
	--momentum 0.9 \
	--weight_decay 1e-4 \
	--amp \
	--plot_training \
	--plot_evaluation

	# Or use the complete training script with logging
	python run_complete_training.py
	```

	### Inference

	```bash
	# Run interactive web demo
	python app.py

	# Or use inference script
	python inference.py --image path/to/image.jpg --model snapshots/best_model.pth
	```

	## Key Features

	### 1. OneCycle Learning Rate Policy
	Implements the OneCycle LR schedule from "Super-Convergence: Very Fast Training of Neural Networks" paper:
	- Achieves faster convergence
	- Better generalization
	- Higher final accuracy

	### 2. Comprehensive Metrics Logging
	After each training run, the script automatically outputs:
	- Training and test accuracy/loss curves
	- Top-1, Top-3, Top-5 accuracies
	- Precision, Recall, F1-Score (macro and weighted)
	- Per-class performance breakdown
	- Confusion matrix and classification report

	### 3. Mixed Precision Training (AMP)
	- 2-3x faster training on modern GPUs
	- Reduced memory usage
	- Maintains accuracy with float16/float32 mixed precision

	### 4. Advanced Data Augmentation
	Uses Albumentations for efficient augmentation:
	- Faster than torchvision transforms
	- More augmentation options
	- GPU-compatible with minimal overhead

	### 5. Model Checkpointing
	- Automatic snapshot saving at specified intervals
	- Best model tracking based on test accuracy
	- Resume training from any checkpoint

	## Detailed Training Log

	Full training logs are available in `log/training_complete_20251010-103227.log`, including:
	- Per-epoch train/test loss and accuracy
	- Learning rate at each epoch
	- Final comprehensive evaluation with per-class metrics
	- Training time and resource utilization

	Example final output:
	```
	======================================================================
	TRAINING COMPLETED - FINAL EVALUATION
	======================================================================

	TRAINING SUMMARY
	----------------------------------------------------------------------
	Total Epochs Trained: 100
	Final Training Loss: 0.5584
	Final Training Accuracy: 80.47%
	Best Training Accuracy: 81.05% (Epoch 94)
	Final Learning Rate: 0.001500

	TEST/VALIDATION SUMMARY
	----------------------------------------------------------------------
	Final Test Loss: 0.8985
	Final Test Accuracy: 76.68%
	Best Test Accuracy: 76.79% (Epoch 99)

	COMPREHENSIVE TEST SET METRICS
	----------------------------------------------------------------------
	Top-1 Accuracy (Test): 76.68%
	Top-3 Accuracy (Test): 90.95%
	Top-5 Accuracy (Test): 94.07%
	```

	## Requirements Met

	✅ Training from Scratch: Custom ResNet (5M params) trained without pre-trained weights
	✅ CIFAR-100 Dataset: All 100 classes used (50,000 train / 10,000 test)
	✅ Target Accuracy: 76.68% achieved (target: 73%) - Exceeded by 3.68%
	✅ Training Duration: 100 epochs with OneCycle LR schedule
	✅ Modern Tools: Extensive use of ChatGPT/Cursor for development
	✅ Comprehensive Evaluation: Full metrics, plots, and detailed analysis
	✅ Model Architecture: Custom lightweight ResNet optimized for CIFAR-100
	✅ Reproducibility: Complete logs, checkpoints, and configuration documented

	## Technologies Used

	- PyTorch - Deep learning framework
	- Albumentations - Data augmentation
	- Gradio - Web interface for inference
	- scikit-learn - Metrics calculation
	- matplotlib/seaborn - Visualization
	- numpy - Numerical operations

	## Model Comparison

	\| Model Variant \| Parameters \| Expected Accuracy \| Notes \|
	\|---------------\|------------\|-------------------\|-------\|
	\| Our Model (4 blocks) \| 5M \| 76.68% \| Balanced efficiency & accuracy \|
	\| Standard ResNet-18 \| 11M \| ~76-78% \| Good baseline for CIFAR \|
	\| Standard ResNet-34 \| 21M \| ~78-80% \| More capacity, slower training \|
	\| Wide-ResNet-28-10 \| 36M \| ~80-82% \| State-of-art, requires more resources \|
	\| PyramidNet \| 26M \| ~82-84% \| Complex architecture \|

	Our lightweight design achieves competitive accuracy with 2-4× fewer parameters than standard ResNets.

	## Future Improvements

	Potential enhancements to reach higher accuracy (78%+):
	1. Architecture upgrades:
	- Increase blocks per stage: [2, 2, 2, 2] or [3, 3, 3, 3]
	- Try Wide-ResNet with wider channels
	- Add Squeeze-and-Excitation (SE) blocks
	2. Training tricks:
	- Mixup (α=0.2) for better generalization
	- CutMix for spatial regularization
	- AutoAugment or RandAugment policies
	3. Regularization:
	- Stochastic Depth (survival probability 0.8-0.9)
	- DropBlock for spatial dropout
	- Increased label smoothing (0.2)
	4. Ensemble methods:
	- Train 3-5 models with different seeds
	- Snapshot ensembles (save last N checkpoints)
	5. Longer training:
	- 200-300 epochs with cosine annealing
	- Multi-step or exponential LR decay
	6. Knowledge distillation:
	- Train larger teacher model first
	- Use soft targets for student training

	## Technical Implementation Details

	### Architecture Design Rationale

	Why a lightweight ResNet variant?

	1. CIFAR-100 Image Size: At 32×32 pixels, CIFAR images contain less spatial information than ImageNet (224×224)
	- Standard ResNet-34's [3,4,6,3] block structure is over-parameterized
	- Our [1,1,1,1] structure provides sufficient capacity without overfitting

	2. Parameter Efficiency:
	- 5M parameters: Sweet spot between underfitting and overfitting
	- Faster training: 100 epochs in ~2-3 hours vs 5-6 hours for ResNet-34
	- Lower memory footprint: Can use larger batch sizes

	3. CIFAR-Specific Modifications:
	- 3×3 initial conv (vs 7×7): Preserves fine details in small images
	- No MaxPool layer: Maintains spatial resolution (32×32 → 4×4 over 4 stages)
	- Stride-2 convolutions: Gradual downsampling for feature hierarchy

	### Code Reference

	From `model_cifar.py`:
	```python
	class ResNet34(nn.Module):
	def __init__(self, num_classes=100):
	super().__init__()
	self.in_channels = 64

	# CIFAR-specific: 3×3 conv, no maxpool
	self.conv1 = nn.Conv2d(3, 64, kernel_size=3, stride=1, padding=1, bias=False)
	self.bn1 = nn.BatchNorm2d(64)
	self.relu = nn.ReLU(inplace=True)

	# 4 stages with 1 BasicBlock each
	self.layer1 = self._make_layer(64, 1) # 32×32×64
	self.layer2 = self._make_layer(128, 1, stride=2) # 16×16×128
	self.layer3 = self._make_layer(256, 1, stride=2) # 8×8×256
	self.layer4 = self._make_layer(512, 1, stride=2) # 4×4×512

	# Classification head
	self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
	self.fc = nn.Linear(512, num_classes)
	```

	BasicBlock (2 conv layers + skip connection):
	```python
	class BasicBlock(nn.Module):
	def forward(self, x):
	identity = x
	out = F.relu(self.bn1(self.conv1(x))) # Conv → BN → ReLU
	out = self.bn2(self.conv2(out)) # Conv → BN
	out += identity # Add skip connection
	out = F.relu(out) # ReLU
	return out
	```

	## References

	Papers:
	- He et al., "Deep Residual Learning for Image Recognition" (2016) - ResNet architecture
	- Smith, "Super-Convergence: Very Fast Training of Neural Networks" (2018) - OneCycle LR
	- Krizhevsky, "Learning Multiple Layers of Features from Tiny Images" (2009) - CIFAR-100

	Implementation Resources:
	- PyTorch official ResNet implementation
	- Albumentations library for efficient augmentation
	- torchvision.datasets for CIFAR-100 loading

	## License

	MIT License

	## Acknowledgments

	This project was developed with extensive assistance from:
	- ChatGPT for architecture design and debugging
	- Cursor AI for code completion and refactoring
	- PyTorch and torchvision communities for reference implementations

	---

	Note: Training logs, model checkpoints, and detailed per-class metrics are available in the `log/` and `plots_complete/` directories.