shahidul034

Add files using upload-large-folder tool

a16c07b verified 27 days ago

preview code

raw

history blame contribute delete

8.05 kB

Assignment Image: Vision Transformer Documentation

1) Overview

This documentation explains two scripts:

assignment_image/code/c1.py: end-to-end training pipeline for a custom Vision Transformer (ViT) on CIFAR-10.
assignment_image/code/c1_test.py: evaluation and analysis pipeline for saved checkpoints, with an optional transfer-learning experiment using a pre-trained torchvision ViT.

Together, these scripts cover:

Data preprocessing and DataLoader creation
ViT architecture definition
Training, validation, checkpointing, and early stopping
Final test evaluation
Error analysis (per-class accuracy + confusion patterns + misclassified images)

2) Project Organization

Logical separation in the codebase:

Data preprocessing
- get_cifar10_dataloaders() in c1.py
- get_imagenet_style_cifar10_dataloaders() in c1_test.py (for pre-trained ViT)
Model architecture
- PatchifyEmbedding, TransformerEncoderBlock, ViTEncoder, ViTClassifier in c1.py
Training loop
- train_one_epoch(), train_model() in c1.py
Evaluation
- evaluate() in c1.py
- evaluate_model() in c1_test.py
Error analysis and visualization
- collect_misclassified(), visualize_misclassified() in c1.py
- collect_predictions(), build_confusion_matrix(), format_error_analysis() in c1_test.py

3) `c1.py` (Training Script) Documentation

Purpose

c1.py trains a custom ViT classifier on CIFAR-10 and saves:

best checkpoint by validation accuracy: vit_cifar10_best.pt
final checkpoint after training ends: vit_cifar10_last.pt
optional misclassification visualization image

Data Pipeline

get_cifar10_dataloaders() performs:

resize CIFAR-10 images to image_size x image_size (default 64x64)
convert to tensor ([0, 255] -> [0, 1])
normalize channels from [0,1] to [-1,1] using mean/std (0.5, 0.5, 0.5)
split official training set into train/validation by val_ratio
build train/val/test DataLoaders with configurable batch size and workers

Model Architecture

The custom ViT follows standard encoder-style design:

Patchify + Projection
PatchifyEmbedding creates non-overlapping patches and projects each patch to embed_dim.
Token + Position Encoding
ViTEncoder prepends a learnable CLS token and adds learnable positional embeddings.
Transformer Blocks
TransformerEncoderBlock applies:
- LayerNorm -> Multi-Head Self-Attention -> Residual
- LayerNorm -> MLP (GELU + Dropout) -> Residual
Classification Head
ViTClassifier extracts CLS representation and maps it to 10 class logits.

Training and Validation

train_model() uses:

loss: CrossEntropyLoss
optimizer: AdamW
scheduler: StepLR(step_size=5, gamma=0.5)
early stopping: stop when validation accuracy does not improve for early_stopping_patience epochs

Main Outputs

During training:

epoch-wise train/validation loss, accuracy, and learning rate logs
checkpoint files saved in save_dir

After training:

final validation summary
test loss/accuracy using best checkpoint
optional plot of misclassified examples

4) `c1_test.py` (Evaluation + Analysis Script) Documentation

Purpose

c1_test.py is a separate script for:

loading a trained checkpoint
evaluating on test data
generating error analysis reports
optionally running transfer learning with pre-trained ViT-B/16

Baseline Evaluation Flow

Load checkpoint with load_model_from_checkpoint()
Recreate test DataLoader with same preprocessing used during training
Run evaluate_model() for test loss and accuracy
Collect predictions via collect_predictions()
Generate:
- per-class accuracy
- top confusion pairs (true -> predicted)
Save analysis text report and misclassified image grid

Optional Transfer-Learning Experiment

When --run-pretrained-experiment is enabled:

build pre-trained vit_b_16 from torchvision
replace classification head for 10 CIFAR-10 classes
preprocess data with ImageNet normalization and 224x224 resize
fine-tune with fine_tune_pretrained()
evaluate and save separate analysis artifacts

Baseline vs Pre-trained Comparison (Recorded Result)

From results/comparison_report.txt:

Model	Test Loss	Test Accuracy
Baseline ViT (custom checkpoint)	0.8916	68.57%
Pre-trained ViT-B/16	0.1495	95.15%

Key comparison metrics:

Accuracy gain (pre-trained - baseline): +26.58 percentage points
Loss delta (pre-trained - baseline): -0.7420

Interpretation: transfer learning with pre-trained ViT-B/16 provides a large performance improvement over the baseline custom-trained ViT in this run.

5) Hyperparameters and Their Significance

Core model hyperparameters (`c1.py`)

image_size=64
Upscales CIFAR-10 images from 32x32 to allow richer patch tokenization.
patch_size=4
Number of patches per image becomes (64/4)^2 = 256.
embed_dim=256
Dimensionality of token embeddings; larger values increase representation capacity and compute cost.
depth=6
Number of transformer encoder blocks; deeper models can learn more complex patterns but train slower.
num_heads=8
Attention heads per block; controls multi-view attention decomposition.
mlp_ratio=4.0
Hidden size of feed-forward block equals 4 * embed_dim.
dropout=0.1
Regularization in transformer blocks to reduce overfitting risk.

Training hyperparameters (`c1.py`)

batch_size=128
Balance between gradient stability, memory use, and throughput.
num_epochs=10
Maximum training epochs before early stopping triggers.
lr=3e-4
Initial learning rate for AdamW.
weight_decay=1e-4
L2-style regularization used by AdamW.
early_stopping_patience=5
Stops training if validation accuracy does not improve for 5 epochs.
StepLR(step_size=5, gamma=0.5)
Learning rate decays by half every 5 epochs.

Transfer-learning hyperparameters (`c1_test.py`)

pretrained_epochs=2 (default)
Short fine-tuning schedule for quick comparison against baseline.
lr=1e-4, weight_decay=1e-4
Conservative adaptation from ImageNet features to CIFAR-10.
ImageNet transform: Resize(224,224) + ImageNet mean/std
Matches input assumptions of pre-trained ViT-B/16.

6) CLI Usage

Train custom ViT

From assignment_image/code:

python c1.py

Evaluate custom checkpoint

python c1_test.py --checkpoint-path /path/to/vit_cifar10_best.pt

Evaluate + run pre-trained ViT transfer experiment

python c1_test.py \
  --checkpoint-path /path/to/vit_cifar10_best.pt \
  --run-pretrained-experiment \
  --pretrained-epochs 2

7) Generated Artifacts

Common artifacts produced by the scripts:

saved_model/vit_cifar10_best.pt
saved_model/vit_cifar10_last.pt
misclassified_examples.png (training script visualization)
results/baseline_analysis.txt
results/misclassified_examples_test.png
results/pretrained_vit_analysis.txt (if transfer experiment runs)
results/misclassified_examples_pretrained_vit.png (if transfer experiment runs)

8) Notes and Best Practices

Keep training and evaluation preprocessing consistent when testing custom checkpoints.
Do not use test set for model selection; use validation split for checkpoint selection.
Use error analysis outputs (per-class and confusion pairs) to guide augmentation or architecture tuning.
If GPU memory is limited, reduce batch_size or image_size.

9) References

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N. (2021). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ICLR 2021. https://arxiv.org/abs/2010.11929

Assignment Image: Vision Transformer Documentation

1) Overview

2) Project Organization

3) c1.py (Training Script) Documentation

Purpose

Data Pipeline

Model Architecture

Training and Validation

Main Outputs

4) c1_test.py (Evaluation + Analysis Script) Documentation

Purpose

Baseline Evaluation Flow

Optional Transfer-Learning Experiment

Baseline vs Pre-trained Comparison (Recorded Result)

5) Hyperparameters and Their Significance

Core model hyperparameters (c1.py)

Training hyperparameters (c1.py)

Transfer-learning hyperparameters (c1_test.py)

6) CLI Usage

Train custom ViT

Evaluate custom checkpoint

Evaluate + run pre-trained ViT transfer experiment

7) Generated Artifacts

8) Notes and Best Practices

9) References

3) `c1.py` (Training Script) Documentation

4) `c1_test.py` (Evaluation + Analysis Script) Documentation

Core model hyperparameters (`c1.py`)

Training hyperparameters (`c1.py`)

Transfer-learning hyperparameters (`c1_test.py`)