shahidul034's picture
Add files using upload-large-folder tool
a16c07b verified

Assignment Image: Vision Transformer Documentation

1) Overview

This documentation explains two scripts:

  • assignment_image/code/c1.py: end-to-end training pipeline for a custom Vision Transformer (ViT) on CIFAR-10.
  • assignment_image/code/c1_test.py: evaluation and analysis pipeline for saved checkpoints, with an optional transfer-learning experiment using a pre-trained torchvision ViT.

Together, these scripts cover:

  1. Data preprocessing and DataLoader creation
  2. ViT architecture definition
  3. Training, validation, checkpointing, and early stopping
  4. Final test evaluation
  5. Error analysis (per-class accuracy + confusion patterns + misclassified images)

2) Project Organization

Logical separation in the codebase:

  • Data preprocessing
    • get_cifar10_dataloaders() in c1.py
    • get_imagenet_style_cifar10_dataloaders() in c1_test.py (for pre-trained ViT)
  • Model architecture
    • PatchifyEmbedding, TransformerEncoderBlock, ViTEncoder, ViTClassifier in c1.py
  • Training loop
    • train_one_epoch(), train_model() in c1.py
  • Evaluation
    • evaluate() in c1.py
    • evaluate_model() in c1_test.py
  • Error analysis and visualization
    • collect_misclassified(), visualize_misclassified() in c1.py
    • collect_predictions(), build_confusion_matrix(), format_error_analysis() in c1_test.py

3) c1.py (Training Script) Documentation

Purpose

c1.py trains a custom ViT classifier on CIFAR-10 and saves:

  • best checkpoint by validation accuracy: vit_cifar10_best.pt
  • final checkpoint after training ends: vit_cifar10_last.pt
  • optional misclassification visualization image

Data Pipeline

get_cifar10_dataloaders() performs:

  • resize CIFAR-10 images to image_size x image_size (default 64x64)
  • convert to tensor ([0, 255] -> [0, 1])
  • normalize channels from [0,1] to [-1,1] using mean/std (0.5, 0.5, 0.5)
  • split official training set into train/validation by val_ratio
  • build train/val/test DataLoaders with configurable batch size and workers

Model Architecture

The custom ViT follows standard encoder-style design:

  1. Patchify + Projection
    PatchifyEmbedding creates non-overlapping patches and projects each patch to embed_dim.

  2. Token + Position Encoding
    ViTEncoder prepends a learnable CLS token and adds learnable positional embeddings.

  3. Transformer Blocks
    TransformerEncoderBlock applies:

    • LayerNorm -> Multi-Head Self-Attention -> Residual
    • LayerNorm -> MLP (GELU + Dropout) -> Residual
  4. Classification Head
    ViTClassifier extracts CLS representation and maps it to 10 class logits.

Training and Validation

train_model() uses:

  • loss: CrossEntropyLoss
  • optimizer: AdamW
  • scheduler: StepLR(step_size=5, gamma=0.5)
  • early stopping: stop when validation accuracy does not improve for early_stopping_patience epochs

Main Outputs

During training:

  • epoch-wise train/validation loss, accuracy, and learning rate logs
  • checkpoint files saved in save_dir

After training:

  • final validation summary
  • test loss/accuracy using best checkpoint
  • optional plot of misclassified examples

4) c1_test.py (Evaluation + Analysis Script) Documentation

Purpose

c1_test.py is a separate script for:

  • loading a trained checkpoint
  • evaluating on test data
  • generating error analysis reports
  • optionally running transfer learning with pre-trained ViT-B/16

Baseline Evaluation Flow

  1. Load checkpoint with load_model_from_checkpoint()
  2. Recreate test DataLoader with same preprocessing used during training
  3. Run evaluate_model() for test loss and accuracy
  4. Collect predictions via collect_predictions()
  5. Generate:
    • per-class accuracy
    • top confusion pairs (true -> predicted)
  6. Save analysis text report and misclassified image grid

Optional Transfer-Learning Experiment

When --run-pretrained-experiment is enabled:

  • build pre-trained vit_b_16 from torchvision
  • replace classification head for 10 CIFAR-10 classes
  • preprocess data with ImageNet normalization and 224x224 resize
  • fine-tune with fine_tune_pretrained()
  • evaluate and save separate analysis artifacts

Baseline vs Pre-trained Comparison (Recorded Result)

From results/comparison_report.txt:

Model Test Loss Test Accuracy
Baseline ViT (custom checkpoint) 0.8916 68.57%
Pre-trained ViT-B/16 0.1495 95.15%

Key comparison metrics:

  • Accuracy gain (pre-trained - baseline): +26.58 percentage points
  • Loss delta (pre-trained - baseline): -0.7420

Interpretation: transfer learning with pre-trained ViT-B/16 provides a large performance improvement over the baseline custom-trained ViT in this run.


5) Hyperparameters and Their Significance

Core model hyperparameters (c1.py)

  • image_size=64
    Upscales CIFAR-10 images from 32x32 to allow richer patch tokenization.

  • patch_size=4
    Number of patches per image becomes (64/4)^2 = 256.

  • embed_dim=256
    Dimensionality of token embeddings; larger values increase representation capacity and compute cost.

  • depth=6
    Number of transformer encoder blocks; deeper models can learn more complex patterns but train slower.

  • num_heads=8
    Attention heads per block; controls multi-view attention decomposition.

  • mlp_ratio=4.0
    Hidden size of feed-forward block equals 4 * embed_dim.

  • dropout=0.1
    Regularization in transformer blocks to reduce overfitting risk.

Training hyperparameters (c1.py)

  • batch_size=128
    Balance between gradient stability, memory use, and throughput.

  • num_epochs=10
    Maximum training epochs before early stopping triggers.

  • lr=3e-4
    Initial learning rate for AdamW.

  • weight_decay=1e-4
    L2-style regularization used by AdamW.

  • early_stopping_patience=5
    Stops training if validation accuracy does not improve for 5 epochs.

  • StepLR(step_size=5, gamma=0.5)
    Learning rate decays by half every 5 epochs.

Transfer-learning hyperparameters (c1_test.py)

  • pretrained_epochs=2 (default)
    Short fine-tuning schedule for quick comparison against baseline.

  • lr=1e-4, weight_decay=1e-4
    Conservative adaptation from ImageNet features to CIFAR-10.

  • ImageNet transform: Resize(224,224) + ImageNet mean/std
    Matches input assumptions of pre-trained ViT-B/16.


6) CLI Usage

Train custom ViT

From assignment_image/code:

python c1.py

Evaluate custom checkpoint

python c1_test.py --checkpoint-path /path/to/vit_cifar10_best.pt

Evaluate + run pre-trained ViT transfer experiment

python c1_test.py \
  --checkpoint-path /path/to/vit_cifar10_best.pt \
  --run-pretrained-experiment \
  --pretrained-epochs 2

7) Generated Artifacts

Common artifacts produced by the scripts:

  • saved_model/vit_cifar10_best.pt
  • saved_model/vit_cifar10_last.pt
  • misclassified_examples.png (training script visualization)
  • results/baseline_analysis.txt
  • results/misclassified_examples_test.png
  • results/pretrained_vit_analysis.txt (if transfer experiment runs)
  • results/misclassified_examples_pretrained_vit.png (if transfer experiment runs)

8) Notes and Best Practices

  • Keep training and evaluation preprocessing consistent when testing custom checkpoints.
  • Do not use test set for model selection; use validation split for checkpoint selection.
  • Use error analysis outputs (per-class and confusion pairs) to guide augmentation or architecture tuning.
  • If GPU memory is limited, reduce batch_size or image_size.

9) References

  • Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N. (2021). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ICLR 2021. https://arxiv.org/abs/2010.11929