Assignment Image: Vision Transformer Documentation
1) Overview
This documentation explains two scripts:
assignment_image/code/c1.py: end-to-end training pipeline for a custom Vision Transformer (ViT) on CIFAR-10.assignment_image/code/c1_test.py: evaluation and analysis pipeline for saved checkpoints, with an optional transfer-learning experiment using a pre-trained torchvision ViT.
Together, these scripts cover:
- Data preprocessing and DataLoader creation
- ViT architecture definition
- Training, validation, checkpointing, and early stopping
- Final test evaluation
- Error analysis (per-class accuracy + confusion patterns + misclassified images)
2) Project Organization
Logical separation in the codebase:
- Data preprocessing
get_cifar10_dataloaders()inc1.pyget_imagenet_style_cifar10_dataloaders()inc1_test.py(for pre-trained ViT)
- Model architecture
PatchifyEmbedding,TransformerEncoderBlock,ViTEncoder,ViTClassifierinc1.py
- Training loop
train_one_epoch(),train_model()inc1.py
- Evaluation
evaluate()inc1.pyevaluate_model()inc1_test.py
- Error analysis and visualization
collect_misclassified(),visualize_misclassified()inc1.pycollect_predictions(),build_confusion_matrix(),format_error_analysis()inc1_test.py
3) c1.py (Training Script) Documentation
Purpose
c1.py trains a custom ViT classifier on CIFAR-10 and saves:
- best checkpoint by validation accuracy:
vit_cifar10_best.pt - final checkpoint after training ends:
vit_cifar10_last.pt - optional misclassification visualization image
Data Pipeline
get_cifar10_dataloaders() performs:
- resize CIFAR-10 images to
image_size x image_size(default64x64) - convert to tensor (
[0, 255] -> [0, 1]) - normalize channels from
[0,1]to[-1,1]using mean/std(0.5, 0.5, 0.5) - split official training set into train/validation by
val_ratio - build train/val/test DataLoaders with configurable batch size and workers
Model Architecture
The custom ViT follows standard encoder-style design:
Patchify + Projection
PatchifyEmbeddingcreates non-overlapping patches and projects each patch toembed_dim.Token + Position Encoding
ViTEncoderprepends a learnable CLS token and adds learnable positional embeddings.Transformer Blocks
TransformerEncoderBlockapplies:- LayerNorm -> Multi-Head Self-Attention -> Residual
- LayerNorm -> MLP (GELU + Dropout) -> Residual
Classification Head
ViTClassifierextracts CLS representation and maps it to 10 class logits.
Training and Validation
train_model() uses:
- loss:
CrossEntropyLoss - optimizer:
AdamW - scheduler:
StepLR(step_size=5, gamma=0.5) - early stopping: stop when validation accuracy does not improve for
early_stopping_patienceepochs
Main Outputs
During training:
- epoch-wise train/validation loss, accuracy, and learning rate logs
- checkpoint files saved in
save_dir
After training:
- final validation summary
- test loss/accuracy using best checkpoint
- optional plot of misclassified examples
4) c1_test.py (Evaluation + Analysis Script) Documentation
Purpose
c1_test.py is a separate script for:
- loading a trained checkpoint
- evaluating on test data
- generating error analysis reports
- optionally running transfer learning with pre-trained ViT-B/16
Baseline Evaluation Flow
- Load checkpoint with
load_model_from_checkpoint() - Recreate test DataLoader with same preprocessing used during training
- Run
evaluate_model()for test loss and accuracy - Collect predictions via
collect_predictions() - Generate:
- per-class accuracy
- top confusion pairs (true -> predicted)
- Save analysis text report and misclassified image grid
Optional Transfer-Learning Experiment
When --run-pretrained-experiment is enabled:
- build pre-trained
vit_b_16from torchvision - replace classification head for 10 CIFAR-10 classes
- preprocess data with ImageNet normalization and
224x224resize - fine-tune with
fine_tune_pretrained() - evaluate and save separate analysis artifacts
Baseline vs Pre-trained Comparison (Recorded Result)
From results/comparison_report.txt:
| Model | Test Loss | Test Accuracy |
|---|---|---|
| Baseline ViT (custom checkpoint) | 0.8916 | 68.57% |
| Pre-trained ViT-B/16 | 0.1495 | 95.15% |
Key comparison metrics:
- Accuracy gain (pre-trained - baseline): +26.58 percentage points
- Loss delta (pre-trained - baseline): -0.7420
Interpretation: transfer learning with pre-trained ViT-B/16 provides a large performance improvement over the baseline custom-trained ViT in this run.
5) Hyperparameters and Their Significance
Core model hyperparameters (c1.py)
image_size=64
Upscales CIFAR-10 images from32x32to allow richer patch tokenization.patch_size=4
Number of patches per image becomes(64/4)^2 = 256.embed_dim=256
Dimensionality of token embeddings; larger values increase representation capacity and compute cost.depth=6
Number of transformer encoder blocks; deeper models can learn more complex patterns but train slower.num_heads=8
Attention heads per block; controls multi-view attention decomposition.mlp_ratio=4.0
Hidden size of feed-forward block equals4 * embed_dim.dropout=0.1
Regularization in transformer blocks to reduce overfitting risk.
Training hyperparameters (c1.py)
batch_size=128
Balance between gradient stability, memory use, and throughput.num_epochs=10
Maximum training epochs before early stopping triggers.lr=3e-4
Initial learning rate for AdamW.weight_decay=1e-4
L2-style regularization used by AdamW.early_stopping_patience=5
Stops training if validation accuracy does not improve for 5 epochs.StepLR(step_size=5, gamma=0.5)
Learning rate decays by half every 5 epochs.
Transfer-learning hyperparameters (c1_test.py)
pretrained_epochs=2(default)
Short fine-tuning schedule for quick comparison against baseline.lr=1e-4,weight_decay=1e-4
Conservative adaptation from ImageNet features to CIFAR-10.ImageNet transform:
Resize(224,224)+ ImageNet mean/std
Matches input assumptions of pre-trained ViT-B/16.
6) CLI Usage
Train custom ViT
From assignment_image/code:
python c1.py
Evaluate custom checkpoint
python c1_test.py --checkpoint-path /path/to/vit_cifar10_best.pt
Evaluate + run pre-trained ViT transfer experiment
python c1_test.py \
--checkpoint-path /path/to/vit_cifar10_best.pt \
--run-pretrained-experiment \
--pretrained-epochs 2
7) Generated Artifacts
Common artifacts produced by the scripts:
saved_model/vit_cifar10_best.ptsaved_model/vit_cifar10_last.ptmisclassified_examples.png(training script visualization)results/baseline_analysis.txtresults/misclassified_examples_test.pngresults/pretrained_vit_analysis.txt(if transfer experiment runs)results/misclassified_examples_pretrained_vit.png(if transfer experiment runs)
8) Notes and Best Practices
- Keep training and evaluation preprocessing consistent when testing custom checkpoints.
- Do not use test set for model selection; use validation split for checkpoint selection.
- Use error analysis outputs (per-class and confusion pairs) to guide augmentation or architecture tuning.
- If GPU memory is limited, reduce
batch_sizeorimage_size.
9) References
- Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N. (2021). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ICLR 2021. https://arxiv.org/abs/2010.11929