shahidul034's picture
Add files using upload-large-folder tool
a16c07b verified
# Assignment Image: Vision Transformer Documentation
## 1) Overview
This documentation explains two scripts:
- `assignment_image/code/c1.py`: end-to-end **training pipeline** for a custom Vision Transformer (ViT) on CIFAR-10.
- `assignment_image/code/c1_test.py`: **evaluation and analysis pipeline** for saved checkpoints, with an optional transfer-learning experiment using a pre-trained torchvision ViT.
Together, these scripts cover:
1. Data preprocessing and DataLoader creation
2. ViT architecture definition
3. Training, validation, checkpointing, and early stopping
4. Final test evaluation
5. Error analysis (per-class accuracy + confusion patterns + misclassified images)
---
## 2) Project Organization
Logical separation in the codebase:
- **Data preprocessing**
- `get_cifar10_dataloaders()` in `c1.py`
- `get_imagenet_style_cifar10_dataloaders()` in `c1_test.py` (for pre-trained ViT)
- **Model architecture**
- `PatchifyEmbedding`, `TransformerEncoderBlock`, `ViTEncoder`, `ViTClassifier` in `c1.py`
- **Training loop**
- `train_one_epoch()`, `train_model()` in `c1.py`
- **Evaluation**
- `evaluate()` in `c1.py`
- `evaluate_model()` in `c1_test.py`
- **Error analysis and visualization**
- `collect_misclassified()`, `visualize_misclassified()` in `c1.py`
- `collect_predictions()`, `build_confusion_matrix()`, `format_error_analysis()` in `c1_test.py`
---
## 3) `c1.py` (Training Script) Documentation
### Purpose
`c1.py` trains a custom ViT classifier on CIFAR-10 and saves:
- best checkpoint by validation accuracy: `vit_cifar10_best.pt`
- final checkpoint after training ends: `vit_cifar10_last.pt`
- optional misclassification visualization image
### Data Pipeline
`get_cifar10_dataloaders()` performs:
- resize CIFAR-10 images to `image_size x image_size` (default `64x64`)
- convert to tensor (`[0, 255] -> [0, 1]`)
- normalize channels from `[0,1]` to `[-1,1]` using mean/std `(0.5, 0.5, 0.5)`
- split official training set into train/validation by `val_ratio`
- build train/val/test DataLoaders with configurable batch size and workers
### Model Architecture
The custom ViT follows standard encoder-style design:
1. **Patchify + Projection**
`PatchifyEmbedding` creates non-overlapping patches and projects each patch to `embed_dim`.
2. **Token + Position Encoding**
`ViTEncoder` prepends a learnable CLS token and adds learnable positional embeddings.
3. **Transformer Blocks**
`TransformerEncoderBlock` applies:
- LayerNorm -> Multi-Head Self-Attention -> Residual
- LayerNorm -> MLP (GELU + Dropout) -> Residual
4. **Classification Head**
`ViTClassifier` extracts CLS representation and maps it to 10 class logits.
### Training and Validation
`train_model()` uses:
- loss: `CrossEntropyLoss`
- optimizer: `AdamW`
- scheduler: `StepLR(step_size=5, gamma=0.5)`
- early stopping: stop when validation accuracy does not improve for `early_stopping_patience` epochs
### Main Outputs
During training:
- epoch-wise train/validation loss, accuracy, and learning rate logs
- checkpoint files saved in `save_dir`
After training:
- final validation summary
- test loss/accuracy using best checkpoint
- optional plot of misclassified examples
---
## 4) `c1_test.py` (Evaluation + Analysis Script) Documentation
### Purpose
`c1_test.py` is a separate script for:
- loading a trained checkpoint
- evaluating on test data
- generating error analysis reports
- optionally running transfer learning with pre-trained ViT-B/16
### Baseline Evaluation Flow
1. Load checkpoint with `load_model_from_checkpoint()`
2. Recreate test DataLoader with same preprocessing used during training
3. Run `evaluate_model()` for test loss and accuracy
4. Collect predictions via `collect_predictions()`
5. Generate:
- per-class accuracy
- top confusion pairs (true -> predicted)
6. Save analysis text report and misclassified image grid
### Optional Transfer-Learning Experiment
When `--run-pretrained-experiment` is enabled:
- build pre-trained `vit_b_16` from torchvision
- replace classification head for 10 CIFAR-10 classes
- preprocess data with ImageNet normalization and `224x224` resize
- fine-tune with `fine_tune_pretrained()`
- evaluate and save separate analysis artifacts
### Baseline vs Pre-trained Comparison (Recorded Result)
From `results/comparison_report.txt`:
| Model | Test Loss | Test Accuracy |
|---|---:|---:|
| Baseline ViT (custom checkpoint) | 0.8916 | 68.57% |
| Pre-trained ViT-B/16 | 0.1495 | 95.15% |
Key comparison metrics:
- Accuracy gain (pre-trained - baseline): **+26.58 percentage points**
- Loss delta (pre-trained - baseline): **-0.7420**
Interpretation: transfer learning with pre-trained ViT-B/16 provides a large performance improvement over the baseline custom-trained ViT in this run.
---
## 5) Hyperparameters and Their Significance
### Core model hyperparameters (`c1.py`)
- `image_size=64`
Upscales CIFAR-10 images from `32x32` to allow richer patch tokenization.
- `patch_size=4`
Number of patches per image becomes `(64/4)^2 = 256`.
- `embed_dim=256`
Dimensionality of token embeddings; larger values increase representation capacity and compute cost.
- `depth=6`
Number of transformer encoder blocks; deeper models can learn more complex patterns but train slower.
- `num_heads=8`
Attention heads per block; controls multi-view attention decomposition.
- `mlp_ratio=4.0`
Hidden size of feed-forward block equals `4 * embed_dim`.
- `dropout=0.1`
Regularization in transformer blocks to reduce overfitting risk.
### Training hyperparameters (`c1.py`)
- `batch_size=128`
Balance between gradient stability, memory use, and throughput.
- `num_epochs=10`
Maximum training epochs before early stopping triggers.
- `lr=3e-4`
Initial learning rate for AdamW.
- `weight_decay=1e-4`
L2-style regularization used by AdamW.
- `early_stopping_patience=5`
Stops training if validation accuracy does not improve for 5 epochs.
- `StepLR(step_size=5, gamma=0.5)`
Learning rate decays by half every 5 epochs.
### Transfer-learning hyperparameters (`c1_test.py`)
- `pretrained_epochs=2` (default)
Short fine-tuning schedule for quick comparison against baseline.
- `lr=1e-4`, `weight_decay=1e-4`
Conservative adaptation from ImageNet features to CIFAR-10.
- ImageNet transform: `Resize(224,224)` + ImageNet mean/std
Matches input assumptions of pre-trained ViT-B/16.
---
## 6) CLI Usage
### Train custom ViT
From `assignment_image/code`:
```bash
python c1.py
```
### Evaluate custom checkpoint
```bash
python c1_test.py --checkpoint-path /path/to/vit_cifar10_best.pt
```
### Evaluate + run pre-trained ViT transfer experiment
```bash
python c1_test.py \
--checkpoint-path /path/to/vit_cifar10_best.pt \
--run-pretrained-experiment \
--pretrained-epochs 2
```
---
## 7) Generated Artifacts
Common artifacts produced by the scripts:
- `saved_model/vit_cifar10_best.pt`
- `saved_model/vit_cifar10_last.pt`
- `misclassified_examples.png` (training script visualization)
- `results/baseline_analysis.txt`
- `results/misclassified_examples_test.png`
- `results/pretrained_vit_analysis.txt` (if transfer experiment runs)
- `results/misclassified_examples_pretrained_vit.png` (if transfer experiment runs)
---
## 8) Notes and Best Practices
- Keep training and evaluation preprocessing consistent when testing custom checkpoints.
- Do not use test set for model selection; use validation split for checkpoint selection.
- Use error analysis outputs (per-class and confusion pairs) to guide augmentation or architecture tuning.
- If GPU memory is limited, reduce `batch_size` or `image_size`.
---
## 9) References
- Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N. (2021). *An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale*. ICLR 2021. https://arxiv.org/abs/2010.11929