Add files using upload-large-folder tool

a16c07b verified about 2 months ago

8.05 kB

	# Assignment Image: Vision Transformer Documentation

	## 1) Overview

	This documentation explains two scripts:

	- `assignment_image/code/c1.py`: end-to-end training pipeline for a custom Vision Transformer (ViT) on CIFAR-10.
	- `assignment_image/code/c1_test.py`: evaluation and analysis pipeline for saved checkpoints, with an optional transfer-learning experiment using a pre-trained torchvision ViT.

	Together, these scripts cover:

	1. Data preprocessing and DataLoader creation
	2. ViT architecture definition
	3. Training, validation, checkpointing, and early stopping
	4. Final test evaluation
	5. Error analysis (per-class accuracy + confusion patterns + misclassified images)

	---

	## 2) Project Organization

	Logical separation in the codebase:

	- Data preprocessing
	- `get_cifar10_dataloaders()` in `c1.py`
	- `get_imagenet_style_cifar10_dataloaders()` in `c1_test.py` (for pre-trained ViT)
	- Model architecture
	- `PatchifyEmbedding`, `TransformerEncoderBlock`, `ViTEncoder`, `ViTClassifier` in `c1.py`
	- Training loop
	- `train_one_epoch()`, `train_model()` in `c1.py`
	- Evaluation
	- `evaluate()` in `c1.py`
	- `evaluate_model()` in `c1_test.py`
	- Error analysis and visualization
	- `collect_misclassified()`, `visualize_misclassified()` in `c1.py`
	- `collect_predictions()`, `build_confusion_matrix()`, `format_error_analysis()` in `c1_test.py`

	---

	## 3) `c1.py` (Training Script) Documentation

	### Purpose

	`c1.py` trains a custom ViT classifier on CIFAR-10 and saves:

	- best checkpoint by validation accuracy: `vit_cifar10_best.pt`
	- final checkpoint after training ends: `vit_cifar10_last.pt`
	- optional misclassification visualization image

	### Data Pipeline

	`get_cifar10_dataloaders()` performs:

	- resize CIFAR-10 images to `image_size x image_size` (default `64x64`)
	- convert to tensor (`[0, 255] -> [0, 1]`)
	- normalize channels from `[0,1]` to `[-1,1]` using mean/std `(0.5, 0.5, 0.5)`
	- split official training set into train/validation by `val_ratio`
	- build train/val/test DataLoaders with configurable batch size and workers

	### Model Architecture

	The custom ViT follows standard encoder-style design:

	1. Patchify + Projection
	`PatchifyEmbedding` creates non-overlapping patches and projects each patch to `embed_dim`.

	2. Token + Position Encoding
	`ViTEncoder` prepends a learnable CLS token and adds learnable positional embeddings.

	3. Transformer Blocks
	`TransformerEncoderBlock` applies:
	- LayerNorm -> Multi-Head Self-Attention -> Residual
	- LayerNorm -> MLP (GELU + Dropout) -> Residual

	4. Classification Head
	`ViTClassifier` extracts CLS representation and maps it to 10 class logits.

	### Training and Validation

	`train_model()` uses:

	- loss: `CrossEntropyLoss`
	- optimizer: `AdamW`
	- scheduler: `StepLR(step_size=5, gamma=0.5)`
	- early stopping: stop when validation accuracy does not improve for `early_stopping_patience` epochs

	### Main Outputs

	During training:

	- epoch-wise train/validation loss, accuracy, and learning rate logs
	- checkpoint files saved in `save_dir`

	After training:

	- final validation summary
	- test loss/accuracy using best checkpoint
	- optional plot of misclassified examples

	---

	## 4) `c1_test.py` (Evaluation + Analysis Script) Documentation

	### Purpose

	`c1_test.py` is a separate script for:

	- loading a trained checkpoint
	- evaluating on test data
	- generating error analysis reports
	- optionally running transfer learning with pre-trained ViT-B/16

	### Baseline Evaluation Flow

	1. Load checkpoint with `load_model_from_checkpoint()`
	2. Recreate test DataLoader with same preprocessing used during training
	3. Run `evaluate_model()` for test loss and accuracy
	4. Collect predictions via `collect_predictions()`
	5. Generate:
	- per-class accuracy
	- top confusion pairs (true -> predicted)
	6. Save analysis text report and misclassified image grid

	### Optional Transfer-Learning Experiment

	When `--run-pretrained-experiment` is enabled:

	- build pre-trained `vit_b_16` from torchvision
	- replace classification head for 10 CIFAR-10 classes
	- preprocess data with ImageNet normalization and `224x224` resize
	- fine-tune with `fine_tune_pretrained()`
	- evaluate and save separate analysis artifacts

	### Baseline vs Pre-trained Comparison (Recorded Result)

	From `results/comparison_report.txt`:

	\| Model \| Test Loss \| Test Accuracy \|
	\|---\|---:\|---:\|
	\| Baseline ViT (custom checkpoint) \| 0.8916 \| 68.57% \|
	\| Pre-trained ViT-B/16 \| 0.1495 \| 95.15% \|

	Key comparison metrics:

	- Accuracy gain (pre-trained - baseline): +26.58 percentage points
	- Loss delta (pre-trained - baseline): -0.7420

	Interpretation: transfer learning with pre-trained ViT-B/16 provides a large performance improvement over the baseline custom-trained ViT in this run.

	---

	## 5) Hyperparameters and Their Significance

	### Core model hyperparameters (`c1.py`)

	- `image_size=64`
	Upscales CIFAR-10 images from `32x32` to allow richer patch tokenization.

	- `patch_size=4`
	Number of patches per image becomes `(64/4)^2 = 256`.

	- `embed_dim=256`
	Dimensionality of token embeddings; larger values increase representation capacity and compute cost.

	- `depth=6`
	Number of transformer encoder blocks; deeper models can learn more complex patterns but train slower.

	- `num_heads=8`
	Attention heads per block; controls multi-view attention decomposition.

	- `mlp_ratio=4.0`
	Hidden size of feed-forward block equals `4 * embed_dim`.

	- `dropout=0.1`
	Regularization in transformer blocks to reduce overfitting risk.

	### Training hyperparameters (`c1.py`)

	- `batch_size=128`
	Balance between gradient stability, memory use, and throughput.

	- `num_epochs=10`
	Maximum training epochs before early stopping triggers.

	- `lr=3e-4`
	Initial learning rate for AdamW.

	- `weight_decay=1e-4`
	L2-style regularization used by AdamW.

	- `early_stopping_patience=5`
	Stops training if validation accuracy does not improve for 5 epochs.

	- `StepLR(step_size=5, gamma=0.5)`
	Learning rate decays by half every 5 epochs.

	### Transfer-learning hyperparameters (`c1_test.py`)

	- `pretrained_epochs=2` (default)
	Short fine-tuning schedule for quick comparison against baseline.

	- `lr=1e-4`, `weight_decay=1e-4`
	Conservative adaptation from ImageNet features to CIFAR-10.

	- ImageNet transform: `Resize(224,224)` + ImageNet mean/std
	Matches input assumptions of pre-trained ViT-B/16.

	---

	## 6) CLI Usage

	### Train custom ViT

	From `assignment_image/code`:

	```bash
	python c1.py
	```

	### Evaluate custom checkpoint

	```bash
	python c1_test.py --checkpoint-path /path/to/vit_cifar10_best.pt
	```

	### Evaluate + run pre-trained ViT transfer experiment

	```bash
	python c1_test.py \
	--checkpoint-path /path/to/vit_cifar10_best.pt \
	--run-pretrained-experiment \
	--pretrained-epochs 2
	```

	---

	## 7) Generated Artifacts

	Common artifacts produced by the scripts:

	- `saved_model/vit_cifar10_best.pt`
	- `saved_model/vit_cifar10_last.pt`
	- `misclassified_examples.png` (training script visualization)
	- `results/baseline_analysis.txt`
	- `results/misclassified_examples_test.png`
	- `results/pretrained_vit_analysis.txt` (if transfer experiment runs)
	- `results/misclassified_examples_pretrained_vit.png` (if transfer experiment runs)

	---

	## 8) Notes and Best Practices

	- Keep training and evaluation preprocessing consistent when testing custom checkpoints.
	- Do not use test set for model selection; use validation split for checkpoint selection.
	- Use error analysis outputs (per-class and confusion pairs) to guide augmentation or architecture tuning.
	- If GPU memory is limited, reduce `batch_size` or `image_size`.

	---

	## 9) References

	- Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N. (2021). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ICLR 2021. https://arxiv.org/abs/2010.11929