| # PlainMLP vs ResMLP Comparison - Distant Identity Task | |
| ## Objective | |
| Compare a 20-layer PlainMLP and ResMLP on a synthetic "Distant Identity" task to demonstrate the vanishing gradient problem and how residual connections solve it. | |
| ## Tasks | |
| ### Phase 1: Implementation | |
| - [ ] Implement PlainMLP (20 layers, hidden dim 64, ReLU, Kaiming He init) | |
| - [ ] Implement ResMLP (20 layers, hidden dim 64, residual connections, Kaiming He init) | |
| - [ ] Generate synthetic data (1024 vectors, dim 64, U(-1,1), Y=X) | |
| ### Phase 2: Training | |
| - [ ] Train both models for 500 steps with Adam (lr=1e-3) | |
| - [ ] Record MSE loss at each step | |
| ### Phase 3: Final State Analysis | |
| - [ ] Implement PyTorch hooks for gradient and activation capture | |
| - [ ] Perform forward/backward pass on new random batch | |
| - [ ] Capture L2 norm of gradients at each layer | |
| - [ ] Capture mean and std of activations at each layer | |
| ### Phase 4: Visualization & Reporting | |
| - [ ] Plot Training Loss vs Steps (both models) | |
| - [ ] Plot Gradient Magnitude vs Layer Depth | |
| - [ ] Plot Activation Mean vs Layer Depth | |
| - [ ] Plot Activation Std vs Layer Depth | |
| - [ ] Write summary report with analysis | |
| ## Expected Outcomes | |
| - PlainMLP: Vanishing gradients, poor learning of identity function | |
| - ResMLP: Stable gradients, successful learning of identity function | |