# PlainMLP vs ResMLP Comparison - Distant Identity Task

## Objective
Compare a 20-layer PlainMLP and ResMLP on a synthetic "Distant Identity" task to demonstrate the vanishing gradient problem and how residual connections solve it.

## Tasks

### Phase 1: Implementation
- [ ] Implement PlainMLP (20 layers, hidden dim 64, ReLU, Kaiming He init)
- [ ] Implement ResMLP (20 layers, hidden dim 64, residual connections, Kaiming He init)
- [ ] Generate synthetic data (1024 vectors, dim 64, U(-1,1), Y=X)

### Phase 2: Training
- [ ] Train both models for 500 steps with Adam (lr=1e-3)
- [ ] Record MSE loss at each step

### Phase 3: Final State Analysis
- [ ] Implement PyTorch hooks for gradient and activation capture
- [ ] Perform forward/backward pass on new random batch
- [ ] Capture L2 norm of gradients at each layer
- [ ] Capture mean and std of activations at each layer

### Phase 4: Visualization & Reporting
- [ ] Plot Training Loss vs Steps (both models)
- [ ] Plot Gradient Magnitude vs Layer Depth
- [ ] Plot Activation Mean vs Layer Depth
- [ ] Plot Activation Std vs Layer Depth
- [ ] Write summary report with analysis

## Expected Outcomes
- PlainMLP: Vanishing gradients, poor learning of identity function
- ResMLP: Stable gradients, successful learning of identity function