AmberLJC's picture
Upload todo.md with huggingface_hub
2c24e4a verified

PlainMLP vs ResMLP Comparison - Distant Identity Task

Objective

Compare a 20-layer PlainMLP and ResMLP on a synthetic "Distant Identity" task to demonstrate the vanishing gradient problem and how residual connections solve it.

Tasks

Phase 1: Implementation

  • Implement PlainMLP (20 layers, hidden dim 64, ReLU, Kaiming He init)
  • Implement ResMLP (20 layers, hidden dim 64, residual connections, Kaiming He init)
  • Generate synthetic data (1024 vectors, dim 64, U(-1,1), Y=X)

Phase 2: Training

  • Train both models for 500 steps with Adam (lr=1e-3)
  • Record MSE loss at each step

Phase 3: Final State Analysis

  • Implement PyTorch hooks for gradient and activation capture
  • Perform forward/backward pass on new random batch
  • Capture L2 norm of gradients at each layer
  • Capture mean and std of activations at each layer

Phase 4: Visualization & Reporting

  • Plot Training Loss vs Steps (both models)
  • Plot Gradient Magnitude vs Layer Depth
  • Plot Activation Mean vs Layer Depth
  • Plot Activation Std vs Layer Depth
  • Write summary report with analysis

Expected Outcomes

  • PlainMLP: Vanishing gradients, poor learning of identity function
  • ResMLP: Stable gradients, successful learning of identity function