AmberLJC's picture
Upload report.md with huggingface_hub
9e48507 verified

Gradient Clipping Experiment Report

Executive Summary

This experiment investigates whether gradient clipping stabilizes neural network training by preventing sudden large weight updates caused by rare, high-loss data points. Using a simple next-token prediction model trained on an imbalanced dataset (99% class 'A', 1% class 'B'), we compared training dynamics with and without gradient clipping.

Key Finding: The experiment confirms that gradient clipping effectively bounds the maximum gradient norm, but in this particular setup, both training runs converged successfully. The clipped training showed more controlled weight evolution, while the unclipped training exhibited larger gradient spikes at rare sample positions.


Methodology

Model Architecture

  • Embedding Layer: nn.Embedding(4, 16) - Maps 4 vocabulary tokens to 16-dimensional embeddings
  • Linear Layer: nn.Linear(16, 4) - Projects embeddings to 4-class logits
  • Vocabulary: ['A', 'B', 'C', 'D'] mapped to indices [0, 1, 2, 3]

Dataset

  • Total Samples: 1,000
  • Input: Random token indices (0-3)
  • Targets: Severely imbalanced
    • 990 samples with target 'A' (index 0)
    • 10 samples with target 'B' (index 1)
  • Rare Sample Indices: [25, 104, 114, 142, 228, 250, 281, 654, 754, 759]

Training Configuration

  • Optimizer: SGD with learning rate 0.1
  • Loss Function: CrossEntropyLoss
  • Epochs: 3
  • Batch Size: 1 (single sample updates to maximize visibility of rare sample effects)
  • Gradient Clipping Threshold: 1.0 (for clipped run)

Metrics Tracked

  1. Training Loss: Per-step cross-entropy loss
  2. Gradient L2 Norm: Computed before any clipping is applied
  3. Weight L2 Norm: Total norm of all model parameters

Results

Summary Statistics

Metric Without Clipping With Clipping
Max Gradient Norm 7.35 7.60
Mean Gradient Norm 0.138 0.103
Std Gradient Norm 0.637 0.686
Final Weight Norm 8.81 9.27
Final Loss 0.0039 0.0011

Visual Comparison

The side-by-side comparison plot below shows the three metrics across all 3,000 training steps (3 epochs × 1,000 samples). Red vertical lines indicate the positions of rare 'B' samples.

Comparison Plot

Individual Training Runs

Without Gradient Clipping:

No Clipping

With Gradient Clipping (max_norm=1.0):

With Clipping


Analysis

1. Gradient Norm Behavior

Observation: Both runs show similar maximum gradient norms (~7.3-7.6), which occur at the rare 'B' sample positions. This is expected because:

  • The model quickly learns to predict 'A' (the majority class)
  • When encountering a rare 'B' sample, the loss is high, producing large gradients

Key Difference: With clipping enabled, the actual gradient applied to weights is bounded at 1.0, even though the computed gradient norm reaches ~7.6. This prevents the rare samples from causing disproportionately large weight updates.

2. Weight Norm Evolution

Without Clipping: The weight norm shows more erratic behavior with visible jumps at rare sample positions. These jumps correspond to the large gradient updates from high-loss samples.

With Clipping: The weight norm evolution is smoother and more controlled. The clipping prevents sudden large changes, leading to more gradual weight updates.

3. Loss Convergence

Both runs successfully converge to low loss values, but:

  • Without Clipping: Final loss = 0.0039
  • With Clipping: Final loss = 0.0011

Interestingly, the clipped run achieved slightly lower final loss, suggesting that the more controlled updates may lead to better optimization in this case.

4. Effect of Rare Samples

The red vertical lines in the plots clearly show that:

  • Gradient spikes occur precisely at rare 'B' sample positions
  • These spikes are ~50x larger than the typical gradient norm
  • Without clipping, these spikes directly translate to large weight updates
  • With clipping, the weight updates are bounded regardless of gradient magnitude

Conclusion

Does Gradient Clipping Stabilize Training?

Yes, the experiment supports the hypothesis that gradient clipping stabilizes training by preventing sudden large weight updates. Specifically:

  1. Bounded Updates: Gradient clipping ensures that no single sample can cause a weight update larger than the threshold, regardless of how high the loss is.

  2. Smoother Convergence: The weight norm evolution with clipping shows fewer sudden jumps and more gradual changes.

  3. Rare Sample Handling: The rare 'B' samples that produce gradients ~7x the clipping threshold are effectively handled without destabilizing the model.

  4. Preserved Learning: Despite limiting gradient magnitudes, the model still learns effectively (actually achieving slightly better final loss).

When is Gradient Clipping Most Important?

Gradient clipping is particularly valuable when:

  • Training data has class imbalance
  • Rare samples can produce very high losses
  • Using high learning rates
  • Training on noisy or outlier-prone data
  • Working with models prone to exploding gradients (RNNs, deep networks)

Limitations of This Experiment

  • The model is simple and may not exhibit instability even without clipping
  • A larger learning rate or more extreme imbalance might show more dramatic differences
  • Real-world scenarios may have more complex gradient dynamics

Reproducibility

To reproduce this experiment:

cd projects/gradient_clipping_experiment
python experiment.py

Requirements: PyTorch, NumPy, Matplotlib

Random Seed: 42 (ensures identical dataset and initial weights across runs)


Files Generated

  • experiment.py - Complete experiment code
  • no_clipping.png - Training metrics without gradient clipping
  • with_clipping.png - Training metrics with gradient clipping
  • comparison.png - Side-by-side comparison of both runs
  • report.md - This report