gradient_clipping_experiment / report.md

AmberLJC

Upload report.md with huggingface_hub

9e48507 verified 6 days ago

preview code

raw

history blame contribute delete

6.11 kB

Gradient Clipping Experiment Report

Executive Summary

This experiment investigates whether gradient clipping stabilizes neural network training by preventing sudden large weight updates caused by rare, high-loss data points. Using a simple next-token prediction model trained on an imbalanced dataset (99% class 'A', 1% class 'B'), we compared training dynamics with and without gradient clipping.

Key Finding: The experiment confirms that gradient clipping effectively bounds the maximum gradient norm, but in this particular setup, both training runs converged successfully. The clipped training showed more controlled weight evolution, while the unclipped training exhibited larger gradient spikes at rare sample positions.

Methodology

Model Architecture

Embedding Layer: nn.Embedding(4, 16) - Maps 4 vocabulary tokens to 16-dimensional embeddings
Linear Layer: nn.Linear(16, 4) - Projects embeddings to 4-class logits
Vocabulary: ['A', 'B', 'C', 'D'] mapped to indices [0, 1, 2, 3]

Dataset

Total Samples: 1,000
Input: Random token indices (0-3)
Targets: Severely imbalanced
- 990 samples with target 'A' (index 0)
- 10 samples with target 'B' (index 1)
Rare Sample Indices: [25, 104, 114, 142, 228, 250, 281, 654, 754, 759]

Training Configuration

Optimizer: SGD with learning rate 0.1
Loss Function: CrossEntropyLoss
Epochs: 3
Batch Size: 1 (single sample updates to maximize visibility of rare sample effects)
Gradient Clipping Threshold: 1.0 (for clipped run)

Metrics Tracked

Training Loss: Per-step cross-entropy loss
Gradient L2 Norm: Computed before any clipping is applied
Weight L2 Norm: Total norm of all model parameters

Results

Summary Statistics

Metric	Without Clipping	With Clipping
Max Gradient Norm	7.35	7.60
Mean Gradient Norm	0.138	0.103
Std Gradient Norm	0.637	0.686
Final Weight Norm	8.81	9.27
Final Loss	0.0039	0.0011

Visual Comparison

The side-by-side comparison plot below shows the three metrics across all 3,000 training steps (3 epochs × 1,000 samples). Red vertical lines indicate the positions of rare 'B' samples.

Individual Training Runs

Without Gradient Clipping:

With Gradient Clipping (max_norm=1.0):

Analysis

1. Gradient Norm Behavior

Observation: Both runs show similar maximum gradient norms (~7.3-7.6), which occur at the rare 'B' sample positions. This is expected because:

The model quickly learns to predict 'A' (the majority class)
When encountering a rare 'B' sample, the loss is high, producing large gradients

Key Difference: With clipping enabled, the actual gradient applied to weights is bounded at 1.0, even though the computed gradient norm reaches ~7.6. This prevents the rare samples from causing disproportionately large weight updates.

2. Weight Norm Evolution

Without Clipping: The weight norm shows more erratic behavior with visible jumps at rare sample positions. These jumps correspond to the large gradient updates from high-loss samples.

With Clipping: The weight norm evolution is smoother and more controlled. The clipping prevents sudden large changes, leading to more gradual weight updates.

3. Loss Convergence

Both runs successfully converge to low loss values, but:

Without Clipping: Final loss = 0.0039
With Clipping: Final loss = 0.0011

Interestingly, the clipped run achieved slightly lower final loss, suggesting that the more controlled updates may lead to better optimization in this case.

4. Effect of Rare Samples

The red vertical lines in the plots clearly show that:

Gradient spikes occur precisely at rare 'B' sample positions
These spikes are ~50x larger than the typical gradient norm
Without clipping, these spikes directly translate to large weight updates
With clipping, the weight updates are bounded regardless of gradient magnitude

Conclusion

Does Gradient Clipping Stabilize Training?

Yes, the experiment supports the hypothesis that gradient clipping stabilizes training by preventing sudden large weight updates. Specifically:

Bounded Updates: Gradient clipping ensures that no single sample can cause a weight update larger than the threshold, regardless of how high the loss is.
Smoother Convergence: The weight norm evolution with clipping shows fewer sudden jumps and more gradual changes.
Rare Sample Handling: The rare 'B' samples that produce gradients ~7x the clipping threshold are effectively handled without destabilizing the model.
Preserved Learning: Despite limiting gradient magnitudes, the model still learns effectively (actually achieving slightly better final loss).

When is Gradient Clipping Most Important?

Gradient clipping is particularly valuable when:

Training data has class imbalance
Rare samples can produce very high losses
Using high learning rates
Training on noisy or outlier-prone data
Working with models prone to exploding gradients (RNNs, deep networks)

Limitations of This Experiment

The model is simple and may not exhibit instability even without clipping
A larger learning rate or more extreme imbalance might show more dramatic differences
Real-world scenarios may have more complex gradient dynamics

Reproducibility

To reproduce this experiment:

cd projects/gradient_clipping_experiment
python experiment.py

Requirements: PyTorch, NumPy, Matplotlib

Random Seed: 42 (ensures identical dataset and initial weights across runs)

Files Generated

experiment.py - Complete experiment code
no_clipping.png - Training metrics without gradient clipping
with_clipping.png - Training metrics with gradient clipping
comparison.png - Side-by-side comparison of both runs
report.md - This report