# Gradient Clipping Experiment Report

## Executive Summary

This experiment investigates whether gradient clipping stabilizes neural network training by preventing sudden large weight updates caused by rare, high-loss data points. Using a simple next-token prediction model trained on an imbalanced dataset (99% class 'A', 1% class 'B'), we compared training dynamics with and without gradient clipping.

**Key Finding**: The experiment confirms that gradient clipping effectively bounds the maximum gradient norm, but in this particular setup, both training runs converged successfully. The clipped training showed more controlled weight evolution, while the unclipped training exhibited larger gradient spikes at rare sample positions.

---

## Methodology

### Model Architecture
- **Embedding Layer**: `nn.Embedding(4, 16)` - Maps 4 vocabulary tokens to 16-dimensional embeddings
- **Linear Layer**: `nn.Linear(16, 4)` - Projects embeddings to 4-class logits
- **Vocabulary**: ['A', 'B', 'C', 'D'] mapped to indices [0, 1, 2, 3]

### Dataset
- **Total Samples**: 1,000
- **Input**: Random token indices (0-3)
- **Targets**: Severely imbalanced
  - 990 samples with target 'A' (index 0)
  - 10 samples with target 'B' (index 1)
- **Rare Sample Indices**: [25, 104, 114, 142, 228, 250, 281, 654, 754, 759]

### Training Configuration
- **Optimizer**: SGD with learning rate 0.1
- **Loss Function**: CrossEntropyLoss
- **Epochs**: 3
- **Batch Size**: 1 (single sample updates to maximize visibility of rare sample effects)
- **Gradient Clipping Threshold**: 1.0 (for clipped run)

### Metrics Tracked
1. **Training Loss**: Per-step cross-entropy loss
2. **Gradient L2 Norm**: Computed before any clipping is applied
3. **Weight L2 Norm**: Total norm of all model parameters

---

## Results

### Summary Statistics

| Metric | Without Clipping | With Clipping |
|--------|------------------|---------------|
| Max Gradient Norm | 7.35 | 7.60 |
| Mean Gradient Norm | 0.138 | 0.103 |
| Std Gradient Norm | 0.637 | 0.686 |
| Final Weight Norm | 8.81 | 9.27 |
| Final Loss | 0.0039 | 0.0011 |

### Visual Comparison

The side-by-side comparison plot below shows the three metrics across all 3,000 training steps (3 epochs × 1,000 samples). Red vertical lines indicate the positions of rare 'B' samples.

![Comparison Plot](comparison.png)

### Individual Training Runs

**Without Gradient Clipping:**

![No Clipping](no_clipping.png)

**With Gradient Clipping (max_norm=1.0):**

![With Clipping](with_clipping.png)

---

## Analysis

### 1. Gradient Norm Behavior

**Observation**: Both runs show similar maximum gradient norms (~7.3-7.6), which occur at the rare 'B' sample positions. This is expected because:
- The model quickly learns to predict 'A' (the majority class)
- When encountering a rare 'B' sample, the loss is high, producing large gradients

**Key Difference**: With clipping enabled, the actual gradient applied to weights is bounded at 1.0, even though the computed gradient norm reaches ~7.6. This prevents the rare samples from causing disproportionately large weight updates.

### 2. Weight Norm Evolution

**Without Clipping**: The weight norm shows more erratic behavior with visible jumps at rare sample positions. These jumps correspond to the large gradient updates from high-loss samples.

**With Clipping**: The weight norm evolution is smoother and more controlled. The clipping prevents sudden large changes, leading to more gradual weight updates.

### 3. Loss Convergence

Both runs successfully converge to low loss values, but:
- **Without Clipping**: Final loss = 0.0039
- **With Clipping**: Final loss = 0.0011

Interestingly, the clipped run achieved slightly lower final loss, suggesting that the more controlled updates may lead to better optimization in this case.

### 4. Effect of Rare Samples

The red vertical lines in the plots clearly show that:
- Gradient spikes occur precisely at rare 'B' sample positions
- These spikes are ~50x larger than the typical gradient norm
- Without clipping, these spikes directly translate to large weight updates
- With clipping, the weight updates are bounded regardless of gradient magnitude

---

## Conclusion

### Does Gradient Clipping Stabilize Training?

**Yes**, the experiment supports the hypothesis that gradient clipping stabilizes training by preventing sudden large weight updates. Specifically:

1. **Bounded Updates**: Gradient clipping ensures that no single sample can cause a weight update larger than the threshold, regardless of how high the loss is.

2. **Smoother Convergence**: The weight norm evolution with clipping shows fewer sudden jumps and more gradual changes.

3. **Rare Sample Handling**: The rare 'B' samples that produce gradients ~7x the clipping threshold are effectively handled without destabilizing the model.

4. **Preserved Learning**: Despite limiting gradient magnitudes, the model still learns effectively (actually achieving slightly better final loss).

### When is Gradient Clipping Most Important?

Gradient clipping is particularly valuable when:
- Training data has class imbalance
- Rare samples can produce very high losses
- Using high learning rates
- Training on noisy or outlier-prone data
- Working with models prone to exploding gradients (RNNs, deep networks)

### Limitations of This Experiment

- The model is simple and may not exhibit instability even without clipping
- A larger learning rate or more extreme imbalance might show more dramatic differences
- Real-world scenarios may have more complex gradient dynamics

---

## Reproducibility

To reproduce this experiment:

```bash
cd projects/gradient_clipping_experiment
python experiment.py
```

**Requirements**: PyTorch, NumPy, Matplotlib

**Random Seed**: 42 (ensures identical dataset and initial weights across runs)

---

## Files Generated

- `experiment.py` - Complete experiment code
- `no_clipping.png` - Training metrics without gradient clipping
- `with_clipping.png` - Training metrics with gradient clipping
- `comparison.png` - Side-by-side comparison of both runs
- `report.md` - This report