AmberLJC's picture
Upload README.md with huggingface_hub
b032934 verified
# 🧠 Activation Functions: Deep Neural Network Analysis
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
[![PyTorch](https://img.shields.io/badge/PyTorch-2.0+-ee4c2c.svg)](https://pytorch.org/)
> **Empirical evidence for the vanishing gradient problem and why modern activations (ReLU, GELU) dominate deep learning.**
This repository provides a comprehensive comparison of 5 activation functions in deep neural networks, demonstrating the **vanishing gradient problem** with Sigmoid and why modern activations enable training of deep networks.
---
## 🎯 Key Findings
| Activation | Final MSE | Gradient Ratio (L10/L1) | Status |
|------------|-----------|-------------------------|--------|
| **ReLU** | **0.008** | 1.93 (stable) | βœ… Excellent |
| **Leaky ReLU** | **0.008** | 0.72 (stable) | βœ… Excellent |
| **GELU** | **0.008** | 0.83 (stable) | βœ… Excellent |
| Linear | 0.213 | 0.84 (stable) | ⚠️ Cannot learn non-linearity |
| Sigmoid | 0.518 | **2.59Γ—10⁷** (vanishing!) | ❌ Failed |
### πŸ”¬ The Vanishing Gradient Problem - Visualized
```
Sigmoid Network (10 layers):
Layer 1 β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ Gradient: 5.04Γ—10⁻¹
Layer 5 β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ Gradient: 1.02Γ—10⁻⁴
Layer 10 ▏ Gradient: 1.94Γ—10⁻⁸ ← 26 MILLION times smaller!
ReLU Network (10 layers):
Layer 1 β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ Gradient: 2.70Γ—10⁻³
Layer 5 β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ Gradient: 2.10Γ—10⁻³
Layer 10 β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ Gradient: 1.36Γ—10⁻³ ← Healthy flow!
```
---
## πŸ“Š Visual Results
### Learned Functions
![Learned Functions](learned_functions.png)
*ReLU, Leaky ReLU, and GELU perfectly approximate the sine wave. Linear learns only a straight line. Sigmoid completely fails to learn.*
### Training Dynamics
![Loss Curves](loss_curves.png)
### Gradient Flow Analysis
![Gradient Flow](gradient_flow.png)
### Comprehensive Summary
![Summary](summary_figure.png)
---
## πŸ§ͺ Experimental Setup
### Architecture
- **Network**: 10 hidden layers Γ— 64 neurons each
- **Task**: 1D non-linear regression (sine wave approximation)
- **Dataset**: `y = sin(x) + Ξ΅`, where `x ∈ [-Ο€, Ο€]` and `Ξ΅ ~ N(0, 0.1)`
### Training Configuration
```python
optimizer = Adam(lr=0.001)
loss_fn = MSELoss()
epochs = 500
batch_size = full_batch (200 samples)
seed = 42
```
### Activation Functions Tested
| Function | Formula | Gradient Range |
|----------|---------|----------------|
| Linear | `f(x) = x` | Always 1 |
| Sigmoid | `f(x) = 1/(1+e⁻ˣ)` | (0, 0.25] |
| ReLU | `f(x) = max(0, x)` | {0, 1} |
| Leaky ReLU | `f(x) = max(0.01x, x)` | {0.01, 1} |
| GELU | `f(x) = xΒ·Ξ¦(x)` | Smooth, ~(0, 1) |
---
## πŸš€ Quick Start
### Installation
```bash
git clone https://huggingface.co/AmberLJC/activation_functions
cd activation_functions
pip install torch numpy matplotlib
```
### Run the Experiment
```bash
# Basic 5-activation comparison
python train.py
# Extended tutorial with 8 activations and 4 experiments
python tutorial_experiments.py
# Training dynamics analysis
python train_dynamics.py
```
---
## πŸ“ Repository Structure
```
activation_functions/
β”œβ”€β”€ README.md # This file
β”œβ”€β”€ report.md # Detailed analysis report
β”œβ”€β”€ activation_tutorial.md # Educational tutorial
β”‚
β”œβ”€β”€ train.py # Main experiment (5 activations)
β”œβ”€β”€ tutorial_experiments.py # Extended experiments (8 activations)
β”œβ”€β”€ train_dynamics.py # Training dynamics analysis
β”‚
β”œβ”€β”€ learned_functions.png # Predictions vs ground truth
β”œβ”€β”€ loss_curves.png # Training loss over epochs
β”œβ”€β”€ gradient_flow.png # Gradient magnitude per layer
β”œβ”€β”€ hidden_activations.png # Activation patterns
β”œβ”€β”€ summary_figure.png # 9-panel comprehensive summary
β”‚
β”œβ”€β”€ exp1_gradient_flow.png # Extended gradient analysis
β”œβ”€β”€ exp2_activation_distributions.png # Activation distribution analysis
β”œβ”€β”€ exp2_sparsity_dead_neurons.png # Sparsity and dead neuron analysis
β”œβ”€β”€ exp3_stability.png # Training stability analysis
β”œβ”€β”€ exp4_predictions.png # Function approximation comparison
β”œβ”€β”€ exp4_representational_heatmap.png # Representational capacity heatmap
β”‚
β”œβ”€β”€ activation_evolution.png # Activation evolution during training
β”œβ”€β”€ gradient_evolution.png # Gradient evolution during training
β”œβ”€β”€ training_dynamics_functions.png # Training dynamics visualization
β”œβ”€β”€ training_dynamics_summary.png # Training dynamics summary
β”‚
β”œβ”€β”€ loss_histories.json # Raw loss data
β”œβ”€β”€ gradient_magnitudes.json # Gradient measurements
β”œβ”€β”€ gradient_magnitudes_epochs.json # Gradient evolution data
β”œβ”€β”€ exp1_gradient_flow.json # Extended gradient data
└── final_losses.json # Final MSE per activation
```
---
## πŸ“– Key Insights
### Why Sigmoid Fails in Deep Networks
The **vanishing gradient problem** occurs because:
1. **Sigmoid derivative is bounded**: max(Οƒ'(x)) = 0.25 at x=0
2. **Chain rule multiplies gradients**: For 10 layers, gradient β‰ˆ (0.25)¹⁰ β‰ˆ 10⁻⁢
3. **Early layers don't learn**: Gradient signal vanishes before reaching input layers
```python
# Theoretical gradient decay for Sigmoid
gradient_layer_10 = gradient_output * (0.25)^10
β‰ˆ gradient_output * 0.000001
β‰ˆ 0 # Effectively zero!
```
### Why ReLU Works
ReLU maintains **unit gradient** for positive inputs:
```python
# ReLU gradient
f'(x) = 1 if x > 0 else 0
# No multiplicative decay!
gradient_layer_10 β‰ˆ gradient_output * 1^10 = gradient_output
```
### Practical Recommendations
| Use Case | Recommended |
|----------|-------------|
| Default choice | ReLU or Leaky ReLU |
| Transformers/LLMs | GELU |
| Very deep networks | Leaky ReLU + skip connections |
| Output (classification) | Sigmoid/Softmax |
| Output (regression) | Linear |
---
## πŸ“š Extended Experiments
The `tutorial_experiments.py` script includes 4 additional experiments:
1. **Gradient Flow Analysis** - Depths 5, 10, 20, 50 layers
2. **Activation Distributions** - Sparsity and dead neuron analysis
3. **Training Stability** - Learning rate and depth sensitivity
4. **Representational Capacity** - Multiple target function approximation
---
## πŸ”— References
- [Deep Learning Book - Chapter 6.3: Hidden Units](https://www.deeplearningbook.org/)
- [Glorot & Bengio (2010): Understanding the difficulty of training deep feedforward neural networks](http://proceedings.mlr.press/v9/glorot10a.html)
- [He et al. (2015): Delving Deep into Rectifiers](https://arxiv.org/abs/1502.01852)
- [Hendrycks & Gimpel (2016): GELU](https://arxiv.org/abs/1606.08415)
---
## πŸ“„ Citation
```bibtex
@misc{activation_functions_analysis,
title={Activation Functions: Deep Neural Network Analysis},
author={Orchestra Research},
year={2024},
publisher={HuggingFace},
url={https://huggingface.co/AmberLJC/activation_functions}
}
```
---
## πŸ“œ License
MIT License - feel free to use for education and research!
---
*Generated by Orchestra Research Assistant*