AmberLJC
/

activation_functions

Model card Files Files and versions

xet

Community

AmberLJC commited on Jan 29

Commit

b032934

verified ·

1 Parent(s): e0497ce

Upload README.md with huggingface_hub

Browse files

Files changed (1) hide show

README.md +227 -0

README.md ADDED Viewed

	@@ -0,0 +1,227 @@

+# 🧠 Activation Functions: Deep Neural Network Analysis
+[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
+[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
+[![PyTorch](https://img.shields.io/badge/PyTorch-2.0+-ee4c2c.svg)](https://pytorch.org/)
+> **Empirical evidence for the vanishing gradient problem and why modern activations (ReLU, GELU) dominate deep learning.**
+This repository provides a comprehensive comparison of 5 activation functions in deep neural networks, demonstrating the **vanishing gradient problem** with Sigmoid and why modern activations enable training of deep networks.
+---
+## 🎯 Key Findings
+| Activation | Final MSE | Gradient Ratio (L10/L1) | Status |
+|------------|-----------|-------------------------|--------|
+| **ReLU** | **0.008** | 1.93 (stable) | ✅ Excellent |
+| **Leaky ReLU** | **0.008** | 0.72 (stable) | ✅ Excellent |
+| **GELU** | **0.008** | 0.83 (stable) | ✅ Excellent |
+| Linear | 0.213 | 0.84 (stable) | ⚠️ Cannot learn non-linearity |
+| Sigmoid | 0.518 | **2.59×10⁷** (vanishing!) | ❌ Failed |
+### 🔬 The Vanishing Gradient Problem - Visualized
+```
+Sigmoid Network (10 layers):
+Layer 1  ████████████████████████████████████████  Gradient: 5.04×10⁻¹
+Layer 5  ████████████                              Gradient: 1.02×10⁻⁴
+Layer 10 ▏                                         Gradient: 1.94×10⁻⁸  ← 26 MILLION times smaller!
+ReLU Network (10 layers):
+Layer 1  ████████████████████████████████████████  Gradient: 2.70×10⁻³
+Layer 5  ██████████████████████████████████████    Gradient: 2.10×10⁻³
+Layer 10 ████████████████████████████████████████  Gradient: 1.36×10⁻³  ← Healthy flow!
+```
+---
+## 📊 Visual Results
+### Learned Functions
+![Learned Functions](learned_functions.png)
+*ReLU, Leaky ReLU, and GELU perfectly approximate the sine wave. Linear learns only a straight line. Sigmoid completely fails to learn.*
+### Training Dynamics
+![Loss Curves](loss_curves.png)
+### Gradient Flow Analysis
+![Gradient Flow](gradient_flow.png)
+### Comprehensive Summary
+![Summary](summary_figure.png)
+---
+## 🧪 Experimental Setup
+### Architecture
+- **Network**: 10 hidden layers × 64 neurons each
+- **Task**: 1D non-linear regression (sine wave approximation)
+- **Dataset**: `y = sin(x) + ε`, where `x ∈ [-π, π]` and `ε ~ N(0, 0.1)`
+### Training Configuration
+```python
+optimizer = Adam(lr=0.001)
+loss_fn = MSELoss()
+epochs = 500
+batch_size = full_batch (200 samples)
+seed = 42
+```
+### Activation Functions Tested
+| Function | Formula | Gradient Range |
+|----------|---------|----------------|
+| Linear | `f(x) = x` | Always 1 |
+| Sigmoid | `f(x) = 1/(1+e⁻ˣ)` | (0, 0.25] |
+| ReLU | `f(x) = max(0, x)` | {0, 1} |
+| Leaky ReLU | `f(x) = max(0.01x, x)` | {0.01, 1} |
+| GELU | `f(x) = x·Φ(x)` | Smooth, ~(0, 1) |
+---
+## 🚀 Quick Start
+### Installation
+```bash
+git clone https://huggingface.co/AmberLJC/activation_functions
+cd activation_functions
+pip install torch numpy matplotlib
+```
+### Run the Experiment
+```bash
+# Basic 5-activation comparison
+python train.py
+# Extended tutorial with 8 activations and 4 experiments
+python tutorial_experiments.py
+# Training dynamics analysis
+python train_dynamics.py
+```
+---
+## 📁 Repository Structure
+```
+activation_functions/
+├── README.md                          # This file
+├── report.md                          # Detailed analysis report
+├── activation_tutorial.md             # Educational tutorial
+│
+├── train.py                           # Main experiment (5 activations)
+├── tutorial_experiments.py            # Extended experiments (8 activations)
+├── train_dynamics.py                  # Training dynamics analysis
+│
+├── learned_functions.png              # Predictions vs ground truth
+├── loss_curves.png                    # Training loss over epochs
+├── gradient_flow.png                  # Gradient magnitude per layer
+├── hidden_activations.png             # Activation patterns
+├── summary_figure.png                 # 9-panel comprehensive summary
+│
+├── exp1_gradient_flow.png             # Extended gradient analysis
+├── exp2_activation_distributions.png  # Activation distribution analysis
+├── exp2_sparsity_dead_neurons.png     # Sparsity and dead neuron analysis
+├── exp3_stability.png                 # Training stability analysis
+├── exp4_predictions.png               # Function approximation comparison
+├── exp4_representational_heatmap.png  # Representational capacity heatmap
+│
+├── activation_evolution.png           # Activation evolution during training
+├── gradient_evolution.png             # Gradient evolution during training
+├── training_dynamics_functions.png    # Training dynamics visualization
+├── training_dynamics_summary.png      # Training dynamics summary
+│
+├── loss_histories.json                # Raw loss data
+├── gradient_magnitudes.json           # Gradient measurements
+├── gradient_magnitudes_epochs.json    # Gradient evolution data
+├── exp1_gradient_flow.json            # Extended gradient data
+└── final_losses.json                  # Final MSE per activation
+```
+---
+## 📖 Key Insights
+### Why Sigmoid Fails in Deep Networks
+The **vanishing gradient problem** occurs because:
+1. **Sigmoid derivative is bounded**: max(σ'(x)) = 0.25 at x=0
+2. **Chain rule multiplies gradients**: For 10 layers, gradient ≈ (0.25)¹⁰ ≈ 10⁻⁶
+3. **Early layers don't learn**: Gradient signal vanishes before reaching input layers
+```python
+# Theoretical gradient decay for Sigmoid
+gradient_layer_10 = gradient_output * (0.25)^10
+                  ≈ gradient_output * 0.000001
+                  ≈ 0  # Effectively zero!
+```
+### Why ReLU Works
+ReLU maintains **unit gradient** for positive inputs:
+```python
+# ReLU gradient
+f'(x) = 1 if x > 0 else 0
+# No multiplicative decay!
+gradient_layer_10 ≈ gradient_output * 1^10 = gradient_output
+```
+### Practical Recommendations
+| Use Case | Recommended |
+|----------|-------------|
+| Default choice | ReLU or Leaky ReLU |
+| Transformers/LLMs | GELU |
+| Very deep networks | Leaky ReLU + skip connections |
+| Output (classification) | Sigmoid/Softmax |
+| Output (regression) | Linear |
+---
+## 📚 Extended Experiments
+The `tutorial_experiments.py` script includes 4 additional experiments:
+1. **Gradient Flow Analysis** - Depths 5, 10, 20, 50 layers
+2. **Activation Distributions** - Sparsity and dead neuron analysis
+3. **Training Stability** - Learning rate and depth sensitivity
+4. **Representational Capacity** - Multiple target function approximation
+---
+## 🔗 References
+- [Deep Learning Book - Chapter 6.3: Hidden Units](https://www.deeplearningbook.org/)
+- [Glorot & Bengio (2010): Understanding the difficulty of training deep feedforward neural networks](http://proceedings.mlr.press/v9/glorot10a.html)
+- [He et al. (2015): Delving Deep into Rectifiers](https://arxiv.org/abs/1502.01852)
+- [Hendrycks & Gimpel (2016): GELU](https://arxiv.org/abs/1606.08415)
+---
+## 📄 Citation
+```bibtex
+@misc{activation_functions_analysis,
+  title={Activation Functions: Deep Neural Network Analysis},
+  author={Orchestra Research},
+  year={2024},
+  publisher={HuggingFace},
+  url={https://huggingface.co/AmberLJC/activation_functions}
+}
+```
+---
+## 📜 License
+MIT License - feel free to use for education and research!
+---
+*Generated by Orchestra Research Assistant*