Upload README.md with huggingface_hub

b032934 verified about 1 month ago

7.98 kB

	# 🧠 Activation Functions: Deep Neural Network Analysis

	[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
	[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
	[![PyTorch](https://img.shields.io/badge/PyTorch-2.0+-ee4c2c.svg)](https://pytorch.org/)

	> Empirical evidence for the vanishing gradient problem and why modern activations (ReLU, GELU) dominate deep learning.

	This repository provides a comprehensive comparison of 5 activation functions in deep neural networks, demonstrating the vanishing gradient problem with Sigmoid and why modern activations enable training of deep networks.

	---

	## 🎯 Key Findings

	\| Activation \| Final MSE \| Gradient Ratio (L10/L1) \| Status \|
	\|------------\|-----------\|-------------------------\|--------\|
	\| ReLU \| 0.008 \| 1.93 (stable) \| ✅ Excellent \|
	\| Leaky ReLU \| 0.008 \| 0.72 (stable) \| ✅ Excellent \|
	\| GELU \| 0.008 \| 0.83 (stable) \| ✅ Excellent \|
	\| Linear \| 0.213 \| 0.84 (stable) \| ⚠️ Cannot learn non-linearity \|
	\| Sigmoid \| 0.518 \| 2.59×10⁷ (vanishing!) \| ❌ Failed \|

	### 🔬 The Vanishing Gradient Problem - Visualized

	```
	Sigmoid Network (10 layers):
	Layer 1 ████████████████████████████████████████ Gradient: 5.04×10⁻¹
	Layer 5 ████████████ Gradient: 1.02×10⁻⁴
	Layer 10 ▏ Gradient: 1.94×10⁻⁸ ← 26 MILLION times smaller!

	ReLU Network (10 layers):
	Layer 1 ████████████████████████████████████████ Gradient: 2.70×10⁻³
	Layer 5 ██████████████████████████████████████ Gradient: 2.10×10⁻³
	Layer 10 ████████████████████████████████████████ Gradient: 1.36×10⁻³ ← Healthy flow!
	```

	---

	## 📊 Visual Results

	### Learned Functions
	![Learned Functions](learned_functions.png)

	ReLU, Leaky ReLU, and GELU perfectly approximate the sine wave. Linear learns only a straight line. Sigmoid completely fails to learn.

	### Training Dynamics
	![Loss Curves](loss_curves.png)

	### Gradient Flow Analysis
	![Gradient Flow](gradient_flow.png)

	### Comprehensive Summary
	![Summary](summary_figure.png)

	---

	## 🧪 Experimental Setup

	### Architecture
	- Network: 10 hidden layers × 64 neurons each
	- Task: 1D non-linear regression (sine wave approximation)
	- Dataset: `y = sin(x) + ε`, where `x ∈ [-π, π]` and `ε ~ N(0, 0.1)`

	### Training Configuration
	```python
	optimizer = Adam(lr=0.001)
	loss_fn = MSELoss()
	epochs = 500
	batch_size = full_batch (200 samples)
	seed = 42
	```

	### Activation Functions Tested
	\| Function \| Formula \| Gradient Range \|
	\|----------\|---------\|----------------\|
	\| Linear \| `f(x) = x` \| Always 1 \|
	\| Sigmoid \| `f(x) = 1/(1+e⁻ˣ)` \| (0, 0.25] \|
	\| ReLU \| `f(x) = max(0, x)` \| {0, 1} \|
	\| Leaky ReLU \| `f(x) = max(0.01x, x)` \| {0.01, 1} \|
	\| GELU \| `f(x) = x·Φ(x)` \| Smooth, ~(0, 1) \|

	---

	## 🚀 Quick Start

	### Installation
	```bash
	git clone https://huggingface.co/AmberLJC/activation_functions
	cd activation_functions
	pip install torch numpy matplotlib
	```

	### Run the Experiment
	```bash
	# Basic 5-activation comparison
	python train.py

	# Extended tutorial with 8 activations and 4 experiments
	python tutorial_experiments.py

	# Training dynamics analysis
	python train_dynamics.py
	```

	---

	## 📁 Repository Structure

	```
	activation_functions/
	├── README.md # This file
	├── report.md # Detailed analysis report
	├── activation_tutorial.md # Educational tutorial
	│
	├── train.py # Main experiment (5 activations)
	├── tutorial_experiments.py # Extended experiments (8 activations)
	├── train_dynamics.py # Training dynamics analysis
	│
	├── learned_functions.png # Predictions vs ground truth
	├── loss_curves.png # Training loss over epochs
	├── gradient_flow.png # Gradient magnitude per layer
	├── hidden_activations.png # Activation patterns
	├── summary_figure.png # 9-panel comprehensive summary
	│
	├── exp1_gradient_flow.png # Extended gradient analysis
	├── exp2_activation_distributions.png # Activation distribution analysis
	├── exp2_sparsity_dead_neurons.png # Sparsity and dead neuron analysis
	├── exp3_stability.png # Training stability analysis
	├── exp4_predictions.png # Function approximation comparison
	├── exp4_representational_heatmap.png # Representational capacity heatmap
	│
	├── activation_evolution.png # Activation evolution during training
	├── gradient_evolution.png # Gradient evolution during training
	├── training_dynamics_functions.png # Training dynamics visualization
	├── training_dynamics_summary.png # Training dynamics summary
	│
	├── loss_histories.json # Raw loss data
	├── gradient_magnitudes.json # Gradient measurements
	├── gradient_magnitudes_epochs.json # Gradient evolution data
	├── exp1_gradient_flow.json # Extended gradient data
	└── final_losses.json # Final MSE per activation
	```

	---

	## 📖 Key Insights

	### Why Sigmoid Fails in Deep Networks

	The vanishing gradient problem occurs because:

	1. Sigmoid derivative is bounded: max(σ'(x)) = 0.25 at x=0
	2. Chain rule multiplies gradients: For 10 layers, gradient ≈ (0.25)¹⁰ ≈ 10⁻⁶
	3. Early layers don't learn: Gradient signal vanishes before reaching input layers

	```python
	# Theoretical gradient decay for Sigmoid
	gradient_layer_10 = gradient_output * (0.25)^10
	≈ gradient_output * 0.000001
	≈ 0 # Effectively zero!
	```

	### Why ReLU Works

	ReLU maintains unit gradient for positive inputs:

	```python
	# ReLU gradient
	f'(x) = 1 if x > 0 else 0

	# No multiplicative decay!
	gradient_layer_10 ≈ gradient_output * 1^10 = gradient_output
	```

	### Practical Recommendations

	\| Use Case \| Recommended \|
	\|----------\|-------------\|
	\| Default choice \| ReLU or Leaky ReLU \|
	\| Transformers/LLMs \| GELU \|
	\| Very deep networks \| Leaky ReLU + skip connections \|
	\| Output (classification) \| Sigmoid/Softmax \|
	\| Output (regression) \| Linear \|

	---

	## 📚 Extended Experiments

	The `tutorial_experiments.py` script includes 4 additional experiments:

	1. Gradient Flow Analysis - Depths 5, 10, 20, 50 layers
	2. Activation Distributions - Sparsity and dead neuron analysis
	3. Training Stability - Learning rate and depth sensitivity
	4. Representational Capacity - Multiple target function approximation

	---

	## 🔗 References

	- [Deep Learning Book - Chapter 6.3: Hidden Units](https://www.deeplearningbook.org/)
	- [Glorot & Bengio (2010): Understanding the difficulty of training deep feedforward neural networks](http://proceedings.mlr.press/v9/glorot10a.html)
	- [He et al. (2015): Delving Deep into Rectifiers](https://arxiv.org/abs/1502.01852)
	- [Hendrycks & Gimpel (2016): GELU](https://arxiv.org/abs/1606.08415)

	---

	## 📄 Citation

	```bibtex
	@misc{activation_functions_analysis,
	title={Activation Functions: Deep Neural Network Analysis},
	author={Orchestra Research},
	year={2024},
	publisher={HuggingFace},
	url={https://huggingface.co/AmberLJC/activation_functions}
	}
	```

	---

	## 📜 License

	MIT License - feel free to use for education and research!

	---

	Generated by Orchestra Research Assistant