tiny-torch-viz / tinytorch /core /activations.py
Adrian Gabriel
latest changs
0ffa4e9
# ---
# jupyter:
# jupytext:
# text_representation:
# extension: .py
# format_name: percent
# format_version: '1.3'
# jupytext_version: 1.17.1
# kernelspec:
# display_name: Python 3 (ipykernel)
# language: python
# name: python3
# ---
# %% [markdown]
"""
# Activations - Intelligence Through Nonlinearity
Welcome to Activations! Today you'll add the secret ingredient that makes neural networks intelligent: **nonlinearity**.
## ๐Ÿ”— Prerequisites & Progress
**You've Built**: Tensor with data manipulation and basic operations
**You'll Build**: Activation functions that add nonlinearity to transformations
**You'll Enable**: Neural networks with the ability to learn complex patterns
**Connection Map**:
```
Tensor โ†’ Activations โ†’ Layers
(data) (intelligence) (architecture)
```
## ๐ŸŽฏ Learning Objectives
By the end of this module, you will:
1. Implement 5 core activation functions (Sigmoid, ReLU, Tanh, GELU, Softmax)
2. Understand how nonlinearity enables neural network intelligence
3. Test activation behaviors and output ranges
4. Connect activations to real neural network components
Let's add intelligence to your tensors!
"""
# %% [markdown]
"""
## ๐Ÿ“ฆ Where This Code Lives in the Final Package
**Learning Side:** You work in modules/02_activations/activations_dev.py
**Building Side:** Code exports to tinytorch.core.activations
```python
# Final package structure:
from tinytorch.core.activations import Sigmoid, ReLU, Tanh, GELU, Softmax # This module
from tinytorch.core.tensor import Tensor # Foundation (Module 01)
```
**Why this matters:**
- **Learning:** Complete activation system in one focused module for deep understanding
- **Production:** Proper organization like PyTorch's torch.nn.functional with all activation operations together
- **Consistency:** All activation functions and behaviors in core.activations
- **Integration:** Works seamlessly with Tensor for complete nonlinear transformations
"""
# %% [markdown]
"""
## ๐Ÿ“‹ Module Dependencies
**Prerequisites**: Module 01 (Tensor) must be completed
**External Dependencies**:
- `numpy` (for numerical operations)
**TinyTorch Dependencies**:
- **Module 01 (Tensor)**: Foundation for all activation computations and data flow
- Used for: Input/output data structures, shape operations, element-wise operations
- Required: Yes - activations operate on Tensor objects
**Dependency Flow**:
```
Module 01 (Tensor) โ†’ Module 02 (Activations) โ†’ Module 03 (Layers)
โ†“ โ†“ โ†“
Foundation Nonlinearity Architecture
```
**Import Strategy**:
This module imports directly from the TinyTorch package (`from tinytorch.core.*`).
**Assumption**: Module 01 (Tensor) has been completed and exported to the package.
If you see import errors, ensure you've run `tito export` after completing Module 01.
"""
# %% nbgrader={"grade": false, "grade_id": "setup", "solution": true}
#| default_exp core.activations
#| export
import numpy as np
# Import from TinyTorch package (previous modules must be completed and exported)
from .tensor import Tensor
# Constants for numerical comparisons
TOLERANCE = 1e-10 # Small tolerance for floating-point comparisons in tests
# Export only activation classes
__all__ = ['Sigmoid', 'ReLU', 'Tanh', 'GELU', 'Softmax']
# %% [markdown]
"""
## ๐Ÿ’ก Introduction - What Makes Neural Networks Intelligent?
Consider two scenarios:
**Without Activations (Linear Only):**
```
Input โ†’ Linear Transform โ†’ Output
[1, 2] โ†’ [3, 4] โ†’ [11] # Just weighted sum
```
**With Activations (Nonlinear):**
```
Input โ†’ Linear โ†’ Activation โ†’ Linear โ†’ Activation โ†’ Output
[1, 2] โ†’ [3, 4] โ†’ [3, 4] โ†’ [7] โ†’ [7] โ†’ Complex Pattern!
```
The magic happens in those activation functions. They introduce **nonlinearity** - the ability to curve, bend, and create complex decision boundaries instead of just straight lines.
### Why Nonlinearity Matters
Without activation functions, stacking multiple linear layers is pointless:
```
Linear(Linear(x)) = Linear(x) # Same as single layer!
```
With activation functions, each layer can learn increasingly complex patterns:
```
Layer 1: Simple edges and lines
Layer 2: Curves and shapes
Layer 3: Complex objects and concepts
```
This is how deep networks build intelligence from simple mathematical operations.
"""
# %% [markdown]
"""
## ๐Ÿ“ Mathematical Foundations
Each activation function serves a different purpose in neural networks:
### The Five Essential Activations
1. **Sigmoid**: Maps to (0, 1) - perfect for probabilities
2. **ReLU**: Removes negatives - creates sparsity and efficiency
3. **Tanh**: Maps to (-1, 1) - zero-centered for better training
4. **GELU**: Smooth ReLU - modern choice for transformers
5. **Softmax**: Creates probability distributions - essential for classification
Let's implement each one with clear explanations and immediate testing!
"""
# %% [markdown]
"""
## ๐Ÿ—๏ธ Implementation - Building Activation Functions
### ๐Ÿ—๏ธ Implementation Pattern
Each activation follows this structure:
```python
class ActivationName:
def forward(self, x: Tensor) -> Tensor:
# Apply mathematical transformation
# Return new Tensor with result
def backward(self, grad: Tensor) -> Tensor:
# Stub for Module 06 - gradient computation
pass
```
"""
# %% [markdown]
"""
## ๐Ÿ—๏ธ Sigmoid - The Probability Gatekeeper
Sigmoid maps any real number to the range (0, 1), making it perfect for probabilities and binary decisions.
### Mathematical Definition
```
ฯƒ(x) = 1/(1 + e^(-x))
```
### Visual Behavior
```
Input: [-3, -1, 0, 1, 3]
โ†“ โ†“ โ†“ โ†“ โ†“ Sigmoid Function
Output: [0.05, 0.27, 0.5, 0.73, 0.95]
```
### ASCII Visualization
```
Sigmoid Curve:
1.0 โ”ค โ•ญโ”€โ”€โ”€โ”€โ”€
โ”‚ โ•ฑ
0.5 โ”ค โ•ฑ
โ”‚ โ•ฑ
0.0 โ”คโ”€โ•ฑโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
-3 0 3
```
**Why Sigmoid matters**: In binary classification, we need outputs between 0 and 1 to represent probabilities. Sigmoid gives us exactly that!
"""
# %% nbgrader={"grade": false, "grade_id": "sigmoid-impl", "solution": true}
#| export
class Sigmoid:
"""
Sigmoid activation: ฯƒ(x) = 1/(1 + e^(-x))
Maps any real number to (0, 1) range.
Perfect for probabilities and binary classification.
"""
def parameters(self):
"""Return empty list (activations have no learnable parameters)."""
return []
def forward(self, x: Tensor) -> Tensor:
"""
Apply sigmoid activation element-wise.
TODO: Implement sigmoid function
APPROACH:
1. Apply sigmoid formula: 1 / (1 + exp(-x))
2. Use np.exp for exponential
3. Return result wrapped in new Tensor
EXAMPLE:
>>> sigmoid = Sigmoid()
>>> x = Tensor([-2, 0, 2])
>>> result = sigmoid(x)
>>> print(result.data)
[0.119, 0.5, 0.881] # All values between 0 and 1
HINT: Use np.exp(-x.data) for numerical stability
"""
### BEGIN SOLUTION
# Apply sigmoid: 1 / (1 + exp(-x))
# Clip extreme values to prevent overflow (sigmoid(-500) โ‰ˆ 0, sigmoid(500) โ‰ˆ 1)
# Clipping at ยฑ500 ensures exp() stays within float64 range
z = np.clip(x.data, -500, 500)
# Use numerically stable sigmoid
# For positive values: 1 / (1 + exp(-x))
# For negative values: exp(x) / (1 + exp(x)) = 1 / (1 + exp(-x)) after clipping
result_data = np.zeros_like(z)
# Positive values (including zero)
pos_mask = z >= 0
result_data[pos_mask] = 1.0 / (1.0 + np.exp(-z[pos_mask]))
# Negative values
neg_mask = z < 0
exp_z = np.exp(z[neg_mask])
result_data[neg_mask] = exp_z / (1.0 + exp_z)
return Tensor(result_data)
### END SOLUTION
def __call__(self, x: Tensor) -> Tensor:
"""Allows the activation to be called like a function."""
return self.forward(x)
def backward(self, grad: Tensor) -> Tensor:
"""Compute gradient (implemented in Module 06)."""
pass # Will implement backward pass in Module 06
# %% [markdown]
"""
### ๐Ÿ”ฌ Unit Test: Sigmoid
This test validates sigmoid activation behavior.
**What we're testing**: Sigmoid maps inputs to (0, 1) range
**Why it matters**: Ensures proper probability-like outputs
**Expected**: All outputs between 0 and 1, sigmoid(0) = 0.5
"""
# %% nbgrader={"grade": true, "grade_id": "test-sigmoid", "locked": true, "points": 10}
def test_unit_sigmoid():
"""๐Ÿ”ฌ Test Sigmoid implementation."""
print("๐Ÿ”ฌ Unit Test: Sigmoid...")
sigmoid = Sigmoid()
# Test basic cases
x = Tensor([0.0])
result = sigmoid.forward(x)
assert np.allclose(result.data, [0.5]), f"sigmoid(0) should be 0.5, got {result.data}"
# Test range property - all outputs should be in (0, 1)
x = Tensor([-10, -1, 0, 1, 10])
result = sigmoid.forward(x)
assert np.all(result.data > 0) and np.all(result.data < 1), "All sigmoid outputs should be in (0, 1)"
# Test specific values
x = Tensor([-1000, 1000]) # Extreme values
result = sigmoid.forward(x)
assert np.allclose(result.data[0], 0, atol=TOLERANCE), "sigmoid(-โˆž) should approach 0"
assert np.allclose(result.data[1], 1, atol=TOLERANCE), "sigmoid(+โˆž) should approach 1"
print("โœ… Sigmoid works correctly!")
if __name__ == "__main__":
test_unit_sigmoid()
# %% [markdown]
"""
## ๐Ÿ—๏ธ ReLU - The Sparsity Creator
ReLU (Rectified Linear Unit) is the most popular activation function. It simply removes negative values, creating sparsity that makes neural networks more efficient.
### Mathematical Definition
```
f(x) = max(0, x)
```
### Visual Behavior
```
Input: [-2, -1, 0, 1, 2]
โ†“ โ†“ โ†“ โ†“ โ†“ ReLU Function
Output: [ 0, 0, 0, 1, 2]
```
### ASCII Visualization
```
ReLU Function:
โ•ฑ
2 โ•ฑ
โ•ฑ
1โ•ฑ
โ•ฑ
โ•ฑ
โ•ฑ
โ”€โ”ดโ”€โ”€โ”€โ”€โ”€
-2 0 2
```
**Why ReLU matters**: By zeroing negative values, ReLU creates sparsity (many zeros) which makes computation faster and helps prevent overfitting.
"""
# %% nbgrader={"grade": false, "grade_id": "relu-impl", "solution": true}
#| export
class ReLU:
"""
ReLU activation: f(x) = max(0, x)
Sets negative values to zero, keeps positive values unchanged.
Most popular activation for hidden layers.
"""
def parameters(self):
"""Return empty list (activations have no learnable parameters)."""
return []
def forward(self, x: Tensor) -> Tensor:
"""
Apply ReLU activation element-wise.
TODO: Implement ReLU function
APPROACH:
1. Use np.maximum(0, x.data) for element-wise max with zero
2. Return result wrapped in new Tensor
EXAMPLE:
>>> relu = ReLU()
>>> x = Tensor([-2, -1, 0, 1, 2])
>>> result = relu(x)
>>> print(result.data)
[0, 0, 0, 1, 2] # Negative values become 0, positive unchanged
HINT: np.maximum handles element-wise maximum automatically
"""
### BEGIN SOLUTION
# Apply ReLU: max(0, x)
result = np.maximum(0, x.data)
return Tensor(result)
### END SOLUTION
def __call__(self, x: Tensor) -> Tensor:
"""Allows the activation to be called like a function."""
return self.forward(x)
def backward(self, grad: Tensor) -> Tensor:
"""Compute gradient (implemented in Module 06)."""
pass # Will implement backward pass in Module 06
# %% [markdown]
"""
### ๐Ÿ”ฌ Unit Test: ReLU
This test validates ReLU activation behavior.
**What we're testing**: ReLU zeros negative values, preserves positive
**Why it matters**: ReLU's sparsity helps neural networks train efficiently
**Expected**: Negative โ†’ 0, positive unchanged, zero โ†’ 0
"""
# %% nbgrader={"grade": true, "grade_id": "test-relu", "locked": true, "points": 10}
def test_unit_relu():
"""๐Ÿ”ฌ Test ReLU implementation."""
print("๐Ÿ”ฌ Unit Test: ReLU...")
relu = ReLU()
# Test mixed positive/negative values
x = Tensor([-2, -1, 0, 1, 2])
result = relu.forward(x)
expected = [0, 0, 0, 1, 2]
assert np.allclose(result.data, expected), f"ReLU failed, expected {expected}, got {result.data}"
# Test all negative
x = Tensor([-5, -3, -1])
result = relu.forward(x)
assert np.allclose(result.data, [0, 0, 0]), "ReLU should zero all negative values"
# Test all positive
x = Tensor([1, 3, 5])
result = relu.forward(x)
assert np.allclose(result.data, [1, 3, 5]), "ReLU should preserve all positive values"
# Test sparsity property
x = Tensor([-1, -2, -3, 1])
result = relu.forward(x)
zeros = np.sum(result.data == 0)
assert zeros == 3, f"ReLU should create sparsity, got {zeros} zeros out of 4"
print("โœ… ReLU works correctly!")
if __name__ == "__main__":
test_unit_relu()
# %% [markdown]
"""
## ๐Ÿ—๏ธ Tanh - The Zero-Centered Alternative
Tanh (hyperbolic tangent) is like sigmoid but centered around zero, mapping inputs to (-1, 1). This zero-centering helps with gradient flow during training.
### Mathematical Definition
```
f(x) = (e^x - e^(-x))/(e^x + e^(-x))
```
### Visual Behavior
```
Input: [-2, 0, 2]
โ†“ โ†“ โ†“ Tanh Function
Output: [-0.96, 0, 0.96]
```
### ASCII Visualization
```
Tanh Curve:
1 โ”ค โ•ญโ”€โ”€โ”€โ”€โ”€
โ”‚ โ•ฑ
0 โ”คโ”€โ”€โ”€โ•ฑโ”€โ”€โ”€โ”€โ”€
โ”‚ โ•ฑ
-1 โ”คโ”€โ•ฑโ”€โ”€โ”€โ”€โ”€โ”€โ”€
-3 0 3
```
**Why Tanh matters**: Unlike sigmoid, tanh outputs are centered around zero, which can help gradients flow better through deep networks.
"""
# %% nbgrader={"grade": false, "grade_id": "tanh-impl", "solution": true}
#| export
class Tanh:
"""
Tanh activation: f(x) = (e^x - e^(-x))/(e^x + e^(-x))
Maps any real number to (-1, 1) range.
Zero-centered alternative to sigmoid.
"""
def parameters(self):
"""Return empty list (activations have no learnable parameters)."""
return []
def forward(self, x: Tensor) -> Tensor:
"""
Apply tanh activation element-wise.
TODO: Implement tanh function
APPROACH:
1. Use np.tanh(x.data) for hyperbolic tangent
2. Return result wrapped in new Tensor
EXAMPLE:
>>> tanh = Tanh()
>>> x = Tensor([-2, 0, 2])
>>> result = tanh(x)
>>> print(result.data)
[-0.964, 0.0, 0.964] # Range (-1, 1), symmetric around 0
HINT: NumPy provides np.tanh function
"""
### BEGIN SOLUTION
# Apply tanh using NumPy
result = np.tanh(x.data)
return Tensor(result)
### END SOLUTION
def __call__(self, x: Tensor) -> Tensor:
"""Allows the activation to be called like a function."""
return self.forward(x)
def backward(self, grad: Tensor) -> Tensor:
"""Compute gradient (implemented in Module 06)."""
pass # Will implement backward pass in Module 06
# %% [markdown]
"""
### ๐Ÿ”ฌ Unit Test: Tanh
This test validates tanh activation behavior.
**What we're testing**: Tanh maps inputs to (-1, 1) range, zero-centered
**Why it matters**: Zero-centered activations can help with gradient flow
**Expected**: All outputs in (-1, 1), tanh(0) = 0, symmetric behavior
"""
# %% nbgrader={"grade": true, "grade_id": "test-tanh", "locked": true, "points": 10}
def test_unit_tanh():
"""๐Ÿ”ฌ Test Tanh implementation."""
print("๐Ÿ”ฌ Unit Test: Tanh...")
tanh = Tanh()
# Test zero
x = Tensor([0.0])
result = tanh.forward(x)
assert np.allclose(result.data, [0.0]), f"tanh(0) should be 0, got {result.data}"
# Test range property - all outputs should be in (-1, 1)
x = Tensor([-10, -1, 0, 1, 10])
result = tanh.forward(x)
assert np.all(result.data >= -1) and np.all(result.data <= 1), "All tanh outputs should be in [-1, 1]"
# Test symmetry: tanh(-x) = -tanh(x)
x = Tensor([2.0])
pos_result = tanh.forward(x)
x_neg = Tensor([-2.0])
neg_result = tanh.forward(x_neg)
assert np.allclose(pos_result.data, -neg_result.data), "tanh should be symmetric: tanh(-x) = -tanh(x)"
# Test extreme values
x = Tensor([-1000, 1000])
result = tanh.forward(x)
assert np.allclose(result.data[0], -1, atol=TOLERANCE), "tanh(-โˆž) should approach -1"
assert np.allclose(result.data[1], 1, atol=TOLERANCE), "tanh(+โˆž) should approach 1"
print("โœ… Tanh works correctly!")
if __name__ == "__main__":
test_unit_tanh()
# %% [markdown]
"""
## ๐Ÿ—๏ธ GELU - The Smooth Modern Choice
GELU (Gaussian Error Linear Unit) is a smooth approximation to ReLU that's become popular in modern architectures like transformers. Unlike ReLU's sharp corner, GELU is smooth everywhere.
### Mathematical Definition
```
f(x) = x * ฮฆ(x) โ‰ˆ x * Sigmoid(1.702 * x)
```
Where ฮฆ(x) is the cumulative distribution function of standard normal distribution.
### Visual Behavior
```
Input: [-1, 0, 1]
โ†“ โ†“ โ†“ GELU Function
Output: [-0.16, 0, 0.84]
```
### ASCII Visualization
```
GELU Function:
โ•ฑ
1 โ•ฑ
โ•ฑ
โ•ฑ
โ•ฑ
โ•ฑ โ†™ (smooth curve, no sharp corner)
โ•ฑ
โ”€โ”ดโ”€โ”€โ”€โ”€โ”€
-2 0 2
```
**Why GELU matters**: Used in GPT, BERT, and other transformers. The smoothness helps with optimization compared to ReLU's sharp corner.
"""
# %% nbgrader={"grade": false, "grade_id": "gelu-impl", "solution": true}
#| export
class GELU:
"""
GELU activation: f(x) = x * ฮฆ(x) โ‰ˆ x * Sigmoid(1.702 * x)
Smooth approximation to ReLU, used in modern transformers.
Where ฮฆ(x) is the cumulative distribution function of standard normal.
"""
def parameters(self):
"""Return empty list (activations have no learnable parameters)."""
return []
def forward(self, x: Tensor) -> Tensor:
"""
Apply GELU activation element-wise.
TODO: Implement GELU approximation
APPROACH:
1. Use approximation: x * sigmoid(1.702 * x)
2. Compute sigmoid part: 1 / (1 + exp(-1.702 * x))
3. Multiply by x element-wise
4. Return result wrapped in new Tensor
EXAMPLE:
>>> gelu = GELU()
>>> x = Tensor([-1, 0, 1])
>>> result = gelu(x)
>>> print(result.data)
[-0.159, 0.0, 0.841] # Smooth, like ReLU but differentiable everywhere
HINT: The 1.702 constant comes from โˆš(2/ฯ€) approximation
"""
### BEGIN SOLUTION
# GELU approximation: x * sigmoid(1.702 * x)
# First compute sigmoid part
sigmoid_part = 1.0 / (1.0 + np.exp(-1.702 * x.data))
# Then multiply by x
result = x.data * sigmoid_part
return Tensor(result)
### END SOLUTION
def __call__(self, x: Tensor) -> Tensor:
"""Allows the activation to be called like a function."""
return self.forward(x)
def backward(self, grad: Tensor) -> Tensor:
"""Compute gradient (implemented in Module 06)."""
pass # Will implement backward pass in Module 06
# %% [markdown]
"""
### ๐Ÿ”ฌ Unit Test: GELU
This test validates GELU activation behavior.
**What we're testing**: GELU provides smooth ReLU-like behavior
**Why it matters**: GELU is used in modern transformers like GPT and BERT
**Expected**: Smooth curve, GELU(0) โ‰ˆ 0, positive values preserved roughly
"""
# %% nbgrader={"grade": true, "grade_id": "test-gelu", "locked": true, "points": 10}
def test_unit_gelu():
"""๐Ÿ”ฌ Test GELU implementation."""
print("๐Ÿ”ฌ Unit Test: GELU...")
gelu = GELU()
# Test zero (should be approximately 0)
x = Tensor([0.0])
result = gelu.forward(x)
assert np.allclose(result.data, [0.0], atol=TOLERANCE), f"GELU(0) should be โ‰ˆ0, got {result.data}"
# Test positive values (should be roughly preserved)
x = Tensor([1.0])
result = gelu.forward(x)
assert result.data[0] > 0.8, f"GELU(1) should be โ‰ˆ0.84, got {result.data[0]}"
# Test negative values (should be small but not zero)
x = Tensor([-1.0])
result = gelu.forward(x)
assert result.data[0] < 0 and result.data[0] > -0.2, f"GELU(-1) should be โ‰ˆ-0.16, got {result.data[0]}"
# Test smoothness property (no sharp corners like ReLU)
x = Tensor([-0.001, 0.0, 0.001])
result = gelu.forward(x)
# Values should be close to each other (smooth)
diff1 = abs(result.data[1] - result.data[0])
diff2 = abs(result.data[2] - result.data[1])
assert diff1 < 0.01 and diff2 < 0.01, "GELU should be smooth around zero"
print("โœ… GELU works correctly!")
if __name__ == "__main__":
test_unit_gelu()
# %% [markdown]
"""
## ๐Ÿ—๏ธ Softmax - The Probability Distributor
Softmax converts any vector into a valid probability distribution. All outputs are positive and sum to exactly 1.0, making it essential for multi-class classification.
### Mathematical Definition
```
f(x_i) = e^(x_i) / ฮฃ(e^(x_j))
```
### Visual Behavior
```
Input: [1, 2, 3]
โ†“ โ†“ โ†“ Softmax Function
Output: [0.09, 0.24, 0.67] # Sum = 1.0
```
### ASCII Visualization
```
Softmax Transform:
Raw scores: [1, 2, 3, 4]
โ†“ Exponential โ†“
[2.7, 7.4, 20.1, 54.6]
โ†“ Normalize โ†“
[0.03, 0.09, 0.24, 0.64] โ† Sum = 1.0
```
**Why Softmax matters**: In multi-class classification, we need outputs that represent probabilities for each class. Softmax guarantees valid probabilities.
"""
# %% nbgrader={"grade": false, "grade_id": "softmax-impl", "solution": true}
#| export
class Softmax:
"""
Softmax activation: f(x_i) = e^(x_i) / ฮฃ(e^(x_j))
Converts any vector to a probability distribution.
Sum of all outputs equals 1.0.
"""
def parameters(self):
"""Return empty list (activations have no learnable parameters)."""
return []
def forward(self, x: Tensor, dim: int = -1) -> Tensor:
"""
Apply softmax activation along specified dimension.
TODO: Implement numerically stable softmax
APPROACH:
1. Subtract max for numerical stability: x - max(x)
2. Compute exponentials: exp(x - max(x))
3. Sum along dimension: sum(exp_values)
4. Divide: exp_values / sum
5. Return result wrapped in new Tensor
EXAMPLE:
>>> softmax = Softmax()
>>> x = Tensor([1, 2, 3])
>>> result = softmax(x)
>>> print(result.data)
[0.090, 0.245, 0.665] # Sums to 1.0, larger inputs get higher probability
HINTS:
- Use np.max(x.data, axis=dim, keepdims=True) for max
- Use np.sum(exp_values, axis=dim, keepdims=True) for sum
- The max subtraction prevents overflow in exponentials
"""
### BEGIN SOLUTION
# Numerical stability: subtract max to prevent overflow
x_max_data = np.max(x.data, axis=dim, keepdims=True)
x_max = Tensor(x_max_data)
x_shifted = x - x_max # Tensor subtraction
# Compute exponentials
exp_values = Tensor(np.exp(x_shifted.data))
# Sum along dimension
exp_sum_data = np.sum(exp_values.data, axis=dim, keepdims=True)
exp_sum = Tensor(exp_sum_data)
# Normalize to get probabilities
result = exp_values / exp_sum
return result
### END SOLUTION
def __call__(self, x: Tensor, dim: int = -1) -> Tensor:
"""Allows the activation to be called like a function."""
return self.forward(x, dim)
class LogSoftmax:
"""
Log-Softmax activation: log(softmax(x))
Computes log-softmax with numerical stability using the log-sum-exp trick.
More numerically stable than computing softmax then log separately.
Essential for cross-entropy loss computation.
"""
def parameters(self):
"""Return empty list (activations have no learnable parameters)."""
return []
def forward(self, x: Tensor, dim: int = -1) -> Tensor:
"""
Apply log-softmax activation along specified dimension.
Uses the log-sum-exp trick for numerical stability:
log_softmax(x) = x - max(x) - log(sum(exp(x - max(x))))
"""
# Step 1: Find max along dimension for numerical stability
max_vals = np.max(x.data, axis=dim, keepdims=True)
# Step 2: Subtract max to prevent overflow
shifted = x.data - max_vals
# Step 3: Compute log(sum(exp(shifted)))
log_sum_exp = np.log(np.sum(np.exp(shifted), axis=dim, keepdims=True))
# Step 4: Return log_softmax = input - max - log_sum_exp
result = x.data - max_vals - log_sum_exp
return Tensor(result)
def __call__(self, x: Tensor, dim: int = -1) -> Tensor:
"""Allows the activation to be called like a function."""
return self.forward(x, dim)
# %% [markdown]
"""
### ๐Ÿ”ฌ Unit Test: Softmax
This test validates softmax activation behavior.
**What we're testing**: Softmax creates valid probability distributions
**Why it matters**: Essential for multi-class classification outputs
**Expected**: Outputs sum to 1.0, all values in (0, 1), largest input gets highest probability
"""
# %% nbgrader={"grade": true, "grade_id": "test-softmax", "locked": true, "points": 10}
def test_unit_softmax():
"""๐Ÿ”ฌ Test Softmax implementation."""
print("๐Ÿ”ฌ Unit Test: Softmax...")
softmax = Softmax()
# Test basic probability properties
x = Tensor([1, 2, 3])
result = softmax.forward(x)
# Should sum to 1
assert np.allclose(np.sum(result.data), 1.0), f"Softmax should sum to 1, got {np.sum(result.data)}"
# All values should be positive
assert np.all(result.data > 0), "All softmax values should be positive"
# All values should be less than 1
assert np.all(result.data < 1), "All softmax values should be less than 1"
# Largest input should get largest output
max_input_idx = np.argmax(x.data)
max_output_idx = np.argmax(result.data)
assert max_input_idx == max_output_idx, "Largest input should get largest softmax output"
# Test numerical stability with large numbers
x = Tensor([1000, 1001, 1002]) # Would overflow without max subtraction
result = softmax.forward(x)
assert np.allclose(np.sum(result.data), 1.0), "Softmax should handle large numbers"
assert not np.any(np.isnan(result.data)), "Softmax should not produce NaN"
assert not np.any(np.isinf(result.data)), "Softmax should not produce infinity"
# Test with 2D tensor (batch dimension)
x = Tensor([[1, 2], [3, 4]])
result = softmax.forward(x, dim=-1) # Softmax along last dimension
assert result.shape == (2, 2), "Softmax should preserve input shape"
# Each row should sum to 1
row_sums = np.sum(result.data, axis=-1)
assert np.allclose(row_sums, [1.0, 1.0]), "Each row should sum to 1"
print("โœ… Softmax works correctly!")
if __name__ == "__main__":
test_unit_softmax()
# %% [markdown]
"""
## ๐Ÿ”ง Integration - Bringing It Together
Now let's test how all our activation functions work together and understand their different behaviors.
"""
# %% [markdown]
"""
### Understanding the Output Patterns
From the demonstration above, notice how each activation serves a different purpose:
**Sigmoid**: Squashes everything to (0, 1) - good for probabilities
**ReLU**: Zeros negatives, keeps positives - creates sparsity
**Tanh**: Like sigmoid but centered at zero (-1, 1) - better gradient flow
**GELU**: Smooth ReLU-like behavior - modern choice for transformers
**Softmax**: Converts to probability distribution - sum equals 1
These different behaviors make each activation suitable for different parts of neural networks.
"""
# %% [markdown]
"""
## ๐Ÿงช Module Integration Test
Final validation that everything works together correctly.
"""
# %% nbgrader={"grade": true, "grade_id": "module-test", "locked": true, "points": 20}
def test_module():
"""๐Ÿงช Module Test: Complete Integration
Comprehensive test of entire module functionality.
This final test runs before module summary to ensure:
- All unit tests pass
- Functions work together correctly
- Module is ready for integration with TinyTorch
"""
print("๐Ÿงช RUNNING MODULE INTEGRATION TEST")
print("=" * 50)
# Run all unit tests
print("Running unit tests...")
test_unit_sigmoid()
test_unit_relu()
test_unit_tanh()
test_unit_gelu()
test_unit_softmax()
print("\nRunning integration scenarios...")
# Test 1: All activations preserve tensor properties
print("๐Ÿ”ฌ Integration Test: Tensor property preservation...")
test_data = Tensor([[1, -1], [2, -2]]) # 2D tensor
activations = [Sigmoid(), ReLU(), Tanh(), GELU()]
for activation in activations:
result = activation.forward(test_data)
assert result.shape == test_data.shape, f"Shape not preserved by {activation.__class__.__name__}"
assert isinstance(result, Tensor), f"Output not Tensor from {activation.__class__.__name__}"
print("โœ… All activations preserve tensor properties!")
# Test 2: Softmax works with different dimensions
print("๐Ÿ”ฌ Integration Test: Softmax dimension handling...")
data_3d = Tensor([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]]) # (2, 2, 3)
softmax = Softmax()
# Test different dimensions
result_last = softmax(data_3d, dim=-1)
assert result_last.shape == (2, 2, 3), "Softmax should preserve shape"
# Check that last dimension sums to 1
last_dim_sums = np.sum(result_last.data, axis=-1)
assert np.allclose(last_dim_sums, 1.0), "Last dimension should sum to 1"
print("โœ… Softmax handles different dimensions correctly!")
# Test 3: Activation chaining (simulating neural network)
print("๐Ÿ”ฌ Integration Test: Activation chaining...")
# Simulate: Input โ†’ Linear โ†’ ReLU โ†’ Linear โ†’ Softmax (like a simple network)
x = Tensor([[-1, 0, 1, 2]]) # Batch of 1, 4 features
# Apply ReLU (hidden layer activation)
relu = ReLU()
hidden = relu.forward(x)
# Apply Softmax (output layer activation)
softmax = Softmax()
output = softmax.forward(hidden)
# Verify the chain
assert hidden.data[0, 0] == 0, "ReLU should zero negative input"
assert np.allclose(np.sum(output.data), 1.0), "Final output should be probability distribution"
print("โœ… Activation chaining works correctly!")
print("\n" + "=" * 50)
print("๐ŸŽ‰ ALL TESTS PASSED! Module ready for export.")
print("Run: tito module complete 02")
# Run comprehensive module test
if __name__ == "__main__":
test_module()
# %% [markdown]
"""
## ๐Ÿค” ML Systems Thinking
Now that you've built activation functions, let's think about their systems-level characteristics.
Understanding computational cost, numerical stability, and gradient behavior helps you make
informed choices when building neural networks.
### Computational Cost Analysis
Different activations have different computational profiles:
**ReLU: O(n) comparisons**
- Simple element-wise comparison: max(0, x)
- Fastest activation function (baseline)
- No exponentials, no divisions
- Ideal for large hidden layers
**Sigmoid/Tanh: O(n) exponentials**
- Each element requires exp() computation
- 3-4ร— slower than ReLU
- Exponentials are expensive operations
- Use sparingly in hidden layers
**GELU: O(n) exponentials + multiplications**
- Approximation involves sigmoid (exponential)
- 4-5ร— slower than ReLU
- Worth the cost in transformers (better gradients)
- Trade-off: performance vs. optimization quality
**Softmax: O(n) exponentials + O(n) sum + O(n) divisions**
- Most expensive: exp, sum, divide for entire vector
- Use only for output layers (not hidden layers)
- Requires synchronization across dimension
- Numerical stability tricks add overhead
### Numerical Stability Considerations
Activations can fail catastrophically without proper handling:
**Sigmoid/Tanh overflow:**
```
Problem: exp(1000) = inf, exp(-1000) = 0
Solution: Clip inputs to reasonable range (ยฑ500)
Your implementation: Uses stable computation for Sigmoid
```
**Softmax catastrophic overflow:**
```
Problem: exp(1000) = inf, causing NaN
Solution: Subtract max before exp (doesn't change result)
Your implementation: Uses max subtraction in Softmax.forward()
```
**ReLU dying neurons:**
```
Problem: Large negative gradient โ†’ weights become negative โ†’ ReLU always outputs 0
Solution: Monitor dead neuron percentage, use LeakyReLU variants
Your implementation: Basic ReLU (watch for this in Module 08 training)
```
### Gradient Behavior Preview
While you'll implement gradients in Module 06, understanding gradient characteristics helps:
**ReLU gradient: Sharp discontinuity**
- Gradient = 1 if x > 0, else 0
- Sharp corner at zero
- Dead neurons never recover (gradient = 0 forever)
**Sigmoid/Tanh gradient: Vanishing problem**
- Gradient โ‰ˆ 0 for large |x|
- Deep networks struggle (gradients die in early layers)
- Why ReLU replaced sigmoid in hidden layers
**GELU gradient: Smooth everywhere**
- No sharp corners (unlike ReLU)
- No vanishing at extremes (like sigmoid)
- Best of both worlds (modern architectures use this)
**Softmax gradient: Coupled across dimension**
- Changing one input affects all outputs
- Jacobian matrix (not element-wise)
- More complex backward pass than others
### Memory Considerations
**Forward pass memory:**
- All activations: Same size as input (element-wise operations)
- Softmax temporary buffers: exp array + sum array (small overhead)
**Backward pass memory (Module 06):**
- Must cache inputs for gradient computation
- 2ร— memory per activation layer (input + gradient)
- For 1000-layer network: Memory adds up!
### Key Insights for Module 02
**For early modules, focus on correctness:**
- Your activations work correctly (test_module validates this)
- Numerical stability is handled (Sigmoid clipping, Softmax max-subtraction)
- Integration ready (Module 03 will use these in layers)
**Systems awareness for later:**
- ReLU is fastest, use for hidden layers by default
- Sigmoid/Tanh: Output layers only (or special cases like gates)
- GELU: Worth the cost in transformers (Module 13)
- Softmax: Output layer for classification only
You've built activations that are both correct AND production-ready!
"""
# %% [markdown]
"""
## ๐Ÿ“Š Real-World Production Context
Now that you've implemented these activations, let's understand how they're used in real ML systems.
### Activation Selection Guide
**When to Use Each Activation:**
**Sigmoid**
- **Use case**: Binary classification output layers, gates in LSTMs/GRUs
- **Production example**: Spam detection (output: probability of spam)
- **Why**: Outputs valid probabilities in (0, 1)
- **Avoid**: Hidden layers in deep networks (vanishing gradients)
**ReLU**
- **Use case**: Hidden layers in CNNs, feedforward networks
- **Production example**: Image classification networks (ResNet, VGG)
- **Why**: Fast computation, prevents vanishing gradients, creates sparsity
- **Avoid**: Output layers (can't output negative values or probabilities)
**Tanh**
- **Use case**: RNN hidden states, when zero-centered outputs matter
- **Production example**: Sentiment analysis RNNs, time series prediction
- **Why**: Zero-centered helps with gradient flow in recurrent networks
- **Avoid**: Very deep networks (still suffers from vanishing gradients)
**GELU**
- **Use case**: Transformer models, modern architectures
- **Production example**: GPT, BERT, modern language models
- **Why**: Smooth approximation of ReLU, better gradient flow, state-of-the-art results
- **Avoid**: When computational efficiency is critical (slightly slower than ReLU)
**Softmax**
- **Use case**: Multi-class classification output layers
- **Production example**: ImageNet classification (1000 classes), NLP token prediction
- **Why**: Converts logits to valid probability distribution (sums to 1)
- **Avoid**: Hidden layers (loses information through normalization)
### Common Production Patterns
**Pattern 1: CNN Image Classification**
```
Input โ†’ Conv+ReLU โ†’ Conv+ReLU โ†’ ... โ†’ Linear โ†’ Softmax โ†’ Class Probabilities
```
**Pattern 2: Binary Classifier**
```
Input โ†’ Linear+ReLU โ†’ Linear+ReLU โ†’ Linear โ†’ Sigmoid โ†’ Binary Probability
```
**Pattern 3: Modern Transformer**
```
Input โ†’ Attention โ†’ Linear+GELU โ†’ Linear+GELU โ†’ Output
```
### Common Pitfalls and Debugging
**Sigmoid/Tanh Pitfalls:**
- **Vanishing gradients**: Gradients near 0 for extreme inputs
- **Saturation**: Outputs plateau, learning slows
- **Debug tip**: Check activation distribution - avoid all values near 0 or 1
**ReLU Pitfalls:**
- **Dying ReLU**: Neurons output 0 forever after large negative gradient
- **No negative outputs**: Can't represent negative relationships
- **Debug tip**: Monitor % of dead neurons (always output 0)
**Softmax Pitfalls:**
- **Numerical overflow**: exp(x) explodes for large x (solved by max subtraction)
- **Dimension confusion**: Must apply along correct axis for batched data
- **Debug tip**: Verify outputs sum to 1.0 along correct dimension
**GELU Pitfalls:**
- **Approximation error**: Using wrong approximation constant
- **Speed**: Slightly slower than ReLU
- **Debug tip**: Compare outputs to reference implementation
### Performance Characteristics
**Computational Cost (relative to ReLU = 1.0):**
- ReLU: 1.0ร— (fastest - just comparison and max)
- Sigmoid: ~3ร—-4ร— (exponential computation)
- Tanh: ~3ร—-4ร— (two exponentials)
- GELU: ~4ร—-5ร— (exponential in approximation)
- Softmax: ~5ร—+ (exponentials + division across all elements)
**Memory Impact:**
- All activations: Minimal memory overhead (output same size as input)
- Softmax: Slightly higher (temporary buffers for exp and sum)
- For autograd (Module 06): Must cache inputs for backward pass
### Integration with TinyTorch
Your activation functions integrate seamlessly with other modules:
**Module 03 (Layers)**: Will use these activations
```python
# Coming in Module 03
class Linear:
def __init__(self, in_features, out_features, activation=None):
self.activation = activation # Your ReLU, Sigmoid, etc.
def forward(self, x):
out = self.compute_linear(x)
if self.activation:
out = self.activation(out) # Uses your forward()
return out
```
**Module 06 (Autograd)**: Will add gradient computation
```python
# Coming in Module 06
class Sigmoid:
def backward(self, grad):
# โˆ‚sigmoid/โˆ‚x = sigmoid(x) * (1 - sigmoid(x))
return grad * self.output * (1 - self.output)
```
"""
# %% [markdown]
"""
## โญ Aha Moment: Activations Transform Data
**What you built:** Five activation functions that introduce nonlinearity to neural networks.
**Why it matters:** Without activations, stacking layers would just be matrix multiplicationโ€”
a linear operation. ReLU's simple "zero out negatives" rule is what allows networks to learn
complex patterns like recognizing faces or understanding language.
In the next module, you'll combine these activations with Linear layers to build complete
neural network architectures. The nonlinearity you just implemented is the secret sauce!
"""
# %%
def demo_activations():
"""๐ŸŽฏ See how activations transform data."""
print("๐ŸŽฏ AHA MOMENT: Activations Transform Data")
print("=" * 45)
# Test input with positive and negative values
x = Tensor(np.array([-2.0, -1.0, 0.0, 1.0, 2.0]))
print(f"Input: {x.data}")
# ReLU - zeros out negatives
relu = ReLU()
relu_out = relu(x)
print(f"ReLU: {relu_out.data} โ† Negatives become 0!")
# Sigmoid - squashes to (0, 1)
sigmoid = Sigmoid()
sigmoid_out = sigmoid(x)
print(f"Sigmoid: {np.round(sigmoid_out.data, 2)} โ† Squashed to (0,1)")
print("\nโœจ Activations add nonlinearityโ€”the key to deep learning!")
# %%
if __name__ == "__main__":
test_module()
print("\n")
demo_activations()
# %% [markdown]
"""
## ๐Ÿš€ MODULE SUMMARY: Activations
Congratulations! You've built the intelligence engine of neural networks!
### Key Accomplishments
- Built 5 core activation functions with distinct behaviors and use cases
- Implemented forward passes for Sigmoid, ReLU, Tanh, GELU, and Softmax
- Discovered how nonlinearity enables complex pattern learning
- All tests pass โœ… (validated by `test_module()`)
### Ready for Next Steps
Your activation implementations enable neural network layers to learn complex, nonlinear patterns instead of just linear transformations.
Export with: `tito module complete 02`
**Next**: Module 03 will combine your Tensors and Activations to build complete neural network Layers!
"""