Spaces:
Running
Running
File size: 54,391 Bytes
af4939b |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 |
# ---
# jupyter:
# jupytext:
# text_representation:
# extension: .py
# format_name: percent
# format_version: '1.3'
# jupytext_version: 1.17.1
# kernelspec:
# display_name: Python 3 (ipykernel)
# language: python
# name: python3
# ---
# %% [markdown]
"""
# Module 07: Optimizers - Sophisticated Learning Algorithms
Welcome to Module 07! You'll build optimizers that enable neural networks to learn from gradients using sophisticated algorithms.
## π Prerequisites & Progress
**You've Built**: Tensor with gradients (Modules 01-06)
**You'll Build**: SGD, Adam, and AdamW optimizers with sophisticated momentum and adaptive learning
**You'll Enable**: Modern optimization algorithms that power state-of-the-art neural networks
**Connection Map**:
```
Gradients β Optimizers β Training
(Module 06) (Module 07) (Module 08)
```
## π― Learning Objectives
By the end of this module, you will:
1. Implement SGD with momentum for stable gradient descent
2. Build Adam optimizer with adaptive learning rates
3. Create AdamW optimizer with decoupled weight decay
4. Understand memory and computational trade-offs in optimization algorithms
Let's get started!
## π¦ Where This Code Lives in the Final Package
**Learning Side:** You work in `modules/07_optimizers/optimizers_dev.py`
**Building Side:** Code exports to `tinytorch.core.optimizers`
```python
# How to use this module:
from tinytorch.core.optimizers import SGD, Adam, AdamW
```
**Why this matters:**
- **Learning:** Complete optimization system for modern neural network training
- **Production:** Proper organization like PyTorch's torch.optim with all optimization algorithms together
- **Consistency:** All optimization logic and parameter updating in core.optimizers
- **Integration:** Works seamlessly with gradients from Module 06 for complete training capability
"""
# %% nbgrader={"grade": false, "grade_id": "imports", "solution": true}
#| default_exp core.optimizers
#| export
import numpy as np
from typing import List, Union, Optional, Dict, Any
# Import Tensor from Module 01 (now with gradient support from Module 06)
from tinytorch.core.tensor import Tensor
# Enable autograd to add gradient tracking to Tensor
# This module depends on Module 06 (Autograd) being available
from tinytorch.core.autograd import enable_autograd
enable_autograd()
# Constants for optimizer defaults
DEFAULT_LEARNING_RATE_SGD = 0.01 # Default learning rate for SGD
DEFAULT_LEARNING_RATE_ADAM = 0.001 # Default learning rate for Adam/AdamW
DEFAULT_MOMENTUM = 0.9 # Default momentum for SGD
DEFAULT_BETA1 = 0.9 # First moment decay rate for Adam
DEFAULT_BETA2 = 0.999 # Second moment decay rate for Adam
DEFAULT_EPS = 1e-8 # Small epsilon for numerical stability in Adam
DEFAULT_WEIGHT_DECAY_ADAMW = 0.01 # Default weight decay for AdamW
# %% [markdown]
"""
## π‘ Introduction: What are Optimizers?
Optimizers are the engines that drive neural network learning. They take gradients computed from your loss function and use them to update model parameters toward better solutions. Think of optimization as navigating a complex landscape where you're trying to find the lowest valley (minimum loss).
### The Optimization Challenge
Imagine you're hiking in dense fog, trying to reach the bottom of a valley. You can only feel the slope under your feet (the gradient), but you can't see where you're going. Different optimization strategies are like different hiking approaches:
```
Loss Landscape (2D visualization):
ποΈ
/ \\
πΆ / \\
/ \\
/ π― \\ β Global minimum (goal)
/ \\
ποΈ ποΈ
Challenge: Navigate to π― using only local slope information!
```
### Our Optimizer Toolkit
**SGD (Stochastic Gradient Descent)**
- Strategy: Always step downhill
- Problem: Can get stuck oscillating in narrow valleys
- Solution: Add momentum to "coast" through oscillations
**Adam (Adaptive Moment Estimation)**
- Strategy: Adapt step size for each parameter individually
- Advantage: Different learning rates for different dimensions
- Key Insight: Some directions need big steps, others need small steps
**AdamW (Adam with Weight Decay)**
- Strategy: Adam + proper regularization
- Fix: Separates optimization from regularization
- Result: Better generalization and training stability
### The Mathematics Behind Movement
At its core, optimization follows: **ΞΈ_new = ΞΈ_old - Ξ± * direction**
Where:
- `ΞΈ` = parameters (your position in the landscape)
- `Ξ±` = step size (learning rate)
- `direction` = where to step (gradient-based)
But sophisticated optimizers do much more than basic gradient descent!
"""
# %% [markdown]
"""
## π Foundations: Mathematical Background
### Understanding Momentum: The Physics of Optimization
Momentum in optimization works like momentum in physics. A ball rolling down a hill doesn't immediately change direction when it hits a small bump - it has momentum that carries it forward.
```
Without Momentum (SGD): With Momentum:
β βοΈ
β β’ β β oscillation β β’ β smooth path
β βοΈ
Narrow valley problem: Momentum solution:
|\\ /| |\\ /|
| \\ β’ / | β ping-pong | \\ β’β/ | β smoother
| \\ / | motion | \\ / | descent
| β | | β |
```
**SGD with Momentum Formula:**
```
velocity = Ξ² * previous_velocity + (1-Ξ²) * current_gradient
parameter = parameter - learning_rate * velocity
Where Ξ² β 0.9 means "90% memory of previous direction"
```
### Adam: Adaptive Learning for Each Parameter
Adam solves a key problem: different parameters need different learning rates. Imagine adjusting the focus and zoom on a camera - you need fine control for focus but coarse control for zoom.
```
Parameter Landscape (2 dimensions):
param2
^
|
π| steep gradient
| (needs small steps)
|
---+--β--β param1
| \\
| \\ gentle gradient
| \\ (needs big steps)
Adam Solution: Automatic step size per parameter!
```
**Adam's Two-Memory System:**
1. **First Moment (m)**: "Which direction am I usually going?"
- `m = Ξ²β * old_m + (1-Ξ²β) * gradient`
- Like momentum, but for direction
2. **Second Moment (v)**: "How big are my gradients usually?"
- `v = Ξ²β * old_v + (1-Ξ²β) * gradientΒ²`
- Tracks gradient magnitude
3. **Adaptive Update**:
- `step_size = m / βv`
- Big gradients β smaller steps
- Small gradients β relatively bigger steps
### AdamW: Fixing Weight Decay
Adam has a subtle bug in how it applies weight decay (regularization). AdamW fixes this:
```
Adam (incorrect): AdamW (correct):
gradient += weight_decay * param [compute gradient update]
update_param_with_gradient() param -= learning_rate * gradient_update
param *= (1 - weight_decay) β separate!
Why it matters:
- Adam: Weight decay affected by adaptive learning rates
- AdamW: Weight decay is consistent regardless of gradients
```
"""
# %% [markdown]
"""
## ποΈ Implementation: Building Optimizers
Now we'll implement each optimizer step by step, following the pattern: understand the algorithm β implement it β test it immediately. Each optimizer builds on the foundation of the previous one.
### Implementation Strategy
```
Optimizer Base Class
β
SGD (foundation algorithm)
β
SGD + Momentum (reduce oscillations)
β
Adam (adaptive learning rates)
β
AdamW (proper weight decay)
```
"""
# %% nbgrader={"grade": false, "grade_id": "optimizer-base", "solution": true}
#| export
class Optimizer:
"""
Base class for all optimizers.
This class defines the common interface that all optimizers must implement:
- zero_grad(): Clear gradients from parameters
- step(): Update parameters based on gradients
"""
def __init__(self, params: List[Tensor]):
"""
Initialize optimizer with parameters to optimize.
TODO: Set up the parameter list for optimization
APPROACH:
1. Store parameters as a list for iteration
2. Validate that all parameters require gradients
3. Initialize step counter for algorithms that need it
EXAMPLE:
>>> linear = Linear(784, 128)
>>> optimizer = SGD(linear.parameters(), lr=0.01)
HINT: Store parameters for iteration during optimization steps
"""
### BEGIN SOLUTION
# Validate and store parameters
if not isinstance(params, list):
params = list(params)
# Store parameters - gradient tracking is handled by autograd module
self.params = params
self.step_count = 0 # For algorithms that need step counting
### END SOLUTION
def zero_grad(self):
"""
Clear gradients from all parameters.
TODO: Reset all parameter gradients to None
APPROACH:
1. Iterate through all parameters
2. Set each parameter's grad to None
EXAMPLE:
>>> optimizer.zero_grad() # Clears all gradients
>>> assert param.grad is None for param in optimizer.params
WHY: Gradients accumulate by default, so we need to clear them between batches
"""
### BEGIN SOLUTION
for param in self.params:
param.grad = None
### END SOLUTION
def step(self):
"""
Update parameters based on gradients.
This is abstract - each optimizer implements its own update rule.
"""
raise NotImplementedError("Subclasses must implement step()")
# %% [markdown]
"""
### π¬ Unit Test: Base Optimizer
This test validates our base Optimizer class works correctly.
**What we're testing**: Parameter validation and zero_grad functionality
**Why it matters**: Foundation for all specific optimizer implementations
**Expected**: Proper parameter storage and gradient clearing
"""
# %% nbgrader={"grade": true, "grade_id": "test-optimizer-base", "locked": true, "points": 10}
def test_unit_optimizer_base():
"""π¬ Test base Optimizer functionality."""
print("π¬ Unit Test: Base Optimizer...")
# Create test parameters
param1 = Tensor([1.0, 2.0], requires_grad=True)
param2 = Tensor([[3.0, 4.0], [5.0, 6.0]], requires_grad=True)
# Add some gradients
param1.grad = Tensor([0.1, 0.2])
param2.grad = Tensor([[0.3, 0.4], [0.5, 0.6]])
# Create optimizer
optimizer = Optimizer([param1, param2])
# Test parameter storage
assert len(optimizer.params) == 2
assert optimizer.params[0] is param1
assert optimizer.params[1] is param2
assert optimizer.step_count == 0
# Test zero_grad
optimizer.zero_grad()
assert param1.grad is None
assert param2.grad is None
# Test that optimizer accepts any tensor (no validation required)
# Gradient tracking is handled by the autograd module
regular_param = Tensor([1.0])
opt = Optimizer([regular_param])
assert len(opt.params) == 1
print("β
Base Optimizer works correctly!")
if __name__ == "__main__":
test_unit_optimizer_base()
# %% [markdown]
"""
## ποΈ SGD - Stochastic Gradient Descent
SGD is the foundation of neural network perf. It implements the simple but powerful idea: "move in the direction opposite to the gradient."
### Why SGD Works
Gradients point uphill (toward higher loss). To minimize loss, we go downhill:
```
Loss Surface (side view):
Loss
^
|
π | current position
| /
| β’ β you are here
| / \\
| / \\ gradient points uphill
|/ \\
β-------\\--β parameters
\\ \\
\\ βοΈ SGD steps downhill
\\ (opposite to gradient)
\\β β goal (minimum loss)
```
### The Oscillation Problem
Pure SGD can get trapped oscillating in narrow valleys:
```
Narrow valley (top view):
\\ /
\\ / β steep sides
\\ /
4β β’ β2 β SGD bounces back and forth
/ \\
1 3 instead of going down the valley
/ \\
β \\
goal \\
```
### Momentum Solution
Momentum remembers the direction you were going and continues in that direction:
```
With momentum:
\\ /
\\ /
\\ /
β’ β smooth path down the valley
/ β
/ β
β β momentum carries us through oscillations
goal
```
**Implementation:** SGD keeps a "velocity" buffer that accumulates momentum.
"""
# %% nbgrader={"grade": false, "grade_id": "sgd-optimizer", "solution": true}
#| export
class SGD(Optimizer):
"""
Stochastic Gradient Descent with momentum.
SGD is the foundational optimization algorithm that moves parameters
in the direction opposite to gradients. With momentum, it remembers
previous updates to reduce oscillations and accelerate convergence.
"""
def __init__(self, params: List[Tensor], lr: float = DEFAULT_LEARNING_RATE_SGD, momentum: float = 0.0, weight_decay: float = 0.0):
"""
Initialize SGD optimizer.
TODO: Set up SGD with momentum and weight decay
APPROACH:
1. Call parent constructor to set up parameters
2. Store learning rate, momentum, and weight decay
3. Initialize momentum buffers for each parameter
EXAMPLE:
>>> optimizer = SGD(model.parameters(), lr=0.01, momentum=0.9)
HINTS:
- Momentum buffers should be initialized as None
- They'll be created lazily on first step
"""
### BEGIN SOLUTION
super().__init__(params)
self.lr = lr
self.momentum = momentum
self.weight_decay = weight_decay
# Initialize momentum buffers (created lazily)
self.momentum_buffers = [None for _ in self.params]
### END SOLUTION
def has_momentum(self) -> bool:
"""
Check if this optimizer uses momentum.
This explicit API method replaces the need for hasattr() checks
in checkpointing code (Module 08).
Returns:
bool: True if momentum is enabled (momentum > 0), False otherwise
EXAMPLE:
>>> optimizer = SGD(params, lr=0.01, momentum=0.9)
>>> optimizer.has_momentum()
True
"""
return self.momentum > 0
def get_momentum_state(self) -> Optional[List]:
"""
Get momentum buffers for checkpointing.
This explicit API method provides safe access to momentum buffers
without using hasattr(), making the API contract clear.
Returns:
Optional[List]: List of momentum buffers if momentum is enabled,
None otherwise
EXAMPLE:
>>> optimizer = SGD(params, lr=0.01, momentum=0.9)
>>> optimizer.step() # Initialize buffers
>>> state = optimizer.get_momentum_state()
>>> # Later: optimizer.set_momentum_state(state)
"""
if not self.has_momentum():
return None
return [buf.copy() if buf is not None else None
for buf in self.momentum_buffers]
def set_momentum_state(self, state: Optional[List]) -> None:
"""
Restore momentum buffers from checkpointing.
This explicit API method provides safe restoration of momentum state
without using hasattr().
Args:
state: List of momentum buffers or None
EXAMPLE:
>>> optimizer = SGD(params, lr=0.01, momentum=0.9)
>>> state = optimizer.get_momentum_state()
>>> # Training interruption...
>>> new_optimizer = SGD(params, lr=0.01, momentum=0.9)
>>> new_optimizer.set_momentum_state(state)
"""
if state is None or not self.has_momentum():
return
if len(state) != len(self.momentum_buffers):
raise ValueError(
f"State length {len(state)} doesn't match "
f"optimizer parameters {len(self.momentum_buffers)}"
)
for i, buf in enumerate(state):
if buf is not None:
self.momentum_buffers[i] = buf.copy()
def step(self):
"""
Perform SGD update step with momentum.
TODO: Implement SGD parameter update with momentum
APPROACH:
1. For each parameter with gradients:
a. Apply weight decay if specified
b. Update momentum buffer
c. Update parameter using momentum
FORMULA:
- With weight decay: grad = grad + weight_decay * param
- Momentum: v = momentum * v_prev + grad
- Update: param = param - lr * v
HINTS:
- Skip parameters without gradients
- Initialize momentum buffers on first use
- Use in-place operations to save memory
"""
### BEGIN SOLUTION
for i, param in enumerate(self.params):
if param.grad is None:
continue
# Get gradient data - grad can be Tensor or numpy array
grad = param.grad
# Handle both Tensor (with .data) and numpy array (from autograd) cases
if isinstance(grad, Tensor):
grad_data = grad.data
else:
# grad is already a numpy array from autograd
grad_data = grad
# Apply weight decay
if self.weight_decay != 0:
grad_data = grad_data + self.weight_decay * param.data
# Update momentum buffer
if self.momentum != 0:
if self.momentum_buffers[i] is None:
# Initialize momentum buffer
self.momentum_buffers[i] = np.zeros_like(param.data)
# Update momentum: v = momentum * v_prev + grad
self.momentum_buffers[i] = self.momentum * self.momentum_buffers[i] + grad_data
grad_data = self.momentum_buffers[i]
# Update parameter: param = param - lr * grad
param.data = param.data - self.lr * grad_data
# Increment step counter
self.step_count += 1
### END SOLUTION
# %% [markdown]
"""
### π¬ Unit Test: SGD Optimizer
This test validates our SGD implementation works correctly.
**What we're testing**: SGD updates with and without momentum
**Why it matters**: Core optimization algorithm used in neural network training
**Expected**: Correct parameter updates following SGD formulas
"""
# %% nbgrader={"grade": true, "grade_id": "test-sgd", "locked": true, "points": 15}
def test_unit_sgd_optimizer():
"""π¬ Test SGD optimizer implementation."""
print("π¬ Unit Test: SGD Optimizer...")
# Test basic SGD without momentum
param = Tensor([1.0, 2.0], requires_grad=True)
param.grad = Tensor([0.1, 0.2])
optimizer = SGD([param], lr=0.1)
original_data = param.data.copy()
optimizer.step()
# Expected: param = param - lr * grad = [1.0, 2.0] - 0.1 * [0.1, 0.2] = [0.99, 1.98]
expected = original_data - 0.1 * param.grad.data
assert np.allclose(param.data, expected)
assert optimizer.step_count == 1
# Test SGD with momentum
param2 = Tensor([1.0, 2.0], requires_grad=True)
param2.grad = Tensor([0.1, 0.2])
optimizer_momentum = SGD([param2], lr=0.1, momentum=0.9)
# First step: v = 0.9 * 0 + [0.1, 0.2] = [0.1, 0.2]
optimizer_momentum.step()
expected_first = np.array([1.0, 2.0]) - 0.1 * np.array([0.1, 0.2])
assert np.allclose(param2.data, expected_first)
# Second step with same gradient
param2.grad = Tensor([0.1, 0.2])
optimizer_momentum.step()
# v = 0.9 * [0.1, 0.2] + [0.1, 0.2] = [0.19, 0.38]
expected_momentum = np.array([0.19, 0.38])
expected_second = expected_first - 0.1 * expected_momentum
assert np.allclose(param2.data, expected_second, rtol=1e-5)
# Test weight decay
param3 = Tensor([1.0, 2.0], requires_grad=True)
param3.grad = Tensor([0.1, 0.2])
optimizer_wd = SGD([param3], lr=0.1, weight_decay=0.01)
optimizer_wd.step()
# grad_with_decay = [0.1, 0.2] + 0.01 * [1.0, 2.0] = [0.11, 0.22]
expected_wd = np.array([1.0, 2.0]) - 0.1 * np.array([0.11, 0.22])
assert np.allclose(param3.data, expected_wd)
print("β
SGD optimizer works correctly!")
if __name__ == "__main__":
test_unit_sgd_optimizer()
# %% [markdown]
"""
## ποΈ Adam - Adaptive Moment Estimation
Adam solves a fundamental problem with SGD: different parameters often need different learning rates. Think of tuning a complex system where some knobs need gentle adjustments and others need bold changes.
### The Parameter Scaling Problem
Consider a neural network with both embedding weights and output weights:
```
Parameter Sensitivity Landscape:
output_weight embedding_weight
β β
| |
π± | steep cliff | π gentle slope
| (needs tiny steps) | (needs big steps)
| |
ββββββββββββββββββββββββββββββββββββββββ
Same learning rate = disaster!
β’ Small LR: output weights learn fast, embeddings crawl
β’ Large LR: embeddings learn well, output weights explode
```
### Adam's Adaptive Solution
Adam automatically adjusts learning rates by tracking two statistics:
```
1. MOMENTUM (first moment): "Which way am I usually going?"
m = 0.9 * old_direction + 0.1 * current_gradient
Visualization:
old: ββββ
new: βοΈ
m: ββββοΈ (weighted average)
2. SCALE (second moment): "How big are my steps usually?"
v = 0.999 * old_scale + 0.001 * (current_gradient)Β²
Big gradients β bigger v β smaller effective steps
Small gradients β smaller v β bigger effective steps
3. ADAPTIVE UPDATE:
step = momentum / βscale
param = param - learning_rate * step
```
### Bias Correction: The Cold Start Problem
Adam starts with m=0 and v=0, which creates a bias toward zero initially:
```
Without bias correction: With bias correction:
Step 1: m = 0.9*0 + 0.1*g Step 1: mΜ = m / (1-0.9ΒΉ) = m / 0.1
= 0.1*g (too small!) = g (correct!)
Step 2: m = 0.9*0.1*g + 0.1*g Step 2: mΜ = m / (1-0.9Β²) = m / 0.19
= 0.19*g (still small) β g (better!)
```
**Key Insight:** Adam is like having an automatic transmission that adjusts gear ratios for each parameter individually.
"""
# %% nbgrader={"grade": false, "grade_id": "adam-optimizer", "solution": true}
#| export
class Adam(Optimizer):
"""
Adam optimizer with adaptive learning rates.
Adam computes individual adaptive learning rates for different parameters
from estimates of first and second moments of the gradients.
This makes it effective for problems with sparse gradients or noisy data.
"""
def __init__(self, params: List[Tensor], lr: float = DEFAULT_LEARNING_RATE_ADAM, betas: tuple = (DEFAULT_BETA1, DEFAULT_BETA2), eps: float = DEFAULT_EPS, weight_decay: float = 0.0):
"""
Initialize Adam optimizer.
TODO: Set up Adam with adaptive learning rates
APPROACH:
1. Call parent constructor
2. Store hyperparameters (lr, betas, eps, weight_decay)
3. Initialize first and second moment buffers
PARAMETERS:
- lr: Learning rate (default: 0.001)
- betas: Coefficients for computing running averages (default: (0.9, 0.999))
- eps: Small constant for numerical stability (default: 1e-8)
- weight_decay: L2 penalty coefficient (default: 0.0)
EXAMPLE:
>>> optimizer = Adam(model.parameters(), lr=0.001, betas=(0.9, 0.999))
"""
### BEGIN SOLUTION
super().__init__(params)
self.lr = lr
self.beta1, self.beta2 = betas
self.eps = eps
self.weight_decay = weight_decay
# Initialize moment buffers (created lazily)
self.m_buffers = [None for _ in self.params] # First moment (mean)
self.v_buffers = [None for _ in self.params] # Second moment (variance)
### END SOLUTION
def step(self):
"""
Perform Adam update step.
TODO: Implement Adam parameter update with adaptive learning rates
APPROACH:
1. For each parameter with gradients:
a. Apply weight decay if specified
b. Update first moment estimate (momentum of gradient)
c. Update second moment estimate (momentum of squared gradient)
d. Compute bias-corrected moments
e. Update parameter using adaptive learning rate
FORMULAS:
- m_t = Ξ²β * m_{t-1} + (1-Ξ²β) * g_t
- v_t = Ξ²β * v_{t-1} + (1-Ξ²β) * g_tΒ²
- mΜ_t = m_t / (1-Ξ²β^t)
- vΜ_t = v_t / (1-Ξ²β^t)
- ΞΈ_t = ΞΈ_{t-1} - lr * mΜ_t / (βvΜ_t + Ξ΅)
HINTS:
- Initialize buffers as zeros on first use
- Use step_count for bias correction
- Square gradients element-wise for second moment
"""
### BEGIN SOLUTION
# Increment step counter first (needed for bias correction)
self.step_count += 1
for i, param in enumerate(self.params):
if param.grad is None:
continue
# Get gradient data - grad can be Tensor or numpy array
grad = param.grad
# Handle both Tensor (with .data) and numpy array (from autograd) cases
if isinstance(grad, Tensor):
grad_data = grad.data
else:
# grad is already a numpy array from autograd
grad_data = grad
# Apply weight decay
if self.weight_decay != 0:
grad_data = grad_data + self.weight_decay * param.data
# Initialize buffers if needed
if self.m_buffers[i] is None:
self.m_buffers[i] = np.zeros_like(param.data)
self.v_buffers[i] = np.zeros_like(param.data)
# Update biased first moment estimate
self.m_buffers[i] = self.beta1 * self.m_buffers[i] + (1 - self.beta1) * grad_data
# Update biased second moment estimate
self.v_buffers[i] = self.beta2 * self.v_buffers[i] + (1 - self.beta2) * (grad_data ** 2)
# Compute bias correction
bias_correction1 = 1 - self.beta1 ** self.step_count
bias_correction2 = 1 - self.beta2 ** self.step_count
# Compute bias-corrected moments
m_hat = self.m_buffers[i] / bias_correction1
v_hat = self.v_buffers[i] / bias_correction2
# Update parameter
param.data = param.data - self.lr * m_hat / (np.sqrt(v_hat) + self.eps)
### END SOLUTION
# %% [markdown]
"""
### π¬ Unit Test: Adam Optimizer
This test validates our Adam implementation works correctly.
**What we're testing**: Adam updates with adaptive learning rates and bias correction
**Why it matters**: Most popular optimizer for modern neural networks
**Expected**: Correct parameter updates following Adam formulas
"""
# %% nbgrader={"grade": true, "grade_id": "test-adam", "locked": true, "points": 20}
def test_unit_adam_optimizer():
"""π¬ Test Adam optimizer implementation."""
print("π¬ Unit Test: Adam Optimizer...")
# Test basic Adam functionality
param = Tensor([1.0, 2.0], requires_grad=True)
param.grad = Tensor([0.1, 0.2])
optimizer = Adam([param], lr=0.01, betas=(0.9, 0.999), eps=1e-8)
original_data = param.data.copy()
# First step
optimizer.step()
# Manually compute expected values
grad = np.array([0.1, 0.2])
# First moment: m = 0.9 * 0 + 0.1 * grad = 0.1 * grad
m = 0.1 * grad
# Second moment: v = 0.999 * 0 + 0.001 * grad^2 = 0.001 * grad^2
v = 0.001 * (grad ** 2)
# Bias correction
bias_correction1 = 1 - 0.9 ** 1 # = 0.1
bias_correction2 = 1 - 0.999 ** 1 # = 0.001
m_hat = m / bias_correction1 # = grad
v_hat = v / bias_correction2 # = grad^2
# Update
expected = original_data - 0.01 * m_hat / (np.sqrt(v_hat) + 1e-8)
assert np.allclose(param.data, expected, rtol=1e-6)
assert optimizer.step_count == 1
# Test second step to verify moment accumulation
param.grad = Tensor([0.1, 0.2])
optimizer.step()
# Should have updated moments
assert optimizer.m_buffers[0] is not None
assert optimizer.v_buffers[0] is not None
assert optimizer.step_count == 2
# Test with weight decay
param2 = Tensor([1.0, 2.0], requires_grad=True)
param2.grad = Tensor([0.1, 0.2])
optimizer_wd = Adam([param2], lr=0.01, weight_decay=0.01)
optimizer_wd.step()
# Weight decay should modify the effective gradient
# grad_with_decay = [0.1, 0.2] + 0.01 * [1.0, 2.0] = [0.11, 0.22]
# The exact computation is complex, but we can verify parameter changed
assert not np.array_equal(param2.data, np.array([1.0, 2.0]))
print("β
Adam optimizer works correctly!")
if __name__ == "__main__":
test_unit_adam_optimizer()
# %% [markdown]
"""
## ποΈ AdamW - Adam with Decoupled Weight Decay
AdamW fixes a subtle but important bug in Adam's weight decay implementation. The bug affects how regularization interacts with adaptive learning rates.
### The Adam Weight Decay Bug
In standard Adam, weight decay is added to gradients before the adaptive scaling:
```
Adam's approach (problematic):
1. gradient = computed_gradient + weight_decay * parameter
2. m = Ξ²β * m + (1-Ξ²β) * gradient
3. v = Ξ²β * v + (1-Ξ²β) * gradientΒ²
4. step = m / βv
5. parameter = parameter - learning_rate * step
Problem: Weight decay gets "adapted" by the learning rate scaling!
```
### Why This Matters
Weight decay should be a consistent regularization force, but Adam makes it inconsistent:
```
Parameter Update Comparison:
Large gradients β small adaptive LR β weak weight decay effect
Small gradients β large adaptive LR β strong weight decay effect
This is backwards! We want consistent regularization.
```
### AdamW's Fix: Decoupled Weight Decay
AdamW separates gradient-based updates from weight decay:
```
AdamW's approach (correct):
1. m = Ξ²β * m + (1-Ξ²β) * pure_gradient β NO weight decay here
2. v = Ξ²β * v + (1-Ξ²β) * pure_gradientΒ²
3. step = m / βv
4. parameter = parameter - learning_rate * step β gradient update
5. parameter = parameter * (1 - weight_decay_rate) β separate decay
Result: Consistent regularization independent of gradient magnitudes!
```
### Visual Comparison
```
Adam weight decay: AdamW weight decay:
gradient βββ gradient βββ adaptive βββ param
βββ adaptive βββ param update
weight βββββ scaling
decay
weight ββββββββββ param
decay shrinkage
Coupled (inconsistent) Decoupled (consistent)
```
**Key Insight:** AdamW treats optimization and regularization as separate, independent processes, leading to better training dynamics and generalization.
"""
# %% nbgrader={"grade": false, "grade_id": "adamw-optimizer", "solution": true}
#| export
class AdamW(Optimizer):
"""
AdamW optimizer with decoupled weight decay.
AdamW fixes a bug in Adam's weight decay implementation by decoupling
weight decay from the gradient-based update. This leads to better
regularization and is the preferred version for most applications.
"""
def __init__(self, params: List[Tensor], lr: float = DEFAULT_LEARNING_RATE_ADAM, betas: tuple = (DEFAULT_BETA1, DEFAULT_BETA2), eps: float = DEFAULT_EPS, weight_decay: float = DEFAULT_WEIGHT_DECAY_ADAMW):
"""
Initialize AdamW optimizer.
TODO: Set up AdamW with decoupled weight decay
APPROACH:
1. Call parent constructor
2. Store hyperparameters (note higher default weight_decay)
3. Initialize moment buffers like Adam
KEY DIFFERENCE from Adam:
- Weight decay is applied directly to parameters, not added to gradients
- This provides better regularization behavior
EXAMPLE:
>>> optimizer = AdamW(model.parameters(), lr=0.001, weight_decay=0.01)
"""
### BEGIN SOLUTION
super().__init__(params)
self.lr = lr
self.beta1, self.beta2 = betas
self.eps = eps
self.weight_decay = weight_decay
# Initialize moment buffers (same as Adam)
self.m_buffers = [None for _ in self.params]
self.v_buffers = [None for _ in self.params]
### END SOLUTION
def step(self):
"""
Perform AdamW update step with decoupled weight decay.
TODO: Implement AdamW parameter update
APPROACH:
1. For each parameter with gradients:
a. Update moments using gradients (NOT modified by weight decay)
b. Compute bias-corrected moments
c. Apply gradient-based update
d. Apply weight decay directly to parameters
KEY DIFFERENCE from Adam:
- Weight decay: ΞΈ_t = ΞΈ_t - lr * weight_decay * ΞΈ_t (applied after gradient update)
- NOT: grad = grad + weight_decay * param (Adam's incorrect approach)
FORMULAS:
- Same moment updates as Adam (using unmodified gradients)
- Gradient update: ΞΈ_t = ΞΈ_{t-1} - lr * mΜ_t / (βvΜ_t + Ξ΅)
- Weight decay: ΞΈ_t = ΞΈ_t * (1 - lr * weight_decay)
HINT: Apply weight decay after gradient update for proper decoupling
"""
### BEGIN SOLUTION
# Increment step counter first
self.step_count += 1
for i, param in enumerate(self.params):
if param.grad is None:
continue
# Get gradient data - grad can be Tensor or numpy array
grad = param.grad
# Handle both Tensor (with .data) and numpy array (from autograd) cases
if isinstance(grad, Tensor):
grad_data = grad.data
else:
# grad is already a numpy array from autograd
grad_data = grad
# Initialize buffers if needed
if self.m_buffers[i] is None:
self.m_buffers[i] = np.zeros_like(param.data)
self.v_buffers[i] = np.zeros_like(param.data)
# Update moments using pure gradients
self.m_buffers[i] = self.beta1 * self.m_buffers[i] + (1 - self.beta1) * grad_data
self.v_buffers[i] = self.beta2 * self.v_buffers[i] + (1 - self.beta2) * (grad_data ** 2)
# Compute bias correction
bias_correction1 = 1 - self.beta1 ** self.step_count
bias_correction2 = 1 - self.beta2 ** self.step_count
# Compute bias-corrected moments
m_hat = self.m_buffers[i] / bias_correction1
v_hat = self.v_buffers[i] / bias_correction2
# Apply gradient-based update
param.data = param.data - self.lr * m_hat / (np.sqrt(v_hat) + self.eps)
# Apply decoupled weight decay
if self.weight_decay != 0:
param.data = param.data * (1 - self.lr * self.weight_decay)
### END SOLUTION
# %% [markdown]
"""
### π¬ Unit Test: AdamW Optimizer
This test validates our AdamW implementation with decoupled weight decay.
**What we're testing**: AdamW updates with proper weight decay decoupling
**Why it matters**: State-of-the-art optimizer for transformer models
**Expected**: Correct separation of gradient updates and weight decay
"""
# %% nbgrader={"grade": true, "grade_id": "test-adamw", "locked": true, "points": 20}
def test_unit_adamw_optimizer():
"""π¬ Test AdamW optimizer implementation."""
print("π¬ Unit Test: AdamW Optimizer...")
# Test AdamW vs Adam difference in weight decay
# Create identical parameters for comparison
param_adam = Tensor([1.0, 2.0], requires_grad=True)
param_adamw = Tensor([1.0, 2.0], requires_grad=True)
param_adam.grad = Tensor([0.1, 0.2])
param_adamw.grad = Tensor([0.1, 0.2])
# Create optimizers with same settings
adam = Adam([param_adam], lr=0.01, weight_decay=0.01)
adamw = AdamW([param_adamw], lr=0.01, weight_decay=0.01)
# Take one step
adam.step()
adamw.step()
# Results should be different due to weight decay implementation
assert not np.allclose(param_adam.data, param_adamw.data, rtol=1e-6)
# Test AdamW basic functionality
param = Tensor([1.0, 2.0], requires_grad=True)
param.grad = Tensor([0.1, 0.2])
optimizer = AdamW([param], lr=0.01, weight_decay=0.01)
original_data = param.data.copy()
optimizer.step()
# Parameter should have changed
assert not np.array_equal(param.data, original_data)
assert optimizer.step_count == 1
# Test that moment buffers are created
assert optimizer.m_buffers[0] is not None
assert optimizer.v_buffers[0] is not None
# Test zero weight decay behaves like Adam
param1 = Tensor([1.0, 2.0], requires_grad=True)
param2 = Tensor([1.0, 2.0], requires_grad=True)
param1.grad = Tensor([0.1, 0.2])
param2.grad = Tensor([0.1, 0.2])
adam_no_wd = Adam([param1], lr=0.01, weight_decay=0.0)
adamw_no_wd = AdamW([param2], lr=0.01, weight_decay=0.0)
adam_no_wd.step()
adamw_no_wd.step()
# Should be very similar (within numerical precision)
assert np.allclose(param1.data, param2.data, rtol=1e-10)
print("β
AdamW optimizer works correctly!")
if __name__ == "__main__":
test_unit_adamw_optimizer()
# %% [markdown]
"""
## π§ Integration: Bringing It Together
Now let's see how our optimizers perform in realistic scenarios. We'll compare their behavior on the same optimization problem to understand their different characteristics.
### Optimizer Behavior Comparison
Each optimizer takes a different approach to the same problem:
```
Optimization Problem: Find minimum of f(x) = xΒ²
SGD approach: Adam approach: AdamW approach:
β β β
x βββ minimize x βββ minimize x βββ minimize
β β β
fixed LR adaptive LR adaptive LR + decay
```
"""
# %% [markdown]
"""
## π Systems Analysis: Optimizer Performance and Memory
Different optimizers have very different resource requirements. Understanding these trade-offs is crucial for production ML systems.
### Memory Usage Patterns
```
Optimizer Memory Requirements (per parameter):
SGD: Adam/AdamW:
ββββββββββ ββββββββββ
β param β β param β
ββββββββββ€ ββββββββββ€
βmomentumβ β m β β first moment
ββββββββββ ββββββββββ€
β v β β second moment
ββββββββββ
2Γ memory 3Γ memory
```
### Computational Complexity
```
Per-step Operations:
SGD: Adam:
β’ 1 multiplication β’ 3 multiplications
β’ 1 addition β’ 4 additions
β’ 1 subtraction β’ 1 subtraction
β’ 1 square root
β’ 1 division
O(n) simple ops O(n) complex ops
```
"""
# %% nbgrader={"grade": false, "grade_id": "optimizer-analysis", "solution": true}
def analyze_optimizer_memory_usage():
"""π Analyze memory usage of different optimizers."""
print("π Analyzing Optimizer Memory Usage...")
# Create test parameters of different sizes
param_sizes = [1000, 10000, 100000] # 1K, 10K, 100K parameters
print("Optimizer Memory Analysis (per parameter tensor):")
print("=" * 60)
print(f"{'Size':<10} {'SGD':<10} {'Adam':<10} {'AdamW':<10} {'Ratio':<10}")
print("-" * 60)
for size in param_sizes:
# Create parameter
param = Tensor(np.random.randn(size), requires_grad=True)
param.grad = Tensor(np.random.randn(size))
# SGD memory (parameter + momentum buffer)
sgd = SGD([param], momentum=0.9)
sgd.step() # Initialize buffers
sgd_memory = size * 2 # param + momentum buffer
# Adam memory (parameter + 2 moment buffers)
param_adam = Tensor(np.random.randn(size), requires_grad=True)
param_adam.grad = Tensor(np.random.randn(size))
adam = Adam([param_adam])
adam.step() # Initialize buffers
adam_memory = size * 3 # param + m_buffer + v_buffer
# AdamW memory (same as Adam)
adamw_memory = adam_memory
# Memory ratio (Adam/SGD)
ratio = adam_memory / sgd_memory
print(f"{size:<10} {sgd_memory:<10} {adam_memory:<10} {adamw_memory:<10} {ratio:.1f}x")
print("\nπ‘ Key Insights:")
print("- SGD: 2Γ parameter memory (momentum buffer)")
print("- Adam/AdamW: 3Γ parameter memory (two moment buffers)")
print("- Memory scales linearly with model size")
print("- Trade-off: More memory for better convergence")
# %% nbgrader={"grade": false, "grade_id": "optimizer-convergence", "solution": true}
def analyze_optimizer_convergence_behavior():
"""π Analyze convergence behavior of different optimizers."""
print("π Analyzing Optimizer Convergence Behavior...")
# Simulate optimization of a quadratic function: f(x) = 0.5 * x^2
# Optimal solution: x* = 0, gradient = x
def quadratic_loss(x):
"""Simple quadratic function for optimization testing."""
return 0.5 * (x ** 2).sum()
def compute_gradient(x):
"""Gradient of quadratic function: df/dx = x."""
return x.copy()
# Starting point
x_start = np.array([5.0, -3.0, 2.0]) # Far from optimum [0, 0, 0]
# Test different optimizers
optimizers_to_test = [
("SGD", SGD, {"lr": 0.1}),
("SGD+Momentum", SGD, {"lr": 0.1, "momentum": 0.9}),
("Adam", Adam, {"lr": 0.1}),
("AdamW", AdamW, {"lr": 0.1, "weight_decay": 0.01})
]
print("Convergence Analysis (quadratic function f(x) = 0.5 * xΒ²):")
print("=" * 70)
print(f"{'Optimizer':<15} {'Step 0':<12} {'Step 5':<12} {'Step 10':<12} {'Final Loss':<12}")
print("-" * 70)
for name, optimizer_class, kwargs in optimizers_to_test:
# Reset parameter
param = Tensor(x_start.copy(), requires_grad=True)
optimizer = optimizer_class([param], **kwargs)
losses = []
# Run optimization for 10 steps
for step in range(11):
# Compute loss and gradient
loss = quadratic_loss(param.data)
param.grad = Tensor(compute_gradient(param.data))
losses.append(loss)
# Update parameters
if step < 10: # Don't update after last evaluation
optimizer.step()
optimizer.zero_grad()
# Format results
step0 = f"{losses[0]:.6f}"
step5 = f"{losses[5]:.6f}"
step10 = f"{losses[10]:.6f}"
final = f"{losses[10]:.6f}"
print(f"{name:<15} {step0:<12} {step5:<12} {step10:<12} {final:<12}")
print("\nπ‘ Key Insights:")
print("- SGD: Steady progress but can be slow")
print("- SGD+Momentum: Faster convergence, less oscillation")
print("- Adam: Adaptive rates help with different parameter scales")
print("- AdamW: Similar to Adam with regularization effects")
# %% [markdown]
"""
## π§ͺ Module Integration Test
Final validation that everything works together correctly.
"""
# %% nbgrader={"grade": true, "grade_id": "module-integration", "locked": true, "points": 25}
def test_module():
"""π§ͺ Module Test: Complete Integration
Comprehensive test of entire module functionality.
This final test runs before module summary to ensure:
- All unit tests pass
- Functions work together correctly
- Module is ready for integration with TinyTorch
"""
print("π§ͺ RUNNING MODULE INTEGRATION TEST")
print("=" * 50)
# Run all unit tests
print("Running unit tests...")
test_unit_optimizer_base()
test_unit_sgd_optimizer()
test_unit_adam_optimizer()
test_unit_adamw_optimizer()
print("\nRunning integration scenarios...")
# Test realistic neural network optimization scenario
print("π¬ Integration Test: Multi-layer Network Optimization...")
# Import components from TinyTorch package (previous modules must be completed and exported)
from tinytorch.core.layers import Linear
from tinytorch.core.activations import ReLU
from tinytorch.core.losses import MSELoss
# Create parameters for a 2-layer network
# Layer 1: 3 inputs -> 4 hidden
W1 = Tensor(np.random.randn(3, 4) * 0.1, requires_grad=True)
b1 = Tensor(np.zeros(4), requires_grad=True)
# Layer 2: 4 hidden -> 2 outputs
W2 = Tensor(np.random.randn(4, 2) * 0.1, requires_grad=True)
b2 = Tensor(np.zeros(2), requires_grad=True)
params = [W1, b1, W2, b2]
# Add realistic gradients
W1.grad = Tensor(np.random.randn(3, 4) * 0.01)
b1.grad = Tensor(np.random.randn(4) * 0.01)
W2.grad = Tensor(np.random.randn(4, 2) * 0.01)
b2.grad = Tensor(np.random.randn(2) * 0.01)
# Test all optimizers on same network
optimizers = [
SGD(params, lr=0.01, momentum=0.9),
Adam([p for p in params], lr=0.001), # Fresh param list for Adam
AdamW([p for p in params], lr=0.001, weight_decay=0.01) # Fresh param list for AdamW
]
# Save original parameter values
original_params = [p.data.copy() for p in params]
# Test SGD
optimizers[0].step()
sgd_params = [p.data.copy() for p in params]
# Restore parameters and test Adam
for i, p in enumerate(params):
p.data = original_params[i].copy()
# Re-add gradients since they may have been modified
if i == 0:
p.grad = Tensor(np.random.randn(3, 4) * 0.01)
elif i == 1:
p.grad = Tensor(np.random.randn(4) * 0.01)
elif i == 2:
p.grad = Tensor(np.random.randn(4, 2) * 0.01)
else:
p.grad = Tensor(np.random.randn(2) * 0.01)
# Update parameter references for Adam
optimizers[1].params = params
optimizers[1].step()
adam_params = [p.data.copy() for p in params]
# Restore parameters and test AdamW
for i, p in enumerate(params):
p.data = original_params[i].copy()
# Re-add gradients
if i == 0:
p.grad = Tensor(np.random.randn(3, 4) * 0.01)
elif i == 1:
p.grad = Tensor(np.random.randn(4) * 0.01)
elif i == 2:
p.grad = Tensor(np.random.randn(4, 2) * 0.01)
else:
p.grad = Tensor(np.random.randn(2) * 0.01)
# Update parameter references for AdamW
optimizers[2].params = params
optimizers[2].step()
adamw_params = [p.data.copy() for p in params]
# Verify parameters changed differently for each optimizer
for i in range(len(params)):
# Parameters should be different from original
assert not np.array_equal(sgd_params[i], original_params[i])
assert not np.array_equal(adam_params[i], original_params[i])
assert not np.array_equal(adamw_params[i], original_params[i])
# Different optimizers should produce different results
assert not np.allclose(sgd_params[i], adam_params[i], rtol=1e-6)
print("β
Multi-layer network optimization works!")
# Test optimizer state management
print("π¬ Integration Test: Optimizer State Management...")
param = Tensor([1.0, 2.0], requires_grad=True)
param.grad = Tensor([0.1, 0.2])
optimizer = Adam([param], lr=0.001)
# First step should initialize buffers
optimizer.step()
assert optimizer.m_buffers[0] is not None
assert optimizer.v_buffers[0] is not None
assert optimizer.step_count == 1
# Zero grad should clear gradients but preserve optimizer state
optimizer.zero_grad()
assert param.grad is None
assert optimizer.m_buffers[0] is not None # State preserved
assert optimizer.step_count == 1 # Step count preserved
print("β
Optimizer state management works!")
print("\n" + "=" * 50)
print("π ALL TESTS PASSED! Module ready for export.")
print("Run: tito module complete 07_optimizers")
# %% [markdown]
"""
## π€ ML Systems Thinking
Now that your optimizers work, let's explore the systems trade-offs between them. Every optimizer choice affects memory usage, convergence speed, and training stability.
### Questions to Consider
**Q1: Memory vs Performance**
You've implemented SGD (2Γ memory) and Adam (3Γ memory). For a model with 10 billion parameters at float32 (4 bytes each):
- How much total memory does each optimizer require?
- At what model size does Adam's extra 50% memory overhead become prohibitive?
- What real-world constraints might force you to choose SGD over Adam?
**Q2: Learning Rate Sensitivity**
SGD uses a fixed learning rate for all parameters, while Adam adapts per-parameter:
- Why might Adam converge faster on problems with parameters at different scales?
- When might SGD's uniform learning rate actually be an advantage?
- How does momentum in SGD relate to Adam's first moment estimation?
**Q3: Optimizer State Management**
Adam and AdamW maintain momentum buffers (m, v) that persist across training steps:
- What happens to these buffers when you checkpoint during training?
- If you resume training with different hyperparameters, should you restore the old buffers?
- How does optimizer state affect distributed training across multiple GPUs?
**Q4: Weight Decay Trade-offs**
AdamW decouples weight decay from gradient updates:
- Why does Adam's coupled weight decay behave inconsistently?
- In what scenarios would AdamW's consistent regularization matter most?
- How does weight decay interact with learning rate schedules?
### Systems Implications
**Memory Hierarchy:**
```
Model Size: 1B parameters (4GB)
ββ SGD: 8GB total (4GB params + 4GB momentum)
ββ Adam: 12GB total (4GB params + 4GB m + 4GB v)
ββ Impact: May not fit in GPU memory, forcing:
β’ Smaller batch sizes
β’ Model parallelism
β’ Optimizer state sharding (ZeRO optimization)
```
**Convergence Patterns:**
- **SGD + Momentum:** Steady progress, may need learning rate tuning
- **Adam:** Fast initial convergence, may overfit without proper regularization
- **AdamW:** Adam's speed + better generalization, standard for transformers
**Production Considerations:**
- **Training cost:** Adam's extra memory means fewer models per GPU
- **Hyperparameter tuning:** SGD more sensitive to learning rate choice
- **Model generalization:** AdamW often generalizes better than Adam
- **Checkpoint size:** Adam checkpoints are 1.5Γ larger than SGD
### Performance Analysis
Our earlier analysis functions revealed:
- `analyze_optimizer_memory_usage()`: Adam requires exactly 1.5Γ SGD's memory
- `analyze_optimizer_convergence_behavior()`: Adam often converges in fewer steps
**The Key Insight:**
Optimizer choice is a systems trade-off between:
- **Memory budget** (can you afford 3Γ parameter memory?)
- **Convergence speed** (how many training steps can you afford?)
- **Generalization quality** (does your model perform well on unseen data?)
There's no universally best optimizerβonly the right choice for your constraints!
"""
# %% [markdown]
"""
## β Aha Moment: Optimizers Update Weights
**What you built:** Optimization algorithms (SGD, Adam) that update neural network weights.
**Why it matters:** Gradients tell us which direction reduces the loss, but someone has to
actually move the weights. That's what optimizers do! SGD takes simple steps, while Adam
adapts the learning rate for each parameterβlike having a personal trainer for each weight.
In the next module, you'll combine optimizers with a training loop to actually train networks!
"""
# %%
def demo_optimizers():
"""π― See optimizers update weights."""
print("π― AHA MOMENT: Optimizers Update Weights")
print("=" * 45)
# Create a parameter with a gradient
weight = Tensor(np.array([5.0]), requires_grad=True)
weight.grad = np.array([1.0]) # Gradient pointing "uphill"
print(f"Initial weight: {weight.data[0]:.2f}")
print(f"Gradient: {weight.grad[0]:.2f} (pointing uphill)")
# SGD takes a step in the opposite direction
optimizer = SGD([weight], lr=0.5)
optimizer.step()
print(f"\nAfter SGD step: {weight.data[0]:.2f}")
print(f"Moved: {5.0 - weight.data[0]:.2f} (opposite to gradient)")
print("\n⨠Optimizer moves weights to reduce loss!")
# %%
if __name__ == "__main__":
test_module()
print("\n")
demo_optimizers()
# %% [markdown]
"""
## π MODULE SUMMARY: Optimizers
Congratulations! You've built sophisticated optimization algorithms that power modern neural network training!
### Key Accomplishments
- Built SGD optimizer with momentum for stable gradient descent and oscillation reduction
- Implemented Adam optimizer with adaptive learning rates and bias correction for different parameter scales
- Created AdamW optimizer with decoupled weight decay for proper regularization
- Analyzed memory trade-offs: SGD (2Γ), Adam/AdamW (3Γ parameter memory)
- All tests pass β
(validated by `test_module()`)
### Ready for Next Steps
Your optimizer implementations enable sophisticated neural network training! With gradients from Module 06 and optimizers from Module 07, you're ready to build complete training loops.
Export with: `tito module complete 07_optimizers`
**Next**: Module 08 will add training loops, learning rate scheduling, and checkpointing for complete end-to-end neural network training!
""" |