File size: 54,391 Bytes
af4939b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
# ---
# jupyter:
#   jupytext:
#     text_representation:
#       extension: .py
#       format_name: percent
#       format_version: '1.3'
#       jupytext_version: 1.17.1
#   kernelspec:
#     display_name: Python 3 (ipykernel)
#     language: python
#     name: python3
# ---

# %% [markdown]
"""
# Module 07: Optimizers - Sophisticated Learning Algorithms

Welcome to Module 07! You'll build optimizers that enable neural networks to learn from gradients using sophisticated algorithms.

## πŸ”— Prerequisites & Progress
**You've Built**: Tensor with gradients (Modules 01-06)
**You'll Build**: SGD, Adam, and AdamW optimizers with sophisticated momentum and adaptive learning
**You'll Enable**: Modern optimization algorithms that power state-of-the-art neural networks

**Connection Map**:
```
Gradients β†’ Optimizers β†’ Training
(Module 06)  (Module 07)  (Module 08)
```

## 🎯 Learning Objectives
By the end of this module, you will:
1. Implement SGD with momentum for stable gradient descent
2. Build Adam optimizer with adaptive learning rates
3. Create AdamW optimizer with decoupled weight decay
4. Understand memory and computational trade-offs in optimization algorithms

Let's get started!

## πŸ“¦ Where This Code Lives in the Final Package

**Learning Side:** You work in `modules/07_optimizers/optimizers_dev.py`
**Building Side:** Code exports to `tinytorch.core.optimizers`

```python
# How to use this module:
from tinytorch.core.optimizers import SGD, Adam, AdamW
```

**Why this matters:**
- **Learning:** Complete optimization system for modern neural network training
- **Production:** Proper organization like PyTorch's torch.optim with all optimization algorithms together
- **Consistency:** All optimization logic and parameter updating in core.optimizers
- **Integration:** Works seamlessly with gradients from Module 06 for complete training capability
"""

# %% nbgrader={"grade": false, "grade_id": "imports", "solution": true}
#| default_exp core.optimizers
#| export

import numpy as np
from typing import List, Union, Optional, Dict, Any

# Import Tensor from Module 01 (now with gradient support from Module 06)
from tinytorch.core.tensor import Tensor

# Enable autograd to add gradient tracking to Tensor
# This module depends on Module 06 (Autograd) being available
from tinytorch.core.autograd import enable_autograd
enable_autograd()

# Constants for optimizer defaults
DEFAULT_LEARNING_RATE_SGD = 0.01  # Default learning rate for SGD
DEFAULT_LEARNING_RATE_ADAM = 0.001  # Default learning rate for Adam/AdamW
DEFAULT_MOMENTUM = 0.9  # Default momentum for SGD
DEFAULT_BETA1 = 0.9  # First moment decay rate for Adam
DEFAULT_BETA2 = 0.999  # Second moment decay rate for Adam
DEFAULT_EPS = 1e-8  # Small epsilon for numerical stability in Adam
DEFAULT_WEIGHT_DECAY_ADAMW = 0.01  # Default weight decay for AdamW

# %% [markdown]
"""
## πŸ’‘ Introduction: What are Optimizers?

Optimizers are the engines that drive neural network learning. They take gradients computed from your loss function and use them to update model parameters toward better solutions. Think of optimization as navigating a complex landscape where you're trying to find the lowest valley (minimum loss).

### The Optimization Challenge

Imagine you're hiking in dense fog, trying to reach the bottom of a valley. You can only feel the slope under your feet (the gradient), but you can't see where you're going. Different optimization strategies are like different hiking approaches:

```
Loss Landscape (2D visualization):
       πŸ”οΈ
      /  \\
   🚢 /    \\
    /      \\
   /   🎯   \\  ← Global minimum (goal)
  /          \\
 πŸ”οΈ          πŸ”οΈ

Challenge: Navigate to 🎯 using only local slope information!
```

### Our Optimizer Toolkit

**SGD (Stochastic Gradient Descent)**
- Strategy: Always step downhill
- Problem: Can get stuck oscillating in narrow valleys
- Solution: Add momentum to "coast" through oscillations

**Adam (Adaptive Moment Estimation)**
- Strategy: Adapt step size for each parameter individually
- Advantage: Different learning rates for different dimensions
- Key Insight: Some directions need big steps, others need small steps

**AdamW (Adam with Weight Decay)**
- Strategy: Adam + proper regularization
- Fix: Separates optimization from regularization
- Result: Better generalization and training stability

### The Mathematics Behind Movement

At its core, optimization follows: **ΞΈ_new = ΞΈ_old - Ξ± * direction**

Where:
- `ΞΈ` = parameters (your position in the landscape)
- `Ξ±` = step size (learning rate)
- `direction` = where to step (gradient-based)

But sophisticated optimizers do much more than basic gradient descent!
"""

# %% [markdown]
"""
## πŸ“ Foundations: Mathematical Background

### Understanding Momentum: The Physics of Optimization

Momentum in optimization works like momentum in physics. A ball rolling down a hill doesn't immediately change direction when it hits a small bump - it has momentum that carries it forward.

```
Without Momentum (SGD):           With Momentum:
     ↓                                β†˜οΈ
  ←  β€’  β†’  ← oscillation           β†’  β€’  β†’ smooth path
     ↑                                ↙️

Narrow valley problem:            Momentum solution:
|\\     /|                        |\\     /|
| \\ β€’ / | ← ping-pong             | \\ β€’β†’/ | ← smoother
|  \\ /  |   motion                |  \\ /  |   descent
|   ●   |                        |   ●   |
```

**SGD with Momentum Formula:**
```
velocity = Ξ² * previous_velocity + (1-Ξ²) * current_gradient
parameter = parameter - learning_rate * velocity

Where Ξ² β‰ˆ 0.9 means "90% memory of previous direction"
```

### Adam: Adaptive Learning for Each Parameter

Adam solves a key problem: different parameters need different learning rates. Imagine adjusting the focus and zoom on a camera - you need fine control for focus but coarse control for zoom.

```
Parameter Landscape (2 dimensions):

   param2
     ^
     |
   😞|    steep gradient
     |    (needs small steps)
     |
  ---+--●--β†’ param1
     |     \\
     |      \\ gentle gradient
     |       \\ (needs big steps)

Adam Solution: Automatic step size per parameter!
```

**Adam's Two-Memory System:**

1. **First Moment (m)**: "Which direction am I usually going?"
   - `m = β₁ * old_m + (1-β₁) * gradient`
   - Like momentum, but for direction

2. **Second Moment (v)**: "How big are my gradients usually?"
   - `v = Ξ²β‚‚ * old_v + (1-Ξ²β‚‚) * gradientΒ²`
   - Tracks gradient magnitude

3. **Adaptive Update**:
   - `step_size = m / √v`
   - Big gradients β†’ smaller steps
   - Small gradients β†’ relatively bigger steps

### AdamW: Fixing Weight Decay

Adam has a subtle bug in how it applies weight decay (regularization). AdamW fixes this:

```
Adam (incorrect):               AdamW (correct):
gradient += weight_decay * param    [compute gradient update]
update_param_with_gradient()        param -= learning_rate * gradient_update
                                   param *= (1 - weight_decay)  ← separate!

Why it matters:
- Adam: Weight decay affected by adaptive learning rates
- AdamW: Weight decay is consistent regardless of gradients
```
"""

# %% [markdown]
"""
## πŸ—οΈ Implementation: Building Optimizers

Now we'll implement each optimizer step by step, following the pattern: understand the algorithm β†’ implement it β†’ test it immediately. Each optimizer builds on the foundation of the previous one.

### Implementation Strategy

```
Optimizer Base Class
    ↓
SGD (foundation algorithm)
    ↓
SGD + Momentum (reduce oscillations)
    ↓
Adam (adaptive learning rates)
    ↓
AdamW (proper weight decay)
```
"""

# %% nbgrader={"grade": false, "grade_id": "optimizer-base", "solution": true}
#| export
class Optimizer:
    """
    Base class for all optimizers.

    This class defines the common interface that all optimizers must implement:
    - zero_grad(): Clear gradients from parameters
    - step(): Update parameters based on gradients
    """

    def __init__(self, params: List[Tensor]):
        """
        Initialize optimizer with parameters to optimize.

        TODO: Set up the parameter list for optimization

        APPROACH:
        1. Store parameters as a list for iteration
        2. Validate that all parameters require gradients
        3. Initialize step counter for algorithms that need it

        EXAMPLE:
        >>> linear = Linear(784, 128)
        >>> optimizer = SGD(linear.parameters(), lr=0.01)

        HINT: Store parameters for iteration during optimization steps
        """
        ### BEGIN SOLUTION
        # Validate and store parameters
        if not isinstance(params, list):
            params = list(params)

        # Store parameters - gradient tracking is handled by autograd module
        self.params = params
        self.step_count = 0  # For algorithms that need step counting
        ### END SOLUTION

    def zero_grad(self):
        """
        Clear gradients from all parameters.

        TODO: Reset all parameter gradients to None

        APPROACH:
        1. Iterate through all parameters
        2. Set each parameter's grad to None

        EXAMPLE:
        >>> optimizer.zero_grad()  # Clears all gradients
        >>> assert param.grad is None for param in optimizer.params

        WHY: Gradients accumulate by default, so we need to clear them between batches
        """
        ### BEGIN SOLUTION
        for param in self.params:
            param.grad = None
        ### END SOLUTION

    def step(self):
        """
        Update parameters based on gradients.

        This is abstract - each optimizer implements its own update rule.
        """
        raise NotImplementedError("Subclasses must implement step()")

# %% [markdown]
"""
### πŸ”¬ Unit Test: Base Optimizer
This test validates our base Optimizer class works correctly.
**What we're testing**: Parameter validation and zero_grad functionality
**Why it matters**: Foundation for all specific optimizer implementations
**Expected**: Proper parameter storage and gradient clearing
"""

# %% nbgrader={"grade": true, "grade_id": "test-optimizer-base", "locked": true, "points": 10}
def test_unit_optimizer_base():
    """πŸ”¬ Test base Optimizer functionality."""
    print("πŸ”¬ Unit Test: Base Optimizer...")

    # Create test parameters
    param1 = Tensor([1.0, 2.0], requires_grad=True)
    param2 = Tensor([[3.0, 4.0], [5.0, 6.0]], requires_grad=True)

    # Add some gradients
    param1.grad = Tensor([0.1, 0.2])
    param2.grad = Tensor([[0.3, 0.4], [0.5, 0.6]])

    # Create optimizer
    optimizer = Optimizer([param1, param2])

    # Test parameter storage
    assert len(optimizer.params) == 2
    assert optimizer.params[0] is param1
    assert optimizer.params[1] is param2
    assert optimizer.step_count == 0

    # Test zero_grad
    optimizer.zero_grad()
    assert param1.grad is None
    assert param2.grad is None

    # Test that optimizer accepts any tensor (no validation required)
    # Gradient tracking is handled by the autograd module
    regular_param = Tensor([1.0])
    opt = Optimizer([regular_param])
    assert len(opt.params) == 1

    print("βœ… Base Optimizer works correctly!")

if __name__ == "__main__":
    test_unit_optimizer_base()

# %% [markdown]
"""
## πŸ—οΈ SGD - Stochastic Gradient Descent

SGD is the foundation of neural network perf. It implements the simple but powerful idea: "move in the direction opposite to the gradient."

### Why SGD Works

Gradients point uphill (toward higher loss). To minimize loss, we go downhill:

```
Loss Surface (side view):

    Loss
     ^
     |
  πŸ“ˆ |     current position
     |    /
     |   β€’ ← you are here
     |  / \\
     | /   \\ gradient points uphill
     |/     \\
     ●-------\\--β†’ parameters
      \\        \\
       \\        β†˜οΈ SGD steps downhill
        \\        (opposite to gradient)
         \\⭐ ← goal (minimum loss)
```

### The Oscillation Problem

Pure SGD can get trapped oscillating in narrow valleys:

```
Narrow valley (top view):
  \\     /
   \\   /   ← steep sides
    \\ /
  4← β€’ β†’2  ← SGD bounces back and forth
    / \\
   1   3   instead of going down the valley
  /     \\
 ●       \\
 goal     \\
```

### Momentum Solution

Momentum remembers the direction you were going and continues in that direction:

```
With momentum:
  \\     /
   \\   /
    \\ /
     β€’  ← smooth path down the valley
    / ↓
   /   ↓
  ●    ↓  momentum carries us through oscillations
 goal
```

**Implementation:** SGD keeps a "velocity" buffer that accumulates momentum.
"""

# %% nbgrader={"grade": false, "grade_id": "sgd-optimizer", "solution": true}
#| export
class SGD(Optimizer):
    """
    Stochastic Gradient Descent with momentum.

    SGD is the foundational optimization algorithm that moves parameters
    in the direction opposite to gradients. With momentum, it remembers
    previous updates to reduce oscillations and accelerate convergence.
    """

    def __init__(self, params: List[Tensor], lr: float = DEFAULT_LEARNING_RATE_SGD, momentum: float = 0.0, weight_decay: float = 0.0):
        """
        Initialize SGD optimizer.

        TODO: Set up SGD with momentum and weight decay

        APPROACH:
        1. Call parent constructor to set up parameters
        2. Store learning rate, momentum, and weight decay
        3. Initialize momentum buffers for each parameter

        EXAMPLE:
        >>> optimizer = SGD(model.parameters(), lr=0.01, momentum=0.9)

        HINTS:
        - Momentum buffers should be initialized as None
        - They'll be created lazily on first step
        """
        ### BEGIN SOLUTION
        super().__init__(params)

        self.lr = lr
        self.momentum = momentum
        self.weight_decay = weight_decay

        # Initialize momentum buffers (created lazily)
        self.momentum_buffers = [None for _ in self.params]
        ### END SOLUTION

    def has_momentum(self) -> bool:
        """
        Check if this optimizer uses momentum.

        This explicit API method replaces the need for hasattr() checks
        in checkpointing code (Module 08).

        Returns:
            bool: True if momentum is enabled (momentum > 0), False otherwise

        EXAMPLE:
            >>> optimizer = SGD(params, lr=0.01, momentum=0.9)
            >>> optimizer.has_momentum()
            True
        """
        return self.momentum > 0

    def get_momentum_state(self) -> Optional[List]:
        """
        Get momentum buffers for checkpointing.

        This explicit API method provides safe access to momentum buffers
        without using hasattr(), making the API contract clear.

        Returns:
            Optional[List]: List of momentum buffers if momentum is enabled,
                          None otherwise

        EXAMPLE:
            >>> optimizer = SGD(params, lr=0.01, momentum=0.9)
            >>> optimizer.step()  # Initialize buffers
            >>> state = optimizer.get_momentum_state()
            >>> # Later: optimizer.set_momentum_state(state)
        """
        if not self.has_momentum():
            return None
        return [buf.copy() if buf is not None else None
                for buf in self.momentum_buffers]

    def set_momentum_state(self, state: Optional[List]) -> None:
        """
        Restore momentum buffers from checkpointing.

        This explicit API method provides safe restoration of momentum state
        without using hasattr().

        Args:
            state: List of momentum buffers or None

        EXAMPLE:
            >>> optimizer = SGD(params, lr=0.01, momentum=0.9)
            >>> state = optimizer.get_momentum_state()
            >>> # Training interruption...
            >>> new_optimizer = SGD(params, lr=0.01, momentum=0.9)
            >>> new_optimizer.set_momentum_state(state)
        """
        if state is None or not self.has_momentum():
            return

        if len(state) != len(self.momentum_buffers):
            raise ValueError(
                f"State length {len(state)} doesn't match "
                f"optimizer parameters {len(self.momentum_buffers)}"
            )

        for i, buf in enumerate(state):
            if buf is not None:
                self.momentum_buffers[i] = buf.copy()

    def step(self):
        """
        Perform SGD update step with momentum.

        TODO: Implement SGD parameter update with momentum

        APPROACH:
        1. For each parameter with gradients:
           a. Apply weight decay if specified
           b. Update momentum buffer
           c. Update parameter using momentum

        FORMULA:
        - With weight decay: grad = grad + weight_decay * param
        - Momentum: v = momentum * v_prev + grad
        - Update: param = param - lr * v

        HINTS:
        - Skip parameters without gradients
        - Initialize momentum buffers on first use
        - Use in-place operations to save memory
        """
        ### BEGIN SOLUTION
        for i, param in enumerate(self.params):
            if param.grad is None:
                continue

            # Get gradient data - grad can be Tensor or numpy array
            grad = param.grad
            # Handle both Tensor (with .data) and numpy array (from autograd) cases
            if isinstance(grad, Tensor):
                grad_data = grad.data
            else:
                # grad is already a numpy array from autograd
                grad_data = grad

            # Apply weight decay
            if self.weight_decay != 0:
                grad_data = grad_data + self.weight_decay * param.data

            # Update momentum buffer
            if self.momentum != 0:
                if self.momentum_buffers[i] is None:
                    # Initialize momentum buffer
                    self.momentum_buffers[i] = np.zeros_like(param.data)

                # Update momentum: v = momentum * v_prev + grad
                self.momentum_buffers[i] = self.momentum * self.momentum_buffers[i] + grad_data
                grad_data = self.momentum_buffers[i]

            # Update parameter: param = param - lr * grad
            param.data = param.data - self.lr * grad_data

        # Increment step counter
        self.step_count += 1
        ### END SOLUTION

# %% [markdown]
"""
### πŸ”¬ Unit Test: SGD Optimizer
This test validates our SGD implementation works correctly.
**What we're testing**: SGD updates with and without momentum
**Why it matters**: Core optimization algorithm used in neural network training
**Expected**: Correct parameter updates following SGD formulas
"""

# %% nbgrader={"grade": true, "grade_id": "test-sgd", "locked": true, "points": 15}
def test_unit_sgd_optimizer():
    """πŸ”¬ Test SGD optimizer implementation."""
    print("πŸ”¬ Unit Test: SGD Optimizer...")

    # Test basic SGD without momentum
    param = Tensor([1.0, 2.0], requires_grad=True)
    param.grad = Tensor([0.1, 0.2])

    optimizer = SGD([param], lr=0.1)
    original_data = param.data.copy()

    optimizer.step()

    # Expected: param = param - lr * grad = [1.0, 2.0] - 0.1 * [0.1, 0.2] = [0.99, 1.98]
    expected = original_data - 0.1 * param.grad.data
    assert np.allclose(param.data, expected)
    assert optimizer.step_count == 1

    # Test SGD with momentum
    param2 = Tensor([1.0, 2.0], requires_grad=True)
    param2.grad = Tensor([0.1, 0.2])

    optimizer_momentum = SGD([param2], lr=0.1, momentum=0.9)

    # First step: v = 0.9 * 0 + [0.1, 0.2] = [0.1, 0.2]
    optimizer_momentum.step()
    expected_first = np.array([1.0, 2.0]) - 0.1 * np.array([0.1, 0.2])
    assert np.allclose(param2.data, expected_first)

    # Second step with same gradient
    param2.grad = Tensor([0.1, 0.2])
    optimizer_momentum.step()
    # v = 0.9 * [0.1, 0.2] + [0.1, 0.2] = [0.19, 0.38]
    expected_momentum = np.array([0.19, 0.38])
    expected_second = expected_first - 0.1 * expected_momentum
    assert np.allclose(param2.data, expected_second, rtol=1e-5)

    # Test weight decay
    param3 = Tensor([1.0, 2.0], requires_grad=True)
    param3.grad = Tensor([0.1, 0.2])

    optimizer_wd = SGD([param3], lr=0.1, weight_decay=0.01)
    optimizer_wd.step()

    # grad_with_decay = [0.1, 0.2] + 0.01 * [1.0, 2.0] = [0.11, 0.22]
    expected_wd = np.array([1.0, 2.0]) - 0.1 * np.array([0.11, 0.22])
    assert np.allclose(param3.data, expected_wd)

    print("βœ… SGD optimizer works correctly!")

if __name__ == "__main__":
    test_unit_sgd_optimizer()

# %% [markdown]
"""
## πŸ—οΈ Adam - Adaptive Moment Estimation

Adam solves a fundamental problem with SGD: different parameters often need different learning rates. Think of tuning a complex system where some knobs need gentle adjustments and others need bold changes.

### The Parameter Scaling Problem

Consider a neural network with both embedding weights and output weights:

```
Parameter Sensitivity Landscape:

  output_weight                 embedding_weight
       ↑                              ↑
       |                              |
    😱 |  steep cliff                 |  🐌 gentle slope
       |  (needs tiny steps)          |  (needs big steps)
       |                              |
    ━━━●━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━●━━━→

Same learning rate = disaster!
β€’ Small LR: output weights learn fast, embeddings crawl
β€’ Large LR: embeddings learn well, output weights explode
```

### Adam's Adaptive Solution

Adam automatically adjusts learning rates by tracking two statistics:

```
1. MOMENTUM (first moment): "Which way am I usually going?"
   m = 0.9 * old_direction + 0.1 * current_gradient

   Visualization:
   old: β†’β†’β†’β†’
   new:     ↗️
   m:   →→→↗️  (weighted average)

2. SCALE (second moment): "How big are my steps usually?"
   v = 0.999 * old_scale + 0.001 * (current_gradient)Β²

   Big gradients β†’ bigger v β†’ smaller effective steps
   Small gradients β†’ smaller v β†’ bigger effective steps

3. ADAPTIVE UPDATE:
   step = momentum / √scale
   param = param - learning_rate * step
```

### Bias Correction: The Cold Start Problem

Adam starts with m=0 and v=0, which creates a bias toward zero initially:

```
Without bias correction:    With bias correction:

Step 1: m = 0.9*0 + 0.1*g    Step 1: mΜ‚ = m / (1-0.9ΒΉ) = m / 0.1
       = 0.1*g (too small!)           = g (correct!)

Step 2: m = 0.9*0.1*g + 0.1*g Step 2: mΜ‚ = m / (1-0.9Β²) = m / 0.19
       = 0.19*g (still small)         β‰ˆ g (better!)
```

**Key Insight:** Adam is like having an automatic transmission that adjusts gear ratios for each parameter individually.
"""

# %% nbgrader={"grade": false, "grade_id": "adam-optimizer", "solution": true}
#| export
class Adam(Optimizer):
    """
    Adam optimizer with adaptive learning rates.

    Adam computes individual adaptive learning rates for different parameters
    from estimates of first and second moments of the gradients.
    This makes it effective for problems with sparse gradients or noisy data.
    """

    def __init__(self, params: List[Tensor], lr: float = DEFAULT_LEARNING_RATE_ADAM, betas: tuple = (DEFAULT_BETA1, DEFAULT_BETA2), eps: float = DEFAULT_EPS, weight_decay: float = 0.0):
        """
        Initialize Adam optimizer.

        TODO: Set up Adam with adaptive learning rates

        APPROACH:
        1. Call parent constructor
        2. Store hyperparameters (lr, betas, eps, weight_decay)
        3. Initialize first and second moment buffers

        PARAMETERS:
        - lr: Learning rate (default: 0.001)
        - betas: Coefficients for computing running averages (default: (0.9, 0.999))
        - eps: Small constant for numerical stability (default: 1e-8)
        - weight_decay: L2 penalty coefficient (default: 0.0)

        EXAMPLE:
        >>> optimizer = Adam(model.parameters(), lr=0.001, betas=(0.9, 0.999))
        """
        ### BEGIN SOLUTION
        super().__init__(params)

        self.lr = lr
        self.beta1, self.beta2 = betas
        self.eps = eps
        self.weight_decay = weight_decay

        # Initialize moment buffers (created lazily)
        self.m_buffers = [None for _ in self.params]  # First moment (mean)
        self.v_buffers = [None for _ in self.params]  # Second moment (variance)
        ### END SOLUTION

    def step(self):
        """
        Perform Adam update step.

        TODO: Implement Adam parameter update with adaptive learning rates

        APPROACH:
        1. For each parameter with gradients:
           a. Apply weight decay if specified
           b. Update first moment estimate (momentum of gradient)
           c. Update second moment estimate (momentum of squared gradient)
           d. Compute bias-corrected moments
           e. Update parameter using adaptive learning rate

        FORMULAS:
        - m_t = β₁ * m_{t-1} + (1-β₁) * g_t
        - v_t = Ξ²β‚‚ * v_{t-1} + (1-Ξ²β‚‚) * g_tΒ²
        - mΜ‚_t = m_t / (1-β₁^t)
        - vΜ‚_t = v_t / (1-Ξ²β‚‚^t)
        - ΞΈ_t = ΞΈ_{t-1} - lr * mΜ‚_t / (√vΜ‚_t + Ξ΅)

        HINTS:
        - Initialize buffers as zeros on first use
        - Use step_count for bias correction
        - Square gradients element-wise for second moment
        """
        ### BEGIN SOLUTION
        # Increment step counter first (needed for bias correction)
        self.step_count += 1

        for i, param in enumerate(self.params):
            if param.grad is None:
                continue

            # Get gradient data - grad can be Tensor or numpy array
            grad = param.grad
            # Handle both Tensor (with .data) and numpy array (from autograd) cases
            if isinstance(grad, Tensor):
                grad_data = grad.data
            else:
                # grad is already a numpy array from autograd
                grad_data = grad

            # Apply weight decay
            if self.weight_decay != 0:
                grad_data = grad_data + self.weight_decay * param.data

            # Initialize buffers if needed
            if self.m_buffers[i] is None:
                self.m_buffers[i] = np.zeros_like(param.data)
                self.v_buffers[i] = np.zeros_like(param.data)

            # Update biased first moment estimate
            self.m_buffers[i] = self.beta1 * self.m_buffers[i] + (1 - self.beta1) * grad_data

            # Update biased second moment estimate
            self.v_buffers[i] = self.beta2 * self.v_buffers[i] + (1 - self.beta2) * (grad_data ** 2)

            # Compute bias correction
            bias_correction1 = 1 - self.beta1 ** self.step_count
            bias_correction2 = 1 - self.beta2 ** self.step_count

            # Compute bias-corrected moments
            m_hat = self.m_buffers[i] / bias_correction1
            v_hat = self.v_buffers[i] / bias_correction2

            # Update parameter
            param.data = param.data - self.lr * m_hat / (np.sqrt(v_hat) + self.eps)
        ### END SOLUTION

# %% [markdown]
"""
### πŸ”¬ Unit Test: Adam Optimizer
This test validates our Adam implementation works correctly.
**What we're testing**: Adam updates with adaptive learning rates and bias correction
**Why it matters**: Most popular optimizer for modern neural networks
**Expected**: Correct parameter updates following Adam formulas
"""

# %% nbgrader={"grade": true, "grade_id": "test-adam", "locked": true, "points": 20}
def test_unit_adam_optimizer():
    """πŸ”¬ Test Adam optimizer implementation."""
    print("πŸ”¬ Unit Test: Adam Optimizer...")

    # Test basic Adam functionality
    param = Tensor([1.0, 2.0], requires_grad=True)
    param.grad = Tensor([0.1, 0.2])

    optimizer = Adam([param], lr=0.01, betas=(0.9, 0.999), eps=1e-8)
    original_data = param.data.copy()

    # First step
    optimizer.step()

    # Manually compute expected values
    grad = np.array([0.1, 0.2])

    # First moment: m = 0.9 * 0 + 0.1 * grad = 0.1 * grad
    m = 0.1 * grad

    # Second moment: v = 0.999 * 0 + 0.001 * grad^2 = 0.001 * grad^2
    v = 0.001 * (grad ** 2)

    # Bias correction
    bias_correction1 = 1 - 0.9 ** 1  # = 0.1
    bias_correction2 = 1 - 0.999 ** 1  # = 0.001

    m_hat = m / bias_correction1  # = grad
    v_hat = v / bias_correction2  # = grad^2

    # Update
    expected = original_data - 0.01 * m_hat / (np.sqrt(v_hat) + 1e-8)

    assert np.allclose(param.data, expected, rtol=1e-6)
    assert optimizer.step_count == 1

    # Test second step to verify moment accumulation
    param.grad = Tensor([0.1, 0.2])
    optimizer.step()

    # Should have updated moments
    assert optimizer.m_buffers[0] is not None
    assert optimizer.v_buffers[0] is not None
    assert optimizer.step_count == 2

    # Test with weight decay
    param2 = Tensor([1.0, 2.0], requires_grad=True)
    param2.grad = Tensor([0.1, 0.2])

    optimizer_wd = Adam([param2], lr=0.01, weight_decay=0.01)
    optimizer_wd.step()

    # Weight decay should modify the effective gradient
    # grad_with_decay = [0.1, 0.2] + 0.01 * [1.0, 2.0] = [0.11, 0.22]
    # The exact computation is complex, but we can verify parameter changed
    assert not np.array_equal(param2.data, np.array([1.0, 2.0]))

    print("βœ… Adam optimizer works correctly!")

if __name__ == "__main__":
    test_unit_adam_optimizer()

# %% [markdown]
"""
## πŸ—οΈ AdamW - Adam with Decoupled Weight Decay

AdamW fixes a subtle but important bug in Adam's weight decay implementation. The bug affects how regularization interacts with adaptive learning rates.

### The Adam Weight Decay Bug

In standard Adam, weight decay is added to gradients before the adaptive scaling:

```
Adam's approach (problematic):
1. gradient = computed_gradient + weight_decay * parameter
2. m = β₁ * m + (1-β₁) * gradient
3. v = Ξ²β‚‚ * v + (1-Ξ²β‚‚) * gradientΒ²
4. step = m / √v
5. parameter = parameter - learning_rate * step

Problem: Weight decay gets "adapted" by the learning rate scaling!
```

### Why This Matters

Weight decay should be a consistent regularization force, but Adam makes it inconsistent:

```
Parameter Update Comparison:

Large gradients β†’ small adaptive LR β†’ weak weight decay effect
Small gradients β†’ large adaptive LR β†’ strong weight decay effect

This is backwards! We want consistent regularization.
```

### AdamW's Fix: Decoupled Weight Decay

AdamW separates gradient-based updates from weight decay:

```
AdamW's approach (correct):
1. m = β₁ * m + (1-β₁) * pure_gradient  ← NO weight decay here
2. v = Ξ²β‚‚ * v + (1-Ξ²β‚‚) * pure_gradientΒ²
3. step = m / √v
4. parameter = parameter - learning_rate * step        ← gradient update
5. parameter = parameter * (1 - weight_decay_rate)    ← separate decay

Result: Consistent regularization independent of gradient magnitudes!
```

### Visual Comparison

```
Adam weight decay:               AdamW weight decay:

gradient ──┐                    gradient ──→ adaptive ──→ param
           β”œβ”€β†’ adaptive ──→ param                  update
weight β”€β”€β”€β”€β”˜   scaling
decay
                                weight ─────────→ param
                                decay           shrinkage

Coupled (inconsistent)          Decoupled (consistent)
```

**Key Insight:** AdamW treats optimization and regularization as separate, independent processes, leading to better training dynamics and generalization.
"""

# %% nbgrader={"grade": false, "grade_id": "adamw-optimizer", "solution": true}
#| export
class AdamW(Optimizer):
    """
    AdamW optimizer with decoupled weight decay.

    AdamW fixes a bug in Adam's weight decay implementation by decoupling
    weight decay from the gradient-based update. This leads to better
    regularization and is the preferred version for most applications.
    """

    def __init__(self, params: List[Tensor], lr: float = DEFAULT_LEARNING_RATE_ADAM, betas: tuple = (DEFAULT_BETA1, DEFAULT_BETA2), eps: float = DEFAULT_EPS, weight_decay: float = DEFAULT_WEIGHT_DECAY_ADAMW):
        """
        Initialize AdamW optimizer.

        TODO: Set up AdamW with decoupled weight decay

        APPROACH:
        1. Call parent constructor
        2. Store hyperparameters (note higher default weight_decay)
        3. Initialize moment buffers like Adam

        KEY DIFFERENCE from Adam:
        - Weight decay is applied directly to parameters, not added to gradients
        - This provides better regularization behavior

        EXAMPLE:
        >>> optimizer = AdamW(model.parameters(), lr=0.001, weight_decay=0.01)
        """
        ### BEGIN SOLUTION
        super().__init__(params)

        self.lr = lr
        self.beta1, self.beta2 = betas
        self.eps = eps
        self.weight_decay = weight_decay

        # Initialize moment buffers (same as Adam)
        self.m_buffers = [None for _ in self.params]
        self.v_buffers = [None for _ in self.params]
        ### END SOLUTION

    def step(self):
        """
        Perform AdamW update step with decoupled weight decay.

        TODO: Implement AdamW parameter update

        APPROACH:
        1. For each parameter with gradients:
           a. Update moments using gradients (NOT modified by weight decay)
           b. Compute bias-corrected moments
           c. Apply gradient-based update
           d. Apply weight decay directly to parameters

        KEY DIFFERENCE from Adam:
        - Weight decay: ΞΈ_t = ΞΈ_t - lr * weight_decay * ΞΈ_t (applied after gradient update)
        - NOT: grad = grad + weight_decay * param (Adam's incorrect approach)

        FORMULAS:
        - Same moment updates as Adam (using unmodified gradients)
        - Gradient update: ΞΈ_t = ΞΈ_{t-1} - lr * mΜ‚_t / (√vΜ‚_t + Ξ΅)
        - Weight decay: ΞΈ_t = ΞΈ_t * (1 - lr * weight_decay)

        HINT: Apply weight decay after gradient update for proper decoupling
        """
        ### BEGIN SOLUTION
        # Increment step counter first
        self.step_count += 1

        for i, param in enumerate(self.params):
            if param.grad is None:
                continue

            # Get gradient data - grad can be Tensor or numpy array
            grad = param.grad
            # Handle both Tensor (with .data) and numpy array (from autograd) cases
            if isinstance(grad, Tensor):
                grad_data = grad.data
            else:
                # grad is already a numpy array from autograd
                grad_data = grad

            # Initialize buffers if needed
            if self.m_buffers[i] is None:
                self.m_buffers[i] = np.zeros_like(param.data)
                self.v_buffers[i] = np.zeros_like(param.data)

            # Update moments using pure gradients
            self.m_buffers[i] = self.beta1 * self.m_buffers[i] + (1 - self.beta1) * grad_data
            self.v_buffers[i] = self.beta2 * self.v_buffers[i] + (1 - self.beta2) * (grad_data ** 2)

            # Compute bias correction
            bias_correction1 = 1 - self.beta1 ** self.step_count
            bias_correction2 = 1 - self.beta2 ** self.step_count

            # Compute bias-corrected moments
            m_hat = self.m_buffers[i] / bias_correction1
            v_hat = self.v_buffers[i] / bias_correction2

            # Apply gradient-based update
            param.data = param.data - self.lr * m_hat / (np.sqrt(v_hat) + self.eps)

            # Apply decoupled weight decay
            if self.weight_decay != 0:
                param.data = param.data * (1 - self.lr * self.weight_decay)
        ### END SOLUTION

# %% [markdown]
"""
### πŸ”¬ Unit Test: AdamW Optimizer
This test validates our AdamW implementation with decoupled weight decay.
**What we're testing**: AdamW updates with proper weight decay decoupling
**Why it matters**: State-of-the-art optimizer for transformer models
**Expected**: Correct separation of gradient updates and weight decay
"""

# %% nbgrader={"grade": true, "grade_id": "test-adamw", "locked": true, "points": 20}
def test_unit_adamw_optimizer():
    """πŸ”¬ Test AdamW optimizer implementation."""
    print("πŸ”¬ Unit Test: AdamW Optimizer...")

    # Test AdamW vs Adam difference in weight decay
    # Create identical parameters for comparison
    param_adam = Tensor([1.0, 2.0], requires_grad=True)
    param_adamw = Tensor([1.0, 2.0], requires_grad=True)

    param_adam.grad = Tensor([0.1, 0.2])
    param_adamw.grad = Tensor([0.1, 0.2])

    # Create optimizers with same settings
    adam = Adam([param_adam], lr=0.01, weight_decay=0.01)
    adamw = AdamW([param_adamw], lr=0.01, weight_decay=0.01)

    # Take one step
    adam.step()
    adamw.step()

    # Results should be different due to weight decay implementation
    assert not np.allclose(param_adam.data, param_adamw.data, rtol=1e-6)

    # Test AdamW basic functionality
    param = Tensor([1.0, 2.0], requires_grad=True)
    param.grad = Tensor([0.1, 0.2])

    optimizer = AdamW([param], lr=0.01, weight_decay=0.01)
    original_data = param.data.copy()

    optimizer.step()

    # Parameter should have changed
    assert not np.array_equal(param.data, original_data)
    assert optimizer.step_count == 1

    # Test that moment buffers are created
    assert optimizer.m_buffers[0] is not None
    assert optimizer.v_buffers[0] is not None

    # Test zero weight decay behaves like Adam
    param1 = Tensor([1.0, 2.0], requires_grad=True)
    param2 = Tensor([1.0, 2.0], requires_grad=True)

    param1.grad = Tensor([0.1, 0.2])
    param2.grad = Tensor([0.1, 0.2])

    adam_no_wd = Adam([param1], lr=0.01, weight_decay=0.0)
    adamw_no_wd = AdamW([param2], lr=0.01, weight_decay=0.0)

    adam_no_wd.step()
    adamw_no_wd.step()

    # Should be very similar (within numerical precision)
    assert np.allclose(param1.data, param2.data, rtol=1e-10)

    print("βœ… AdamW optimizer works correctly!")

if __name__ == "__main__":
    test_unit_adamw_optimizer()

# %% [markdown]
"""
## πŸ”§ Integration: Bringing It Together

Now let's see how our optimizers perform in realistic scenarios. We'll compare their behavior on the same optimization problem to understand their different characteristics.

### Optimizer Behavior Comparison

Each optimizer takes a different approach to the same problem:

```
Optimization Problem: Find minimum of f(x) = xΒ²

SGD approach:        Adam approach:        AdamW approach:
  ↓                    ↓                     ↓
 x ──→ minimize       x ──→ minimize       x ──→ minimize
  ↑                    ↑                     ↑
fixed LR           adaptive LR          adaptive LR + decay
```
"""


# %% [markdown]
"""
## πŸ“Š Systems Analysis: Optimizer Performance and Memory

Different optimizers have very different resource requirements. Understanding these trade-offs is crucial for production ML systems.

### Memory Usage Patterns

```
Optimizer Memory Requirements (per parameter):

SGD:           Adam/AdamW:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ param  β”‚     β”‚ param  β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€     β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚momentumβ”‚     β”‚   m    β”‚ ← first moment
β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€
               β”‚   v    β”‚ ← second moment
               β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜

2Γ— memory       3Γ— memory
```

### Computational Complexity

```
Per-step Operations:

SGD:                     Adam:
β€’ 1 multiplication       β€’ 3 multiplications
β€’ 1 addition            β€’ 4 additions
β€’ 1 subtraction         β€’ 1 subtraction
                        β€’ 1 square root
                        β€’ 1 division

O(n) simple ops         O(n) complex ops
```
"""

# %% nbgrader={"grade": false, "grade_id": "optimizer-analysis", "solution": true}
def analyze_optimizer_memory_usage():
    """πŸ“Š Analyze memory usage of different optimizers."""
    print("πŸ“Š Analyzing Optimizer Memory Usage...")

    # Create test parameters of different sizes
    param_sizes = [1000, 10000, 100000]  # 1K, 10K, 100K parameters

    print("Optimizer Memory Analysis (per parameter tensor):")
    print("=" * 60)
    print(f"{'Size':<10} {'SGD':<10} {'Adam':<10} {'AdamW':<10} {'Ratio':<10}")
    print("-" * 60)

    for size in param_sizes:
        # Create parameter
        param = Tensor(np.random.randn(size), requires_grad=True)
        param.grad = Tensor(np.random.randn(size))

        # SGD memory (parameter + momentum buffer)
        sgd = SGD([param], momentum=0.9)
        sgd.step()  # Initialize buffers
        sgd_memory = size * 2  # param + momentum buffer

        # Adam memory (parameter + 2 moment buffers)
        param_adam = Tensor(np.random.randn(size), requires_grad=True)
        param_adam.grad = Tensor(np.random.randn(size))
        adam = Adam([param_adam])
        adam.step()  # Initialize buffers
        adam_memory = size * 3  # param + m_buffer + v_buffer

        # AdamW memory (same as Adam)
        adamw_memory = adam_memory

        # Memory ratio (Adam/SGD)
        ratio = adam_memory / sgd_memory

        print(f"{size:<10} {sgd_memory:<10} {adam_memory:<10} {adamw_memory:<10} {ratio:.1f}x")

    print("\nπŸ’‘ Key Insights:")
    print("- SGD: 2Γ— parameter memory (momentum buffer)")
    print("- Adam/AdamW: 3Γ— parameter memory (two moment buffers)")
    print("- Memory scales linearly with model size")
    print("- Trade-off: More memory for better convergence")

# %% nbgrader={"grade": false, "grade_id": "optimizer-convergence", "solution": true}
def analyze_optimizer_convergence_behavior():
    """πŸ“Š Analyze convergence behavior of different optimizers."""
    print("πŸ“Š Analyzing Optimizer Convergence Behavior...")

    # Simulate optimization of a quadratic function: f(x) = 0.5 * x^2
    # Optimal solution: x* = 0, gradient = x

    def quadratic_loss(x):
        """Simple quadratic function for optimization testing."""
        return 0.5 * (x ** 2).sum()

    def compute_gradient(x):
        """Gradient of quadratic function: df/dx = x."""
        return x.copy()

    # Starting point
    x_start = np.array([5.0, -3.0, 2.0])  # Far from optimum [0, 0, 0]

    # Test different optimizers
    optimizers_to_test = [
        ("SGD", SGD, {"lr": 0.1}),
        ("SGD+Momentum", SGD, {"lr": 0.1, "momentum": 0.9}),
        ("Adam", Adam, {"lr": 0.1}),
        ("AdamW", AdamW, {"lr": 0.1, "weight_decay": 0.01})
    ]

    print("Convergence Analysis (quadratic function f(x) = 0.5 * xΒ²):")
    print("=" * 70)
    print(f"{'Optimizer':<15} {'Step 0':<12} {'Step 5':<12} {'Step 10':<12} {'Final Loss':<12}")
    print("-" * 70)

    for name, optimizer_class, kwargs in optimizers_to_test:
        # Reset parameter
        param = Tensor(x_start.copy(), requires_grad=True)
        optimizer = optimizer_class([param], **kwargs)

        losses = []

        # Run optimization for 10 steps
        for step in range(11):
            # Compute loss and gradient
            loss = quadratic_loss(param.data)
            param.grad = Tensor(compute_gradient(param.data))

            losses.append(loss)

            # Update parameters
            if step < 10:  # Don't update after last evaluation
                optimizer.step()
                optimizer.zero_grad()

        # Format results
        step0 = f"{losses[0]:.6f}"
        step5 = f"{losses[5]:.6f}"
        step10 = f"{losses[10]:.6f}"
        final = f"{losses[10]:.6f}"

        print(f"{name:<15} {step0:<12} {step5:<12} {step10:<12} {final:<12}")

    print("\nπŸ’‘ Key Insights:")
    print("- SGD: Steady progress but can be slow")
    print("- SGD+Momentum: Faster convergence, less oscillation")
    print("- Adam: Adaptive rates help with different parameter scales")
    print("- AdamW: Similar to Adam with regularization effects")

# %% [markdown]
"""
## πŸ§ͺ Module Integration Test

Final validation that everything works together correctly.
"""

# %% nbgrader={"grade": true, "grade_id": "module-integration", "locked": true, "points": 25}
def test_module():
    """πŸ§ͺ Module Test: Complete Integration

    Comprehensive test of entire module functionality.

    This final test runs before module summary to ensure:
    - All unit tests pass
    - Functions work together correctly
    - Module is ready for integration with TinyTorch
    """
    print("πŸ§ͺ RUNNING MODULE INTEGRATION TEST")
    print("=" * 50)

    # Run all unit tests
    print("Running unit tests...")
    test_unit_optimizer_base()
    test_unit_sgd_optimizer()
    test_unit_adam_optimizer()
    test_unit_adamw_optimizer()

    print("\nRunning integration scenarios...")

    # Test realistic neural network optimization scenario
    print("πŸ”¬ Integration Test: Multi-layer Network Optimization...")

    # Import components from TinyTorch package (previous modules must be completed and exported)
    from tinytorch.core.layers import Linear
    from tinytorch.core.activations import ReLU
    from tinytorch.core.losses import MSELoss

    # Create parameters for a 2-layer network
    # Layer 1: 3 inputs -> 4 hidden
    W1 = Tensor(np.random.randn(3, 4) * 0.1, requires_grad=True)
    b1 = Tensor(np.zeros(4), requires_grad=True)

    # Layer 2: 4 hidden -> 2 outputs
    W2 = Tensor(np.random.randn(4, 2) * 0.1, requires_grad=True)
    b2 = Tensor(np.zeros(2), requires_grad=True)

    params = [W1, b1, W2, b2]

    # Add realistic gradients
    W1.grad = Tensor(np.random.randn(3, 4) * 0.01)
    b1.grad = Tensor(np.random.randn(4) * 0.01)
    W2.grad = Tensor(np.random.randn(4, 2) * 0.01)
    b2.grad = Tensor(np.random.randn(2) * 0.01)

    # Test all optimizers on same network
    optimizers = [
        SGD(params, lr=0.01, momentum=0.9),
        Adam([p for p in params], lr=0.001),  # Fresh param list for Adam
        AdamW([p for p in params], lr=0.001, weight_decay=0.01)  # Fresh param list for AdamW
    ]

    # Save original parameter values
    original_params = [p.data.copy() for p in params]

    # Test SGD
    optimizers[0].step()
    sgd_params = [p.data.copy() for p in params]

    # Restore parameters and test Adam
    for i, p in enumerate(params):
        p.data = original_params[i].copy()
        # Re-add gradients since they may have been modified
        if i == 0:
            p.grad = Tensor(np.random.randn(3, 4) * 0.01)
        elif i == 1:
            p.grad = Tensor(np.random.randn(4) * 0.01)
        elif i == 2:
            p.grad = Tensor(np.random.randn(4, 2) * 0.01)
        else:
            p.grad = Tensor(np.random.randn(2) * 0.01)

    # Update parameter references for Adam
    optimizers[1].params = params
    optimizers[1].step()
    adam_params = [p.data.copy() for p in params]

    # Restore parameters and test AdamW
    for i, p in enumerate(params):
        p.data = original_params[i].copy()
        # Re-add gradients
        if i == 0:
            p.grad = Tensor(np.random.randn(3, 4) * 0.01)
        elif i == 1:
            p.grad = Tensor(np.random.randn(4) * 0.01)
        elif i == 2:
            p.grad = Tensor(np.random.randn(4, 2) * 0.01)
        else:
            p.grad = Tensor(np.random.randn(2) * 0.01)

    # Update parameter references for AdamW
    optimizers[2].params = params
    optimizers[2].step()
    adamw_params = [p.data.copy() for p in params]

    # Verify parameters changed differently for each optimizer
    for i in range(len(params)):
        # Parameters should be different from original
        assert not np.array_equal(sgd_params[i], original_params[i])
        assert not np.array_equal(adam_params[i], original_params[i])
        assert not np.array_equal(adamw_params[i], original_params[i])

        # Different optimizers should produce different results
        assert not np.allclose(sgd_params[i], adam_params[i], rtol=1e-6)

    print("βœ… Multi-layer network optimization works!")

    # Test optimizer state management
    print("πŸ”¬ Integration Test: Optimizer State Management...")

    param = Tensor([1.0, 2.0], requires_grad=True)
    param.grad = Tensor([0.1, 0.2])

    optimizer = Adam([param], lr=0.001)

    # First step should initialize buffers
    optimizer.step()
    assert optimizer.m_buffers[0] is not None
    assert optimizer.v_buffers[0] is not None
    assert optimizer.step_count == 1

    # Zero grad should clear gradients but preserve optimizer state
    optimizer.zero_grad()
    assert param.grad is None
    assert optimizer.m_buffers[0] is not None  # State preserved
    assert optimizer.step_count == 1  # Step count preserved

    print("βœ… Optimizer state management works!")

    print("\n" + "=" * 50)
    print("πŸŽ‰ ALL TESTS PASSED! Module ready for export.")
    print("Run: tito module complete 07_optimizers")

# %% [markdown]
"""
## πŸ€” ML Systems Thinking

Now that your optimizers work, let's explore the systems trade-offs between them. Every optimizer choice affects memory usage, convergence speed, and training stability.

### Questions to Consider

**Q1: Memory vs Performance**
You've implemented SGD (2Γ— memory) and Adam (3Γ— memory). For a model with 10 billion parameters at float32 (4 bytes each):
- How much total memory does each optimizer require?
- At what model size does Adam's extra 50% memory overhead become prohibitive?
- What real-world constraints might force you to choose SGD over Adam?

**Q2: Learning Rate Sensitivity**
SGD uses a fixed learning rate for all parameters, while Adam adapts per-parameter:
- Why might Adam converge faster on problems with parameters at different scales?
- When might SGD's uniform learning rate actually be an advantage?
- How does momentum in SGD relate to Adam's first moment estimation?

**Q3: Optimizer State Management**
Adam and AdamW maintain momentum buffers (m, v) that persist across training steps:
- What happens to these buffers when you checkpoint during training?
- If you resume training with different hyperparameters, should you restore the old buffers?
- How does optimizer state affect distributed training across multiple GPUs?

**Q4: Weight Decay Trade-offs**
AdamW decouples weight decay from gradient updates:
- Why does Adam's coupled weight decay behave inconsistently?
- In what scenarios would AdamW's consistent regularization matter most?
- How does weight decay interact with learning rate schedules?

### Systems Implications

**Memory Hierarchy:**
```
Model Size: 1B parameters (4GB)
β”œβ”€ SGD:     8GB total (4GB params + 4GB momentum)
β”œβ”€ Adam:    12GB total (4GB params + 4GB m + 4GB v)
└─ Impact:  May not fit in GPU memory, forcing:
            β€’ Smaller batch sizes
            β€’ Model parallelism
            β€’ Optimizer state sharding (ZeRO optimization)
```

**Convergence Patterns:**
- **SGD + Momentum:** Steady progress, may need learning rate tuning
- **Adam:** Fast initial convergence, may overfit without proper regularization
- **AdamW:** Adam's speed + better generalization, standard for transformers

**Production Considerations:**
- **Training cost:** Adam's extra memory means fewer models per GPU
- **Hyperparameter tuning:** SGD more sensitive to learning rate choice
- **Model generalization:** AdamW often generalizes better than Adam
- **Checkpoint size:** Adam checkpoints are 1.5Γ— larger than SGD

### Performance Analysis

Our earlier analysis functions revealed:
- `analyze_optimizer_memory_usage()`: Adam requires exactly 1.5Γ— SGD's memory
- `analyze_optimizer_convergence_behavior()`: Adam often converges in fewer steps

**The Key Insight:**
Optimizer choice is a systems trade-off between:
- **Memory budget** (can you afford 3Γ— parameter memory?)
- **Convergence speed** (how many training steps can you afford?)
- **Generalization quality** (does your model perform well on unseen data?)

There's no universally best optimizerβ€”only the right choice for your constraints!
"""

# %% [markdown]
"""
## ⭐ Aha Moment: Optimizers Update Weights

**What you built:** Optimization algorithms (SGD, Adam) that update neural network weights.

**Why it matters:** Gradients tell us which direction reduces the loss, but someone has to
actually move the weights. That's what optimizers do! SGD takes simple steps, while Adam
adapts the learning rate for each parameterβ€”like having a personal trainer for each weight.

In the next module, you'll combine optimizers with a training loop to actually train networks!
"""

# %%
def demo_optimizers():
    """🎯 See optimizers update weights."""
    print("🎯 AHA MOMENT: Optimizers Update Weights")
    print("=" * 45)

    # Create a parameter with a gradient
    weight = Tensor(np.array([5.0]), requires_grad=True)
    weight.grad = np.array([1.0])  # Gradient pointing "uphill"

    print(f"Initial weight: {weight.data[0]:.2f}")
    print(f"Gradient:       {weight.grad[0]:.2f} (pointing uphill)")

    # SGD takes a step in the opposite direction
    optimizer = SGD([weight], lr=0.5)
    optimizer.step()

    print(f"\nAfter SGD step: {weight.data[0]:.2f}")
    print(f"Moved: {5.0 - weight.data[0]:.2f} (opposite to gradient)")

    print("\n✨ Optimizer moves weights to reduce loss!")

# %%
if __name__ == "__main__":
    test_module()
    print("\n")
    demo_optimizers()

# %% [markdown]
"""
## πŸš€ MODULE SUMMARY: Optimizers

Congratulations! You've built sophisticated optimization algorithms that power modern neural network training!

### Key Accomplishments
- Built SGD optimizer with momentum for stable gradient descent and oscillation reduction
- Implemented Adam optimizer with adaptive learning rates and bias correction for different parameter scales
- Created AdamW optimizer with decoupled weight decay for proper regularization
- Analyzed memory trade-offs: SGD (2Γ—), Adam/AdamW (3Γ— parameter memory)
- All tests pass βœ… (validated by `test_module()`)

### Ready for Next Steps
Your optimizer implementations enable sophisticated neural network training! With gradients from Module 06 and optimizers from Module 07, you're ready to build complete training loops.

Export with: `tito module complete 07_optimizers`

**Next**: Module 08 will add training loops, learning rate scheduling, and checkpointing for complete end-to-end neural network training!
"""