btherien commited on
Commit
cce9a53
·
verified ·
1 Parent(s): 9aa0bd6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +141 -4
README.md CHANGED
@@ -1,11 +1,148 @@
1
  ---
2
  license: apache-2.0
3
  tags:
4
- - learned-optimizer
5
  - model_hub_mixin
6
  - pytorch_model_hub_mixin
 
7
  ---
8
 
9
- This model has been pushed to the Hub using the [PytorchModelHubMixin](https://huggingface.co/docs/huggingface_hub/package_reference/mixins#huggingface_hub.PyTorchModelHubMixin) integration:
10
- - Library: [More Information Needed]
11
- - Docs: [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
  tags:
 
4
  - model_hub_mixin
5
  - pytorch_model_hub_mixin
6
+ - learned-optimizer
7
  ---
8
 
9
+
10
+ # Description
11
+ `btherien/mulo` is a leared optimizer meta-training im μ-parameterization. It corresponds to the $\mu$LO$_M$ optinzer of [μLO: Compute-Efficient Meta-Generalization of Learned Optimizers](https://arxiv.org/abs/2406.00153). Due to being meta-train in $\mu$P, $\mu$LO$_M$ has strong meta-generalization capabilities (abilitize to optimizer unseen tasks), despite its relatively short and inexpensive meta-training distirbution.
12
+ ### Learned optimizer meta training and architectural details
13
+ | **Field** | **Value** |
14
+ |------------------------------|---------------------------------------------------------------------------|
15
+ | **Meta-training distribution** | ImageNet classification, 3-layer MLP, width ∈ {128, 512, 1024} |
16
+ | **Number of meta-training steps** | 5000 |
17
+ | **Target inner problem length** | 1000 iterations |
18
+ | **Gradient estimator** | PES (Persistent Evolution Strategies) |
19
+ | **Truncation length** | 50 |
20
+ | **Architecture** | small_fc_lopt [(Metz et al., 2022a)](https://arxiv.org/abs/2203.11860) |
21
+ | **Optimizer Input size** | 39 |
22
+ | **Optimizer Hidden size** | 32 |
23
+ | **Optimizer Output size** | 2 |
24
+
25
+ # Usage
26
+ ---
27
+ ## 1) Install PyLO
28
+ The following
29
+ ```bash
30
+ git clone https://github.com/Belilovsky-Lab/pylo
31
+ cd pylo
32
+ pip install . --config-settings="--build-option=--cuda" #Optional installation with Cuda
33
+ ```
34
+
35
+ ## (2) Use $\mu$LO as a drop-in replacement for pytorch learned optimizers
36
+
37
+ ```python
38
+ if USE_CUDA_KERNEL:
39
+ from pylo.optim import MuLO_CUDA
40
+ optimizer = MuLO_CUDA(model.parameters(), hf_key='btherien/mulo')
41
+ else:
42
+ from pylo.optim import MuLO_naive
43
+ optimizer = MuLO_naive(model.parameters(), hf_key='btherien/mulo')
44
+ ```
45
+
46
+ ## (3) A simple example
47
+
48
+ The following example is for illustration purposes and does not implement the correct parameterizaiton. For a correct implementation see https://github.com/Belilovsky-Lab/pylo/tree/main/examples
49
+
50
+ ```python
51
+ import torch
52
+ import torch.nn as nn
53
+ import torch.optim as optim
54
+ from torchvision import datasets, transforms
55
+ from torch.utils.data import DataLoader
56
+
57
+ # Model
58
+ class MLP(nn.Module):
59
+ def __init__(self):
60
+ super().__init__()
61
+ self.net = nn.Sequential(
62
+ nn.Flatten(),
63
+ nn.Linear(28 * 28, 128),
64
+ nn.ReLU(),
65
+ nn.Linear(128, 10)
66
+ )
67
+ def forward(self, x):
68
+ return self.net(x)
69
+
70
+ model = MLP().to(device)
71
+
72
+ #########################
73
+ Setup Learned Optimizer
74
+ #########################
75
+ #USE_CUDA_KERNEL=True # Uncomment for accelerated kernels
76
+ if USE_CUDA_KERNEL:
77
+ from pylo.optim import MuLO_CUDA
78
+ optimizer = MuLO_CUDA(model.parameters(), hf_key='btherien/mulo')
79
+ else:
80
+ from pylo.optim import MuLO_naive
81
+ optimizer = MuLO_naive(model.parameters(), hf_key='btherien/mulo')
82
+
83
+ # Device
84
+ device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
85
+
86
+ # Data
87
+ transform = transforms.ToTensor()
88
+ train_loader = DataLoader(datasets.MNIST(root='./data', train=True, download=True, transform=transform),
89
+ batch_size=64, shuffle=True)
90
+
91
+ criterion = nn.CrossEntropyLoss()
92
+ # Training loop
93
+ for epoch in range(1): # Just 1 epoch for simplicity
94
+ for x, y in train_loader:
95
+ x, y = x.to(device), y.to(device)
96
+ optimizer.zero_grad()
97
+ loss = criterion(model(x), y)
98
+ loss.backward()
99
+ optimizer.step()
100
+
101
+ print("Done!")
102
+ ```
103
+
104
+
105
+ ### Per-Parameter Input Features Used by MuLO
106
+
107
+ | **Type** | **# Features** | **Description** | **Equation** |
108
+ |---------------------------|----------------|------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------|
109
+ | **Accumulators** | 3 | Momentum accumulators with coefficients βᵢ, i ∈ {1, 2, 3}. | mₜ,ᵢ = βᵢ·mₜ₋₁,ᵢ + (1 − βᵢ)·∇ₜ |
110
+ | | 1 | Second moment accumulator with coefficient β₄. | vₜ = β₄·vₜ₋₁ + (1 − β₄)·∇ₜ² |
111
+ | | 3 | Adafactor row accumulators with coefficients βᵢ, i ∈ {5, 6, 7}. | rₜ,ᵢ = βᵢ·rₜ₋₁,ᵢ + (1 − βᵢ)·row_mean(∇ₜ²) |
112
+ | | 3 | Adafactor column accumulators with coefficients βᵢ, i ∈ {5, 6, 7}. | cₜ,ᵢ = βᵢ·cₜ₋₁,ᵢ + (1 − βᵢ)·col_mean(∇ₜ²) |
113
+ | **Accumulator Features** | 3 | Normalized momentum: momentum divided by sqrt of second moment for i ∈ {5, 6, 7}. | mₜ,ᵢ / √v |
114
+ | | 1 | Reciprocal sqrt of second moment value. | 1 / √v |
115
+ | | 6 | Reciprocal sqrt of Adafactor accumulators. | 1 / √(rₜ,ᵢ or cₜ,ᵢ) |
116
+ | | 3 | Adafactor gradient features for i ∈ {5, 6, 7}. | ∇ₜ × rₜ,ᵢ × cₜ,ᵢ |
117
+ | | 3 | Adafactor momentum features for (i, j) ∈ {(5,1), (6,2), (7,3)}. | mₜ,ⱼ × rₜ,ᵢ × cₜ,ᵢ |
118
+ | **Time Features** | 11 | Time features for x ∈ {1, 3, 10, 30, 100, 300, 1000, 3000, 10⁴, 3·10⁴, 10⁵}. | tanh(t / x) |
119
+ | **Parameters** | 1 | Parameter value. | wₜ |
120
+ | | 1 | Gradient value. | ∇ₜ |
121
+ | **Total** | 39 | — | — |
122
+
123
+
124
+ # Cite
125
+ If you found this optimizer useful in your research, please consider citing our work:
126
+ ```bibtex
127
+ @misc{therien2024mulo,
128
+ title = {$\mu$LO: Compute-Efficient Meta-Generalization of Learned Optimizers},
129
+ author = {Benjamin Thérien and Charles-Étienne Joseph and Boris Knyazev and Edouard Oyallon and Irina Rish and Eugene Belilovsky},
130
+ year = {2024},
131
+ eprint = {2406.00153},
132
+ archivePrefix = {arXiv},
133
+ primaryClass = {cs.LG},
134
+ url = {https://arxiv.org/abs/2406.00153}
135
+ }
136
+ ```
137
+
138
+
139
+
140
+
141
+
142
+
143
+
144
+
145
+
146
+
147
+
148
+