File size: 8,082 Bytes
9aa0bd6
 
 
 
 
cce9a53
9aa0bd6
 
cce9a53
 
ed5fb0c
cce9a53
 
 
 
 
 
44de681
cce9a53
44de681
cce9a53
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
---
license: apache-2.0
tags:
- model_hub_mixin
- pytorch_model_hub_mixin
- learned-optimizer
---


# Description 
`btherien/mulo` is a learned optimizer meta-trained in μ-parameterization. It corresponds to the μLO_M optimizer from [μLO: Compute-Efficient Meta-Generalization of Learned Optimizers](https://arxiv.org/abs/2406.00153). Due to being meta-trained in μP, μLO_M has strong meta-generalization capabilities (i.e., the ability to optimize unseen tasks), despite its relatively short and inexpensive meta-training distribution.
### Learned optimizer meta training and architectural details 
| **Field**                     | **Value**                                                                 |
|------------------------------|---------------------------------------------------------------------------|
| **Meta-training distribution** | ImageNet classification, 3-layer MLP, width ∈ {128, 512, 1024}            |
| **Number of meta-training steps** | 5000                                                                |
| **Target inner problem length**   | 1000 iterations                                                     |
| **Gradient estimator**       | Persistent Evolution Strategies                                   |
| **Truncation length**        | 50                                                                        |
| **Architecture**             | small_fc_lopt                                        |
| **Optimizer Input size**           | 39                                                           |
| **Optimizer Hidden size**           | 32                                                           |
| **Optimizer Output size**           | 2                                                           |

# Usage 
---
## 1) Install PyLO
The following 
```bash
git clone https://github.com/Belilovsky-Lab/pylo
cd pylo
pip install . --config-settings="--build-option=--cuda" #Optional installation with Cuda
```

## (2) Use $\mu$LO as a drop-in replacement for pytorch learned optimizers

```python
if USE_CUDA_KERNEL:
  from pylo.optim import MuLO_CUDA
  optimizer = MuLO_CUDA(model.parameters(), hf_key='btherien/mulo')
else:
  from pylo.optim import MuLO_naive
  optimizer = MuLO_naive(model.parameters(), hf_key='btherien/mulo')
```

## (3) A simple example

The following example is for illustration purposes and does not implement the correct parameterizaiton. For a correct implementation see https://github.com/Belilovsky-Lab/pylo/tree/main/examples

```python
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

# Model
class MLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(
            nn.Flatten(),
            nn.Linear(28 * 28, 128),
            nn.ReLU(),
            nn.Linear(128, 10)
        )
    def forward(self, x):
        return self.net(x)

model = MLP().to(device)

#########################
Setup Learned Optimizer
#########################
#USE_CUDA_KERNEL=True # Uncomment for accelerated kernels
if USE_CUDA_KERNEL:
  from pylo.optim import MuLO_CUDA
  optimizer = MuLO_CUDA(model.parameters(), hf_key='btherien/mulo')
else:
  from pylo.optim import MuLO_naive
  optimizer = MuLO_naive(model.parameters(), hf_key='btherien/mulo')

# Device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Data
transform = transforms.ToTensor()
train_loader = DataLoader(datasets.MNIST(root='./data', train=True, download=True, transform=transform),
                          batch_size=64, shuffle=True)

criterion = nn.CrossEntropyLoss()
# Training loop
for epoch in range(1):  # Just 1 epoch for simplicity
    for x, y in train_loader:
        x, y = x.to(device), y.to(device)
        optimizer.zero_grad()
        loss = criterion(model(x), y)
        loss.backward()
        optimizer.step()

print("Done!")
```


### Per-Parameter Input Features Used by MuLO

| **Type**                  | **# Features** | **Description**                                                                                      | **Equation**                                                              |
|---------------------------|----------------|------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------|
| **Accumulators**          | 3              | Momentum accumulators with coefficients βᵢ, i ∈ {1, 2, 3}.                                           | mₜ,ᵢ = βᵢ·mₜ₋₁,ᵢ + (1 − βᵢ)·∇ₜ                                           |
|                           | 1              | Second moment accumulator with coefficient β₄.                                                       | vₜ = β₄·vₜ₋₁ + (1 − β₄)·∇ₜ²                                              |
|                           | 3              | Adafactor row accumulators with coefficients βᵢ, i ∈ {5, 6, 7}.                                     | rₜ,ᵢ = βᵢ·rₜ₋₁,ᵢ + (1 − βᵢ)·row_mean(∇ₜ²)                                  |
|                           | 3              | Adafactor column accumulators with coefficients βᵢ, i ∈ {5, 6, 7}.                                  | cₜ,ᵢ = βᵢ·cₜ₋₁,ᵢ + (1 − βᵢ)·col_mean(∇ₜ²)                                  |
| **Accumulator Features**  | 3              | Normalized momentum: momentum divided by sqrt of second moment for i ∈ {5, 6, 7}.                   | mₜ,ᵢ / √v                                                                 |
|                           | 1              | Reciprocal sqrt of second moment value.                                                             | 1 / √v                                                                    |
|                           | 6              | Reciprocal sqrt of Adafactor accumulators.                                                          | 1 / √(rₜ,ᵢ or cₜ,ᵢ)                                                       |
|                           | 3              | Adafactor gradient features for i ∈ {5, 6, 7}.                                                       | ∇ₜ × rₜ,ᵢ × cₜ,ᵢ                                                         |
|                           | 3              | Adafactor momentum features for (i, j) ∈ {(5,1), (6,2), (7,3)}.                                      | mₜ,ⱼ × rₜ,ᵢ × cₜ,ᵢ                                                       |
| **Time Features**         | 11             | Time features for x ∈ {1, 3, 10, 30, 100, 300, 1000, 3000, 10⁴, 3·10⁴, 10⁵}.                         | tanh(t / x)                                                               |
| **Parameters**            | 1              | Parameter value.                                                                                     | wₜ                                                                        |
|                           | 1              | Gradient value.                                                                                      | ∇ₜ                                                                        |
| **Total**                 | 39             | —                                                                                                    | —                                                                         |


# Cite 
If you found this optimizer useful in your research, please consider citing our work:
```bibtex
@misc{therien2024mulo,
  title = {$\mu$LO: Compute-Efficient Meta-Generalization of Learned Optimizers},
  author = {Benjamin Thérien and Charles-Étienne Joseph and Boris Knyazev and Edouard Oyallon and Irina Rish and Eugene Belilovsky},
  year = {2024},
  eprint = {2406.00153},
  archivePrefix = {arXiv},
  primaryClass = {cs.LG},
  url = {https://arxiv.org/abs/2406.00153}
}
```