File size: 8,082 Bytes
9aa0bd6 cce9a53 9aa0bd6 cce9a53 ed5fb0c cce9a53 44de681 cce9a53 44de681 cce9a53 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 |
---
license: apache-2.0
tags:
- model_hub_mixin
- pytorch_model_hub_mixin
- learned-optimizer
---
# Description
`btherien/mulo` is a learned optimizer meta-trained in μ-parameterization. It corresponds to the μLO_M optimizer from [μLO: Compute-Efficient Meta-Generalization of Learned Optimizers](https://arxiv.org/abs/2406.00153). Due to being meta-trained in μP, μLO_M has strong meta-generalization capabilities (i.e., the ability to optimize unseen tasks), despite its relatively short and inexpensive meta-training distribution.
### Learned optimizer meta training and architectural details
| **Field** | **Value** |
|------------------------------|---------------------------------------------------------------------------|
| **Meta-training distribution** | ImageNet classification, 3-layer MLP, width ∈ {128, 512, 1024} |
| **Number of meta-training steps** | 5000 |
| **Target inner problem length** | 1000 iterations |
| **Gradient estimator** | Persistent Evolution Strategies |
| **Truncation length** | 50 |
| **Architecture** | small_fc_lopt |
| **Optimizer Input size** | 39 |
| **Optimizer Hidden size** | 32 |
| **Optimizer Output size** | 2 |
# Usage
---
## 1) Install PyLO
The following
```bash
git clone https://github.com/Belilovsky-Lab/pylo
cd pylo
pip install . --config-settings="--build-option=--cuda" #Optional installation with Cuda
```
## (2) Use $\mu$LO as a drop-in replacement for pytorch learned optimizers
```python
if USE_CUDA_KERNEL:
from pylo.optim import MuLO_CUDA
optimizer = MuLO_CUDA(model.parameters(), hf_key='btherien/mulo')
else:
from pylo.optim import MuLO_naive
optimizer = MuLO_naive(model.parameters(), hf_key='btherien/mulo')
```
## (3) A simple example
The following example is for illustration purposes and does not implement the correct parameterizaiton. For a correct implementation see https://github.com/Belilovsky-Lab/pylo/tree/main/examples
```python
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader
# Model
class MLP(nn.Module):
def __init__(self):
super().__init__()
self.net = nn.Sequential(
nn.Flatten(),
nn.Linear(28 * 28, 128),
nn.ReLU(),
nn.Linear(128, 10)
)
def forward(self, x):
return self.net(x)
model = MLP().to(device)
#########################
Setup Learned Optimizer
#########################
#USE_CUDA_KERNEL=True # Uncomment for accelerated kernels
if USE_CUDA_KERNEL:
from pylo.optim import MuLO_CUDA
optimizer = MuLO_CUDA(model.parameters(), hf_key='btherien/mulo')
else:
from pylo.optim import MuLO_naive
optimizer = MuLO_naive(model.parameters(), hf_key='btherien/mulo')
# Device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# Data
transform = transforms.ToTensor()
train_loader = DataLoader(datasets.MNIST(root='./data', train=True, download=True, transform=transform),
batch_size=64, shuffle=True)
criterion = nn.CrossEntropyLoss()
# Training loop
for epoch in range(1): # Just 1 epoch for simplicity
for x, y in train_loader:
x, y = x.to(device), y.to(device)
optimizer.zero_grad()
loss = criterion(model(x), y)
loss.backward()
optimizer.step()
print("Done!")
```
### Per-Parameter Input Features Used by MuLO
| **Type** | **# Features** | **Description** | **Equation** |
|---------------------------|----------------|------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------|
| **Accumulators** | 3 | Momentum accumulators with coefficients βᵢ, i ∈ {1, 2, 3}. | mₜ,ᵢ = βᵢ·mₜ₋₁,ᵢ + (1 − βᵢ)·∇ₜ |
| | 1 | Second moment accumulator with coefficient β₄. | vₜ = β₄·vₜ₋₁ + (1 − β₄)·∇ₜ² |
| | 3 | Adafactor row accumulators with coefficients βᵢ, i ∈ {5, 6, 7}. | rₜ,ᵢ = βᵢ·rₜ₋₁,ᵢ + (1 − βᵢ)·row_mean(∇ₜ²) |
| | 3 | Adafactor column accumulators with coefficients βᵢ, i ∈ {5, 6, 7}. | cₜ,ᵢ = βᵢ·cₜ₋₁,ᵢ + (1 − βᵢ)·col_mean(∇ₜ²) |
| **Accumulator Features** | 3 | Normalized momentum: momentum divided by sqrt of second moment for i ∈ {5, 6, 7}. | mₜ,ᵢ / √v |
| | 1 | Reciprocal sqrt of second moment value. | 1 / √v |
| | 6 | Reciprocal sqrt of Adafactor accumulators. | 1 / √(rₜ,ᵢ or cₜ,ᵢ) |
| | 3 | Adafactor gradient features for i ∈ {5, 6, 7}. | ∇ₜ × rₜ,ᵢ × cₜ,ᵢ |
| | 3 | Adafactor momentum features for (i, j) ∈ {(5,1), (6,2), (7,3)}. | mₜ,ⱼ × rₜ,ᵢ × cₜ,ᵢ |
| **Time Features** | 11 | Time features for x ∈ {1, 3, 10, 30, 100, 300, 1000, 3000, 10⁴, 3·10⁴, 10⁵}. | tanh(t / x) |
| **Parameters** | 1 | Parameter value. | wₜ |
| | 1 | Gradient value. | ∇ₜ |
| **Total** | 39 | — | — |
# Cite
If you found this optimizer useful in your research, please consider citing our work:
```bibtex
@misc{therien2024mulo,
title = {$\mu$LO: Compute-Efficient Meta-Generalization of Learned Optimizers},
author = {Benjamin Thérien and Charles-Étienne Joseph and Boris Knyazev and Edouard Oyallon and Irina Rish and Eugene Belilovsky},
year = {2024},
eprint = {2406.00153},
archivePrefix = {arXiv},
primaryClass = {cs.LG},
url = {https://arxiv.org/abs/2406.00153}
}
```
|