mulo / README.md

Update README.md

44de681 verified 10 months ago

8.08 kB

	---
	license: apache-2.0
	tags:
	- model_hub_mixin
	- pytorch_model_hub_mixin
	- learned-optimizer
	---


	# Description
	`btherien/mulo` is a learned optimizer meta-trained in μ-parameterization. It corresponds to the μLO_M optimizer from [μLO: Compute-Efficient Meta-Generalization of Learned Optimizers](https://arxiv.org/abs/2406.00153). Due to being meta-trained in μP, μLO_M has strong meta-generalization capabilities (i.e., the ability to optimize unseen tasks), despite its relatively short and inexpensive meta-training distribution.
	### Learned optimizer meta training and architectural details
	\| Field \| Value \|
	\|------------------------------\|---------------------------------------------------------------------------\|
	\| Meta-training distribution \| ImageNet classification, 3-layer MLP, width ∈ {128, 512, 1024} \|
	\| Number of meta-training steps \| 5000 \|
	\| Target inner problem length \| 1000 iterations \|
	\| Gradient estimator \| Persistent Evolution Strategies \|
	\| Truncation length \| 50 \|
	\| Architecture \| small_fc_lopt \|
	\| Optimizer Input size \| 39 \|
	\| Optimizer Hidden size \| 32 \|
	\| Optimizer Output size \| 2 \|

	# Usage
	---
	## 1) Install PyLO
	The following
	```bash
	git clone https://github.com/Belilovsky-Lab/pylo
	cd pylo
	pip install . --config-settings="--build-option=--cuda" #Optional installation with Cuda
	```

	## (2) Use $\mu$LO as a drop-in replacement for pytorch learned optimizers

	```python
	if USE_CUDA_KERNEL:
	from pylo.optim import MuLO_CUDA
	optimizer = MuLO_CUDA(model.parameters(), hf_key='btherien/mulo')
	else:
	from pylo.optim import MuLO_naive
	optimizer = MuLO_naive(model.parameters(), hf_key='btherien/mulo')
	```

	## (3) A simple example

	The following example is for illustration purposes and does not implement the correct parameterizaiton. For a correct implementation see https://github.com/Belilovsky-Lab/pylo/tree/main/examples

	```python
	import torch
	import torch.nn as nn
	import torch.optim as optim
	from torchvision import datasets, transforms
	from torch.utils.data import DataLoader

	# Model
	class MLP(nn.Module):
	def __init__(self):
	super().__init__()
	self.net = nn.Sequential(
	nn.Flatten(),
	nn.Linear(28 * 28, 128),
	nn.ReLU(),
	nn.Linear(128, 10)
	)
	def forward(self, x):
	return self.net(x)

	model = MLP().to(device)

	#########################
	Setup Learned Optimizer
	#########################
	#USE_CUDA_KERNEL=True # Uncomment for accelerated kernels
	if USE_CUDA_KERNEL:
	from pylo.optim import MuLO_CUDA
	optimizer = MuLO_CUDA(model.parameters(), hf_key='btherien/mulo')
	else:
	from pylo.optim import MuLO_naive
	optimizer = MuLO_naive(model.parameters(), hf_key='btherien/mulo')

	# Device
	device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

	# Data
	transform = transforms.ToTensor()
	train_loader = DataLoader(datasets.MNIST(root='./data', train=True, download=True, transform=transform),
	batch_size=64, shuffle=True)

	criterion = nn.CrossEntropyLoss()
	# Training loop
	for epoch in range(1): # Just 1 epoch for simplicity
	for x, y in train_loader:
	x, y = x.to(device), y.to(device)
	optimizer.zero_grad()
	loss = criterion(model(x), y)
	loss.backward()
	optimizer.step()

	print("Done!")
	```


	### Per-Parameter Input Features Used by MuLO

	\| Type \| # Features \| Description \| Equation \|
	\|---------------------------\|----------------\|------------------------------------------------------------------------------------------------------\|---------------------------------------------------------------------------\|
	\| Accumulators \| 3 \| Momentum accumulators with coefficients βᵢ, i ∈ {1, 2, 3}. \| mₜ,ᵢ = βᵢ·mₜ₋₁,ᵢ + (1 − βᵢ)·∇ₜ \|
	\| \| 1 \| Second moment accumulator with coefficient β₄. \| vₜ = β₄·vₜ₋₁ + (1 − β₄)·∇ₜ² \|
	\| \| 3 \| Adafactor row accumulators with coefficients βᵢ, i ∈ {5, 6, 7}. \| rₜ,ᵢ = βᵢ·rₜ₋₁,ᵢ + (1 − βᵢ)·row_mean(∇ₜ²) \|
	\| \| 3 \| Adafactor column accumulators with coefficients βᵢ, i ∈ {5, 6, 7}. \| cₜ,ᵢ = βᵢ·cₜ₋₁,ᵢ + (1 − βᵢ)·col_mean(∇ₜ²) \|
	\| Accumulator Features \| 3 \| Normalized momentum: momentum divided by sqrt of second moment for i ∈ {5, 6, 7}. \| mₜ,ᵢ / √v \|
	\| \| 1 \| Reciprocal sqrt of second moment value. \| 1 / √v \|
	\| \| 6 \| Reciprocal sqrt of Adafactor accumulators. \| 1 / √(rₜ,ᵢ or cₜ,ᵢ) \|
	\| \| 3 \| Adafactor gradient features for i ∈ {5, 6, 7}. \| ∇ₜ × rₜ,ᵢ × cₜ,ᵢ \|
	\| \| 3 \| Adafactor momentum features for (i, j) ∈ {(5,1), (6,2), (7,3)}. \| mₜ,ⱼ × rₜ,ᵢ × cₜ,ᵢ \|
	\| Time Features \| 11 \| Time features for x ∈ {1, 3, 10, 30, 100, 300, 1000, 3000, 10⁴, 3·10⁴, 10⁵}. \| tanh(t / x) \|
	\| Parameters \| 1 \| Parameter value. \| wₜ \|
	\| \| 1 \| Gradient value. \| ∇ₜ \|
	\| Total \| 39 \| — \| — \|


	# Cite
	If you found this optimizer useful in your research, please consider citing our work:
	```bibtex
	@misc{therien2024mulo,
	title = {$\mu$LO: Compute-Efficient Meta-Generalization of Learned Optimizers},
	author = {Benjamin Thérien and Charles-Étienne Joseph and Boris Knyazev and Edouard Oyallon and Irina Rish and Eugene Belilovsky},
	year = {2024},
	eprint = {2406.00153},
	archivePrefix = {arXiv},
	primaryClass = {cs.LG},
	url = {https://arxiv.org/abs/2406.00153}
	}
	```