btherien
/

mulo

@@ -1,11 +1,148 @@
 ---
 license: apache-2.0
 tags:
-- learned-optimizer
 - model_hub_mixin
 - pytorch_model_hub_mixin
 ---
-This model has been pushed to the Hub using the [PytorchModelHubMixin](https://huggingface.co/docs/huggingface_hub/package_reference/mixins#huggingface_hub.PyTorchModelHubMixin) integration:
-- Library: [More Information Needed]
-- Docs: [More Information Needed]

 ---
 license: apache-2.0
 tags:
 - model_hub_mixin
 - pytorch_model_hub_mixin
+- learned-optimizer
 ---
+# Description
+`btherien/mulo` is a leared optimizer meta-training im μ-parameterization. It corresponds to the $\mu$LO$_M$ optinzer of [μLO: Compute-Efficient Meta-Generalization of Learned Optimizers](https://arxiv.org/abs/2406.00153). Due to being meta-train in $\mu$P, $\mu$LO$_M$ has strong meta-generalization capabilities (abilitize to optimizer unseen tasks), despite its relatively short and inexpensive meta-training distirbution.
+### Learned optimizer meta training and architectural details
+| **Field**                     | **Value**                                                                 |
+|------------------------------|---------------------------------------------------------------------------|
+| **Meta-training distribution** | ImageNet classification, 3-layer MLP, width ∈ {128, 512, 1024}            |
+| **Number of meta-training steps** | 5000                                                                |
+| **Target inner problem length**   | 1000 iterations                                                     |
+| **Gradient estimator**       | PES (Persistent Evolution Strategies)                                    |
+| **Truncation length**        | 50                                                                        |
+| **Architecture**             | small_fc_lopt [(Metz et al., 2022a)](https://arxiv.org/abs/2203.11860)                                        |
+| **Optimizer Input size**           | 39                                                           |
+| **Optimizer Hidden size**           | 32                                                           |
+| **Optimizer Output size**           | 2                                                           |
+# Usage
+---
+## 1) Install PyLO
+The following
+```bash
+git clone https://github.com/Belilovsky-Lab/pylo
+cd pylo
+pip install . --config-settings="--build-option=--cuda" #Optional installation with Cuda
+```
+## (2) Use $\mu$LO as a drop-in replacement for pytorch learned optimizers
+```python
+if USE_CUDA_KERNEL:
+  from pylo.optim import MuLO_CUDA
+  optimizer = MuLO_CUDA(model.parameters(), hf_key='btherien/mulo')
+else:
+  from pylo.optim import MuLO_naive
+  optimizer = MuLO_naive(model.parameters(), hf_key='btherien/mulo')
+```
+## (3) A simple example
+The following example is for illustration purposes and does not implement the correct parameterizaiton. For a correct implementation see https://github.com/Belilovsky-Lab/pylo/tree/main/examples
+```python
+import torch
+import torch.nn as nn
+import torch.optim as optim
+from torchvision import datasets, transforms
+from torch.utils.data import DataLoader
+# Model
+class MLP(nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.net = nn.Sequential(
+            nn.Flatten(),
+            nn.Linear(28 * 28, 128),
+            nn.ReLU(),
+            nn.Linear(128, 10)
+        )
+    def forward(self, x):
+        return self.net(x)
+model = MLP().to(device)
+#########################
+Setup Learned Optimizer
+#########################
+#USE_CUDA_KERNEL=True # Uncomment for accelerated kernels
+if USE_CUDA_KERNEL:
+  from pylo.optim import MuLO_CUDA
+  optimizer = MuLO_CUDA(model.parameters(), hf_key='btherien/mulo')
+else:
+  from pylo.optim import MuLO_naive
+  optimizer = MuLO_naive(model.parameters(), hf_key='btherien/mulo')
+# Device
+device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
+# Data
+transform = transforms.ToTensor()
+train_loader = DataLoader(datasets.MNIST(root='./data', train=True, download=True, transform=transform),
+                          batch_size=64, shuffle=True)
+criterion = nn.CrossEntropyLoss()
+# Training loop
+for epoch in range(1):  # Just 1 epoch for simplicity
+    for x, y in train_loader:
+        x, y = x.to(device), y.to(device)
+        optimizer.zero_grad()
+        loss = criterion(model(x), y)
+        loss.backward()
+        optimizer.step()
+print("Done!")
+```
+### Per-Parameter Input Features Used by MuLO
+| **Type**                  | **# Features** | **Description**                                                                                      | **Equation**                                                              |
+|---------------------------|----------------|------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------|
+| **Accumulators**          | 3              | Momentum accumulators with coefficients βᵢ, i ∈ {1, 2, 3}.                                           | mₜ,ᵢ = βᵢ·mₜ₋₁,ᵢ + (1 − βᵢ)·∇ₜ                                           |
+|                           | 1              | Second moment accumulator with coefficient β₄.                                                       | vₜ = β₄·vₜ₋₁ + (1 − β₄)·∇ₜ²                                              |
+|                           | 3              | Adafactor row accumulators with coefficients βᵢ, i ∈ {5, 6, 7}.                                     | rₜ,ᵢ = βᵢ·rₜ₋₁,ᵢ + (1 − βᵢ)·row_mean(∇ₜ²)                                  |
+|                           | 3              | Adafactor column accumulators with coefficients βᵢ, i ∈ {5, 6, 7}.                                  | cₜ,ᵢ = βᵢ·cₜ₋₁,ᵢ + (1 − βᵢ)·col_mean(∇ₜ²)                                  |
+| **Accumulator Features**  | 3              | Normalized momentum: momentum divided by sqrt of second moment for i ∈ {5, 6, 7}.                   | mₜ,ᵢ / √v                                                                 |
+|                           | 1              | Reciprocal sqrt of second moment value.                                                             | 1 / √v                                                                    |
+|                           | 6              | Reciprocal sqrt of Adafactor accumulators.                                                          | 1 / √(rₜ,ᵢ or cₜ,ᵢ)                                                       |
+|                           | 3              | Adafactor gradient features for i ∈ {5, 6, 7}.                                                       | ∇ₜ × rₜ,ᵢ × cₜ,ᵢ                                                         |
+|                           | 3              | Adafactor momentum features for (i, j) ∈ {(5,1), (6,2), (7,3)}.                                      | mₜ,ⱼ × rₜ,ᵢ × cₜ,ᵢ                                                       |
+| **Time Features**         | 11             | Time features for x ∈ {1, 3, 10, 30, 100, 300, 1000, 3000, 10⁴, 3·10⁴, 10⁵}.                         | tanh(t / x)                                                               |
+| **Parameters**            | 1              | Parameter value.                                                                                     | wₜ                                                                        |
+|                           | 1              | Gradient value.                                                                                      | ∇ₜ                                                                        |
+| **Total**                 | 39             | —                                                                                                    | —                                                                         |
+# Cite
+If you found this optimizer useful in your research, please consider citing our work:
+```bibtex
+@misc{therien2024mulo,
+  title = {$\mu$LO: Compute-Efficient Meta-Generalization of Learned Optimizers},
+  author = {Benjamin Thérien and Charles-Étienne Joseph and Boris Knyazev and Edouard Oyallon and Irina Rish and Eugene Belilovsky},
+  year = {2024},
+  eprint = {2406.00153},
+  archivePrefix = {arXiv},
+  primaryClass = {cs.LG},
+  url = {https://arxiv.org/abs/2406.00153}
+}
+```