Add optimization docs and update implementation guide [skip-build]

Browse files

Files changed (2) hide show

docs/implementation.md +63 -19
docs/optimizations.md +125 -0

docs/implementation.md CHANGED Viewed

@@ -8,11 +8,12 @@ This document explains the internal architecture of the Muon optimizer for revie
 2. [Entry Point and Parameter Routing](#entry-point-and-parameter-routing)
 3. [Execution Paths](#execution-paths)
 4. [Parallel Pipeline (the core feature)](#parallel-pipeline)
-5. [Distributed Utilities](#distributed-utilities)
-6. [Newton-Schulz Orthogonalization](#newton-schulz-orthogonalization)
-7. [QK Clipping](#qk-clipping)
-8. [AdamW for Non-Muon Parameters](#adamw-for-non-muon-parameters)
-9. [Source File Map](#source-file-map)
 ---
@@ -33,17 +34,19 @@ Users must provide parameter groups with `use_muon=True/False` flags (via `get_d
 ```
 _step_muon(group)
   |
   +-- DTensor, all Replicate placements  --> base()       (no sharding)
-  +-- DTensor, numel <= threshold        --> distributed_muon()  (small param fallback)
   +-- DTensor, sharded                   --> parallel()   (pipelined all-to-all)
   +-- plain Tensor                       --> base()       (single device)
 ```
 Parameters are classified by their DTensor placements:
 - **Fully replicated** DTensors and plain tensors use `base()` &mdash; standard single-device Muon.
-- **Small sharded** DTensors (below `small_param_numel_threshold`, default 65536) use `distributed_muon()` &mdash; gathers the full tensor via `full_tensor()`, computes the update, then redistributes.
-- **Large sharded** DTensors use `parallel()` &mdash; the pipelined all-to-all approach described below.
 ## Execution Paths
@@ -51,9 +54,9 @@ Parameters are classified by their DTensor placements:
 Straightforward per-parameter loop: momentum update &rarr; Newton-Schulz orthogonalization &rarr; parameter update &rarr; optional QK clipping.
-### distributed_muon() &mdash; Full Gather
-Each parameter's gradient is gathered to full via `g.full_tensor()`, orthogonalized on every rank, then the updated full parameter is redistributed back to the original sharded placement. Simple but communication-heavy &mdash; used only as a fallback for small parameters.
 ### parallel() &mdash; Pipelined All-to-All
@@ -171,6 +174,47 @@ Inverse of gather:
 Each rank applies weight decay and the Muon update to its local parameter shard. Also applies QK clipping if configured.
 ## Distributed Utilities
 **File:** `distributed/utils.py`
@@ -181,7 +225,7 @@ These utilities solve the problem of mapping from a DTensor's arbitrary sharding
 Given a DTensor's placements and device mesh, this function:
-1. **Sorts** placements: Replicate dims first, then Shard dims by dimension (with `_StridedShard` after regular `Shard` on the same dim).
 2. **Permutes** the mesh accordingly.
 3. **Separates** replicate dims from shard dims &mdash; each replicate group gets its own shard sub-mesh.
 4. **Creates** a ProcessGroup for the current rank's shard mesh.
@@ -214,7 +258,7 @@ def _is_shard(placement):
 **File:** `newton_schulz.py`
-`_zeropower_via_newtonschulz5()` computes the orthogonal approximation of a matrix using 5 quintic Newton-Schulz iterations with pre-optimized coefficients. The result approximates `US'V^T` where `S'` is near-uniform on `[0.5, 1.5]`, which empirically does not hurt model performance vs. exact `UV^T`.
 Each iteration uses `matmul_transpose_assign()` (a Triton kernel for `X @ X^T`) for efficiency.
@@ -248,15 +292,15 @@ Parameters not eligible for Muon (1D parameters, embeddings, LM head) are optimi
 | File | Lines | Purpose |
 |------|-------|---------|
-| `muon.py` | ~525 | Optimizer class, parameter routing, 3 execution paths |
-| `pipeline.py` | ~290 | Generator-based parallel pipeline (gather/compute/scatter/update) |
 | `async_utils.py` | ~75 | Pipeline scheduler with bounded concurrency |
-| `core.py` | ~110 | `_muon_state` dataclass, momentum/update helpers, param grouping |
 | `distributed/utils.py` | ~230 | Shard mesh construction, DTensor index computation |
-| `newton_schulz.py` | ~50 | Newton-Schulz iteration |
-| `matmul_transpose_triton.py` | ~120 | Triton kernel for symmetric matmul |
-| `qk_clip.py` | ~130 | QK logit clipping |
-| `adamw.py` | ~160 | Fused AdamW for non-Muon params |
 ### Dependency Graph

 2. [Entry Point and Parameter Routing](#entry-point-and-parameter-routing)
 3. [Execution Paths](#execution-paths)
 4. [Parallel Pipeline (the core feature)](#parallel-pipeline)
+5. [MoE Expert Weight Support](#moe-expert-weight-support-expert_keys)
+6. [Distributed Utilities](#distributed-utilities)
+7. [Newton-Schulz Orthogonalization](#newton-schulz-orthogonalization)
+8. [QK Clipping](#qk-clipping)
+9. [AdamW for Non-Muon Parameters](#adamw-for-non-muon-parameters)
+10. [Source File Map](#source-file-map)
 ---
 ```
 _step_muon(group)
+  |
+  +-- momentum update (batched _foreach_* ops)
+  +-- _expand_expert_params()   -- 3D expert params → per-expert 2D views (cached)
   |
   +-- DTensor, all Replicate placements  --> base()       (no sharding)
   +-- DTensor, sharded                   --> parallel()   (pipelined all-to-all)
   +-- plain Tensor                       --> base()       (single device)
 ```
 Parameters are classified by their DTensor placements:
 - **Fully replicated** DTensors and plain tensors use `base()` &mdash; standard single-device Muon.
+- **Sharded** DTensors use `parallel()` &mdash; the pipelined all-to-all approach described below.
+- `distributed_muon()` exists as a **test-only reference implementation** for correctness verification.
 ## Execution Paths
 Straightforward per-parameter loop: momentum update &rarr; Newton-Schulz orthogonalization &rarr; parameter update &rarr; optional QK clipping.
+### distributed_muon() &mdash; Full Gather (test-only)
+Reference implementation for correctness verification. Uses batched all-gather to reconstruct full tensors, computes Newton-Schulz on the full grad, then slices back to local shards. Simple but communication-heavy &mdash; not used in production.
 ### parallel() &mdash; Pipelined All-to-All
 Each rank applies weight decay and the Muon update to its local parameter shard. Also applies QK clipping if configured.
+## MoE Expert Weight Support (`expert_keys`)
+**File:** `muon.py` &mdash; `_expand_expert_params()`
+MoE models have 3D expert weights with shape `(num_experts, out_dim, in_dim)`. Since Muon operates on 2D matrices, expert params need special handling.
+### Configuration
+Pass `expert_keys` to both `get_default_muon_param_groups()` and `Muon()`:
+```python
+params = get_default_muon_param_groups(model, expert_keys=["experts"])
+optim = Muon(params, expert_keys=["experts"], ...)
+```
+Any parameter whose name contains a string in `expert_keys` is treated as an expert-parallel parameter. Non-matching 3D+ parameters raise `AssertionError` to catch misconfiguration.
+### How It Works
+`_expand_expert_params()` runs after momentum and before routing to `base()`/`parallel()`/`distributed_muon()`:
+1. **Split on dim 0**: A 3D `(E, out, in)` tensor becomes `E` separate 2D `(out, in)` `nn.Parameter` views. Views share storage with the original, so in-place updates propagate back.
+2. **Placement remapping**: When the original is a DTensor, `Shard(k)` on dim `k > 0` becomes `Shard(k-1)` on the 2D slice (since dim 0 is consumed by the split).
+3. **Submesh wrapping**: Non-dim-0 shard placements are preserved by wrapping each 2D slice as a DTensor on the corresponding submesh. This is **placement-agnostic** &mdash; the same logic handles TP `Shard(1/2)`, EFSDP `Shard(1)`, or any other non-dim-0 sharding.
+### Placement-Agnostic Design
+The expansion logic does not care *why* a dimension is sharded &mdash; only whether it's on dim 0 (consumed by split) or not (preserved on submesh):
+| Original Placement | After Expansion |
+|-------------------|-----------------|
+| `Shard(0)` (EP) | Consumed by split &rarr; plain tensor |
+| `Shard(1)` (TP or EFSDP) | `Shard(0)` on submesh &rarr; 2D DTensor |
+| `Shard(2)` (TP row-wise) | `Shard(1)` on submesh &rarr; 2D DTensor |
+| `Replicate` | Ignored (not a shard) |
+| `_StridedShard(0)` (EFSDP) | Consumed by split &rarr; plain tensor |
+After expansion, the 2D params flow through the standard routing: DTensors with shard placements go to `parallel()`, plain tensors go to `base()`.
+For EP/EFSDP background and torchtitan integration details, see [`docs/expert_parallel.md`](expert_parallel.md).
 ## Distributed Utilities
 **File:** `distributed/utils.py`
 Given a DTensor's placements and device mesh, this function:
+1. **Sorts** placements: Replicate dims first, then Shard dims by dimension (with `_StridedShard` before regular `Shard` on the same dim, so the outer sharding is applied first).
 2. **Permutes** the mesh accordingly.
 3. **Separates** replicate dims from shard dims &mdash; each replicate group gets its own shard sub-mesh.
 4. **Creates** a ProcessGroup for the current rank's shard mesh.
 **File:** `newton_schulz.py`
+`_zeropower_via_newtonschulz5()` computes the polar factor of a matrix using the Polar Express method &mdash; quintic Newton-Schulz iterations with analytically optimal (minimax/Remez) coefficients precomputed by `_optimal_composition()`. The default configuration uses 10 iterations with `l=1e-3`, converging all singular values to 1 to produce the exact polar factor `UV^T`. Wrapped by `zeropower_via_newtonschulz5()` which adds per-shape `torch.compile` caching with CUDA graph support.
 Each iteration uses `matmul_transpose_assign()` (a Triton kernel for `X @ X^T`) for efficiency.
 | File | Lines | Purpose |
 |------|-------|---------|
+| `muon.py` | ~815 | Optimizer class, parameter routing, 3 execution paths, MoE expert expansion + caching |
+| `pipeline.py` | ~400 | Generator-based parallel pipeline (gather/compute/scatter/update) |
 | `async_utils.py` | ~75 | Pipeline scheduler with bounded concurrency |
+| `core.py` | ~175 | `_muon_state` dataclass, batched momentum/update helpers, param grouping |
 | `distributed/utils.py` | ~230 | Shard mesh construction, DTensor index computation |
+| `newton_schulz.py` | ~190 | Polar Express coefficients, Newton-Schulz iteration + compile/CUDA graph |
+| `matmul_transpose_triton.py` | ~130 | Triton kernel for symmetric matmul |
+| `qk_clip.py` | ~135 | QK logit clipping |
+| `adamw.py` | ~170 | Fused AdamW for non-Muon params |
 ### Dependency Graph

docs/optimizations.md ADDED Viewed

	@@ -0,0 +1,125 @@

+# Performance Optimizations (vs. main)
+Summary of optimizations on branch `perf/pipelined-distributed-muon-clean` relative to `main`.
+---
+## 1. Batched Momentum (`core.py`)
+**Before:** Per-param `update_g()` — one `torch.add` + optional `torch.add_` per parameter.
+**After:** `_batch_pre_ortho()` — `_foreach_mul_`, `_foreach_add_` on lists of local tensors (unwrapped from DTensor). Single fused kernel per batch instead of N individual kernels.
+**Impact:** Eliminates N per-param Python-loop overhead + N small kernel launches. Scales with parameter count.
+---
+## 2. Pipeline Buffer Packing (`pipeline.py`)
+### Gather send buffer
+**Before:** Per-param `.to(COMM_DTYPE).contiguous()` followed by per-destination `append` to list, then `torch.cat` on the per-dst lists.
+**After:** Collect all grad slices in destination order in a single pass, then one `torch.cat` call. Avoids intermediate per-destination lists and redundant dtype conversions.
+### Scatter send buffer
+**Before:** Per-param, per-destination-rank: index `u_full[indices].flatten()`, append to per-dst list, then flatten+cat.
+**After:** Cache `u_full` conversions (avoid redundant `.to()` per dst_rank). Collect all slices in dst order in one pass, single `torch.cat`.
+**Impact:** Fewer kernel launches, less Python overhead, reduced intermediate allocations.
+---
+## 3. Zero-Copy Scatter (`pipeline.py`)
+**Before:** `_launch_scatter` pre-allocates `torch.empty_like(p.to_local())` for every param. `_complete_scatter` copies from recv_buf into these pre-allocated tensors via `copy_()`.
+**After:** `_complete_scatter` assigns **views** into `recv_buf` directly (via `recv_buf.narrow(...).view_as(...)`). No pre-allocation, no copy. The recv_buf storage stays alive through the views until `_update_params` consumes them.
+**Impact:** Eliminates N `empty_like` allocations + N `copy_` kernel launches per scatter stage.
+---
+## 4. Batched Parameter Update (`pipeline.py`)
+**Before:** Per-param loop calling `update_p()` (which unwraps DTensor, applies weight decay, applies update individually).
+**After:** Batched using `_foreach_mul_` (weight decay) and `_foreach_add_` (Muon update), grouped by `adjusted_lr` to preserve float32 alpha precision. Single kernel per group instead of per param.
+**Impact:** Reduces N per-param kernel launches to 1-2 batched kernel launches.
+---
+## 5. Parallel Metadata Caching (`muon.py`)
+**Before:** `init_state_and_assign_params()` called every step — sorts params by FLOP cost, assigns ownership via round-robin, precomputes per-rank indices/numels for all-to-all.
+**After:** `_parallel_cache` keyed by `tuple(names)`. First call computes and caches `ordered_names`, `name_to_state`, `rank`, `chunk_size`. Subsequent calls reuse cached metadata, only rebuilding `param_to_state` with current `id(p)` keys (since param objects are stable but ids may change for QK clip updates).
+**Impact:** Eliminates repeated sorting, mesh construction, and index precomputation on every step.
+---
+## 6. Expert Param Expansion Caching (`muon.py`)
+**Before:** `_expand_expert_params()` called every step — for each expert param `(E, out, in)`, creates E `nn.Parameter` wrappers (triggers `aten::detach`), indexes data and grad (`aten::select`), and wraps in DTensor for TP.
+**After:** `_expert_expand_cache` keyed by `tuple(id(p) for p in params)`. Cold path runs `_expand_expert_params` once and caches:
+- `expanded_names` / `expanded_params` — the nn.Parameter wrappers with stable data views
+- `grad_info` — per-expert-group metadata (orig param index, num experts, expanded start index, DTensor flag, TP mesh/placements)
+Hot path reuses cached nn.Parameter objects (data views are stable since optimizer updates happen in-place on the same storage). Only updates `.grad` on each cached expert param by slicing the current step's gradient.
+**Eliminated on hot path:**
+- `nn.Parameter()` construction — removes `aten::detach`
+- `local_data[i]` data slicing — removes half of `aten::select` + `aten::as_strided`
+- `DTensor.from_local()` for data — only needed for grad now
+- `is_expert_param()` name matching per step
+**Still required per step:**
+- `local_grad[i]` — grad tensor changes each step (nesterov)
+- `DTensor.from_local(slice_grad, ...)` — for TP expert grads
+- `p.grad = None` — freeing original 3D grad storage
+**Impact:** ~8ms CPU overhead reduction per step at production scale (64 GPUs, 48 local experts).
+---
+## 7. Newton-Schulz Compile + CUDA Graph (`newton_schulz.py`)
+**Before:** `_zeropower_via_newtonschulz5()` called directly every time.
+**After:** `zeropower_via_newtonschulz5()` wrapper with per-shape `torch.compile` caching + CUDA graph (`triton.cudagraphs=True`). Each unique shape gets its own compiled function stored in `_ns_per_shape`. Toggled via `set_ns_compile(enabled)`.
+**Impact:** After warmup, NS iterations run as CUDA graphs — eliminates per-step compilation overhead and CPU-GPU synchronization.
+---
+## 8. Removed `small_param_numel_threshold` (`muon.py`)
+**Before:** Small sharded DTensors (below threshold, default 65536) fell back to `distributed_muon()` which used per-param `full_tensor()` + redistribute.
+**After:** All sharded DTensors go to `parallel()`. `distributed_muon()` is retained as a test-only reference implementation. Uneven shard splits (e.g., MoE gate weights with fewer rows than shard ranks) are handled inline via `full_tensor()` fallback within the batched distributed_muon path.
+**Impact:** Simpler routing, no silent fallback to slower path.
+---
+## Summary Table
+| Optimization | Location | Category | Kernel Launches Saved |
+|---|---|---|---|
+| Batched momentum | `core.py` | CPU + GPU | N per-param → 2-3 batched |
+| Buffer packing (gather) | `pipeline.py` | CPU + GPU | N cat+cast → 1 cat+cast |
+| Buffer packing (scatter) | `pipeline.py` | CPU + GPU | N cat → 1 cat |
+| Zero-copy scatter | `pipeline.py` | GPU memory | N alloc+copy → 0 |
+| Batched param update | `pipeline.py` | CPU + GPU | N update → 1-2 batched |
+| Parallel metadata cache | `muon.py` | CPU | Sort+index per step → once |
+| Expert expand cache | `muon.py` | CPU | N detach+select → grad-only |
+| NS compile + CUDA graph | `newton_schulz.py` | GPU | JIT warmup → graph replay |
+| Remove small_param_threshold | `muon.py` | Routing | Simpler, unified path |