arxiv:2606.12397

Redesign Mixture-of-Experts Routers with Manifold Power Iteration

Published on Jun 10

· Submitted by

Songhao Wu on Jun 11

#1 Paper of the day

Upvote

Authors:

Songhao Wu ,

Ang Lv ,

Abstract

Researchers propose a novel router redesign for Mixture-of-Experts models that aligns router rows with the principal singular directions of expert matrices using Manifold Power Iteration to improve model effectiveness.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Router is the cornerstone component to the Mixture-of-Experts models. Serving as expert proxies, the rows of the router matrix compute their similarity to the MoE inputs to determine which subset of experts is activated. Ideally, each router row is designed to encode the expert matrix into this representative vector, such that its dot-product with token can better reflect token-expert affinity. However, there exists no design principles to enforce this condensation. In this paper, we propose to align each router row with the principal singular direction of the associated expert, as this direction provides the most expressive mathematical description of a matrix. Based on this principle, we propose a router redesign with Manifold Power Iteration (MPI). Specifically, it introduces a "Power-then-Retract" paradigm, where a power iteration step is performed on the router weights, followed by a retraction to impose a norm constraint to ensure both efficiency and stability. Theoretically, we show that MPI drives router rows to converge toward the principal singular directions of associated experts. Empirically, we pretrain MoE model across scales from 1B to 11B parameters to confirm that this alignment facilitates more effective MoE models.

View arXiv page View PDF GitHub 4 Add to collection

Community

shwu

Paper author Paper submitter about 11 hours ago

•

edited about 3 hours ago

We propose a redesign of the MoE router using Power Iteration during forward pass to couple router weights and expert parameters within the singular space of the parameters. We contend that this imposes an explicit constraint that forces router weights to better reflect the parametric characteristics of the expert weights, resulting in optimized expert routing. Our initial results and extensive analysis validate the effectiveness of this design. We hope our work inspires researchers to rethink MoE routers and leads to more valuable insights for future router designs.

noahml

about 4 hours ago

This is a neat approach to MoE routing. I like the idea of moving away from arbitrary router weights and instead using the principal singular direction of the experts to guide the selection process. It feels like a much more grounded way to define token-expert affinity than how most models currently handle it.

Since this uses a Power-then-Retract paradigm, how much of a computational overhead does this add during the training loop compared to standard routing?

I made a podcast on it with ResearchPod, it makes it easy to get the key concepts on the go:
https://researchpod.app/episode/b091d9ea-bfd5-4ea9-bced-18546d1f87e4