Papers
arxiv:2606.12397

Redesign Mixture-of-Experts Routers with Manifold Power Iteration

Published on Jun 10
· Submitted by
Songhao Wu
on Jun 11
#1 Paper of the day
Authors:
,

Abstract

Researchers propose a novel router redesign for Mixture-of-Experts models that aligns router rows with the principal singular directions of expert matrices using Manifold Power Iteration to improve model effectiveness.

Router is the cornerstone component to the Mixture-of-Experts models. Serving as expert proxies, the rows of the router matrix compute their similarity to the MoE inputs to determine which subset of experts is activated. Ideally, each router row is designed to encode the expert matrix into this representative vector, such that its dot-product with token can better reflect token-expert affinity. However, there exists no design principles to enforce this condensation. In this paper, we propose to align each router row with the principal singular direction of the associated expert, as this direction provides the most expressive mathematical description of a matrix. Based on this principle, we propose a router redesign with Manifold Power Iteration (MPI). Specifically, it introduces a "Power-then-Retract" paradigm, where a power iteration step is performed on the router weights, followed by a retraction to impose a norm constraint to ensure both efficiency and stability. Theoretically, we show that MPI drives router rows to converge toward the principal singular directions of associated experts. Empirically, we pretrain MoE model across scales from 1B to 11B parameters to confirm that this alignment facilitates more effective MoE models.

Community

Paper author Paper submitter
edited about 3 hours ago

We propose a redesign of the MoE router using Power Iteration during forward pass to couple router weights and expert parameters within the singular space of the parameters. We contend that this imposes an explicit constraint that forces router weights to better reflect the parametric characteristics of the expert weights, resulting in optimized expert routing. Our initial results and extensive analysis validate the effectiveness of this design. We hope our work inspires researchers to rethink MoE routers and leads to more valuable insights for future router designs.

This is a neat approach to MoE routing. I like the idea of moving away from arbitrary router weights and instead using the principal singular direction of the experts to guide the selection process. It feels like a much more grounded way to define token-expert affinity than how most models currently handle it.

Since this uses a Power-then-Retract paradigm, how much of a computational overhead does this add during the training loop compared to standard routing?

I made a podcast on it with ResearchPod, it makes it easy to get the key concepts on the go:
https://researchpod.app/episode/b091d9ea-bfd5-4ea9-bced-18546d1f87e4

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.12397
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.12397 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.12397 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.12397 in a Space README.md to link it from this page.

Collections including this paper 1