Buckets:

|
download
raw
90.6 kB

Title: Improving Routing in Sparse Mixture of Experts with Graph of Tokens

URL Source: https://arxiv.org/html/2505.00792

Markdown Content: Back to arXiv

This is experimental HTML to improve accessibility. We invite you to report rendering errors. Use Alt+Y to toggle on accessible reporting links and Alt+Shift+Y to toggle off. Learn more about this project and help improve conversions.

Why HTML? Report Issue Back to Abstract Download PDF Abstract 1Introduction 2Routing in SMoE with Graph of Tokens 3Beyond One Layer: Routing in MoE Transformer with Graph of Tokens 4Entropy Analysis of Similarity and Attention-Aware Routing 5Experimental Results 6Related Work 7Concluding Remarks References

HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

failed: titletoc

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: CC BY 4.0 arXiv:2505.00792v1 [cs.LG] 01 May 2025 Improving Routing in Sparse Mixture of Experts with Graph of Tokens Tam Nguyen Ngoc N. Tran Khai Nguyen Richard G. Baraniuk Abstract

Sparse Mixture of Experts (SMoE) has emerged as a key to achieving unprecedented scalability in deep learning. By activating only a small subset of parameters per sample, SMoE achieves an exponential increase in parameter counts while maintaining a constant computational overhead. However, SMoE models are susceptible to routing fluctuationsโ€“changes in the routing of a given input to its target expertโ€“at the late stage of model training, leading to model non-robustness. In this work, we unveil the limitation of SMoE through the perspective of the probabilistic graphical model (PGM). Through this PGM framework, we highlight the independence in the expert-selection of tokens, which exposes the model to routing fluctuation and non-robustness. Alleviating this independence, we propose the novel Similarity-Aware (S)MoE, which considers interactions between tokens during expert selection. We then derive a new PGM underlying an (S)MoE-Attention block, going beyond just a single (S)MoE layer. Leveraging the token similarities captured by the attention matrix, we propose the innovative Attention-Aware (S)MoE, which employs the attention matrix to guide the routing of tokens to appropriate experts in (S)MoE. We theoretically prove that Similarity/Attention-Aware routing help reduce the entropy of expert selection, resulting in more stable token routing mechanisms. We empirically validate our models on various tasks and domains, showing significant improvements in reducing routing fluctuations, enhancing accuracy, and increasing model robustness over the baseline MoE-Transformer with token routing via softmax gating.

Machine Learning, ICML 1Introduction

Mixture of Experts (MoE) (Jacobs et al., 1991; Jordan & Jacobs, 1994) has been widely used to scale up the number of parameters of deep neural networks while maintaining efficient computational overhead. MoE appears in applications of deep learning including language processing (Devlin et al., 2018; Radford et al., 2019; Raffel et al., 2020; Kaplan et al., 2020; Brown et al., 2020; Touvron et al., 2023), vision understanding (Neil & Dirk, 2020; Bao et al., 2021, 2022; Li et al., 2023; Bai et al., 2024), speech processing (Gaur et al., 2021), and other applications (Subramanian et al., 2024; Gormley & Murphy, 2011). A recent variation of the Mixture of Experts (MoE) called Sparse MoE (SMoE) (Shazeer et al., 2017), has been introduced to enhance model size to billion-parameter while maintaining constant computational costs by modularizing the network and activating only specific subsets of experts for each input. SMoE has been applied successfully across various machine learning tasks. For model training, it has been used in pre-training (Fedus et al., 2022; Artetxe et al., 2021) and fine-tuning tasks. In terms of applications, it has shown effectiveness in machine translation (Lepikhin et al., 2020), image classification (Riquelme et al., 2021), and large language modeling (Du et al., 2022).

1.1Sparse Mixture of Experts

Mixture of Experts (MoE) introduces dynamic routing into deep learning models by replacing components such as feed-forward or convolutional layers with a set of specialized neural networks called experts. While this approach enhances model capacity and flexibility, it comes with a significant computational overhead, as the model needs to maintain and process multiple expert networks. Sparse Mixture of Experts (SMoE) addresses this limitation by activating only a subset of experts for each input, substantially reducing computational costs while maintaining performance.

Let ๐ธ be the number of experts in MoE. For each input token ๐’– ๐‘– โˆˆ โ„ ๐ท of MoE, a routerโ€™s component ๐‘’ -th computes the affinity scores between ๐’– ๐‘– and expert ๐‘’ -th as ๐›พ ๐‘’ โข ( ๐’– ๐‘– ) and each corresponding expert network ๐’ˆ ๐‘’ processes ๐’– ๐‘– to obtain the output ๐’ˆ ๐‘’ โข ( ๐’– ๐‘– ) , where ๐‘’ โˆˆ [ 1 , 2 , โ€ฆ , ๐ธ ] . In practice, the router ๐’“ ( . ) is often chosen as ๐’“ โข ( ๐’– ๐‘– )

[ ๐‘Ÿ 1 โข ( ๐’– ๐‘– ) , โ€ฆ , ๐‘Ÿ ๐ธ โข ( ๐’– ๐‘– ) ]

softmax โข ( [ ๐›พ 1 โข ( ๐’– ๐‘– ) , โ€ฆ , ๐›พ ๐พ โข ( ๐’– ๐‘– ) ] โŠค )

softmax โข ( ๐– โข ๐’– ๐‘– + ๐’ƒ ) , where ๐–

[ ๐’˜ 1 , โ€ฆ , ๐’˜ ๐ธ ] โŠค โˆˆ โ„ ๐พ ร— ๐ท , ๐’ƒ

[ ๐‘ 1 , โ€ฆ , ๐‘ ๐ธ ] โŠค โˆˆ โ„ ๐พ , and ๐›พ ๐‘’ โข ( ๐’– ๐‘– )

๐’˜ ๐‘’ โŠค โข ๐’– ๐‘– + ๐‘ ๐‘’ . We also refer to ๐’“ โข ( ๐’– ๐‘– ) as expert scores for token ๐’– ๐‘– . MoE aggregates the outputs of all experts as:

MoE โข ( ๐’– )

โˆ‘ ๐‘’ ๐ธ ๐‘Ÿ ๐‘’ โข ( ๐’– ๐‘– ) โข ๐’ˆ ๐‘’ โข ( ๐’– ๐‘– ) .

(1)

To improve computational efficiency, SMoE applies the TopK 0 function, which returns an expert score ๐‘Ÿ ๐‘’ โข ( ๐’– ๐‘– ) only if it ranks among the top-k highest scores and 0 otherwise. The selected expert scores are then renormalized to sum to 1 . For brevity, we denote the combined Renormalize โข ( TopK 0 ) operation simply as TopK unless otherwise specified.

Outputs from selected experts are then linearly combined:

SMoE โข ( ๐’– ๐‘– )

โˆ‘ ๐‘’ ๐ธ TopK โข ( ๐‘Ÿ ๐‘’ โข ( ๐’– ๐‘– ) ) โข ๐’ˆ ๐‘’ โข ( ๐’– ๐‘– ) .

(2) Remark 1.

The routing scores in SMoE can be computed either as Renormalize โข ( TopK 0 โข ( ๐‘Ÿ ๐‘’ โข ( ๐ฎ ๐‘– ) ) ) or Softmax โข ( TopK โˆž โข ( ๐›พ ๐‘’ โข ( ๐ฎ ๐‘– ) ) ) , where TopK โˆž assigns โˆ’ โˆž to elements outside the ๐พ highest affinity scores. These formulations are mathematically equivalent, as proven in Appendix B.3.

Routing fluctuation in SMoE. One of the major challenges in training SMoE is the fluctuating routing decisions during training (Dai et al., 2022; Zoph et al., 2022; Chi et al., 2022). For instance, even in the final epochs of training, up to 33% of tokens still switch their assigned experts (c.f. Figure 3). This routing instability leads model non-robustness as small input perturbations can significantly change expert routing decisions, causing different experts to process similar inputs and leading to inconsistent outputs. In addition, improving the consistency of expert routing decisions is necessary for model training since routing fluctuations, especially in the later stages of training, make it challenging to determine an appropriate stopping point for training. Therefore, reinforcing consistent routing decisions enhances model robustness and improves overall performance. Thus, reinforcing consistent routing decisions enhances the robustness and overall performance of the model.

1.2Contributions

We utilize the attention mechanism and similarity between tokens to reduce the routing fluctuation in SMoE. In particular, from probabilistic graphical model (PGM) perspective underlying the (S)MoE, we show that the independence between the chosen experts and each individual token leads to routing fluctuations. We then propose the novel Similarity-Aware (S)MoE, which promotes the assignment of similar tokens to the same expert. Consequently, Similarity-Aware (S)MoE enables tokens to influence each otherโ€™s routing decisions, reducing routing fluctuations, and improving the modelโ€™s robustness. We extend the PGM framework to an MoE-Attention block in MoE-Transformer models and propose Attention-Aware, which leverages these dependencies captured in the attention matrix to guide the routerโ€™s decisions. Our contributions are four-fold:

1.

Under the probabilistic graphical model (PGM) perspective, we show that the independence between the chosen experts and each individual token results in routing fluctuations in (S)MoE.

2.

We propose the novel Similarity-Aware (S)MoE that break the independence between the chosen experts and tokens, by building a graph of tokens, to mitigate routing fluctuations.

3.

Beyond one layer (S)MoE, we develop the PGM framework to (S)MoE-Attention block in the (S)MoE-Transformer. We proposed Attention-Aware (S)MoE, an extension of Similarity-Aware (S)MoE, which leverage the dependencies between tokens captured by the attention matrix to guide the routerโ€™s decision.

4.

We theoretically prove that the Similarity-Aware (S)MoE, as well as the Attention-Aware (S)MoE, reduce the entropy in the decision-making process of indecisive tokens. This entropy reduction fosters more confident and consistent expert assignments.

We empirically corroborate the advantages of Similarity-Aware (S)MoE and Attention-Aware (S)MoE models across various tasks and domains.

Organization. In Section 2, we begin by presenting the Probabilistic Graphical Model (PGM) of (S)MoE and analyzing its limitations. Based on these insights, we introduce our Similarity-Aware (S)MoE framework in the same section. We then extend our analysis beyond single-layer (S)MoE by examining (S)MoE-Attention blocks within (S)MoE-Transformers, leading to our proposed Attention-Aware (S)MoE in Section 3. Section 4 provides a detailed entropy analysis of these models, followed by comprehensive experimental results across various tasks in Section 5. We conclude our work in Section 7, with supplementary materials provided in the Appendix.

Notaion. To facilitate the understanding of the content of our work, all used notations are explained in Appendix A.

2Routing in SMoE with Graph of Tokens

Probabilistic graphical models (PGM) provide a framework to understand conditional dependencies of and perform inference on variables. In this section, we examine the PGM that forms the basis of (S)MoE. From this graphical model perspective, we show the underlying assumptions about the independence between variables, highlighting the limitations that can arise from these assumptions.

2.1A Probabilistic Graphical Model for (S)MoE Figure 1:PGMs for (S)MoE ( ๐’ข 1 ) and Similarity-Aware (S)MoE ( ๐’ข 2 ). Directed paths are shown by arrows; dotted arrow indicates concatenation; blue arrows highlight differences. ๐”

[ ๐’– 1 , โ€ฆ , ๐’– ๐‘ ] ๐‘‡ is the input sequence of (S)MoE and Similarity-Aware (S)MoE. ๐’– ๐‘– , ๐’ ๐‘– are input-output at position ๐‘– โˆˆ [ 1 , โ€ฆ , ๐‘ ] . Variables ๐‘’ ๐‘– and ๐‘’ ๐‘– ๐‘  denote expert selection for ๐’– ๐‘– in (S)MoE and Similarity-Aware (S)MoE, respectively; ๐‘  ๐‘– represents similarity variable for ๐” .

Given a token ๐’– ๐‘– , the expert network ๐’ˆ ๐‘’ . for ๐‘’ โˆˆ [ 1 , โ€ฆ , ๐ธ ] , and the router ๐’“ ( . ) are defined in Section 1.1, the generation for (S)MoE (Bishop & Svensรฉn, 2003) takes the form:

๐‘’ ๐‘– | ๐’– ๐‘–

โˆผ Cat โข ( ๐’“ โข ( ๐’– ๐‘– ) )

(3)

๐’ ๐‘– | ๐’– ๐‘– , ๐‘’ ๐‘–

โˆผ ๐’ฉ โข ( ๐’ˆ ๐‘’ ๐‘– โข ( ๐’– ๐‘– ) , ๐•€ )

The generation process, illustrated as the graph ๐’ข 1 in Fig. 1, can be summarized as such: (S)MoE generative model samples expert selection ๐‘’ ๐‘– for token ๐’– ๐‘– from a categorical distribution with probability ๐’“ โข ( ๐’– ๐‘– ) . Then, conditioned on the expert selection ๐‘’ ๐‘– and token ๐’– ๐‘– , output target ๐’ ๐‘– is generated following a Gaussian distribution with mean ๐’ˆ ๐‘’ ๐‘– โข ( ๐’– ๐‘– ) .

(S)MoE as Optimal Estimation. When dealing with continuous-valued target variables, regression analysis provides a natural framework for prediction tasks. In such problems, our goal is to predict a target variable given an input variable while minimizing the expected squared error loss between the true target value and our prediction. Under this squared error criterion, the optimal predictor for such regression tasks is the conditional expectation of the target given the input (Hastie, 2009). In the case of MoE, the optimal estimate of the output ๐’ ๐‘– is given as:

๐’ ยฏ ๐‘–

๐”ผ [ ๐’ ๐‘– | ๐’– ๐‘– ]

๐”ผ ๐‘’ ๐‘– [ ๐”ผ ๐’ ๐‘– [ ๐’ ๐‘– | ๐‘’ ๐‘– , ๐’– ๐‘– ] | ๐’– ๐‘– ]

(4)

= โˆ‘ ๐‘’ ๐‘– ๐‘Ÿ ๐‘’ ๐‘– โข ( ๐’– ๐‘– ) โข ๐’ˆ ๐‘’ ๐‘– โข ( ๐’– ๐‘– )

MoE โข ( ๐’– ๐‘– ) .

Here, ๐”ผ โข [ ๐’ ๐‘– | ๐’– ๐‘– , ๐‘’ ๐‘– ]

๐’ˆ ๐‘’ ๐‘– โข ( ๐’– ๐‘– ) . As can be seen in Eqn. 4, the optimal estimation ๐’ ยฏ ๐‘– of ๐’ ๐‘– matches the formula of MoE in Eqn. 1. SMoE can be derived accordingly by using the TopK function on the expert scores, resulting in SMoE โข ( ๐’– ๐‘– )

โˆ‘ ๐‘’ TopK โข ( ๐‘Ÿ ๐‘’ โข ( ๐’– ๐‘– ) ) โข ๐’ˆ ๐‘’ โข ( ๐’– ๐‘– ) . When the expert scores ๐’“ โข ( ๐’– ๐‘– )

Softmax โข ( ๐– โข ๐’– ๐‘– + ๐’ƒ ) , the estimation recover (S)MoE with softmax gating.

Remark 2 (Deriving Other Routers).

Among the important advantages of the PGM formulation of MoE is the flexibility it offers to derive a variety of routers, including the popular ones such as cosine routing, random routing, โ€ฆ by modifying the formula for the parameter ๐ซ โข ( ๐ฎ ๐‘– ) of the Cat distribution in Eqn. 3.

Limitations. From the PGM for MoE, we observe that expert selections for each individual token are mutually independent, i.e., ( ๐‘’ ๐‘– โŸ‚ โŸ‚ ๐‘’ ๐‘— ) for all ๐‘– , ๐‘— , when ๐’– ๐‘– , ๐’– ๐‘— are i.i.d. sampled. This lack of interaction between tokensโ€™ decisions can lead to routing fluctuation.

In particular, near the end of SMoE training, when the learning rate is significantly small and the model parameters stabilize, we expect minimal changes in routing decisions, given that the approximation function is reasonably smooth and token representations do not change considerably. However, empirical evidence shows this is not the case. Fig. 3 in Sec. 5 presents an empirical analysis demonstrating that up to 33% of tokens still switch their selected experts in the final epoch, highlighting a persistent instability in routing. This observation suggests that similar tokens should be routed to the same expert, but the current independent routing does not guarantee this from happening. By letting tokens influencing othersโ€™ expert selection, we could reduce fluctuation and ensure more stable, consistent routing decisions.

2.2Similarity-Aware (S)MoE

Leveraging token similarity to guide expert selection can both reduce routing fluctuations and facilitate expert learning by presenting less diverse input to each expert. In this section, from the PGM perspective, we introduce the novel Similarity-Aware (S)MoE, where expert decisions are directly dependent based on tokensโ€™ similarity.

Probabilistic Graphical Model of Similarity-Aware (S)MoE. For each token ๐’– ๐‘– in a given sequence ๐” , we introduce a similarity variable ๐‘  ๐‘– , whose probability distrubution quantifies how likely token ๐’– ๐‘– โˆˆ โ„ ๐ท resembles other tokens. This similarity directly influences the expert selection variable, which is denoted as ๐‘’ ๐‘– ๐‘  for token ๐’– ๐‘– . The introduction of this new notation for expert selection aims to distinguish it from the expert selection variable ๐‘’ ๐‘– used in the generative model of (S)MoE. The generative process for producing the target ๐’ ๐‘– โˆˆ โ„ ๐ท for each token ๐’– ๐‘– is detailed in Def. 1, and illustrated as graph ๐’ข 2 in Fig. 1.

Definition 1.

[Similarity-Aware MoE Generative Model (SAM).] Given a sequence of token ๐”

[ ๐ฎ 1 , โ€ฆ , ๐ฎ ๐‘ ] โŠค , similarity variable ๐‘  ๐‘– โˆˆ { 1 , โ€ฆ , ๐‘ } of token ๐ฎ ๐‘– , expert selection variable of SAM ๐‘’ ๐‘– ๐‘  โˆˆ { 1 , โ€ฆ , ๐ธ } , and the generation process in SAM generate target variable ๐จ ๐‘– as follows:

๐‘  ๐‘– | ๐”

โˆผ Cat ( Softmax ( ๐’– ๐‘– โŠค โข ๐– ๐‘  โข ๐” ๐‘‡ ๐œ ) )

(5)

๐‘’ ๐‘– ๐‘  | ๐‘  ๐‘– , ๐”

โˆผ Cat โข ( ๐’“ โข ( ๐’– ๐‘  ๐‘– ) )

๐’ ๐‘– | ๐’– ๐‘– , ๐‘’ ๐‘– ๐‘ 

โˆผ ๐’ฉ โข ( ๐’ˆ ๐‘’ ๐‘– ๐‘  โข ( ๐’– ๐‘– ) , ๐•€ ) ,

where ๐’– ๐‘  ๐‘–

๐” โข [ ๐‘  ๐‘– , : ] , ๐– ๐‘  โˆˆ โ„ ๐ท ร— ๐ท is a learnable parameter matrix, and ๐œ > 0 is a temperature parameter controlling the sharpness of the similarity distribution. In practice, to be computationally efficient, we set ๐– ๐‘ 

๐•€ .

Remark 3.

We establish a connection between the expert selection variables in ๐‘’ ๐‘– ๐‘  in SAM and ๐‘’ ๐‘– in (S)MoE generative model, as follows:

โ„™ โข ( ๐‘’ ๐‘– ๐‘ 

๐‘’ | ๐” )

โˆ‘ ๐‘  ๐‘– โ„™ โข ( ๐‘’ ๐‘– ๐‘ 

๐‘’ | ๐‘  ๐‘– , ๐” ) โข โ„™ โข ( ๐‘  ๐‘– | ๐” )

(6)

= โˆ‘ ๐‘  ๐‘– ๐‘Ÿ ๐‘’ โข ( ๐’– ๐‘  ๐‘– ) โข โ„™ โข ( ๐‘  ๐‘– | ๐” )

โˆ‘ ๐‘  ๐‘– โ„™ โข ( ๐‘’ ๐‘  ๐‘–

๐‘’ | ๐’– ๐‘  ๐‘– ) โข โ„™ โข ( ๐‘  ๐‘– | ๐” ) .

Equation 6 reveals a key distinction in expert routing mechanisms: The (S)MoE approach routes token ๐‘– based solely on its embedding ๐ฎ ๐‘– , while our Similarity-Aware routing considers the relationships between all tokens, weighting decisions by token similarities. The process implies that similar tokens are more likely to be routed to the same expert, promoting consistency in the processing of related information, leading to reduce routing fluctuation.

Optimal Estimation of ๐‘œ ๐‘– in SAM. Similar to the derivation of (S)MoE in Section 2.1, we compute the expectation ๐”ผ โข [ ๐’ ๐‘– | ๐” ] as follows:

๐’ ยฏ ๐‘–

๐”ผ โข [ ๐’ ๐‘– | ๐” ]

๐”ผ ๐‘’ ๐‘– ๐‘  โข [ ๐”ผ ๐’ ๐‘– โข [ ๐’ ๐‘– | ๐‘’ ๐‘– ๐‘  , ๐” ] | ๐” ]

(7)

= โˆ‘ ๐‘’

1 ๐ธ โ„™ ( ๐‘’ ๐‘– ๐‘ 

๐‘’ | ๐” ) [ ๐”ผ ๐’ ๐‘– [ ๐’ ๐‘– | ๐‘’ ๐‘– ๐‘ 

๐‘’ , ๐’– ๐‘– ] ]

โˆ‘ ๐‘’

1 ๐ธ โˆ‘ ๐‘  ๐‘–

1 ๐‘ ๐‘Ÿ ๐‘’ โข ( ๐’– ๐‘  ๐‘– ) โข โ„™ โข ( ๐‘  ๐‘– | ๐” ) โข ๐’ˆ ๐‘’ โข ( ๐’– ๐‘– ) .

With this result, we now define Similarity-Aware (S)MoE:

Definition 2.

[Similarity-Aware (S)MoE.] Given an token sequence ๐”

[ ๐ฎ 1 , โ€ฆ , ๐ฎ ๐‘ ] ๐‘‡ , the expert scores ๐ซ โข ( ๐ฎ ๐‘– ) for each token ๐‘– , and the similarity score ๐’ โข [ ๐‘– , ๐‘— ]

Softmax โข ( ๐ฎ ๐‘– ๐‘‡ โข ๐– ๐‘  โข ๐ฎ ๐‘— / ๐œ ) , Similarity-Aware MoE estimates the output ๐จ ๐‘– at token ๐‘– as

๐’ ยฏ ๐‘–

โˆ‘ ๐‘’

1 ๐ธ โˆ‘ ๐‘—

1 ๐‘ ๐’ โข [ ๐‘– , ๐‘— ] โข ๐‘Ÿ ๐‘’ โข ( ๐’– ๐‘— ) โข ๐’ˆ ๐‘’ โข ( ๐’– ๐‘– ) .

and its special version Similarity-Aware SMoE calculates

๐’ ยฏ ๐‘–

โˆ‘ ๐‘’

1 ๐ธ TopK โข ( โˆ‘ ๐‘—

1 ๐‘ ๐’ โข [ ๐‘– , ๐‘— ] โข ๐‘Ÿ ๐‘’ โข ( ๐’– ๐‘— ) ) โข ๐’ˆ ๐‘’ โข ( ๐’– ๐‘– ) .

(8)

By incorporating token similarities, encourages experts to specialize in handling clusters of similar tokens, leading to more efficient learning and better performance. In addition, the approach is less likely to make different expert selections for similar tokens, hence, leads to reduction in routing fluctuations and improve robustness.

3Beyond One Layer: Routing in MoE Transformer with Graph of Tokens

In this section, we extend the PGM for (S)MoE to a 2-layer block setting, which includes the (S)MoE following a self-attention layer as in recent (S)MoE transformer models, such as Switch Transformer (Fedus et al., 2021) and Swin-MoE (Liu et al., 2021). We then propose another Similarity-Aware (S)MoE, which leverages the attention matrix in self-attention to guide expert selection in (S)MoE.

3.1PGM for MoE Transformer 3.1.1Multi-head Attention and MoE

Multihead Attention For a given input sequence ๐—

[ ๐’™ 1 , โ‹ฏ , ๐’™ ๐‘ ] โŠค โˆˆ โ„ ๐‘ ร— ๐ท , self-attention transforms ๐— into the output sequence Softmax โข ( ๐—๐– ๐‘„ , โ„Ž โŠค โข ๐– ๐พ , โ„Ž โข ๐— โŠค / ๐ท ) โข ๐—๐– ๐‘‰ , โ„Ž โŠค := ๐€ โ„Ž โข ๐– ๐‘‰ , โ„Ž , for each head โ„Ž

1 , โ€ฆ , ๐ป . The matrix ๐€ โ„Ž โˆˆ โ„ ๐‘ ร— ๐‘ is called the attention matrix, and ๐– ๐‘„ , โ„Ž , ๐– ๐พ , โ„Ž โˆˆ โ„ ๐ท ๐‘ž โข ๐‘˜ ร— ๐ท , and ๐– ๐‘‰ , โ„Ž โˆˆ โ„ ๐ท ๐‘ฃ ร— ๐ท are the weight matrices for head โ„Ž . MHA aggregates the output of ๐ป heads as

๐”

MHA โข ( ๐— ) := 1 ๐ป โข โˆ‘ โ„Ž

1 ๐ป ๐€ โ„Ž โข ๐—๐– ๐‘‰ , โ„Ž โŠค โข ๐– ๐‘‚ , โ„Ž ,

(9)

where ๐– ๐‘‚ , โ„Ž โˆˆ โ„ ๐ท ๐‘ฃ ร— ๐ท is the projection matrix for the output of each head โ„Ž . Here, we merge ๐– โ„Ž

: ๐– ๐‘‰ , โ„Ž ๐‘‡ ๐– ๐‘‚ , โ„Ž โˆˆ โ„ ๐ท ร— ๐ท for convenience, results in MHA โข ( ๐— )

1 ๐ป โข โˆ‘ โ„Ž ๐€ โ„Ž โข ๐—๐– โ„Ž .

(S)MoE Transformer. (S)MoE Transformer integrates (S)MoE into a transformer architecture by replacing the standard feed-forward network following the self-attention layer with an (S)MoE layer. This block, refered to (S)MoE-Attention, computes the output token at position ๐‘– -th as MoE โข ( MHA โข ( ๐— ) โข [ ๐‘– ] ) or SMoE โข ( MHA โข ( ๐— ) โข [ ๐‘– ] ) , where MHA , MoE , SMoE are defined in Eq. 9,  1, and 2, respectively.

3.1.2Graphical model for (S)MoE-Attention

Extending the PGM for (S)MoE in Sec. 2.1, we develop the PGM model for (S)MoE-Attention, namely the (S)MoE-Attention Generative Model (MAM), as illustrated as ๐’ข 3 in Fig. 2. We define MAM as follows:

Definition 3.

[(S)MoE-Attention Generative Model (MAM).] Given the sequence of input ๐—

[ ๐ฑ 1 , โ€ฆ , ๐ฑ ๐‘ ] โŠค , let โ„Ž ๐‘– โˆˆ { 1 , โ€ฆ , ๐ป } and ๐‘ง ๐‘– โˆˆ { 1 , โ€ฆ , ๐‘ } be the head selection variable and the attention position selection variable, respectively, for each token ๐ฑ ๐‘– . Let ๐ฎ ๐‘– be the input token of MoE and ๐‘’ ๐‘– โˆˆ { 1 , โ€ฆ , ๐ธ } is the expert selection for token ๐ฎ ๐‘– . MAM generates the output ๐จ ๐‘– as follow:

โ„Ž ๐‘–

โˆผ Uniform โข ( { 1 , โ€ฆ , ๐ป } )

(10)

๐‘ง ๐‘– | โ„Ž ๐‘– , ๐—

โˆผ Cat ( Softmax ( ๐’™ ๐‘– โŠค โข ๐– ๐‘„ , โ„Ž ๐‘– โŠค โข ๐– ๐พ , โ„Ž ๐‘– โข ๐— ๐‘‡ ๐ท ๐‘ž โข ๐‘˜ ) )

๐’– ๐‘– | ๐‘ง ๐‘– , โ„Ž ๐‘– , ๐—

โˆผ ๐’ฉ โข ( ๐– โ„Ž ๐‘– โข ๐’™ ๐‘ง ๐‘– , ๐œŽ 2 โข ๐•€ )

๐‘’ ๐‘– | ๐’– ๐‘–

โˆผ Cat โข ( ๐’“ โข ( ๐’– ๐‘– ) )

๐’ ๐‘– | ๐’– ๐‘– , ๐‘’ ๐‘–

โˆผ ๐’ฉ โข ( ๐’ˆ ๐‘’ โข ( ๐’– ๐‘– ) , ๐•€ ) ,

where ๐– ๐‘„ , โ„Ž , ๐– ๐พ , โ„Ž , ๐– โ„Ž are learnable parameters and ๐ฑ ๐‘ง ๐‘–

๐— โข [ ๐‘ง ๐‘– , : ] .

The MAM generation process consists of two main steps: (1) The model generates the input token ๐’– ๐‘– , ๐‘–

1 , โ€ฆ , ๐‘ of (S)MoE, from the input sequence ๐— . In particular, the head selection โ„Ž ๐‘– is uniformly sampled. The attention position ๐‘ง ๐‘– is then chosen given the sample โ„Ž ๐‘– and input sequence ๐— . Subsequently, ๐’– ๐‘– is generated via a Gaussian centered at ๐– โ„Ž ๐‘– โข ๐’™ ๐‘ง ๐‘– . (2) Following the (S)MoE generation described in Sec. 2.1, for each token ๐’– ๐‘– , MAM samples an expert ๐‘’ ๐‘– from a categorical distribution with parameter ๐’“ โข ( ๐’– ๐‘– ) and generates the output ๐’ ๐‘– from a Gaussian with mean ๐’ˆ ๐‘’ ๐‘– โข ( ๐’– ๐‘– ) .

Figure 2:PGMs for (S)MoE-Attention ( ๐’ข 3 ) and Attention-Aware (S)MoE ( ๐’ข 4 ) defined in Def. 3 and Def. 4, respectively. Directed paths shown by arrows; dotted arrow indicates concatenation; blue arrows highlight differences.

(S)MoE-Attention as a Point Estimation. We show that MoE-Attention can be derived as an point estimation of the output ๐’ ๐‘– from MAM. In particular, the output ๐’ ๐‘– in MAM can be estimated as

๐’ ยฏ ๐‘–

๐”ผ โข [ ๐’ ๐‘– | ๐— ]

๐”ผ ๐’– ๐‘– โข [ ๐”ผ ๐’ ๐‘– โข [ ๐’ ๐‘– | ๐’– ๐‘– ] | ๐— ]

๐”ผ ๐’– ๐‘– โข [ MoE โข ( ๐’– ๐‘– ) | ๐— ] ,

where ๐”ผ ๐’ ๐‘– โข [ ๐’ ๐‘– | ๐’– ๐‘– ]

MoE โข ( ๐’– ๐‘– ) as in Eqn. 4 and ๐”ผ โข [ ๐’ ๐‘– | ๐— ]

๐”ผ ๐’– ๐‘– โข [ ๐”ผ ๐’ ๐‘– โข [ ๐’ ๐‘– | ๐’– ๐‘– ] | ๐— ] is obtained by using tower rule. Notice that ๐”ผ ๐’– ๐‘– โข [ MoE โข ( ๐’– ๐‘– ) | ๐— ] does not have a closed-form expression. MoE-Attention approximates this conditional expectation using a point estimation of ๐’– ๐‘– given ๐— , giving:

๐’ ยฏ ๐‘–

๐”ผ โข [ ๐’ ๐‘– | ๐’– ๐‘–

๐”ผ โข [ ๐’– ๐‘– | ๐— ] ]

MoE โข ( MHA โข ( ๐— ) โข [ ๐‘– ] ) ,

(11)

where MHA โข ( ๐— ) in Eqn. 11 is obtained as

๐”ผ [ ๐’– ๐‘– | ๐— ]

๐”ผ โ„Ž ๐‘– [ ๐”ผ ๐‘ง ๐‘– [ ๐”ผ ๐’– ๐‘– [ ๐’– ๐‘– | ๐‘ง ๐‘– , โ„Ž ๐‘– , ๐— ] | โ„Ž ๐‘– , ๐— ]

1 ๐ป โข โˆ‘ โ„Ž

1 ๐ป ๐– โ„Ž โข โˆ‘ ๐‘—

1 ๐‘ ๐€ โ„Ž โข [ ๐‘– , ๐‘— ] โข ๐ฑ ๐‘—

MHA โข ( ๐— ) ,

which is the multihead attention in Eqn. 9. The detailed derivation of ๐”ผ โข [ ๐’– ๐‘– | ๐— ] is given in Appendix C.

As can be seen in Eqn. 3.1.2, MoE-Attention utilizes MHA to compute the token ๐’– ๐‘– from the input sequence ๐— . ๐’– ๐‘– is then sent to an MoE layer to estimate the output token ๐’ ๐‘– . SMoE-Attention can be easily derived by replacing the MoE layer in Eqn. 11 with the SMoE module in Eqn. 2, obtaining ๐’ ยฏ ๐‘–

SMoE โข ( MHA โข ( ๐— ) โข [ ๐‘– ] ) .

Limitation: The graphical model ๐’ข 3 reveals that expert selections for individual tokens exhibit conditional independence given the input tokens, expressed as ( ๐‘’ ๐‘– โŸ‚ โŸ‚ ๐‘’ ๐‘— | ๐— ) for all ๐‘– , ๐‘— . This conditional independence and lack of direct interaction in token routing decisions can lead to instability across the network. Our empirical analysis in Fig. 3 (Sec. 5) demonstrates significant routing volatility, with 10-33% of tokens changing their assigned experts, across layers, the final training epoch. This volatility arises because conditional independence fails to ensure that similar tokens are routed consistently to the same expert. By introducing inter-token influences in expert selection, we can reduce this fluctuation and achieve more stable, consistent routing patterns.

3.2Attention-Aware (S)MoE)

The routing decision for each token can also be informed via their dependency captured in the attention layers. In particular, we establish a link from the variable ๐‘ง ๐‘– โ€“which represents the position of the token that token ๐’™ ๐‘– attends to in the attention layerโ€“to the expert selection variable. This approach, which we call Attention-Aware Routing (A2 Routing), allows us to utilize the similarity information directly from the attention layer to inform the expert selection, instead of computing the similarity matrix based on ๐” as in Similarity-Aware (S)MoEs.

Probabilistic Generative Model for Attention-Routing. We define the PGM for Attention-Aware (S)MoE (shown in ๐’ข 4 , Fig. 2), which employs A2 Routing for expert selection:

Definition 4 (Attention-Aware MoE Generative Model (A2MM)).

Given the sequence of input ๐—

[ ๐ฑ 1 , โ€ฆ , ๐ฑ ๐‘ ] โŠค , let โ„Ž ๐‘– โˆˆ { 1 , โ€ฆ , ๐ป } and ๐‘ง ๐‘– โˆˆ { 1 , โ€ฆ , ๐‘ } are the head selection and attention position selection variables of each token at position ๐‘– -th, respectively. Let ๐”

[ ๐ฎ 1 , โ€ฆ , ๐ฎ ๐‘ ] โŠค be input tokens to MoE layer and ๐‘’ ๐‘– ๐‘Ž โˆˆ { 1 , โ€ฆ , ๐ธ } is the expert selection for token ๐‘– . A2MM generates the output ๐จ ๐‘– as follows:

โ„Ž ๐‘–

โˆผ Uniform โข ( { 1 , โ€ฆ , ๐ป } )

๐‘ง ๐‘– | โ„Ž ๐‘– , ๐—

โˆผ Cat ( Softmax ( ๐’™ ๐‘– โŠค โข ๐– ๐‘„ , โ„Ž ๐‘– โŠค โข ๐– ๐พ , โ„Ž ๐‘– โข ๐— ๐‘‡ ๐ท ๐‘ž โข ๐‘˜ ) )

๐’– ๐‘– | ๐‘ง ๐‘– , โ„Ž ๐‘– , ๐—

โˆผ ๐’ฉ โข ( ๐– โ„Ž ๐‘– โข ๐’™ ๐‘ง ๐‘– , ๐œŽ 2 โข ๐ˆ )

๐‘’ ๐‘– ๐‘Ž | ๐‘ง ๐‘– , ๐”

โˆผ Cat โข ( ๐’“ โข ( ๐’– ๐‘ง ๐‘– ) )

๐’ ๐‘– | ๐’– ๐‘– , ๐‘’ ๐‘– ๐‘Ž

โˆผ ๐’ฉ โข ( ๐’ˆ ๐‘’ ๐‘– ๐‘Ž โข ( ๐’– ๐‘– ) , ๐ˆ ) ,

where ๐’™ ๐‘ง ๐‘–

๐— โข [ ๐‘ง ๐‘– , : ] and ๐’– ๐‘ง ๐‘–

๐” โข [ ๐‘ง ๐‘– , : ] . A2MM generates outputs in two steps: (1) Create the sequence of tokens ๐”

[ ๐’– 1 , โ€ฆ , ๐’– ๐‘ ] โŠค via multi-head attention, where each ๐’– ๐‘– is generated from the input sequence ๐— following the process in Def. 3. (2) For each token ๐’– ๐‘– , sample expert ๐‘’ ๐‘– ๐‘Ž from a categorical distribution with parameters ๐’“ โข ( ๐’– ๐‘ง ๐‘– ) , where ๐‘ง ๐‘– indicates the attended position. The final output ๐’ ๐‘– is drawn from ๐’ฉ โข ( ๐’ˆ ๐‘’ ๐‘– ๐‘Ž โข ( ๐’– ๐‘– ) , ๐•€ ) .

Estimation of the Target Values ๐‘œ ๐‘– . By using tower rule, we calculate the conditional expectation ๐”ผ โข [ ๐’ ๐‘– | ๐— ] under the A2MM as follows:

๐’ ยฏ ๐‘–

๐”ผ [ ๐’ ๐‘– | ๐— ]

๐”ผ ๐” [ ๐”ผ ๐‘’ ๐‘– ๐‘Ž [ ๐”ผ ๐’ ๐‘– [ ๐’ ๐‘– | ๐‘’ ๐‘– ๐‘Ž , ๐’– ๐‘– , ๐— ] | ๐” , ๐— ] | ๐— ] .

(12)

Lemma 1 provides the key result for computing the expectation in Eqn. 12. The details derivation of Lemma 1 is found in Appendix B.1

Lemma 1.

The distribution of the expert selection ๐‘’ ๐‘– ๐‘Ž conditioned on ๐” , ๐— , is given by

โ„™ โข ( ๐‘’ ๐‘– ๐‘Ž

๐‘’ | ๐” , ๐— )

โˆ‘ โ„Ž

1 ๐ป โˆ‘ ๐‘—

1 ๐‘ โ„ ๐‘ โข [ ๐‘– , โ„Ž ] โข ๐”ธ โ„Ž ๐‘ โข [ ๐‘– , ๐‘— ] โข ๐‘Ÿ ๐‘’ โข ( ๐’– ๐‘— ) ,

where the posteriors

๐”ธ โ„Ž ๐‘ โข [ ๐‘– , ๐‘— ]
:= โ„™ ( ๐‘ง ๐‘–

๐‘— | โ„Ž ๐‘–

โ„Ž , ๐’– ๐‘– , ๐— )

๐”ธ โ„Ž โข [ ๐‘– , ๐‘— ] โข ๐‹ โ„Ž โข [ ๐‘– , ๐‘— ] โˆ‘ ๐‘— โ€ฒ ๐”ธ โ„Ž โข [ ๐‘– , ๐‘— โ€ฒ ] โข ๐‹ โ„Ž โข [ ๐‘– , ๐‘— โ€ฒ ] ,

โ„ ๐‘ โข [ ๐‘– , โ„Ž ]
:= โ„™ โข ( โ„Ž ๐‘–

โ„Ž | ๐’– ๐‘– , ๐— )

โ„ โข [ ๐‘– , โ„Ž ] โข โˆ‘ ๐‘— ๐”ธ โ„Ž โข [ ๐‘– , ๐‘— ] โข ๐‹ โ„Ž โข [ ๐‘– , ๐‘— ] โˆ‘ โ„Ž โ€ฒ โข โ„ โข [ ๐‘– , โ„Ž โ€ฒ ] โข โˆ‘ ๐‘— โ€ฒ ๐”ธ โ„Ž โ€ฒ โข [ ๐‘– , ๐‘— โ€ฒ ] โข ๐‹ โ„Ž โ€ฒ โข [ ๐‘– , ๐‘— โ€ฒ ] ,

with the prior ๐”ธ โ„Ž [ ๐‘– , ๐‘— ] := โ„™ ( ๐‘ง ๐‘–

๐‘— | โ„Ž ๐‘–

โ„Ž , ๐— ) and โ„ โข [ ๐‘– , โ„Ž ] := โ„™ โข ( โ„Ž ๐‘–

โ„Ž ) and the likelihood ๐‹ โ„Ž โข [ ๐‘– , ๐‘— ] := ๐’ฉ โข ( ๐ฎ ๐‘– | ๐– โ„Ž โข ๐ฑ ๐‘— , ๐œŽ 2 โข ๐•€ ) . This results in

๐”ผ โข [ ๐’ ๐‘– | ๐— ]

๐”ผ ๐” โข [ โˆ‘ ๐‘’ โ„™ โข ( ๐‘’ ๐‘– ๐‘Ž

๐‘’ | ๐” , ๐— ) โข ๐’ˆ ๐‘’ โข ( ๐’– ๐‘– ) | ๐— ] .

(13)

Lemma 1 unveils a sophisticated decision-making process in Attention-Aware MoE, where the final routing decision for a token is influenced by the decisions of other tokens, as well as the relevance of each attention head. This formulation can be interpreted as a two-stage process: first, each tokenโ€™s original decision is adjusted by the decisions of other tokens, weighted by ๐€ โ„Ž ๐‘ โข [ ๐‘– , ๐‘— ] , which represents the โ€œresponsibilityโ€ of token ๐’™ ๐‘— in explaining token ๐’– ๐‘– โ€™s representation within attention head โ„Ž . Then, these weighted decisions from each head are further weightedly combined by ๐‡ ๐‘ โข [ ๐‘– , โ„Ž ] , which represents the responsibility of head โ„Ž in explaining token ๐’– ๐‘– . This hierarchical weighting scheme allows the model to integrate context from multiple attention patterns.

Enhancing Efficiency of Estimating ๐‘œ ๐‘– . Computing the RHS of Eqn 13 in Lem. 1 is costly due to the summation over the head โ„Ž . To reduce this computational overhead, we propose an approximation for ๐‡ ๐‘ that avoids full posterior inference across all heads:

โ„ ยฏ ๐‘ โข [ ๐‘– , โ„Ž ]

{ 1 , if  โข โ„Ž

โ„Ž โˆ— := arg โข min โ„Ž โก ๐”ผ โข [ โ„‹ โข ( ๐€ โ„Ž โข [ ๐‘– , : ] ) ]

0 , otherwise .

(14)

where ๐€ โ„Ž โข [ ๐‘– , : ] is the ๐‘– ๐‘ก โข โ„Ž row of ๐€ โ„Ž and โ„‹ โข ( ๐€ โ„Ž โข [ ๐‘– , : ] )

โˆ’ โˆ‘ ๐‘–

๐‘— ๐‘ ๐€ โ„Ž โข [ ๐‘– , ๐‘— ] โข log โก ๐€ โ„Ž โข [ ๐‘– , ๐‘— ] is the entropy of attention score for token ๐’™ ๐‘– at head โ„Ž and the expectation ๐”ผ โข [ โ„‹ โข ( ๐€ โ„Ž โข [ ๐‘– , : ] ) ] is taken over tokens ๐’™ ๐‘– . This means that only the attention head with the lowest average entropy should contribute to the posteriors. Finally, since the expectation over ๐” in Equation 13 does not have a closed-form, we approximate it by using the point estimate ๐”

๐”ผ โข [ ๐” | ๐— ]

MHA โข ( ๐— ) as in derivation . By applying the result in Lem. 1 and the head selection in Eqn. 14, the target values ๐’ ๐‘– is then estimated as:

๐’ ยฏ ๐‘–

โˆ‘ ๐‘’

1 ๐ธ โˆ‘ ๐‘—

1 ๐‘ ๐”ธ โ„Ž โˆ— ๐‘ โข [ ๐‘– , ๐‘— ] โข ๐‘Ÿ ๐‘’ โข ( ๐’– ๐‘— ) โข ๐’ˆ ๐‘’ โข ( ๐’– ๐‘– ) .

(15)

We are now ready to define Attention-Aware (S)MoE.

Definition 5 (Attention-Aware (S)MoE.).

Given a sequence of input tokens ๐— , the output of the multihead attention layer ๐”

MHA โข ( ๐— )

[ ๐ฎ 1 , โ€ฆ , ๐ฎ ๐‘ ] ๐‘‡ , the expert score ๐ซ โข ( ๐ฎ ๐‘– ) , ๐‘–

[ 1 , โ€ฆ , ๐‘ ] , computed in Sec. 1.1, and the posterior score ๐€ โ„Ž โˆ— ๐‘ computed as in Lemma 1 with โ„Ž โˆ— being the head index with lowest average attention as in Eqn. 14. The Attention-Aware SMoE approximates the output ๐จ ๐‘– as

๐’ ยฏ ๐‘–

โˆ‘ ๐‘’

1 ๐ธ TopK ( โˆ‘ ๐‘—

1 ๐‘ ๐€ โ„Ž โˆ— ๐‘ [ ๐‘– , ๐‘— ] ๐‘Ÿ ๐‘’ ( ๐’– ๐‘— ) ) ๐’ˆ ๐‘’ ( ๐’– ๐‘– ) .

(16) 4Entropy Analysis of Similarity and Attention-Aware Routing

When the model is uncertain in its routing decision, a small perturbation in either weight space or input space would cause a change in its discrete decision. As a result, high entropy of the expert scores ๐’“ โข ( ๐’– ๐‘– ) , as defined in Section 1.1, of a token suggests an increased routing fluctuation. In this section, we demonstrate that our Similarity/Attention-Aware MoE reduces routing fluctuations by lowering the entropy of routing scores.

Consider for any ๐‘–

1 , โ€ฆ , ๐‘ , and define ๐ฝ ๐‘–

{ ๐‘— โˆฃ โ„‹ โข ( ๐‘’ ๐‘— | ๐’– ๐‘— ) โ‰ค โ„‹ โข ( ๐‘’ ๐‘– | ๐’– ๐‘– ) } . Here, we slightly abuse the notation of entropy โ„‹ , using it interchangeably for both a random variable and its associated distribution. We apply the Similarity/Attention-Aware MoE to token ๐’– ๐‘– with the set ๐ฝ ๐‘– . The score function ๐‘  โข ( ๐‘– , ๐‘— ) , capturing the correspondence between token ๐’– ๐‘– and ๐’– ๐‘— for ๐‘— โˆˆ ๐ฝ ๐‘– , is either defined as ๐‘  โข ( ๐‘– , ๐‘— )

Softmax โข ( ๐’– ๐‘– ๐‘‡ โข ๐– ๐‘  โข ๐’– ๐‘— / ๐œ ) (in Def. 1) or ๐‘  โข ( ๐‘– , ๐‘— )

๐€ โ„Ž โˆ— ๐‘ โข [ ๐‘– , ๐‘— ] as in Eqn. 15. We show that the expert selection of Similarity/Attention-Aware (S)MoEs have lower entropy than those of MoE:

Proposition 1.

Let ๐ฉ ๐‘–

[ ๐‘ 1 , โ€ฆ , ๐‘ ๐พ ] ๐‘‡ denote the distribution of expert selection variables, i.e., ๐‘’ ๐‘– ๐‘  for SAM and ๐‘’ ๐‘– ๐‘Ž for A2MM. The expert score in the baseline MoE for token ๐ฎ ๐‘– is breviated as ๐ซ ๐‘– := ๐ซ โข ( ๐ฎ ๐‘– ) as in Section 1.1. Similarity/Attention-Aware MoE transform these expert scores ๐ซ ๐‘– into ๐ฉ ๐‘–

โˆ‘ ๐‘— โˆˆ ๐ฝ ๐‘– ๐‘  โข ( ๐‘– , ๐‘— ) โข ๐ซ ๐‘— , where ๐‘  โข ( ๐‘– , ๐‘— ) denotes the influence weight between tokens ๐ฎ ๐‘– and ๐ฎ ๐‘— . The upper bound of the expert selectionโ€™s entropy in Similarity/Attention-Aware MoE is then given by:

โ„‹ โข ( ๐ฉ ๐‘– ) โ‰ค โˆ‘ ๐‘—

1 | ๐ฝ ๐‘– | ๐‘  โข ( ๐‘– , ๐‘— ) โข โ„‹ โข ( ๐ซ ๐‘— ) + โ„‹ โข ( ๐•ค ๐‘– ) ,

(17)

where ๐•ค ๐‘–

[ ๐‘  โข ( ๐‘– , 1 ) , โ€ฆ , ๐‘  โข ( ๐‘– , | ๐ฝ ๐‘– | ) ] ๐‘‡ . As the temperature parameter ๐œ โ†’ 0 (defined in Def. 1) or ๐œŽ โ†’ 0 (defined in Def. 4), โ„‹ โข ( ๐ฉ ๐‘– ) โ‰ค โ„‹ โข ( ๐ซ ๐‘– ) .

Our upper bound โ„‹ โข ( ๐’‘ ๐‘– ) โ‰ค โ„‹ โข ( ๐’“ ๐‘– ) in Prop. 1 shows that the entropy of the expert scores reduces when applying our methods. Thus, the model improves its decision certainty, reducing the fluctuation in token routing.

5Experimental Results

To highlight the strengths of Similarity/Attention-Aware SMoE, we conduct experiments on ImageNet classification, Wikitext-103 language modeling, and fine-tuning tasks. Our results demonstrate that the proposed models: (1) consistently outperform baseline SMoE across all tasks; (2) achieve significant robustness improvements when evaluated on both adversarially and naturally perturbed versions of these datasets; (3) exhibit adaptivity; and (4) surpass previous works while serving as effective plug-and-play enhancements for various MoE architectures. Additionally, empirical analysis reveals that our methods: (5) reduce expert decision entropy and routing fluctuations compared to the baseline, and (6) enhance load balancing for expert assignment. Additional experiments and ablation studies, which further demonstrate the advantages of our methods, are presented in Appendix 5.

Language modeling on Wikitext-103. Table 1 demonstrates that our Similarity/Attention-Aware SMoEs outperform SMoE and GLAM baselines on Wikitext-103 LM, using TopK experts (K=2). The models are evaluated on validation and test sets using perplexity scores (lower is better) on clean data and in adversarial scenarios, i.e. under word-swap attacks. Our proposed models consistently outperform both baselines. Especially, GLAM variants with Similarity/Attention-Aware SMoEs also show significant improvements over standard versions, demonstrating the approachโ€™s effectiveness for both performance and robustness.

Table 1:PPL evaluation (lower is better) with the clean and attacked Wikitext-103 test set Baseline SMoE, X-MoE, SMoE-dropout and their Similarity/Attention-Aware variants, ๐พ

2 Model/Metric Clean Wikitext-103 Attacked Wikitext-103 Valid PPL Test PPL Valid PPL Test PPL SMoE 33.29 34.84 41.75 43.59 Similarity-Aware SMoE 30.75 32.03 38.33 39.92 Attention-Aware SMoE 31.31 32.23 39.68 40.91 GLAM 37.55 39.10 48.01 49.75 Similarity-Aware GLAM 33.72 34.92 42.19 43.72 Attention-Aware GLAM 35.17 36.71 44.17 45.85 X-MoE 33.05 34.49 41.68 42.96 Similarity-Aware X-MoE 31.83 33.06 39.92 41.28 Attention-Aware X-MoE 32.06 33.24 40.35 41.73 SMoE-dropout 33.08 34.67 41.11 43.09 Similarity-Aware SMoE-dropout 32.47 33.69 40.6 41.99 Attention-Aware SMoE-dropout 32.21 33.91 40.56 42.17

Compare and improve previous works. We compare our Similarity/Attention-Aware SMoE with X-MoE (Chi et al., 2022), which addresses routing fluctuation, and SMoE-dropout (Chen et al., 2023), which improves upon standard SMoE. As shown in Table 1, our models achieve lower PPL scores on both Clean and Attacked Wikitext-103 datasets. Additionally, integrating our methods with these models, creating Similarity/Attention-Aware X-MoE and Similarity/Attention-Aware SMoE-dropout, further improved their performance, demonstrating our approachโ€™s effectiveness as a plug-and-play enhancement for various MoE architectures.

ImageNet Classification. As shown in Table 2, our Similarity-Aware and Attention-Aware variants outperform the baseline V-MoE (Riquelme et al., 2021) on both clean data and robustness tests, including ImageNet-C (corruptions) (Hendrycks & Dietterich, 2019), ImageNet-A (adversarial) (Hendrycks et al., 2021b), ImageNet-R (out-of-distribution) (Hendrycks et al., 2021a), and ImageNet-O (OOD detection) (Hendrycks et al., 2021b), further demonstrate the advantages if our methods.

Table 2:Test set accuracy of different ImageNet variants on Baseline SMoE, Attention-Aware SMoE, and Similarity-Aware SMoE. All models are trained only on the original ImageNet dataset. Model Params IN-1K IN-R IN-A IN-C Top-1 โ†‘ Top-1 โ†‘ Top-1 โ†‘ Top-1 โ†‘ V-MoE (baseline) 297M 72.71 35.42 5.27 48.72 Similarity-Aware V-MoE 297M 73.21 36.58 5.60 50.45 Attention-Aware V-MoE 297M 73.33 36.66 6.78 50.85

Finetuning on downstream tasks.

Table 3: Top-1 test accuracy on Stanford Sentiment Treebank 5, 2 (SST5, SST2), and Banking-77 (B77) finetuning task. Model SST5 SST2 Banking-77 SMoE 36.54 70.23 83.96 Similarity-Aware SMoE 37.91 71.72 85.19 Attention-Aware SMoE 38.89 72.41 85.84

We evaluate the adaptivity of pretrained SMoE variants through fine-tuning on SST5 (Socher et al., 2013), SST2 (Socher et al., 2013), and Banking-77 (Casanueva et al., 2020) datasets. Table 3 shows that Attention-Aware SMoE achieves the highest accuracy across all datasets, followed by Similarity-Aware SMoE, while baseline SMoE performs worst. These results demonstrate the advantageous adaptivity of our proposed architectures compared to baseline SMoEs.

Figure 3:Comparison of routing fluctuation and entropy ratio across layers for Baseline SMoE, Attention-Aware SMoE, and Similarity-Aware SMoE

Similarity/Attention-Aware MoE reduces routing fluctuation Fig. 3 (Left) compares routing fluctuation across baseline SMoE and our proposed Similarity/Attention-Aware SMoEs on Wikitext-103. Fluctuation rate measures the percentage of tokens changing expert assignments between epochs 59-60. While Baseline SMoE shows highest fluctuation, especially in early layers, Similarity/Attention-Aware SMoEs achieve lower rates throughout. Similarity-Aware SMoE maintains consistently low fluctuation across all layers, demonstrating better routing stability, while Attention-Aware SMoE also significantly improves upon the baseline.

Similarity/Attention-Aware MoE reduces decision entropy. Fig. 3 (Right) shows the average entropy rate of token routing decisions across layers for our models compared to the baseline SMoE (on Wikitext-103), calculated at epoch 59, just before the final epoch where routing fluctuation occurs. Our Similarity/Attention-Aware SMoEs methods exhibit lower average entropy (rate < 1) than the baseline, consistent with the reduced routing fluctuation seen in the left graph. This indicates more stable and consistent routing decisions. The Similarity-Aware SMoE, in particular, maintains lower entropy across all layers, reflecting its better routing stability. These results highlight the advantages of our methods in enhancing routing consistency.

Similarity/Attention-Aware SMoE improves load imbalancing.

Figure 4:Comparison of expert routing distribution for baseline SMoE, Attention-Aware SMoE, and Similarity-Aware SMoE

Fig. 4 displays token distribution across experts in the VMoE architecture on the ImageNet validation set. The baseline model shows experts 3 and 4 handling significantly more tokens, while our Similarity/Attention-Aware SMoE models achieve a more uniform distribution. This implicit load balancing allows busier experts to specialize and ensures less utilized experts handle a broader range of inputs, improving overall efficiency and balance.

6Related Work

Routing method. Numerous approaches have been proposed for assigning tokens to experts, including deterministic hashing (Roller et al., 2021), linear programming (Lewis et al., 2021), and cosine similarity-based methods (Chi et al., 2022). Other techniques leverage reinforcement learning (Bengio et al., 2015), greedy top-k expert selection (Shazeer et al., 2017), and optimal transport (Liu et al., 2023). However, these approaches make expert assignment decisions for each token independently, without considering token-to-token interactions. In contrast, our proposed method enables collaborative expert selection by allowing tokens to share information with each other during the assignment process.

Routing fluctuation. Routing fluctuation as been discussed in existing literature. (Nguyen et al., 2024) mentions that various SMoE routers (Csordรกs et al., 2023; Do et al., 2023) suffer from routing fluctuation without proposing solutions. In addition (Su et al., 2024) suggests that due to the variation of learnable parameters in the router. StableMoE (Dai et al., 2022) reduces routing fluctuations by first distilling a balanced routing strategy into a lightweight router and then locking token-to-expert assignments during the second training phase for stable routing.SMoE-dropout (Chen et al., 2023) is another work that also provides another solution to improve the stability of the model. This method initially randomizes and freezes the router during training to provide stable routing strategies (Zoph et al., 2022) examine several approaches to improve stability including removing multiplicative interactions, injecting model noise, and constraining activations and gradients. After the examination, the authors propose the router z-loss which enhance the training stability with no quality degradation. (Chi et al., 2022) proposes to estimate the routing scores between tokens and experts on a low-dimensional hypersphere to achieve more consistent routing compared to the conventional approach. Feedforward layers are replaced by hash layers in (Roller et al., 2021) to to keep routing choices consistent. (Lewis et al., 2021) formulates routing as a linear assignment problem that globally maximizes token-expert similarities for increasing the stability. Our work is orthogonal to these approaches: to reduce routing fluctuation, we encourage tokens to influence each otherโ€™s routing decision based on their similarity.

7Concluding Remarks

In this work, under the Probabilistic Graphical Model perspective, we show that in (S)MoE, expert selection are independent, results in routing fluctuations. To address this, we introduce Similarity/Attention-Aware (S)MoE, which establishes connections between tokens through a graph structure, effectively breaking this independence. Our theoretical analysis proves that both approaches reduce entropy in the decision-making process for indecisive tokens, leading to more stable and confident expert assignments. These theoretical advantages are validated through empirical evaluations across diverse tasks and domains, demonstrating the effectiveness of our proposed methods. A limitation of our paper is that we have not considered a generative model that capture the token generation process in our PGM. Studying transformer-MoE from a generative model perspective is an exciting research direction. We leave it for future work.

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

References Artetxe et al. (2021) โ†‘ Artetxe, M., Bhosale, S., Goyal, N., Mihaylov, T., Ott, M., Shleifer, S., Lin, X. V., Du, J., Iyer, S., Pasunuru, R., et al.Efficient large scale language modeling with mixtures of experts.arXiv preprint arXiv:2112.10684, 2021. Bai et al. (2024) โ†‘ Bai, Y., Geng, X., Mangalam, K., Bar, A., Yuille, A. L., Darrell, T., Malik, J., and Efros, A. A.Sequential modeling enables scalable learning for large vision models.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  22861โ€“22872, 2024. Bao et al. (2021) โ†‘ Bao, H., Dong, L., Piao, S., and Wei, F.Beit: Bert pre-training of image transformers.arXiv preprint arXiv:2106.08254, 2021. Bao et al. (2022) โ†‘ Bao, H., Wang, W., Dong, L., Liu, Q., Mohammed, O. K., Aggarwal, K., Som, S., Piao, S., and Wei, F.Vlmo: Unified vision-language pre-training with mixture-of-modality-experts.Advances in Neural Information Processing Systems, 35:32897โ€“32912, 2022. Bengio et al. (2015) โ†‘ Bengio, E., Bacon, P.-L., Pineau, J., and Precup, D.Conditional computation in neural networks for faster models.arXiv preprint arXiv:1511.06297, 2015. Bishop & Svensรฉn (2003) โ†‘ Bishop, C. and Svensรฉn, M.Bayesian hierarchical mixtures of experts.In Proceedings Nineteenth Conference on Uncertainty in Artificial Intelligence, pp.  57โ€“64. Morgan Kaufmann, January 2003.URL https://www.microsoft.com/en-us/research/publication/bayesian-hierarchical-mixtures-of-experts/. Brown et al. (2020) โ†‘ Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.Language models are few-shot learners.arXiv preprint arXiv:2005.14165, 2020. Casanueva et al. (2020) โ†‘ Casanueva, I., Temcinas, T., Gerz, D., Henderson, M., and Vulic, I.Efficient intent detection with dual sentence encoders.In Proceedings of the 2nd Workshop on NLP for ConvAI - ACL 2020, mar 2020.URL https://arxiv.org/abs/2003.04807.Data available at https://github.com/PolyAI-LDN/task-specific-datasets. Chen et al. (2023) โ†‘ Chen, T., Zhang, Z., JAISWAL, A. K., Liu, S., and Wang, Z.Sparse moe as the new dropout: Scaling dense and self-slimmable transformers.In The Eleventh International Conference on Learning Representations, 2023.URL https://openreview.net/forum?id=w1hwFUb_81. Chi et al. (2022) โ†‘ Chi, Z., Dong, L., Huang, S., Dai, D., Ma, S., Patra, B., Singhal, S., Bajaj, P., Song, X., Mao, X.-L., et al.On the representation collapse of sparse mixture of experts.Advances in Neural Information Processing Systems, 35:34600โ€“34613, 2022. Csordรกs et al. (2023) โ†‘ Csordรกs, R., Irie, K., and Schmidhuber, J.Approximating two-layer feedforward networks for efficient transformers.In Bouamor, H., Pino, J., and Bali, K. (eds.), Findings of the Association for Computational Linguistics: EMNLP 2023, pp.  674โ€“692, Singapore, December 2023. Association for Computational Linguistics.doi: 10.18653/v1/2023.findings-emnlp.49.URL https://aclanthology.org/2023.findings-emnlp.49. Dai et al. (2022) โ†‘ Dai, D., Dong, L., Ma, S., Zheng, B., Sui, Z., Chang, B., and Wei, F.Stablemoe: Stable routing strategy for mixture of experts.In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  7085โ€“7095, 2022. Devlin et al. (2018) โ†‘ Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.Bert: Pre-training of deep bidirectional transformers for language understanding.arXiv preprint arXiv:1810.04805, 2018. Do et al. (2023) โ†‘ Do, T. G., Khiem, L., Pham, Q., Nguyen, T., Doan, T.-N., Nguyen, B., Liu, C., Ramasamy, S., Li, X., and Hoi, S.HyperRouter: Towards efficient training and inference of sparse mixture of experts.In Bouamor, H., Pino, J., and Bali, K. (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.  5754โ€“5765, Singapore, December 2023. Association for Computational Linguistics.doi: 10.18653/v1/2023.emnlp-main.351.URL https://aclanthology.org/2023.emnlp-main.351. Du et al. (2022) โ†‘ Du, N., Huang, Y., Dai, A. M., Tong, S., Lepikhin, D., Xu, Y., Krikun, M., Zhou, Y., Yu, A. W., Firat, O., et al.Glam: Efficient scaling of language models with mixture-of-experts.In International Conference on Machine Learning, pp.  5547โ€“5569. PMLR, 2022. Fedus et al. (2021) โ†‘ Fedus, W., Zoph, B., and Shazeer, N.Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.arXiv preprint arXiv:2101.03961, 2021. Fedus et al. (2022) โ†‘ Fedus, W., Zoph, B., and Shazeer, N.Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1โ€“39, 2022. Gaur et al. (2021) โ†‘ Gaur, N., Farris, B., Haghani, P., Leal, I., Moreno, P. J., Prasad, M., Ramabhadran, B., and Zhu, Y.Mixture of informed experts for multilingual speech recognition.In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  6234โ€“6238. IEEE, 2021. Gormley & Murphy (2011) โ†‘ Gormley, I. C. and Murphy, T. B.Mixture of experts modelling with social science applications.Mixtures: Estimation and applications, pp.  101โ€“121, 2011. Hastie (2009) โ†‘ Hastie, T.The elements of statistical learning: data mining, inference, and prediction, 2009. Hendrycks & Dietterich (2019) โ†‘ Hendrycks, D. and Dietterich, T.Benchmarking neural network robustness to common corruptions and perturbations.arXiv preprint arXiv:1903.12261, 2019. Hendrycks et al. (2021a) โ†‘ Hendrycks, D., Basart, S., Mu, N., Kadavath, S., Wang, F., Dorundo, E., Desai, R., Zhu, T., Parajuli, S., Guo, M., et al.The many faces of robustness: A critical analysis of out-of-distribution generalization.In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  8340โ€“8349, 2021a. Hendrycks et al. (2021b) โ†‘ Hendrycks, D., Zhao, K., Basart, S., Steinhardt, J., and Song, D.Natural adversarial examples.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  15262โ€“15271, 2021b. Jacobs et al. (1991) โ†‘ Jacobs, R. A., Jordan, M. I., Nowlan, S. J., and Hinton, G. E.Adaptive mixtures of local experts.Neural computation, 3(1):79โ€“87, 1991. Jordan & Jacobs (1994) โ†‘ Jordan, M. I. and Jacobs, R. A.Hierarchical mixtures of experts and the em algorithm.Neural computation, 6(2):181โ€“214, 1994. Kaplan et al. (2020) โ†‘ Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D.Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020. Lepikhin et al. (2020) โ†‘ Lepikhin, D., Lee, H., Xu, Y., Chen, D., Firat, O., Huang, Y., Krikun, M., Shazeer, N., and Chen, Z.Gshard: Scaling giant models with conditional computation and automatic sharding.arXiv preprint arXiv:2006.16668, 2020. Lewis et al. (2021) โ†‘ Lewis, M., Bhosale, S., Dettmers, T., Goyal, N., and Zettlemoyer, L.Base layers: Simplifying training of large, sparse models.In International Conference on Machine Learning, pp.  6265โ€“6274. PMLR, 2021. Li et al. (2023) โ†‘ Li, J., Li, D., Savarese, S., and Hoi, S.Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models.In International conference on machine learning, pp.  19730โ€“19742. PMLR, 2023. Liu et al. (2023) โ†‘ Liu, T., Puigcerver, J., and Blondel, M.Sparsity-constrained optimal transport.In The Eleventh International Conference on Learning Representations, 2023. Liu et al. (2021) โ†‘ Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., and Guo, B.Swin transformer: Hierarchical vision transformer using shifted windows.In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  10012โ€“10022, 2021. Merity et al. (2016) โ†‘ Merity, S., Xiong, C., Bradbury, J., and Socher, R.Pointer sentinel mixture models, 2016.URL https://arxiv.org/abs/1609.07843. Morris et al. (2020) โ†‘ Morris, J. X., Lifland, E., Yoo, J. Y., Grigsby, J., Jin, D., and Qi, Y.Textattack: A framework for adversarial attacks, data augmentation, and adversarial training in nlp.arXiv preprint arXiv:2005.05909, 2020. Neil & Dirk (2020) โ†‘ Neil, H. and Dirk, W.Transformers for image recognition at scale.Online: https://ai. googleblog. com/2020/12/transformers-for-image-recognitionat. html, 2020. Nguyen et al. (2024) โ†‘ Nguyen, N. V., Doan, T. T., Tran, L., Nguyen, V., and Pham, Q.Libmoe: A library for comprehensive benchmarking mixture of experts in large language models, 2024.URL https://arxiv.org/abs/2411.00918. Pham et al. (2024) โ†‘ Pham, Q., Do, G., Nguyen, H., Nguyen, T., Liu, C., Sartipi, M., Nguyen, B. T., Ramasamy, S., Li, X., Hoi, S., and Ho, N.Competesmoe โ€“ effective training of sparse mixture of experts via competition, 2024. Press et al. (2020) โ†‘ Press, O., Smith, N. A., and Levy, O.Improving transformer models by reordering their sublayers.In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  2996โ€“3005, Online, July 2020. Association for Computational Linguistics.doi: 10.18653/v1/2020.acl-main.270.URL https://www.aclweb.org/anthology/2020.acl-main.270. Radford et al. (2019) โ†‘ Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskever, I.Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019. Raffel et al. (2020) โ†‘ Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., and Liu, P. J.Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1โ€“67, 2020. Riquelme et al. (2021) โ†‘ Riquelme, C., Puigcerver, J., Mustafa, B., Neumann, M., Jenatton, R., Susano Pinto, A., Keysers, D., and Houlsby, N.Scaling vision with sparse mixture of experts.Advances in Neural Information Processing Systems, 34:8583โ€“8595, 2021. Roller et al. (2021) โ†‘ Roller, S., Sukhbaatar, S., Weston, J., et al.Hash layers for large sparse models.Advances in Neural Information Processing Systems, 34:17555โ€“17566, 2021. Shazeer et al. (2017) โ†‘ Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., and Dean, J.Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.arXiv preprint arXiv:1701.06538, 2017. Socher et al. (2013) โ†‘ Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C. D., Ng, A., and Potts, C.Recursive deep models for semantic compositionality over a sentiment treebank.In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp.  1631โ€“1642, Seattle, Washington, USA, October 2013. Association for Computational Linguistics.URL https://www.aclweb.org/anthology/D13-1170. Su et al. (2024) โ†‘ Su, Z., Lin, Z., Bai, X., Wu, X., Xiong, Y., Lian, H., Ma, G., Chen, H., Ding, G., Zhou, W., et al.Maskmoe: Boosting token-level learning via routing mask in mixture-of-experts.arXiv preprint arXiv:2407.09816, 2024. Subramanian et al. (2024) โ†‘ Subramanian, S., Harrington, P., Keutzer, K., Bhimji, W., Morozov, D., Mahoney, M. W., and Gholami, A.Towards foundation models for scientific machine learning: Characterizing scaling and transfer behavior.Advances in Neural Information Processing Systems, 36, 2024. Touvron et al. (2023) โ†‘ Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M., Lacroix, T., Roziรจre, B., Goyal, N., Hambro, E., Azhar, F., et al.Open and efficient foundation language models.Preprint at arXiv. https://doi. org/10.48550/arXiv, 2302, 2023. Zoph et al. (2022) โ†‘ Zoph, B., Bello, I., Kumar, S., Du, N., Huang, Y., Dean, J., Shazeer, N., and Fedus, W.St-moe: Designing stable and transferable sparse expert models.arXiv preprint arXiv:2202.08906, 2022.

Supplement to โ€œImproving Routing in Sparse Mixture of Experts with Graph of Tokens"

\startcontents\printcontents

1Table of Contents   

Appendix ANotation

To facilitate the understanding of the theoretical content of our work, all used notations are explained in Table 4.

Table 4:Table of notations Variables

๐’™ ๐‘–

โ‰œ

input token of MHA at position ๐‘– -th

๐—

โ‰œ

[ ๐’™ 1 , โ€ฆ , ๐’™ ๐‘ ] โŠค , sequence of input token of MHA

๐’– ๐‘–

โ‰œ

input token of (S)MoE at position ๐‘– -th

๐”

โ‰œ

[ ๐’– 1 , โ€ฆ , ๐’– ๐‘ ] โŠค , sequence of input token of (S)MoE

๐’ ๐‘–

โ‰œ

target variable at position ๐‘– -th

โ„Ž ๐‘–

โ‰œ

head selection at position ๐‘– -th

๐‘ง ๐‘–

โ‰œ

attention position selection of token ๐’™ ๐‘–

๐‘  ๐‘–

โ‰œ

similarity variable of token ๐’– ๐‘–

๐‘’ ๐‘–

โ‰œ

expert selection of token ๐’– ๐‘– for MoE generative model defined as in Section 2.1

๐‘’ ๐‘– ๐‘ 

โ‰œ

expert selection of token ๐’– ๐‘– for SAM defined as in Definition 1

๐‘’ ๐‘– ๐‘Ž

โ‰œ

expert selection of token ๐’– ๐‘– for AAMM defined as in Definition 4

Other Notations

๐‘

โ‰œ

number of tokens in a sequence

๐ป

โ‰œ

number of heads in MHA

๐ธ

โ‰œ

number of experts in MoE

๐พ

โ‰œ

number of selected experts in SMoE and its variants.

๐›พ ๐‘’ โข ( ๐’– ๐‘– )

โ‰œ

routing function giving the affinity score between expert ๐‘’ -th and ๐’– ๐‘–

r โข ( ๐ฎ i )
โ‰œ
[ ๐‘Ÿ 1 โข ( ๐’– ๐‘– ) , โ€ฆ , ๐‘Ÿ ๐‘ โข ( ๐’– ๐‘– ) ] โŠค

softmax โข ( [ ๐›พ 1 โข ( ๐’– ๐‘– ) , โ€ฆ , ๐›พ ๐พ โข ( ๐’– ๐‘– ) ] โŠค ) expert scores of token ๐’– ๐‘– in (S)MoE

๐’“ ๐‘–

โ‰œ

๐’“ โข ( ๐’– ๐‘– ) , short cut for ๐’“ โข ( ๐’– ๐‘– )

๐’‘ ๐‘–

โ‰œ

expert scores of token ๐’– ๐‘– in Similarity/Attention-Aware (S)MoE

๐’ˆ ๐‘’

โ‰œ

expert network of expert ๐‘’ -th

๐’

โ‰œ

Similarity matrix of sequence ๐” , whose elements are defined as in Definition 1

๐€ โ„Ž

โ‰œ

Attention matrix of head โ„Ž -th

๐€ โ„Ž ๐‘
โ‰œ
the posterior attention matrix whose element ๐€ โ„Ž ๐‘ [ ๐‘– , ๐‘— ] := โ„™ ( ๐‘ง ๐‘–

๐‘— | โ„Ž ๐‘–

โ„Ž , ๐’– ๐‘– , ๐— ) is defined as in Lemma 1

๐‡ ๐‘
โ‰œ
poterior head selection matrx, whose element ๐‡ ๐‘ [ ๐‘– , โ„Ž ] := โ„™ ( ๐‘ง ๐‘–

๐‘— โˆฃ โ„Ž ๐‘–

โ„Ž , ๐— ) represents the โ€œresponsibilityโ€œ of head โ„Ž in explaining token ๐’– ๐‘– , defined as in Lemma 1

๐‹ โ„Ž

โ‰œ

the likelihood matrix, corresponding to head โ„Ž , whose element ๐‹ โ„Ž โข [ ๐‘– , ๐‘— ] := ๐’ฉ โข ( ๐’– ๐‘– โˆฃ ๐– โ„Ž โข ๐’™ ๐‘— , ๐œŽ 2 โข ๐•€ ) is defined as in Lemma 1.

Parameters

๐– , ๐’ƒ

โ‰œ

weight matrix and bias vector to compute expert scores, respectively.

๐– ๐‘„ , โ„Ž , ๐– ๐พ , โ„Ž , ๐– ๐‘‰ , โ„Ž , ๐– ๐‘‚ , โ„Ž

โ‰œ

query, key, value and output weight matrix used in MHA for head โ„Ž -th, respectively

๐– โ„Ž

โ‰œ

๐– ๐‘‰ , โ„Ž โŠค โข ๐– ๐‘‚ , โ„Ž merging the value and output matrix for convenient notation

๐– ๐‘ 

โ‰œ

parameter matrix used to compute the similarity matrix ๐’ of sequence ๐”

๐œ

โ‰œ

the temperature parameter used to compute the similarity between tokens of ๐”

๐œŽ

โ‰œ

the variance parameter of the distribution of ๐’– ๐‘– , given ๐‘’ ๐‘– , โ„Ž ๐‘– , ๐— , defined as in Definition 3. Appendix BTechincal Proofs B.1Proof of Lemma 1

Restate Lemma 1 The expert decision of ๐‘’ ๐‘– ๐‘Ž given ๐” , ๐— is given by

โ„™ โข ( ๐‘’ ๐‘– ๐‘Ž

๐‘’ | ๐” , ๐— )

โˆ‘ โ„Ž

1 ๐ป โˆ‘ ๐‘—

1 ๐‘ โ„ ๐‘ โข [ ๐‘– , โ„Ž ] โข ๐”ธ โ„Ž ๐‘ โข [ ๐‘– , ๐‘— ] โข ๐‘Ÿ ๐‘’ โข ( ๐’– ๐‘— )

where the posteriors

๐”ธ โ„Ž ๐‘ โข [ ๐‘– , ๐‘— ]
:= โ„™ ( ๐‘ง ๐‘–

๐‘— | โ„Ž ๐‘–

โ„Ž , ๐’– ๐‘– , ๐— )

๐”ธ โ„Ž โข [ ๐‘– , ๐‘— ] โข ๐‹ โ„Ž โข [ ๐‘– , ๐‘— ] โˆ‘ ๐‘— โ€ฒ ๐”ธ โ„Ž โข [ ๐‘– , ๐‘— โ€ฒ ] โข ๐‹ โ„Ž โข [ ๐‘– , ๐‘— โ€ฒ ] ,

โ„ ๐‘ โข [ ๐‘– , โ„Ž ]
:= โ„™ โข ( โ„Ž ๐‘–

โ„Ž | ๐’– ๐‘– , ๐— )

โ„ โข [ ๐‘– , โ„Ž ] โข โˆ‘ ๐‘— ๐”ธ โ„Ž โข [ ๐‘– , ๐‘— ] โข ๐‹ โ„Ž โข [ ๐‘– , ๐‘— ] โˆ‘ โ„Ž โ€ฒ โข โ„ โข [ ๐‘– , โ„Ž โ€ฒ ] โข โˆ‘ ๐‘— โ€ฒ ๐”ธ โ„Ž โ€ฒ โข [ ๐‘– , ๐‘— โ€ฒ ] โข ๐‹ โ„Ž โ€ฒ โข [ ๐‘– , ๐‘— โ€ฒ ] ,

with the prior ๐”ธ โ„Ž [ ๐‘– , ๐‘— ] := โ„™ ( ๐‘ง ๐‘–

๐‘— | โ„Ž ๐‘–

โ„Ž , ๐— ) and โ„ โข [ ๐‘– , โ„Ž ] := โ„™ โข ( โ„Ž ๐‘–

โ„Ž ) and the likelihood ๐‹ โ„Ž โข [ ๐‘– , ๐‘— ] := ๐’ฉ โข ( ๐’– ๐‘– | ๐– โ„Ž โข ๐’™ ๐‘— , ๐œŽ 2 โข ๐•€ ) . This results in

๐”ผ โข [ ๐’ ๐‘– | ๐— ]

๐”ผ ๐” โข [ โˆ‘ ๐‘’

1 ๐ธ โ„™ โข ( ๐‘’ ๐‘– ๐‘Ž

๐‘’ | ๐” , ๐— ) โข ๐  ๐‘’ โข ( ๐’– ๐‘– ) | ๐— ]

Review the generalization of the graph ๐’ข 4 . Given the sequence of input ๐—

[ ๐’™ 1 , โ€ฆ , ๐’™ ๐‘ ] โŠค , let โ„Ž ๐‘– โˆˆ { 1 , โ€ฆ , ๐ป } and ๐‘ง ๐‘– โˆˆ { 1 , โ€ฆ , ๐‘ } are the selected head and attention position of each token ๐‘– , respectively. Let ๐‘ˆ

[ ๐’– 1 , โ€ฆ , ๐’– ๐‘ ] ๐‘ be the sequence of latent variable and ๐‘’ ๐‘– ๐‘Ž โˆˆ { 1 , โ€ฆ , ๐ธ } is the expert assignment for token ๐‘– . AAMM generates the output ๐’ ๐‘– as follow

โ„Ž ๐‘–

โˆผ Uniform โข ( { 1 , โ€ฆ , ๐ป } )

๐‘ง ๐‘– | โ„Ž ๐‘– , ๐—

โˆผ Cat ( Softmax ( ๐’™ ๐‘– โŠค โข ๐– ๐‘„ , โ„Ž ๐‘– โŠค โข ๐– ๐พ , โ„Ž ๐‘– โข ๐— ๐‘‡ ๐ท ) )

๐’– ๐‘– | ๐‘ง ๐‘– , โ„Ž ๐‘– , ๐—

โˆผ ๐’ฉ โข ( ๐– โ„Ž ๐‘– โข ๐’™ ๐‘ง ๐‘– , ๐œŽ 2 โข ๐•€ )

๐‘’ ๐‘– ๐‘Ž | ๐‘ง ๐‘– , ๐”

โˆผ Cat โข ( ๐’“ โข ( ๐’– ๐‘ง ๐‘– ) )

๐’ ๐‘– | ๐’– ๐‘– , ๐‘’ ๐‘– ๐‘Ž

โˆผ ๐’ฉ โข ( ๐  ๐‘’ ๐‘– ๐‘Ž โข ( ๐’– ๐‘– ) , ๐•€ )

Proof: Following the above generative process, starting with the probability of expert decision ๐‘’ ๐‘– ๐‘Ž of AAMM for token ๐‘– given the sequence ๐” and ๐— :

โ„™ โข ( ๐‘’ ๐‘– ๐‘Ž

๐‘’ | ๐” , ๐— )

โˆ‘ ๐‘—

1 ๐‘ โ„™ ( ๐‘’ ๐‘– ๐‘Ž

๐‘’ | ๐‘ง ๐‘–

๐‘— , ๐” , ๐— ) โ„™ ( ๐‘ง ๐‘–

๐‘— | ๐” , ๐— )

(18)

= โˆ‘ ๐‘—

1 ๐‘ โ„™ ( ๐‘’ ๐‘– ๐‘Ž

๐‘’ | ๐‘ง ๐‘–

๐‘— , ๐” ) โˆ‘ โ„Ž

1 ๐ป โ„™ ( ๐‘ง ๐‘–

๐‘— , โ„Ž ๐‘–

โ„Ž | ๐” , ๐— )

โˆ‘ ๐‘—

1 ๐‘ โ„™ ( ๐‘’ ๐‘– ๐‘Ž

๐‘’ | ๐‘ง ๐‘–

๐‘— , ๐” ) โˆ‘ โ„Ž

1 ๐ป โ„™ ( ๐‘ง ๐‘–

๐‘— | โ„Ž ๐‘–

โ„Ž , ๐” , ๐— ) โ„™ ( โ„Ž ๐‘– | ๐” , ๐— )

โˆ‘ ๐‘—

1 ๐‘ โ„™ ( ๐‘’ ๐‘– ๐‘Ž

๐‘’ | ๐‘ง ๐‘–

๐‘— , ๐” ) โˆ‘ โ„Ž

1 ๐ป โ„™ ( ๐‘ง ๐‘–

๐‘— | โ„Ž ๐‘–

โ„Ž , ๐’– ๐‘– , ๐— ) โ„™ ( โ„Ž ๐‘– | ๐’– ๐‘– , ๐— )

The posterior distribution of attention variable given the observation of MoE input ๐’– ๐‘– is

โ„™ ( ๐‘ง ๐‘–

๐‘— | โ„Ž ๐‘–

โ„Ž , ๐’– ๐‘– , ๐— )

โ„™ ( ๐‘ง ๐‘–

๐‘— | โ„Ž ๐‘–

โ„Ž , ๐— ) โ„™ ( ๐’– ๐‘– | ๐‘ง ๐‘–

๐‘— , โ„Ž ๐‘–

โ„Ž , ๐’™ ๐‘— ) โˆ‘ ๐‘— โ€ฒ โ„™ ( ๐‘ง ๐‘–

๐‘— โ€ฒ | โ„Ž ๐‘–

โ„Ž , ๐— ) โ„™ ( ๐’– ๐‘– | ๐‘ง ๐‘–

๐‘— โ€ฒ , โ„Ž ๐‘–

โ„Ž , ๐’™ ๐‘— โ€ฒ )

(19)

= ๐”ธ โ„Ž โข [ ๐‘– , ๐‘— ] โข ๐’ฉ โข ( ๐’– ๐‘– | ๐•Ž ๐‘‚ , โ„Ž โข ๐•Ž ๐‘‰ , โ„Ž โข ๐’™ ๐‘— , ๐•€ ) โˆ‘ ๐‘— โ€ฒ ๐”ธ โ„Ž โข [ ๐‘– , ๐‘— โ€ฒ ] โข ๐’ฉ โข ( ๐’– ๐‘– | ๐•Ž ๐‘‚ , โ„Ž โข ๐•Ž ๐‘‰ , โ„Ž โข ๐’™ ๐‘— โ€ฒ , ๐•€ )

๐”ธ โ„Ž โข [ ๐‘– , ๐‘— ] โข ๐‹ โ„Ž โข [ ๐‘– , ๐‘— ] โˆ‘ ๐‘— โ€ฒ ๐”ธ โ„Ž โข [ ๐‘– , ๐‘— โ€ฒ ] โข ๐‹ โ„Ž โข [ ๐‘– , ๐‘— โ€ฒ ]

๐”ธ โ„Ž ๐‘ โข [ ๐‘– , ๐‘— ]

where ๐€ โ„Ž is the attention matrix of head โ„Ž and the likelihood ๐‹ โ„Ž โข [ ๐‘– , ๐‘— ]

๐’ฉ โข ( ๐’– ๐‘– | ๐– โ„Ž โข ๐’™ ๐‘— , ๐•€ )

Then, the posterior probability of head index given input ๐’– ๐‘– of the (S)MoE and input ๐— of the attention. This represents the responsibility of head โ„Ž in explaining token ๐‘– .

โ„™ โข ( โ„Ž ๐‘–

โ„Ž | ๐’– ๐‘– , ๐— )

โ„™ ( โ„Ž ๐‘–

โ„Ž ) โˆ‘ ๐‘—

1 ๐‘ โ„™ ( ๐‘ง ๐‘–

๐‘— | โ„Ž ๐‘–

โ„Ž , ๐— ) โ„™ ( ๐’– ๐‘– | ๐‘ง ๐‘–

๐‘— , โ„Ž ๐‘–

โ„Ž , ๐’™ ๐‘— ) โˆ‘ โ„Ž โ€ฒ

1 ๐ป โ„™ ( โ„Ž ๐‘–

โ„Ž โ€ฒ ) โˆ‘ ๐‘— โ€ฒ

1 ๐‘ โ„™ ( ๐‘ง ๐‘–

๐‘— โ€ฒ | โ„Ž ๐‘–

โ„Ž โ€ฒ , ๐— ) โ„™ ( ๐’– ๐‘– | ๐‘ง ๐‘–

๐‘— โ€ฒ , โ„Ž ๐‘–

โ„Ž โ€ฒ , ๐’™ ๐‘— โ€ฒ )

(20)

= โ„ โข [ ๐‘– , โ„Ž ] โข โˆ‘ ๐‘— ๐”ธ โ„Ž โข [ ๐‘– , ๐‘— ] โข ๐’ฉ โข ( ๐’– ๐‘– | ๐•Ž ๐‘‚ , โ„Ž โ€ฒ โข ๐•Ž ๐‘‰ , โ„Ž โ€ฒ โข ๐’™ ๐‘— , ๐•€ ) โˆ‘ โ„Ž โ€ฒ โข โ„ โข [ ๐‘– , โ„Ž โ€ฒ ] โข โˆ‘ ๐‘— โ€ฒ ๐”ธ โ„Ž โ€ฒ โข [ ๐‘– , ๐‘— โ€ฒ ] โข ๐’ฉ โข ( ๐’– ๐‘– | ๐•Ž ๐‘‚ , โ„Ž โ€ฒ โข ๐•Ž ๐‘‰ , โ„Ž โ€ฒ โข ๐’™ ๐‘— โ€ฒ , ๐•€ )

โ„ โข [ ๐‘– , โ„Ž ] โข โˆ‘ ๐‘— ๐”ธ โ„Ž โข [ ๐‘– , ๐‘— ] โข ๐‹ โ„Ž โข [ ๐‘– , ๐‘— ] โˆ‘ โ„Ž โ€ฒ โ„ [ ๐‘– , โ„Ž โ€ฒ ] โˆ‘ ๐‘— โ€ฒ ๐”ธ โ„Ž โ€ฒ [ ๐‘– , ๐‘— โ€ฒ ] ๐‹ โ„Ž โ€ฒ [ ๐‘– , ๐‘— โ€ฒ ] , ๐•€ )

โ„ ๐‘ โข [ ๐‘– , โ„Ž ] .

Substituting the results to Eqn. 18, we obtain:

โ„™ โข ( ๐‘’ ๐‘– ๐‘Ž

๐‘’ | ๐” , ๐— )

โˆ‘ โ„Ž

1 ๐ป โˆ‘ ๐‘—

1 ๐‘ โ„ ๐‘ โข [ ๐‘– , โ„Ž ] โข ๐”ธ โ„Ž ๐‘ โข [ ๐‘– , ๐‘— ] โข ๐‘Ÿ ๐‘’ โข ( ๐’– ๐‘— )

To compute ๐”ผ โข ( ๐’ ๐‘– | ๐— ) , we condition ๐’ ๐‘– on ๐‘’ ๐‘– ๐‘Ž , ๐” , and ๐— , results in:

๐”ผ โข [ ๐’ ๐‘– | ๐— ]

๐”ผ ๐” โข [ ๐”ผ ๐‘’ ๐‘– ๐‘Ž โข [ ๐”ผ ๐’ ๐‘– โข [ ๐’ ๐‘– | ๐‘’ ๐‘– ๐‘Ž , ๐” ] | ๐” , ๐— ] | ๐— ]

๐”ผ ๐” โข [ ๐”ผ ๐‘’ ๐‘– ๐‘Ž โข [ ๐”ผ ๐’ ๐‘– โข [ ๐’ ๐‘– | ๐‘’ ๐‘– ๐‘Ž , ๐’– ๐‘– ] | ๐” , ๐— ] | ๐— ]

๐”ผ ๐” โข [ โˆ‘ ๐‘’ โ„™ โข ( ๐‘’ ๐‘– ๐‘Ž

๐‘’ | ๐” , ๐— ) โข ๐  ๐‘’ โข ( ๐’– ๐‘– ) | ๐— ]

Thus, we have derived the complete dependency of ๐‘’ ๐‘– ๐‘Ž on ๐” and ๐— through the expert selection process of Attention-Aware SMoE , proving Lemma 1.

B.2Proof of Proposition 1

Restate Proposition 1

Proposition 1.

Let ๐ฉ ๐‘–

[ ๐‘ 1 , โ€ฆ , ๐‘ ๐พ ] ๐‘‡ denote the distribution of expert selection variables, i.e., ๐‘’ ๐‘– ๐‘  for SAM and ๐‘’ ๐‘– ๐‘Ž for A2MM. The expert score in the baseline MoE for token ๐ฎ ๐‘– is breviated as ๐ซ ๐‘– := ๐ซ โข ( ๐ฎ ๐‘– ) as in Section 1.1. Similarity/Attention-Aware MoE transform these expert scores ๐ซ ๐‘– into ๐ฉ ๐‘–

โˆ‘ ๐‘— โˆˆ ๐ฝ ๐‘– ๐‘  โข ( ๐‘– , ๐‘— ) โข ๐ซ ๐‘— , where ๐‘  โข ( ๐‘– , ๐‘— ) denotes the influence weight between tokens ๐ฎ ๐‘– and ๐ฎ ๐‘— . The upper bound of the expert selectionโ€™s entropy in Similarity/Attention-Aware MoE is then given by:

โ„‹ โข ( ๐’‘ ๐‘– ) โ‰ค โˆ‘ ๐‘—

1 | ๐ฝ ๐‘– | ๐‘  โข ( ๐‘– , ๐‘— ) โข โ„‹ โข ( ๐’“ ๐‘— ) + โ„‹ โข ( ๐•ค ๐‘– ) ,

(21)

where ๐•ค ๐‘–

[ ๐‘  โข ( ๐‘– , 1 ) , โ€ฆ , ๐‘  โข ( ๐‘– , | ๐ฝ ๐‘– | ) ] ๐‘‡ . As the temperature parameter ๐œ โ†’ 0 (defined in Def. 1) or ๐œŽ โ†’ 0 (defined in Def. 4), โ„‹ โข ( ๐ฉ ๐‘– ) โ‰ค โ„‹ โข ( ๐ซ ๐‘– ) .

Proof: We first prove for the case of Similarity-Aware routing. The proof is the same for Attention-Aware routing. From ๐’‘ ๐‘–

โˆ‘ ๐‘—

1 | ๐ฝ ๐‘– | ๐‘  โข ( ๐‘– , ๐‘— ) โข ๐’“ ๐‘— , we have ๐‘’ ๐‘– ๐‘  is the mixture of | ๐ฝ ๐‘– | discrete distribution of ๐‘’ ๐‘— with the probability mass ๐’“ ๐‘– . Denote ๐‘ก ๐‘– is the latent random variable of that admit the weighting coefficient as probability distribution ๐‘  ( ๐‘– , . ) at token ๐‘– . We obtain the decomposition of joint entropy as follow

โ„‹ โข ( ๐‘’ ๐‘– ๐‘  , ๐‘ก ๐‘– )

โ„‹ โข ( ๐‘ก ๐‘– ) + โ„‹ โข ( ๐‘’ ๐‘– ๐‘  | ๐‘ก ๐‘– )

โ„‹ โข ( ๐‘ก ๐‘– ) + โˆ‘ ๐‘—

1 | ๐ฝ ๐‘– | ๐‘  โข ( ๐‘– , ๐‘— ) โข โ„‹ โข ( ๐‘’ ๐‘— )

Since entropy is non-negative,

โ„‹ โข ( ๐‘’ ๐‘– ๐‘  , ๐‘ก ๐‘– )

โ„‹ โข ( ๐‘’ ๐‘– ๐‘  ) + โ„‹ โข ( ๐‘ก ๐‘– | ๐‘’ ๐‘– ๐‘  ) โ‰ฅ โ„‹ โข ( ๐‘’ ๐‘– ๐‘  )

Hence,

โ„‹ โข ( ๐‘’ ๐‘– ๐‘  ) โ‰ค โ„‹ โข ( ๐‘ก ๐‘– ) + โˆ‘ ๐‘—

1 | ๐ฝ ๐‘– | ๐‘  โข ( ๐‘– , ๐‘— ) โข โ„‹ โข ( ๐‘’ ๐‘— ) โ‰ค โ„‹ โข ( ๐‘ก ๐‘– ) + โ„‹ โข ( ๐‘’ ๐‘– )

because for any ๐‘— โˆˆ ๐ฝ ๐‘– , โ„‹ โข ( ๐‘’ ๐‘– )

โ„‹ โข ( ๐‘’ ๐‘— ) . Again, here, we slightly abuse the notation of entropy โ„‹ , using it interchangeably for both a random variable and its associated distribution.

The final piece of this Propositionโ€™s proof is to verify the above limit. For ๐œ โ†’ 0 , the temperature-softmax distribution gradually morphs into an one-hot distribution, and thus โ„‹ โข ( ๐‘ก ๐‘– ) entropy goes to 0. Therefore, when ๐œ โ†’ 0 , we have โ„‹ โข ( ๐‘’ ๐‘– ๐‘  ) โ‰ค โ„‹ โข ( ๐‘’ ๐‘– ) or โ„‹ โข ( ๐’‘ ๐‘– ) โ‰ค โ„‹ โข ( ๐’“ ๐‘– ) . We have proved the inequality in Prop. 1 for Similarity-Aware MoE

Similarly, for Attention-Aware routing , we also show that ๐œŽ โ†’ 0 , we have โ„‹ โข ( ๐‘’ ๐‘– ๐‘Ž ) โ‰ค โ„‹ โข ( ๐‘’ ๐‘– ) or โ„‹ โข ( ๐’‘ ๐‘– ) โ‰ค โ„‹ โข ( ๐’“ ๐‘– ) . As ๐œŽ โ†’ 0 , โ„™ โข ( ๐’– ๐‘– | ๐‘ง ๐‘–

๐‘— , โ„Ž ๐‘–

โ„Ž , ๐— )

๐’ฉ โข ( ๐’– ๐‘– | ๐– โ„Ž โข ๐’™ ๐‘— , ๐œŽ 2 โข ๐•€ ) also converges to the Dirac delta function centered at the mean. This means that the closest mean will give a density greatly dominating the others, in turn making ๐‘จ โ„Ž โˆ— ๐‘ the one-hot distribution, yielding zero entropy of โ„‹ โข ( ๐‘ก ๐‘– | ๐‘’ ๐‘– ๐‘  ) .

With that, we have fully proved Prop 1.

B.3 the equivalence of Renormalize โข ( TopK 0 โข ( ๐‘Ÿ ๐‘’ ) ) and Softmax โข ( TopK โˆž โข ( ๐›พ ๐‘’ ) ) , and its proof

Renormalization. Given a token ๐’– ๐‘– , let ๐’ฆ be the set of indices with ๐พ highest expert scores ๐‘Ÿ ๐‘˜ โข ( ๐’– ๐‘– ) . It is also equivalent to the set of ๐พ highest affinity score ๐›พ ๐‘˜ โข ( ๐’– ๐‘– ) . For any ๐‘’ โˆˆ ๐’ฆ , we have

Renormalize โข ( TopK 0 โข ( ๐‘Ÿ ๐‘’ ) )

๐‘Ÿ ๐‘’ โข ( ๐’– ๐‘– ) โˆ‘ ๐‘˜ โˆˆ ๐’ฆ ๐‘Ÿ ๐‘˜ โข ( ๐’– ๐‘– )

exp ๐›พ ๐‘’ โข ( ๐’– ๐‘– ) / โˆ‘ ๐‘’ โ€ฒ

1 ๐ธ exp ๐›พ ๐‘’ โ€ฒ โข ( ๐’– ๐‘– ) โˆ‘ ๐‘˜ โˆˆ ๐’ฆ exp ๐›พ ๐‘˜ โข ( ๐’– ๐‘– ) / โˆ‘ ๐‘’ โ€ฒ

1 ๐ธ exp ๐›พ ๐‘’ โ€ฒ โข ( ๐’– ๐‘– )

exp ๐›พ ๐‘’ โข ( ๐’– ๐‘– ) โˆ‘ ๐‘˜ โˆˆ ๐’ฆ exp ๐›พ ๐‘˜ โข ( ๐’– ๐‘– )

Softmax โข ( TopK โข ( ๐›พ ๐‘’ โข ( ๐’– ๐‘– ) ) )

This shows the equivalence of two operators for ๐‘’ โˆˆ ๐’ฆ For any ๐‘’ โˆ‰ ๐’ฆ , TopK 0 โข ( ๐‘Ÿ ๐‘’ โข ( ๐’– ๐‘– ) )

exp โข ( TopK โˆž โข ( ๐›พ ๐‘’ โข ( ๐’– ๐‘– ) ) )

0 also results in the equivalence of two operators, finishing the proof.

Appendix CDerivation

Conditional Expectation of ๐‘ข ๐‘– given ๐— following MAM

By using the tower rule, we obtain

๐”ผ โข [ ๐’– ๐‘– | ๐— ]

๐”ผ โ„Ž ๐‘– [ ๐”ผ ๐‘ง ๐‘– [ ๐”ผ ๐’– ๐‘– [ ๐’– ๐‘– | ๐‘ง ๐‘– , โ„Ž ๐‘– , ๐— ] | โ„Ž ๐‘– , ๐— ] ]

โˆ‘ โ„Ž ๐ป โˆ‘ ๐‘— ๐‘ ๐”ผ [ ๐’– ๐‘– | ๐‘ง ๐‘–

๐‘— , โ„Ž ๐‘– , ๐’™ ๐‘— ] โ„™ ( ๐‘ง ๐‘–

๐‘— | โ„Ž ๐‘–

โ„Ž , ๐— ) โ„™ ( โ„Ž ๐‘–

โ„Ž )

1 ๐ป โข โˆ‘ โ„Ž

1 ๐ป ๐– โ„Ž โข โˆ‘ ๐‘—

1 ๐‘ Softmax โข ( ๐’™ ๐‘– โŠค โข ๐– ๐‘„ , โ„Ž ๐‘– โŠค โข ๐– ๐พ , โ„Ž ๐‘– โข ๐’™ ๐‘— ๐ท ๐‘ž โข ๐‘˜ ) โข ๐ฑ ๐‘— .

Appendix DExperiments Details Table 5:Summarization of baselinesโ€™ sizes Baselines size # params # layers Token dimension ๐ท Hidden size of MLP ๐‘‘ ff Sequence Length # heads VMoE 60M 8 512 2048 50 8 SMoE-medium 215M 6 352 352 512 8 SMoE-large 388M 12 512 512 512 8 GLAM-medium 201M 4 352 351 512 8 D.1WikiText-103 Language Modeling D.1.1Dataset

The WikiText-103 dataset (Merity et al., 2016), sourced from Wikipedia, is crafted to examine extended contextual relationships. Its training component encompasses roughly 28,000 articles, totaling 103 million words. These articles are segmented into blocks of about 3,600 words each. The validation and test sets consist of 60 articles each, with word counts of 218,000 and 246,000 respectively, amounting to approximately 268,000 words combined. To assess the resilience of our methods, we employ TextAttackโ€™s word swap attack (Morris et al., 2020) to modify both the validation and test data. This adversarial method randomly substitutes words with "AAA," challenging the modelโ€™s ability to accurately predict subsequent words in the sequence.

D.1.2Models and baselines

In our study, we utilize the Switch Transformer (Fedus et al., 2021) (denoted as SMoE in our data presentations) and GLaM (Du et al., 2022) as baseline models. The Switch Transformer substitutes all multilayer perceptron (MLP) layers with SMoE layers, while GLaM replaces every alternate MLP layer. Our standard model for experiments is medium-sized with 6 layers. Each model incorporates 16 experts in every models, selecting Top-1 or Top-2 experts (E = 2) per input. All models employ an identical sparse router function, comprising a linear network that processes input data, followed by TopK and Softmax functions. The models undergo 60 epochs of training, while GLaM models train for 80 epochs without any additional load balancing loss. Our implementation builds upon the codebase developed by (Pham et al., 2024; Press et al., 2020), which is publicly accessible at https://github.com/ofirpress/sandwich_transformer and https://github.com/giangdip2410/CompeteSMoE/tree/main.

The model sizes are summarized in Tab. 5. Note that except for models presented in Tab. 7, all models used in language tasks have medium sizes.

In all our Similarity/Attention-Aware SMoEs, we set the hyperparameter ๐œ

1 . In Similarity-Aware SMoE, instead of learning ๐– ๐‘  in Def. 1, we set ๐– ๐‘ 

๐•€ for the save of computation and to avoid introduce extra parameters. In Attention-Aware SMoE , we set the hyperparameter ๐œŽ

1 .

D.2ImageNet-1K Object Recognition D.2.1Datasets

Our study employs the ImageNet-1K dataset, which consists of 1.28 million training images and 50,000 validation images across 1,000 object classes. The model is trained for object recognition. To evaluate resilience to input data distribution shifts, we use ImageNet-A (IN-A) (Hendrycks et al., 2021b). This dataset includes adversarially filtered images from a 200-class subset of ImageNet-1K. We also test our modelโ€™s ability to generalize to abstract visual representations using ImageNet-R (IN-R) , which contains various artistic renditions of images.

D.2.2Model and baselines

For our ImageNet-1K object recognition task and standard robustness benchmarks, we employ a small Vision Mixture of Experts (V-MoE) (Riquelme et al., 2021) model as the SMoE baseline. This V-MoE variant is composed of 8 Vision Transformer (ViT) blocks, with the MLPs in the final two blocks replaced by SMoE layers. In our Similarity/Attention-Aware SMoEs, we alternate between Attention-Aware SMoE and Similarity-Aware SMoE layers, replacing every other MLP layer. All our vision SMoE models select 2 experts ( ๐พ

2 ) per patch at each SMoE layer. We adhere to the training configurations and settings outlined in the cited work. The codebase for this implementation is publicly available at https://github.com/google-research/vmoe/. Similar to the experiments on Language Modeling, we also we set the hyperparameter ๐œ

1 and ๐– ๐‘ 

๐•€ in Similarity-Aware SMoE.

The VMoE baseline has 8 layers, with model size is 512 and 60M parameters.

D.3Finetuning on downstream tasks. D.3.1Datasets

The Stanford Sentiment Treebank-2 (SST2). (Socher et al., 2013). The dataset is designed for analyzing how sentiment is composed in language. It contains a binary classification setup, featuring 11,855 individual sentences drawn from movie reviews. Using the Stanford parser, these sentences were broken down into parse trees, generating 215,154 distinct phrases. Each of these phrases was evaluated by three human annotators. The datasetโ€™s structure allows researchers to examine how sentiment meaning is built up through language composition.

The Stanford Sentiment Treebank-5 (SST5) (Socher et al., 2013). The sentiment analysis dataset consists of five sentiment categories. It comprises 11,855 individual sentences taken from movie reviews. The sentences were parsed into trees, yielding 215,154 unique phrases, with each phrase receiving ratings from three human evaluators. The sentiment classifications in this dataset range across five levels: negative, somewhat negative, neutral, somewhat positive, and positive.

The Banking-77 (B77) (Casanueva et al., 2020) This is a detailed intent classification dataset for customer service in banking. It features 13,083 customer queries categorized across 77 distinct intent classes, providing a highly granular classification system for banking-related customer inquiries.

D.3.2Model and baselines

The models were initialized using pretrained language models that were trained on the Wikitext-103 dataset. For the Stanford Sentiment Treebank datasets (SST2 and SST5), the model training process involves 5 epochs of finetuning, utilizing the Adam optimizer with a 0.001 base learning rate. The training uses no warmup period and processes 16 samples per batch. The Banking-77 dataset requires longer training at 50 epochs, also using Adam but with a much lower base learning rate of 0.00001, maintaining the same batch size of 16 and no warmup period.

Appendix EAdditional Experiments and Analysis E.1Experiments with change in number of experts and TopK

We examine our method with more seetings of number of experts and TopK including Top-1, Top-8 and 32 experts. Across all these settings, both Similarity-Aware SMoE and Attention-Aware SMoE consistently demonstrate better performance compared to the baseline SMoE, achieving lower PPL scores on both Clean and Attacked Wikitext-103 datasets (Tab  6). When using 32 experts, our methods achieve PPL reductions of up to 1.56 PPL compared to the baseline, and when increasing to top-8 active experts, they maintain their advantage with improvements of up to 1.64 PPL. These consistent performance gains across different architectural configurations demonstrate the robustness and effectiveness of our proposed methods.

Table 6: PPL evaluation (lower is better) with the clean and attacked Wikitext-103 test set of baseline SMoEs and Similarity/Attention-Aware-SMoE SMoE(s) with different number of experts and TopK Model/Metric Clean Wikitext-103 Attacked Wikitext-103

Valid PPL

Test PPL

Valid PPL

Test PPL

SMoE (K = 1, E = 16) 39.55

40.75

48.82

50.21

Similarity-Aware SMoE (K = 1, E = 16) 37.78

39.18

46.93

48.66

Attention-Aware SMoE (K = 1, E = 16) 38.02

39.35

47.20

48.72

SMoE (K = 2, E = 16) 33.29

34.84

41.75

43.59

Similarity-Aware SMoE (K = 2, E = 16) 30.75

32.03

38.33

39.92

Attention-Aware SMoE (K = 2, E = 16) 31.31

32.23

39.68

40.91

SMoE (K = 8, E = 16) 33.48

34.92

41.36

42.98

Similarity-Aware SMoE (K = 8, E = 16) 32.5

33.81

40.6

42.37

Attention-Aware SMoE (K = 8, E = 16) 31.97

33.28

39.98

41.45

SMoE (K = 2, E = 32) 31.82

33.41

39.9

41.79

Similarity-Aware SMoE (K = 2,, E = 32) 30.41

31.62

38.23

39.77

Attention-Aware SMoE (K = 2, E = 32) 30.39

31.85

37.8

39.65 E.2Routing Fluctuation and Entropy of SMoEs Top-1 Figure 5:Comparison of Routing Fluctuation and Entropy Ratio Across Layers for Baseline SmoE Top-1, Attention-Aware SMoE Top-1, and Similarity-Aware SMoE Top-1

Attention-Aware SMoE Top-1 reduces routing fluctuation. Fig. 5 (Left) compares the routing fluctuation of the baseline SMoE Top-1 and Attention-Aware SMoE for Top-1 routing. The fluctuation rate, computed as the proportion of tokens that switch their expert choice between consecutive last training epochs (from epoch 59 to 60), provides insight into routing stability. The Baseline SMoE exhibits higher fluctuation rates across all layers. In contrast, the Attention-Aware SMoE demonstrates consistently lower fluctuation rates across all layers. The Attention-Aware SMoE maintains more stable routing decisions throughout the network, indicating improved consistency in expert utilization. These results suggest that our proposed Attention-Aware SMoE method significantly enhances routing stability compared to the baseline approach, potentially leading to more consistent and efficient utilization of experts in the Mixture of Experts model. The results also aligns with the better performance and enhancement in robustness of Attention-Aware SMoE Top-1 in Tab. 6.

Attention-Aware SMoE Top-1 reduces decision entropy Fig. 5 (Right) illustrates the ratio of average entropy of tokensโ€™ routing decisions across layers for the Attention-Aware SMoE compared to the baseline SMoE for epoch 59. The Attention-Aware SMoE demonstrates consistently lower entropy levels compared to the baseline SMoE across all layers, as evidenced by ratios below 1.0. This trend aligns with the lower routing fluctuation observed in the left graph, suggesting that our approach leads to more stable and consistent routing decisions.

E.3Scalability of Similarity/Attention-Aware SMoEs Table 7: PPL evaluation (lower is better) with the clean and attacked Wikitext-103 test set Baseline SMoE (large size), Attention-Aware SMoE (large size), and Similarity-Aware SMoE (large size) Model/Metric Clean Wikitext-103 Attacked Wikitext-103

Valid PPL

Test PPL

Valid PPL

Test PPL

SMoE (K = 2) 28.737

30.378

36.43

38.34

Similarity-Aware SMoE (K = 2) 27.06

28.34

34.65

36.28

Attention-Aware SMoE (K = 2) 27.26

28.69

34.69

36.37

To further demonstrate the scalability of our models, we evaluate them with a larger transformer-MoEs baseline of approximately 390M parameters, with 12 layers. The results in Tab. 7 confirms that the scaling law holds true, as all models show improved language modeling performance with increased parameter count. Importantly, both Similarity-Aware SMoE and Attention-Aware SMoE maintain their performance advantage over the conventional SMoE at this larger scale, with Similarity-Aware SMoE emerging as the best performing variant. These findings validate that the benefits of our proposed methods are preserved when scaling up model size.

E.4Comuputation and memory Table 8: Computation and Memory Ratio of forward pass (compared to the baselines SMoE, XMoE and SMoE-dropout) comparison for different SMoE-medium size variants, Top-K = 2 Model Computation Ratio

Memory Ratio

Similarity-Aware SMoE 1.048

1.008

Attention-Aware SMoE 1.070

1.060

Similarity-Aware SMoE XMoE 1.026

1.009

Attention-Aware SMoE XMoE 1.038

1.060

Similarity-Aware SMoE-dropout 1.047

1.008

Attention-Aware SMoE -dropout 1.064

1.060

We compare the computational complexity and memory complexity of using mutual inform techniques compared to the conventional approach without them. In particular, we measure the computational time and computational memory of Similarity-Aware SMoE and Attention-Aware SMoE divided by the corresponding computational time and computational memory of the conventional SMoE in Tab. 8. Similarly, we report the ratio for the case of XMoE and SMoE-dropout in Tab. 8. From the table, we can see that Similarity/Attention-Aware-SMoE variants only increase the computational complexities slightly.

E.5Hyperparameter ablation

We present the ablation study for the hyperparameters temperatures ๐œ in Similarity-SMoE and ๐œŽ in Attention-SMoE. Tab.9 demonstrates that both Similarity-SMoE and Attention-SMoE are relatively insensitive to their respective temperature parameters ( ๐œ and ๐œŽ ). Across different values including 0.1 , 1 , 2 , and 352 (where 352 is the model size). In the case of Similarity-SMoE, too large ๐œ or too small ๐œ can lead to an decrease in performance.

Table 9: Perplexity comparison for different SMoE variants with various ๐œ values on validation and test sets Model Valid PPL

Test PPL

SMoE 33.29

34.84

Similarity-SMoE ( ๐œ =0.1) 32.79

34.01

Similarity-SMoE ( ๐œ =1.0) 30.75

32.03

Similarity-SMoE ( ๐œ =2.0) 30.68

32.88

Similarity-SMoE ( ๐œ

352 ) 32.26

33.83

Attention-SMoE ( ๐œŽ =0.1) 31.93

32.67

Attention-SMoE ( ๐œŽ =1.0) 31.31

32.23

Attention-SMoE ( ๐œŽ =2.0) 31.13

32.85

Attention-SMoE ( ๐œŽ

352 ) 31.62

32.90 E.6The influence of token similarity on the expert decision during training

The token similarity influences decisions through the score matrix ๐‘†

Softmax โข ( ๐”๐– ๐‘  โข ๐” ๐‘‡ / ๐œ ) , where ๐œ is the temperature parameter and in practice, we set ๐– ๐‘ 

๐•€ . As ๐œ โ†’ 0 , ๐‘† converges to the identity matrix, meaning each token becomes conditionally independent in decision-making. We examine how gradually varying ๐œ during training affects model performance. As expected, when ๐œ โ†’ 0 , the performance of Similarity-Aware SMoE becomes comparable with the baseline SMoE. On the other hand, increasing ๐œ from 0.01 to 10 improves the performance to the baseline.

Table 10:PPL evaluation (lower is better) with the clean and attacked Wikitext-103 test set of medium-size baseline SMoEs and Similarity-Aware SMoE(s) with ๐œ decreases from 10 to 0.01 and increases from 0.01 to 10 Model/Metric Clean Wikitext-103 Attacked Wikitext-103

Valid PPL

Test PPL

Valid PPL

Test PPL

SMoE 33.29

34.84

41.75

43.59

Similarity-Aware SMoE ( ๐œ decreases) 33.37

34.65

41.61

43.08

Similarity-Aware SMoE( ๐œ increases) 32.40

33.91

39.92

41.67 Report Issue Report Issue for Selection Generated by L A T E xml Instructions for reporting errors

We are continuing to improve HTML versions of papers, and your feedback helps enhance accessibility and mobile support. To report errors in the HTML that will help us improve conversion and rendering, choose any of the methods listed below:

Click the "Report Issue" button. Open a report feedback form via keyboard, use "Ctrl + ?". Make a text selection and click the "Report Issue for Selection" button near your cursor. You can use Alt+Y to toggle on and Alt+Shift+Y to toggle off accessible reporting links at each section.

Our team has already identified the following issues. We appreciate your time reviewing and reporting rendering errors we may not have found yet. Your efforts will help us improve the HTML versions for all readers, because disability should not be a barrier to accessing research. Thank you for your continued support in championing open access for all.

Have a free development cycle? Help support accessibility at arXiv! Our collaborators at LaTeXML maintain a list of packages that need conversion, and welcome developer contributions.

Xet Storage Details

Size:
90.6 kB
ยท
Xet hash:
047db138497f3922a577168c608111578e18ec300fa4944519afd80bd9c8e202

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.