Papers - MoE - Router
updated
Turn Waste into Worth: Rectifying Top-k Router of MoE
Paper
• 2402.12399
• Published
• 2
CompeteSMoE -- Effective Training of Sparse Mixture of Experts via
Competition
Paper
• 2402.02526
• Published
• 3
Buffer Overflow in Mixture of Experts
Paper
• 2402.05526
• Published
• 9
OpenMoE: An Early Effort on Open Mixture-of-Experts Language Models
Paper
• 2402.01739
• Published
• 28
LocMoE: A Low-overhead MoE for Large Language Model Training
Paper
• 2401.13920
• Published
• 2
HyperRouter: Towards Efficient Training and Inference of Sparse Mixture
of Experts
Paper
• 2312.07035
• Published
• 2
Routers in Vision Mixture of Experts: An Empirical Study
Paper
• 2401.15969
• Published
• 2
BASE Layers: Simplifying Training of Large, Sparse Models
Paper
• 2103.16716
• Published
• 3
Hash Layers For Large Sparse Models
Paper
• 2106.04426
• Published
• 2
Direct Neural Machine Translation with Task-level Mixture of Experts
models
Paper
• 2310.12236
• Published
• 3
Adaptive Gating in Mixture-of-Experts based Language Models
Paper
• 2310.07188
• Published
• 2
Merge, Then Compress: Demystify Efficient SMoE with Hints from Its
Routing Policy
Paper
• 2310.01334
• Published
• 3
Sparse Backpropagation for MoE Training
Paper
• 2310.00811
• Published
• 2
A Review of Sparse Expert Models in Deep Learning
Paper
• 2209.01667
• Published
• 3
SpeechMoE2: Mixture-of-Experts Model with Improved Routing
Paper
• 2111.11831
• Published
• 2
Towards More Effective and Economic Sparsely-Activated Model
Paper
• 2110.07431
• Published
• 2
Taming Sparsely Activated Transformer with Stochastic Experts
Paper
• 2110.04260
• Published
• 2
Beyond Distillation: Task-level Mixture-of-Experts for Efficient
Inference
Paper
• 2110.03742
• Published
• 4
Mixture-of-Supernets: Improving Weight-Sharing Supernet Training with
Architecture-Routed Mixture-of-Experts
Paper
• 2306.04845
• Published
• 4
Patch-level Routing in Mixture-of-Experts is Provably Sample-efficient
for Convolutional Neural Networks
Paper
• 2306.04073
• Published
• 2
Unified Scaling Laws for Routed Language Models
Paper
• 2202.01169
• Published
• 2