Papers
arxiv:2603.13364

FineRMoE: Dimension Expansion for Finer-Grained Expert with Its Upcycling Approach

Published on Mar 9
· Submitted by
NingLiao
on Mar 17
Authors:
,

Abstract

As revealed by the scaling law of fine-grained MoE, model performance ceases to be improved once the granularity of the intermediate dimension exceeds the optimal threshold, limiting further gains from single-dimension fine-grained design. To address this bottleneck, we propose FineRMoE (FineR-Grained MoE), an architecture that extends fine-grained expert design to both intermediate and output dimensions, aiming to enhance expert specialization beyond the single-dimension limit. We further introduce a bi-level sparse forward computation paradigm and a specialized routing mechanism to govern the activation. In addition, to obviate the prohibitive cost of training FineRMoE from scratch, we devise a generalized upcycling method to build FineRMoE in a cost-effective manner. Extensive experiments demonstrate the superior performance achieved by FineRMoE across ten standard benchmarks. Compared with the strongest baseline, FineRMoE achieves 6 times higher parameter efficiency, 281 times lower prefill latency, and 136 timese higher decoding throughput during inference.

Community

Paper author Paper submitter

To break the performance ceiling of fine-grained MoE designs that are solely confined to the intermediate dimension, which has been revealed by the scaling law of MoE, we introduce the FineRMoE (FineR-grained MoE) architecture. It pioneers the expansion of the fine-grained expert design in MoE models from only the intermediate dimension to the output dimension, aiming to enhance expert specialization beyond the single-dimension limit. The core contributions of this work include:

  • Finer-grained expert design across intermediate and output dimensions;
  • Bi-level sparse forward computation paradigm for multi-expert fusion;
  • Unified routing mechanism with one router governing two sparse layers;
  • Generalized upcycling compatible with FineRMoE and conventional MoEs.

Sign up or log in to comment

Models citing this paper 3

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2603.13364 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2603.13364 in a Space README.md to link it from this page.

Collections including this paper 1