Post
27
Latest
TL; DR MoEs can be misleading to reason about from active parameters alone, since each token only activates a subset of experts, while the serving setup still needs to account for the full resident memory footprint.
๐ง
๐๏ธ Dense models usually load and use most weights every forward pass, while MoEs load many experts but only route each token to a few of them
โก Active params isn't the same as memory footprint, especially for sparse architectures
๐ฆ Runtime memory is about what is used per request/token, while loading memory also includes the expert weights that need to be resident
๐ KV cache can still dominate depending on context length, batch size, and concurrency
๐ Expert Parallelism (EP) helps shard experts across accelerators when expert weights dominate
๐ Data Parallelism (DP) + EP is often a good fit for throughput-oriented MoE serving
Check the repository at https://github.com/alvarobartt/hf-mem
hf-mem release added a breakdown of Mixture-of-Experts (MoE) memory usage!TL; DR MoEs can be misleading to reason about from active parameters alone, since each token only activates a subset of experts, while the serving setup still needs to account for the full resident memory footprint.
๐ง
hf-mem now splits MoE memory into base model weights, routed experts, and KV cache๐๏ธ Dense models usually load and use most weights every forward pass, while MoEs load many experts but only route each token to a few of them
โก Active params isn't the same as memory footprint, especially for sparse architectures
๐ฆ Runtime memory is about what is used per request/token, while loading memory also includes the expert weights that need to be resident
๐ KV cache can still dominate depending on context length, batch size, and concurrency
๐ Expert Parallelism (EP) helps shard experts across accelerators when expert weights dominate
๐ Data Parallelism (DP) + EP is often a good fit for throughput-oriented MoE serving
Check the repository at https://github.com/alvarobartt/hf-mem