MoE routing for reasoning workloads
#7
by O96a - opened
The Puzzle MoE architecture with 88B parameters is an interesting approach to scaling reasoning capabilities. We've been experimenting with MoE models for multi-agent orchestration where different experts handle different cognitive tasks. The key question is whether the routing overhead in MoE actually helps with reasoning or just dilutes the signal. Has anyone measured the expert activation patterns during chain-of-thought reasoning versus simple generation tasks? Would be useful to know if certain experts specialize in planning vs execution phases.