A newer version of the Gradio SDK is available: 6.14.0
title: Building a Neural Mixture-of-Experts Router from Scratch
thumbnail: https://huggingface.co/spaces/ianshank/MangoMAS/resolve/main/thumbnail.png
authors:
- ianshank
tags:
- mixture-of-experts
- pytorch
- neural-routing
- multi-agent
- reinforcement-learning
Building a Neural Mixture-of-Experts Router from Scratch
Author: Ian Shanker | Date: February 2026 | Reading time: ~12 min
🧪 Try it live! Route tasks through the neural MoE gate on the MangoMAS Interactive Demo — select the 🔀 MoE Router tab to see feature extraction and expert weights in real time.
Introduction
Mixture-of-Experts (MoE) architectures have powered some of the most capable AI systems of the last decade — from Switch Transformer to GPT-4. But most tutorials treat MoE as a black box. In this post, I'll walk through building a production-grade neural MoE router from scratch in PyTorch, including the feature extraction pipeline, learned routing gate, and feedback-driven weight updates.
This is the exact architecture powering MangoMAS's multi-agent orchestration layer. The full model is available on HuggingFace Hub: ianshank/MangoMAS-MoE-7M.
What Is a Mixture-of-Experts Router?
A MoE router is a learned function that maps an input to a probability distribution over a set of "experts" (specialized sub-networks or agents). Instead of routing every input through the same computation, MoE selects the most relevant experts for each input.
Input → Feature Extractor → RouterNet (MLP) → Softmax → Expert Weights
↓
[Expert 1, Expert 2, ..., Expert N]
↓
Weighted Aggregation → Output
The key insight: routing is a learned function, not a hand-crafted heuristic.
Architecture Overview
MangoMAS's MoE has three components:
1. Feature Extractor (64-Dimensional Vector)
Converts raw text into a compact feature vector:
def featurize64(text: str) -> np.ndarray:
"""
Extract 64 routing features from raw text.
Features include:
- Hash-based sinusoidal encoding (32 dims)
- Domain tag signals: code, security, architecture, data (16 dims)
- Structural signals: length, punctuation density, questions (8 dims)
- Sentiment polarity estimate (4 dims)
- Novelty/complexity scores (4 dims)
"""
features = np.zeros(64, dtype=np.float32)
# ... feature extraction logic
return features / (np.linalg.norm(features) + 1e-8) # L2 normalize
Why 64 dimensions? It's the sweet spot between expressiveness and routing latency. At 64 dims, the RouterNet forward pass takes < 1ms on CPU.
2. RouterNet (Neural Gate)
A lightweight MLP with residual connections:
class RouterNet(nn.Module):
"""
Neural routing gate for MoE expert selection.
Architecture: Linear(64→128) → ReLU → Dropout → Linear(128→64)
→ ReLU → Linear(64→N_experts) → Softmax
"""
def __init__(self, n_experts: int, hidden_dim: int = 128, dropout: float = 0.1):
super().__init__()
self.layers = nn.Sequential(
nn.Linear(64, hidden_dim),
nn.ReLU(),
nn.Dropout(dropout),
nn.Linear(hidden_dim, hidden_dim // 2),
nn.ReLU(),
nn.Linear(hidden_dim // 2, n_experts),
)
self.softmax = nn.Softmax(dim=-1)
def forward(self, x: torch.Tensor) -> torch.Tensor:
logits = self.layers(x)
return self.softmax(logits)
3. MixtureOfExperts7M (~7M Parameters)
The full model with 16 expert towers:
class MixtureOfExperts7M(nn.Module):
"""
Architecture:
- Gating: Linear(64→512) → ReLU → Linear(512→16) → Softmax
- 16 Expert Towers: Linear(64→512) → ReLU → Linear(512→512) → ReLU → Linear(512→256)
- Classifier: Linear(256→N_classes)
"""
🔗 Model on Hub:
ianshank/MangoMAS-MoE-7M— download the weights and config.
4. AggregatorCell
Combines expert outputs using the router's weight distribution (weighted average, max confidence, or ensemble).
The Routing Pipeline
Here's the complete routing flow:
def route(task: str, strategy: str = "moe_routing") -> RoutingResult:
# 1. Extract features
features = featurize64(task) # 64-dim vector, < 0.5ms
# 2. Neural routing
with torch.no_grad():
weights = router_net(torch.tensor(features)) # softmax over N experts
# 3. Select top-K experts (sparse routing)
top_k = torch.topk(weights, k=3)
selected_experts = [EXPERTS[i] for i in top_k.indices]
# 4. Execute experts in parallel
results = await asyncio.gather(*[
expert.execute(task) for expert in selected_experts
])
# 5. Aggregate with learned weights
return aggregator.aggregate(results, weights=dict(zip(selected_experts, top_k.values)))
Learned Routing: Feedback Loop
The router improves over time via a REINFORCE-style gradient update:
class RouterFeedbackLoop:
"""Updates router weights based on expert output quality."""
def update(self, routing_result: RoutingResult, feedback: float) -> None:
# Compute policy gradient loss
log_probs = torch.log(routing_result.weights + 1e-8)
loss = -feedback * log_probs.sum()
# Update with Adam optimizer
self.optimizer.zero_grad()
loss.backward()
self.optimizer.step()
In production, we use PPO with a value baseline to reduce variance.
Key Design Decisions
Why Not Attention-Based Routing?
For MangoMAS's use case — routing between specialized agents — we need:
- Sub-millisecond latency (attention is O(n²))
- CPU-only inference (no GPU required)
- Interpretable routing decisions
A simple MLP with 64-dim features achieves all three.
Sparse vs. Dense Routing
We use sparse routing (top-K=3 out of N experts). This reduces compute by 60-80%, forces specialization, and enables load balancing.
Load Balancing Loss
def load_balance_loss(weights: torch.Tensor, n_experts: int) -> torch.Tensor:
"""Encourage uniform expert utilization."""
expert_load = weights.mean(dim=0)
target_load = torch.ones(n_experts) / n_experts
return F.kl_div(expert_load.log(), target_load, reduction="batchmean")
Performance Results
| Metric | Value |
|---|---|
| Routing latency (P50) | 0.8ms |
| Routing latency (P99) | 2.1ms |
| Expert utilization (entropy) | 2.94 / 3.00 |
| Quality improvement vs. random | +23% |
| Quality improvement vs. greedy | +11% |
📊 See live benchmarks on the MangoMAS Demo — select the 📈 Metrics tab.
Conclusion
Building a neural MoE router from scratch taught us:
- Feature engineering matters more than model size — 64 well-chosen features outperform 256 raw features
- Sparse routing is essential for production latency
- Load balancing loss prevents collapse
- Feedback loops close the loop between routing decisions and output quality
Next in this series: MCTS for Multi-Agent Task Planning
Full source code: MangoMAS on GitHub