--- license: mit language: - en --- # HiMoE — Hierarchical Mixture of Experts > *A Matryoshka-inspired two-level routing architecture for efficient large-scale language modelling.* **Author:** AG  ·  **Year:** 2026 --- ## Overview HiMoE replaces the standard feed-forward network (FFN) in each Transformer block with a hierarchical routing system. A **Level-1 router** selects one of N MoE blocks; that block's own **Level-2 router** selects one of M local experts. Only a single expert is ever activated per token — regardless of total model size. ``` Token └─► Level-1 Router (1 of 6 MoE blocks) └─► Level-2 Router (1 of 8 experts) └─► Expert FFN ──► output ``` With the default config (N=6, M=8, 2 layers) the model holds **~52M parameters** but activates only **~3.3% per token** — the compute footprint of a ~1.7M dense model. --- ## Repository Structure ``` . ├── train_himoe.py # Full training script (self-contained) ├── hamlet.txt # Training corpus (place here before running) ├── README.md └── model/ # Created automatically on first save ├── config.json # Hyperparameters + vocab snapshot ├── backbone.pt # Embeddings, attention, LN, LM head ├── main_router.pt # Level-1 gate (or layer_01_main_router.pt for n_layer > 1) ├── moe_expert_001/ │ ├── router.pt # Level-2 gate for this MoE block │ ├── model_001.pt │ ├── model_002.pt │ └── ... (model_008.pt) ├── moe_expert_002/ │ └── ... ├── ... ├── moe_expert_006/ ├── sample.txt # Generated text after training └── routing_log.json # Expert attribution for first 50 tokens ``` Each learnable component lives in its own file — making it straightforward to hot-swap, quantise, or fine-tune individual experts without touching the rest of the model. --- ## Quickstart ### 1. Install dependencies ```bash pip install torch ``` No other dependencies. Everything else is standard library. ### 2. Add your data Place `hamlet.txt` (or any plain-text corpus) in the same directory as `train_himoe.py`. ### 3. Train ```bash python train_himoe.py ``` Checkpoints are saved to `model/` every `eval_interval` steps and at the end of training. A sample generation and routing log are written automatically. ### 4. Resume training ```bash python train_himoe.py --resume ``` ### 5. Custom config All hyperparameters are overridable from the command line: ```bash python train_himoe.py \ --num_moes 8 \ --num_experts 16 \ --n_embd 512 \ --n_layer 4 \ --max_iters 10000 \ --lr 2e-4 \ --data_file my_corpus.txt \ --model_dir checkpoints/run_01 ``` --- ## Architecture ### HiMoEConfig defaults | Parameter | Default | Description | |---|---|---| | `n_embd` | 256 | Embedding / hidden dimension | | `n_layer` | 2 | Number of Transformer layers | | `n_head` | 4 | Attention heads | | `block_size` | 128 | Context window (tokens) | | `num_moes` | 6 | Level-1 choices (MoE blocks) | | `num_experts` | 8 | Level-2 choices per MoE block | | `dropout` | 0.1 | Dropout rate | | `batch_size` | 32 | Training batch size | | `max_iters` | 3000 | Training steps | | `lr` | 3e-4 | Peak learning rate | ### Sparsity | Routing Level | Active | Total | % Active | |---|---|---|---| | Level-1 (MoE blocks) | 1 | 6 | 16.7% | | Level-2 (experts) | 1 | 48 | 2.1% | | **Full model (params)** | **~1.7M** | **~52M** | **~3.3%** | ### Checkpoint layout for multi-layer models When `n_layer > 1`, routers and expert directories are prefixed by layer: ``` model/ layer_01_main_router.pt layer_01_moe_expert_001/ layer_01_moe_expert_002/ ... layer_02_main_router.pt layer_02_moe_expert_001/ ... ``` --- ## Training Details - **Optimiser:** AdamW with weight decay 0.1 on matrix parameters, 0.0 on biases and norms - **LR schedule:** Cosine decay with 100-step linear warmup, minimum LR = 10% of peak - **Gradient clipping:** 1.0 - **Weight tying:** Token embedding matrix and LM head share weights - **Routing:** Hard top-1 at both levels (no auxiliary load-balancing loss required) --- ## Modular Deployment Because every component is a separate file, you can: **Load only what you need:** ```python import torch # Load just one expert for inspection or fine-tuning expert_weights = torch.load("model/moe_expert_003/model_005.pt") ``` **Swap a router:** ```python torch.save(new_router.state_dict(), "model/moe_expert_003/router.pt") ``` **Fine-tune a single MoE block** without touching the backbone or other experts. **Add a new expert** by saving a new `model_009.pt` and retraining only the corresponding router. --- ## Output Files After training completes: | File | Contents | |---|---| | `model/sample.txt` | 400-token generation from a blank context | | `model/routing_log.json` | Per-token (MoE, expert) routing decisions for the first 50 generated tokens | | `model/config.json` | Full config + vocabulary + last saved step | The training loop also prints an **expert utilisation summary** — a bar chart in the terminal showing how evenly tokens are distributed across MoE blocks and experts. --- ## Paper A full write-up of the architecture, sparsity analysis, and experiments is included as `himoe_paper.pdf`. --- ## Citation ``` @misc{himoe2026, title = {HiMoE: Hierarchical Mixture of Experts for Efficient Large-Scale Language Modelling}, author = {AG}, year = {2026} } ```