| --- |
| license: mit |
| language: |
| - en |
| --- |
| <img src="himoe_visual.png"> |
|
|
| # HiMoE — Hierarchical Mixture of Experts |
|
|
| > *A Matryoshka-inspired two-level routing architecture for efficient large-scale language modelling.* |
|
|
| **Author:** AG · **Year:** 2026 |
|
|
| --- |
|
|
| ## Overview |
|
|
| HiMoE replaces the standard feed-forward network (FFN) in each Transformer block with a hierarchical routing system. A **Level-1 router** selects one of N MoE blocks; that block's own **Level-2 router** selects one of M local experts. Only a single expert is ever activated per token — regardless of total model size. |
|
|
| ``` |
| Token |
| └─► Level-1 Router (1 of 6 MoE blocks) |
| └─► Level-2 Router (1 of 8 experts) |
| └─► Expert FFN ──► output |
| ``` |
|
|
| With the default config (N=6, M=8, 2 layers) the model holds **~52M parameters** but activates only **~3.3% per token** — the compute footprint of a ~1.7M dense model. |
|
|
| --- |
|
|
| ## Repository Structure |
|
|
| ``` |
| . |
| ├── train_himoe.py # Full training script (self-contained) |
| ├── hamlet.txt # Training corpus (place here before running) |
| ├── README.md |
| └── model/ # Created automatically on first save |
| ├── config.json # Hyperparameters + vocab snapshot |
| ├── backbone.pt # Embeddings, attention, LN, LM head |
| ├── main_router.pt # Level-1 gate (or layer_01_main_router.pt for n_layer > 1) |
| ├── moe_expert_001/ |
| │ ├── router.pt # Level-2 gate for this MoE block |
| │ ├── model_001.pt |
| │ ├── model_002.pt |
| │ └── ... (model_008.pt) |
| ├── moe_expert_002/ |
| │ └── ... |
| ├── ... |
| ├── moe_expert_006/ |
| ├── sample.txt # Generated text after training |
| └── routing_log.json # Expert attribution for first 50 tokens |
| ``` |
|
|
| Each learnable component lives in its own file — making it straightforward to hot-swap, quantise, or fine-tune individual experts without touching the rest of the model. |
|
|
| --- |
|
|
| ## Quickstart |
|
|
| ### 1. Install dependencies |
|
|
| ```bash |
| pip install torch |
| ``` |
|
|
| No other dependencies. Everything else is standard library. |
|
|
| ### 2. Add your data |
|
|
| Place `hamlet.txt` (or any plain-text corpus) in the same directory as `train_himoe.py`. |
|
|
| ### 3. Train |
|
|
| ```bash |
| python train_himoe.py |
| ``` |
|
|
| Checkpoints are saved to `model/` every `eval_interval` steps and at the end of training. A sample generation and routing log are written automatically. |
|
|
| ### 4. Resume training |
|
|
| ```bash |
| python train_himoe.py --resume |
| ``` |
|
|
| ### 5. Custom config |
|
|
| All hyperparameters are overridable from the command line: |
|
|
| ```bash |
| python train_himoe.py \ |
| --num_moes 8 \ |
| --num_experts 16 \ |
| --n_embd 512 \ |
| --n_layer 4 \ |
| --max_iters 10000 \ |
| --lr 2e-4 \ |
| --data_file my_corpus.txt \ |
| --model_dir checkpoints/run_01 |
| ``` |
|
|
| --- |
|
|
| ## Architecture |
|
|
| ### HiMoEConfig defaults |
|
|
| | Parameter | Default | Description | |
| |---|---|---| |
| | `n_embd` | 256 | Embedding / hidden dimension | |
| | `n_layer` | 2 | Number of Transformer layers | |
| | `n_head` | 4 | Attention heads | |
| | `block_size` | 128 | Context window (tokens) | |
| | `num_moes` | 6 | Level-1 choices (MoE blocks) | |
| | `num_experts` | 8 | Level-2 choices per MoE block | |
| | `dropout` | 0.1 | Dropout rate | |
| | `batch_size` | 32 | Training batch size | |
| | `max_iters` | 3000 | Training steps | |
| | `lr` | 3e-4 | Peak learning rate | |
|
|
| ### Sparsity |
|
|
| | Routing Level | Active | Total | % Active | |
| |---|---|---|---| |
| | Level-1 (MoE blocks) | 1 | 6 | 16.7% | |
| | Level-2 (experts) | 1 | 48 | 2.1% | |
| | **Full model (params)** | **~1.7M** | **~52M** | **~3.3%** | |
|
|
| ### Checkpoint layout for multi-layer models |
|
|
| When `n_layer > 1`, routers and expert directories are prefixed by layer: |
|
|
| ``` |
| model/ |
| layer_01_main_router.pt |
| layer_01_moe_expert_001/ |
| layer_01_moe_expert_002/ |
| ... |
| layer_02_main_router.pt |
| layer_02_moe_expert_001/ |
| ... |
| ``` |
|
|
| --- |
|
|
| ## Training Details |
|
|
| - **Optimiser:** AdamW with weight decay 0.1 on matrix parameters, 0.0 on biases and norms |
| - **LR schedule:** Cosine decay with 100-step linear warmup, minimum LR = 10% of peak |
| - **Gradient clipping:** 1.0 |
| - **Weight tying:** Token embedding matrix and LM head share weights |
| - **Routing:** Hard top-1 at both levels (no auxiliary load-balancing loss required) |
|
|
| --- |
|
|
| ## Modular Deployment |
|
|
| Because every component is a separate file, you can: |
|
|
| **Load only what you need:** |
| ```python |
| import torch |
| # Load just one expert for inspection or fine-tuning |
| expert_weights = torch.load("model/moe_expert_003/model_005.pt") |
| ``` |
|
|
| **Swap a router:** |
| ```python |
| torch.save(new_router.state_dict(), "model/moe_expert_003/router.pt") |
| ``` |
|
|
| **Fine-tune a single MoE block** without touching the backbone or other experts. |
|
|
| **Add a new expert** by saving a new `model_009.pt` and retraining only the corresponding router. |
|
|
| --- |
|
|
| ## Output Files |
|
|
| After training completes: |
|
|
| | File | Contents | |
| |---|---| |
| | `model/sample.txt` | 400-token generation from a blank context | |
| | `model/routing_log.json` | Per-token (MoE, expert) routing decisions for the first 50 generated tokens | |
| | `model/config.json` | Full config + vocabulary + last saved step | |
|
|
| The training loop also prints an **expert utilisation summary** — a bar chart in the terminal showing how evenly tokens are distributed across MoE blocks and experts. |
|
|
| --- |
|
|
| ## Paper |
|
|
| A full write-up of the architecture, sparsity analysis, and experiments is included as `himoe_paper.pdf`. |
|
|
| --- |
|
|
| ## Citation |
|
|
| ``` |
| @misc{himoe2026, |
| title = {HiMoE: Hierarchical Mixture of Experts for Efficient Large-Scale Language Modelling}, |
| author = {AG}, |
| year = {2026} |
| } |
| ``` |