File size: 5,717 Bytes
a3b11c6 5404f1c 3f7159d | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 | ---
license: mit
language:
- en
---
<img src="himoe_visual.png">
# HiMoE β Hierarchical Mixture of Experts
> *A Matryoshka-inspired two-level routing architecture for efficient large-scale language modelling.*
**Author:** AG Β· **Year:** 2026
---
## Overview
HiMoE replaces the standard feed-forward network (FFN) in each Transformer block with a hierarchical routing system. A **Level-1 router** selects one of N MoE blocks; that block's own **Level-2 router** selects one of M local experts. Only a single expert is ever activated per token β regardless of total model size.
```
Token
βββΊ Level-1 Router (1 of 6 MoE blocks)
βββΊ Level-2 Router (1 of 8 experts)
βββΊ Expert FFN βββΊ output
```
With the default config (N=6, M=8, 2 layers) the model holds **~52M parameters** but activates only **~3.3% per token** β the compute footprint of a ~1.7M dense model.
---
## Repository Structure
```
.
βββ train_himoe.py # Full training script (self-contained)
βββ hamlet.txt # Training corpus (place here before running)
βββ README.md
βββ model/ # Created automatically on first save
βββ config.json # Hyperparameters + vocab snapshot
βββ backbone.pt # Embeddings, attention, LN, LM head
βββ main_router.pt # Level-1 gate (or layer_01_main_router.pt for n_layer > 1)
βββ moe_expert_001/
β βββ router.pt # Level-2 gate for this MoE block
β βββ model_001.pt
β βββ model_002.pt
β βββ ... (model_008.pt)
βββ moe_expert_002/
β βββ ...
βββ ...
βββ moe_expert_006/
βββ sample.txt # Generated text after training
βββ routing_log.json # Expert attribution for first 50 tokens
```
Each learnable component lives in its own file β making it straightforward to hot-swap, quantise, or fine-tune individual experts without touching the rest of the model.
---
## Quickstart
### 1. Install dependencies
```bash
pip install torch
```
No other dependencies. Everything else is standard library.
### 2. Add your data
Place `hamlet.txt` (or any plain-text corpus) in the same directory as `train_himoe.py`.
### 3. Train
```bash
python train_himoe.py
```
Checkpoints are saved to `model/` every `eval_interval` steps and at the end of training. A sample generation and routing log are written automatically.
### 4. Resume training
```bash
python train_himoe.py --resume
```
### 5. Custom config
All hyperparameters are overridable from the command line:
```bash
python train_himoe.py \
--num_moes 8 \
--num_experts 16 \
--n_embd 512 \
--n_layer 4 \
--max_iters 10000 \
--lr 2e-4 \
--data_file my_corpus.txt \
--model_dir checkpoints/run_01
```
---
## Architecture
### HiMoEConfig defaults
| Parameter | Default | Description |
|---|---|---|
| `n_embd` | 256 | Embedding / hidden dimension |
| `n_layer` | 2 | Number of Transformer layers |
| `n_head` | 4 | Attention heads |
| `block_size` | 128 | Context window (tokens) |
| `num_moes` | 6 | Level-1 choices (MoE blocks) |
| `num_experts` | 8 | Level-2 choices per MoE block |
| `dropout` | 0.1 | Dropout rate |
| `batch_size` | 32 | Training batch size |
| `max_iters` | 3000 | Training steps |
| `lr` | 3e-4 | Peak learning rate |
### Sparsity
| Routing Level | Active | Total | % Active |
|---|---|---|---|
| Level-1 (MoE blocks) | 1 | 6 | 16.7% |
| Level-2 (experts) | 1 | 48 | 2.1% |
| **Full model (params)** | **~1.7M** | **~52M** | **~3.3%** |
### Checkpoint layout for multi-layer models
When `n_layer > 1`, routers and expert directories are prefixed by layer:
```
model/
layer_01_main_router.pt
layer_01_moe_expert_001/
layer_01_moe_expert_002/
...
layer_02_main_router.pt
layer_02_moe_expert_001/
...
```
---
## Training Details
- **Optimiser:** AdamW with weight decay 0.1 on matrix parameters, 0.0 on biases and norms
- **LR schedule:** Cosine decay with 100-step linear warmup, minimum LR = 10% of peak
- **Gradient clipping:** 1.0
- **Weight tying:** Token embedding matrix and LM head share weights
- **Routing:** Hard top-1 at both levels (no auxiliary load-balancing loss required)
---
## Modular Deployment
Because every component is a separate file, you can:
**Load only what you need:**
```python
import torch
# Load just one expert for inspection or fine-tuning
expert_weights = torch.load("model/moe_expert_003/model_005.pt")
```
**Swap a router:**
```python
torch.save(new_router.state_dict(), "model/moe_expert_003/router.pt")
```
**Fine-tune a single MoE block** without touching the backbone or other experts.
**Add a new expert** by saving a new `model_009.pt` and retraining only the corresponding router.
---
## Output Files
After training completes:
| File | Contents |
|---|---|
| `model/sample.txt` | 400-token generation from a blank context |
| `model/routing_log.json` | Per-token (MoE, expert) routing decisions for the first 50 generated tokens |
| `model/config.json` | Full config + vocabulary + last saved step |
The training loop also prints an **expert utilisation summary** β a bar chart in the terminal showing how evenly tokens are distributed across MoE blocks and experts.
---
## Paper
A full write-up of the architecture, sparsity analysis, and experiments is included as `himoe_paper.pdf`.
---
## Citation
```
@misc{himoe2026,
title = {HiMoE: Hierarchical Mixture of Experts for Efficient Large-Scale Language Modelling},
author = {AG},
year = {2026}
}
``` |