File size: 5,717 Bytes
a3b11c6
 
 
 
 
5404f1c
 
3f7159d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
---
license: mit
language:
- en
---
<img src="himoe_visual.png">

# HiMoE β€” Hierarchical Mixture of Experts

> *A Matryoshka-inspired two-level routing architecture for efficient large-scale language modelling.*

**Author:** AG &nbsp;Β·&nbsp; **Year:** 2026

---

## Overview

HiMoE replaces the standard feed-forward network (FFN) in each Transformer block with a hierarchical routing system. A **Level-1 router** selects one of N MoE blocks; that block's own **Level-2 router** selects one of M local experts. Only a single expert is ever activated per token β€” regardless of total model size.

```
Token
  └─► Level-1 Router  (1 of 6 MoE blocks)
          └─► Level-2 Router  (1 of 8 experts)
                  └─► Expert FFN  ──► output
```

With the default config (N=6, M=8, 2 layers) the model holds **~52M parameters** but activates only **~3.3% per token** β€” the compute footprint of a ~1.7M dense model.

---

## Repository Structure

```
.
β”œβ”€β”€ train_himoe.py       # Full training script (self-contained)
β”œβ”€β”€ hamlet.txt           # Training corpus (place here before running)
β”œβ”€β”€ README.md
└── model/               # Created automatically on first save
    β”œβ”€β”€ config.json                  # Hyperparameters + vocab snapshot
    β”œβ”€β”€ backbone.pt                  # Embeddings, attention, LN, LM head
    β”œβ”€β”€ main_router.pt               # Level-1 gate  (or layer_01_main_router.pt for n_layer > 1)
    β”œβ”€β”€ moe_expert_001/
    β”‚   β”œβ”€β”€ router.pt                # Level-2 gate for this MoE block
    β”‚   β”œβ”€β”€ model_001.pt
    β”‚   β”œβ”€β”€ model_002.pt
    β”‚   └── ...  (model_008.pt)
    β”œβ”€β”€ moe_expert_002/
    β”‚   └── ...
    β”œβ”€β”€ ...
    β”œβ”€β”€ moe_expert_006/
    β”œβ”€β”€ sample.txt                   # Generated text after training
    └── routing_log.json             # Expert attribution for first 50 tokens
```

Each learnable component lives in its own file β€” making it straightforward to hot-swap, quantise, or fine-tune individual experts without touching the rest of the model.

---

## Quickstart

### 1. Install dependencies

```bash
pip install torch
```

No other dependencies. Everything else is standard library.

### 2. Add your data

Place `hamlet.txt` (or any plain-text corpus) in the same directory as `train_himoe.py`.

### 3. Train

```bash
python train_himoe.py
```

Checkpoints are saved to `model/` every `eval_interval` steps and at the end of training. A sample generation and routing log are written automatically.

### 4. Resume training

```bash
python train_himoe.py --resume
```

### 5. Custom config

All hyperparameters are overridable from the command line:

```bash
python train_himoe.py \
  --num_moes 8 \
  --num_experts 16 \
  --n_embd 512 \
  --n_layer 4 \
  --max_iters 10000 \
  --lr 2e-4 \
  --data_file my_corpus.txt \
  --model_dir checkpoints/run_01
```

---

## Architecture

### HiMoEConfig defaults

| Parameter | Default | Description |
|---|---|---|
| `n_embd` | 256 | Embedding / hidden dimension |
| `n_layer` | 2 | Number of Transformer layers |
| `n_head` | 4 | Attention heads |
| `block_size` | 128 | Context window (tokens) |
| `num_moes` | 6 | Level-1 choices (MoE blocks) |
| `num_experts` | 8 | Level-2 choices per MoE block |
| `dropout` | 0.1 | Dropout rate |
| `batch_size` | 32 | Training batch size |
| `max_iters` | 3000 | Training steps |
| `lr` | 3e-4 | Peak learning rate |

### Sparsity

| Routing Level | Active | Total | % Active |
|---|---|---|---|
| Level-1 (MoE blocks) | 1 | 6 | 16.7% |
| Level-2 (experts) | 1 | 48 | 2.1% |
| **Full model (params)** | **~1.7M** | **~52M** | **~3.3%** |

### Checkpoint layout for multi-layer models

When `n_layer > 1`, routers and expert directories are prefixed by layer:

```
model/
  layer_01_main_router.pt
  layer_01_moe_expert_001/
  layer_01_moe_expert_002/
  ...
  layer_02_main_router.pt
  layer_02_moe_expert_001/
  ...
```

---

## Training Details

- **Optimiser:** AdamW with weight decay 0.1 on matrix parameters, 0.0 on biases and norms
- **LR schedule:** Cosine decay with 100-step linear warmup, minimum LR = 10% of peak
- **Gradient clipping:** 1.0
- **Weight tying:** Token embedding matrix and LM head share weights
- **Routing:** Hard top-1 at both levels (no auxiliary load-balancing loss required)

---

## Modular Deployment

Because every component is a separate file, you can:

**Load only what you need:**
```python
import torch
# Load just one expert for inspection or fine-tuning
expert_weights = torch.load("model/moe_expert_003/model_005.pt")
```

**Swap a router:**
```python
torch.save(new_router.state_dict(), "model/moe_expert_003/router.pt")
```

**Fine-tune a single MoE block** without touching the backbone or other experts.

**Add a new expert** by saving a new `model_009.pt` and retraining only the corresponding router.

---

## Output Files

After training completes:

| File | Contents |
|---|---|
| `model/sample.txt` | 400-token generation from a blank context |
| `model/routing_log.json` | Per-token (MoE, expert) routing decisions for the first 50 generated tokens |
| `model/config.json` | Full config + vocabulary + last saved step |

The training loop also prints an **expert utilisation summary** β€” a bar chart in the terminal showing how evenly tokens are distributed across MoE blocks and experts.

---

## Paper

A full write-up of the architecture, sparsity analysis, and experiments is included as `himoe_paper.pdf`.

---

## Citation

```
@misc{himoe2026,
  title   = {HiMoE: Hierarchical Mixture of Experts for Efficient Large-Scale Language Modelling},
  author  = {AG},
  year    = {2026}
}
```