HiMoE / README.md

Upload 4 files

5404f1c verified 7 days ago

5.72 kB

	---
	license: mit
	language:
	- en
	---
	<img src="himoe_visual.png">

	# HiMoE — Hierarchical Mixture of Experts

	> A Matryoshka-inspired two-level routing architecture for efficient large-scale language modelling.

	Author: AG  ·  Year: 2026

	---

	## Overview

	HiMoE replaces the standard feed-forward network (FFN) in each Transformer block with a hierarchical routing system. A Level-1 router selects one of N MoE blocks; that block's own Level-2 router selects one of M local experts. Only a single expert is ever activated per token — regardless of total model size.

	```
	Token
	└─► Level-1 Router (1 of 6 MoE blocks)
	└─► Level-2 Router (1 of 8 experts)
	└─► Expert FFN ──► output
	```

	With the default config (N=6, M=8, 2 layers) the model holds ~52M parameters but activates only ~3.3% per token — the compute footprint of a ~1.7M dense model.

	---

	## Repository Structure

	```
	.
	├── train_himoe.py # Full training script (self-contained)
	├── hamlet.txt # Training corpus (place here before running)
	├── README.md
	└── model/ # Created automatically on first save
	├── config.json # Hyperparameters + vocab snapshot
	├── backbone.pt # Embeddings, attention, LN, LM head
	├── main_router.pt # Level-1 gate (or layer_01_main_router.pt for n_layer > 1)
	├── moe_expert_001/
	│ ├── router.pt # Level-2 gate for this MoE block
	│ ├── model_001.pt
	│ ├── model_002.pt
	│ └── ... (model_008.pt)
	├── moe_expert_002/
	│ └── ...
	├── ...
	├── moe_expert_006/
	├── sample.txt # Generated text after training
	└── routing_log.json # Expert attribution for first 50 tokens
	```

	Each learnable component lives in its own file — making it straightforward to hot-swap, quantise, or fine-tune individual experts without touching the rest of the model.

	---

	## Quickstart

	### 1. Install dependencies

	```bash
	pip install torch
	```

	No other dependencies. Everything else is standard library.

	### 2. Add your data

	Place `hamlet.txt` (or any plain-text corpus) in the same directory as `train_himoe.py`.

	### 3. Train

	```bash
	python train_himoe.py
	```

	Checkpoints are saved to `model/` every `eval_interval` steps and at the end of training. A sample generation and routing log are written automatically.

	### 4. Resume training

	```bash
	python train_himoe.py --resume
	```

	### 5. Custom config

	All hyperparameters are overridable from the command line:

	```bash
	python train_himoe.py \
	--num_moes 8 \
	--num_experts 16 \
	--n_embd 512 \
	--n_layer 4 \
	--max_iters 10000 \
	--lr 2e-4 \
	--data_file my_corpus.txt \
	--model_dir checkpoints/run_01
	```

	---

	## Architecture

	### HiMoEConfig defaults

	\| Parameter \| Default \| Description \|
	\|---\|---\|---\|
	\| `n_embd` \| 256 \| Embedding / hidden dimension \|
	\| `n_layer` \| 2 \| Number of Transformer layers \|
	\| `n_head` \| 4 \| Attention heads \|
	\| `block_size` \| 128 \| Context window (tokens) \|
	\| `num_moes` \| 6 \| Level-1 choices (MoE blocks) \|
	\| `num_experts` \| 8 \| Level-2 choices per MoE block \|
	\| `dropout` \| 0.1 \| Dropout rate \|
	\| `batch_size` \| 32 \| Training batch size \|
	\| `max_iters` \| 3000 \| Training steps \|
	\| `lr` \| 3e-4 \| Peak learning rate \|

	### Sparsity

	\| Routing Level \| Active \| Total \| % Active \|
	\|---\|---\|---\|---\|
	\| Level-1 (MoE blocks) \| 1 \| 6 \| 16.7% \|
	\| Level-2 (experts) \| 1 \| 48 \| 2.1% \|
	\| Full model (params) \| ~1.7M \| ~52M \| ~3.3% \|

	### Checkpoint layout for multi-layer models

	When `n_layer > 1`, routers and expert directories are prefixed by layer:

	```
	model/
	layer_01_main_router.pt
	layer_01_moe_expert_001/
	layer_01_moe_expert_002/
	...
	layer_02_main_router.pt
	layer_02_moe_expert_001/
	...
	```

	---

	## Training Details

	- Optimiser: AdamW with weight decay 0.1 on matrix parameters, 0.0 on biases and norms
	- LR schedule: Cosine decay with 100-step linear warmup, minimum LR = 10% of peak
	- Gradient clipping: 1.0
	- Weight tying: Token embedding matrix and LM head share weights
	- Routing: Hard top-1 at both levels (no auxiliary load-balancing loss required)

	---

	## Modular Deployment

	Because every component is a separate file, you can:

	Load only what you need:
	```python
	import torch
	# Load just one expert for inspection or fine-tuning
	expert_weights = torch.load("model/moe_expert_003/model_005.pt")
	```

	Swap a router:
	```python
	torch.save(new_router.state_dict(), "model/moe_expert_003/router.pt")
	```

	Fine-tune a single MoE block without touching the backbone or other experts.

	Add a new expert by saving a new `model_009.pt` and retraining only the corresponding router.

	---

	## Output Files

	After training completes:

	\| File \| Contents \|
	\|---\|---\|
	\| `model/sample.txt` \| 400-token generation from a blank context \|
	\| `model/routing_log.json` \| Per-token (MoE, expert) routing decisions for the first 50 generated tokens \|
	\| `model/config.json` \| Full config + vocabulary + last saved step \|

	The training loop also prints an expert utilisation summary — a bar chart in the terminal showing how evenly tokens are distributed across MoE blocks and experts.

	---

	## Paper

	A full write-up of the architecture, sparsity analysis, and experiments is included as `himoe_paper.pdf`.

	---

	## Citation

	```
	@misc{himoe2026,
	title = {HiMoE: Hierarchical Mixture of Experts for Efficient Large-Scale Language Modelling},
	author = {AG},
	year = {2026}
	}
	```