chatsd
/

Sparse_Dynamic_MOE

Text Generation

mixture-of-experts

Mixture of Experts

conditional-computation

Model card Files Files and versions

Sparse_Dynamic_MOE / README.md

chatsd's picture

Update README.md

a7e5596 verified about 2 months ago

|

history blame contribute delete

3.12 kB

	---
	tags:
	- mixture-of-experts
	- moe
	- transformer
	- language-model
	- pytorch
	- conditional-computation
	datasets:
	- custom
	pipeline_tag: text-generation
	license: mit
	---

	# Mixture-of-Experts Language Models

	A PyTorch implementation exploring conditional computation in Transformers through Mixture-of-Experts (MoE).

	## Models

	This repository contains two MoE architectures:

	### 1. Sparse MoE (Top-K Routing)
	Routes each token to a fixed number of experts (k=2), increasing model capacity without proportionally increasing compute.

	### 2. Dynamic MoE (Confidence-Based Routing)
	Dynamically adjusts the number of experts per token based on routing confidence—"easy" tokens use fewer experts, "hard" tokens use more.

	## Model Details

	\| Parameter \| Sparse MoE \| Dynamic MoE \|
	\|-----------\|------------\|-------------\|
	\| Layers \| 4 \| 4 \|
	\| Hidden Dim \| 512 \| 512 \|
	\| FFN Dim \| 2048 \| 2048 \|
	\| Attention Heads \| 8 \| 8 \|
	\| Experts \| 8 \| 4 \|
	\| Routing \| Top-2 \| τ=0.8 threshold \|
	\| Context Length \| 256 \| 256 \|
	\| Vocab Size \| 10,000 \| 10,000 \|

	## Architecture

	```
	Input → Embedding → [Transformer Block × N] → RMSNorm → Linear → Output

	Transformer Block:
	└─ RMSNorm → Multi-Head Self-Attention → Residual
	└─ RMSNorm → MoE Layer → Residual

	MoE Layer:
	└─ Router (softmax gating)
	└─ Expert Selection (Top-K or Dynamic)
	└─ Weighted Expert Outputs
	```

	## Training

	Both models were trained with:
	- Optimizer: AdamW (β1=0.9, β2=0.95)
	- Learning Rate: 3e-4 with cosine decay
	- Warmup Steps: 2,000
	- Weight Decay: 0.1

	### Loss Functions

	Sparse MoE:
	```
	L = L_CE + α * L_balance
	```

	Dynamic MoE:
	```
	L = L_CE + β * L_balance + γ * L_entropy
	```

	Where:
	- `L_CE`: Cross-entropy loss
	- `L_balance`: Load balancing loss (encourages uniform expert utilization)
	- `L_entropy`: Entropy regularization (encourages sparse routing)

	## Usage

	```python
	import torch
	from moe.moelm import MoeLM, DynamicMOELM

	# Load Sparse MoE
	sparse_model = MoeLM(
	vocab_size=10000,
	num_layers=4,
	context_length=256,
	d_model=512,
	d_ff=2048,
	num_heads=8,
	num_experts=8,
	top_k=2
	)
	sparse_model.load_state_dict(torch.load("sparse_moe_final.pt"))

	# Load Dynamic MoE
	dynamic_model = DynamicMOELM(
	vocab_size=10000,
	num_layers=4,
	context_length=256,
	d_model=512,
	d_ff=2048,
	num_heads=8,
	num_experts=4,
	confidence_threshold=0.8
	)
	dynamic_model.load_state_dict(torch.load("dynamic_moe_final.pt"))
	```

	## Files

	\| File \| Description \|
	\|------\|-------------\|
	\| `sparse_moe_final.pt` \| Sparse MoE model weights \|
	\| `dynamic_moe_final.pt` \| Dynamic MoE model weights \|
	\| `sparse_moe_config.json` \| Sparse MoE configuration \|
	\| `dynamic_moe_config.json` \| Dynamic MoE configuration \|

	## Citation

	```bibtex
	@misc{moe-lm-2024,
	title={Mixture-of-Experts Language Model},
	author={Chaitanya},
	year={2024},
	url={https://github.com/chaitanya/transformers-and-MOE}
	}
	```

	## Reference

	Based on ["Harder Tasks Need More Experts: Dynamic Routing in MoE Models"](https://arxiv.org/abs/2403.07652)

	## License

	MIT