Custom Sparse Mixture of Experts (MoE) 8x7B-Style Model
This is a custom implementation of a Sparse Mixture of Experts (MoE) Transformer model. It is designed to demonstrate efficient scaling by decoupling active parameters from total parameters using a routing mechanism.
The architecture is inspired by modern MoE models (like Mixtral/Switch Transformer) but implemented from scratch in PyTorch with a custom configuration.
π Technical Specifications
| Hyperparameter | Value | Description |
|---|---|---|
| Model Type | Sparse MoE Transformer | Decoder-only causal language model |
| Vocabulary Size | 32,000 | Based on Mistral/Llama sentencepiece tokenizer |
| Context Window | 1,024 | Maximum sequence length (tokens) |
| Embedding Dimension | 512 | Width of the model (d_model) |
| Hidden Dimension | 2,048 | Internal dimension of FFN/Experts |
| Num Layers | 10 | Number of transformer blocks |
| Num Attention Heads | 16 | Multi-head attention configuration |
| Num Experts | 4 | Total experts per layer |
| Active Experts | 2 | Top-k experts selected per token |
| Parameter Count | ~170M | Total trainable parameters |
π§ Model Architecture
This model deviates from standard dense Transformers by replacing the Feed-Forward Network (FFN) with a Sparse MoE Layer.
1. Gating Mechanism (The Router)
- Routing Strategy: Top-K Gating
- Selection: For every token, the router calculates a probability distribution over all 4 experts.
- Top-K: The top 2 experts with the highest probability are selected.
- Load Balancing: The model uses soft routing weights to differentiate the contribution of the selected experts.
2. Expert Architecture (SwiGLU)
Each "Expert" is an independent Feed-Forward Network using the SwiGLU activation variant for improved performance (similar to Llama 2/3).
- Structure:
Down(SiLU(Gate(x)) * Up(x)) - Components:
Gate Projection: Linear transform to hidden dimension.Up Projection: Linear transform to hidden dimension.Down Projection: Linear transform back to embedding dimension.
3. Attention Mechanism
- Type: Causal Multi-Head Self-Attention.
- Positional Embeddings: Standard learnable positional embeddings.
- Masking: Causal masking (Upper triangular
-inf) ensures the model cannot attend to future tokens. - Normalization: Pre-normalization using
LayerNorm.
π Configuration
The model can be instantiated using the provided config.json. Below is the raw configuration map for reproducibility:
{
"architectures": ["MoEModel"],
"model_type": "custom_moe",
"vocab_size": 32000,
"d_model": 512,
"num_layers": 10,
"num_heads": 16,
"num_experts": 4,
"hidden_dim": 2048,
"max_len": 1024,
"router_top_k": 2
}
- Downloads last month
- -