Update README.md

190e376 verified 2 months ago

3 kB

library_name: pytorch
tags:
  - mixture-of-experts
  - moe
  - sparse-model
  - text-generation
  - custom-architecture
  - transformers
license: mit

Custom Sparse Mixture of Experts (MoE) 8x7B-Style Model

This is a custom implementation of a Sparse Mixture of Experts (MoE) Transformer model. It is designed to demonstrate efficient scaling by decoupling active parameters from total parameters using a routing mechanism.

The architecture is inspired by modern MoE models (like Mixtral/Switch Transformer) but implemented from scratch in PyTorch with a custom configuration.

📊 Technical Specifications

Hyperparameter	Value	Description
Model Type	Sparse MoE Transformer	Decoder-only causal language model
Vocabulary Size	32,000	Based on Mistral/Llama sentencepiece tokenizer
Context Window	1,024	Maximum sequence length (tokens)
Embedding Dimension	512	Width of the model (`d_model`)
Hidden Dimension	2,048	Internal dimension of FFN/Experts
Num Layers	10	Number of transformer blocks
Num Attention Heads	16	Multi-head attention configuration
Num Experts	4	Total experts per layer
Active Experts	2	Top-k experts selected per token
Parameter Count	~170M	Total trainable parameters

🧠 Model Architecture

This model deviates from standard dense Transformers by replacing the Feed-Forward Network (FFN) with a Sparse MoE Layer.

1. Gating Mechanism (The Router)

Routing Strategy: Top-K Gating
Selection: For every token, the router calculates a probability distribution over all 4 experts.
Top-K: The top 2 experts with the highest probability are selected.
Load Balancing: The model uses soft routing weights to differentiate the contribution of the selected experts.

2. Expert Architecture (SwiGLU)

Each "Expert" is an independent Feed-Forward Network using the SwiGLU activation variant for improved performance (similar to Llama 2/3).

Structure: Down(SiLU(Gate(x)) * Up(x))
Components:
- Gate Projection: Linear transform to hidden dimension.
- Up Projection: Linear transform to hidden dimension.
- Down Projection: Linear transform back to embedding dimension.

3. Attention Mechanism

Type: Causal Multi-Head Self-Attention.
Positional Embeddings: Standard learnable positional embeddings.
Masking: Causal masking (Upper triangular -inf) ensures the model cannot attend to future tokens.
Normalization: Pre-normalization using LayerNorm.

🛠 Configuration

The model can be instantiated using the provided config.json. Below is the raw configuration map for reproducibility:

{
  "architectures": ["MoEModel"],
  "model_type": "custom_moe",
  "vocab_size": 32000,
  "d_model": 512,
  "num_layers": 10,
  "num_heads": 16,
  "num_experts": 4,
  "hidden_dim": 2048,
  "max_len": 1024,
  "router_top_k": 2
}