harshhmaniya's picture
Update README.md
190e376 verified
metadata
library_name: pytorch
tags:
  - mixture-of-experts
  - moe
  - sparse-model
  - text-generation
  - custom-architecture
  - transformers
license: mit

Custom Sparse Mixture of Experts (MoE) 8x7B-Style Model

This is a custom implementation of a Sparse Mixture of Experts (MoE) Transformer model. It is designed to demonstrate efficient scaling by decoupling active parameters from total parameters using a routing mechanism.

The architecture is inspired by modern MoE models (like Mixtral/Switch Transformer) but implemented from scratch in PyTorch with a custom configuration.

πŸ“Š Technical Specifications

Hyperparameter Value Description
Model Type Sparse MoE Transformer Decoder-only causal language model
Vocabulary Size 32,000 Based on Mistral/Llama sentencepiece tokenizer
Context Window 1,024 Maximum sequence length (tokens)
Embedding Dimension 512 Width of the model (d_model)
Hidden Dimension 2,048 Internal dimension of FFN/Experts
Num Layers 10 Number of transformer blocks
Num Attention Heads 16 Multi-head attention configuration
Num Experts 4 Total experts per layer
Active Experts 2 Top-k experts selected per token
Parameter Count ~170M Total trainable parameters

🧠 Model Architecture

This model deviates from standard dense Transformers by replacing the Feed-Forward Network (FFN) with a Sparse MoE Layer.

1. Gating Mechanism (The Router)

  • Routing Strategy: Top-K Gating
  • Selection: For every token, the router calculates a probability distribution over all 4 experts.
  • Top-K: The top 2 experts with the highest probability are selected.
  • Load Balancing: The model uses soft routing weights to differentiate the contribution of the selected experts.

2. Expert Architecture (SwiGLU)

Each "Expert" is an independent Feed-Forward Network using the SwiGLU activation variant for improved performance (similar to Llama 2/3).

  • Structure: Down(SiLU(Gate(x)) * Up(x))
  • Components:
    • Gate Projection: Linear transform to hidden dimension.
    • Up Projection: Linear transform to hidden dimension.
    • Down Projection: Linear transform back to embedding dimension.

3. Attention Mechanism

  • Type: Causal Multi-Head Self-Attention.
  • Positional Embeddings: Standard learnable positional embeddings.
  • Masking: Causal masking (Upper triangular -inf) ensures the model cannot attend to future tokens.
  • Normalization: Pre-normalization using LayerNorm.

πŸ›  Configuration

The model can be instantiated using the provided config.json. Below is the raw configuration map for reproducibility:

{
  "architectures": ["MoEModel"],
  "model_type": "custom_moe",
  "vocab_size": 32000,
  "d_model": 512,
  "num_layers": 10,
  "num_heads": 16,
  "num_experts": 4,
  "hidden_dim": 2048,
  "max_len": 1024,
  "router_top_k": 2
}