You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Custom Sparse Mixture of Experts (MoE) 8x7B-Style Model

This is a custom implementation of a Sparse Mixture of Experts (MoE) Transformer model. It is designed to demonstrate efficient scaling by decoupling active parameters from total parameters using a routing mechanism.

The architecture is inspired by modern MoE models (like Mixtral/Switch Transformer) but implemented from scratch in PyTorch with a custom configuration.

πŸ“Š Technical Specifications

Hyperparameter Value Description
Model Type Sparse MoE Transformer Decoder-only causal language model
Vocabulary Size 32,000 Based on Mistral/Llama sentencepiece tokenizer
Context Window 1,024 Maximum sequence length (tokens)
Embedding Dimension 512 Width of the model (d_model)
Hidden Dimension 2,048 Internal dimension of FFN/Experts
Num Layers 10 Number of transformer blocks
Num Attention Heads 16 Multi-head attention configuration
Num Experts 4 Total experts per layer
Active Experts 2 Top-k experts selected per token
Parameter Count ~170M Total trainable parameters

🧠 Model Architecture

This model deviates from standard dense Transformers by replacing the Feed-Forward Network (FFN) with a Sparse MoE Layer.

1. Gating Mechanism (The Router)

  • Routing Strategy: Top-K Gating
  • Selection: For every token, the router calculates a probability distribution over all 4 experts.
  • Top-K: The top 2 experts with the highest probability are selected.
  • Load Balancing: The model uses soft routing weights to differentiate the contribution of the selected experts.

2. Expert Architecture (SwiGLU)

Each "Expert" is an independent Feed-Forward Network using the SwiGLU activation variant for improved performance (similar to Llama 2/3).

  • Structure: Down(SiLU(Gate(x)) * Up(x))
  • Components:
    • Gate Projection: Linear transform to hidden dimension.
    • Up Projection: Linear transform to hidden dimension.
    • Down Projection: Linear transform back to embedding dimension.

3. Attention Mechanism

  • Type: Causal Multi-Head Self-Attention.
  • Positional Embeddings: Standard learnable positional embeddings.
  • Masking: Causal masking (Upper triangular -inf) ensures the model cannot attend to future tokens.
  • Normalization: Pre-normalization using LayerNorm.

πŸ›  Configuration

The model can be instantiated using the provided config.json. Below is the raw configuration map for reproducibility:

{
  "architectures": ["MoEModel"],
  "model_type": "custom_moe",
  "vocab_size": 32000,
  "d_model": 512,
  "num_layers": 10,
  "num_heads": 16,
  "num_experts": 4,
  "hidden_dim": 2048,
  "max_len": 1024,
  "router_top_k": 2
}
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support