Instructions to use harshhmaniya/Custom-MoE-Mixtral-based with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use harshhmaniya/Custom-MoE-Mixtral-based with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="harshhmaniya/Custom-MoE-Mixtral-based", trust_remote_code=True)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("harshhmaniya/Custom-MoE-Mixtral-based", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use harshhmaniya/Custom-MoE-Mixtral-based with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "harshhmaniya/Custom-MoE-Mixtral-based" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "harshhmaniya/Custom-MoE-Mixtral-based", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/harshhmaniya/Custom-MoE-Mixtral-based
- SGLang
How to use harshhmaniya/Custom-MoE-Mixtral-based with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "harshhmaniya/Custom-MoE-Mixtral-based" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "harshhmaniya/Custom-MoE-Mixtral-based", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "harshhmaniya/Custom-MoE-Mixtral-based" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "harshhmaniya/Custom-MoE-Mixtral-based", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use harshhmaniya/Custom-MoE-Mixtral-based with Docker Model Runner:
docker model run hf.co/harshhmaniya/Custom-MoE-Mixtral-based
Custom Sparse Mixture of Experts (MoE) 8x7B-Style Model
This is a custom implementation of a Sparse Mixture of Experts (MoE) Transformer model. It is designed to demonstrate efficient scaling by decoupling active parameters from total parameters using a routing mechanism.
The architecture is inspired by modern MoE models (like Mixtral/Switch Transformer) but implemented from scratch in PyTorch with a custom configuration.
π Technical Specifications
| Hyperparameter | Value | Description |
|---|---|---|
| Model Type | Sparse MoE Transformer | Decoder-only causal language model |
| Vocabulary Size | 32,000 | Based on Mistral/Llama sentencepiece tokenizer |
| Context Window | 1,024 | Maximum sequence length (tokens) |
| Embedding Dimension | 512 | Width of the model (d_model) |
| Hidden Dimension | 2,048 | Internal dimension of FFN/Experts |
| Num Layers | 10 | Number of transformer blocks |
| Num Attention Heads | 16 | Multi-head attention configuration |
| Num Experts | 4 | Total experts per layer |
| Active Experts | 2 | Top-k experts selected per token |
| Parameter Count | ~170M | Total trainable parameters |
π§ Model Architecture
This model deviates from standard dense Transformers by replacing the Feed-Forward Network (FFN) with a Sparse MoE Layer.
1. Gating Mechanism (The Router)
- Routing Strategy: Top-K Gating
- Selection: For every token, the router calculates a probability distribution over all 4 experts.
- Top-K: The top 2 experts with the highest probability are selected.
- Load Balancing: The model uses soft routing weights to differentiate the contribution of the selected experts.
2. Expert Architecture (SwiGLU)
Each "Expert" is an independent Feed-Forward Network using the SwiGLU activation variant for improved performance (similar to Llama 2/3).
- Structure:
Down(SiLU(Gate(x)) * Up(x)) - Components:
Gate Projection: Linear transform to hidden dimension.Up Projection: Linear transform to hidden dimension.Down Projection: Linear transform back to embedding dimension.
3. Attention Mechanism
- Type: Causal Multi-Head Self-Attention.
- Positional Embeddings: Standard learnable positional embeddings.
- Masking: Causal masking (Upper triangular
-inf) ensures the model cannot attend to future tokens. - Normalization: Pre-normalization using
LayerNorm.
π Configuration
The model can be instantiated using the provided config.json. Below is the raw configuration map for reproducibility:
{
"architectures": ["MoEModel"],
"model_type": "custom_moe",
"vocab_size": 32000,
"d_model": 512,
"num_layers": 10,
"num_heads": 16,
"num_experts": 4,
"hidden_dim": 2048,
"max_len": 1024,
"router_top_k": 2
}
- Downloads last month
- -