Instructions to use harshhmaniya/Custom-MoE-Mixtral-based with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use harshhmaniya/Custom-MoE-Mixtral-based with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="harshhmaniya/Custom-MoE-Mixtral-based", trust_remote_code=True)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("harshhmaniya/Custom-MoE-Mixtral-based", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use harshhmaniya/Custom-MoE-Mixtral-based with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "harshhmaniya/Custom-MoE-Mixtral-based"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "harshhmaniya/Custom-MoE-Mixtral-based",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/harshhmaniya/Custom-MoE-Mixtral-based

SGLang

How to use harshhmaniya/Custom-MoE-Mixtral-based with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "harshhmaniya/Custom-MoE-Mixtral-based" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "harshhmaniya/Custom-MoE-Mixtral-based",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "harshhmaniya/Custom-MoE-Mixtral-based" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "harshhmaniya/Custom-MoE-Mixtral-based",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use harshhmaniya/Custom-MoE-Mixtral-based with Docker Model Runner:
```
docker model run hf.co/harshhmaniya/Custom-MoE-Mixtral-based
```

You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Custom Sparse Mixture of Experts (MoE) 8x7B-Style Model

This is a custom implementation of a Sparse Mixture of Experts (MoE) Transformer model. It is designed to demonstrate efficient scaling by decoupling active parameters from total parameters using a routing mechanism.

The architecture is inspired by modern MoE models (like Mixtral/Switch Transformer) but implemented from scratch in PyTorch with a custom configuration.

📊 Technical Specifications

Hyperparameter	Value	Description
Model Type	Sparse MoE Transformer	Decoder-only causal language model
Vocabulary Size	32,000	Based on Mistral/Llama sentencepiece tokenizer
Context Window	1,024	Maximum sequence length (tokens)
Embedding Dimension	512	Width of the model (`d_model`)
Hidden Dimension	2,048	Internal dimension of FFN/Experts
Num Layers	10	Number of transformer blocks
Num Attention Heads	16	Multi-head attention configuration
Num Experts	4	Total experts per layer
Active Experts	2	Top-k experts selected per token
Parameter Count	~170M	Total trainable parameters

🧠 Model Architecture

This model deviates from standard dense Transformers by replacing the Feed-Forward Network (FFN) with a Sparse MoE Layer.

1. Gating Mechanism (The Router)

Routing Strategy: Top-K Gating
Selection: For every token, the router calculates a probability distribution over all 4 experts.
Top-K: The top 2 experts with the highest probability are selected.
Load Balancing: The model uses soft routing weights to differentiate the contribution of the selected experts.

2. Expert Architecture (SwiGLU)

Each "Expert" is an independent Feed-Forward Network using the SwiGLU activation variant for improved performance (similar to Llama 2/3).

Structure: Down(SiLU(Gate(x)) * Up(x))
Components:
- Gate Projection: Linear transform to hidden dimension.
- Up Projection: Linear transform to hidden dimension.
- Down Projection: Linear transform back to embedding dimension.

3. Attention Mechanism

Type: Causal Multi-Head Self-Attention.
Positional Embeddings: Standard learnable positional embeddings.
Masking: Causal masking (Upper triangular -inf) ensures the model cannot attend to future tokens.
Normalization: Pre-normalization using LayerNorm.

🛠 Configuration

The model can be instantiated using the provided config.json. Below is the raw configuration map for reproducibility:

{
  "architectures": ["MoEModel"],
  "model_type": "custom_moe",
  "vocab_size": 32000,
  "d_model": 512,
  "num_layers": 10,
  "num_heads": 16,
  "num_experts": 4,
  "hidden_dim": 2048,
  "max_len": 1024,
  "router_top_k": 2
}

Downloads last month: -