Instructions to use microsoft/GRIN-MoE with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use microsoft/GRIN-MoE with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="microsoft/GRIN-MoE", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import GRIN-MoE model = GRIN-MoE.from_pretrained("microsoft/GRIN-MoE", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use microsoft/GRIN-MoE with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "microsoft/GRIN-MoE" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "microsoft/GRIN-MoE", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/microsoft/GRIN-MoE
- SGLang
How to use microsoft/GRIN-MoE with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "microsoft/GRIN-MoE" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "microsoft/GRIN-MoE", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "microsoft/GRIN-MoE" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "microsoft/GRIN-MoE", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use microsoft/GRIN-MoE with Docker Model Runner:
docker model run hf.co/microsoft/GRIN-MoE
Appreciate the model drop!
But why is it only 4k? Its 2024 man, those are rookie numbers.
Haha.
Very good question. The model training concludes this June and we have been fighting for releasing a detailed tech report for long time---for a long time, the release has been proven to be difficulty.
Meanwhile, a different version of post-training has been conducted, with a focus on multi-lingual and long context ability. That model supports 128k and is released to https://huggingface.co/microsoft/Phi-3.5-MoE-instruct : )
@LiyuanLucasLiu would love to try Phi 3.5 Moe Instruct and vision locally in llama.cpp, but there has been absolutely zero movement to add support. Feature request is still open: https://github.com/ggerganov/llama.cpp/issues/9119
@YorkieOH10 I understand. It pains me as well... Meanwhile, you can try the demo at https://huggingface.co/spaces/GRIN-MoE-Demo/GRIN-MoE (not sure how long i can keep it alive).
@LiyuanLucasLiu do you know how to run efficiently on multiple A100 GPUs, it seems that the MOE router is not using the experts efficiently on multiple GPUs with utilization less than 10%? Is there any specific setting for this in transformers?
@dtanow great question!
- with A100-80G GPUs, you should be able to run inference on one gpu. You may need to install flash-attention-2 and add
_attn_implementation = 'flash_attention_2'in the config file (together with other configs as below). This would also improve the performance of the multi-gpu setting greatly.
model = AutoModelForCausalLM.from_pretrained(
"microsoft/GRIN-MoE",
device_map="sequential",
trust_remote_code=True,
_attn_implementation="flash_attention_2",
torch_dtype=torch.bfloat16,
)
- with multiple gpus, I would recommend you to convert the weight and serve the model with vllm instead. It gives you a much better throughput. We haven't had chance to merge the code back to the vllm repo, but its not complicated. The only thing you need to change is the router implementation.