Instructions to use microsoft/GRIN-MoE with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use microsoft/GRIN-MoE with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="microsoft/GRIN-MoE", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import GRIN-MoE
model = GRIN-MoE.from_pretrained("microsoft/GRIN-MoE", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use microsoft/GRIN-MoE with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "microsoft/GRIN-MoE"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "microsoft/GRIN-MoE",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/microsoft/GRIN-MoE

SGLang

How to use microsoft/GRIN-MoE with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "microsoft/GRIN-MoE" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "microsoft/GRIN-MoE",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "microsoft/GRIN-MoE" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "microsoft/GRIN-MoE",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use microsoft/GRIN-MoE with Docker Model Runner:
```
docker model run hf.co/microsoft/GRIN-MoE
```

Appreciate the model drop!

by Nitral-AI - opened Sep 19, 2024

Discussion

Nitral-AI

Sep 19, 2024

But why is it only 4k? Its 2024 man, those are rookie numbers.

Lewdiculous

Sep 19, 2024

Haha.

LiyuanLucasLiu

Sep 19, 2024

Very good question. The model training concludes this June and we have been fighting for releasing a detailed tech report for long time---for a long time, the release has been proven to be difficulty.

Meanwhile, a different version of post-training has been conducted, with a focus on multi-lingual and long context ability. That model supports 128k and is released to https://huggingface.co/microsoft/Phi-3.5-MoE-instruct : )

YorkieOH10

Sep 19, 2024

@LiyuanLucasLiu would love to try Phi 3.5 Moe Instruct and vision locally in llama.cpp, but there has been absolutely zero movement to add support. Feature request is still open: https://github.com/ggerganov/llama.cpp/issues/9119

LiyuanLucasLiu

Sep 19, 2024

•

edited Sep 25, 2024

@YorkieOH10 I understand. It pains me as well... Meanwhile, you can try the demo at https://huggingface.co/spaces/GRIN-MoE-Demo/GRIN-MoE (not sure how long i can keep it alive).

dtanow

Sep 24, 2024

@LiyuanLucasLiu do you know how to run efficiently on multiple A100 GPUs, it seems that the MOE router is not using the experts efficiently on multiple GPUs with utilization less than 10%? Is there any specific setting for this in transformers?

LiyuanLucasLiu

Sep 24, 2024

•

edited Sep 25, 2024

@dtanow great question!

with A100-80G GPUs, you should be able to run inference on one gpu. You may need to install flash-attention-2 and add _attn_implementation = 'flash_attention_2' in the config file (together with other configs as below). This would also improve the performance of the multi-gpu setting greatly.

model = AutoModelForCausalLM.from_pretrained( 
    "microsoft/GRIN-MoE",
    device_map="sequential",  
    trust_remote_code=True,
    _attn_implementation="flash_attention_2",
    torch_dtype=torch.bfloat16,
)

with multiple gpus, I would recommend you to convert the weight and serve the model with vllm instead. It gives you a much better throughput. We haven't had chance to merge the code back to the vllm repo, but its not complicated. The only thing you need to change is the router implementation.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment