Instructions to use microsoft/Phi-3.5-MoE-instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use microsoft/Phi-3.5-MoE-instruct with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="microsoft/Phi-3.5-MoE-instruct", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3.5-MoE-instruct", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("microsoft/Phi-3.5-MoE-instruct", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use microsoft/Phi-3.5-MoE-instruct with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "microsoft/Phi-3.5-MoE-instruct"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "microsoft/Phi-3.5-MoE-instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/microsoft/Phi-3.5-MoE-instruct

SGLang

How to use microsoft/Phi-3.5-MoE-instruct with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "microsoft/Phi-3.5-MoE-instruct" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "microsoft/Phi-3.5-MoE-instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "microsoft/Phi-3.5-MoE-instruct" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "microsoft/Phi-3.5-MoE-instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use microsoft/Phi-3.5-MoE-instruct with Docker Model Runner:
```
docker model run hf.co/microsoft/Phi-3.5-MoE-instruct
```

need gguf

by windkkk - opened Aug 22, 2024

Discussion

windkkk

Aug 22, 2024

need gguf

phil111

Aug 22, 2024

GGUF would be nice. Hopefully they'll add Phi 3.5 MOE support to llama.cpp.

https://github.com/ggerganov/llama.cpp/issues/9119

Srrp

Aug 23, 2024

Iam need gguf

goodasdgood

Aug 27, 2024

https://huggingface.co/spaces/ggml-org/gguf-my-repo

J22

Aug 28, 2024

ChatLLM.cpp supports this:

    ________          __  __    __    __  ___ (Φ)
   / ____/ /_  ____ _/ /_/ /   / /   /  |/  /_________  ____  
  / /   / __ \/ __ `/ __/ /   / /   / /|_/ // ___/ __ \/ __ \ 
 / /___/ / / / /_/ / /_/ /___/ /___/ /  / // /__/ /_/ / /_/ / 
 \____/_/ /_/\__,_/\__/_____/_____/_/  /_(_)___/ .___/ .___/  
You are served by Phi-3.5 MoE,                /_/   /_/       
with 41873153344 (6.6B effect.) parameters.

You  > write a python program to calculate 10!
A.I. > Certainly! Below is a Python program that calculates the factorial of 10 (10!):

def factorial(n):
    if n == 0 or n == 1:
        return 1
    else:
        return n * factorial(n - 1)

result = factorial(10)
print("The factorial of 10 is:", result)

Here's an alternative using a loop for better performance:
...

goodasdgood

Aug 28, 2024

how to use ChatLLM.cpp to convet microsoft/Phi-3.5-MoE-instruct to gguf?

J22

Aug 28, 2024

@goodasdgood Sorry, ChatLLM.cpp uses its own format (something like the old GGML file format).

EricB

Sep 2, 2024

Mistral.rs supports this now: https://github.com/EricLBuehler/mistral.rs/blob/master/docs/PHI3.5MOE.md!

You can quantize in-place with GGUF and HQQ quantization, and there is model toplogoy for per-layer device-mapping and ISQ parameterization. CUDA and Metal are supported, as well as CPU SIMD acceleration.

Built on Candle: https://github.com/huggingface/candle!

goodasdgood

Sep 3, 2024

https://github.com/EricLBuehler/mistral.rs/issues/743

EricB

Sep 3, 2024

@goodasdgood as of writing, llama.cpp doesn't support Phi 3.5 MoE models, so GGUF models for that wouldn't really make sense.

Mistral.rs uses (among other methods), a technique called ISQ, which enables you to quantize the model quickly, locally, and in-place. You can then use our OpenAI server, Python API, or Rust API to interface with your application.

ykim362

Microsoft org Sep 3, 2024

For your information, it will be best to quantize the expert weights only. Not the gating nor the attention weight. Expert weights take most of the memory, it gives a very good compression with expert-only quantization.

EricB

Sep 3, 2024

@ykim362 thank you for your input. We actually already only quantize the expert & attention weights, as we discovered that quantizing the gating layer drastically reduced performance. But I'll try only quantizing the attention weights!

ykim362

Microsoft org Sep 3, 2024

@EricB Thanks. Just to make sure, I think you meant "I'll try only quantizing the expert weights!". not attention.

EricB

Sep 3, 2024

@ykim362 yes, sorry my mistake!

ykim362

Microsoft org Sep 3, 2024

•

edited Sep 3, 2024

Not this model, but we have some studies around the quantization of MoE. We could easily quantize the models down to 3-bits (experts only) with PTQ (the plain absolute min-max). With QAT (experts only), we could push it down to 2-bits. But, quantizing the other parts hurts the perf significantly. https://arxiv.org/abs/2310.02410

EricB

Sep 3, 2024

@ykim362 thanks for the link. That seems super interesting, I implemented it here quickly so it can be used with Phi 3.5 MoE!

ykim362

Microsoft org Sep 4, 2024

@EricB that's real quick! Awesome and thank you!!

0xDEFEC7ED

Sep 11, 2024

Bump, am also interested in GGUF format and want to watch this thread.

Dampfinchen

Oct 8, 2024

•

edited Oct 8, 2024

@EricB that's real quick! Awesome and thank you!!

Most people will use llama.cpp though (or more appropriately said the programs using it like Oobabooga, Koboldcpp, Ollama, LM Studio and countless others). Sadly, this model has not seen the recognition it deserves at all. Part of the reasonn might be because Phi's capabilities for creative writing are very weak and it's more censored than competitive models, so the volunteers of llama.cpp have less initiative to put the work in for this more complex MoE architecture.

I would highly recommend dedicating some of the team to help llama.cpp gaining Phi Vision and MoE support.

phymbert

Dec 28, 2024

•

edited Dec 28, 2024

Finally added support in llama.cpp: https://github.com/ggerganov/llama.cpp/pull/11003

And quantum models added to my collection: https://huggingface.co/collections/phymbert/phi-35-moe-instruct-gguf-676ff4882b1891292b6bd9c1

Once the PR is merged, gguf-myrepo will work nicely.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment