Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
Paper
•
2404.14219
•
Published
•
259
F16 GGUF conversion of microsoft/Phi-3.5-MoE-instruct with Rust bindings for llama.cpp's MoE CPU offloading functionality.
This model supports MoE CPU offloading via llama.cpp (implemented in PR #15077). Shimmy provides Rust bindings for this functionality, enabling:
| Configuration | VRAM | TPS | TTFT |
|---|---|---|---|
| GPU-only | 77.7GB | 13.8 | 730ms |
| CPU Offload | 2.8GB | 4.5 | 2,251ms |
Trade-off: Memory for speed. Best for VRAM-constrained scenarios where generation speed is less critical than model size.
huggingface-cli download MikeKuykendall/phi-3.5-moe-cpu-offload-gguf \
--include "phi-3.5-moe-f16.gguf" \
--local-dir ./models
# Standard loading (requires ~80GB VRAM)
./llama-server -m phi-3.5-moe-f16.gguf -c 4096
# With MoE CPU offloading (requires ~3GB VRAM + 80GB RAM)
./llama-server -m phi-3.5-moe-f16.gguf -c 4096 --cpu-moe
# Install Shimmy
cargo install --git https://github.com/Michael-A-Kuykendall/shimmy --features llama-cuda
# Standard loading
shimmy serve --model phi-3.5-moe-f16.gguf
# With MoE CPU offloading
shimmy serve --model phi-3.5-moe-f16.gguf --cpu-moe
# Query the API
curl http://localhost:11435/api/generate \
-d '{
"model": "phi-3.5-moe",
"prompt": "Explain mixture of experts in simple terms",
"max_tokens": 256,
"stream": false
}'
<|system|>
You are a helpful assistant.<|end|>
<|user|>
Your question here<|end|>
<|assistant|>
Standard GPU Loading:
CPU Offloading:
Full validation report with controlled baselines: Shimmy MoE CPU Offloading Technical Report
@techreport{abdin2024phi,
title={Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone},
author={Abdin, Marah and others},
year={2024},
institution={Microsoft}
}
GGUF conversion and MoE offloading validation by MikeKuykendall
16-bit
Base model
microsoft/Phi-3.5-MoE-instruct