Phi-4-multimodal-instruct GGUF Quantizations
GGUF quantizations of microsoft/Phi-4-multimodal-instruct β a 5.6 B parameter multimodal model by Microsoft supporting text, vision (images), and audio inputs.
Produced with llama.cpp build b8347 on RTX 5090.
License: MIT β Β© Microsoft Corporation. These quantizations carry the same MIT license as the original model.
Available Files
| File | Quant | Size | BPW | Best for |
|---|---|---|---|---|
phi4-mm-f16.gguf |
F16 | 7.17 GB | 16.0 | Re-quantization base, maximum quality |
phi4-mm-Q8_0.gguf |
Q8_0 | 3.90 GB | 8.0 | High-end GPU β near-lossless |
phi4-mm-Q4_K_M.gguf |
Q4_K_M | 2.37 GB | 5.18 | CPU / constrained VRAM β good quality |
mmproj-phi4-mm-f16.gguf |
F16 | 825 MB | 16.0 | Vision encoder β required for image input |
One mmproj for all:
mmproj-phi4-mm-f16.ggufworks with every text GGUF above. It cannot be quantized further β the CLIP FF layer dimension (4304) is not divisible by 32.
VRAM Requirements
Full GPU offload (-ngl 99):
| Configuration | VRAM |
|---|---|
| F16 + mmproj-F16 | ~10,000 MiB |
| Q8_0 + mmproj-F16 | ~5,400 MiB |
| Q4_K_M + mmproj-F16 | ~3,500 MiB |
Quality Metrics
Perplexity (wikitext-2-raw test set, context 512)
| Model | PPL | vs F16 |
|---|---|---|
| F16 (baseline) | 14.9338 Β± 0.107 | β |
| Q8_0 | 14.9107 Β± 0.106 | β0.15% β lossless |
| Q4_K_M | 16.3183 Β± 0.121 | +9.3% |
Throughput (llama-bench, RTX 5090, pp512 / tg128)
| Model | Prompt (t/s) | Generation (t/s) |
|---|---|---|
| Q8_0 | 21,352 | 247 |
| Q4_K_M | 19,904 | 324 |
Multimodal Benchmarks (lmms-eval)
VQA evaluation in progress β will be added when complete. Suite: MMStar, OCRBench, AI2D, MathVista, HallusionBench.
Usage
LM Studio (Recommended for Desktop)
LM Studio has native GGUF support including multimodal vision. No command-line needed.
Text-only:
- Open LM Studio β search
Swicked86/phi4-mm-gguf - Download
phi4-mm-Q8_0.gguf(GPU) orphi4-mm-Q4_K_M.gguf(CPU / low VRAM) - Load the model β Chat
With vision (image input):
- Download both
phi4-mm-Q8_0.ggufandmmproj-phi4-mm-f16.gguf - Load the main model in LM Studio
- In Model Settings β Multimodal β Vision Model (mmproj), browse to
mmproj-phi4-mm-f16.gguf - In Chat, click the image icon to attach a photo and ask questions about it
The mmproj file is the vision encoder. Without it the model runs text-only.
mmproj-phi4-mm-f16.ggufis compatible with all three text GGUFs.
llama.cpp CLI
Step 1 β Download the files
# Install huggingface-cli if needed
pip install huggingface_hub
# Text + vision (recommended)
huggingface-cli download Swicked86/phi4-mm-gguf phi4-mm-Q8_0.gguf mmproj-phi4-mm-f16.gguf --local-dir ./phi4-mm
# CPU / low-VRAM variant
huggingface-cli download Swicked86/phi4-mm-gguf phi4-mm-Q4_K_M.gguf mmproj-phi4-mm-f16.gguf --local-dir ./phi4-mm
Build llama.cpp if you haven't already:
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON # omit DGGML_CUDA=ON for CPU-only
cmake --build build --config Release -j$(nproc)
Step 2 β Interactive multimodal chat (images + text)
llama-mtmd-cli launches an interactive session. Type your prompt, or prefix it with an image path using /image:
./build/bin/llama-mtmd-cli \
-m ./phi4-mm/phi4-mm-Q8_0.gguf \
--mmproj ./phi4-mm/mmproj-phi4-mm-f16.gguf \
-ngl 99 --threads 16 --ctx-size 8192
Inside the session:
> /image photo.jpg
Image loaded.
> What is in this image?
[model describes the image]
> /image chart.png
Image loaded.
> Summarise the trend shown in this chart.
[model analyses the chart]
> Explain the previous image again but in French.
[responds without re-loading the image]
Single-shot (non-interactive):
./build/bin/llama-mtmd-cli \
-m ./phi4-mm/phi4-mm-Q8_0.gguf \
--mmproj ./phi4-mm/mmproj-phi4-mm-f16.gguf \
--image photo.jpg \
-p "Describe this image in detail." \
-ngl 99 --threads 16 --no-display-prompt
CPU (no GPU):
./build/bin/llama-mtmd-cli \
-m ./phi4-mm/phi4-mm-Q4_K_M.gguf \
--mmproj ./phi4-mm/mmproj-phi4-mm-f16.gguf \
--image photo.jpg \
-p "What objects are in this photo?" \
--threads 8
Text-only (no image)
./build/bin/llama-cli \
-m ./phi4-mm/phi4-mm-Q8_0.gguf \
--ctx-size 65536 --flash-attn on --kv-offload -ngl 99 --threads 16 \
-p "<|system|>You are a helpful assistant.<|end|><|user|>Hello<|end|><|assistant|>"
llama-server β OpenAI-compatible API
Serves both text and vision via /v1/chat/completions. Useful for integrations (Open WebUI, SillyTavern, Continue.dev, etc.):
./build/bin/llama-server \
-m ./phi4-mm/phi4-mm-Q8_0.gguf \
--mmproj ./phi4-mm/mmproj-phi4-mm-f16.gguf \
-ngl 99 --threads 16 \
--ctx-size 8192 --parallel 4 \
--port 8080 --host 127.0.0.1
Text query (curl):
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "phi4-mm",
"messages": [{"role": "user", "content": "Explain the Pythagorean theorem."}],
"max_tokens": 300
}'
Image query (curl, base64):
IMAGE_B64=$(base64 -w0 photo.jpg)
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d "{
\"model\": \"phi4-mm\",
\"messages\": [{
\"role\": \"user\",
\"content\": [
{\"type\": \"image_url\", \"image_url\": {\"url\": \"data:image/jpeg;base64,${IMAGE_B64}\"}},
{\"type\": \"text\", \"text\": \"What is in this image?\"}
]
}],
\"max_tokens\": 300
}"
Image query (Python, openai SDK):
import base64, httpx
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="local")
with open("photo.jpg", "rb") as f:
image_b64 = base64.b64encode(f.read()).decode()
response = client.chat.completions.create(
model="phi4-mm",
messages=[{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{image_b64}"}},
{"type": "text", "text": "Describe this image."},
],
}],
max_tokens=300,
)
print(response.choices[0].message.content)
Ollama (CPU / NUC / edge)
See the deploy/ folder for a complete Modelfile, NUC install script, and OpenClaw integration config.
FROM ./phi4-mm-Q4_K_M.gguf
PARAMETER num_ctx 8192
PARAMETER num_thread 8
PARAMETER num_gpu 0
PARAMETER flash_attn false
PARAMETER temperature 0.7
Architecture
| Property | Value |
|---|---|
| Base model | Phi-4-Mini (3.8 B LLM backbone) |
| Total parameters | ~5.6 B |
| GGUF arch | phi3 |
| Context length | 128 K tokens (131,072) |
| Modalities | Text, Vision (CLIP-based), Audio/Speech |
The vision encoder (mmproj-phi4-mm-f16.gguf) is a CLIP-style image encoder
with a projection MLP. Audio/speech capability is embedded in the base GGUF weights.
Conversion Details
| Item | Value |
|---|---|
| Converter | llama.cpp convert_hf_to_gguf.py (text) + custom mmproj converter |
| llama.cpp build | b8347 / fc350fdf9 |
| Source | microsoft/Phi-4-multimodal-instruct |
| Hardware | RTX 5090 32 GB, CUDA 12.0, WSL2 Ubuntu 24.04 |
Quantization commands:
# Q8_0
llama-quantize phi4-mm-f16.gguf phi4-mm-Q8_0.gguf Q8_0
# Q4_K_M
llama-quantize phi4-mm-f16.gguf phi4-mm-Q4_K_M.gguf Q4_K_M
NUC / Edge Deployment
The Q4_K_M + mmproj-phi4-mm-f16.gguf combination (~3,500 MiB VRAM) fits on:
- Intel NUC 13/14 Pro (Intel Arc iGPU, 4β8 GB shared VRAM)
- Systems with 8 GB unified memory (Apple Silicon M-series, etc.)
See deploy/ for install scripts and configuration.
Related
- microsoft/Phi-4-multimodal-instruct β original model weights
- llama.cpp β inference engine
- lmms-eval β multimodal evaluation harness
- Downloads last month
- 45
4-bit
8-bit
16-bit
Model tree for Swicked86/phi4-mm-gguf
Base model
microsoft/Phi-4-multimodal-instruct