Phi-4-multimodal-instruct GGUF Quantizations

GGUF quantizations of microsoft/Phi-4-multimodal-instruct β€” a 5.6 B parameter multimodal model by Microsoft supporting text, vision (images), and audio inputs.

Produced with llama.cpp build b8347 on RTX 5090.

License: MIT β€” Β© Microsoft Corporation. These quantizations carry the same MIT license as the original model.


Available Files

File Quant Size BPW Best for
phi4-mm-f16.gguf F16 7.17 GB 16.0 Re-quantization base, maximum quality
phi4-mm-Q8_0.gguf Q8_0 3.90 GB 8.0 High-end GPU β€” near-lossless
phi4-mm-Q4_K_M.gguf Q4_K_M 2.37 GB 5.18 CPU / constrained VRAM β€” good quality
mmproj-phi4-mm-f16.gguf F16 825 MB 16.0 Vision encoder β€” required for image input

One mmproj for all: mmproj-phi4-mm-f16.gguf works with every text GGUF above. It cannot be quantized further β€” the CLIP FF layer dimension (4304) is not divisible by 32.


VRAM Requirements

Full GPU offload (-ngl 99):

Configuration VRAM
F16 + mmproj-F16 ~10,000 MiB
Q8_0 + mmproj-F16 ~5,400 MiB
Q4_K_M + mmproj-F16 ~3,500 MiB

Quality Metrics

Perplexity (wikitext-2-raw test set, context 512)

Model PPL vs F16
F16 (baseline) 14.9338 Β± 0.107 β€”
Q8_0 14.9107 Β± 0.106 βˆ’0.15% βœ… lossless
Q4_K_M 16.3183 Β± 0.121 +9.3%

Throughput (llama-bench, RTX 5090, pp512 / tg128)

Model Prompt (t/s) Generation (t/s)
Q8_0 21,352 247
Q4_K_M 19,904 324

Multimodal Benchmarks (lmms-eval)

VQA evaluation in progress β€” will be added when complete. Suite: MMStar, OCRBench, AI2D, MathVista, HallusionBench.


Usage

LM Studio (Recommended for Desktop)

LM Studio has native GGUF support including multimodal vision. No command-line needed.

Text-only:

  1. Open LM Studio β†’ search Swicked86/phi4-mm-gguf
  2. Download phi4-mm-Q8_0.gguf (GPU) or phi4-mm-Q4_K_M.gguf (CPU / low VRAM)
  3. Load the model β†’ Chat

With vision (image input):

  1. Download both phi4-mm-Q8_0.gguf and mmproj-phi4-mm-f16.gguf
  2. Load the main model in LM Studio
  3. In Model Settings β†’ Multimodal β†’ Vision Model (mmproj), browse to mmproj-phi4-mm-f16.gguf
  4. In Chat, click the image icon to attach a photo and ask questions about it

The mmproj file is the vision encoder. Without it the model runs text-only.
mmproj-phi4-mm-f16.gguf is compatible with all three text GGUFs.


llama.cpp CLI

Step 1 β€” Download the files

# Install huggingface-cli if needed
pip install huggingface_hub

# Text + vision (recommended)
huggingface-cli download Swicked86/phi4-mm-gguf phi4-mm-Q8_0.gguf mmproj-phi4-mm-f16.gguf --local-dir ./phi4-mm

# CPU / low-VRAM variant
huggingface-cli download Swicked86/phi4-mm-gguf phi4-mm-Q4_K_M.gguf mmproj-phi4-mm-f16.gguf --local-dir ./phi4-mm

Build llama.cpp if you haven't already:

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON    # omit DGGML_CUDA=ON for CPU-only
cmake --build build --config Release -j$(nproc)

Step 2 β€” Interactive multimodal chat (images + text)

llama-mtmd-cli launches an interactive session. Type your prompt, or prefix it with an image path using /image:

./build/bin/llama-mtmd-cli \
  -m ./phi4-mm/phi4-mm-Q8_0.gguf \
  --mmproj ./phi4-mm/mmproj-phi4-mm-f16.gguf \
  -ngl 99 --threads 16 --ctx-size 8192

Inside the session:

> /image photo.jpg
Image loaded.
> What is in this image?
[model describes the image]

> /image chart.png
Image loaded.
> Summarise the trend shown in this chart.
[model analyses the chart]

> Explain the previous image again but in French.
[responds without re-loading the image]

Single-shot (non-interactive):

./build/bin/llama-mtmd-cli \
  -m ./phi4-mm/phi4-mm-Q8_0.gguf \
  --mmproj ./phi4-mm/mmproj-phi4-mm-f16.gguf \
  --image photo.jpg \
  -p "Describe this image in detail." \
  -ngl 99 --threads 16 --no-display-prompt

CPU (no GPU):

./build/bin/llama-mtmd-cli \
  -m ./phi4-mm/phi4-mm-Q4_K_M.gguf \
  --mmproj ./phi4-mm/mmproj-phi4-mm-f16.gguf \
  --image photo.jpg \
  -p "What objects are in this photo?" \
  --threads 8

Text-only (no image)

./build/bin/llama-cli \
  -m ./phi4-mm/phi4-mm-Q8_0.gguf \
  --ctx-size 65536 --flash-attn on --kv-offload -ngl 99 --threads 16 \
  -p "<|system|>You are a helpful assistant.<|end|><|user|>Hello<|end|><|assistant|>"

llama-server β€” OpenAI-compatible API

Serves both text and vision via /v1/chat/completions. Useful for integrations (Open WebUI, SillyTavern, Continue.dev, etc.):

./build/bin/llama-server \
  -m ./phi4-mm/phi4-mm-Q8_0.gguf \
  --mmproj ./phi4-mm/mmproj-phi4-mm-f16.gguf \
  -ngl 99 --threads 16 \
  --ctx-size 8192 --parallel 4 \
  --port 8080 --host 127.0.0.1

Text query (curl):

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "phi4-mm",
    "messages": [{"role": "user", "content": "Explain the Pythagorean theorem."}],
    "max_tokens": 300
  }'

Image query (curl, base64):

IMAGE_B64=$(base64 -w0 photo.jpg)
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"phi4-mm\",
    \"messages\": [{
      \"role\": \"user\",
      \"content\": [
        {\"type\": \"image_url\", \"image_url\": {\"url\": \"data:image/jpeg;base64,${IMAGE_B64}\"}},
        {\"type\": \"text\", \"text\": \"What is in this image?\"}
      ]
    }],
    \"max_tokens\": 300
  }"

Image query (Python, openai SDK):

import base64, httpx
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="local")

with open("photo.jpg", "rb") as f:
    image_b64 = base64.b64encode(f.read()).decode()

response = client.chat.completions.create(
    model="phi4-mm",
    messages=[{
        "role": "user",
        "content": [
            {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{image_b64}"}},
            {"type": "text", "text": "Describe this image."},
        ],
    }],
    max_tokens=300,
)
print(response.choices[0].message.content)

Ollama (CPU / NUC / edge)

See the deploy/ folder for a complete Modelfile, NUC install script, and OpenClaw integration config.

FROM ./phi4-mm-Q4_K_M.gguf
PARAMETER num_ctx 8192
PARAMETER num_thread 8
PARAMETER num_gpu 0
PARAMETER flash_attn false
PARAMETER temperature 0.7

Architecture

Property Value
Base model Phi-4-Mini (3.8 B LLM backbone)
Total parameters ~5.6 B
GGUF arch phi3
Context length 128 K tokens (131,072)
Modalities Text, Vision (CLIP-based), Audio/Speech

The vision encoder (mmproj-phi4-mm-f16.gguf) is a CLIP-style image encoder with a projection MLP. Audio/speech capability is embedded in the base GGUF weights.


Conversion Details

Item Value
Converter llama.cpp convert_hf_to_gguf.py (text) + custom mmproj converter
llama.cpp build b8347 / fc350fdf9
Source microsoft/Phi-4-multimodal-instruct
Hardware RTX 5090 32 GB, CUDA 12.0, WSL2 Ubuntu 24.04

Quantization commands:

# Q8_0
llama-quantize phi4-mm-f16.gguf phi4-mm-Q8_0.gguf Q8_0

# Q4_K_M
llama-quantize phi4-mm-f16.gguf phi4-mm-Q4_K_M.gguf Q4_K_M

NUC / Edge Deployment

The Q4_K_M + mmproj-phi4-mm-f16.gguf combination (~3,500 MiB VRAM) fits on:

  • Intel NUC 13/14 Pro (Intel Arc iGPU, 4–8 GB shared VRAM)
  • Systems with 8 GB unified memory (Apple Silicon M-series, etc.)

See deploy/ for install scripts and configuration.


Related

Downloads last month
45
GGUF
Model size
4B params
Architecture
phi3
Hardware compatibility
Log In to add your hardware

4-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Swicked86/phi4-mm-gguf

Quantized
(7)
this model