β‘ Quantized Models (Q4-Q8)
Collection
Aggressively quantized: Q4_K_M, Q5_K_M, Q6_K, Q8_0, int4. Same model, fraction of the size. β’ 14 items β’ Updated
How to use dispatchAI/Phi-3.5-mini-instruct-Q5-mobile with Transformers:
# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("text-generation", model="dispatchAI/Phi-3.5-mini-instruct-Q5-mobile")
messages = [
{"role": "user", "content": "Who are you?"},
]
pipe(messages) # Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("dispatchAI/Phi-3.5-mini-instruct-Q5-mobile", dtype="auto")How to use dispatchAI/Phi-3.5-mini-instruct-Q5-mobile with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="dispatchAI/Phi-3.5-mini-instruct-Q5-mobile", filename="model.gguf", )
llm.create_chat_completion(
messages = [
{
"role": "user",
"content": "What is the capital of France?"
}
]
)How to use dispatchAI/Phi-3.5-mini-instruct-Q5-mobile with llama.cpp:
curl -LsSf https://llama.app/install.sh | sh # Start a local OpenAI-compatible server with a web UI: llama serve -hf dispatchAI/Phi-3.5-mini-instruct-Q5-mobile # Run inference directly in the terminal: llama cli -hf dispatchAI/Phi-3.5-mini-instruct-Q5-mobile
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama serve -hf dispatchAI/Phi-3.5-mini-instruct-Q5-mobile # Run inference directly in the terminal: llama cli -hf dispatchAI/Phi-3.5-mini-instruct-Q5-mobile
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf dispatchAI/Phi-3.5-mini-instruct-Q5-mobile # Run inference directly in the terminal: ./llama-cli -hf dispatchAI/Phi-3.5-mini-instruct-Q5-mobile
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf dispatchAI/Phi-3.5-mini-instruct-Q5-mobile # Run inference directly in the terminal: ./build/bin/llama-cli -hf dispatchAI/Phi-3.5-mini-instruct-Q5-mobile
docker model run hf.co/dispatchAI/Phi-3.5-mini-instruct-Q5-mobile
How to use dispatchAI/Phi-3.5-mini-instruct-Q5-mobile with vLLM:
# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "dispatchAI/Phi-3.5-mini-instruct-Q5-mobile"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "dispatchAI/Phi-3.5-mini-instruct-Q5-mobile",
"messages": [
{
"role": "user",
"content": "What is the capital of France?"
}
]
}'docker model run hf.co/dispatchAI/Phi-3.5-mini-instruct-Q5-mobile
How to use dispatchAI/Phi-3.5-mini-instruct-Q5-mobile with SGLang:
# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
--model-path "dispatchAI/Phi-3.5-mini-instruct-Q5-mobile" \
--host 0.0.0.0 \
--port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "dispatchAI/Phi-3.5-mini-instruct-Q5-mobile",
"messages": [
{
"role": "user",
"content": "What is the capital of France?"
}
]
}'docker run --gpus all \
--shm-size 32g \
-p 30000:30000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=<secret>" \
--ipc=host \
lmsysorg/sglang:latest \
python3 -m sglang.launch_server \
--model-path "dispatchAI/Phi-3.5-mini-instruct-Q5-mobile" \
--host 0.0.0.0 \
--port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "dispatchAI/Phi-3.5-mini-instruct-Q5-mobile",
"messages": [
{
"role": "user",
"content": "What is the capital of France?"
}
]
}'How to use dispatchAI/Phi-3.5-mini-instruct-Q5-mobile with Ollama:
ollama run hf.co/dispatchAI/Phi-3.5-mini-instruct-Q5-mobile
How to use dispatchAI/Phi-3.5-mini-instruct-Q5-mobile with Unsloth Studio:
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for dispatchAI/Phi-3.5-mini-instruct-Q5-mobile to start chatting
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for dispatchAI/Phi-3.5-mini-instruct-Q5-mobile to start chatting
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for dispatchAI/Phi-3.5-mini-instruct-Q5-mobile to start chatting
How to use dispatchAI/Phi-3.5-mini-instruct-Q5-mobile with Docker Model Runner:
docker model run hf.co/dispatchAI/Phi-3.5-mini-instruct-Q5-mobile
How to use dispatchAI/Phi-3.5-mini-instruct-Q5-mobile with Lemonade:
# Download Lemonade from https://lemonade-server.ai/ lemonade pull dispatchAI/Phi-3.5-mini-instruct-Q5-mobile
lemonade run user.Phi-3.5-mini-instruct-Q5-mobile-{{QUANT_TAG}}lemonade list
llm.create_chat_completion(
messages = [
{
"role": "user",
"content": "What is the capital of France?"
}
]
)β WORKS β Verified June 2026.
| Prompt | Response | Correct? |
|---|---|---|
| What is the capital of France? | "The capital of France is Paris. Paris is not only the larges" | β |
| What is 2+2? Just the number. | "The sum of 2 and 2 is 4. This is a basic arithmetic operatio" | β |
| Attribute | Value |
|---|---|
| Base Model | microsoft/Phi-3.5-mini-instruct |
| File Size | 2685 MB |
| Format | GGUF |
| Chat Format | chatml |
| CPU Speed | 7.0 tokens/sec |
| License | mit |
from llama_cpp import Llama
llm = Llama(model_path="model.gguf", chat_format="chatml", n_ctx=512, n_threads=4, verbose=False)
response = llm.create_chat_completion(
messages=[{"role": "user", "content": "What is the capital of France?"}],
max_tokens=50,
)
print(response["choices"][0]["message"]["content"])
from dispatchai import load_model
model = load_model("Phi-3.5-mini-instruct-Q5-mobile", backend="gguf")
print(model.chat("Hello!"))
π dispatchAI
We're not able to determine the quantization variants.
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="dispatchAI/Phi-3.5-mini-instruct-Q5-mobile", filename="model.gguf", )