How to use from the
Use from the
llama-cpp-python library
# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="DuoNeural/Phi-3.5-mini-instruct-LiteRT",
	filename="Phi-3.5-mini-instruct-LiteRT_Q4_K_M.gguf",
)
llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Phi-3.5-mini-instruct-LiteRT

Phi 3.5 Mini Instruct โ€” compact on-device assistant โ€” converted for mobile and edge deployment by DuoNeural.

  • Source model: microsoft/Phi-3.5-mini-instruct
  • Format: GGUF Q4_K_M (llama.cpp-compatible)
  • Parameters: 3.8B
  • Quantization: 4-bit K-mean (Q4_K_M) โ€” great accuracy/size balance
  • Target platforms: Android, iOS, desktop edge inference
  • Converted: 2026-05-06 by Archon / DuoNeural

Why This Model?

Phi-3.5-mini punches way above its weight class โ€” Microsoft's 3.8B model consistently beats models 2-3ร— larger on reasoning benchmarks. Q4_K_M keeps it under 2.5GB while preserving near-full quality. An ideal edge model when you need real intelligence with a small footprint.

Usage

llama.cpp (CLI)

./llama-cli -m Phi-3.5-mini-instruct-LiteRT_Q4_K_M.gguf \
  -n 512 --temp 0.7 -p "<|system|>You are a helpful assistant.<|end|><|user|>"

Google AI Edge / MediaPipe (Android/iOS)

This GGUF is compatible with MLC-LLM and llama.cpp Android bindings for on-device inference. For use with Google Edge Gallery, convert to .task bundle using MediaPipe LLM conversion tools.

Python via llama-cpp-python

from llama_cpp import Llama

llm = Llama(
    model_path="Phi-3.5-mini-instruct-LiteRT_Q4_K_M.gguf",
    n_ctx=4096,
    n_threads=4,
    verbose=False,
)

response = llm.create_chat_completion(
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is the derivative of sin(xยฒ)?"},
    ]
)
print(response["choices"][0]["message"]["content"])

Ollama

ollama run hf.co/DuoNeural/Phi-3.5-mini-instruct-LiteRT

Performance Notes

Metric Value
Quantization Q4_K_M
RAM required ~3 GB (with context)
Recommended devices 6GB+ RAM phones, laptops
Quantization loss Minimal โ€” Phi-3.5 is robust to 4-bit quantization

Phi-3.5 Mini Highlights

  • 3.8B params, trained on 3.4T tokens
  • Strong reasoning, coding, and instruction-following
  • 128K context window (trimmed to device-safe lengths for edge)
  • One of the top 4B-class models in its generation

About the Conversion

Converted using llama.cpp GGUF pipeline with CUDA acceleration. Source weights downloaded from HuggingFace in safetensors format, converted to F16 GGUF, then quantized to Q4_K_M.


DuoNeural

DuoNeural is an open AI research lab โ€” human + AI in collaboration.

DuoNeural Research Publications

Open access, CC BY 4.0. Authored by Archon, Jesse Caldwell, Aura โ€” DuoNeural.

Research Team

  • Jesse โ€” Vision, hardware, direction
  • Archon โ€” Lab Director, post-training, abliteration, experiments
  • Aura โ€” Research AI, literature synthesis, novel proposals

Subscribe to the lab newsletter at duoneural.beehiiv.com for model drops before they go anywhere else.

Downloads last month
55
GGUF
Model size
4B params
Architecture
phi3
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for DuoNeural/Phi-3.5-mini-instruct-LiteRT

Quantized
(183)
this model