Phi-3.5-mini-instruct-LiteRT
Phi 3.5 Mini Instruct โ compact on-device assistant โ converted for mobile and edge deployment by DuoNeural.
- Source model: microsoft/Phi-3.5-mini-instruct
- Format: GGUF Q4_K_M (llama.cpp-compatible)
- Parameters: 3.8B
- Quantization: 4-bit K-mean (Q4_K_M) โ great accuracy/size balance
- Target platforms: Android, iOS, desktop edge inference
- Converted: 2026-05-06 by Archon / DuoNeural
Why This Model?
Phi-3.5-mini punches way above its weight class โ Microsoft's 3.8B model consistently beats models 2-3ร larger on reasoning benchmarks. Q4_K_M keeps it under 2.5GB while preserving near-full quality. An ideal edge model when you need real intelligence with a small footprint.
Usage
llama.cpp (CLI)
./llama-cli -m Phi-3.5-mini-instruct-LiteRT_Q4_K_M.gguf \
-n 512 --temp 0.7 -p "<|system|>You are a helpful assistant.<|end|><|user|>"
Google AI Edge / MediaPipe (Android/iOS)
This GGUF is compatible with MLC-LLM and llama.cpp Android bindings for on-device inference. For use with Google Edge Gallery, convert to .task bundle using MediaPipe LLM conversion tools.
Python via llama-cpp-python
from llama_cpp import Llama
llm = Llama(
model_path="Phi-3.5-mini-instruct-LiteRT_Q4_K_M.gguf",
n_ctx=4096,
n_threads=4,
verbose=False,
)
response = llm.create_chat_completion(
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the derivative of sin(xยฒ)?"},
]
)
print(response["choices"][0]["message"]["content"])
Ollama
ollama run hf.co/DuoNeural/Phi-3.5-mini-instruct-LiteRT
Performance Notes
| Metric | Value |
|---|---|
| Quantization | Q4_K_M |
| RAM required | ~3 GB (with context) |
| Recommended devices | 6GB+ RAM phones, laptops |
| Quantization loss | Minimal โ Phi-3.5 is robust to 4-bit quantization |
Phi-3.5 Mini Highlights
- 3.8B params, trained on 3.4T tokens
- Strong reasoning, coding, and instruction-following
- 128K context window (trimmed to device-safe lengths for edge)
- One of the top 4B-class models in its generation
About the Conversion
Converted using llama.cpp GGUF pipeline with CUDA acceleration. Source weights downloaded from HuggingFace in safetensors format, converted to F16 GGUF, then quantized to Q4_K_M.
DuoNeural
DuoNeural is an open AI research lab โ human + AI in collaboration.
| Platform | Link |
|---|---|
| HuggingFace | huggingface.co/DuoNeural |
| Website | duoneural.com |
| GitHub | github.com/DuoNeural |
| X / Twitter | @DuoNeural |
| duoneural@proton.me | |
| Newsletter | duoneural.beehiiv.com |
| Support | buymeacoffee.com/duoneural |
DuoNeural Research Publications
Open access, CC BY 4.0. Authored by Archon, Jesse Caldwell, Aura โ DuoNeural.
Research Team
- Jesse โ Vision, hardware, direction
- Archon โ Lab Director, post-training, abliteration, experiments
- Aura โ Research AI, literature synthesis, novel proposals
Subscribe to the lab newsletter at duoneural.beehiiv.com for model drops before they go anywhere else.
- Downloads last month
- 54
4-bit
Model tree for DuoNeural/Phi-3.5-mini-instruct-LiteRT
Base model
microsoft/Phi-3.5-mini-instruct