Instructions to use DuoNeural/Phi-3.5-mini-instruct-LiteRT with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- LiteRT
How to use DuoNeural/Phi-3.5-mini-instruct-LiteRT with LiteRT:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- llama-cpp-python
How to use DuoNeural/Phi-3.5-mini-instruct-LiteRT with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="DuoNeural/Phi-3.5-mini-instruct-LiteRT", filename="Phi-3.5-mini-instruct-LiteRT_Q4_K_M.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use DuoNeural/Phi-3.5-mini-instruct-LiteRT with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf DuoNeural/Phi-3.5-mini-instruct-LiteRT:Q4_K_M # Run inference directly in the terminal: llama-cli -hf DuoNeural/Phi-3.5-mini-instruct-LiteRT:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf DuoNeural/Phi-3.5-mini-instruct-LiteRT:Q4_K_M # Run inference directly in the terminal: llama-cli -hf DuoNeural/Phi-3.5-mini-instruct-LiteRT:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf DuoNeural/Phi-3.5-mini-instruct-LiteRT:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf DuoNeural/Phi-3.5-mini-instruct-LiteRT:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf DuoNeural/Phi-3.5-mini-instruct-LiteRT:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf DuoNeural/Phi-3.5-mini-instruct-LiteRT:Q4_K_M
Use Docker
docker model run hf.co/DuoNeural/Phi-3.5-mini-instruct-LiteRT:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use DuoNeural/Phi-3.5-mini-instruct-LiteRT with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "DuoNeural/Phi-3.5-mini-instruct-LiteRT" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "DuoNeural/Phi-3.5-mini-instruct-LiteRT", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/DuoNeural/Phi-3.5-mini-instruct-LiteRT:Q4_K_M
- Ollama
How to use DuoNeural/Phi-3.5-mini-instruct-LiteRT with Ollama:
ollama run hf.co/DuoNeural/Phi-3.5-mini-instruct-LiteRT:Q4_K_M
- Unsloth Studio new
How to use DuoNeural/Phi-3.5-mini-instruct-LiteRT with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for DuoNeural/Phi-3.5-mini-instruct-LiteRT to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for DuoNeural/Phi-3.5-mini-instruct-LiteRT to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for DuoNeural/Phi-3.5-mini-instruct-LiteRT to start chatting
- Docker Model Runner
How to use DuoNeural/Phi-3.5-mini-instruct-LiteRT with Docker Model Runner:
docker model run hf.co/DuoNeural/Phi-3.5-mini-instruct-LiteRT:Q4_K_M
- Lemonade
How to use DuoNeural/Phi-3.5-mini-instruct-LiteRT with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull DuoNeural/Phi-3.5-mini-instruct-LiteRT:Q4_K_M
Run and chat with the model
lemonade run user.Phi-3.5-mini-instruct-LiteRT-Q4_K_M
List all available models
lemonade list
Install from WinGet (Windows)
winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf DuoNeural/Phi-3.5-mini-instruct-LiteRT:Q4_K_M# Run inference directly in the terminal:
llama-cli -hf DuoNeural/Phi-3.5-mini-instruct-LiteRT:Q4_K_MUse pre-built binary
# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf DuoNeural/Phi-3.5-mini-instruct-LiteRT:Q4_K_M# Run inference directly in the terminal:
./llama-cli -hf DuoNeural/Phi-3.5-mini-instruct-LiteRT:Q4_K_MBuild from source code
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf DuoNeural/Phi-3.5-mini-instruct-LiteRT:Q4_K_M# Run inference directly in the terminal:
./build/bin/llama-cli -hf DuoNeural/Phi-3.5-mini-instruct-LiteRT:Q4_K_MUse Docker
docker model run hf.co/DuoNeural/Phi-3.5-mini-instruct-LiteRT:Q4_K_MPhi-3.5-mini-instruct-LiteRT
Phi 3.5 Mini Instruct โ compact on-device assistant โ converted for mobile and edge deployment by DuoNeural.
- Source model: microsoft/Phi-3.5-mini-instruct
- Format: GGUF Q4_K_M (llama.cpp-compatible)
- Parameters: 3.8B
- Quantization: 4-bit K-mean (Q4_K_M) โ great accuracy/size balance
- Target platforms: Android, iOS, desktop edge inference
- Converted: 2026-05-06 by Archon / DuoNeural
Why This Model?
Phi-3.5-mini punches way above its weight class โ Microsoft's 3.8B model consistently beats models 2-3ร larger on reasoning benchmarks. Q4_K_M keeps it under 2.5GB while preserving near-full quality. An ideal edge model when you need real intelligence with a small footprint.
Usage
llama.cpp (CLI)
./llama-cli -m Phi-3.5-mini-instruct-LiteRT_Q4_K_M.gguf \
-n 512 --temp 0.7 -p "<|system|>You are a helpful assistant.<|end|><|user|>"
Google AI Edge / MediaPipe (Android/iOS)
This GGUF is compatible with MLC-LLM and llama.cpp Android bindings for on-device inference. For use with Google Edge Gallery, convert to .task bundle using MediaPipe LLM conversion tools.
Python via llama-cpp-python
from llama_cpp import Llama
llm = Llama(
model_path="Phi-3.5-mini-instruct-LiteRT_Q4_K_M.gguf",
n_ctx=4096,
n_threads=4,
verbose=False,
)
response = llm.create_chat_completion(
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the derivative of sin(xยฒ)?"},
]
)
print(response["choices"][0]["message"]["content"])
Ollama
ollama run hf.co/DuoNeural/Phi-3.5-mini-instruct-LiteRT
Performance Notes
| Metric | Value |
|---|---|
| Quantization | Q4_K_M |
| RAM required | ~3 GB (with context) |
| Recommended devices | 6GB+ RAM phones, laptops |
| Quantization loss | Minimal โ Phi-3.5 is robust to 4-bit quantization |
Phi-3.5 Mini Highlights
- 3.8B params, trained on 3.4T tokens
- Strong reasoning, coding, and instruction-following
- 128K context window (trimmed to device-safe lengths for edge)
- One of the top 4B-class models in its generation
About the Conversion
Converted using llama.cpp GGUF pipeline with CUDA acceleration. Source weights downloaded from HuggingFace in safetensors format, converted to F16 GGUF, then quantized to Q4_K_M.
DuoNeural
DuoNeural is an open AI research lab โ human + AI in collaboration.
| Platform | Link |
|---|---|
| HuggingFace | huggingface.co/DuoNeural |
| Website | duoneural.com |
| GitHub | github.com/DuoNeural |
| X / Twitter | @DuoNeural |
| duoneural@proton.me | |
| Newsletter | duoneural.beehiiv.com |
| Support | buymeacoffee.com/duoneural |
DuoNeural Research Publications
Open access, CC BY 4.0. Authored by Archon, Jesse Caldwell, Aura โ DuoNeural.
Research Team
- Jesse โ Vision, hardware, direction
- Archon โ Lab Director, post-training, abliteration, experiments
- Aura โ Research AI, literature synthesis, novel proposals
Subscribe to the lab newsletter at duoneural.beehiiv.com for model drops before they go anywhere else.
- Downloads last month
- 61
4-bit
Model tree for DuoNeural/Phi-3.5-mini-instruct-LiteRT
Base model
microsoft/Phi-3.5-mini-instruct
Install from brew
# Start a local OpenAI-compatible server with a web UI: llama-server -hf DuoNeural/Phi-3.5-mini-instruct-LiteRT:Q4_K_M# Run inference directly in the terminal: llama-cli -hf DuoNeural/Phi-3.5-mini-instruct-LiteRT:Q4_K_M