How to use from the
Use from the
llama-cpp-python library
# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="DuoNeural/Qwen2.5-1.5B-Instruct-LiteRT",
	filename="Qwen2.5-1.5B-Instruct-LiteRT_Q4_K_M.gguf",
)
llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Qwen2.5-1.5B-Instruct-LiteRT

Qwen2.5 1.5B Instruct โ€” ultra-compact on-device inference โ€” converted for mobile and edge deployment by DuoNeural.

  • Source model: Qwen/Qwen2.5-1.5B-Instruct
  • Format: GGUF Q4_K_M (llama.cpp-compatible, ~986 MB)
  • Parameters: 1.5B
  • Quantization: 4-bit K-mean (Q4_K_M) โ€” excellent accuracy/size trade-off for edge devices
  • Target platforms: Android, iOS, desktop edge inference
  • Converted: 2026-05-06 by Archon / DuoNeural

Why This Model?

Qwen2.5-1.5B-Instruct is one of the most capable sub-2B instruction-tuned models available. At Q4_K_M the binary is under 1GB, making it viable for on-device deployment on mid-range phones (6GB+ RAM) and all modern laptops.

Usage

llama.cpp (CLI)

./llama-cli -m Qwen2.5-1.5B-Instruct-LiteRT_Q4_K_M.gguf \
  -n 512 --temp 0.7 -p "You are a helpful assistant."

Google AI Edge / MediaPipe (Android/iOS)

This GGUF is compatible with MLC-LLM and llama.cpp Android bindings for on-device inference. For use with Google Edge Gallery, convert to .task bundle using MediaPipe LLM conversion tools.

Python via llama-cpp-python

from llama_cpp import Llama

llm = Llama(
    model_path="Qwen2.5-1.5B-Instruct-LiteRT_Q4_K_M.gguf",
    n_ctx=2048,
    n_threads=4,
    verbose=False,
)

response = llm.create_chat_completion(
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain attention mechanisms in one paragraph."},
    ]
)
print(response["choices"][0]["message"]["content"])

Ollama

ollama run hf.co/DuoNeural/Qwen2.5-1.5B-Instruct-LiteRT

Performance Notes

Metric Value
File size ~986 MB
RAM required ~1.5 GB (with context)
Recommended devices 4GB+ RAM phones, laptops, SBCs
Quantization loss Minimal (Q4_K_M is near-lossless for instruction following)

About the Conversion

Converted using llama.cpp GGUF pipeline with CUDA acceleration. Source weights downloaded from HuggingFace in safetensors format, converted to F16 GGUF, then quantized to Q4_K_M.


DuoNeural

DuoNeural is an open AI research lab โ€” human + AI in collaboration.

DuoNeural Research Publications

Open access, CC BY 4.0. Authored by Archon, Jesse Caldwell, Aura โ€” DuoNeural.

Research Team

  • Jesse โ€” Vision, hardware, direction
  • Archon โ€” Lab Director, post-training, abliteration, experiments
  • Aura โ€” Research AI, literature synthesis, novel proposals

Subscribe to the lab newsletter at duoneural.beehiiv.com for model drops before they go anywhere else.

Downloads last month
221
GGUF
Model size
2B params
Architecture
qwen2
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for DuoNeural/Qwen2.5-1.5B-Instruct-LiteRT

Quantized
(179)
this model