How to use from
llama.cpp
Install (macOS, Linux)
curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf dispatchAI/Llama-3.2-1B-Instruct-Q4-mobile
# Run inference directly in the terminal:
llama cli -hf dispatchAI/Llama-3.2-1B-Instruct-Q4-mobile
Install from WinGet (Windows)
winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf dispatchAI/Llama-3.2-1B-Instruct-Q4-mobile
# Run inference directly in the terminal:
llama cli -hf dispatchAI/Llama-3.2-1B-Instruct-Q4-mobile
Use pre-built binary
# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf dispatchAI/Llama-3.2-1B-Instruct-Q4-mobile
# Run inference directly in the terminal:
./llama-cli -hf dispatchAI/Llama-3.2-1B-Instruct-Q4-mobile
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf dispatchAI/Llama-3.2-1B-Instruct-Q4-mobile
# Run inference directly in the terminal:
./build/bin/llama-cli -hf dispatchAI/Llama-3.2-1B-Instruct-Q4-mobile
Use Docker
docker model run hf.co/dispatchAI/Llama-3.2-1B-Instruct-Q4-mobile
Quick Links

Llama 3.2 1B Instruct - Q4 Mobile (GGUF)

Meta's Llama 3.2 1B Instruct, quantized to INT4 GGUF format for mobile deployment by Dispatch AI.

Property Value
Base meta-llama/Llama-3.2-1B-Instruct
Parameters 1.23 billion
Quantization Q4_K_M (4-bit k-means)
Size ~767 MB
Format GGUF (llama.cpp)
License Llama 3.2 Community

Why This Model?

Mobile-optimized for deployment on Android phones (Snapdragon 865+), laptops, IoT devices, and any hardware with 4GB+ RAM. No GPU required.

Performance on Samsung S20 FE (Snapdragon 865)

Metric This Version Original FP16
Size 767 MB ~2.5 GB
Speed ~28 tok/s CPU ~8 tok/s
Memory ~1.2 GB ~3.8 GB
Quality ~95% of original 100% baseline

Use Cases

  • Chatbots & conversational AI on mobile devices
  • Instruction following in resource-constrained environments
  • Content summarization, text classification, RAG pipelines
  • Educational apps, tutoring systems

Quick Start

# Install llama.cpp
git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp && cmake -B build -DLLAMA_NATIVE=ON && cmake --build build --config Release

# Download this model
huggingface-cli download dispatchAI/Llama-3.2-1B-Instruct-Q4-mobile ggml-model-Q4_K_M.gguf --local-dir ./models

# Run inference immediately
./build/bin/main -m ./models/ggml-model-Q4_K_M.gguf -p "Hello" -n 256 -t 4

Hardware Requirements

Requirement Minimum Recommended
RAM 4 GB 6 GB+
Storage 1 GB free 2 GB+
CPU 4-core ARM64/x86_64 8-core Snapdragon 865+
GPU Not required Any (faster)

Limitations

  • ~5% quality degradation vs FP16 on complex reasoning tasks
  • Not suitable for high-precision numerical computation
  • Context window follows base model (~128K tokens)

About Dispatch AI

Re-engineering LLMs for mobile and edge deployment. HuggingFace - 40+ models, 13K+ downloads

Downloads last month
1,242
GGUF
Model size
1B params
Architecture
llama
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Spaces using dispatchAI/Llama-3.2-1B-Instruct-Q4-mobile 4

Collections including dispatchAI/Llama-3.2-1B-Instruct-Q4-mobile