SmolLM3-3B β€” GGUF (iPhone-optimized)

GGUF quantizations of HuggingFaceTB/SmolLM3-3B, built and optimized for on-device inference on iPhone, iPad, and Apple Silicon Macs via llama.cpp or apps that wrap it (e.g. Haplo).

Built and quantized by jc-builds for the Haplo ecosystem. Original weights Β© Hugging Face, redistributed under Apache 2.0 per the upstream license.

TL;DR

A 3B-parameter decoder-only transformer with hybrid reasoning (toggle "thinking mode" via /think or /no_think system prompts), 128k context (with YaRN), and 6 native languages. SmolLM3 is the rare model where everything is open β€” weights, training data mixture, and training configs. At the 3B scale it outperforms Llama-3.2-3B and Qwen2.5-3B across most benchmarks and stays competitive with many 4B-class models.

Available quantizations

File Size Bits/weight Recommended use
SmolLM3-3B-Q4_K_M.gguf 1.8 GB 4.8 Default β€” best size/quality tradeoff for phone & laptop
SmolLM3-3B-Q5_K_M.gguf 2.1 GB 5.7 Slightly better quality, ~17% bigger; good for iPad / Mac
SmolLM3-3B-Q8_0.gguf 3.0 GB 8.5 Near-FP16 quality; only worth it on Apple Silicon Mac

Pick Q4_K_M unless you have a reason not to β€” it's the sweet spot for on-device on Apple Silicon. Q5_K_M is ~5-10% smarter on hard reasoning prompts but ~20% bigger; Q8_0 is essentially indistinguishable from FP16 but 2Γ— the size of Q4_K_M.

Performance on Apple Silicon

Approximate decode throughput at single-batch greedy decode, 2048-token context. Measured with llama-cli.

Device RAM Q4_K_M tok/s Notes
iPhone 15 Pro 8 GB ~22 tok/s Smooth chat experience
iPhone 14 Pro 6 GB ~18 tok/s Comfortable
iPad Pro M2 8 GB ~45 tok/s Snappy
MacBook Pro M3 16 GB ~80 tok/s Effectively instant

Reference numbers β€” your throughput will vary with prompt length, KV cache, and what else is running. Q5_K_M and Q8_0 are roughly 15% / 40% slower than Q4_K_M respectively.

How to use

1. Haplo (iPhone / iPad / Mac)

The model appears automatically in Haplo's model browser on Kuzco-1.1.0+ builds. The download URL for Q4_K_M is:

https://huggingface.co/jc-builds/SmolLM3-3B-Instruct-GGUF/resolve/main/SmolLM3-3B-Q4_K_M.gguf

2. llama.cpp (CLI)

huggingface-cli download jc-builds/SmolLM3-3B-Instruct-GGUF SmolLM3-3B-Q4_K_M.gguf --local-dir .

./llama-cli \
  -m SmolLM3-3B-Q4_K_M.gguf \
  -p "Explain gravity in two sentences." \
  -n 256 \
  --temp 0.6 \
  --top-p 0.95

3. Ollama

cat <<'EOF' > Modelfile
FROM ./SmolLM3-3B-Q4_K_M.gguf
PARAMETER temperature 0.6
PARAMETER top_p 0.95
EOF
ollama create smollm3 -f Modelfile
ollama run smollm3

Reasoning modes (think / no_think)

SmolLM3 ships with hybrid reasoning. You toggle it via system prompt:

System prompt Behavior
/think (default) Emits a <think>…</think> reasoning block, then the answer. Better on math / code / multi-step problems.
/no_think Skips the reasoning block. Use for fast chat / simple Q&A.

Example:

<|im_start|>system
/no_think<|im_end|>
<|im_start|>user
Capital of Australia?<|im_end|>
<|im_start|>assistant

Sampling defaults

The upstream team recommends temperature=0.6 and top_p=0.95. The GGUF metadata stores these as the defaults β€” most clients (llama.cpp, Haplo, Ollama) will use them automatically.

Chat template

The HuggingFaceTB chat template is preserved in the GGUF metadata (so llama.cpp's --chat-template flag is not required). It uses ChatML-style turns:

<|im_start|>system
{system}<|im_end|>
<|im_start|>user
{user}<|im_end|>
<|im_start|>assistant
{assistant}<|im_end|>

Quantization recipe

Built with llama.cpp at commit e43431b (May 7, 2026).

  1. Downloaded HuggingFaceTB/SmolLM3-3B safetensors checkpoint via huggingface-cli.
  2. Converted to GGUF FP16 via convert_hf_to_gguf.py --outtype f16.
  3. Quantized to each target type via llama-quantize:
    llama-quantize SmolLM3-3B-F16.gguf SmolLM3-3B-Q4_K_M.gguf Q4_K_M
    llama-quantize SmolLM3-3B-F16.gguf SmolLM3-3B-Q5_K_M.gguf Q5_K_M
    llama-quantize SmolLM3-3B-F16.gguf SmolLM3-3B-Q8_0.gguf   Q8_0
    

No imatrix calibration was used β€” the weights come from the upstream FP16 directly.

Original model card

See the upstream model card for full architecture, training, and benchmark details: HuggingFaceTB/SmolLM3-3B.

License

Apache 2.0, inherited from the original model. Commercial use, modification, and redistribution are permitted. See LICENSE for the full terms.

SmolLM3 by Hugging Face. Licensed under Apache 2.0.

Acknowledgements

  • The Hugging Face SmolLM team for the original weights and an unusually generous open-everything release (training data, recipe, configs).
  • The llama.cpp team for the GGUF format and quantization tooling.
  • The Haplo ecosystem this drop is built for.
Downloads last month
376
GGUF
Model size
3B params
Architecture
smollm3
Hardware compatibility
Log In to add your hardware

4-bit

5-bit

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for jc-builds/SmolLM3-3B-Instruct-GGUF

Quantized
(87)
this model