Instructions to use jc-builds/SmolLM3-3B-Instruct-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Notebooks
Google Colab
Kaggle
Local Apps Settings

How to use jc-builds/SmolLM3-3B-Instruct-GGUF with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf jc-builds/SmolLM3-3B-Instruct-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf jc-builds/SmolLM3-3B-Instruct-GGUF:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf jc-builds/SmolLM3-3B-Instruct-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf jc-builds/SmolLM3-3B-Instruct-GGUF:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf jc-builds/SmolLM3-3B-Instruct-GGUF:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf jc-builds/SmolLM3-3B-Instruct-GGUF:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf jc-builds/SmolLM3-3B-Instruct-GGUF:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf jc-builds/SmolLM3-3B-Instruct-GGUF:Q4_K_M

Use Docker

docker model run hf.co/jc-builds/SmolLM3-3B-Instruct-GGUF:Q4_K_M

LM Studio
Jan

vLLM

How to use jc-builds/SmolLM3-3B-Instruct-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "jc-builds/SmolLM3-3B-Instruct-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "jc-builds/SmolLM3-3B-Instruct-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/jc-builds/SmolLM3-3B-Instruct-GGUF:Q4_K_M

Ollama
How to use jc-builds/SmolLM3-3B-Instruct-GGUF with Ollama:
```
ollama run hf.co/jc-builds/SmolLM3-3B-Instruct-GGUF:Q4_K_M
```

Unsloth Studio

How to use jc-builds/SmolLM3-3B-Instruct-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for jc-builds/SmolLM3-3B-Instruct-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for jc-builds/SmolLM3-3B-Instruct-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for jc-builds/SmolLM3-3B-Instruct-GGUF to start chatting

How to use jc-builds/SmolLM3-3B-Instruct-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf jc-builds/SmolLM3-3B-Instruct-GGUF:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "jc-builds/SmolLM3-3B-Instruct-GGUF:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use jc-builds/SmolLM3-3B-Instruct-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf jc-builds/SmolLM3-3B-Instruct-GGUF:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default jc-builds/SmolLM3-3B-Instruct-GGUF:Q4_K_M

Run Hermes

hermes

Atomic Chat new

OpenClaw new

How to use jc-builds/SmolLM3-3B-Instruct-GGUF with OpenClaw:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf jc-builds/SmolLM3-3B-Instruct-GGUF:Q4_K_M

Configure OpenClaw

# Install OpenClaw:
npm install -g openclaw@latest
# Register the local server and set it as the default model:
openclaw onboard --non-interactive --mode local \
  --auth-choice custom-api-key \
  --custom-base-url http://127.0.0.1:8080/v1 \
  --custom-model-id "jc-builds/SmolLM3-3B-Instruct-GGUF:Q4_K_M" \
  --custom-provider-id llama-cpp \
  --custom-compatibility openai \
  --custom-text-input \
  --accept-risk \
  --skip-health

Run OpenClaw

openclaw agent --local --agent main --message "Hello from Hugging Face"

Docker Model Runner
How to use jc-builds/SmolLM3-3B-Instruct-GGUF with Docker Model Runner:
```
docker model run hf.co/jc-builds/SmolLM3-3B-Instruct-GGUF:Q4_K_M
```

Lemonade

How to use jc-builds/SmolLM3-3B-Instruct-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull jc-builds/SmolLM3-3B-Instruct-GGUF:Q4_K_M

Run and chat with the model

lemonade run user.SmolLM3-3B-Instruct-GGUF-Q4_K_M

List all available models

lemonade list

SmolLM3-3B — GGUF (iPhone-optimized)

GGUF quantizations of HuggingFaceTB/SmolLM3-3B, built and optimized for on-device inference on iPhone, iPad, and Apple Silicon Macs via llama.cpp or apps that wrap it (e.g. Haplo).

Built and quantized by jc-builds for the Haplo ecosystem. Original weights © Hugging Face, redistributed under Apache 2.0 per the upstream license.

TL;DR

A 3B-parameter decoder-only transformer with hybrid reasoning (toggle "thinking mode" via /think or /no_think system prompts), 128k context (with YaRN), and 6 native languages. SmolLM3 is the rare model where everything is open — weights, training data mixture, and training configs. At the 3B scale it outperforms Llama-3.2-3B and Qwen2.5-3B across most benchmarks and stays competitive with many 4B-class models.

Available quantizations

File	Size	Bits/weight	Recommended use
`SmolLM3-3B-Q4_K_M.gguf`	1.8 GB	4.8	Default — best size/quality tradeoff for phone & laptop
`SmolLM3-3B-Q5_K_M.gguf`	2.1 GB	5.7	Slightly better quality, ~17% bigger; good for iPad / Mac
`SmolLM3-3B-Q8_0.gguf`	3.0 GB	8.5	Near-FP16 quality; only worth it on Apple Silicon Mac

Pick Q4_K_M unless you have a reason not to — it's the sweet spot for on-device on Apple Silicon. Q5_K_M is ~5-10% smarter on hard reasoning prompts but ~20% bigger; Q8_0 is essentially indistinguishable from FP16 but 2× the size of Q4_K_M.

Performance on Apple Silicon

Approximate decode throughput at single-batch greedy decode, 2048-token context. Measured with llama-cli.

Device	RAM	Q4_K_M tok/s	Notes
iPhone 15 Pro	8 GB	~22 tok/s	Smooth chat experience
iPhone 14 Pro	6 GB	~18 tok/s	Comfortable
iPad Pro M2	8 GB	~45 tok/s	Snappy
MacBook Pro M3	16 GB	~80 tok/s	Effectively instant

Reference numbers — your throughput will vary with prompt length, KV cache, and what else is running. Q5_K_M and Q8_0 are roughly 15% / 40% slower than Q4_K_M respectively.

How to use

1. Haplo (iPhone / iPad / Mac)

The model appears automatically in Haplo's model browser on Kuzco-1.1.0+ builds. The download URL for Q4_K_M is:

https://huggingface.co/jc-builds/SmolLM3-3B-Instruct-GGUF/resolve/main/SmolLM3-3B-Q4_K_M.gguf

2. llama.cpp (CLI)

huggingface-cli download jc-builds/SmolLM3-3B-Instruct-GGUF SmolLM3-3B-Q4_K_M.gguf --local-dir .

./llama-cli \
  -m SmolLM3-3B-Q4_K_M.gguf \
  -p "Explain gravity in two sentences." \
  -n 256 \
  --temp 0.6 \
  --top-p 0.95

3. Ollama

cat <<'EOF' > Modelfile
FROM ./SmolLM3-3B-Q4_K_M.gguf
PARAMETER temperature 0.6
PARAMETER top_p 0.95
EOF
ollama create smollm3 -f Modelfile
ollama run smollm3

Reasoning modes (think / no_think)

SmolLM3 ships with hybrid reasoning. You toggle it via system prompt:

System prompt	Behavior
`/think` (default)	Emits a `<think>…</think>` reasoning block, then the answer. Better on math / code / multi-step problems.
`/no_think`	Skips the reasoning block. Use for fast chat / simple Q&A.

Example:

<|im_start|>system
/no_think<|im_end|>
<|im_start|>user
Capital of Australia?<|im_end|>
<|im_start|>assistant

Sampling defaults

The upstream team recommends temperature=0.6 and top_p=0.95. The GGUF metadata stores these as the defaults — most clients (llama.cpp, Haplo, Ollama) will use them automatically.

Chat template

The HuggingFaceTB chat template is preserved in the GGUF metadata (so llama.cpp's --chat-template flag is not required). It uses ChatML-style turns:

<|im_start|>system
{system}<|im_end|>
<|im_start|>user
{user}<|im_end|>
<|im_start|>assistant
{assistant}<|im_end|>

Quantization recipe

Built with llama.cpp at commit e43431b (May 7, 2026).

Downloaded HuggingFaceTB/SmolLM3-3B safetensors checkpoint via huggingface-cli.
Converted to GGUF FP16 via convert_hf_to_gguf.py --outtype f16.

Quantized to each target type via llama-quantize:

llama-quantize SmolLM3-3B-F16.gguf SmolLM3-3B-Q4_K_M.gguf Q4_K_M
llama-quantize SmolLM3-3B-F16.gguf SmolLM3-3B-Q5_K_M.gguf Q5_K_M
llama-quantize SmolLM3-3B-F16.gguf SmolLM3-3B-Q8_0.gguf   Q8_0

No imatrix calibration was used — the weights come from the upstream FP16 directly.

Original model card

See the upstream model card for full architecture, training, and benchmark details: HuggingFaceTB/SmolLM3-3B.

License

Apache 2.0, inherited from the original model. Commercial use, modification, and redistribution are permitted. See LICENSE for the full terms.

SmolLM3 by Hugging Face. Licensed under Apache 2.0.

Acknowledgements

The Hugging Face SmolLM team for the original weights and an unusually generous open-everything release (training data, recipe, configs).
The llama.cpp team for the GGUF format and quantization tooling.
The Haplo ecosystem this drop is built for.

Downloads last month: 319

GGUF

Model size

3B params

Architecture

smollm3

Hardware compatibility

4-bit

5-bit

8-bit

Model tree for jc-builds/SmolLM3-3B-Instruct-GGUF

Base model

HuggingFaceTB/SmolLM3-3B-Base

Finetuned

HuggingFaceTB/SmolLM3-3B

Quantized

(104)

this model