Instructions to use enfuse/smol-tools-4b-32k with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use enfuse/smol-tools-4b-32k with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="enfuse/smol-tools-4b-32k")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("enfuse/smol-tools-4b-32k")
model = AutoModelForMultimodalLM.from_pretrained("enfuse/smol-tools-4b-32k")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

llama-cpp-python

How to use enfuse/smol-tools-4b-32k with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="enfuse/smol-tools-4b-32k",
	filename="smol-tools-4b-32k-q4_k_m.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use enfuse/smol-tools-4b-32k with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf enfuse/smol-tools-4b-32k:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf enfuse/smol-tools-4b-32k:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf enfuse/smol-tools-4b-32k:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf enfuse/smol-tools-4b-32k:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf enfuse/smol-tools-4b-32k:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf enfuse/smol-tools-4b-32k:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf enfuse/smol-tools-4b-32k:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf enfuse/smol-tools-4b-32k:Q4_K_M

Use Docker

docker model run hf.co/enfuse/smol-tools-4b-32k:Q4_K_M

LM Studio
Jan

vLLM

How to use enfuse/smol-tools-4b-32k with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "enfuse/smol-tools-4b-32k"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "enfuse/smol-tools-4b-32k",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/enfuse/smol-tools-4b-32k:Q4_K_M

SGLang

How to use enfuse/smol-tools-4b-32k with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "enfuse/smol-tools-4b-32k" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "enfuse/smol-tools-4b-32k",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "enfuse/smol-tools-4b-32k" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "enfuse/smol-tools-4b-32k",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Ollama
How to use enfuse/smol-tools-4b-32k with Ollama:
```
ollama run hf.co/enfuse/smol-tools-4b-32k:Q4_K_M
```

Unsloth Studio

How to use enfuse/smol-tools-4b-32k with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for enfuse/smol-tools-4b-32k to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for enfuse/smol-tools-4b-32k to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for enfuse/smol-tools-4b-32k to start chatting

How to use enfuse/smol-tools-4b-32k with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf enfuse/smol-tools-4b-32k:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "enfuse/smol-tools-4b-32k:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use enfuse/smol-tools-4b-32k with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf enfuse/smol-tools-4b-32k:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default enfuse/smol-tools-4b-32k:Q4_K_M

Run Hermes

hermes

Atomic Chat new

OpenClaw new

How to use enfuse/smol-tools-4b-32k with OpenClaw:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf enfuse/smol-tools-4b-32k:Q4_K_M

Configure OpenClaw

# Install OpenClaw:
npm install -g openclaw@latest
# Register the local server and set it as the default model:
openclaw onboard --non-interactive --mode local \
  --auth-choice custom-api-key \
  --custom-base-url http://127.0.0.1:8080/v1 \
  --custom-model-id "enfuse/smol-tools-4b-32k:Q4_K_M" \
  --custom-provider-id llama-cpp \
  --custom-compatibility openai \
  --custom-text-input \
  --accept-risk \
  --skip-health

Run OpenClaw

openclaw agent --local --agent main --message "Hello from Hugging Face"

Docker Model Runner
How to use enfuse/smol-tools-4b-32k with Docker Model Runner:
```
docker model run hf.co/enfuse/smol-tools-4b-32k:Q4_K_M
```

Lemonade

How to use enfuse/smol-tools-4b-32k with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull enfuse/smol-tools-4b-32k:Q4_K_M

Run and chat with the model

lemonade run user.smol-tools-4b-32k-Q4_K_M

List all available models

lemonade list

smol-tools-4b-32k — Long-Context Agentic Tool-Use Model

A 4B parameter model fine-tuned for reliable tool calling with 32K context support. Handles extended multi-turn tool-use conversations, long document analysis with tool calls, and complex multi-step agent workflows — all within a single context window 8x larger than the original smol-tools-4b.

Built on Qwen3.5-4B-Claude-4.6-Opus-Reasoning-Distilled, trained with LoRA on 8,059 examples including multi-turn tool-use conversations up to 32K tokens.

Architecture: Qwen3_5ForCausalLM (text-only, 32 layers, hybrid attention — 24 linear + 8 full-attention). Qwen3.5's efficient attention makes long-context inference memory-friendly.

Need less context? See smol-tools-4b-16k (16K, also available in GGUF) or smol-tools-4b (4K, highest accuracy, GGUF also available).

Available Formats

Format	File	Size	Tool F1	Use Case
BF16 safetensors	`model.safetensors`	9.7 GB	0.940	GPU inference with transformers / vLLM
Q8_0 GGUF	`smol-tools-4b-32k-q8_0.gguf`	4.9 GB	0.918	Near-lossless — Jetson Orin NX/AGX, 8GB+ GPUs
Q4_K_M GGUF	`smol-tools-4b-32k-q4_k_m.gguf`	2.9 GB	0.925	Edge deployment — Jetson Orin Nano, phones, RPi 5

All formats maintain 100% JSON validity, 100% argument correctness, and 100% no-tool accuracy. GGUF files run with llama.cpp, ollama, or llama-cpp-python.

Why 32K Context?

The original smol-tools-4b (4K context) works well for single-turn tool calls. But real agent workflows often involve:

Multi-turn conversations — 10-20 rounds of tool calls and results accumulating in context
Long tool outputs — database queries returning hundreds of rows, full file contents, and lengthy web pages
Complex planning — reasoning over many prior results to decide next steps

This model was specifically trained on multi-turn tool-use data (up to 32K tokens) so it maintains tool-calling accuracy across very long conversations, not just short single-turn queries. 32K tokens is enough for most real-world agent sessions.

Results (200-example held-out eval)

Metric	smol-tools-4b (4K)	smol-tools-4b-32k	Delta
Tool Selection F1	0.955	0.940	-1.5%
Tool Precision	0.955	0.940	-1.5%
Tool Recall	0.980	0.965	-1.5%
JSON Validity	100%	100%	—
Argument Correctness	100%	100%	—
No-Tool Accuracy	100%	100%	—
Max Context	4,096	32,768	8x

Only 1.5% F1 drop compared to the 4K model while supporting 8x the context length. Perfect JSON validity and argument correctness are preserved.

Per-Scenario Breakdown

Scenario	F1	Count	Description
multi_tool_parallel	1.000	18	Multiple independent tool calls
no_tool_needed	1.000	18	Questions answerable without tools
multi_tool_sequential	0.972	36	Chained tool calls with dependencies
error_recovery	0.944	18	Handling malformed inputs or missing data
single_tool	0.943	53	One tool call needed
reasoning_heavy	0.914	35	Complex reasoning before tool selection
complex_multi_step	0.818	22	Multi-step workflows with planning

Capabilities

32K context window — handles extended multi-turn agent conversations with accumulated tool results
Tool selection: Picks the right tool(s) from a provided set with 94.0% F1
Structured output: Produces valid <tool_call>{"name": "...", "arguments": {...}}</tool_call> JSON — 100% validity
Tool refusal: Correctly answers directly when no tool is needed — 100% accuracy
Multi-tool: Handles parallel and sequential multi-tool scenarios with high accuracy

Available Tools (training set)

The model was trained with these 15 tools but generalizes to new tool schemas provided at inference:

web_search, get_webpage, execute_python, read_file, write_file, list_directory, send_email, get_current_datetime, calculate, translate, get_weather, create_calendar_event, database_query, http_request, shell_command

Quick Start

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "enfuse/smol-tools-4b-32k",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("enfuse/smol-tools-4b-32k", trust_remote_code=True)

tools = [
    {"type": "function", "function": {
        "name": "web_search",
        "description": "Search the web for information",
        "parameters": {"type": "object", "properties": {
            "query": {"type": "string"}
        }, "required": ["query"]}
    }}
]

messages = [
    {"role": "system", "content": "You are a helpful assistant with access to tools."},
    {"role": "user", "content": "What's the latest news about SpaceX?"},
]

prompt = tokenizer.apply_chat_template(messages, tools=tools, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=512, temperature=0.1, do_sample=True)
print(tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=False))

With vLLM (faster)

from vllm import LLM, SamplingParams

llm = LLM(model="enfuse/smol-tools-4b-32k", dtype="bfloat16", max_model_len=32768, enforce_eager=True)
sampling = SamplingParams(max_tokens=2048, temperature=0.1, stop=["<|im_end|>"])
outputs = llm.generate([prompt], sampling)

With llama.cpp (GGUF, edge devices)

# Download the Q4_K_M GGUF (2.9 GB) for edge deployment
huggingface-cli download enfuse/smol-tools-4b-32k smol-tools-4b-32k-q4_k_m.gguf --local-dir .

# Run with llama-server (OpenAI-compatible API)
llama-server -m smol-tools-4b-32k-q4_k_m.gguf -c 4096 -ngl 99 --port 8080

# Or with llama-cli for one-shot inference
llama-cli -m smol-tools-4b-32k-q4_k_m.gguf -c 4096 -ngl 99 -p "<your prompt>"

# Or with llama-cpp-python
from llama_cpp import Llama

llm = Llama(model_path="smol-tools-4b-32k-q4_k_m.gguf", n_ctx=4096, n_gpu_layers=-1)
output = llm(prompt, max_tokens=2048, temperature=0.1, stop=["<|im_end|>"])
print(output["choices"][0]["text"])

Training Details

Parameter	Value
Base model	Qwen3.5-4B-Claude-4.6-Opus-Reasoning-Distilled
Method	LoRA (rank 64, alpha 128)
Target modules	q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Training examples	8,059 (6,855 short-context + 1,204 multi-turn long-context)
Epochs	3
Batch size	1 (× 32 gradient accumulation = effective 32)
Learning rate	1e-4 (cosine schedule)
Max sequence length	32,768
Eval loss	0.140
Train loss	0.229
Training time	~29.4 hours on 1× NVIDIA H200
Framework	TRL SFTTrainer + PEFT

Data Pipeline

Short-context data (6,855 examples): Quality-filtered synthetic tool-use conversations from smol-tools-4b training, covering all 7 scenario types at up to 4K tokens
Multi-turn long-context data (1,204 examples): New synthetic conversations generated by Qwen3.5-27B teacher at 64K context, featuring 8-16 rounds of tool calls with realistic tool outputs (database results, file contents, web pages). Cleaned to remove malformed tool calls and ensure proper conversation endings
Combined: 8,059 examples with a mix of short single-turn and long multi-turn conversations, filtered to ≤32K tokens

smol-tools Family

All models share the same base architecture, tool schema, and output format. Choose based on your context length needs:

Model	Context	Tool F1	JSON Valid	No-Tool Acc	Parameters	HF Repo
smol-tools-4b	4K	0.955	100%	100%	Rank 32, α=64	enfuse/smol-tools-4b
smol-tools-4b-16k	16K	0.948	100%	100%	Rank 64, α=128	enfuse/smol-tools-4b-16k
smol-tools-4b-32k	32K	0.940	100%	100%	Rank 64, α=128	this repo

How to choose:

4K: Single-turn tool calls, short tool outputs — highest accuracy, lowest memory — also available in GGUF quantized formats
16K: Multi-turn conversations (5-10 rounds), moderate tool outputs — also available in GGUF quantized formats
32K (this model): Extended agent sessions (10-20 rounds), large tool outputs — also available in GGUF quantized formats

When to Use This Model

You're building an agent that needs extended multi-turn tool conversations — research assistants, code generators, data analysis pipelines with many rounds
Your tool outputs are very long — large database results, full source files, lengthy web scrapes that push context well beyond 16K
You need a small, fast model that can handle extended agent sessions without losing tool-calling accuracy
You want structured output you can trust at long context — 100% JSON validity even with 32K of accumulated context

When NOT to Use This Model

If all your tool calls are single-turn with short outputs, use smol-tools-4b (4K) — it's slightly more accurate and uses less memory
If your conversations stay under 16K tokens, use smol-tools-4b-16k — it's slightly more accurate and trains faster
If you need GPT-4-level complex multi-step planning (our weakest category at F1=0.818), use a bigger model

Limitations

complex_multi_step scenarios (F1=0.818) remain the weakest — the model sometimes struggles with multi-step planning involving 3+ chained tools
Trained on synthetic data only — real-world tool-use patterns may differ
The 32K training data was generated by a 27B teacher model; very long conversations (>24K tokens) may show quality degradation compared to shorter ones
Inherits Qwen3.5-4B base model limitations (knowledge cutoff)

Quantization Results (200-example eval)

Format	Size	Tool F1	Precision	Recall	JSON	Args	No-Tool
BF16	9.7 GB	0.940	0.940	0.965	100%	100%	100%
Q8_0	4.9 GB	0.918	0.917	0.955	100%	100%	100%
Q4_K_M	2.9 GB	0.925	0.925	0.955	100%	100%	100%

All formats preserve perfect JSON validity and argument correctness. The Q4_K_M quantization (3.4x smaller) retains 98% of the BF16 model's tool-calling accuracy.

Hardware

Training: 1× NVIDIA H200 NVL (141 GB HBM3e), ~29.4 hours
Inference (BF16, 32K context): Any GPU with ≥24 GB VRAM
Inference (BF16, 16K context): Any GPU with ≥16 GB VRAM
Inference (BF16, 4K context): Any GPU with ≥10 GB VRAM
Inference (Q8_0 GGUF, 32K context): Any device with ≥12 GB RAM — Jetson Orin AGX, consumer GPUs
Inference (Q8_0 GGUF, 4K context): Any device with ≥6 GB RAM
Inference (Q4_K_M GGUF, 32K context): Any device with ≥8 GB RAM — Jetson Orin NX, phones
Inference (Q4_K_M GGUF, 4K context): Any device with ≥4 GB RAM — Jetson Orin Nano, RPi 5