Instructions to use enfuse/smol-tools-4b-16k with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use enfuse/smol-tools-4b-16k with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="enfuse/smol-tools-4b-16k") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("enfuse/smol-tools-4b-16k") model = AutoModelForImageTextToText.from_pretrained("enfuse/smol-tools-4b-16k") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - llama-cpp-python
How to use enfuse/smol-tools-4b-16k with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="enfuse/smol-tools-4b-16k", filename="smol-tools-4b-16k-q4_k_m.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use enfuse/smol-tools-4b-16k with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf enfuse/smol-tools-4b-16k:Q4_K_M # Run inference directly in the terminal: llama-cli -hf enfuse/smol-tools-4b-16k:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf enfuse/smol-tools-4b-16k:Q4_K_M # Run inference directly in the terminal: llama-cli -hf enfuse/smol-tools-4b-16k:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf enfuse/smol-tools-4b-16k:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf enfuse/smol-tools-4b-16k:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf enfuse/smol-tools-4b-16k:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf enfuse/smol-tools-4b-16k:Q4_K_M
Use Docker
docker model run hf.co/enfuse/smol-tools-4b-16k:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use enfuse/smol-tools-4b-16k with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "enfuse/smol-tools-4b-16k" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "enfuse/smol-tools-4b-16k", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/enfuse/smol-tools-4b-16k:Q4_K_M
- SGLang
How to use enfuse/smol-tools-4b-16k with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "enfuse/smol-tools-4b-16k" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "enfuse/smol-tools-4b-16k", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "enfuse/smol-tools-4b-16k" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "enfuse/smol-tools-4b-16k", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Ollama
How to use enfuse/smol-tools-4b-16k with Ollama:
ollama run hf.co/enfuse/smol-tools-4b-16k:Q4_K_M
- Unsloth Studio new
How to use enfuse/smol-tools-4b-16k with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for enfuse/smol-tools-4b-16k to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for enfuse/smol-tools-4b-16k to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for enfuse/smol-tools-4b-16k to start chatting
- Pi new
How to use enfuse/smol-tools-4b-16k with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf enfuse/smol-tools-4b-16k:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "enfuse/smol-tools-4b-16k:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use enfuse/smol-tools-4b-16k with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf enfuse/smol-tools-4b-16k:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default enfuse/smol-tools-4b-16k:Q4_K_M
Run Hermes
hermes
- Docker Model Runner
How to use enfuse/smol-tools-4b-16k with Docker Model Runner:
docker model run hf.co/enfuse/smol-tools-4b-16k:Q4_K_M
- Lemonade
How to use enfuse/smol-tools-4b-16k with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull enfuse/smol-tools-4b-16k:Q4_K_M
Run and chat with the model
lemonade run user.smol-tools-4b-16k-Q4_K_M
List all available models
lemonade list
llm.create_chat_completion(
messages = [
{
"role": "user",
"content": "What is the capital of France?"
}
]
)- smol-tools-4b-16k — Long-Context Agentic Tool-Use Model
- Why 16K Context?
- Results (200-example held-out eval)
- Capabilities
- Available Tools (training set)
- Quick Start
- Example Outputs (real model outputs, not hand-written)
- Training Details
- smol-tools Family
- When to Use This Model
- When NOT to Use This Model
- Limitations
- Quantization Results (200-example eval)
- Hardware
- Attribution
- Why 16K Context?
smol-tools-4b-16k — Long-Context Agentic Tool-Use Model
A 4B parameter model fine-tuned for reliable tool calling with 16K context support. Handles multi-turn tool-use conversations, long document analysis with tool calls, and complex multi-step agent workflows — all within a single context window 4x larger than the original smol-tools-4b.
Built on Qwen3.5-4B-Claude-4.6-Opus-Reasoning-Distilled, trained with LoRA on 7,557 examples including multi-turn tool-use conversations up to 16K tokens.
Architecture:
Qwen3_5ForCausalLM(text-only, 32 layers, hybrid attention — 24 linear + 8 full-attention). Qwen3.5's efficient attention makes long-context inference memory-friendly.
Need more context? See smol-tools-4b-32k for 32K context support. Need less? See smol-tools-4b for the highest-accuracy 4K variant.
Available Formats
| Format | File | Size | Tool F1 | Use Case |
|---|---|---|---|---|
| BF16 safetensors | model.safetensors |
9.7 GB | 0.948 | GPU inference with transformers / vLLM |
| Q8_0 GGUF | smol-tools-4b-16k-q8_0.gguf |
4.9 GB | 0.923 | Near-lossless — Jetson Orin NX/AGX, 8GB+ GPUs |
| Q4_K_M GGUF | smol-tools-4b-16k-q4_k_m.gguf |
2.9 GB | 0.928 | Edge deployment — Jetson Orin Nano, phones, RPi 5 |
All formats maintain 100% JSON validity, 100% argument correctness, and 100% no-tool accuracy. GGUF files run with llama.cpp, ollama, or llama-cpp-python.
Why 16K Context?
The original smol-tools-4b (4K context) works well for single-turn tool calls. But real agent workflows often involve:
- Multi-turn conversations — 5-10 rounds of tool calls and results accumulating in context
- Long tool outputs — database queries, file reads, and web pages that consume thousands of tokens
- Complex planning — reasoning over prior results to decide next steps
This model was specifically trained on multi-turn tool-use data (up to 16K tokens) so it maintains tool-calling accuracy across long conversations, not just short single-turn queries.
Results (200-example held-out eval)
| Metric | smol-tools-4b (4K) | smol-tools-4b-16k | Delta |
|---|---|---|---|
| Tool Selection F1 | 0.955 | 0.948 | -0.7% |
| Tool Precision | 0.955 | 0.948 | -0.7% |
| Tool Recall | 0.980 | 0.975 | -0.5% |
| JSON Validity | 100% | 100% | — |
| Argument Correctness | 100% | 100% | — |
| No-Tool Accuracy | 100% | 100% | — |
| Max Context | 4,096 | 16,384 | 4x |
Tool-calling accuracy is near-identical to the 4K model while supporting 4x the context length.
Per-Scenario Breakdown
| Scenario | F1 | Count | Description |
|---|---|---|---|
| multi_tool_parallel | 1.000 | 18 | Multiple independent tool calls |
| multi_tool_sequential | 1.000 | 36 | Chained tool calls with dependencies |
| no_tool_needed | 1.000 | 18 | Questions answerable without tools |
| error_recovery | 1.000 | 18 | Handling malformed inputs or missing data |
| single_tool | 0.962 | 53 | One tool call needed |
| reasoning_heavy | 0.876 | 35 | Complex reasoning before tool selection |
| complex_multi_step | 0.818 | 22 | Multi-step workflows with planning |
Capabilities
- 16K context window — handles multi-turn agent conversations with accumulated tool results
- Tool selection: Picks the right tool(s) from a provided set with 94.8% F1
- Structured output: Produces valid
<tool_call>{"name": "...", "arguments": {...}}</tool_call>JSON — 100% validity - Tool refusal: Correctly answers directly when no tool is needed — 100% accuracy
- Multi-tool: Handles parallel and sequential multi-tool scenarios perfectly
- Error recovery: Perfect F1 on error recovery scenarios (malformed inputs, missing data)
Available Tools (training set)
The model was trained with these 15 tools but generalizes to new tool schemas provided at inference:
web_search, get_webpage, execute_python, read_file, write_file, list_directory, send_email, get_current_datetime, calculate, translate, get_weather, create_calendar_event, database_query, http_request, shell_command
Quick Start
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained(
"enfuse/smol-tools-4b-16k",
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("enfuse/smol-tools-4b-16k", trust_remote_code=True)
tools = [
{"type": "function", "function": {
"name": "web_search",
"description": "Search the web for information",
"parameters": {"type": "object", "properties": {
"query": {"type": "string"}
}, "required": ["query"]}
}}
]
messages = [
{"role": "system", "content": "You are a helpful assistant with access to tools."},
{"role": "user", "content": "What's the latest news about SpaceX?"},
]
prompt = tokenizer.apply_chat_template(messages, tools=tools, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=512, temperature=0.1, do_sample=True)
print(tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=False))
With vLLM (faster)
from vllm import LLM, SamplingParams
llm = LLM(model="enfuse/smol-tools-4b-16k", dtype="bfloat16", max_model_len=16384, enforce_eager=True)
sampling = SamplingParams(max_tokens=2048, temperature=0.1, stop=["<|im_end|>"])
outputs = llm.generate([prompt], sampling)
With llama.cpp (GGUF, edge devices)
# Download the Q4_K_M GGUF (2.9 GB) for edge deployment
huggingface-cli download enfuse/smol-tools-4b-16k smol-tools-4b-16k-q4_k_m.gguf --local-dir .
# Run with llama-server (OpenAI-compatible API)
llama-server -m smol-tools-4b-16k-q4_k_m.gguf -c 4096 -ngl 99 --port 8080
# Or with llama-cli for one-shot inference
llama-cli -m smol-tools-4b-16k-q4_k_m.gguf -c 4096 -ngl 99 -p "<your prompt>"
# Or with llama-cpp-python
from llama_cpp import Llama
llm = Llama(model_path="smol-tools-4b-16k-q4_k_m.gguf", n_ctx=4096, n_gpu_layers=-1)
output = llm(prompt, max_tokens=2048, temperature=0.1, stop=["<|im_end|>"])
print(output["choices"][0]["text"])
Example Outputs (real model outputs, not hand-written)
These are actual outputs from the model at temperature=0.1:
Example 1 — Single tool call (input: "What's the weather like in Tokyo right now?")
<think>
The user is asking about the current weather in Tokyo. I have a weather tool
available that can provide this information. Let me use the get_weather function
with Tokyo as the location. Since Tokyo is in Japan, I'll use Celsius as the
default unit.
</think>
<tool_call>
{"name": "get_weather", "arguments": {"location": "Tokyo", "units": "celsius"}}
</tool_call>
Example 2 — No tool needed (input: "What is the capital of France?")
<think>
The user is asking a simple factual question about the capital of France. This
is general knowledge that I can answer directly without needing to use any of the
available tools (get_weather, web_search, or calculate).
</think>
The capital of France is **Paris**.
This is a well-known geographical fact that doesn't require any tools to verify.
Paris has been the capital of France for centuries and is one of the world's most
famous cities, known for landmarks like the Eiffel Tower, the Louvre Museum, and
Notre-Dame Cathedral.
Example 3 — Multiple parallel tool calls (input: "I'm planning a trip to Paris next week. What's the weather forecast, and can you calculate how many euros I'd get for 500 USD at a rate of 0.92?")
<think>
The user is asking for two things:
1. Weather forecast for Paris next week
2. Currency conversion calculation (500 USD to euros at 0.92 rate)
For the weather, I can use the get_weather function. For the calculation, I can
use the calculate function with the expression "500 * 0.92" to convert USD to
euros. Let me make both function calls.
</think>
I'll help you with both the weather information and the currency calculation for
your Paris trip.
<tool_call>
{"name": "get_weather", "arguments": {"location": "Paris", "units": "celsius"}}
</tool_call>
<tool_call>
{"name": "calculate", "arguments": {"expression": "500 * 0.92"}}
</tool_call>
Training Details
| Parameter | Value |
|---|---|
| Base model | Qwen3.5-4B-Claude-4.6-Opus-Reasoning-Distilled |
| Method | LoRA (rank 64, alpha 128) |
| Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Training examples | 7,557 (6,855 short-context + 702 multi-turn long-context) |
| Epochs | 3 |
| Batch size | 2 (× 16 gradient accumulation = effective 32) |
| Learning rate | 1e-4 (cosine schedule) |
| Max sequence length | 16,384 |
| Eval loss | 0.134 |
| Token accuracy | 95.3% |
| Training time | ~14.4 hours on 1× NVIDIA H200 |
| Framework | TRL SFTTrainer + PEFT |
Data Pipeline
- Short-context data (6,855 examples): Quality-filtered synthetic tool-use conversations from smol-tools-4b training, covering all 7 scenario types at up to 4K tokens
- Multi-turn long-context data (702 examples): New synthetic conversations generated by Qwen3.5-27B teacher at 64K context, featuring 8-16 rounds of tool calls with realistic tool outputs (database results, file contents, web pages). Cleaned to remove malformed tool calls and ensure proper conversation endings
- Combined: 7,557 examples with a mix of short single-turn and long multi-turn conversations, filtered to ≤16K tokens
smol-tools Family
All models share the same base architecture, tool schema, and output format. Choose based on your context length needs:
| Model | Context | Tool F1 | JSON Valid | No-Tool Acc | Parameters | HF Repo |
|---|---|---|---|---|---|---|
| smol-tools-4b | 4K | 0.955 | 100% | 100% | Rank 32, α=64 | enfuse/smol-tools-4b |
| smol-tools-4b-16k | 16K | 0.948 | 100% | 100% | Rank 64, α=128 | this repo |
| smol-tools-4b-32k | 32K | 0.940 | 100% | 100% | Rank 64, α=128 | enfuse/smol-tools-4b-32k |
How to choose:
- 4K: Single-turn tool calls, short tool outputs — highest accuracy, lowest memory — also available in GGUF quantized formats
- 16K (this model): Multi-turn conversations (5-10 rounds), moderate tool outputs — best balance of accuracy and context length
- 32K: Extended agent sessions (10-20 rounds), large tool outputs — also available in GGUF quantized formats
When to Use This Model
- You're building an agent that needs multi-turn tool conversations — research assistants, code generators, data analysis pipelines
- Your tool outputs are long — database query results, file contents, web scrapes that push context beyond 4K
- You need a small, fast model that can handle extended agent sessions without losing tool-calling accuracy
- You want structured output you can trust at long context — 100% JSON validity even with 16K of accumulated context
When NOT to Use This Model
- If all your tool calls are single-turn with short outputs, use smol-tools-4b (4K) — it's slightly more accurate and uses less memory
- If you need GPT-4-level complex multi-step planning (our weakest category at F1=0.818), use a bigger model
- If your conversations routinely exceed 16K tokens, use smol-tools-4b-32k (32K context)
Limitations
- complex_multi_step scenarios (F1=0.818) remain the weakest — the model sometimes struggles with multi-step planning involving 3+ chained tools
- Trained on synthetic data only — real-world tool-use patterns may differ
- The 16K training data was generated by a 27B teacher model; very long conversations (>12K tokens) may show quality degradation compared to shorter ones
- Inherits Qwen3.5-4B base model limitations (knowledge cutoff)
Quantization Results (200-example eval)
| Format | Size | Tool F1 | Precision | Recall | JSON | Args | No-Tool |
|---|---|---|---|---|---|---|---|
| BF16 | 9.7 GB | 0.948 | 0.948 | 0.975 | 100% | 100% | 100% |
| Q8_0 | 4.9 GB | 0.923 | 0.922 | 0.955 | 100% | 100% | 100% |
| Q4_K_M | 2.9 GB | 0.928 | 0.927 | 0.950 | 100% | 100% | 100% |
All formats preserve perfect JSON validity and argument correctness. The Q4_K_M quantization (3.4x smaller) retains 98% of the BF16 model's tool-calling accuracy.
Hardware
- Training: 1× NVIDIA H200 NVL (141 GB HBM3e), ~14.4 hours
- Inference (BF16, 16K context): Any GPU with ≥16 GB VRAM
- Inference (BF16, 4K context): Any GPU with ≥10 GB VRAM
- Inference (Q8_0 GGUF, 16K context): Any device with ≥8 GB RAM — Jetson Orin NX/AGX, consumer GPUs
- Inference (Q4_K_M GGUF, 16K context): Any device with ≥5 GB RAM — Jetson Orin Nano, phones, RPi 5
- Inference (Q4_K_M GGUF, 4K context): Any device with ≥4 GB RAM
Attribution
- Base model: Qwen3.5-4B-Claude-4.6-Opus-Reasoning-Distilled by Jackrong
- Training framework: TRL + PEFT by HuggingFace
- Inference: vLLM
- Downloads last month
- 61
Model tree for enfuse/smol-tools-4b-16k
Base model
Qwen/Qwen3.5-4B-BaseEvaluation results
- Tool Selection F1self-reported0.948
- JSON Validityself-reported1.000
- No-Tool Accuracyself-reported1.000
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="enfuse/smol-tools-4b-16k", filename="", )