Instructions to use enfuse/smol-tools-4b-32k with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use enfuse/smol-tools-4b-32k with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="enfuse/smol-tools-4b-32k") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("enfuse/smol-tools-4b-32k") model = AutoModelForImageTextToText.from_pretrained("enfuse/smol-tools-4b-32k") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - llama-cpp-python
How to use enfuse/smol-tools-4b-32k with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="enfuse/smol-tools-4b-32k", filename="smol-tools-4b-32k-q4_k_m.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use enfuse/smol-tools-4b-32k with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf enfuse/smol-tools-4b-32k:Q4_K_M # Run inference directly in the terminal: llama-cli -hf enfuse/smol-tools-4b-32k:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf enfuse/smol-tools-4b-32k:Q4_K_M # Run inference directly in the terminal: llama-cli -hf enfuse/smol-tools-4b-32k:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf enfuse/smol-tools-4b-32k:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf enfuse/smol-tools-4b-32k:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf enfuse/smol-tools-4b-32k:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf enfuse/smol-tools-4b-32k:Q4_K_M
Use Docker
docker model run hf.co/enfuse/smol-tools-4b-32k:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use enfuse/smol-tools-4b-32k with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "enfuse/smol-tools-4b-32k" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "enfuse/smol-tools-4b-32k", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/enfuse/smol-tools-4b-32k:Q4_K_M
- SGLang
How to use enfuse/smol-tools-4b-32k with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "enfuse/smol-tools-4b-32k" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "enfuse/smol-tools-4b-32k", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "enfuse/smol-tools-4b-32k" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "enfuse/smol-tools-4b-32k", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Ollama
How to use enfuse/smol-tools-4b-32k with Ollama:
ollama run hf.co/enfuse/smol-tools-4b-32k:Q4_K_M
- Unsloth Studio new
How to use enfuse/smol-tools-4b-32k with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for enfuse/smol-tools-4b-32k to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for enfuse/smol-tools-4b-32k to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for enfuse/smol-tools-4b-32k to start chatting
- Pi new
How to use enfuse/smol-tools-4b-32k with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf enfuse/smol-tools-4b-32k:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "enfuse/smol-tools-4b-32k:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use enfuse/smol-tools-4b-32k with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf enfuse/smol-tools-4b-32k:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default enfuse/smol-tools-4b-32k:Q4_K_M
Run Hermes
hermes
- Docker Model Runner
How to use enfuse/smol-tools-4b-32k with Docker Model Runner:
docker model run hf.co/enfuse/smol-tools-4b-32k:Q4_K_M
- Lemonade
How to use enfuse/smol-tools-4b-32k with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull enfuse/smol-tools-4b-32k:Q4_K_M
Run and chat with the model
lemonade run user.smol-tools-4b-32k-Q4_K_M
List all available models
lemonade list
- smol-tools-4b-32k — Long-Context Agentic Tool-Use Model
smol-tools-4b-32k — Long-Context Agentic Tool-Use Model
A 4B parameter model fine-tuned for reliable tool calling with 32K context support. Handles extended multi-turn tool-use conversations, long document analysis with tool calls, and complex multi-step agent workflows — all within a single context window 8x larger than the original smol-tools-4b.
Built on Qwen3.5-4B-Claude-4.6-Opus-Reasoning-Distilled, trained with LoRA on 8,059 examples including multi-turn tool-use conversations up to 32K tokens.
Architecture:
Qwen3_5ForCausalLM(text-only, 32 layers, hybrid attention — 24 linear + 8 full-attention). Qwen3.5's efficient attention makes long-context inference memory-friendly.
Need less context? See smol-tools-4b-16k (16K, also available in GGUF) or smol-tools-4b (4K, highest accuracy, GGUF also available).
Available Formats
| Format | File | Size | Tool F1 | Use Case |
|---|---|---|---|---|
| BF16 safetensors | model.safetensors |
9.7 GB | 0.940 | GPU inference with transformers / vLLM |
| Q8_0 GGUF | smol-tools-4b-32k-q8_0.gguf |
4.9 GB | 0.918 | Near-lossless — Jetson Orin NX/AGX, 8GB+ GPUs |
| Q4_K_M GGUF | smol-tools-4b-32k-q4_k_m.gguf |
2.9 GB | 0.925 | Edge deployment — Jetson Orin Nano, phones, RPi 5 |
All formats maintain 100% JSON validity, 100% argument correctness, and 100% no-tool accuracy. GGUF files run with llama.cpp, ollama, or llama-cpp-python.
Why 32K Context?
The original smol-tools-4b (4K context) works well for single-turn tool calls. But real agent workflows often involve:
- Multi-turn conversations — 10-20 rounds of tool calls and results accumulating in context
- Long tool outputs — database queries returning hundreds of rows, full file contents, and lengthy web pages
- Complex planning — reasoning over many prior results to decide next steps
This model was specifically trained on multi-turn tool-use data (up to 32K tokens) so it maintains tool-calling accuracy across very long conversations, not just short single-turn queries. 32K tokens is enough for most real-world agent sessions.
Results (200-example held-out eval)
| Metric | smol-tools-4b (4K) | smol-tools-4b-32k | Delta |
|---|---|---|---|
| Tool Selection F1 | 0.955 | 0.940 | -1.5% |
| Tool Precision | 0.955 | 0.940 | -1.5% |
| Tool Recall | 0.980 | 0.965 | -1.5% |
| JSON Validity | 100% | 100% | — |
| Argument Correctness | 100% | 100% | — |
| No-Tool Accuracy | 100% | 100% | — |
| Max Context | 4,096 | 32,768 | 8x |
Only 1.5% F1 drop compared to the 4K model while supporting 8x the context length. Perfect JSON validity and argument correctness are preserved.
Per-Scenario Breakdown
| Scenario | F1 | Count | Description |
|---|---|---|---|
| multi_tool_parallel | 1.000 | 18 | Multiple independent tool calls |
| no_tool_needed | 1.000 | 18 | Questions answerable without tools |
| multi_tool_sequential | 0.972 | 36 | Chained tool calls with dependencies |
| error_recovery | 0.944 | 18 | Handling malformed inputs or missing data |
| single_tool | 0.943 | 53 | One tool call needed |
| reasoning_heavy | 0.914 | 35 | Complex reasoning before tool selection |
| complex_multi_step | 0.818 | 22 | Multi-step workflows with planning |
Capabilities
- 32K context window — handles extended multi-turn agent conversations with accumulated tool results
- Tool selection: Picks the right tool(s) from a provided set with 94.0% F1
- Structured output: Produces valid
<tool_call>{"name": "...", "arguments": {...}}</tool_call>JSON — 100% validity - Tool refusal: Correctly answers directly when no tool is needed — 100% accuracy
- Multi-tool: Handles parallel and sequential multi-tool scenarios with high accuracy
Available Tools (training set)
The model was trained with these 15 tools but generalizes to new tool schemas provided at inference:
web_search, get_webpage, execute_python, read_file, write_file, list_directory, send_email, get_current_datetime, calculate, translate, get_weather, create_calendar_event, database_query, http_request, shell_command
Quick Start
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained(
"enfuse/smol-tools-4b-32k",
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("enfuse/smol-tools-4b-32k", trust_remote_code=True)
tools = [
{"type": "function", "function": {
"name": "web_search",
"description": "Search the web for information",
"parameters": {"type": "object", "properties": {
"query": {"type": "string"}
}, "required": ["query"]}
}}
]
messages = [
{"role": "system", "content": "You are a helpful assistant with access to tools."},
{"role": "user", "content": "What's the latest news about SpaceX?"},
]
prompt = tokenizer.apply_chat_template(messages, tools=tools, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=512, temperature=0.1, do_sample=True)
print(tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=False))
With vLLM (faster)
from vllm import LLM, SamplingParams
llm = LLM(model="enfuse/smol-tools-4b-32k", dtype="bfloat16", max_model_len=32768, enforce_eager=True)
sampling = SamplingParams(max_tokens=2048, temperature=0.1, stop=["<|im_end|>"])
outputs = llm.generate([prompt], sampling)
With llama.cpp (GGUF, edge devices)
# Download the Q4_K_M GGUF (2.9 GB) for edge deployment
huggingface-cli download enfuse/smol-tools-4b-32k smol-tools-4b-32k-q4_k_m.gguf --local-dir .
# Run with llama-server (OpenAI-compatible API)
llama-server -m smol-tools-4b-32k-q4_k_m.gguf -c 4096 -ngl 99 --port 8080
# Or with llama-cli for one-shot inference
llama-cli -m smol-tools-4b-32k-q4_k_m.gguf -c 4096 -ngl 99 -p "<your prompt>"
# Or with llama-cpp-python
from llama_cpp import Llama
llm = Llama(model_path="smol-tools-4b-32k-q4_k_m.gguf", n_ctx=4096, n_gpu_layers=-1)
output = llm(prompt, max_tokens=2048, temperature=0.1, stop=["<|im_end|>"])
print(output["choices"][0]["text"])
Training Details
| Parameter | Value |
|---|---|
| Base model | Qwen3.5-4B-Claude-4.6-Opus-Reasoning-Distilled |
| Method | LoRA (rank 64, alpha 128) |
| Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Training examples | 8,059 (6,855 short-context + 1,204 multi-turn long-context) |
| Epochs | 3 |
| Batch size | 1 (× 32 gradient accumulation = effective 32) |
| Learning rate | 1e-4 (cosine schedule) |
| Max sequence length | 32,768 |
| Eval loss | 0.140 |
| Train loss | 0.229 |
| Training time | ~29.4 hours on 1× NVIDIA H200 |
| Framework | TRL SFTTrainer + PEFT |
Data Pipeline
- Short-context data (6,855 examples): Quality-filtered synthetic tool-use conversations from smol-tools-4b training, covering all 7 scenario types at up to 4K tokens
- Multi-turn long-context data (1,204 examples): New synthetic conversations generated by Qwen3.5-27B teacher at 64K context, featuring 8-16 rounds of tool calls with realistic tool outputs (database results, file contents, web pages). Cleaned to remove malformed tool calls and ensure proper conversation endings
- Combined: 8,059 examples with a mix of short single-turn and long multi-turn conversations, filtered to ≤32K tokens
smol-tools Family
All models share the same base architecture, tool schema, and output format. Choose based on your context length needs:
| Model | Context | Tool F1 | JSON Valid | No-Tool Acc | Parameters | HF Repo |
|---|---|---|---|---|---|---|
| smol-tools-4b | 4K | 0.955 | 100% | 100% | Rank 32, α=64 | enfuse/smol-tools-4b |
| smol-tools-4b-16k | 16K | 0.948 | 100% | 100% | Rank 64, α=128 | enfuse/smol-tools-4b-16k |
| smol-tools-4b-32k | 32K | 0.940 | 100% | 100% | Rank 64, α=128 | this repo |
How to choose:
- 4K: Single-turn tool calls, short tool outputs — highest accuracy, lowest memory — also available in GGUF quantized formats
- 16K: Multi-turn conversations (5-10 rounds), moderate tool outputs — also available in GGUF quantized formats
- 32K (this model): Extended agent sessions (10-20 rounds), large tool outputs — also available in GGUF quantized formats
When to Use This Model
- You're building an agent that needs extended multi-turn tool conversations — research assistants, code generators, data analysis pipelines with many rounds
- Your tool outputs are very long — large database results, full source files, lengthy web scrapes that push context well beyond 16K
- You need a small, fast model that can handle extended agent sessions without losing tool-calling accuracy
- You want structured output you can trust at long context — 100% JSON validity even with 32K of accumulated context
When NOT to Use This Model
- If all your tool calls are single-turn with short outputs, use smol-tools-4b (4K) — it's slightly more accurate and uses less memory
- If your conversations stay under 16K tokens, use smol-tools-4b-16k — it's slightly more accurate and trains faster
- If you need GPT-4-level complex multi-step planning (our weakest category at F1=0.818), use a bigger model
Limitations
- complex_multi_step scenarios (F1=0.818) remain the weakest — the model sometimes struggles with multi-step planning involving 3+ chained tools
- Trained on synthetic data only — real-world tool-use patterns may differ
- The 32K training data was generated by a 27B teacher model; very long conversations (>24K tokens) may show quality degradation compared to shorter ones
- Inherits Qwen3.5-4B base model limitations (knowledge cutoff)
Quantization Results (200-example eval)
| Format | Size | Tool F1 | Precision | Recall | JSON | Args | No-Tool |
|---|---|---|---|---|---|---|---|
| BF16 | 9.7 GB | 0.940 | 0.940 | 0.965 | 100% | 100% | 100% |
| Q8_0 | 4.9 GB | 0.918 | 0.917 | 0.955 | 100% | 100% | 100% |
| Q4_K_M | 2.9 GB | 0.925 | 0.925 | 0.955 | 100% | 100% | 100% |
All formats preserve perfect JSON validity and argument correctness. The Q4_K_M quantization (3.4x smaller) retains 98% of the BF16 model's tool-calling accuracy.
Hardware
- Training: 1× NVIDIA H200 NVL (141 GB HBM3e), ~29.4 hours
- Inference (BF16, 32K context): Any GPU with ≥24 GB VRAM
- Inference (BF16, 16K context): Any GPU with ≥16 GB VRAM
- Inference (BF16, 4K context): Any GPU with ≥10 GB VRAM
- Inference (Q8_0 GGUF, 32K context): Any device with ≥12 GB RAM — Jetson Orin AGX, consumer GPUs
- Inference (Q8_0 GGUF, 4K context): Any device with ≥6 GB RAM
- Inference (Q4_K_M GGUF, 32K context): Any device with ≥8 GB RAM — Jetson Orin NX, phones
- Inference (Q4_K_M GGUF, 4K context): Any device with ≥4 GB RAM — Jetson Orin Nano, RPi 5
Attribution
- Base model: Qwen3.5-4B-Claude-4.6-Opus-Reasoning-Distilled by Jackrong
- Training framework: TRL + PEFT by HuggingFace
- Inference: vLLM
- Downloads last month
- 97
4-bit
8-bit
Model tree for enfuse/smol-tools-4b-32k
Base model
Qwen/Qwen3.5-4B-BaseEvaluation results
- Tool Selection F1self-reported0.940
- JSON Validityself-reported1.000
- No-Tool Accuracyself-reported1.000