qwen3-0.6b-tool-router
A low-latency, schema-strict tool/function calling model optimized for edge-device inference.
Overview
qwen3-0.6b-tool-router is a verticalized Small Language Model (SLM) derived from Qwen3-0.6B, purpose-built for tool and function routing under strict JSON schemas.
Unlike general-purpose chat or instruction-following models, this model is optimized to run as a deterministic router in agentic systems, especially in resource-constrained edge environments (e.g., CPUs, embedded GPUs, mobile accelerators).
Its sole responsibility is to reliably map natural language queries → structured tool calls, with minimal latency and zero tolerance for hallucinated tools.
Key Properties
- Model Size: 0.6B parameters
- No Chain-of-Thought: Disabled to reduce token count and parsing cost
- Strict JSON Output: Designed for direct machine consumption
- Low Memory Footprint: QLoRA fine-tuning, edge-friendly quantization
- Fast Cold Start: Ideal for on-device or near-device inference
This makes it well-suited for:
- On-device assistants
- Local agent routers
- Offline-capable systems
- Privacy-sensitive deployments
BFCL Results
| Category | Score |
|---|---|
| Non-Live Parallel AST | 83.50% |
| Multi-Turn Base | 90.42% |
| Live Simple AST | 62.86% |
| Live Parallel AST | 52.00% |
| Relevance Detection | 90.89% |
import json
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
MODEL_PATH = "AryanNsc/qwen3-0.6b-tool-router"
# Load tokenizer & model
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)
tokenizer.padding_side = "left"
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(
MODEL_PATH,
device_map="auto",
torch_dtype="auto",
trust_remote_code=True
)
# Define a tool
tools = [{
"name": "get_weather",
"description": "Get weather for a city",
"parameters": {
"type": "object",
"properties": {
"city": {"type": "string"}
},
"required": ["city"]
}
}]
# Build system prompt with tools
system_prompt = (
"You may call one or more functions.\n\n"
"<tools>\n"
+ "\n".join(json.dumps(t) for t in tools)
+ "\n</tools>\n\n"
"Return the function call inside <tool_call></tool_call> tags."
)
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": "What's the weather in Tokyo?"}
]
# Apply chat template
prompt = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=False
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
# Generate
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=256,
do_sample=False,
pad_token_id=tokenizer.pad_token_id
)
# Decode only the generated tokens
generated = outputs[:, inputs.input_ids.shape[1]:]
text = tokenizer.decode(generated[0], skip_special_tokens=True)
print(text)
Why This Model for Edge Inference?
Edge environments demand:
- Small model size
- Predictable latency
- Deterministic outputs
- Minimal parsing overhead
This model was explicitly trained to satisfy those constraints.
- Downloads last month
- 9