qwen3-0.6b-tool-router

A low-latency, schema-strict tool/function calling model optimized for edge-device inference.

Overview

qwen3-0.6b-tool-router is a verticalized Small Language Model (SLM) derived from Qwen3-0.6B, purpose-built for tool and function routing under strict JSON schemas.

Unlike general-purpose chat or instruction-following models, this model is optimized to run as a deterministic router in agentic systems, especially in resource-constrained edge environments (e.g., CPUs, embedded GPUs, mobile accelerators).

Its sole responsibility is to reliably map natural language queries → structured tool calls, with minimal latency and zero tolerance for hallucinated tools.

Key Properties

Model Size: 0.6B parameters
No Chain-of-Thought: Disabled to reduce token count and parsing cost
Strict JSON Output: Designed for direct machine consumption
Low Memory Footprint: QLoRA fine-tuning, edge-friendly quantization
Fast Cold Start: Ideal for on-device or near-device inference

This makes it well-suited for:

On-device assistants
Local agent routers
Offline-capable systems
Privacy-sensitive deployments

BFCL Results

Category	Score
Non-Live Parallel AST	83.50%
Multi-Turn Base	90.42%
Live Simple AST	62.86%
Live Parallel AST	52.00%
Relevance Detection	90.89%

import json
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

MODEL_PATH = "AryanNsc/qwen3-0.6b-tool-router"

# Load tokenizer & model
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)
tokenizer.padding_side = "left"
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    MODEL_PATH,
    device_map="auto",
    torch_dtype="auto",
    trust_remote_code=True
)

# Define a tool
tools = [{
    "name": "get_weather",
    "description": "Get weather for a city",
    "parameters": {
        "type": "object",
        "properties": {
            "city": {"type": "string"}
        },
        "required": ["city"]
    }
}]

# Build system prompt with tools
system_prompt = (
    "You may call one or more functions.\n\n"
    "<tools>\n"
    + "\n".join(json.dumps(t) for t in tools)
    + "\n</tools>\n\n"
    "Return the function call inside <tool_call></tool_call> tags."
)

messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": "What's the weather in Tokyo?"}
]

# Apply chat template
prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False
)

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

# Generate
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=256,
        do_sample=False,
        pad_token_id=tokenizer.pad_token_id
    )

# Decode only the generated tokens
generated = outputs[:, inputs.input_ids.shape[1]:]
text = tokenizer.decode(generated[0], skip_special_tokens=True)

print(text)

Why This Model for Edge Inference?

Edge environments demand:

Small model size
Predictable latency
Deterministic outputs
Minimal parsing overhead

This model was explicitly trained to satisfy those constraints.

Downloads last month: 8

Safetensors

Model size

0.6B params

Tensor type

BF16