Qwen3.5-2B Voice Assistant (Tool Calling)

LoRA fine-tuned unsloth/Qwen3.5-2B for hands-free voice assistance with native Qwen3.5 XML tool calling. Trained on 11044 conversations (572 tool-call, 10472 voice-only).

Tool Call Format

This model uses the native Qwen3.5 XML parameter format — the same format produced by the model's built-in chat_template.jinja. No custom prompt engineering is needed at inference.

<tool_call>
<function=get_weather>
<parameter=location>
Austin
</parameter>
</function>
</tool_call>

This is parsed automatically by llama.cpp (--jinja), vLLM, LM Studio, and Ollama when using the bundled chat template.

Quick Start

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("cowWhySo/qwen3_5_2B_voice_assistant_tools")
tokenizer = AutoTokenizer.from_pretrained("cowWhySo/qwen3_5_2B_voice_assistant_tools")

Inference

llama-server

./llama.cpp/build/bin/llama-server \
    -m unsloth/Qwen3.5-2B-q4_k_m.gguf \
    --jinja \
    --ctx-size 2048 \
    --temp 0.7 \
    --top-p 0.9 \
    --repeat-penalty 1.0 \
    --host 0.0.0.0 \
    --port 8080

Important: Use --jinja — this reads the native chat_template.jinja bundled with the model, which handles tool schema injection and output parsing automatically. --repeat-penalty 1.0 is critical — higher values corrupt XML structure in tool calls.

OpenAI SDK (via llama-server or vLLM)

from openai import OpenAI
import json

client = OpenAI(base_url="http://localhost:8080/v1", api_key="unused")
tools = json.load(open("tools.json"))

response = client.chat.completions.create(
    model="your-model",
    messages=[
        {"role": "system", "content": "You are a casual, hands-free voice assistant..."},
        {"role": "user", "content": "What's the weather in Austin?"},
    ],
    tools=tools,
    temperature=0.7,
    top_p=0.9,
)

message = response.choices[0].message

if message.tool_calls:
    tool_call = message.tool_calls[0]
    args = tool_call.function.arguments
    if isinstance(args, str):
        args = json.loads(args)
    tool_result = execute_tool(tool_call.function.name, args)

    response2 = client.chat.completions.create(
        model="your-model",
        messages=[
            {"role": "system", "content": "You are a casual, hands-free voice assistant..."},
            {"role": "user", "content": "What's the weather in Austin?"},
            message,
            {"role": "tool", "tool_call_id": tool_call.id, "content": json.dumps(tool_result)},
        ],
        tools=tools,
        temperature=0.7,
    )
    spoken = response2.choices[0].message.content
else:
    spoken = message.content

Known issue (llama.cpp #20198): arguments may be returned as a dict instead of a JSON string. The isinstance check above handles both.

Transformers (direct)

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("cowWhySo/qwen3_5_2B_voice_assistant_tools")
model = AutoModelForCausalLM.from_pretrained("cowWhySo/qwen3_5_2B_voice_assistant_tools", device_map="auto")

messages = [
    {"role": "system", "content": "You are a casual, hands-free voice assistant..."},
    {"role": "user", "content": "Set a timer for 5 minutes"},
]
tools = json.load(open("tools.json"))

# Native template handles tool schema injection automatically
prompt = tokenizer.apply_chat_template(
    messages,
    tools=tools,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False,
)
inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False).to(model.device)
output = model.generate(**inputs, max_new_tokens=256, temperature=0.7, top_p=0.9)
print(tokenizer.decode(output[0][inputs.input_ids.shape[1]:], skip_special_tokens=False))

vLLM

vllm serve cowWhySo/qwen3_5_2B_voice_assistant_tools \
    --max-model-len 2048 \
    --enable-auto-tool-choice \
    --tool-call-parser hermes

Training Details

Parameter Value
Base model unsloth/Qwen3.5-2B
Method LoRA (r=16, alpha=32)
Precision bf16
Max seq length 2048
Learning rate 0.0001
Effective batch size 64
Epochs 3
Early stopping patience=3 (eval every 15 steps)
Thinking Disabled

Tools

get_weather · set_timer · create_reminder · control_smart_home · play_music · web_search

Full tool schemas are in tools.json in this repo.

Design Decisions

  • Native Qwen3.5 format: Training data formatted using the model's own chat_template.jinja, so tool calls use the XML parameter format (<function=name><parameter=key>value</parameter></function>) that every inference framework expects. Zero custom prompt engineering at deployment.
  • Tools always visible: Every training example (including voice-only) sees tool schemas in the system prompt, teaching the model when NOT to call tools.
  • Thinking disabled: enable_thinking=False throughout training and inference — avoids reasoning loops on a 2B model and keeps voice responses instant. For Qwen3.5 0.8B/2B/4B/9B, thinking is disabled by default.
  • Voice-first responses: All non-tool assistant responses filtered for conciseness (20-400 chars) and conversational tone (no markdown, lists, or code).
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for cowWhySo/qwen3_5_2B_voice_assistant_tools-lora

Finetuned
Qwen/Qwen3.5-2B
Adapter
(13)
this model