Qwen3.5-2B Voice Assistant (Tool Calling)
LoRA fine-tuned unsloth/Qwen3.5-2B for hands-free voice assistance with native Qwen3.5 XML tool calling. Trained on 11044 conversations (572 tool-call, 10472 voice-only).
Tool Call Format
This model uses the native Qwen3.5 XML parameter format — the same format
produced by the model's built-in chat_template.jinja. No custom prompt
engineering is needed at inference.
<tool_call>
<function=get_weather>
<parameter=location>
Austin
</parameter>
</function>
</tool_call>
This is parsed automatically by llama.cpp (--jinja), vLLM, LM Studio,
and Ollama when using the bundled chat template.
Quick Start
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("cowWhySo/qwen3_5_2B_voice_assistant_tools")
tokenizer = AutoTokenizer.from_pretrained("cowWhySo/qwen3_5_2B_voice_assistant_tools")
Inference
llama-server
./llama.cpp/build/bin/llama-server \
-m unsloth/Qwen3.5-2B-q4_k_m.gguf \
--jinja \
--ctx-size 2048 \
--temp 0.7 \
--top-p 0.9 \
--repeat-penalty 1.0 \
--host 0.0.0.0 \
--port 8080
Important: Use
--jinja— this reads the nativechat_template.jinjabundled with the model, which handles tool schema injection and output parsing automatically.--repeat-penalty 1.0is critical — higher values corrupt XML structure in tool calls.
OpenAI SDK (via llama-server or vLLM)
from openai import OpenAI
import json
client = OpenAI(base_url="http://localhost:8080/v1", api_key="unused")
tools = json.load(open("tools.json"))
response = client.chat.completions.create(
model="your-model",
messages=[
{"role": "system", "content": "You are a casual, hands-free voice assistant..."},
{"role": "user", "content": "What's the weather in Austin?"},
],
tools=tools,
temperature=0.7,
top_p=0.9,
)
message = response.choices[0].message
if message.tool_calls:
tool_call = message.tool_calls[0]
args = tool_call.function.arguments
if isinstance(args, str):
args = json.loads(args)
tool_result = execute_tool(tool_call.function.name, args)
response2 = client.chat.completions.create(
model="your-model",
messages=[
{"role": "system", "content": "You are a casual, hands-free voice assistant..."},
{"role": "user", "content": "What's the weather in Austin?"},
message,
{"role": "tool", "tool_call_id": tool_call.id, "content": json.dumps(tool_result)},
],
tools=tools,
temperature=0.7,
)
spoken = response2.choices[0].message.content
else:
spoken = message.content
Known issue (llama.cpp #20198):
argumentsmay be returned as a dict instead of a JSON string. Theisinstancecheck above handles both.
Transformers (direct)
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("cowWhySo/qwen3_5_2B_voice_assistant_tools")
model = AutoModelForCausalLM.from_pretrained("cowWhySo/qwen3_5_2B_voice_assistant_tools", device_map="auto")
messages = [
{"role": "system", "content": "You are a casual, hands-free voice assistant..."},
{"role": "user", "content": "Set a timer for 5 minutes"},
]
tools = json.load(open("tools.json"))
# Native template handles tool schema injection automatically
prompt = tokenizer.apply_chat_template(
messages,
tools=tools,
tokenize=False,
add_generation_prompt=True,
enable_thinking=False,
)
inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False).to(model.device)
output = model.generate(**inputs, max_new_tokens=256, temperature=0.7, top_p=0.9)
print(tokenizer.decode(output[0][inputs.input_ids.shape[1]:], skip_special_tokens=False))
vLLM
vllm serve cowWhySo/qwen3_5_2B_voice_assistant_tools \
--max-model-len 2048 \
--enable-auto-tool-choice \
--tool-call-parser hermes
Training Details
| Parameter | Value |
|---|---|
| Base model | unsloth/Qwen3.5-2B |
| Method | LoRA (r=16, alpha=32) |
| Precision | bf16 |
| Max seq length | 2048 |
| Learning rate | 0.0001 |
| Effective batch size | 64 |
| Epochs | 3 |
| Early stopping | patience=3 (eval every 15 steps) |
| Thinking | Disabled |
Tools
get_weather · set_timer · create_reminder · control_smart_home · play_music · web_search
Full tool schemas are in tools.json in this repo.
Design Decisions
- Native Qwen3.5 format: Training data formatted using the model's own
chat_template.jinja, so tool calls use the XML parameter format (<function=name><parameter=key>value</parameter></function>) that every inference framework expects. Zero custom prompt engineering at deployment. - Tools always visible: Every training example (including voice-only) sees tool schemas in the system prompt, teaching the model when NOT to call tools.
- Thinking disabled:
enable_thinking=Falsethroughout training and inference — avoids reasoning loops on a 2B model and keeps voice responses instant. For Qwen3.5 0.8B/2B/4B/9B, thinking is disabled by default. - Voice-first responses: All non-tool assistant responses filtered for conciseness (20-400 chars) and conversational tone (no markdown, lists, or code).