Qwen3.5-2B Voice Assistant
Fine-tuned Qwen3.5-2B for voice assistant / conversational use.
This is designed to be short responses without thinking.
Trained on curated, concise datasets โ all assistant responses are short and natural-sounding, optimized for spoken output rather than written text.
Training Details
| Parameter | Value |
|---|---|
| Base model | unsloth/Qwen3.5-2B |
| Method | LoRA (rank=16, alpha=32) |
| LoRA dropout | 0.05 |
| Learning rate | 0.0001 |
| Epochs | 3 (early stopping, patience=4) |
| Effective batch size | 64 |
| Max sequence length | 1024 |
| Scheduler | Cosine with 50 warmup steps |
| Precision | bf16 |
| Thinking mode | Disabled |
| GPU | NVIDIA L4 (22 GB) |
| Framework | Unsloth + TRL SFTTrainer |
Datasets
All datasets are filtered for concise, voice-friendly assistant responses (20โ400 chars for general data, 20โ500 chars for reasoning). Responses containing markdown formatting (bold, inline code, numbered lists, bullet points, headings) are excluded. Exact-match deduplication is applied across all sources before training.
| Dataset | Rows | Purpose |
|---|---|---|
| OpenAssistant/oasst_top1_2023-08-25 | 2,388 | Real human multi-turn conversations |
| HuggingFaceTB/everyday-conversations-llama3.1-2k | 1,910 | Greetings, small talk, basic Q&A |
| argilla/synthetic-concise-reasoning-sft | 535 | Short factual reasoning answers |
| WizardLM/WizardLM_evol_instruct_70k | 7,000 | Casual single-turn Q&A |
| Duplicates removed | 1,992 | |
| Total (after dedup) | 9,841 |
Filtering Pipeline (v7)
Each assistant response is checked against the following before inclusion:
- Length: 20โ400 chars (general), 20โ500 chars (reasoning)
- No markdown:
**bold**,`inline code`,[link](url),# headingsall excluded - No lists: numbered (
1.) and bullet (-,*) patterns excluded at line-start and after colons - No list lead-ins: phrases like "the process involves:", "as follows:", "the following" excluded
- No AI-isms: "certainly!", "as an AI", "in conclusion", "delve" excluded
- Post-dedup sanity check: % of markdown patterns logged to W&B before training
System Prompt
All training samples include this system prompt:
You are a casual, hands-free voice assistant. Speak in short, punchy sentences as if we are having a real-time conversation. Never use bullet points, markdown, or code. If explaining a complex topic, use a simple, everyday analogy. Respond immediately without any preamble or internal monologue.
Available Formats
| Repo | Format | Use case |
|---|---|---|
| cowWhySo/qwen3_5_2B_voice_assistant | Merged 16-bit | Transformers / vLLM / SGLang |
| cowWhySo/qwen3_5_2B_voice_assistant-lora | LoRA adapters | Merge with base yourself |
| cowWhySo/qwen3_5_2B_voice_assistant-GGUF | GGUF (q4_k_m, q5_k_m, q8_0, f16) | llama.cpp / Ollama / LM Studio |
Usage with llama.cpp
huggingface-cli download cowWhySo/qwen3_5_2B_voice_assistant-GGUF --include "*q4_k_m*" --local-dir .
./llama-cli -m *q4_k_m*.gguf --ctx-size 2048 --temp 0.7 --top-p 0.9
Usage with Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("cowWhySo/qwen3_5_2B_voice_assistant")
tokenizer = AutoTokenizer.from_pretrained("cowWhySo/qwen3_5_2B_voice_assistant")
messages = [
{"role": "system", "content": "You are a casual, hands-free voice assistant. Speak in short, punchy sentences as if we are having a real-time conversation. Never use bullet points, markdown, or code. If explaining a complex topic, use a simple, everyday analogy. Respond immediately without any preamble or internal monologue."},
{"role": "user", "content": "What's the weather like today?"}
]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True)
outputs = model.generate(inputs, max_new_tokens=256, temperature=0.7, top_p=0.9)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Fine-tuned with Unsloth on an NVIDIA L4 GPU.