Qwen2.5-0.5B-LocalLLMs-ToolCalling
Fine-tuned version of Qwen/Qwen2.5-0.5B-Instruct optimized for tool calling in ElBruno.LocalLLMs.
No Python needed. Download and use directly in .NET with ONNX Runtime GenAI.
Model Details
| Property | Value |
|---|---|
| Base Model | Qwen/Qwen2.5-0.5B-Instruct |
| Fine-Tuning | QLoRA (rank 16, alpha 32) |
| Training Data | Tool calling + RAG + instruction following (5,000 examples) |
| Format | ONNX INT4 (ONNX Runtime GenAI) |
| Size | ~837 MB |
| Context Length | 2,048 tokens |
| Parameters | 0.5B |
| License | Apache 2.0 |
Key Features
β
No Python needed β Download and use directly in .NET
β
Optimized for ElBruno.LocalLLMs β Matches QwenFormatter ChatML template exactly
β
Better tool calling accuracy β Improved <tool_call> JSON format compliance
β
RAG grounded answering β Cites context sources accurately
β
Runs on CPU β No GPU required (faster with GPU)
β
Tiny model β 0.5B parameters fit on edge devices and laptops
Usage with ElBruno.LocalLLMs
Install the NuGet package
dotnet add package ElBruno.LocalLLMs
C# Code Example
using ElBruno.LocalLLMs;
using Microsoft.Extensions.AI;
// Configure the fine-tuned model
var options = new LocalLLMsOptions
{
Model = new ModelDefinition
{
Id = "Qwen2.5-0.5B-LocalLLMs-ToolCalling".ToLower(),
HuggingFaceRepoId = "elbruno/Qwen2.5-0.5B-LocalLLMs-ToolCalling",
RequiredFiles = ["*"],
ModelType = OnnxModelType.GenAI,
ChatTemplate = ChatTemplateFormat.Qwen,
SupportsToolCalling = true
}
};
// Create the chat client (downloads model automatically on first use)
using var client = await LocalChatClient.CreateAsync(options);
// --- Tool Calling Example ---
var tools = new List<AITool>
{
AIFunctionFactory.Create(
(string city) => $"{{\"temp\": 22, \"condition\": \"sunny\"}}",
"get_weather",
"Get current weather for a city"
)
};
var response = await client.GetResponseAsync(
new[] { new ChatMessage(ChatRole.User, "What's the weather in Paris?") },
new ChatOptions { Tools = tools }
);
Console.WriteLine(response);
// --- RAG Example ---
var ragMessages = new[]
{
new ChatMessage(ChatRole.System, "Answer based on the provided context."),
new ChatMessage(ChatRole.User,
"Context:\n[1] ONNX Runtime GenAI enables local LLM inference.\n\n"
+ "Question: What does ONNX Runtime GenAI do?")
};
var ragResponse = await client.GetResponseAsync(ragMessages);
Console.WriteLine(ragResponse);
Training Details
Hyperparameters
| Parameter | Value |
|---|---|
| LoRA Rank | 16 |
| LoRA Alpha | 32 |
| LoRA Dropout | 0.05 |
| Target Modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Learning Rate | 2e-4 |
| Epochs | 3 |
| Batch Size | 16 (effective: 4 Γ 4 gradient accumulation) |
| Optimizer | paged_adamw_8bit |
| Scheduler | Cosine with 50-step warmup |
| Max Sequence Length | 2,048 |
| Precision | FP16 (mixed precision training) |
Training Data
The model was fine-tuned on a curated dataset of 5,000 examples:
| Category | Count | Source |
|---|---|---|
| Tool Calling | 2,000 | Glaive Function Calling v2 + custom ElBruno.LocalLLMs examples |
| RAG Grounded | 1,500 | MS MARCO + custom library documentation Q&A |
| Chat Template | 1,500 | Alpaca + ShareGPT (filtered, reformatted to ChatML) |
All training data matches the exact format produced by QwenFormatter.cs β including <tool_call> tags, ChatML tokens (<|im_start|>, <|im_end|>), and tool result formatting.
Training Framework
- Unsloth β 2x faster QLoRA training with 50% less VRAM
- HuggingFace TRL β SFTTrainer for supervised fine-tuning
- Hardware: NVIDIA RTX 4090 (24 GB VRAM) or equivalent
Benchmark Results
| Metric | Base Model | Fine-Tuned | Improvement |
|---|---|---|---|
| Tool Call Accuracy | β | β | β |
| JSON Format Compliance | β | β | β |
| RAG Citation Accuracy | β | β | β |
| ChatML Adherence | β | β | β |
| Inference Speed (tokens/sec) | β | β | β |
Benchmarks will be updated after comprehensive evaluation.
ONNX Conversion Pipeline
The model was converted using this pipeline:
Qwen2.5 Base β QLoRA Fine-tune β Merge LoRA β ONNX Export (INT4)
- Fine-tune with QLoRA (Unsloth + TRL)
- Merge LoRA adapters into base model (
merge_lora.py) - Convert to ONNX with
onnxruntime_genai.models.builderINT4 quantization (convert_to_onnx.py) - Validate against QwenFormatter test suite (
validate_onnx.py) - Upload to HuggingFace (
upload_to_hf.py)
All scripts are available at: scripts/finetune/
Intended Use
Primary Use Cases
- Tool Calling β Small model that reliably produces
<tool_call>JSON for function execution - RAG β Grounded answering with source citations from provided context
- Local Inference β Privacy-preserving AI on laptops, edge devices, and CI/CD pipelines
- .NET Applications β Seamless integration via ElBruno.LocalLLMs NuGet package
Out of Scope
- Complex multi-step reasoning (use 7B+ models)
- Multilingual tasks (English-only training data)
- Long-context tasks beyond 2,048 tokens
- Safety-critical applications without additional guardrails
Limitations
- 0.5B model β Limited reasoning compared to larger models (3B, 7B, 14B)
- English only β Not trained on multilingual data
- Simple tools β Best with 1β3 tools per conversation; may struggle with 10+ complex tools
- INT4 quantization β Slight quality degradation (~1-3%) compared to FP16, especially on edge cases
- No streaming tool calls β Tool call output is generated as a complete block
Citation
@misc{{{MODEL_NAME.lower().replace('-', '_').replace('.', '_')}}},
author = {{Bruno Capuano}},
title = {Qwen2.5-0.5B-LocalLLMs-ToolCalling},
year = {2026},
publisher = {HuggingFace},
url = {https://huggingface.co/elbruno/Qwen2.5-0.5B-LocalLLMs-ToolCalling}
}
Acknowledgments
- Base Model: Qwen Team β Qwen2.5 family
- Training Framework: Unsloth β Fast QLoRA training
- ONNX Conversion: ONNX Runtime GenAI β Microsoft
- Training Data: Glaive AI β Function calling dataset
- Library: ElBruno.LocalLLMs β .NET local LLM inference