Qwen2.5-0.5B-LocalLLMs-ToolCalling

Fine-tuned version of Qwen/Qwen2.5-0.5B-Instruct optimized for tool calling in ElBruno.LocalLLMs.

No Python needed. Download and use directly in .NET with ONNX Runtime GenAI.

Model Details

Property	Value
Base Model	Qwen/Qwen2.5-0.5B-Instruct
Fine-Tuning	QLoRA (rank 16, alpha 32)
Training Data	Tool calling + RAG + instruction following (5,000 examples)
Format	ONNX INT4 (ONNX Runtime GenAI)
Size	~837 MB
Context Length	2,048 tokens
Parameters	0.5B
License	Apache 2.0

Key Features

✅ No Python needed — Download and use directly in .NET
✅ Optimized for ElBruno.LocalLLMs — Matches QwenFormatter ChatML template exactly
✅ Better tool calling accuracy — Improved <tool_call> JSON format compliance
✅ RAG grounded answering — Cites context sources accurately
✅ Runs on CPU — No GPU required (faster with GPU)
✅ Tiny model — 0.5B parameters fit on edge devices and laptops

Usage with ElBruno.LocalLLMs

Install the NuGet package

dotnet add package ElBruno.LocalLLMs

C# Code Example

using ElBruno.LocalLLMs;
using Microsoft.Extensions.AI;

// Configure the fine-tuned model
var options = new LocalLLMsOptions
{
    Model = new ModelDefinition
    {
        Id = "Qwen2.5-0.5B-LocalLLMs-ToolCalling".ToLower(),
        HuggingFaceRepoId = "elbruno/Qwen2.5-0.5B-LocalLLMs-ToolCalling",
        RequiredFiles = ["*"],
        ModelType = OnnxModelType.GenAI,
        ChatTemplate = ChatTemplateFormat.Qwen,
        SupportsToolCalling = true
    }
};

// Create the chat client (downloads model automatically on first use)
using var client = await LocalChatClient.CreateAsync(options);

// --- Tool Calling Example ---
var tools = new List<AITool>
{
    AIFunctionFactory.Create(
        (string city) => $"{{\"temp\": 22, \"condition\": \"sunny\"}}",
        "get_weather",
        "Get current weather for a city"
    )
};

var response = await client.GetResponseAsync(
    new[] { new ChatMessage(ChatRole.User, "What's the weather in Paris?") },
    new ChatOptions { Tools = tools }
);
Console.WriteLine(response);

// --- RAG Example ---
var ragMessages = new[]
{
    new ChatMessage(ChatRole.System, "Answer based on the provided context."),
    new ChatMessage(ChatRole.User,
        "Context:\n[1] ONNX Runtime GenAI enables local LLM inference.\n\n"
        + "Question: What does ONNX Runtime GenAI do?")
};
var ragResponse = await client.GetResponseAsync(ragMessages);
Console.WriteLine(ragResponse);

Training Details

Hyperparameters

Parameter	Value
LoRA Rank	16
LoRA Alpha	32
LoRA Dropout	0.05
Target Modules	q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Learning Rate	2e-4
Epochs	3
Batch Size	16 (effective: 4 × 4 gradient accumulation)
Optimizer	paged_adamw_8bit
Scheduler	Cosine with 50-step warmup
Max Sequence Length	2,048
Precision	FP16 (mixed precision training)

Training Data

The model was fine-tuned on a curated dataset of 5,000 examples:

Category	Count	Source
Tool Calling	2,000	Glaive Function Calling v2 + custom ElBruno.LocalLLMs examples
RAG Grounded	1,500	MS MARCO + custom library documentation Q&A
Chat Template	1,500	Alpaca + ShareGPT (filtered, reformatted to ChatML)

All training data matches the exact format produced by QwenFormatter.cs — including <tool_call> tags, ChatML tokens (<|im_start|>, <|im_end|>), and tool result formatting.

Training Framework

Unsloth — 2x faster QLoRA training with 50% less VRAM
HuggingFace TRL — SFTTrainer for supervised fine-tuning
Hardware: NVIDIA RTX 4090 (24 GB VRAM) or equivalent

Benchmark Results

Metric	Base Model	Fine-Tuned	Improvement
Tool Call Accuracy	—	—	—
JSON Format Compliance	—	—	—
RAG Citation Accuracy	—	—	—
ChatML Adherence	—	—	—
Inference Speed (tokens/sec)	—	—	—

Benchmarks will be updated after comprehensive evaluation.

ONNX Conversion Pipeline

The model was converted using this pipeline:

Qwen2.5 Base → QLoRA Fine-tune → Merge LoRA → ONNX Export (INT4)

Fine-tune with QLoRA (Unsloth + TRL)
Merge LoRA adapters into base model (merge_lora.py)
Convert to ONNX with onnxruntime_genai.models.builder INT4 quantization (convert_to_onnx.py)
Validate against QwenFormatter test suite (validate_onnx.py)
Upload to HuggingFace (upload_to_hf.py)

All scripts are available at: scripts/finetune/

Intended Use

Primary Use Cases

Tool Calling — Small model that reliably produces <tool_call> JSON for function execution
RAG — Grounded answering with source citations from provided context
Local Inference — Privacy-preserving AI on laptops, edge devices, and CI/CD pipelines
.NET Applications — Seamless integration via ElBruno.LocalLLMs NuGet package

Out of Scope

Complex multi-step reasoning (use 7B+ models)
Multilingual tasks (English-only training data)
Long-context tasks beyond 2,048 tokens
Safety-critical applications without additional guardrails

Limitations

0.5B model — Limited reasoning compared to larger models (3B, 7B, 14B)
English only — Not trained on multilingual data
Simple tools — Best with 1–3 tools per conversation; may struggle with 10+ complex tools
INT4 quantization — Slight quality degradation (~1-3%) compared to FP16, especially on edge cases
No streaming tool calls — Tool call output is generated as a complete block

Citation

@misc{{{MODEL_NAME.lower().replace('-', '_').replace('.', '_')}}},
  author = {{Bruno Capuano}},
  title = {Qwen2.5-0.5B-LocalLLMs-ToolCalling},
  year = {2026},
  publisher = {HuggingFace},
  url = {https://huggingface.co/elbruno/Qwen2.5-0.5B-LocalLLMs-ToolCalling}
}

Acknowledgments

Base Model: Qwen Team — Qwen2.5 family
Training Framework: Unsloth — Fast QLoRA training
ONNX Conversion: ONNX Runtime GenAI — Microsoft
Training Data: Glaive AI — Function calling dataset
Library: ElBruno.LocalLLMs — .NET local LLM inference

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for elbruno/Qwen2.5-0.5B-LocalLLMs-ToolCalling

Base model

Qwen/Qwen2.5-0.5B

Finetuned

Qwen/Qwen2.5-0.5B-Instruct

Quantized

(217)

this model