Qwen3-1.7B-FC: Function Calling Specialist
A function calling model based on Qwen3-1.7B, fine-tuned using RLVR (Reinforcement Learning with Verifiable Rewards) to improve tool-use capabilities on the BFCL V3 benchmark.
🏆 Performance Highlights
| Model | Size | BFCL Overall | Category Avg |
|---|---|---|---|
| Qwen3-1.7B-FC (Our) | 1.7B | 54.2% | 50.8% |
| Qwen3-1.7B (Base) | 1.7B | 48.8% | 45.8% |
| Qwen3-8B | 8B | 51.9% | 48.6% |
| Qwen3-14B | 14B | 51.6% | 49.0% |
Response Efficiency
| Model | Avg Response Tokens | Efficiency vs Base |
|---|---|---|
| Base Qwen3-1.7B | 35.6 tokens | - |
| Qwen3-1.7B-FC (Our) | 22.7 tokens | -36% |
The fine-tuned model generates 36% fewer tokens while maintaining higher accuracy, thanks to:
- Direct tool calls without verbose preambles
- Concise refusal messages ("None of the provided tools can answer this question")
- Reduced
<think>reasoning blocks
📊 Detailed Benchmark Results (BFCL V3)
Core Function Calling
| Category | Qwen3-1.7B-FC (Our) | Base 1.7B | Qwen3-8B | Qwen3-14B |
|---|---|---|---|---|
| simple | 81.0% | 61.5% | 69.2% | 65.5% |
| multiple | 79.0% | 55.5% | 66.0% | 57.0% |
| parallel | 78.0% | 68.0% | 78.0% | 77.0% |
| parallel_multiple | 64.5% | 51.5% | 66.5% | 66.5% |
| irrelevance | 81.2% | 86.2% | 85.4% | 90.4% |
Executable Python
| Category | Qwen3-1.7B-FC (Our) | Base 1.7B | 8B | 14B |
|---|---|---|---|---|
| exec_simple | 84.0% | 82.0% | 84.0% | 87.0% |
| exec_multiple | 70.0% | 70.0% | 78.0% | 78.0% |
| exec_parallel | 80.0% | 76.0% | 86.0% | 90.0% |
| exec_parallel_multiple | 60.0% | 60.0% | 67.5% | 65.0% |
Live API Categories
| Category | Qwen3-1.7B-FC (Our) | Base 1.7B | Qwen3-8B | Qwen3-14B |
|---|---|---|---|---|
| live_simple | 63.6% | 43.8% | 51.2% | 51.6% |
| live_multiple | 55.0% | 36.8% | 43.7% | 42.5% |
| live_parallel | 50.0% | 18.8% | 43.8% | 43.8% |
| live_parallel_multiple | 66.7% | 37.5% | 54.2% | 50.0% |
| live_irrelevance | 66.1% | 80.3% | 78.7% | 79.9% |
📚 Training Data
Data Sources
| Source | Samples | Type | Description |
|---|---|---|---|
| xLAM | ~60,000 | Positive | High-quality function calling examples from Salesforce |
| ToolACE | ~11,000 | Positive | Diverse multi-turn tool usage scenarios |
| Toucan-1.5M | 40,000 | Negative | Irrelevant queries (Server Shuffle method) |
| Synthetic Negatives | 6,000 | Negative | Domain mismatch, partial fulfillment, permission errors |
Negative Sample Types
The model is trained to refuse appropriately using diverse negative samples:
| Type | Description | Example |
|---|---|---|
| Toucan Irrelevant | Query has no matching tool in available functions | "What's the weather?" when only get_stock_price is available |
| Domain Mismatch | Tools from wrong domain | Asking about finance when only cooking tools available |
| Action Mismatch | Similar name but wrong action | Asking to "delete" when only "get" function exists |
| Partial Fulfillment | Tools can't fully solve query | Need 2 steps but only 1 tool available |
| Permission/Auth | Missing required permissions | Admin action without credentials |
| Format Mismatch | Wrong data format requirements | Tool expects JSON but query provides CSV |
🔧 Training Methodology
Two-Stage RLVR Fine-tuning
Stage 1: Accuracy-focused training (V3)
- Trained from Qwen3-1.7B base
- Dataset: ~40K samples (stage2.parquet)
- Reward: Correctness (1.0) + Format (0.1) + Efficiency (0.3) + Refusal (0.3)
- Config: max_steps=5000, LR=5e-7, temp=1.2
- Best checkpoint: step 100 (early stopping, highest accuracy)
Stage 2: Efficiency optimization (V4)
- Loaded from Stage 1 checkpoint-100
- Focus: Reduce verbosity, discourage
<think>tags - Reward weights: Efficiency=1.0, Correctness=0.5, Format=0.1, Refusal=0.3
- Config: max_steps=3000, LR=2e-7
- Selected checkpoint: step 1100
- Result: 36% reduction in response tokens
Reward Function Design
# Combined Reward Formula
total_reward = (
format_weight * format_reward + # Valid <tool_call> JSON (0.0-1.0)
correct_weight * correctness_reward + # Tool name + arguments match (0.0-1.0)
refusal_weight * refusal_reward + # +1.0 correct refusal, -1.0 hallucination
efficiency_weight * efficiency_reward # Penalty for verbose <think>
)
# Stage 1 Weights (Accuracy Focus)
STAGE1_WEIGHTS = {
'format': 0.2,
'correctness': 1.0, # Main focus
'efficiency': 0.2,
'refusal': 0.3,
}
# Stage 2 Weights (Efficiency Focus)
STAGE2_WEIGHTS = {
'format': 0.1,
'correctness': 0.5, # Reduced - already accurate from Stage 1
'efficiency': 1.0, # Main focus - penalize <think> tags
'refusal': 0.3,
}
Individual Reward Components
| Component | Description | Range |
|---|---|---|
| format_reward | Valid <tool_call>JSON</tool_call> structure |
0.0 - 1.0 |
| correctness_reward | Tool name match + argument similarity | 0.0 - 1.0 |
| refusal_reward | +1.0 correct refusal, -1.0 hallucination | -1.0 to +1.0 |
| efficiency_reward | Stage 1: -0.3 for <think>, Stage 2: -1.0 |
-1.0 to +0.1 |
Key Training Innovations
- Strong Refusal Penalty: -1.0 for calling tools when
ground_truth = [] - Toucan Irrelevant Data: 40K high-quality "unanswerable" samples
- Efficiency Optimization: Rewarding direct tool calls without preambles
- Discourage
<think>Tags: Strong penalty (-1.0) for verbose reasoning blocks
🚀 Usage
With Transformers
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model_name = "contextboxai/Qwen3-1.7B-FC"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map="auto")
# Define tools
tools = [{
"name": "get_weather",
"description": "Get weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string", "description": "City name"}
},
"required": ["location"]
}
}]
messages = [{"role": "user", "content": "What's the weather in Tokyo?"}]
prompt = tokenizer.apply_chat_template(
messages,
tools=tools,
add_generation_prompt=True,
tokenize=False,
enable_thinking=False # Disable thinking for efficiency
)
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=256)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
Expected Output
<tool_call>
{"name": "get_weather", "arguments": {"location": "Tokyo"}}
</tool_call>
Refusal Example
When asked "What is the meaning of life?" with only get_weather tool available:
None of the provided tools can answer this question.
With vLLM (Recommended for Production)
from vllm import LLM, SamplingParams
llm = LLM(model="contextboxai/Qwen3-1.7B-FC")
sampling_params = SamplingParams(temperature=0, max_tokens=256)
# Generate with same prompt format as above
outputs = llm.generate([prompt], sampling_params)
💡 Key Features
✅ Strengths
- Compact Size: Only 1.7B parameters, runs on consumer GPUs
- High Accuracy: Outperforms larger models (8B, 14B) on function calling
- Efficient Responses: Direct tool calls without verbose preambles
- Strong Refusal: Trained on 46K negative samples to avoid hallucination
- Multilingual: Supports English and Vietnamese
- Chat Compatible: Maintains general chat ability (100% on chatable benchmark)
⚠️ Limitations
- Irrelevance: Slightly more aggressive at calling tools (-5% vs base)
📝 Use Cases
🎯 Ideal For
This model is optimized for edge deployment and customer service automation where a small, efficient model is needed:
| Use Case | Description |
|---|---|
| Edge Device Deployment | Run locally on devices with limited GPU/RAM |
| Customer Service Chatbot | Automate order lookup, ticket creation, FAQ with tool calls |
| Voice Agent / Call Center | Real-time voice-to-action for phone support systems |
| IoT/Smart Home | Control devices via function calling on edge hardware |
| Mobile AI Assistant | On-device tool execution without cloud dependency |
| Cost-Efficient API Gateway | Route requests to appropriate backend services |
💼 Customer Service Examples
# Example: Customer asks about their order
tools = [
{"name": "lookup_order", "parameters": {"order_id": "string"}},
{"name": "create_ticket", "parameters": {"issue": "string", "priority": "string"}},
{"name": "get_faq", "parameters": {"topic": "string"}}
]
# User: "Đơn hàng #12345 của tôi ở đâu rồi?"
# Model output:
# <tool_call>
# {"name": "lookup_order", "arguments": {"order_id": "12345"}}
# </tool_call>
# User: "Tôi muốn đổi trả sản phẩm"
# Model output:
# <tool_call>
# {"name": "create_ticket", "arguments": {"issue": "product_return", "priority": "normal"}}
# </tool_call>
⚡ Why Small Model?
| Benefit | Description |
|---|---|
| Low Latency | ~50ms inference on consumer GPU |
| Low Cost | 8x cheaper than 14B model to deploy |
| Privacy | Run entirely on-premise, no data leaves device |
| Offline Capable | Works without internet connection |
🧠 Reduced Catastrophic Forgetting
This model uses RLVR (Reinforcement Learning from Verifiable Rewards) instead of traditional SFT, which helps reduce capability loss:
- Less forgetting than SFT: RLVR fine-tunes through reward signals rather than directly overwriting weights
- 100% chatable score: Model maintains normal conversation ability on BFCL benchmark
- Multilingual preserved: English and Vietnamese capabilities remain functional
- Lower risk: Compared to SFT, RLVR typically causes less regression on non-target tasks
🔬 Technical Details
| Attribute | Value |
|---|---|
| Base Model | Qwen/Qwen3-1.7B |
| Training Method | RLVR (RL fine-tuning) |
| Training Steps | 100 (V3) + 3000 (V4) |
| Peak LR | 1e-6 → 2e-7 |
| Training Data | 117K samples (71K positive + 46K negative) |
| Precision | bfloat16 |
| Max Sequence Length | 32768 tokens |
| Tool Format | XML-style (<tool_call>...</tool_call>) |
📚 Citation
If you use this model, please cite:
@misc{qwen3-fc,
title={Qwen3-1.7B-FC: Efficient Function Calling via GRPO Fine-tuning},
author={ContextboxAI},
year={2024},
howpublished={\url{https://huggingface.co/contextboxai/Qwen3-1.7B-FC}},
}
🙏 Acknowledgments
- Qwen Team for the excellent base model
- Jan-nano for training methodology inspiration
- Berkeley Function Calling Leaderboard for the benchmark
- xLAM (Salesforce) for function calling data
- ToolACE for multi-turn tool usage data
- Toucan-1.5M (Agent-Ark) for irrelevant/negative samples
- TRL for GRPO implementation
📄 License
Apache 2.0
Model Card Contact: ContextboxAI
- Downloads last month
- 34