Qwen3-1.7B-FC: Function Calling Specialist

A function calling model based on Qwen3-1.7B, fine-tuned using RLVR (Reinforcement Learning with Verifiable Rewards) to improve tool-use capabilities on the BFCL V3 benchmark.

🏆 Performance Highlights

Model Size BFCL Overall Category Avg
Qwen3-1.7B-FC (Our) 1.7B 54.2% 50.8%
Qwen3-1.7B (Base) 1.7B 48.8% 45.8%
Qwen3-8B 8B 51.9% 48.6%
Qwen3-14B 14B 51.6% 49.0%

Response Efficiency

Model Avg Response Tokens Efficiency vs Base
Base Qwen3-1.7B 35.6 tokens -
Qwen3-1.7B-FC (Our) 22.7 tokens -36%

The fine-tuned model generates 36% fewer tokens while maintaining higher accuracy, thanks to:

  • Direct tool calls without verbose preambles
  • Concise refusal messages ("None of the provided tools can answer this question")
  • Reduced <think> reasoning blocks

📊 Detailed Benchmark Results (BFCL V3)

Core Function Calling

Category Qwen3-1.7B-FC (Our) Base 1.7B Qwen3-8B Qwen3-14B
simple 81.0% 61.5% 69.2% 65.5%
multiple 79.0% 55.5% 66.0% 57.0%
parallel 78.0% 68.0% 78.0% 77.0%
parallel_multiple 64.5% 51.5% 66.5% 66.5%
irrelevance 81.2% 86.2% 85.4% 90.4%

Executable Python

Category Qwen3-1.7B-FC (Our) Base 1.7B 8B 14B
exec_simple 84.0% 82.0% 84.0% 87.0%
exec_multiple 70.0% 70.0% 78.0% 78.0%
exec_parallel 80.0% 76.0% 86.0% 90.0%
exec_parallel_multiple 60.0% 60.0% 67.5% 65.0%

Live API Categories

Category Qwen3-1.7B-FC (Our) Base 1.7B Qwen3-8B Qwen3-14B
live_simple 63.6% 43.8% 51.2% 51.6%
live_multiple 55.0% 36.8% 43.7% 42.5%
live_parallel 50.0% 18.8% 43.8% 43.8%
live_parallel_multiple 66.7% 37.5% 54.2% 50.0%
live_irrelevance 66.1% 80.3% 78.7% 79.9%

📚 Training Data

Data Sources

Source Samples Type Description
xLAM ~60,000 Positive High-quality function calling examples from Salesforce
ToolACE ~11,000 Positive Diverse multi-turn tool usage scenarios
Toucan-1.5M 40,000 Negative Irrelevant queries (Server Shuffle method)
Synthetic Negatives 6,000 Negative Domain mismatch, partial fulfillment, permission errors

Negative Sample Types

The model is trained to refuse appropriately using diverse negative samples:

Type Description Example
Toucan Irrelevant Query has no matching tool in available functions "What's the weather?" when only get_stock_price is available
Domain Mismatch Tools from wrong domain Asking about finance when only cooking tools available
Action Mismatch Similar name but wrong action Asking to "delete" when only "get" function exists
Partial Fulfillment Tools can't fully solve query Need 2 steps but only 1 tool available
Permission/Auth Missing required permissions Admin action without credentials
Format Mismatch Wrong data format requirements Tool expects JSON but query provides CSV

🔧 Training Methodology

Two-Stage RLVR Fine-tuning

  1. Stage 1: Accuracy-focused training (V3)

    • Trained from Qwen3-1.7B base
    • Dataset: ~40K samples (stage2.parquet)
    • Reward: Correctness (1.0) + Format (0.1) + Efficiency (0.3) + Refusal (0.3)
    • Config: max_steps=5000, LR=5e-7, temp=1.2
    • Best checkpoint: step 100 (early stopping, highest accuracy)
  2. Stage 2: Efficiency optimization (V4)

    • Loaded from Stage 1 checkpoint-100
    • Focus: Reduce verbosity, discourage <think> tags
    • Reward weights: Efficiency=1.0, Correctness=0.5, Format=0.1, Refusal=0.3
    • Config: max_steps=3000, LR=2e-7
    • Selected checkpoint: step 1100
    • Result: 36% reduction in response tokens

Reward Function Design

# Combined Reward Formula
total_reward = (
    format_weight * format_reward +           # Valid <tool_call> JSON (0.0-1.0)
    correct_weight * correctness_reward +     # Tool name + arguments match (0.0-1.0)
    refusal_weight * refusal_reward +         # +1.0 correct refusal, -1.0 hallucination
    efficiency_weight * efficiency_reward     # Penalty for verbose <think>
)

# Stage 1 Weights (Accuracy Focus)
STAGE1_WEIGHTS = {
    'format': 0.2,
    'correctness': 1.0,    # Main focus
    'efficiency': 0.2,
    'refusal': 0.3,
}

# Stage 2 Weights (Efficiency Focus)
STAGE2_WEIGHTS = {
    'format': 0.1,
    'correctness': 0.5,    # Reduced - already accurate from Stage 1
    'efficiency': 1.0,     # Main focus - penalize <think> tags
    'refusal': 0.3,
}

Individual Reward Components

Component Description Range
format_reward Valid <tool_call>JSON</tool_call> structure 0.0 - 1.0
correctness_reward Tool name match + argument similarity 0.0 - 1.0
refusal_reward +1.0 correct refusal, -1.0 hallucination -1.0 to +1.0
efficiency_reward Stage 1: -0.3 for <think>, Stage 2: -1.0 -1.0 to +0.1

Key Training Innovations

  1. Strong Refusal Penalty: -1.0 for calling tools when ground_truth = []
  2. Toucan Irrelevant Data: 40K high-quality "unanswerable" samples
  3. Efficiency Optimization: Rewarding direct tool calls without preambles
  4. Discourage <think> Tags: Strong penalty (-1.0) for verbose reasoning blocks

🚀 Usage

With Transformers

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "contextboxai/Qwen3-1.7B-FC"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map="auto")

# Define tools
tools = [{
    "name": "get_weather",
    "description": "Get weather for a location",
    "parameters": {
        "type": "object",
        "properties": {
            "location": {"type": "string", "description": "City name"}
        },
        "required": ["location"]
    }
}]

messages = [{"role": "user", "content": "What's the weather in Tokyo?"}]

prompt = tokenizer.apply_chat_template(
    messages,
    tools=tools,
    add_generation_prompt=True,
    tokenize=False,
    enable_thinking=False  # Disable thinking for efficiency
)

inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=256)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Expected Output

<tool_call>
{"name": "get_weather", "arguments": {"location": "Tokyo"}}
</tool_call>

Refusal Example

When asked "What is the meaning of life?" with only get_weather tool available:

None of the provided tools can answer this question.

With vLLM (Recommended for Production)

from vllm import LLM, SamplingParams

llm = LLM(model="contextboxai/Qwen3-1.7B-FC")
sampling_params = SamplingParams(temperature=0, max_tokens=256)

# Generate with same prompt format as above
outputs = llm.generate([prompt], sampling_params)

💡 Key Features

✅ Strengths

  • Compact Size: Only 1.7B parameters, runs on consumer GPUs
  • High Accuracy: Outperforms larger models (8B, 14B) on function calling
  • Efficient Responses: Direct tool calls without verbose preambles
  • Strong Refusal: Trained on 46K negative samples to avoid hallucination
  • Multilingual: Supports English and Vietnamese
  • Chat Compatible: Maintains general chat ability (100% on chatable benchmark)

⚠️ Limitations

  • Irrelevance: Slightly more aggressive at calling tools (-5% vs base)

📝 Use Cases

🎯 Ideal For

This model is optimized for edge deployment and customer service automation where a small, efficient model is needed:

Use Case Description
Edge Device Deployment Run locally on devices with limited GPU/RAM
Customer Service Chatbot Automate order lookup, ticket creation, FAQ with tool calls
Voice Agent / Call Center Real-time voice-to-action for phone support systems
IoT/Smart Home Control devices via function calling on edge hardware
Mobile AI Assistant On-device tool execution without cloud dependency
Cost-Efficient API Gateway Route requests to appropriate backend services

💼 Customer Service Examples

# Example: Customer asks about their order
tools = [
    {"name": "lookup_order", "parameters": {"order_id": "string"}},
    {"name": "create_ticket", "parameters": {"issue": "string", "priority": "string"}},
    {"name": "get_faq", "parameters": {"topic": "string"}}
]

# User: "Đơn hàng #12345 của tôi ở đâu rồi?"
# Model output:
# <tool_call>
# {"name": "lookup_order", "arguments": {"order_id": "12345"}}
# </tool_call>

# User: "Tôi muốn đổi trả sản phẩm"
# Model output:
# <tool_call>
# {"name": "create_ticket", "arguments": {"issue": "product_return", "priority": "normal"}}
# </tool_call>

⚡ Why Small Model?

Benefit Description
Low Latency ~50ms inference on consumer GPU
Low Cost 8x cheaper than 14B model to deploy
Privacy Run entirely on-premise, no data leaves device
Offline Capable Works without internet connection

🧠 Reduced Catastrophic Forgetting

This model uses RLVR (Reinforcement Learning from Verifiable Rewards) instead of traditional SFT, which helps reduce capability loss:

  • Less forgetting than SFT: RLVR fine-tunes through reward signals rather than directly overwriting weights
  • 100% chatable score: Model maintains normal conversation ability on BFCL benchmark
  • Multilingual preserved: English and Vietnamese capabilities remain functional
  • Lower risk: Compared to SFT, RLVR typically causes less regression on non-target tasks

🔬 Technical Details

Attribute Value
Base Model Qwen/Qwen3-1.7B
Training Method RLVR (RL fine-tuning)
Training Steps 100 (V3) + 3000 (V4)
Peak LR 1e-6 → 2e-7
Training Data 117K samples (71K positive + 46K negative)
Precision bfloat16
Max Sequence Length 32768 tokens
Tool Format XML-style (<tool_call>...</tool_call>)

📚 Citation

If you use this model, please cite:

@misc{qwen3-fc,
  title={Qwen3-1.7B-FC: Efficient Function Calling via GRPO Fine-tuning},
  author={ContextboxAI},
  year={2024},
  howpublished={\url{https://huggingface.co/contextboxai/Qwen3-1.7B-FC}},
}

🙏 Acknowledgments

📄 License

Apache 2.0


Model Card Contact: ContextboxAI

Downloads last month
34
Safetensors
Model size
2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for contextboxai/Qwen3-1.7B-FC

Finetuned
Qwen/Qwen3-1.7B
Finetuned
(373)
this model

Datasets used to train contextboxai/Qwen3-1.7B-FC