Qwen3-1.7B-FC: Function Calling Specialist

A function calling model based on Qwen3-1.7B, fine-tuned using RLVR (Reinforcement Learning with Verifiable Rewards) to improve tool-use capabilities on the BFCL V3 benchmark.

🏆 Performance Highlights

Model	Size	BFCL Overall	Category Avg
Qwen3-1.7B-FC (Our)	1.7B	54.2%	50.8%
Qwen3-1.7B (Base)	1.7B	48.8%	45.8%
Qwen3-8B	8B	51.9%	48.6%
Qwen3-14B	14B	51.6%	49.0%

Response Efficiency

Model	Avg Response Tokens	Efficiency vs Base
Base Qwen3-1.7B	35.6 tokens	-
Qwen3-1.7B-FC (Our)	22.7 tokens	-36%

The fine-tuned model generates 36% fewer tokens while maintaining higher accuracy, thanks to:

Direct tool calls without verbose preambles
Concise refusal messages ("None of the provided tools can answer this question")
Reduced <think> reasoning blocks

📊 Detailed Benchmark Results (BFCL V3)

Core Function Calling

Category	Qwen3-1.7B-FC (Our)	Base 1.7B	Qwen3-8B	Qwen3-14B
simple	81.0%	61.5%	69.2%	65.5%
multiple	79.0%	55.5%	66.0%	57.0%
parallel	78.0%	68.0%	78.0%	77.0%
parallel_multiple	64.5%	51.5%	66.5%	66.5%
irrelevance	81.2%	86.2%	85.4%	90.4%

Executable Python

Category	Qwen3-1.7B-FC (Our)	Base 1.7B	8B	14B
exec_simple	84.0%	82.0%	84.0%	87.0%
exec_multiple	70.0%	70.0%	78.0%	78.0%
exec_parallel	80.0%	76.0%	86.0%	90.0%
exec_parallel_multiple	60.0%	60.0%	67.5%	65.0%

Live API Categories

Category	Qwen3-1.7B-FC (Our)	Base 1.7B	Qwen3-8B	Qwen3-14B
live_simple	63.6%	43.8%	51.2%	51.6%
live_multiple	55.0%	36.8%	43.7%	42.5%
live_parallel	50.0%	18.8%	43.8%	43.8%
live_parallel_multiple	66.7%	37.5%	54.2%	50.0%
live_irrelevance	66.1%	80.3%	78.7%	79.9%

📚 Training Data

Data Sources

Source	Samples	Type	Description
xLAM	~60,000	Positive	High-quality function calling examples from Salesforce
ToolACE	~11,000	Positive	Diverse multi-turn tool usage scenarios
Toucan-1.5M	40,000	Negative	Irrelevant queries (Server Shuffle method)
Synthetic Negatives	6,000	Negative	Domain mismatch, partial fulfillment, permission errors

Negative Sample Types

The model is trained to refuse appropriately using diverse negative samples:

Type	Description	Example
Toucan Irrelevant	Query has no matching tool in available functions	"What's the weather?" when only `get_stock_price` is available
Domain Mismatch	Tools from wrong domain	Asking about finance when only cooking tools available
Action Mismatch	Similar name but wrong action	Asking to "delete" when only "get" function exists
Partial Fulfillment	Tools can't fully solve query	Need 2 steps but only 1 tool available
Permission/Auth	Missing required permissions	Admin action without credentials
Format Mismatch	Wrong data format requirements	Tool expects JSON but query provides CSV

🔧 Training Methodology

Two-Stage RLVR Fine-tuning

Stage 1: Accuracy-focused training (V3)
- Trained from Qwen3-1.7B base
- Dataset: ~40K samples (stage2.parquet)
- Reward: Correctness (1.0) + Format (0.1) + Efficiency (0.3) + Refusal (0.3)
- Config: max_steps=5000, LR=5e-7, temp=1.2
- Best checkpoint: step 100 (early stopping, highest accuracy)
Stage 2: Efficiency optimization (V4)
- Loaded from Stage 1 checkpoint-100
- Focus: Reduce verbosity, discourage <think> tags
- Reward weights: Efficiency=1.0, Correctness=0.5, Format=0.1, Refusal=0.3
- Config: max_steps=3000, LR=2e-7
- Selected checkpoint: step 1100
- Result: 36% reduction in response tokens

Reward Function Design

# Combined Reward Formula
total_reward = (
    format_weight * format_reward +           # Valid <tool_call> JSON (0.0-1.0)
    correct_weight * correctness_reward +     # Tool name + arguments match (0.0-1.0)
    refusal_weight * refusal_reward +         # +1.0 correct refusal, -1.0 hallucination
    efficiency_weight * efficiency_reward     # Penalty for verbose <think>
)

# Stage 1 Weights (Accuracy Focus)
STAGE1_WEIGHTS = {
    'format': 0.2,
    'correctness': 1.0,    # Main focus
    'efficiency': 0.2,
    'refusal': 0.3,
}

# Stage 2 Weights (Efficiency Focus)
STAGE2_WEIGHTS = {
    'format': 0.1,
    'correctness': 0.5,    # Reduced - already accurate from Stage 1
    'efficiency': 1.0,     # Main focus - penalize <think> tags
    'refusal': 0.3,
}

Individual Reward Components

Component	Description	Range
format_reward	Valid `<tool_call>JSON</tool_call>` structure	0.0 - 1.0
correctness_reward	Tool name match + argument similarity	0.0 - 1.0
refusal_reward	+1.0 correct refusal, -1.0 hallucination	-1.0 to +1.0
efficiency_reward	Stage 1: -0.3 for `<think>`, Stage 2: -1.0	-1.0 to +0.1

Key Training Innovations

Strong Refusal Penalty: -1.0 for calling tools when ground_truth = []
Toucan Irrelevant Data: 40K high-quality "unanswerable" samples
Efficiency Optimization: Rewarding direct tool calls without preambles
Discourage <think> Tags: Strong penalty (-1.0) for verbose reasoning blocks

🚀 Usage

With Transformers

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "contextboxai/Qwen3-1.7B-FC"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map="auto")

# Define tools
tools = [{
    "name": "get_weather",
    "description": "Get weather for a location",
    "parameters": {
        "type": "object",
        "properties": {
            "location": {"type": "string", "description": "City name"}
        },
        "required": ["location"]
    }
}]

messages = [{"role": "user", "content": "What's the weather in Tokyo?"}]

prompt = tokenizer.apply_chat_template(
    messages,
    tools=tools,
    add_generation_prompt=True,
    tokenize=False,
    enable_thinking=False  # Disable thinking for efficiency
)

inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=256)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Expected Output

<tool_call>
{"name": "get_weather", "arguments": {"location": "Tokyo"}}
</tool_call>

Refusal Example

When asked "What is the meaning of life?" with only get_weather tool available:

None of the provided tools can answer this question.

With vLLM (Recommended for Production)

from vllm import LLM, SamplingParams

llm = LLM(model="contextboxai/Qwen3-1.7B-FC")
sampling_params = SamplingParams(temperature=0, max_tokens=256)

# Generate with same prompt format as above
outputs = llm.generate([prompt], sampling_params)

💡 Key Features

✅ Strengths

Compact Size: Only 1.7B parameters, runs on consumer GPUs
High Accuracy: Outperforms larger models (8B, 14B) on function calling
Efficient Responses: Direct tool calls without verbose preambles
Strong Refusal: Trained on 46K negative samples to avoid hallucination
Multilingual: Supports English and Vietnamese
Chat Compatible: Maintains general chat ability (100% on chatable benchmark)

⚠️ Limitations

Irrelevance: Slightly more aggressive at calling tools (-5% vs base)

📝 Use Cases

🎯 Ideal For

This model is optimized for edge deployment and customer service automation where a small, efficient model is needed:

Use Case	Description
Edge Device Deployment	Run locally on devices with limited GPU/RAM
Customer Service Chatbot	Automate order lookup, ticket creation, FAQ with tool calls
Voice Agent / Call Center	Real-time voice-to-action for phone support systems
IoT/Smart Home	Control devices via function calling on edge hardware
Mobile AI Assistant	On-device tool execution without cloud dependency
Cost-Efficient API Gateway	Route requests to appropriate backend services

💼 Customer Service Examples

# Example: Customer asks about their order
tools = [
    {"name": "lookup_order", "parameters": {"order_id": "string"}},
    {"name": "create_ticket", "parameters": {"issue": "string", "priority": "string"}},
    {"name": "get_faq", "parameters": {"topic": "string"}}
]

# User: "Đơn hàng #12345 của tôi ở đâu rồi?"
# Model output:
# <tool_call>
# {"name": "lookup_order", "arguments": {"order_id": "12345"}}
# </tool_call>

# User: "Tôi muốn đổi trả sản phẩm"
# Model output:
# <tool_call>
# {"name": "create_ticket", "arguments": {"issue": "product_return", "priority": "normal"}}
# </tool_call>

⚡ Why Small Model?

Benefit	Description
Low Latency	~50ms inference on consumer GPU
Low Cost	8x cheaper than 14B model to deploy
Privacy	Run entirely on-premise, no data leaves device
Offline Capable	Works without internet connection

🧠 Reduced Catastrophic Forgetting

This model uses RLVR (Reinforcement Learning from Verifiable Rewards) instead of traditional SFT, which helps reduce capability loss:

Less forgetting than SFT: RLVR fine-tunes through reward signals rather than directly overwriting weights
100% chatable score: Model maintains normal conversation ability on BFCL benchmark
Multilingual preserved: English and Vietnamese capabilities remain functional
Lower risk: Compared to SFT, RLVR typically causes less regression on non-target tasks

🔬 Technical Details

Attribute	Value
Base Model	Qwen/Qwen3-1.7B
Training Method	RLVR (RL fine-tuning)
Training Steps	100 (V3) + 3000 (V4)
Peak LR	1e-6 → 2e-7
Training Data	117K samples (71K positive + 46K negative)
Precision	bfloat16
Max Sequence Length	32768 tokens
Tool Format	XML-style (`<tool_call>...</tool_call>`)

📚 Citation

If you use this model, please cite:

@misc{qwen3-fc,
  title={Qwen3-1.7B-FC: Efficient Function Calling via GRPO Fine-tuning},
  author={ContextboxAI},
  year={2024},
  howpublished={\url{https://huggingface.co/contextboxai/Qwen3-1.7B-FC}},
}

🙏 Acknowledgments

Qwen Team for the excellent base model
Jan-nano for training methodology inspiration
Berkeley Function Calling Leaderboard for the benchmark
xLAM (Salesforce) for function calling data
ToolACE for multi-turn tool usage data
Toucan-1.5M (Agent-Ark) for irrelevant/negative samples
TRL for GRPO implementation

📄 License

Apache 2.0

Model Card Contact: ContextboxAI

Downloads last month: 34

Safetensors

Model size

2B params

Tensor type

F32

Model tree for contextboxai/Qwen3-1.7B-FC

Base model

Qwen/Qwen3-1.7B-Base

Finetuned

Qwen/Qwen3-1.7B