Funcdex-1.7B / README.md

Update README.md

ed67f17 verified 27 days ago

18.9 kB

	---
	library_name: transformers
	license: mit
	task_categories:
	- text-generation
	language:
	- en
	tags:
	- agent
	- Agentic Learning
	- tool use
	- BFCL
	---

	[![Funcdex-Collection](https://img.shields.io/badge/Hugging%20Face-Model-yellow?logo=huggingface)](https://huggingface.co/collections/prem-research/funcdex) [![Dataset](https://img.shields.io/badge/Hugging%20Face-Dataset-yellow?logo=huggingface)](https://huggingface.co/datasets/prem-research/Funcdex-MT-Function-Calling) [![GitHub](https://img.shields.io/badge/GitHub-Code-181717?logo=github)](https://github.com/prem-research/Funcdex-Synthesizer) [![PremAI](https://img.shields.io/badge/Project-PremAI-green)](https://www.premai.io/)

	# Funcdex-1.7B

	<div align="center">
	<img src="assets/funcdex_hero.png" alt="Funcdex Hero" width="70%">
	</div>

	Funcdex-1.7B is a research preview model by Prem Labs. It has been trained on a mix of [Funcdex-MT-Function-Calling](https://huggingface.co/datasets/prem-research/Funcdex-MT-Function-Calling), Instruct-Following, Single-turn function datasets. It is a LoRA finetune of Qwen3-1.7B (with thinking disabled).

	This model excels at Multi-turn Function Calling with tools from `gmail`, `jira`, `calendar`, `docs`, etc.

	The code used to generate the dataset can be found [here](https://github.com/prem-research/Funcdex-Synthesizer).

	# Evaluation


	<div align="center">
	<img src="assets/line_plot.png" alt="Line Plot" width="80%">
	</div>

	Notes:
	- Funcdex-0.6B is the average of performances of individual Funcdex-0.6B models.
	- For cost, we track the number of prompt/completion tokens for evaluating 300 conversations.
	- e.g. If token cost is input=$1 and output=$10 per million tokens, and evaluation needed `0.5M` and `0.1M` input/output tokens, then cost is `1 * 0.5 + 0.1 * 10 = $1.5`.
	- Qwen3-0.6B and Qwen3-1.7B evaluation costs are estimated by extrapolating from Llama3.2-3B serverless costs. Other model's costs are sourced from Openrouter.

	## Results

	### BFCL v3
	- We filtered BFCLv3 examples relevant to the toolkits/bundles and report performance:
	- The filtered set is only 83 examples. Further emphasizing the need for workflow/toolkit-specialized workflows.

	<table border="1" class="dataframe">
	<thead>
	<tr style="text-align: center;">
	<th>LLM</th>
	<th>Acc %</th>
	</tr>
	</thead>
	<tbody>
	<tr style="text-align: center;">
	<td>GPT-5 Mini<br>(medium)</td>
	<td>0.71</td>
	</tr>
	<tr style="text-align: center;">
	<td>Qwen3-1.7B</td>
	<td>0.82</td>
	</tr>
	<tr style="text-align: center;">
	<td><strong><a href="https://huggingface.co/prem-research/Funcdex-1.7B">Funcdex-1.7B</a><strong></td>
	<td><strong>0.86</strong></td>
	</tr>
	</tbody>
	</table>


	### Funcdex-MT: Overall Performance

	<table border="1" class="dataframe">
	<thead>
	<tr style="text-align: center;">
	<th>LLM</th>
	<th>Exact Match</th>
	<th>String Ratio</th>
	<th>Total Cost ($)</th>
	</tr>
	</thead>
	<tbody>
	<tr style="text-align: center;">
	<td>GPT-OSS-120B<br>(medium)</td>
	<td>0.35</td>
	<td>0.51</td>
	<td>9.32</td>
	</tr>
	<tr style="text-align: center;">
	<td>GPT-5 Mini<br>(medium)</td>
	<td>0.35</td>
	<td>0.58</td>
	<td>99.71</td>
	</tr>
	<tr style="text-align: center;">
	<td>GPT-5<br>(minimal)</td>
	<td>0.18</td>
	<td>0.59</td>
	<td>205.45</td>
	</tr>
	<tr style="text-align: center;">
	<td>Qwen3-0.6B</td>
	<td>0.27</td>
	<td>0.59</td>
	<td>2.83</td>
	</tr>
	<tr style="text-align: center;">
	<td>Qwen3-1.7B</td>
	<td>0.27</td>
	<td>0.69</td>
	<td>5.73</td>
	</tr>
	<tr style="text-align: center;">
	<td><strong><a href="https://huggingface.co/collections/prem-research/funcdex">Funcdex-0.6B</a></strong></td>
	<td><strong>0.39</strong></td>
	<td><strong>0.70</strong></td>
	<td><strong>0.19</strong></td>
	</tr>
	<tr style="text-align: center;">
	<td><strong><a href="https://huggingface.co/prem-research/Funcdex-1.7B">Funcdex-1.7B</a></strong></td>
	<td><strong>0.43</strong></td>
	<td><strong>0.81</strong></td>
	<td>5.64</td>
	</tr>
	</tbody>
	</table>

	### Funcdex-MT: Toolkit-Level Performance

	<table border="1" class="dataframe">
	<thead>
	<tr style="text-align: center;">
	<th rowspan="2">Toolkit</th>
	<th colspan="2">GPT-OSS-120B<br>(medium)</th>
	<th colspan="2">GPT-5<br>(minimal)</th>
	<th colspan="2">GPT-5 Mini<br>(medium)</th>
	<th colspan="2">Qwen3-0.6B</th>
	<th colspan="3">Funcdex-0.6B</th>
	<th colspan="2">Qwen3-1.7B</th>
	<th colspan="3">Funcdex-1.7B</th>
	</tr>
	<tr style="text-align: center;">
	<th>EM</th>
	<th>SR</th>
	<th>EM</th>
	<th>SR</th>
	<th>EM</th>
	<th>SR</th>
	<th>EM</th>
	<th>SR</th>
	<th>EM</th>
	<th>SR</th>
	<th>LoRA Checkpoint</th>
	<th>EM</th>
	<th>SR</th>
	<th>EM</th>
	<th>SR</th>
	<th>LoRA Checkpoint</th>
	</tr>
	</thead>
	<tbody>
	<tr style="text-align: center;">
	<td><img src="assets/icons/asana.png" width="20" height="20" style="vertical-align: middle;"/> Asana</td>
	<td>0.38</td>
	<td>0.47</td>
	<td>0.12</td>
	<td>0.68</td>
	<td>0.49</td>
	<td>0.71</td>
	<td>0.33</td>
	<td>0.63</td>
	<td>0.46</td>
	<td>0.69</td>
	<td><a href="https://huggingface.co/prem-research/Funcdex-0.6B-asana">🤗</a></td>
	<td>0.30</td>
	<td>0.79</td>
	<td>0.52</td>
	<td>0.82</td>
	<td rowspan="10"><a href="https://huggingface.co/prem-research/Funcdex-1.7B">🤗</a></td>
	</tr>
	<tr style="text-align: center;">
	<td><img src="assets/icons/calendly.png" width="20" height="20" style="vertical-align: middle;"/> Calendly</td>
	<td>0.47</td>
	<td>0.56</td>
	<td>0.41</td>
	<td>0.63</td>
	<td>0.41</td>
	<td>0.56</td>
	<td>0.44</td>
	<td>0.66</td>
	<td>0.54</td>
	<td>0.78</td>
	<td><a href="https://huggingface.co/prem-research/Funcdex-0.6B-calendly">🤗</a></td>
	<td>0.47</td>
	<td>0.74</td>
	<td>0.54</td>
	<td>0.86</td>
	</tr>
	<tr style="text-align: center;">
	<td><img src="assets/icons/gmail.png" width="20" height="20" style="vertical-align: middle;"/> Gmail</td>
	<td>0.48</td>
	<td>0.70</td>
	<td>0.24</td>
	<td>0.69</td>
	<td>0.50</td>
	<td>0.73</td>
	<td>0.27</td>
	<td>0.61</td>
	<td>0.47</td>
	<td>0.72</td>
	<td><a href="https://huggingface.co/prem-research/Funcdex-0.6B-gmail">🤗</a></td>
	<td>0.31</td>
	<td>0.73</td>
	<td>0.53</td>
	<td>0.83</td>
	</tr>
	<tr style="text-align: center;">
	<td><img src="assets/icons/google-calendar.png" width="20" height="20" style="vertical-align: middle;"/> Calendar</td>
	<td>0.27</td>
	<td>0.52</td>
	<td>0.20</td>
	<td>0.50</td>
	<td>0.21</td>
	<td>0.51</td>
	<td>0.21</td>
	<td>0.53</td>
	<td>0.39</td>
	<td>0.74</td>
	<td><a href="https://huggingface.co/prem-research/Funcdex-0.6B-googlecalendar">🤗</a></td>
	<td>0.23</td>
	<td>0.64</td>
	<td>0.47</td>
	<td>0.83</td>
	</tr>
	<tr style="text-align: center;">
	<td><img src="assets/icons/docs.png" width="20" height="20" style="vertical-align: middle;"/> Docs</td>
	<td>0.19</td>
	<td>0.38</td>
	<td>0.07</td>
	<td>0.49</td>
	<td>0.18</td>
	<td>0.46</td>
	<td>0.07</td>
	<td>0.58</td>
	<td>0.13</td>
	<td>0.64</td>
	<td><a href="https://huggingface.co/prem-research/Funcdex-0.6B-googledocs">🤗</a></td>
	<td>0.11</td>
	<td>0.62</td>
	<td>0.18</td>
	<td>0.79</td>
	</tr>
	<tr style="text-align: center;">
	<td><img src="assets/icons/google-drive.png" width="20" height="20" style="vertical-align: middle;"/> Drive</td>
	<td>0.34</td>
	<td>0.52</td>
	<td>0.19</td>
	<td>0.61</td>
	<td>0.38</td>
	<td>0.58</td>
	<td>0.26</td>
	<td>0.65</td>
	<td>0.40</td>
	<td>0.75</td>
	<td><a href="https://huggingface.co/prem-research/Funcdex-0.6B-googledrive">🤗</a></td>
	<td>0.26</td>
	<td>0.73</td>
	<td>0.48</td>
	<td>0.82</td>
	</tr>
	<tr style="text-align: center;">
	<td><img src="assets/icons/jira.png" width="20" height="20" style="vertical-align: middle;"/> Jira</td>
	<td>0.47</td>
	<td>0.53</td>
	<td>0.17</td>
	<td>0.65</td>
	<td>0.47</td>
	<td>0.66</td>
	<td>0.51</td>
	<td>0.69</td>
	<td>0.58</td>
	<td>0.76</td>
	<td><a href="https://huggingface.co/prem-research/Funcdex-0.6B-jira">🤗</a></td>
	<td>0.47</td>
	<td>0.76</td>
	<td>0.59</td>
	<td>0.83</td>
	</tr>
	<tr style="text-align: center;">
	<td><img src="assets/icons/stripe.png" width="20" height="20" style="vertical-align: middle;"/> Stripe</td>
	<td>0.15</td>
	<td>0.37</td>
	<td>0.10</td>
	<td>0.46</td>
	<td>0.12</td>
	<td>0.39</td>
	<td>0.08</td>
	<td>0.50</td>
	<td>0.17</td>
	<td>0.71</td>
	<td><a href="https://huggingface.co/prem-research/Funcdex-0.6B-stripe">🤗</a></td>
	<td>0.09</td>
	<td>0.56</td>
	<td>0.16</td>
	<td>0.80</td>
	</tr>
	<tr style="text-align: center;">
	<td><img src="assets/icons/to-do-list.png" width="20" height="20" style="vertical-align: middle;"/> Todoist</td>
	<td>0.65</td>
	<td>0.74</td>
	<td>0.19</td>
	<td>0.72</td>
	<td>0.64</td>
	<td>0.79</td>
	<td>0.57</td>
	<td>0.87</td>
	<td>0.65</td>
	<td>0.88</td>
	<td><a href="https://huggingface.co/prem-research/Funcdex-0.6B-todoist">🤗</a></td>
	<td>0.55</td>
	<td>0.91</td>
	<td>0.72</td>
	<td>0.94</td>
	</tr>
	<tr style="text-align: center;">
	<td><img src="assets/icons/whatsapp.png" width="20" height="20" style="vertical-align: middle;"/> Whatsapp</td>
	<td>0.23</td>
	<td>0.39</td>
	<td>0.13</td>
	<td>0.47</td>
	<td>0.24</td>
	<td>0.43</td>
	<td>0.20</td>
	<td>0.43</td>
	<td>0.28</td>
	<td>0.64</td>
	<td><a href="https://huggingface.co/prem-research/Funcdex-0.6B-whatsapp">🤗</a></td>
	<td>0.26</td>
	<td>0.55</td>
	<td>0.31</td>
	<td>0.71</td>
	</tr>
	</tbody>
	</table>

	- Funcdex-0.6B are specialized models. Reported number is the average performance of each specific model in their respective subset.

	### Funcdex-MT: Bundle/Multi-toolkit Performance:

	<table border="1" class="dataframe">
	<thead>
	<tr style="text-align: center;">
	<th rowspan="2">Bundle</th>
	<th colspan="2">GPT-OSS-120B<br>(medium)</th>
	<th colspan="2">GPT-5<br>(minimal)</th>
	<th colspan="2">GPT-5 Mini<br>(medium)</th>
	<th colspan="2">Qwen3-0.6B</th>
	<th colspan="3">Funcdex-0.6B</th>
	<th colspan="2">Qwen3-1.7B</th>
	<th colspan="3">Funcdex-1.7B</th>
	</tr>
	<tr style="text-align: center;">
	<th>EM</th>
	<th>SR</th>
	<th>EM</th>
	<th>SR</th>
	<th>EM</th>
	<th>SR</th>
	<th>EM</th>
	<th>SR</th>
	<th>EM</th>
	<th>SR</th>
	<th>LoRA Checkpoint</th>
	<th>EM</th>
	<th>SR</th>
	<th>EM</th>
	<th>SR</th>
	<th>LoRA Checkpoint</th>
	</tr>
	</thead>
	<tbody>
	<tr style="text-align: center;">
	<td><img src="assets/icons/gmail.png" width="20" height="20" style="vertical-align: middle;"/>Gmail<img src="assets/icons/google-calendar.png" width="20" height="20" style="vertical-align: middle;"/>Calendar</td>
	<td>0.28</td>
	<td>0.53</td>
	<td>0.15</td>
	<td>0.54</td>
	<td>0.22</td>
	<td>0.56</td>
	<td>0.19</td>
	<td>0.51</td>
	<td>0.26</td>
	<td>0.54</td>
	<td><a href="https://huggingface.co/prem-research/Funcdex-0.6B-gmail_googlecalendar">🤗</a></td>
	<td>0.17</td>
	<td>0.61</td>
	<td>0.32</td>
	<td>0.71</td>
	<td rowspan="5"><a href="https://huggingface.co/prem-research/Funcdex-1.7B">🤗</a></td>
	</tr>
	<tr style="text-align: center;">
	<td><img src="assets/icons/google-drive.png" width="20" height="20" style="vertical-align: middle;"/>Drive <img src="assets/icons/calendly.png" width="20" height="20" style="vertical-align: middle;"/> Calendly <img src="assets/icons/google-calendar.png" width="20" height="20" style="vertical-align: middle;"/> Calendar</td>
	<td>0.32</td>
	<td>0.45</td>
	<td>0.17</td>
	<td>0.52</td>
	<td>0.35</td>
	<td>0.47</td>
	<td>0.19</td>
	<td>0.49</td>
	<td>0.35</td>
	<td>0.60</td>
	<td><a href="https://huggingface.co/prem-research/Funcdex-0.6B-googledrive_calendly_googlecalendar">🤗</a></td>
	<td>0.15</td>
	<td>0.66</td>
	<td>0.40</td>
	<td>0.78</td>
	</tr>
	<tr style="text-align: center;">
	<td><img src="assets/icons/google-drive.png" width="20" height="20" style="vertical-align: middle;"/>Drive <img src="assets/icons/docs.png" width="20" height="20" style="vertical-align: middle;"/> Docs</td>
	<td>0.28</td>
	<td>0.37</td>
	<td>0.12</td>
	<td>0.50</td>
	<td>0.33</td>
	<td>0.47</td>
	<td>0.18</td>
	<td>0.54</td>
	<td>0.34</td>
	<td>0.70</td>
	<td><a href="https://huggingface.co/prem-research/Funcdex-0.6B-googledrive_googledocs">🤗</a></td>
	<td>0.19</td>
	<td>0.68</td>
	<td>0.43</td>
	<td>0.76</td>
	</tr>
	<tr style="text-align: center;">
	<td><img src="assets/icons/jira.png" width="20" height="20" style="vertical-align: middle;"/>Jira <img src="assets/icons/gmail.png" width="20" height="20" style="vertical-align: middle;"/> Gmail</td>
	<td>0.42</td>
	<td>0.60</td>
	<td>0.18</td>
	<td>0.66</td>
	<td>0.36</td>
	<td>0.66</td>
	<td>0.29</td>
	<td>0.61</td>
	<td>0.39</td>
	<td>0.71</td>
	<td><a href="https://huggingface.co/prem-research/Funcdex-0.6B-jira_gmail">🤗</a></td>
	<td>0.28</td>
	<td>0.72</td>
	<td>0.44</td>
	<td>0.82</td>
	</tr>
	<tr style="text-align: center;">
	<td><img src="assets/icons/whatsapp.png" width="20" height="20" style="vertical-align: middle;"/>Whatsapp <img src="assets/icons/to-do-list.png" width="20" height="20" style="vertical-align: middle;"/> Todoist</td>
	<td>0.32</td>
	<td>0.58</td>
	<td>0.19</td>
	<td>0.66</td>
	<td>0.35</td>
	<td>0.69</td>
	<td>0.26</td>
	<td>0.50</td>
	<td>0.41</td>
	<td>0.70</td>
	<td><a href="https://huggingface.co/prem-research/Funcdex-0.6B-whatsapp_todoist">🤗</a></td>
	<td>0.27</td>
	<td>0.68</td>
	<td>0.39</td>
	<td>0.77</td>
	</tr>
	</tbody>
	</table>


	## Inference

	- Given a conversation, we extract all tuples `(context_messages, function_calls)` and use it to generate predictions. We ignore the `content` field and only evaluate `function_calls` generated by an LLM.
	- We use vLLM deployment with `tool_choice="auto"`.


	## Metrics

	Given a list of predicted and reference function calls, we report two metrics:
	- Function Call String Match (SR): We perform greedy match and report best-matched string ratio using `difflib.SequenceMatcher.ratio`. The number reported is average string ratio.
	- Exact Match (EM): Same as above, but we perform exact string match instead. The number reported is EM F1 Score.

	EM is a strict metric, and penalizes string arguments in function calls that may be "okay", e.g. `"email_content": "This is an example."` v/s `"email_content": "This is an Example."`, both only differ by one letter.

	## Deployment with vLLM

	`vllm serve ojus1/Qwen3-1.7B-Instruct --enable-lora --lora-modules prem-research/Funcdex-1.7B=prem-research/Funcdex-1.7B --enable-auto-tool-choice --tool-call-parser hermes`

	# Quickstart

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	from peft import PeftModel
	import torch
	import json

	# Load model and tokenizer
	base_model_name = "ojus1/Qwen3-1.7B-Instruct"
	model_name = "prem-research/Funcdex-1.7B"

	tokenizer = AutoTokenizer.from_pretrained(model_name)

	base_model = AutoModelForCausalLM.from_pretrained(
	base_model_name,
	torch_dtype="auto",
	device_map="auto"
	)

	model = PeftModel.from_pretrained(
	base_model,
	model_name,
	torch_dtype="auto",
	device_map="auto"
	)

	# Define tools (supports all toolkits)
	tools = [
	{
	"type": "function",
	"function": {
	"name": "CREATE_SHARED_DRIVE",
	"description": "Create a new shared drive in Google Drive",
	"parameters": {
	"type": "object",
	"properties": {
	"name": {"type": "string", "description": "Name of the shared drive"},
	"requestId": {"type": "string", "description": "Unique request ID"}
	},
	"required": ["name", "requestId"]
	}
	}
	},
	{
	"type": "function",
	"function": {
	"name": "CREATE_A_FOLDER",
	"description": "Create a folder in Google Drive",
	"parameters": {
	"type": "object",
	"properties": {
	"folder_name": {"type": "string", "description": "Name of the folder"},
	"parent_id": {"type": "string", "description": "Parent drive or folder ID"}
	},
	"required": ["folder_name", "parent_id"]
	}
	}
	}
	]

	# Define conversation
	messages = [
	{"role": "system", "content": "You are a helpful assistant that can help with tasks by using tools."},
	{"role": "user", "content": "Create a shared drive named 'Partner-Alpha-Integration' with request ID 'req-12345'."}
	]

	# Apply chat template with tools
	formatted_input = tokenizer.apply_chat_template(
	messages,
	tools=tools,
	tokenize=False,
	add_generation_prompt=True
	)

	# Tokenize and generate
	input_tokens = tokenizer(formatted_input, return_tensors="pt").to(model.device)
	output = model.generate(**input_tokens, max_new_tokens=256, do_sample=False)
	response = tokenizer.decode(output[0][input_tokens['input_ids'].shape[1]:], skip_special_tokens=True)

	print("Response:", response)
	# Expected output includes: <tool_call>{"name": "CREATE_SHARED_DRIVE", "arguments": {"name": "Partner-Alpha-Integration", "requestId": "req-12345"}}</tool_call>
	```

	For best results, provide detailed system-prompt to steer the tool-use behaviour.

	# License

	The models, code and the dataset are licensed under MIT License.