ojus1 commited on
Commit
eae828a
·
verified ·
1 Parent(s): c697282

Update README with model card for googledocs

Browse files
Files changed (1) hide show
  1. README.md +569 -195
README.md CHANGED
@@ -1,199 +1,573 @@
1
  ---
2
  library_name: transformers
3
- tags: []
 
 
 
 
 
 
 
 
 
4
  ---
5
 
6
- # Model Card for Model ID
7
-
8
- <!-- Provide a quick summary of what the model is/does. -->
9
-
10
-
11
-
12
- ## Model Details
13
-
14
- ### Model Description
15
-
16
- <!-- Provide a longer summary of what this model is. -->
17
-
18
- This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
19
-
20
- - **Developed by:** [More Information Needed]
21
- - **Funded by [optional]:** [More Information Needed]
22
- - **Shared by [optional]:** [More Information Needed]
23
- - **Model type:** [More Information Needed]
24
- - **Language(s) (NLP):** [More Information Needed]
25
- - **License:** [More Information Needed]
26
- - **Finetuned from model [optional]:** [More Information Needed]
27
-
28
- ### Model Sources [optional]
29
-
30
- <!-- Provide the basic links for the model. -->
31
-
32
- - **Repository:** [More Information Needed]
33
- - **Paper [optional]:** [More Information Needed]
34
- - **Demo [optional]:** [More Information Needed]
35
-
36
- ## Uses
37
-
38
- <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
39
-
40
- ### Direct Use
41
-
42
- <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
43
-
44
- [More Information Needed]
45
-
46
- ### Downstream Use [optional]
47
-
48
- <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
49
-
50
- [More Information Needed]
51
-
52
- ### Out-of-Scope Use
53
-
54
- <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
55
-
56
- [More Information Needed]
57
-
58
- ## Bias, Risks, and Limitations
59
-
60
- <!-- This section is meant to convey both technical and sociotechnical limitations. -->
61
-
62
- [More Information Needed]
63
-
64
- ### Recommendations
65
-
66
- <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
67
-
68
- Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
69
-
70
- ## How to Get Started with the Model
71
-
72
- Use the code below to get started with the model.
73
-
74
- [More Information Needed]
75
-
76
- ## Training Details
77
-
78
- ### Training Data
79
-
80
- <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
81
-
82
- [More Information Needed]
83
-
84
- ### Training Procedure
85
-
86
- <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
87
-
88
- #### Preprocessing [optional]
89
-
90
- [More Information Needed]
91
-
92
-
93
- #### Training Hyperparameters
94
-
95
- - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
96
-
97
- #### Speeds, Sizes, Times [optional]
98
-
99
- <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
100
-
101
- [More Information Needed]
102
-
103
- ## Evaluation
104
-
105
- <!-- This section describes the evaluation protocols and provides the results. -->
106
-
107
- ### Testing Data, Factors & Metrics
108
-
109
- #### Testing Data
110
-
111
- <!-- This should link to a Dataset Card if possible. -->
112
-
113
- [More Information Needed]
114
-
115
- #### Factors
116
-
117
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
118
-
119
- [More Information Needed]
120
-
121
- #### Metrics
122
-
123
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
124
-
125
- [More Information Needed]
126
-
127
- ### Results
128
-
129
- [More Information Needed]
130
-
131
- #### Summary
132
-
133
-
134
-
135
- ## Model Examination [optional]
136
-
137
- <!-- Relevant interpretability work for the model goes here -->
138
-
139
- [More Information Needed]
140
-
141
- ## Environmental Impact
142
-
143
- <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
144
-
145
- Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
146
-
147
- - **Hardware Type:** [More Information Needed]
148
- - **Hours used:** [More Information Needed]
149
- - **Cloud Provider:** [More Information Needed]
150
- - **Compute Region:** [More Information Needed]
151
- - **Carbon Emitted:** [More Information Needed]
152
-
153
- ## Technical Specifications [optional]
154
-
155
- ### Model Architecture and Objective
156
-
157
- [More Information Needed]
158
-
159
- ### Compute Infrastructure
160
-
161
- [More Information Needed]
162
-
163
- #### Hardware
164
-
165
- [More Information Needed]
166
-
167
- #### Software
168
-
169
- [More Information Needed]
170
-
171
- ## Citation [optional]
172
-
173
- <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
174
-
175
- **BibTeX:**
176
-
177
- [More Information Needed]
178
-
179
- **APA:**
180
-
181
- [More Information Needed]
182
-
183
- ## Glossary [optional]
184
-
185
- <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
186
-
187
- [More Information Needed]
188
-
189
- ## More Information [optional]
190
-
191
- [More Information Needed]
192
-
193
- ## Model Card Authors [optional]
194
-
195
- [More Information Needed]
196
-
197
- ## Model Card Contact
198
-
199
- [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  library_name: transformers
3
+ license: mit
4
+ task_categories:
5
+ - text-generation
6
+ language:
7
+ - en
8
+ tags:
9
+ - agent
10
+ - Agentic Learning
11
+ - tool use
12
+ - BFCL
13
  ---
14
 
15
+ [![Funcdex-Collection](https://img.shields.io/badge/Hugging%20Face-Model-yellow?logo=huggingface)](https://huggingface.co/collections/prem-research/funcdex) [![Dataset](https://img.shields.io/badge/Hugging%20Face-Dataset-yellow?logo=huggingface)](https://huggingface.co/datasets/prem-research/Funcdex-MT-Function-Calling) [![GitHub](https://img.shields.io/badge/GitHub-Code-181717?logo=github)](https://github.com/prem-research/Funcdex-Synthesizer) [![PremAI](https://img.shields.io/badge/Project-PremAI-green)](https://www.premai.io/)
16
+
17
+ # Funcdex-0.6B-googledocs
18
+
19
+ <div align="center">
20
+ <img src="assets/funcdex_hero.png" alt="Funcdex Hero" width="70%">
21
+ </div>
22
+
23
+ Funcdex-0.6B is a research preview model by Prem Labs. It has been trained on a mix of [Funcdex-MT-Function-Calling](https://huggingface.co/datasets/prem-research/Funcdex-MT-Function-Calling), Instruct-Following, Single-turn function datasets. It is a LoRA finetune of Qwen3-0.6B (with thinking disabled).
24
+
25
+ This model excels at Multi-turn Function Calling with tools from `googledocs`.
26
+
27
+ The code used to generate the dataset can be found [here](https://github.com/prem-research/Funcdex-Synthesizer).
28
+
29
+ # Evaluation
30
+
31
+
32
+ <div align="center">
33
+ <img src="assets/line_plot.png" alt="Line Plot" width="80%">
34
+ </div>
35
+
36
+ ## Results
37
+
38
+ ### BFCL v3
39
+ - We filtered BFCLv3 examples relevant to the toolkits/bundles and report performance:
40
+ - The filtered set is only 83 examples. Further emphasizing the need for workflow/toolkit-specialized workflows.
41
+
42
+ <table border="1" class="dataframe">
43
+ <thead>
44
+ <tr style="text-align: center;">
45
+ <th>LLM</th>
46
+ <th>Acc %</th>
47
+ </tr>
48
+ </thead>
49
+ <tbody>
50
+ <tr style="text-align: center;">
51
+ <td>GPT-5 Mini<br>(medium)</td>
52
+ <td>0.71</td>
53
+ </tr>
54
+ <tr style="text-align: center;">
55
+ <td>Qwen3-1.7B</td>
56
+ <td>0.82</td>
57
+ </tr>
58
+ <tr style="text-align: center;">
59
+ <td><strong><a href="https://huggingface.co/prem-research/Funcdex-1.7B">Funcdex-1.7B</a><strong></td>
60
+ <td><strong>0.86</strong></td>
61
+ </tr>
62
+ </tbody>
63
+ </table>
64
+
65
+
66
+ ### Funcdex-MT: Overall Performance
67
+
68
+ <table border="1" class="dataframe">
69
+ <thead>
70
+ <tr style="text-align: center;">
71
+ <th>LLM</th>
72
+ <th>Exact Match</th>
73
+ <th>String Ratio</th>
74
+ <th>Total Cost ($)</th>
75
+ </tr>
76
+ </thead>
77
+ <tbody>
78
+ <tr style="text-align: center;">
79
+ <td>GPT-OSS-120B<br>(medium)</td>
80
+ <td>0.35</td>
81
+ <td>0.51</td>
82
+ <td>9.32</td>
83
+ </tr>
84
+ <tr style="text-align: center;">
85
+ <td>GPT-5 Mini<br>(medium)</td>
86
+ <td>0.35</td>
87
+ <td>0.58</td>
88
+ <td>99.71</td>
89
+ </tr>
90
+ <tr style="text-align: center;">
91
+ <td>GPT-5<br>(minimal)</td>
92
+ <td>0.18</td>
93
+ <td>0.59</td>
94
+ <td>205.45</td>
95
+ </tr>
96
+ <tr style="text-align: center;">
97
+ <td>Qwen3-0.6B</td>
98
+ <td>0.27</td>
99
+ <td>0.59</td>
100
+ <td>2.83</td>
101
+ </tr>
102
+ <tr style="text-align: center;">
103
+ <td>Qwen3-1.7B</td>
104
+ <td>0.27</td>
105
+ <td>0.69</td>
106
+ <td>5.73</td>
107
+ </tr>
108
+ <tr style="text-align: center;">
109
+ <td><strong><a href="https://huggingface.co/collections/prem-research/funcdex">Funcdex-0.6B</a></strong></td>
110
+ <td><strong>0.39</strong></td>
111
+ <td><strong>0.70</strong></td>
112
+ <td><strong>0.19</strong></td>
113
+ </tr>
114
+ <tr style="text-align: center;">
115
+ <td><strong><a href="https://huggingface.co/prem-research/Funcdex-1.7B">Funcdex-1.7B</a></strong></td>
116
+ <td><strong>0.43</strong></td>
117
+ <td><strong>0.81</strong></td>
118
+ <td>5.64</td>
119
+ </tr>
120
+ </tbody>
121
+ </table>
122
+
123
+ ### Funcdex-MT: Toolkit-Level Performance
124
+
125
+ <table border="1" class="dataframe">
126
+ <thead>
127
+ <tr style="text-align: center;">
128
+ <th rowspan="2">Toolkit</th>
129
+ <th colspan="2">GPT-OSS-120B<br>(medium)</th>
130
+ <th colspan="2">GPT-5<br>(minimal)</th>
131
+ <th colspan="2">GPT-5 Mini<br>(medium)</th>
132
+ <th colspan="2">Qwen3-0.6B</th>
133
+ <th colspan="3">Funcdex-0.6B</th>
134
+ <th colspan="2">Qwen3-1.7B</th>
135
+ <th colspan="3">Funcdex-1.7B</th>
136
+ </tr>
137
+ <tr style="text-align: center;">
138
+ <th>EM</th>
139
+ <th>SR</th>
140
+ <th>EM</th>
141
+ <th>SR</th>
142
+ <th>EM</th>
143
+ <th>SR</th>
144
+ <th>EM</th>
145
+ <th>SR</th>
146
+ <th>EM</th>
147
+ <th>SR</th>
148
+ <th>LoRA Checkpoint</th>
149
+ <th>EM</th>
150
+ <th>SR</th>
151
+ <th>EM</th>
152
+ <th>SR</th>
153
+ <th>LoRA Checkpoint</th>
154
+ </tr>
155
+ </thead>
156
+ <tbody>
157
+ <tr style="text-align: center;">
158
+ <td><img src="assets/icons/asana.png" width="20" height="20" style="vertical-align: middle;"/> Asana</td>
159
+ <td>0.38</td>
160
+ <td>0.47</td>
161
+ <td>0.12</td>
162
+ <td>0.68</td>
163
+ <td>0.49</td>
164
+ <td>0.71</td>
165
+ <td>0.33</td>
166
+ <td>0.63</td>
167
+ <td>0.46</td>
168
+ <td>0.69</td>
169
+ <td><a href="https://huggingface.co/prem-research/Funcdex-0.6B-asana">🤗</a></td>
170
+ <td>0.30</td>
171
+ <td>0.79</td>
172
+ <td>0.52</td>
173
+ <td>0.82</td>
174
+ <td rowspan="10"><a href="https://huggingface.co/prem-research/Funcdex-1.7B">🤗</a></td>
175
+ </tr>
176
+ <tr style="text-align: center;">
177
+ <td><img src="assets/icons/calendly.png" width="20" height="20" style="vertical-align: middle;"/> Calendly</td>
178
+ <td>0.47</td>
179
+ <td>0.56</td>
180
+ <td>0.41</td>
181
+ <td>0.63</td>
182
+ <td>0.41</td>
183
+ <td>0.56</td>
184
+ <td>0.44</td>
185
+ <td>0.66</td>
186
+ <td>0.54</td>
187
+ <td>0.78</td>
188
+ <td><a href="https://huggingface.co/prem-research/Funcdex-0.6B-calendly">🤗</a></td>
189
+ <td>0.47</td>
190
+ <td>0.74</td>
191
+ <td>0.54</td>
192
+ <td>0.86</td>
193
+ </tr>
194
+ <tr style="text-align: center;">
195
+ <td><img src="assets/icons/gmail.png" width="20" height="20" style="vertical-align: middle;"/> Gmail</td>
196
+ <td>0.48</td>
197
+ <td>0.70</td>
198
+ <td>0.24</td>
199
+ <td>0.69</td>
200
+ <td>0.50</td>
201
+ <td>0.73</td>
202
+ <td>0.27</td>
203
+ <td>0.61</td>
204
+ <td>0.47</td>
205
+ <td>0.72</td>
206
+ <td><a href="https://huggingface.co/prem-research/Funcdex-0.6B-gmail">🤗</a></td>
207
+ <td>0.31</td>
208
+ <td>0.73</td>
209
+ <td>0.53</td>
210
+ <td>0.83</td>
211
+ </tr>
212
+ <tr style="text-align: center;">
213
+ <td><img src="assets/icons/google-calendar.png" width="20" height="20" style="vertical-align: middle;"/> Calendar</td>
214
+ <td>0.27</td>
215
+ <td>0.52</td>
216
+ <td>0.20</td>
217
+ <td>0.50</td>
218
+ <td>0.21</td>
219
+ <td>0.51</td>
220
+ <td>0.21</td>
221
+ <td>0.53</td>
222
+ <td>0.39</td>
223
+ <td>0.74</td>
224
+ <td><a href="https://huggingface.co/prem-research/Funcdex-0.6B-googlecalendar">🤗</a></td>
225
+ <td>0.23</td>
226
+ <td>0.64</td>
227
+ <td>0.47</td>
228
+ <td>0.83</td>
229
+ </tr>
230
+ <tr style="text-align: center;">
231
+ <td><img src="assets/icons/docs.png" width="20" height="20" style="vertical-align: middle;"/> Docs</td>
232
+ <td>0.19</td>
233
+ <td>0.38</td>
234
+ <td>0.07</td>
235
+ <td>0.49</td>
236
+ <td>0.18</td>
237
+ <td>0.46</td>
238
+ <td>0.07</td>
239
+ <td>0.58</td>
240
+ <td>0.13</td>
241
+ <td>0.64</td>
242
+ <td><a href="https://huggingface.co/prem-research/Funcdex-0.6B-googledocs">🤗</a></td>
243
+ <td>0.11</td>
244
+ <td>0.62</td>
245
+ <td>0.18</td>
246
+ <td>0.79</td>
247
+ </tr>
248
+ <tr style="text-align: center;">
249
+ <td><img src="assets/icons/google-drive.png" width="20" height="20" style="vertical-align: middle;"/> Drive</td>
250
+ <td>0.34</td>
251
+ <td>0.52</td>
252
+ <td>0.19</td>
253
+ <td>0.61</td>
254
+ <td>0.38</td>
255
+ <td>0.58</td>
256
+ <td>0.26</td>
257
+ <td>0.65</td>
258
+ <td>0.40</td>
259
+ <td>0.75</td>
260
+ <td><a href="https://huggingface.co/prem-research/Funcdex-0.6B-googledrive">🤗</a></td>
261
+ <td>0.26</td>
262
+ <td>0.73</td>
263
+ <td>0.48</td>
264
+ <td>0.82</td>
265
+ </tr>
266
+ <tr style="text-align: center;">
267
+ <td><img src="assets/icons/jira.png" width="20" height="20" style="vertical-align: middle;"/> Jira</td>
268
+ <td>0.47</td>
269
+ <td>0.53</td>
270
+ <td>0.17</td>
271
+ <td>0.65</td>
272
+ <td>0.47</td>
273
+ <td>0.66</td>
274
+ <td>0.51</td>
275
+ <td>0.69</td>
276
+ <td>0.58</td>
277
+ <td>0.76</td>
278
+ <td><a href="https://huggingface.co/prem-research/Funcdex-0.6B-jira">🤗</a></td>
279
+ <td>0.47</td>
280
+ <td>0.76</td>
281
+ <td>0.59</td>
282
+ <td>0.83</td>
283
+ </tr>
284
+ <tr style="text-align: center;">
285
+ <td><img src="assets/icons/stripe.png" width="20" height="20" style="vertical-align: middle;"/> Stripe</td>
286
+ <td>0.15</td>
287
+ <td>0.37</td>
288
+ <td>0.10</td>
289
+ <td>0.46</td>
290
+ <td>0.12</td>
291
+ <td>0.39</td>
292
+ <td>0.08</td>
293
+ <td>0.50</td>
294
+ <td>0.17</td>
295
+ <td>0.71</td>
296
+ <td><a href="https://huggingface.co/prem-research/Funcdex-0.6B-stripe">🤗</a></td>
297
+ <td>0.09</td>
298
+ <td>0.56</td>
299
+ <td>0.16</td>
300
+ <td>0.80</td>
301
+ </tr>
302
+ <tr style="text-align: center;">
303
+ <td><img src="assets/icons/to-do-list.png" width="20" height="20" style="vertical-align: middle;"/> Todoist</td>
304
+ <td>0.65</td>
305
+ <td>0.74</td>
306
+ <td>0.19</td>
307
+ <td>0.72</td>
308
+ <td>0.64</td>
309
+ <td>0.79</td>
310
+ <td>0.57</td>
311
+ <td>0.87</td>
312
+ <td>0.65</td>
313
+ <td>0.88</td>
314
+ <td><a href="https://huggingface.co/prem-research/Funcdex-0.6B-todoist">🤗</a></td>
315
+ <td>0.55</td>
316
+ <td>0.91</td>
317
+ <td>0.72</td>
318
+ <td>0.94</td>
319
+ </tr>
320
+ <tr style="text-align: center;">
321
+ <td><img src="assets/icons/whatsapp.png" width="20" height="20" style="vertical-align: middle;"/> Whatsapp</td>
322
+ <td>0.23</td>
323
+ <td>0.39</td>
324
+ <td>0.13</td>
325
+ <td>0.47</td>
326
+ <td>0.24</td>
327
+ <td>0.43</td>
328
+ <td>0.20</td>
329
+ <td>0.43</td>
330
+ <td>0.28</td>
331
+ <td>0.64</td>
332
+ <td><a href="https://huggingface.co/prem-research/Funcdex-0.6B-whatsapp">🤗</a></td>
333
+ <td>0.26</td>
334
+ <td>0.55</td>
335
+ <td>0.31</td>
336
+ <td>0.71</td>
337
+ </tr>
338
+ </tbody>
339
+ </table>
340
+
341
+ - Funcdex-0.6B are specialized models. Reported number is the average performance of each specific model in their respective subset.
342
+
343
+ ### Funcdex-MT: Bundle/Multi-toolkit Performance:
344
+
345
+ <table border="1" class="dataframe">
346
+ <thead>
347
+ <tr style="text-align: center;">
348
+ <th rowspan="2">Bundle</th>
349
+ <th colspan="2">GPT-OSS-120B<br>(medium)</th>
350
+ <th colspan="2">GPT-5<br>(minimal)</th>
351
+ <th colspan="2">GPT-5 Mini<br>(medium)</th>
352
+ <th colspan="2">Qwen3-0.6B</th>
353
+ <th colspan="3">Funcdex-0.6B</th>
354
+ <th colspan="2">Qwen3-1.7B</th>
355
+ <th colspan="3">Funcdex-1.7B</th>
356
+ </tr>
357
+ <tr style="text-align: center;">
358
+ <th>EM</th>
359
+ <th>SR</th>
360
+ <th>EM</th>
361
+ <th>SR</th>
362
+ <th>EM</th>
363
+ <th>SR</th>
364
+ <th>EM</th>
365
+ <th>SR</th>
366
+ <th>EM</th>
367
+ <th>SR</th>
368
+ <th>LoRA Checkpoint</th>
369
+ <th>EM</th>
370
+ <th>SR</th>
371
+ <th>EM</th>
372
+ <th>SR</th>
373
+ <th>LoRA Checkpoint</th>
374
+ </tr>
375
+ </thead>
376
+ <tbody>
377
+ <tr style="text-align: center;">
378
+ <td><img src="assets/icons/gmail.png" width="20" height="20" style="vertical-align: middle;"/>Gmail<img src="assets/icons/google-calendar.png" width="20" height="20" style="vertical-align: middle;"/>Calendar</td>
379
+ <td>0.28</td>
380
+ <td>0.53</td>
381
+ <td>0.15</td>
382
+ <td>0.54</td>
383
+ <td>0.22</td>
384
+ <td>0.56</td>
385
+ <td>0.19</td>
386
+ <td>0.51</td>
387
+ <td>0.26</td>
388
+ <td>0.54</td>
389
+ <td><a href="https://huggingface.co/prem-research/Funcdex-0.6B-gmail_googlecalendar">🤗</a></td>
390
+ <td>0.17</td>
391
+ <td>0.61</td>
392
+ <td>0.32</td>
393
+ <td>0.71</td>
394
+ <td rowspan="5"><a href="https://huggingface.co/prem-research/Funcdex-1.7B">🤗</a></td>
395
+ </tr>
396
+ <tr style="text-align: center;">
397
+ <td><img src="assets/icons/google-drive.png" width="20" height="20" style="vertical-align: middle;"/>Drive <img src="assets/icons/calendly.png" width="20" height="20" style="vertical-align: middle;"/> Calendly <img src="assets/icons/google-calendar.png" width="20" height="20" style="vertical-align: middle;"/> Calendar</td>
398
+ <td>0.32</td>
399
+ <td>0.45</td>
400
+ <td>0.17</td>
401
+ <td>0.52</td>
402
+ <td>0.35</td>
403
+ <td>0.47</td>
404
+ <td>0.19</td>
405
+ <td>0.49</td>
406
+ <td>0.35</td>
407
+ <td>0.60</td>
408
+ <td><a href="https://huggingface.co/prem-research/Funcdex-0.6B-googledrive_calendly_googlecalendar">🤗</a></td>
409
+ <td>0.15</td>
410
+ <td>0.66</td>
411
+ <td>0.40</td>
412
+ <td>0.78</td>
413
+ </tr>
414
+ <tr style="text-align: center;">
415
+ <td><img src="assets/icons/google-drive.png" width="20" height="20" style="vertical-align: middle;"/>Drive <img src="assets/icons/docs.png" width="20" height="20" style="vertical-align: middle;"/> Docs</td>
416
+ <td>0.28</td>
417
+ <td>0.37</td>
418
+ <td>0.12</td>
419
+ <td>0.50</td>
420
+ <td>0.33</td>
421
+ <td>0.47</td>
422
+ <td>0.18</td>
423
+ <td>0.54</td>
424
+ <td>0.34</td>
425
+ <td>0.70</td>
426
+ <td><a href="https://huggingface.co/prem-research/Funcdex-0.6B-googledrive_googledocs">🤗</a></td>
427
+ <td>0.19</td>
428
+ <td>0.68</td>
429
+ <td>0.43</td>
430
+ <td>0.76</td>
431
+ </tr>
432
+ <tr style="text-align: center;">
433
+ <td><img src="assets/icons/jira.png" width="20" height="20" style="vertical-align: middle;"/>Jira <img src="assets/icons/gmail.png" width="20" height="20" style="vertical-align: middle;"/> Gmail</td>
434
+ <td>0.42</td>
435
+ <td>0.60</td>
436
+ <td>0.18</td>
437
+ <td>0.66</td>
438
+ <td>0.36</td>
439
+ <td>0.66</td>
440
+ <td>0.29</td>
441
+ <td>0.61</td>
442
+ <td>0.39</td>
443
+ <td>0.71</td>
444
+ <td><a href="https://huggingface.co/prem-research/Funcdex-0.6B-jira_gmail">🤗</a></td>
445
+ <td>0.28</td>
446
+ <td>0.72</td>
447
+ <td>0.44</td>
448
+ <td>0.82</td>
449
+ </tr>
450
+ <tr style="text-align: center;">
451
+ <td><img src="assets/icons/whatsapp.png" width="20" height="20" style="vertical-align: middle;"/>Whatsapp <img src="assets/icons/to-do-list.png" width="20" height="20" style="vertical-align: middle;"/> Todoist</td>
452
+ <td>0.32</td>
453
+ <td>0.58</td>
454
+ <td>0.19</td>
455
+ <td>0.66</td>
456
+ <td>0.35</td>
457
+ <td>0.69</td>
458
+ <td>0.26</td>
459
+ <td>0.50</td>
460
+ <td>0.41</td>
461
+ <td>0.70</td>
462
+ <td><a href="https://huggingface.co/prem-research/Funcdex-0.6B-whatsapp_todoist">🤗</a></td>
463
+ <td>0.27</td>
464
+ <td>0.68</td>
465
+ <td>0.39</td>
466
+ <td>0.77</td>
467
+ </tr>
468
+ </tbody>
469
+ </table>
470
+
471
+
472
+ ## Inference
473
+
474
+ - Given a conversation, we extract all tuples `(context_messages, function_calls)` and use it to generate predictions. We ignore the `content` field and only evaluate `function_calls` generated by an LLM.
475
+ - We use vLLM deployment with `tool_choice="auto"`.
476
+
477
+ ## Metrics
478
+
479
+ Given a list of predicted and reference function calls, we report two metrics:
480
+ - **Function Call String Match (SR)**: We perform greedy match and report best-matched string ratio using `difflib.SequenceMatcher.ratio`. The number reported is average string ratio.
481
+ - **Exact Match (EM)**: Same as above, but we perform exact string match instead. The number reported is EM F1 Score.
482
+
483
+ EM is a strict metric, and penalizes string arguments in function calls that may be "okay", e.g. `"email_content": "This is an example."` v/s `"email_content": "This is an Example."`, both only differ by one letter.
484
+
485
+
486
+ # Quickstart
487
+
488
+ ```python
489
+ from transformers import AutoModelForCausalLM, AutoTokenizer
490
+ from peft import PeftModel
491
+ import torch
492
+ import json
493
+
494
+ # Load model and tokenizer
495
+ base_model_name = "ojus1/Qwen3-0.6B-Instruct"
496
+ model_name = "prem-research/Funcdex-0.6B-googledocs"
497
+
498
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
499
+
500
+ base_model = AutoModelForCausalLM.from_pretrained(
501
+ base_model_name,
502
+ torch_dtype="auto",
503
+ device_map="auto"
504
+ )
505
+
506
+ model = PeftModel.from_pretrained(
507
+ base_model,
508
+ model_name,
509
+ torch_dtype="auto",
510
+ device_map="auto"
511
+ )
512
+
513
+ # Define tools
514
+ tools = [
515
+ {
516
+ "type": "function",
517
+ "function": {
518
+ "name": "CREATE_HEADER",
519
+ "description": "Create a header in a Google Doc",
520
+ "parameters": {
521
+ "type": "object",
522
+ "properties": {
523
+ "documentId": {"type": "string", "description": "Document ID"},
524
+ "createHeader": {"type": "object", "description": "Header configuration"}
525
+ },
526
+ "required": ["documentId", "createHeader"]
527
+ }
528
+ }
529
+ },
530
+ {
531
+ "type": "function",
532
+ "function": {
533
+ "name": "COPY_GOOGLE_DOCUMENT",
534
+ "description": "Copy a Google Document",
535
+ "parameters": {
536
+ "type": "object",
537
+ "properties": {
538
+ "document_id": {"type": "string", "description": "Source document ID"},
539
+ "title": {"type": "string", "description": "Title for the copy"}
540
+ },
541
+ "required": ["document_id", "title"]
542
+ }
543
+ }
544
+ }
545
+ ]
546
+
547
+ # Define conversation
548
+ messages = [
549
+ {"role": "system", "content": "You are a helpful assistant that can help with tasks by using tools."},
550
+ {"role": "user", "content": "Create a header in document '9z8y7x6w5v4u3t2s1r0q'."}
551
+ ]
552
+
553
+ # Apply chat template with tools
554
+ formatted_input = tokenizer.apply_chat_template(
555
+ messages,
556
+ tools=tools,
557
+ tokenize=False,
558
+ add_generation_prompt=True
559
+ )
560
+
561
+ # Tokenize and generate
562
+ input_tokens = tokenizer(formatted_input, return_tensors="pt").to(model.device)
563
+ output = model.generate(**input_tokens, max_new_tokens=256, do_sample=False)
564
+ response = tokenizer.decode(output[0][input_tokens['input_ids'].shape[1]:], skip_special_tokens=True)
565
+
566
+ print("Response:", response)
567
+ ```
568
+
569
+ For best results, provide detailed system-prompt to steer the tool-use behaviour.
570
+
571
+ # License
572
+
573
+ The models, code and the dataset are licensed under MIT License.