File size: 6,762 Bytes
34cc828
 
 
 
 
 
 
 
1b13f02
 
34cc828
1ea19ac
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
---
license: mit
language:
- en
- it
base_model:
- google/functiongemma-270m-it
new_version: independently-platform/functiongemma-independently
datasets:
- independently-platform/FunctionGemma-270m-Independently
---
# Independently FunctionGemma Fine-Tune

This repository provides a fine-tuned version of `google/functiongemma-270m-it` tailored for the **Independently** desktop platform. The goal of this fine-tune is **high-precision, high-reliability tool calling** for local virtual assistant workflows (chores, expenses, recipes, alerts, and grocery lists), ensuring it consistently selects the right tool and produces correctly structured arguments even under varied phrasing and bilingual (English/Italian) requests.

---

## Purpose

The core objective of this project was to fine-tune a **very small model** that can be **embedded directly inside the Independently desktop app**, including on **low-power devices**, without sacrificing tool-calling accuracy or requiring external services.

---

## What This Model Optimizes For

- **Accurate tool selection** across chores, expenses, recipes, alerts, and grocery-list flows.
    
- **Correct parameterization**: filters, tags, date ranges, recurrence rules, limits, and update/delete semantics.
    
- **Robustness to phrasing variation**: short commands, conversational instructions, ambiguous user wording, and follow-ups.
    
- **Multilingual support (EN/IT)**: consistent behavior across English and Italian prompts for the same intent.
    

---

## Dataset

The dataset is **100% synthetic**, generated to match the FunctionGemma tool-calling format and the Independently tool schema.

- **Train:** 10,000 examples
    
- **Eval/Test:** 2,000 examples
    
- **Coverage:** All primary Independently tools and common usage patterns, including edge cases (e.g., missing optional fields, default behaviors, fuzzy time expressions, multi-constraint filters, weird queries or bad worded ones, etc.)
    

### Tool Coverage

The dataset includes examples for the following tools:

#### Categories & Tags

- `data.listCategories`
    
- `data.upsertCategory`
    
- `data.listTags`
    
- `data.upsertTag`
    

#### Chores

- `data.listChores`
    
- `data.listArchivedChores`
    
- `data.createChore`
    
- `data.completeChoresByFilter`
    
- `data.archiveChoresByFilter`
    
- `data.deleteChoresByFilter`
    
- `data.updateChoresByFilter`
    
- `data.listRecurringChores`
    
- `data.createRecurringChore`
    

#### Expenses & Limits

- `data.listExpenses`
    
- `data.createExpense`
    
- `data.deleteExpensesByFilter`
    
- `data.updateExpensesByFilter`
    
- `data.listRecurringExpenses`
    
- `data.createRecurringExpense`
    
- `data.listExpenseLimits`
    
- `data.upsertExpenseLimit`
    

#### Recipes & Alerts

- `data.listRecipes`
    
- `data.suggestRecipes`
    
- `data.importRecipeDatasetFromUrl`
    
- `data.createRecipe`
    
- `data.updateRecipesByFilter`
    
- `data.scheduleRecipeAlertByName`
    
- `data.listAlerts`
    
- `data.dismissAlertsByFilter`
    

#### Integrations

- `discord.sendRecipeGroceryListByName`
    

---

## Training Configuration

|Item|Value|
|---|---|
|Base model|`google/functiongemma-270m-it`|
|Epochs|4|
|Hardware|NVIDIA RTX 4090|
|Runtime|~14 minutes|
|Train size|10,000|
|Eval size|2,000|

---

## Results

On the evaluation set (tool selection + argument correctness):

- **Base model:** ~0% accuracy
    
- **Fine-tuned model:** **98.2% accuracy**
    

This is a substantial improvement in both:

- selecting the intended tool for a given user request, and
    
- producing valid, faithful arguments aligned to Independently’s runtime schema.
    

## Evaluation Methodology

Evaluation is performed on a held-out **2,000-example** eval/test set containing natural-language user requests paired with a single expected tool call (tool name + JSON arguments). A prediction is counted as **correct** only when it produces the **right tool** _and_ the **right parameters** for the given query.

### What Counts as a Successful Prediction

A model output is considered a successful call when all of the following are true:

1. **Correct tool selection**  
    The predicted tool name exactly matches the expected tool for the request (e.g., `data.createExpense` vs. `data.updateExpensesByFilter`).
    
2. **Correct argument structure**  
    The output is a valid FunctionGemma-style tool call with well-formed JSON, using the expected argument schema for that tool.
    
3. **Correct parameterization (semantic match)**  
    The arguments faithfully represent the user’s intent, including:
    
    - Filters (tags, categories, names, status, archived/active)
        
    - Date and time constraints (explicit dates, ranges, relative expressions)
        
    - Recurrence rules (frequency, interval, days, next run)
        
    - Limits (amount thresholds, periods)
        
    - Update/delete/complete semantics and scopes (single item vs. matching set)
        

### Matching Rules

- **Tool name match is strict**: the tool must be exactly the expected one.
    
- **Required fields must match**: missing or incorrect required parameters are failures.
    
- **Optional fields are allowed when consistent**: extra optional parameters are permitted only if they do not change the meaning of the request or narrow/expand the scope incorrectly.
    
- **Equivalent representations are accepted** when they resolve to the same meaning (e.g., alternative but schema-valid ways of expressing a date range), as long as they conform to the runtime tool schema.
    

### Accuracy Metric

Reported accuracy is **end-to-end exactness** on the eval set:

- **Correct = right tool + correct arguments**
    
- **Incorrect = wrong tool OR wrong/missing/misleading arguments**
    

This metric is intentionally strict, reflecting Independently’s production requirement that tool calls be executable and faithful to the user’s request without manual repair.

> Note: The reported score is measured on the same eval set definition used before and after fine-tuning, focusing on end-to-end correctness (tool + parameters).

---

## Notes & Assumptions

- The dataset follows the **FunctionGemma tool-calling JSON format** used in Google’s official tooling guide.
    
- The model is optimized for **local-first** execution: tool decisions do not require external APIs.
    
- The **tool schema is consistent** between training and runtime to minimize schema drift and maximize reliability.
    

---

## Intended Use

This model is intended to be embedded within the Independently desktop assistant as the **tool-calling policy model**—the component responsible for producing structured tool invocations from natural language.