File size: 2,964 Bytes
16e5fa3
 
 
 
 
5ad0788
 
 
 
 
 
 
 
 
 
 
 
16e5fa3
5ad0788
16e5fa3
5ad0788
16e5fa3
5ad0788
16e5fa3
5ad0788
 
16e5fa3
5ad0788
1045749
16e5fa3
5ad0788
16e5fa3
5ad0788
16e5fa3
5ad0788
16e5fa3
5ad0788
16e5fa3
5ad0788
16e5fa3
5ad0788
16e5fa3
5ad0788
 
 
 
 
 
 
16e5fa3
5ad0788
 
 
 
 
 
16e5fa3
5ad0788
 
 
 
 
 
 
 
 
16e5fa3
5ad0788
16e5fa3
5ad0788
 
 
 
16e5fa3
5ad0788
 
16e5fa3
5ad0788
 
 
 
 
16e5fa3
5ad0788
16e5fa3
5ad0788
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
---
library_name: transformers
tags: []
---

language:
- ar
tags:
- function-calling
- tool-use
- arabic
- instruction-tuning
- gemma
- transformers
license: apache-2.0
base_model: google/functiongemma-270m-it
---

# FunctionGemma-270M Arabic Tool Use

This model is a finetuned version of **`google/functiongemma-270m-it`** for Arabic **tool use / function calling** across multiple dialects and domains.

It is trained to produce **exactly one tool call** when a tool is required, using **FunctionGemma-native tool formatting** (special function-call tokens) and structured JSON arguments.

## Base model
- `google/functiongemma-270m-it`

## Dataset
- `metga97/arabic-tooluse-functiongemma-v1`

## What the model outputs

When a tool is required, generation should include a FunctionGemma tool call pattern such as:

- `<start_function_call>call:TOOL_NAME{ ...json args... }<end_function_call>`

For non-tool requests, it returns a short Arabic reply.

## Evaluation (by slang / dialect)

Evaluated on the test split of `metga97/arabic-tooluse-functiongemma-v1`.

### Overall
- Parsed OK rate: **0.891**
- Tool name accuracy: **0.9921**
- Strict EM: **0.6564**
- Key-F1 (avg): **0.9925**
- Missed-call rate: **0.0064**
- False-call rate (negatives): **0.0**

### Strict EM by slang / dialect
- **Egyptian**: 0.6791 (denom_calls: 1069)
- **Gulf**: 0.6237 (denom_calls: 1172)
- **Levantine**: 0.6558 (denom_calls: 706)
- **MSA**: 0.6804 (denom_calls: 1408)
- **Maghrebi**: 0.5455 (denom_calls: 176)

### Strict EM by domain
- banking_finance: 0.6255 (denom_calls: 542)
- ecommerce: 0.64 (denom_calls: 550)
- government_services: 0.7651 (denom_calls: 613)
- healthcare: 0.5754 (denom_calls: 577)
- islamic_services: 0.7119 (denom_calls: 597)
- travel: 0.6028 (denom_calls: 564)
- utilities: 0.4652 (denom_calls: 561)
- weather: 0.8653 (denom_calls: 527)

## Inference (important)

### 1) Use left padding for decoder-only generation
Set:
- `tokenizer.padding_side = "left"`
- `tokenizer.pad_token = tokenizer.eos_token` (if missing)

### 2) Pass tools via `apply_chat_template(..., tools=tools_list)`
This is critical for FunctionGemma-style function calling.

Example outline:
1. Select a tool subset for the request (domain pack + deterministic sampling).
2. Build prompt with `apply_chat_template` including `tools=tools_list`.
3. `generate()` deterministically (`do_sample=False`, `temperature=0.0`).
4. Parse tool call tokens and arguments.

## Known limitations / improvement ideas

- Some outputs may translate slot values into English (e.g., “Abu Dhabi”, “ID renewal”).
  - Mitigations: stronger developer prompt constraints, post-processing, adding explicit anti-translation supervision, and/or filtering/rebalancing training examples where values are English.
- Parsed OK < 1.0: you can improve formatting consistency with:
  - longer training
  - slightly stronger prompt
  - adding more negative/no-tool examples with explicit non-tool responses