|
|
--- |
|
|
library_name: transformers |
|
|
tags: [] |
|
|
--- |
|
|
|
|
|
language: |
|
|
- ar |
|
|
tags: |
|
|
- function-calling |
|
|
- tool-use |
|
|
- arabic |
|
|
- instruction-tuning |
|
|
- gemma |
|
|
- transformers |
|
|
license: apache-2.0 |
|
|
base_model: google/functiongemma-270m-it |
|
|
--- |
|
|
|
|
|
# FunctionGemma-270M Arabic Tool Use |
|
|
|
|
|
This model is a finetuned version of **`google/functiongemma-270m-it`** for Arabic **tool use / function calling** across multiple dialects and domains. |
|
|
|
|
|
It is trained to produce **exactly one tool call** when a tool is required, using **FunctionGemma-native tool formatting** (special function-call tokens) and structured JSON arguments. |
|
|
|
|
|
## Base model |
|
|
- `google/functiongemma-270m-it` |
|
|
|
|
|
## Dataset |
|
|
- `metga97/arabic-tooluse-functiongemma-v1` |
|
|
|
|
|
## What the model outputs |
|
|
|
|
|
When a tool is required, generation should include a FunctionGemma tool call pattern such as: |
|
|
|
|
|
- `<start_function_call>call:TOOL_NAME{ ...json args... }<end_function_call>` |
|
|
|
|
|
For non-tool requests, it returns a short Arabic reply. |
|
|
|
|
|
## Evaluation (by slang / dialect) |
|
|
|
|
|
Evaluated on the test split of `metga97/arabic-tooluse-functiongemma-v1`. |
|
|
|
|
|
### Overall |
|
|
- Parsed OK rate: **0.891** |
|
|
- Tool name accuracy: **0.9921** |
|
|
- Strict EM: **0.6564** |
|
|
- Key-F1 (avg): **0.9925** |
|
|
- Missed-call rate: **0.0064** |
|
|
- False-call rate (negatives): **0.0** |
|
|
|
|
|
### Strict EM by slang / dialect |
|
|
- **Egyptian**: 0.6791 (denom_calls: 1069) |
|
|
- **Gulf**: 0.6237 (denom_calls: 1172) |
|
|
- **Levantine**: 0.6558 (denom_calls: 706) |
|
|
- **MSA**: 0.6804 (denom_calls: 1408) |
|
|
- **Maghrebi**: 0.5455 (denom_calls: 176) |
|
|
|
|
|
### Strict EM by domain |
|
|
- banking_finance: 0.6255 (denom_calls: 542) |
|
|
- ecommerce: 0.64 (denom_calls: 550) |
|
|
- government_services: 0.7651 (denom_calls: 613) |
|
|
- healthcare: 0.5754 (denom_calls: 577) |
|
|
- islamic_services: 0.7119 (denom_calls: 597) |
|
|
- travel: 0.6028 (denom_calls: 564) |
|
|
- utilities: 0.4652 (denom_calls: 561) |
|
|
- weather: 0.8653 (denom_calls: 527) |
|
|
|
|
|
## Inference (important) |
|
|
|
|
|
### 1) Use left padding for decoder-only generation |
|
|
Set: |
|
|
- `tokenizer.padding_side = "left"` |
|
|
- `tokenizer.pad_token = tokenizer.eos_token` (if missing) |
|
|
|
|
|
### 2) Pass tools via `apply_chat_template(..., tools=tools_list)` |
|
|
This is critical for FunctionGemma-style function calling. |
|
|
|
|
|
Example outline: |
|
|
1. Select a tool subset for the request (domain pack + deterministic sampling). |
|
|
2. Build prompt with `apply_chat_template` including `tools=tools_list`. |
|
|
3. `generate()` deterministically (`do_sample=False`, `temperature=0.0`). |
|
|
4. Parse tool call tokens and arguments. |
|
|
|
|
|
## Known limitations / improvement ideas |
|
|
|
|
|
- Some outputs may translate slot values into English (e.g., “Abu Dhabi”, “ID renewal”). |
|
|
- Mitigations: stronger developer prompt constraints, post-processing, adding explicit anti-translation supervision, and/or filtering/rebalancing training examples where values are English. |
|
|
- Parsed OK < 1.0: you can improve formatting consistency with: |
|
|
- longer training |
|
|
- slightly stronger prompt |
|
|
- adding more negative/no-tool examples with explicit non-tool responses |