language: - ar tags: - function-calling - tool-use - arabic - instruction-tuning - gemma - transformers license: apache-2.0 base_model: google/functiongemma-270m-it

FunctionGemma-270M Arabic Tool Use

This model is a finetuned version of google/functiongemma-270m-it for Arabic tool use / function calling across multiple dialects and domains.

It is trained to produce exactly one tool call when a tool is required, using FunctionGemma-native tool formatting (special function-call tokens) and structured JSON arguments.

Base model

  • google/functiongemma-270m-it

Dataset

  • metga97/arabic-tooluse-functiongemma-v1

What the model outputs

When a tool is required, generation should include a FunctionGemma tool call pattern such as:

  • <start_function_call>call:TOOL_NAME{ ...json args... }<end_function_call>

For non-tool requests, it returns a short Arabic reply.

Evaluation (by slang / dialect)

Evaluated on the test split of metga97/arabic-tooluse-functiongemma-v1.

Overall

  • Parsed OK rate: 0.891
  • Tool name accuracy: 0.9921
  • Strict EM: 0.6564
  • Key-F1 (avg): 0.9925
  • Missed-call rate: 0.0064
  • False-call rate (negatives): 0.0

Strict EM by slang / dialect

  • Egyptian: 0.6791 (denom_calls: 1069)
  • Gulf: 0.6237 (denom_calls: 1172)
  • Levantine: 0.6558 (denom_calls: 706)
  • MSA: 0.6804 (denom_calls: 1408)
  • Maghrebi: 0.5455 (denom_calls: 176)

Strict EM by domain

  • banking_finance: 0.6255 (denom_calls: 542)
  • ecommerce: 0.64 (denom_calls: 550)
  • government_services: 0.7651 (denom_calls: 613)
  • healthcare: 0.5754 (denom_calls: 577)
  • islamic_services: 0.7119 (denom_calls: 597)
  • travel: 0.6028 (denom_calls: 564)
  • utilities: 0.4652 (denom_calls: 561)
  • weather: 0.8653 (denom_calls: 527)

Inference (important)

1) Use left padding for decoder-only generation

Set:

  • tokenizer.padding_side = "left"
  • tokenizer.pad_token = tokenizer.eos_token (if missing)

2) Pass tools via apply_chat_template(..., tools=tools_list)

This is critical for FunctionGemma-style function calling.

Example outline:

  1. Select a tool subset for the request (domain pack + deterministic sampling).
  2. Build prompt with apply_chat_template including tools=tools_list.
  3. generate() deterministically (do_sample=False, temperature=0.0).
  4. Parse tool call tokens and arguments.

Known limitations / improvement ideas

  • Some outputs may translate slot values into English (e.g., “Abu Dhabi”, “ID renewal”).
    • Mitigations: stronger developer prompt constraints, post-processing, adding explicit anti-translation supervision, and/or filtering/rebalancing training examples where values are English.
  • Parsed OK < 1.0: you can improve formatting consistency with:
    • longer training
    • slightly stronger prompt
    • adding more negative/no-tool examples with explicit non-tool responses
Downloads last month
44
Safetensors
Model size
0.3B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support