--- license: gemma language: - ar base_model: - google/gemma-3-270m pipeline_tag: text-generation library_name: transformers tags: - function-calling - tool-use - agentic - arabic - reasoning - think - gemma3 - shared-task - arabicnlp2026 - baseline - dialect datasets: - TuwaiqAcademy/AISA-ArabicFC model-index: - name: AISA-AR-FunctionCall-Think results: - task: type: text-generation name: Arabic Function Calling — Track B (Reasoning-Augmented) dataset: name: AISA-ArabicFC (held-out test) type: TuwaiqAcademy/AISA-ArabicFC metrics: - type: function-name-accuracy value: 0.982 name: FnAcc - type: argument-exact-match value: 0.541 name: ArgEM - type: think-before-call-rate value: 0.868 name: ThinkRate - type: overall value: 0.739 name: Overall (Track B, v2) --- # AISA-AR-FunctionCall-Think ### 🏷️ Official **Track B baseline** for the [AISA-ArabicFC shared task](https://huggingface.co/spaces/Omartificial-Intelligence-Space/AISA-ArabicFC-Shared-Task) @ **ArabicNLP 2026** (co-located with EMNLP 2026, Budapest) > This model is the **organizer-provided baseline** for **Track B — Reasoning-Augmented Function Calling**. It defines the reference score that participating systems are expected to beat. It is released for reproducibility and as a starting point — **it is not a competition entry.** A compact (**270M-parameter**) Arabic function-calling model that, given an Arabic user query (in any of 5 dialects) and a set of candidate tools, **writes a short Arabic `` reasoning trace and then emits a structured tool call**. Fine-tuned (LoRA) from **[google/gemma-3-270m](https://huggingface.co/google/gemma-3-270m)** on the AISA-ArabicFC reasoning data. For the non-reasoning Track A baseline, see the sibling model **[AISA-AR-FunctionCall-FT](https://huggingface.co/AISA-Framework/AISA-AR-FunctionCall-FT)**. --- ## At a glance | | | |---|---| | **Role** | Official baseline — Track B (Reasoning-Augmented) | | **Base model** | google/gemma-3-270m (270M params) | | **Adaptation** | LoRA fine-tune (merged), then full causal-LM inference | | **Languages** | Arabic — MSA, Gulf, Egyptian, Levantine, Maghrebi | | **Behaviour** | `` Arabic reasoning → structured function call | | **Training data** | [TuwaiqAcademy/AISA-ArabicFC](https://huggingface.co/datasets/TuwaiqAcademy/AISA-ArabicFC) | **License** | Gemma (see *License* below) | --- ## The shared task Given an Arabic user query and a set of candidate tool definitions, a system must: 1. **Decide** whether a function call is required (some queries need no tool), 2. **Select** the correct function name, 3. **Extract** the structured arguments, 4. **(Track B)** **Generate an Arabic reasoning trace** (``) *before* the call. | Track | Description | |-------|-------------| | **A — Core** | Decide / Select / Extract | | **B — Reasoning-Augmented** ← *this model* | Track A **+** an Arabic `` reasoning trace | | **C — Cross-Dialect Robustness** | Diagnostic: dialect-stratified evaluation of A/B submissions | --- ## How it works — input / output format This model uses **Gemma 3 chat turns** with a custom function-calling schema (it does **not** emit plain JSON). The exact prompt is the `text` field in the dataset; the structure is: ``` developer declaration:NAME{description:,parameters:{…}} …one declaration per candidate tool… developer التاريخ والوقت الحالي …: 2024-04-12T23:05:24 اليوم هو الجمعة أنت نموذج يمكنه استدعاء الوظائف التالية user أريد مقارنة أسعار تلفاز سامسونج في الأردن model ``` The model then generates: ``` يبدو أن نية المستخدم هي الحصول على مقارنة لأسعار تلفاز سامسونج في الأردن. أداة "compare_prices" هي الأنسب … call:compare_prices{country:Jordan,product_name:Samsung TV} ``` For a query that needs **no tool**, the model omits the `` block (→ `requires_function = false`). --- ## Usage ```python import re, torch from transformers import AutoTokenizer, AutoModelForCausalLM MODEL_ID = "TuwaiqAcademy/AISA-AR-FunctionCall-Think" tok = AutoTokenizer.from_pretrained(MODEL_ID) model = AutoModelForCausalLM.from_pretrained( MODEL_ID, torch_dtype=torch.float32, device_map="auto" ).eval() def parse_model_output(text: str) -> dict: """Turn raw generation into the shared-task submission schema.""" out = {"requires_function": False, "function_name": "none", "arguments": {}, "think": ""} if (m := re.search(r"\s*(.*?)\s*", text, re.DOTALL)): out["think"] = m.group(1).strip() if (m := re.search(r"\s*call:(\w+)\{(.*?)\}\s*", text, re.DOTALL)): out["requires_function"] = True out["function_name"] = m.group(1) for key, str_val, num_val in re.findall(r"(\w+):(?:(.*?)|([^,}]+))", m.group(2)): val = str_val if str_val else num_val try: val = float(val) if "." in str(val) else int(val) except (ValueError, TypeError): pass out["arguments"][key] = val return out # Easiest path: take the ready-made prompt from the dataset's `text` field and # cut it at the model turn (everything after is what the model should produce). from datasets import load_dataset row = load_dataset("TuwaiqAcademy/AISA-ArabicFC", split="validation")[0] prompt = row["text"].split("model\n")[0] + "model\n" inputs = tok(prompt, return_tensors="pt", add_special_tokens=False).to(model.device) with torch.no_grad(): gen = model.generate(**inputs, max_new_tokens=250, do_sample=False) # greedy raw = tok.decode(gen[0][inputs["input_ids"].shape[1]:], skip_special_tokens=False) print(parse_model_output(raw)) # → {'requires_function': True, 'function_name': 'compare_prices', # 'arguments': {'country': 'Jordan', 'product_name': 'Samsung TV'}, # 'think': 'يبدو أن نية المستخدم …'} ``` The parsed dict maps directly onto a **leaderboard submission line**: `{"id", "tool_called", "arguments", "think"}` (use `function_name` → `tool_called`). --- ## Evaluation Scored on the AISA-ArabicFC **held-out test set** (1,000 positive + negative examples) using the official **v2** metrics: - **FnAcc** — function-name accuracy over *all* samples (also penalises hallucinated / missed calls; negatives have gold `none`) - **ArgEM** — strict argument **exact match**, over positives only - **ThinkRate** — fraction of outputs with a non-empty `` trace - **Overall (Track A)** = `0.40·FnAcc + 0.60·ArgEM` - **Overall (Track B)** = `0.30·FnAcc + 0.50·ArgEM + 0.20·ThinkRate` ### Baseline results | System | FnAcc | ArgEM | Overall (A) | Overall (B) | |--------|:-----:|:-----:|:-----------:|:-----------:| | **AISA-AR-FunctionCall-Think (270M) ← this** | **0.982** | **0.541** | **0.717** | **0.739** | | GPT-4o — zero-shot | 0.927 | 0.070 | 0.413 | 0.313 | | GPT-4o — 3-shot | 0.854 | 0.122 | 0.415 | 0.317 | | Random baseline | 0.047 | 0.033 | 0.039 | 0.031 | - **Think-Before-Call rate (ThinkRate):** **0.868** for this model; 0.000 for all non-reasoning baselines. - **Hallucination rate:** **0.000** on negative (no-tool) queries. **Key takeaways** - 🎯 **Argument extraction is the open challenge.** Tool *selection* is largely solved (FnAcc ≈ 0.98), but strict argument **exact match tops out at 0.541** — and GPT-4o reaches only 0.070 zero-shot. This is where the task is won or lost. - 🪶 **A 270M model beats GPT-4o** across every metric here, showing the value of task-specific Arabic training and lowering the compute barrier to entry. - 🗣️ **Cross-dialect gaps remain.** FnAcc varies by roughly 10–15 points across dialects, with **Gulf and Levantine** consistently the hardest and Maghrebi (small sample) the easiest — see the Track C diagnostic in the task overview paper. --- ## Training - **Base:** `google/gemma-3-270m` - **Method:** LoRA (rank 64), 3 epochs, cosine LR scheduler - **Data:** AISA-ArabicFC training split (~10.5K examples) with 12,000 Arabic reasoning annotations for the `` traces - **Objective:** produce a short Arabic reasoning trace followed by a single structured tool call (or no call for negatives) --- ## Intended use & limitations **Intended use** - A reference **baseline** to compare against and reproduce for the AISA-ArabicFC shared task. - A lightweight starting point for Arabic tool-use / agentic experiments. **Out of scope / limitations** - Trained for the **27-tool, 8-domain AISA-ArabicFC schema** and its prompt format; behaviour on arbitrary tools or free-form chat is undefined. - Single-turn, single-call setting — no multi-tool or multi-turn dialogue. - **Argument extraction is imperfect** (ArgEM 0.541): expect errors in date normalisation, numeric typing, and dialectal argument phrasing. - Uneven dialect coverage (Maghrebi is only ~1.3% of data); robustness varies by dialect. - A 270M model — capacity-limited by design to keep the baseline accessible. --- ## Related resources - 🏆 **Shared task page:** https://huggingface.co/spaces/Omartificial-Intelligence-Space/AISA-ArabicFC-Shared-Task - 📊 **Leaderboard:** https://huggingface.co/spaces/TuwaiqAcademy/AISA-ArabicFC-SharedTask-Leaderboard - 📚 **Dataset (train + dev):** [TuwaiqAcademy/AISA-ArabicFC](https://huggingface.co/datasets/TuwaiqAcademy/AISA-ArabicFC) --- ## Citation ```bibtex @inproceedings{najar2026aisaarabicfc, title = {AISA-ArabicFC: Arabic Function Calling for Agentic AI Systems}, author = {Najar, Omar}, booktitle = {Proceedings of the Fourth Arabic Natural Language Processing Conference (ArabicNLP 2026)}, year = {2026} } ``` ## License This model is a derivative of **Gemma 3** and is distributed under the **[Gemma Terms of Use](https://ai.google.dev/gemma/terms)**. By using it you agree to those terms and to the [Gemma Prohibited Use Policy](https://ai.google.dev/gemma/prohibited_use_policy). The AISA-ArabicFC **dataset** is released separately under Apache-2.0. ## Contact Shared-task organizers — **trdc@tuwaiq.edu.sa** · Tuwaiq Academy ```