Omartificial-Intelligence-Space's picture
Update README.md
a146440 verified
---
license: gemma
language:
- ar
base_model:
- google/gemma-3-270m
pipeline_tag: text-generation
library_name: transformers
tags:
- function-calling
- tool-use
- agentic
- arabic
- reasoning
- think
- gemma3
- shared-task
- arabicnlp2026
- baseline
- dialect
datasets:
- TuwaiqAcademy/AISA-ArabicFC
model-index:
- name: AISA-AR-FunctionCall-Think
results:
- task:
type: text-generation
name: Arabic Function Calling Track B (Reasoning-Augmented)
dataset:
name: AISA-ArabicFC (held-out test)
type: TuwaiqAcademy/AISA-ArabicFC
metrics:
- type: function-name-accuracy
value: 0.982
name: FnAcc
- type: argument-exact-match
value: 0.541
name: ArgEM
- type: think-before-call-rate
value: 0.868
name: ThinkRate
- type: overall
value: 0.739
name: Overall (Track B, v2)
---
# AISA-AR-FunctionCall-Think
### 🏷️ Official **Track B baseline** for the [AISA-ArabicFC shared task](https://huggingface.co/spaces/Omartificial-Intelligence-Space/AISA-ArabicFC-Shared-Task) @ **ArabicNLP 2026** (co-located with EMNLP 2026, Budapest)
> This model is the **organizer-provided baseline** for **Track B — Reasoning-Augmented Function Calling**. It defines the reference score that participating systems are expected to beat. It is released for reproducibility and as a starting point — **it is not a competition entry.**
A compact (**270M-parameter**) Arabic function-calling model that, given an Arabic user query (in any of 5 dialects) and a set of candidate tools, **writes a short Arabic `<think>` reasoning trace and then emits a structured tool call**. Fine-tuned (LoRA) from **[google/gemma-3-270m](https://huggingface.co/google/gemma-3-270m)** on the AISA-ArabicFC reasoning data.
For the non-reasoning Track A baseline, see the sibling model **[AISA-AR-FunctionCall-FT](https://huggingface.co/AISA-Framework/AISA-AR-FunctionCall-FT)**.
---
## At a glance
| | |
|---|---|
| **Role** | Official baseline — Track B (Reasoning-Augmented) |
| **Base model** | google/gemma-3-270m (270M params) |
| **Adaptation** | LoRA fine-tune (merged), then full causal-LM inference |
| **Languages** | Arabic — MSA, Gulf, Egyptian, Levantine, Maghrebi |
| **Behaviour** | `<think>` Arabic reasoning → structured function call |
| **Training data** | [TuwaiqAcademy/AISA-ArabicFC](https://huggingface.co/datasets/TuwaiqAcademy/AISA-ArabicFC)
| **License** | Gemma (see *License* below) |
---
## The shared task
Given an Arabic user query and a set of candidate tool definitions, a system must:
1. **Decide** whether a function call is required (some queries need no tool),
2. **Select** the correct function name,
3. **Extract** the structured arguments,
4. **(Track B)** **Generate an Arabic reasoning trace** (`<think> … </think>`) *before* the call.
| Track | Description |
|-------|-------------|
| **A — Core** | Decide / Select / Extract |
| **B — Reasoning-Augmented***this model* | Track A **+** an Arabic `<think>` reasoning trace |
| **C — Cross-Dialect Robustness** | Diagnostic: dialect-stratified evaluation of A/B submissions |
---
## How it works — input / output format
This model uses **Gemma 3 chat turns** with a custom function-calling schema (it does **not** emit plain JSON). The exact prompt is the `text` field in the dataset; the structure is:
```
<bos><start_of_turn>developer
<system instruction in Arabic>
<start_function_declaration>declaration:NAME{description:<escape>…<escape>,parameters:{…}}<end_function_declaration>
…one declaration per candidate tool…<end_of_turn>
<start_of_turn>developer
التاريخ والوقت الحالي …: 2024-04-12T23:05:24
اليوم هو الجمعة
أنت نموذج يمكنه استدعاء الوظائف التالية<end_of_turn>
<start_of_turn>user
أريد مقارنة أسعار تلفاز سامسونج في الأردن<end_of_turn>
<start_of_turn>model
```
The model then generates:
```
<think>
يبدو أن نية المستخدم هي الحصول على مقارنة لأسعار تلفاز سامسونج في الأردن. أداة "compare_prices" هي الأنسب …
</think>
<start_function_call>call:compare_prices{country:<escape>Jordan<escape>,product_name:<escape>Samsung TV<escape>}<end_function_call>
```
For a query that needs **no tool**, the model omits the `<start_function_call>` block (→ `requires_function = false`).
---
## Usage
```python
import re, torch
from transformers import AutoTokenizer, AutoModelForCausalLM
MODEL_ID = "TuwaiqAcademy/AISA-AR-FunctionCall-Think"
tok = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID, torch_dtype=torch.float32, device_map="auto"
).eval()
def parse_model_output(text: str) -> dict:
"""Turn raw generation into the shared-task submission schema."""
out = {"requires_function": False, "function_name": "none", "arguments": {}, "think": ""}
if (m := re.search(r"<think>\s*(.*?)\s*</think>", text, re.DOTALL)):
out["think"] = m.group(1).strip()
if (m := re.search(r"<start_function_call>\s*call:(\w+)\{(.*?)\}\s*<end_function_call>", text, re.DOTALL)):
out["requires_function"] = True
out["function_name"] = m.group(1)
for key, str_val, num_val in re.findall(r"(\w+):(?:<escape>(.*?)<escape>|([^,}]+))", m.group(2)):
val = str_val if str_val else num_val
try:
val = float(val) if "." in str(val) else int(val)
except (ValueError, TypeError):
pass
out["arguments"][key] = val
return out
# Easiest path: take the ready-made prompt from the dataset's `text` field and
# cut it at the model turn (everything after is what the model should produce).
from datasets import load_dataset
row = load_dataset("TuwaiqAcademy/AISA-ArabicFC", split="validation")[0]
prompt = row["text"].split("<start_of_turn>model\n")[0] + "<start_of_turn>model\n"
inputs = tok(prompt, return_tensors="pt", add_special_tokens=False).to(model.device)
with torch.no_grad():
gen = model.generate(**inputs, max_new_tokens=250, do_sample=False) # greedy
raw = tok.decode(gen[0][inputs["input_ids"].shape[1]:], skip_special_tokens=False)
print(parse_model_output(raw))
# → {'requires_function': True, 'function_name': 'compare_prices',
# 'arguments': {'country': 'Jordan', 'product_name': 'Samsung TV'},
# 'think': 'يبدو أن نية المستخدم …'}
```
The parsed dict maps directly onto a **leaderboard submission line**: `{"id", "tool_called", "arguments", "think"}` (use `function_name``tool_called`).
---
## Evaluation
Scored on the AISA-ArabicFC **held-out test set** (1,000 positive + negative examples) using the official **v2** metrics:
- **FnAcc** — function-name accuracy over *all* samples (also penalises hallucinated / missed calls; negatives have gold `none`)
- **ArgEM** — strict argument **exact match**, over positives only
- **ThinkRate** — fraction of outputs with a non-empty `<think>` trace
- **Overall (Track A)** = `0.40·FnAcc + 0.60·ArgEM`
- **Overall (Track B)** = `0.30·FnAcc + 0.50·ArgEM + 0.20·ThinkRate`
### Baseline results
| System | FnAcc | ArgEM | Overall (A) | Overall (B) |
|--------|:-----:|:-----:|:-----------:|:-----------:|
| **AISA-AR-FunctionCall-Think (270M) ← this** | **0.982** | **0.541** | **0.717** | **0.739** |
| GPT-4o — zero-shot | 0.927 | 0.070 | 0.413 | 0.313 |
| GPT-4o — 3-shot | 0.854 | 0.122 | 0.415 | 0.317 |
| Random baseline | 0.047 | 0.033 | 0.039 | 0.031 |
- **Think-Before-Call rate (ThinkRate):** **0.868** for this model; 0.000 for all non-reasoning baselines.
- **Hallucination rate:** **0.000** on negative (no-tool) queries.
**Key takeaways**
- 🎯 **Argument extraction is the open challenge.** Tool *selection* is largely solved (FnAcc ≈ 0.98), but strict argument **exact match tops out at 0.541** — and GPT-4o reaches only 0.070 zero-shot. This is where the task is won or lost.
- 🪶 **A 270M model beats GPT-4o** across every metric here, showing the value of task-specific Arabic training and lowering the compute barrier to entry.
- 🗣️ **Cross-dialect gaps remain.** FnAcc varies by roughly 10–15 points across dialects, with **Gulf and Levantine** consistently the hardest and Maghrebi (small sample) the easiest — see the Track C diagnostic in the task overview paper.
---
## Training
- **Base:** `google/gemma-3-270m`
- **Method:** LoRA (rank 64), 3 epochs, cosine LR scheduler
- **Data:** AISA-ArabicFC training split (~10.5K examples) with 12,000 Arabic reasoning annotations for the `<think>` traces
- **Objective:** produce a short Arabic reasoning trace followed by a single structured tool call (or no call for negatives)
---
## Intended use & limitations
**Intended use**
- A reference **baseline** to compare against and reproduce for the AISA-ArabicFC shared task.
- A lightweight starting point for Arabic tool-use / agentic experiments.
**Out of scope / limitations**
- Trained for the **27-tool, 8-domain AISA-ArabicFC schema** and its prompt format; behaviour on arbitrary tools or free-form chat is undefined.
- Single-turn, single-call setting — no multi-tool or multi-turn dialogue.
- **Argument extraction is imperfect** (ArgEM 0.541): expect errors in date normalisation, numeric typing, and dialectal argument phrasing.
- Uneven dialect coverage (Maghrebi is only ~1.3% of data); robustness varies by dialect.
- A 270M model — capacity-limited by design to keep the baseline accessible.
---
## Related resources
- 🏆 **Shared task page:** https://huggingface.co/spaces/Omartificial-Intelligence-Space/AISA-ArabicFC-Shared-Task
- 📊 **Leaderboard:** https://huggingface.co/spaces/TuwaiqAcademy/AISA-ArabicFC-SharedTask-Leaderboard
- 📚 **Dataset (train + dev):** [TuwaiqAcademy/AISA-ArabicFC](https://huggingface.co/datasets/TuwaiqAcademy/AISA-ArabicFC)
---
## Citation
```bibtex
@inproceedings{najar2026aisaarabicfc,
title = {AISA-ArabicFC: Arabic Function Calling for Agentic AI Systems},
author = {Najar, Omar},
booktitle = {Proceedings of the Fourth Arabic Natural Language Processing Conference (ArabicNLP 2026)},
year = {2026}
}
```
## License
This model is a derivative of **Gemma 3** and is distributed under the **[Gemma Terms of Use](https://ai.google.dev/gemma/terms)**. By using it you agree to those terms and to the [Gemma Prohibited Use Policy](https://ai.google.dev/gemma/prohibited_use_policy). The AISA-ArabicFC **dataset** is released separately under Apache-2.0.
## Contact
Shared-task organizers — **trdc@tuwaiq.edu.sa** · Tuwaiq Academy
```