File size: 10,722 Bytes
49b980c
63ef237
49b980c
 
63ef237
 
 
 
49b980c
 
 
 
63ef237
49b980c
 
63ef237
 
 
 
 
49b980c
63ef237
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
49b980c
 
 
 
63ef237
49b980c
63ef237
49b980c
63ef237
49b980c
a400e15
49b980c
 
 
63ef237
49b980c
63ef237
49b980c
63ef237
 
 
 
 
c258f23
63ef237
49b980c
 
 
63ef237
49b980c
63ef237
49b980c
63ef237
 
 
 
49b980c
63ef237
 
 
 
 
49b980c
 
 
63ef237
49b980c
63ef237
49b980c
63ef237
 
 
 
 
 
 
 
 
 
 
 
 
49b980c
63ef237
49b980c
 
 
63ef237
49b980c
63ef237
49b980c
 
63ef237
49b980c
 
 
63ef237
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
49b980c
63ef237
49b980c
63ef237
49b980c
63ef237
49b980c
63ef237
49b980c
63ef237
 
 
 
 
49b980c
63ef237
49b980c
63ef237
 
 
 
 
 
49b980c
63ef237
 
49b980c
63ef237
49b980c
63ef237
 
 
49b980c
 
 
63ef237
49b980c
63ef237
 
 
 
49b980c
63ef237
49b980c
63ef237
49b980c
63ef237
 
 
49b980c
63ef237
 
 
 
 
 
49b980c
 
 
63ef237
49b980c
63ef237
 
 
49b980c
 
 
63ef237
49b980c
63ef237
 
 
 
 
 
 
 
49b980c
63ef237
49b980c
63ef237
49b980c
63ef237
49b980c
db95ec9
a146440
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
---
license: gemma
language:
- ar
base_model:
- google/gemma-3-270m
pipeline_tag: text-generation
library_name: transformers
tags:
- function-calling
- tool-use
- agentic
- arabic
- reasoning
- think
- gemma3
- shared-task
- arabicnlp2026
- baseline
- dialect
datasets:
- TuwaiqAcademy/AISA-ArabicFC
model-index:
- name: AISA-AR-FunctionCall-Think
  results:
  - task:
      type: text-generation
      name: Arabic Function Calling  Track B (Reasoning-Augmented)
    dataset:
      name: AISA-ArabicFC (held-out test)
      type: TuwaiqAcademy/AISA-ArabicFC
    metrics:
    - type: function-name-accuracy
      value: 0.982
      name: FnAcc
    - type: argument-exact-match
      value: 0.541
      name: ArgEM
    - type: think-before-call-rate
      value: 0.868
      name: ThinkRate
    - type: overall
      value: 0.739
      name: Overall (Track B, v2)
---

# AISA-AR-FunctionCall-Think

### 🏷️ Official **Track B baseline** for the [AISA-ArabicFC shared task](https://huggingface.co/spaces/Omartificial-Intelligence-Space/AISA-ArabicFC-Shared-Task) @ **ArabicNLP 2026** (co-located with EMNLP 2026, Budapest)

> This model is the **organizer-provided baseline** for **Track B — Reasoning-Augmented Function Calling**. It defines the reference score that participating systems are expected to beat. It is released for reproducibility and as a starting point — **it is not a competition entry.**

A compact (**270M-parameter**) Arabic function-calling model that, given an Arabic user query (in any of 5 dialects) and a set of candidate tools, **writes a short Arabic `<think>` reasoning trace and then emits a structured tool call**. Fine-tuned (LoRA) from **[google/gemma-3-270m](https://huggingface.co/google/gemma-3-270m)** on the AISA-ArabicFC reasoning data.

For the non-reasoning Track A baseline, see the sibling model **[AISA-AR-FunctionCall-FT](https://huggingface.co/AISA-Framework/AISA-AR-FunctionCall-FT)**.

---

## At a glance

| | |
|---|---|
| **Role** | Official baseline — Track B (Reasoning-Augmented) |
| **Base model** | google/gemma-3-270m (270M params) |
| **Adaptation** | LoRA fine-tune (merged), then full causal-LM inference |
| **Languages** | Arabic — MSA, Gulf, Egyptian, Levantine, Maghrebi |
| **Behaviour** | `<think>` Arabic reasoning → structured function call |
| **Training data** | [TuwaiqAcademy/AISA-ArabicFC](https://huggingface.co/datasets/TuwaiqAcademy/AISA-ArabicFC) 
| **License** | Gemma (see *License* below) |

---

## The shared task

Given an Arabic user query and a set of candidate tool definitions, a system must:

1. **Decide** whether a function call is required (some queries need no tool),
2. **Select** the correct function name,
3. **Extract** the structured arguments,
4. **(Track B)** **Generate an Arabic reasoning trace** (`<think> … </think>`) *before* the call.

| Track | Description |
|-------|-------------|
| **A — Core** | Decide / Select / Extract |
| **B — Reasoning-Augmented***this model* | Track A **+** an Arabic `<think>` reasoning trace |
| **C — Cross-Dialect Robustness** | Diagnostic: dialect-stratified evaluation of A/B submissions |

---

## How it works — input / output format

This model uses **Gemma 3 chat turns** with a custom function-calling schema (it does **not** emit plain JSON). The exact prompt is the `text` field in the dataset; the structure is:

```
<bos><start_of_turn>developer
<system instruction in Arabic>
<start_function_declaration>declaration:NAME{description:<escape>…<escape>,parameters:{…}}<end_function_declaration>
…one declaration per candidate tool…<end_of_turn>
<start_of_turn>developer
التاريخ والوقت الحالي …: 2024-04-12T23:05:24
اليوم هو الجمعة
أنت نموذج يمكنه استدعاء الوظائف التالية<end_of_turn>
<start_of_turn>user
أريد مقارنة أسعار تلفاز سامسونج في الأردن<end_of_turn>
<start_of_turn>model
```

The model then generates:

```
<think>
يبدو أن نية المستخدم هي الحصول على مقارنة لأسعار تلفاز سامسونج في الأردن. أداة "compare_prices" هي الأنسب …
</think>
<start_function_call>call:compare_prices{country:<escape>Jordan<escape>,product_name:<escape>Samsung TV<escape>}<end_function_call>
```

For a query that needs **no tool**, the model omits the `<start_function_call>` block (→ `requires_function = false`).

---

## Usage

```python
import re, torch
from transformers import AutoTokenizer, AutoModelForCausalLM

MODEL_ID = "TuwaiqAcademy/AISA-AR-FunctionCall-Think"
tok   = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID, torch_dtype=torch.float32, device_map="auto"
).eval()

def parse_model_output(text: str) -> dict:
    """Turn raw generation into the shared-task submission schema."""
    out = {"requires_function": False, "function_name": "none", "arguments": {}, "think": ""}
    if (m := re.search(r"<think>\s*(.*?)\s*</think>", text, re.DOTALL)):
        out["think"] = m.group(1).strip()
    if (m := re.search(r"<start_function_call>\s*call:(\w+)\{(.*?)\}\s*<end_function_call>", text, re.DOTALL)):
        out["requires_function"] = True
        out["function_name"] = m.group(1)
        for key, str_val, num_val in re.findall(r"(\w+):(?:<escape>(.*?)<escape>|([^,}]+))", m.group(2)):
            val = str_val if str_val else num_val
            try:
                val = float(val) if "." in str(val) else int(val)
            except (ValueError, TypeError):
                pass
            out["arguments"][key] = val
    return out

# Easiest path: take the ready-made prompt from the dataset's `text` field and
# cut it at the model turn (everything after is what the model should produce).
from datasets import load_dataset
row = load_dataset("TuwaiqAcademy/AISA-ArabicFC", split="validation")[0]
prompt = row["text"].split("<start_of_turn>model\n")[0] + "<start_of_turn>model\n"

inputs = tok(prompt, return_tensors="pt", add_special_tokens=False).to(model.device)
with torch.no_grad():
    gen = model.generate(**inputs, max_new_tokens=250, do_sample=False)  # greedy
raw = tok.decode(gen[0][inputs["input_ids"].shape[1]:], skip_special_tokens=False)

print(parse_model_output(raw))
# → {'requires_function': True, 'function_name': 'compare_prices',
#    'arguments': {'country': 'Jordan', 'product_name': 'Samsung TV'},
#    'think': 'يبدو أن نية المستخدم …'}
```

The parsed dict maps directly onto a **leaderboard submission line**: `{"id", "tool_called", "arguments", "think"}` (use `function_name``tool_called`).

---

## Evaluation

Scored on the AISA-ArabicFC **held-out test set** (1,000 positive + negative examples) using the official **v2** metrics:

- **FnAcc** — function-name accuracy over *all* samples (also penalises hallucinated / missed calls; negatives have gold `none`)
- **ArgEM** — strict argument **exact match**, over positives only
- **ThinkRate** — fraction of outputs with a non-empty `<think>` trace
- **Overall (Track A)** = `0.40·FnAcc + 0.60·ArgEM`
- **Overall (Track B)** = `0.30·FnAcc + 0.50·ArgEM + 0.20·ThinkRate`

### Baseline results

| System | FnAcc | ArgEM | Overall (A) | Overall (B) |
|--------|:-----:|:-----:|:-----------:|:-----------:|
| **AISA-AR-FunctionCall-Think (270M) ← this** | **0.982** | **0.541** | **0.717** | **0.739** |
| GPT-4o — zero-shot | 0.927 | 0.070 | 0.413 | 0.313 |
| GPT-4o — 3-shot | 0.854 | 0.122 | 0.415 | 0.317 |
| Random baseline | 0.047 | 0.033 | 0.039 | 0.031 |

- **Think-Before-Call rate (ThinkRate):** **0.868** for this model; 0.000 for all non-reasoning baselines.
- **Hallucination rate:** **0.000** on negative (no-tool) queries.

**Key takeaways**

- 🎯 **Argument extraction is the open challenge.** Tool *selection* is largely solved (FnAcc ≈ 0.98), but strict argument **exact match tops out at 0.541** — and GPT-4o reaches only 0.070 zero-shot. This is where the task is won or lost.
- 🪶 **A 270M model beats GPT-4o** across every metric here, showing the value of task-specific Arabic training and lowering the compute barrier to entry.
- 🗣️ **Cross-dialect gaps remain.** FnAcc varies by roughly 10–15 points across dialects, with **Gulf and Levantine** consistently the hardest and Maghrebi (small sample) the easiest — see the Track C diagnostic in the task overview paper.

---

## Training

- **Base:** `google/gemma-3-270m`
- **Method:** LoRA (rank 64), 3 epochs, cosine LR scheduler
- **Data:** AISA-ArabicFC training split (~10.5K examples) with 12,000 Arabic reasoning annotations for the `<think>` traces
- **Objective:** produce a short Arabic reasoning trace followed by a single structured tool call (or no call for negatives)

---

## Intended use & limitations

**Intended use**
- A reference **baseline** to compare against and reproduce for the AISA-ArabicFC shared task.
- A lightweight starting point for Arabic tool-use / agentic experiments.

**Out of scope / limitations**
- Trained for the **27-tool, 8-domain AISA-ArabicFC schema** and its prompt format; behaviour on arbitrary tools or free-form chat is undefined.
- Single-turn, single-call setting — no multi-tool or multi-turn dialogue.
- **Argument extraction is imperfect** (ArgEM 0.541): expect errors in date normalisation, numeric typing, and dialectal argument phrasing.
- Uneven dialect coverage (Maghrebi is only ~1.3% of data); robustness varies by dialect.
- A 270M model — capacity-limited by design to keep the baseline accessible.

---

## Related resources

- 🏆 **Shared task page:** https://huggingface.co/spaces/Omartificial-Intelligence-Space/AISA-ArabicFC-Shared-Task
- 📊 **Leaderboard:** https://huggingface.co/spaces/TuwaiqAcademy/AISA-ArabicFC-SharedTask-Leaderboard
- 📚 **Dataset (train + dev):** [TuwaiqAcademy/AISA-ArabicFC](https://huggingface.co/datasets/TuwaiqAcademy/AISA-ArabicFC)

---

## Citation

```bibtex
@inproceedings{najar2026aisaarabicfc,
  title     = {AISA-ArabicFC: Arabic Function Calling for Agentic AI Systems},
  author    = {Najar, Omar},
  booktitle = {Proceedings of the Fourth Arabic Natural Language Processing Conference (ArabicNLP 2026)},
  year      = {2026}
}
```

## License

This model is a derivative of **Gemma 3** and is distributed under the **[Gemma Terms of Use](https://ai.google.dev/gemma/terms)**. By using it you agree to those terms and to the [Gemma Prohibited Use Policy](https://ai.google.dev/gemma/prohibited_use_policy). The AISA-ArabicFC **dataset** is released separately under Apache-2.0.

## Contact

Shared-task organizers — **trdc@tuwaiq.edu.sa** · Tuwaiq Academy
```