Omartificial-Intelligence-Space commited on
Commit
63ef237
·
verified ·
1 Parent(s): 49b980c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +181 -155
README.md CHANGED
@@ -1,228 +1,254 @@
1
  ---
 
2
  language:
3
  - ar
4
- license: apache-2.0
5
- base_model: AISA-Framework/AISA-AR-FunctionCall-FT
 
 
6
  tags:
7
  - function-calling
8
- - arabic
9
  - tool-use
10
  - agentic
11
- - gemma
12
  - reasoning
13
- - lora
14
  - think
 
 
 
 
 
15
  datasets:
16
- - AISA-Framework/AISA-AR-FunctionCall
17
- pipeline_tag: text-generation
18
- library_name: transformers
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
19
  ---
20
 
21
  # AISA-AR-FunctionCall-Think
22
 
23
- <p align="center">
24
- <img src="https://cdn-uploads.huggingface.co/production/uploads/628f7a71dd993507cfcbe587/21Mxl67VW-RQFiXTnvheT.png" width="700"/>
25
- </p>
26
 
27
- **Reasoning-Augmented Arabic Structured Tool Calling**
28
 
29
- `AISA-AR-FunctionCall-Think` is a reasoning-enhanced variant of the Arabic function-calling model introduced in the **AISA-AR-FunctionCall** framework. The model generates an intermediate reasoning trace before invoking a tool, enabling transparent decision-making for Arabic agentic systems.
30
 
31
- This model extends [AISA-AR-FunctionCall-FT](https://huggingface.co/AISA-Framework/AISA-AR-FunctionCall-FT) by introducing explicit reasoning supervision using `<think>` blocks prior to tool execution.
32
 
33
  ---
34
 
35
- ## Model Overview
36
 
37
- | Field | Value |
38
  |---|---|
39
- | **Model name** | AISA-AR-FunctionCall-Think |
40
- | **Base model** | AISA-AR-FunctionCall-FT |
41
- | **Architecture** | Gemma 3 (FunctionGemma 270M) |
42
- | **Training method** | LoRA reasoning fine-tuning |
43
- | **Primary task** | Arabic reasoning-aware function calling |
44
-
45
- The model produces outputs in the following pattern:
46
-
47
- ```
48
- <think>
49
- reasoning about tool selection
50
- </think>
51
- <start_function_call>
52
- call:tool_name{arguments}
53
- </end_function_call>
54
- ```
55
-
56
- This allows the system to expose the reasoning behind tool selection.
57
 
58
  ---
59
 
60
- ## Key Capabilities
61
 
62
- - Reasoning-aware tool selection
63
- - Explicit decision traces for tool invocation
64
- - Improved argument extraction consistency
65
- - Interpretable structured execution
66
 
67
- **Supported domains:**
 
 
 
68
 
69
- | Domain |
70
- |---|
71
- | Travel |
72
- | Utilities |
73
- | Islamic services |
74
- | Weather |
75
- | Healthcare |
76
- | Banking & finance |
77
- | E-commerce |
78
- | Government services |
79
-
80
- **Supported Arabic dialect groups:**
81
-
82
- - Modern Standard Arabic (MSA)
83
- - Gulf
84
- - Egyptian
85
- - Levantine
86
- - Maghrebi
87
 
88
  ---
89
 
90
- ## Training Dataset
91
-
92
- Training uses a subset of the [AISA-AR-FunctionCall](https://huggingface.co/datasets/AISA-Framework/AISA-AR-FunctionCall) dataset with reasoning annotations.
93
-
94
- | Property | Value |
95
- |---|---|
96
- | Dataset size | ~12k reasoning-augmented samples |
97
- | Dialect coverage | 5 Arabic dialects |
98
- | Domains | 8 real-world domains |
99
- | Tools | 27 structured tools |
100
-
101
- ---
102
 
103
- ## Training Methodology
104
 
105
- The reasoning model is trained by augmenting assistant outputs with explicit reasoning segments.
 
 
 
 
 
 
 
 
 
 
 
 
106
 
107
- **Training format:**
108
 
109
  ```
110
  <think>
111
- tool selection reasoning
112
  </think>
113
- <start_function_call>
114
- call:tool{arguments}
115
- </end_function_call>
116
  ```
117
 
118
- Reasoning supervision is enforced during inference by priming the model to begin its generation with `<think>`.
119
-
120
- **Training configuration:**
121
-
122
- | Parameter | Value |
123
- |---|---|
124
- | Training type | LoRA fine-tuning |
125
- | LoRA rank | 64 |
126
- | Alpha | 64 |
127
- | Dropout | 0.05 |
128
- | Trainable parameters | ~5.36% |
129
- | Epochs | 3 |
130
- | Learning rate | 3e-6 |
131
- | Effective batch size | 32 |
132
- | Optimizer | 8-bit AdamW |
133
- | Scheduler | Cosine |
134
-
135
- Additional training signals include **negative tool examples** to reduce hallucinated tool calls when no tool invocation is required.
136
 
137
  ---
138
 
139
- ## Evaluation Results
140
-
141
- Evaluation is performed on a strict reasoning evaluation subset.
142
-
143
- ### Strict Evaluation (n = 240)
144
-
145
- | Metric | Score |
146
- |---|---|
147
- | Tool Call Rate | 0.992 |
148
- | Think-Before-Call Rate | **1.000** |
149
- | Function Name Accuracy | 0.992 |
150
- | Argument F1 | **1.000** |
151
- | Decision Accuracy | 0.992 |
152
- | Hallucination Rate | **0.000** |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
153
 
154
- These results indicate that the model consistently performs reasoning before tool invocation and achieves near-perfect structured alignment within the evaluated subset.
155
 
156
- ### Important Note on Format Validation
157
 
158
- Standard function-call validators may classify reasoning outputs as **parse failures** because `<think>` tokens appear before the function call marker.
159
 
160
- This does **not** indicate structural instability it reflects a difference in serialization format. When reasoning segments are permitted, tool invocation correctness remains near-perfect.
161
 
162
- ---
 
 
 
 
163
 
164
- ## Example Usage
165
 
166
- **User query:**
 
 
 
 
 
167
 
168
- ```
169
- ما حالة الطقس في الرياض اليوم؟
170
- ```
171
 
172
- **Model output:**
173
 
174
- ```
175
- <think>
176
- المستخدم يريد معرفة حالة الطقس في مدينة الرياض، لذا يجب استخدام أداة get_weather.
177
- </think>
178
- <start_function_call>
179
- call:get_weather{city:<escape>الرياض<escape>,days:1}
180
- </end_function_call>
181
- ```
182
 
183
  ---
184
 
185
- ## Intended Use
186
 
187
- This model is intended for:
 
 
 
188
 
189
- - Research on reasoning-aware tool calling
190
- - Interpretable agent systems
191
- - Arabic reasoning supervision experiments
192
- - Debugging tool selection behavior
193
 
194
- ### Production Recommendation
195
 
196
- This model is an **exploratory research variant**. For production deployment, we recommend using:
 
 
197
 
198
- [AISA-AR-FunctionCall-FT](https://huggingface.co/AISA-Framework/AISA-AR-FunctionCall-FT)
 
 
 
 
 
199
 
200
  ---
201
 
202
- ## Related Resources
203
 
204
- | Resource | Link |
205
- |---|---|
206
- | Dataset | [AISA-Framework/AISA-AR-FunctionCall](https://huggingface.co/datasets/AISA-Framework/AISA-AR-FunctionCall) |
207
- | Production model | [AISA-AR-FunctionCall-FT](https://huggingface.co/AISA-Framework/AISA-AR-FunctionCall-FT) |
208
- | Model collection | [AISA Arabic FunctionCall](https://huggingface.co/collections/AISA-Framework/aisa-arabic-functioncall-datasets-and-models) |
209
 
210
  ---
211
 
212
- ## Paper
213
-
214
- **From Language to Action in Arabic: Reliable Structured Tool Calling via Data-Centric Fine-Tuning**
215
-
216
- *AISA Framework*
217
-
218
- ---
219
 
220
- ## AISA Framework
 
 
 
 
 
 
 
221
 
222
- This model is part of the **AISA** (Agentic AI Systems Architecture) initiative for building reliable multilingual AI agents.
223
 
224
- ---
225
 
226
- ## License
227
 
228
- [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
 
 
1
  ---
2
+ license: gemma
3
  language:
4
  - ar
5
+ base_model:
6
+ - google/gemma-3-270m
7
+ pipeline_tag: text-generation
8
+ library_name: transformers
9
  tags:
10
  - function-calling
 
11
  - tool-use
12
  - agentic
13
+ - arabic
14
  - reasoning
 
15
  - think
16
+ - gemma3
17
+ - shared-task
18
+ - arabicnlp2026
19
+ - baseline
20
+ - dialect
21
  datasets:
22
+ - TuwaiqAcademy/AISA-ArabicFC
23
+ - Omartificial-Intelligence-Space/AISA-AR-FunctionCall-Reasoning
24
+ model-index:
25
+ - name: AISA-AR-FunctionCall-Think
26
+ results:
27
+ - task:
28
+ type: text-generation
29
+ name: Arabic Function Calling — Track B (Reasoning-Augmented)
30
+ dataset:
31
+ name: AISA-ArabicFC (held-out test)
32
+ type: TuwaiqAcademy/AISA-ArabicFC
33
+ metrics:
34
+ - type: function-name-accuracy
35
+ value: 0.982
36
+ name: FnAcc
37
+ - type: argument-exact-match
38
+ value: 0.541
39
+ name: ArgEM
40
+ - type: think-before-call-rate
41
+ value: 0.868
42
+ name: ThinkRate
43
+ - type: overall
44
+ value: 0.739
45
+ name: Overall (Track B, v2)
46
  ---
47
 
48
  # AISA-AR-FunctionCall-Think
49
 
50
+ ### 🏷️ Official **Track B baseline** for the [AISA-ArabicFC shared task](https://huggingface.co/spaces/Omartificial-Intelligence-Space/AISA-ArabicFC-Shared-Task) @ **ArabicNLP 2026** (co-located with EMNLP 2026, Budapest)
 
 
51
 
52
+ > This model is the **organizer-provided baseline** for **Track B — Reasoning-Augmented Function Calling**. It defines the reference score that participating systems are expected to beat. It is released for reproducibility and as a starting point — **it is not a competition entry.**
53
 
54
+ A compact (**270M-parameter**) Arabic function-calling model that, given an Arabic user query (in any of 5 dialects) and a set of candidate tools, **writes a short Arabic `<think>` reasoning trace and then emits a structured tool call**. Fine-tuned (LoRA) from **[google/gemma-3-270m](https://huggingface.co/google/gemma-3-270m)** on the AISA-ArabicFC reasoning data.
55
 
56
+ For the non-reasoning Track A baseline, see the sibling model **[AISA-AR-FunctionCall-FT](https://huggingface.co/TuwaiqAcademy/AISA-AR-FunctionCall-FT)**.
57
 
58
  ---
59
 
60
+ ## At a glance
61
 
62
+ | | |
63
  |---|---|
64
+ | **Role** | Official baseline — Track B (Reasoning-Augmented) |
65
+ | **Base model** | google/gemma-3-270m (270M params) |
66
+ | **Adaptation** | LoRA fine-tune (merged), then full causal-LM inference |
67
+ | **Languages** | Arabic MSA, Gulf, Egyptian, Levantine, Maghrebi |
68
+ | **Behaviour** | `<think>` Arabic reasoning → structured function call |
69
+ | **Training data** | [TuwaiqAcademy/AISA-ArabicFC](https://huggingface.co/datasets/TuwaiqAcademy/AISA-ArabicFC) + [reasoning annotations](https://huggingface.co/datasets/Omartificial-Intelligence-Space/AISA-AR-FunctionCall-Reasoning) |
70
+ | **License** | Gemma (see *License* below) |
 
 
 
 
 
 
 
 
 
 
 
71
 
72
  ---
73
 
74
+ ## The shared task
75
 
76
+ Given an Arabic user query and a set of candidate tool definitions, a system must:
 
 
 
77
 
78
+ 1. **Decide** whether a function call is required (some queries need no tool),
79
+ 2. **Select** the correct function name,
80
+ 3. **Extract** the structured arguments,
81
+ 4. **(Track B)** **Generate an Arabic reasoning trace** (`<think> … </think>`) *before* the call.
82
 
83
+ | Track | Description |
84
+ |-------|-------------|
85
+ | **A — Core** | Decide / Select / Extract |
86
+ | **B — Reasoning-Augmented** ← *this model* | Track A **+** an Arabic `<think>` reasoning trace |
87
+ | **C Cross-Dialect Robustness** | Diagnostic: dialect-stratified evaluation of A/B submissions |
 
 
 
 
 
 
 
 
 
 
 
 
 
88
 
89
  ---
90
 
91
+ ## How it works — input / output format
 
 
 
 
 
 
 
 
 
 
 
92
 
93
+ This model uses **Gemma 3 chat turns** with a custom function-calling schema (it does **not** emit plain JSON). The exact prompt is the `text` field in the dataset; the structure is:
94
 
95
+ ```
96
+ <bos><start_of_turn>developer
97
+ <system instruction in Arabic>
98
+ <start_function_declaration>declaration:NAME{description:<escape>…<escape>,parameters:{…}}<end_function_declaration>
99
+ …one declaration per candidate tool…<end_of_turn>
100
+ <start_of_turn>developer
101
+ التاريخ والوقت الحالي …: 2024-04-12T23:05:24
102
+ اليوم هو الجمعة
103
+ أنت نموذج يمكنه استدعاء الوظائف التالية<end_of_turn>
104
+ <start_of_turn>user
105
+ أريد مقارنة أسعار تلفاز سامسونج في الأردن<end_of_turn>
106
+ <start_of_turn>model
107
+ ```
108
 
109
+ The model then generates:
110
 
111
  ```
112
  <think>
113
+ يبدو أن نية المستخدم هي الحصول على مقارنة لأسعار تلفاز سامسونج في الأردن. أداة "compare_prices" هي الأنسب …
114
  </think>
115
+ <start_function_call>call:compare_prices{country:<escape>Jordan<escape>,product_name:<escape>Samsung TV<escape>}<end_function_call>
 
 
116
  ```
117
 
118
+ For a query that needs **no tool**, the model omits the `<start_function_call>` block (→ `requires_function = false`).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
119
 
120
  ---
121
 
122
+ ## Usage
123
+
124
+ ```python
125
+ import re, torch
126
+ from transformers import AutoTokenizer, AutoModelForCausalLM
127
+
128
+ MODEL_ID = "TuwaiqAcademy/AISA-AR-FunctionCall-Think"
129
+ tok = AutoTokenizer.from_pretrained(MODEL_ID)
130
+ model = AutoModelForCausalLM.from_pretrained(
131
+ MODEL_ID, torch_dtype=torch.float32, device_map="auto"
132
+ ).eval()
133
+
134
+ def parse_model_output(text: str) -> dict:
135
+ """Turn raw generation into the shared-task submission schema."""
136
+ out = {"requires_function": False, "function_name": "none", "arguments": {}, "think": ""}
137
+ if (m := re.search(r"<think>\s*(.*?)\s*</think>", text, re.DOTALL)):
138
+ out["think"] = m.group(1).strip()
139
+ if (m := re.search(r"<start_function_call>\s*call:(\w+)\{(.*?)\}\s*<end_function_call>", text, re.DOTALL)):
140
+ out["requires_function"] = True
141
+ out["function_name"] = m.group(1)
142
+ for key, str_val, num_val in re.findall(r"(\w+):(?:<escape>(.*?)<escape>|([^,}]+))", m.group(2)):
143
+ val = str_val if str_val else num_val
144
+ try:
145
+ val = float(val) if "." in str(val) else int(val)
146
+ except (ValueError, TypeError):
147
+ pass
148
+ out["arguments"][key] = val
149
+ return out
150
+
151
+ # Easiest path: take the ready-made prompt from the dataset's `text` field and
152
+ # cut it at the model turn (everything after is what the model should produce).
153
+ from datasets import load_dataset
154
+ row = load_dataset("TuwaiqAcademy/AISA-ArabicFC", split="validation")[0]
155
+ prompt = row["text"].split("<start_of_turn>model\n")[0] + "<start_of_turn>model\n"
156
+
157
+ inputs = tok(prompt, return_tensors="pt", add_special_tokens=False).to(model.device)
158
+ with torch.no_grad():
159
+ gen = model.generate(**inputs, max_new_tokens=250, do_sample=False) # greedy
160
+ raw = tok.decode(gen[0][inputs["input_ids"].shape[1]:], skip_special_tokens=False)
161
+
162
+ print(parse_model_output(raw))
163
+ # → {'requires_function': True, 'function_name': 'compare_prices',
164
+ # 'arguments': {'country': 'Jordan', 'product_name': 'Samsung TV'},
165
+ # 'think': 'يبدو أن نية المستخدم …'}
166
+ ```
167
 
168
+ The parsed dict maps directly onto a **leaderboard submission line**: `{"id", "tool_called", "arguments", "think"}` (use `function_name` `tool_called`).
169
 
170
+ ---
171
 
172
+ ## Evaluation
173
 
174
+ Scored on the AISA-ArabicFC **held-out test set** (1,000 positive + negative examples) using the official **v2** metrics:
175
 
176
+ - **FnAcc** — function-name accuracy over *all* samples (also penalises hallucinated / missed calls; negatives have gold `none`)
177
+ - **ArgEM** — strict argument **exact match**, over positives only
178
+ - **ThinkRate** — fraction of outputs with a non-empty `<think>` trace
179
+ - **Overall (Track A)** = `0.40·FnAcc + 0.60·ArgEM`
180
+ - **Overall (Track B)** = `0.30·FnAcc + 0.50·ArgEM + 0.20·ThinkRate`
181
 
182
+ ### Baseline results
183
 
184
+ | System | FnAcc | ArgEM | Overall (A) | Overall (B) |
185
+ |--------|:-----:|:-----:|:-----------:|:-----------:|
186
+ | **AISA-AR-FunctionCall-Think (270M) ← this** | **0.982** | **0.541** | **0.717** | **0.739** |
187
+ | GPT-4o — zero-shot | 0.927 | 0.070 | 0.413 | 0.313 |
188
+ | GPT-4o — 3-shot | 0.854 | 0.122 | 0.415 | 0.317 |
189
+ | Random baseline | 0.047 | 0.033 | 0.039 | 0.031 |
190
 
191
+ - **Think-Before-Call rate (ThinkRate):** **0.868** for this model; 0.000 for all non-reasoning baselines.
192
+ - **Hallucination rate:** **0.000** on negative (no-tool) queries.
 
193
 
194
+ **Key takeaways**
195
 
196
+ - 🎯 **Argument extraction is the open challenge.** Tool *selection* is largely solved (FnAcc ≈ 0.98), but strict argument **exact match tops out at 0.541** — and GPT-4o reaches only 0.070 zero-shot. This is where the task is won or lost.
197
+ - 🪶 **A 270M model beats GPT-4o** across every metric here, showing the value of task-specific Arabic training and lowering the compute barrier to entry.
198
+ - 🗣️ **Cross-dialect gaps remain.** FnAcc varies by roughly 10–15 points across dialects, with **Gulf and Levantine** consistently the hardest and Maghrebi (small sample) the easiest — see the Track C diagnostic in the task overview paper.
 
 
 
 
 
199
 
200
  ---
201
 
202
+ ## Training
203
 
204
+ - **Base:** `google/gemma-3-270m`
205
+ - **Method:** LoRA (rank 64), 3 epochs, cosine LR scheduler
206
+ - **Data:** AISA-ArabicFC training split (~10.5K examples) with 12,000 Arabic reasoning annotations for the `<think>` traces
207
+ - **Objective:** produce a short Arabic reasoning trace followed by a single structured tool call (or no call for negatives)
208
 
209
+ ---
 
 
 
210
 
211
+ ## Intended use & limitations
212
 
213
+ **Intended use**
214
+ - A reference **baseline** to compare against and reproduce for the AISA-ArabicFC shared task.
215
+ - A lightweight starting point for Arabic tool-use / agentic experiments.
216
 
217
+ **Out of scope / limitations**
218
+ - Trained for the **27-tool, 8-domain AISA-ArabicFC schema** and its prompt format; behaviour on arbitrary tools or free-form chat is undefined.
219
+ - Single-turn, single-call setting — no multi-tool or multi-turn dialogue.
220
+ - **Argument extraction is imperfect** (ArgEM 0.541): expect errors in date normalisation, numeric typing, and dialectal argument phrasing.
221
+ - Uneven dialect coverage (Maghrebi is only ~1.3% of data); robustness varies by dialect.
222
+ - A 270M model — capacity-limited by design to keep the baseline accessible.
223
 
224
  ---
225
 
226
+ ## Related resources
227
 
228
+ - 🏆 **Shared task page:** https://huggingface.co/spaces/Omartificial-Intelligence-Space/AISA-ArabicFC-Shared-Task
229
+ - 📊 **Leaderboard:** https://huggingface.co/spaces/TuwaiqAcademy/AISA-ArabicFC-SharedTask-Leaderboard
230
+ - 📚 **Dataset (train + dev):** [TuwaiqAcademy/AISA-ArabicFC](https://huggingface.co/datasets/TuwaiqAcademy/AISA-ArabicFC)
231
+ - 🧠 **Reasoning dataset:** [Omartificial-Intelligence-Space/AISA-AR-FunctionCall-Reasoning](https://huggingface.co/datasets/Omartificial-Intelligence-Space/AISA-AR-FunctionCall-Reasoning)
232
+ - 🤝 **Sibling baseline (Track A):** [TuwaiqAcademy/AISA-AR-FunctionCall-FT](https://huggingface.co/TuwaiqAcademy/AISA-AR-FunctionCall-FT)
233
 
234
  ---
235
 
236
+ ## Citation
 
 
 
 
 
 
237
 
238
+ ```bibtex
239
+ @inproceedings{najar2026aisaarabicfc,
240
+ title = {AISA-ArabicFC: Arabic Function Calling for Agentic AI Systems},
241
+ author = {Najar, Omar},
242
+ booktitle = {Proceedings of the Fourth Arabic Natural Language Processing Conference (ArabicNLP 2026)},
243
+ year = {2026}
244
+ }
245
+ ```
246
 
247
+ ## License
248
 
249
+ This model is a derivative of **Gemma 3** and is distributed under the **[Gemma Terms of Use](https://ai.google.dev/gemma/terms)**. By using it you agree to those terms and to the [Gemma Prohibited Use Policy](https://ai.google.dev/gemma/prohibited_use_policy). The AISA-ArabicFC **dataset** is released separately under Apache-2.0.
250
 
251
+ ## Contact
252
 
253
+ Shared-task organizers — **arabicnlp-shared-task-chair@sigarab.org** · Tuwaiq Academy
254
+ ```