Text Generation
Transformers
Safetensors
Arabic
gemma3_text
function-calling
tool-use
agentic
arabic
reasoning
think
gemma3
shared-task
arabicnlp2026
baseline
dialect
conversational
Eval Results (legacy)
text-generation-inference
Instructions to use TuwaiqAcademy/AISA-AR-FunctionCall-Think with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use TuwaiqAcademy/AISA-AR-FunctionCall-Think with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="TuwaiqAcademy/AISA-AR-FunctionCall-Think") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("TuwaiqAcademy/AISA-AR-FunctionCall-Think") model = AutoModelForCausalLM.from_pretrained("TuwaiqAcademy/AISA-AR-FunctionCall-Think") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use TuwaiqAcademy/AISA-AR-FunctionCall-Think with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "TuwaiqAcademy/AISA-AR-FunctionCall-Think" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "TuwaiqAcademy/AISA-AR-FunctionCall-Think", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/TuwaiqAcademy/AISA-AR-FunctionCall-Think
- SGLang
How to use TuwaiqAcademy/AISA-AR-FunctionCall-Think with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "TuwaiqAcademy/AISA-AR-FunctionCall-Think" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "TuwaiqAcademy/AISA-AR-FunctionCall-Think", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "TuwaiqAcademy/AISA-AR-FunctionCall-Think" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "TuwaiqAcademy/AISA-AR-FunctionCall-Think", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use TuwaiqAcademy/AISA-AR-FunctionCall-Think with Docker Model Runner:
docker model run hf.co/TuwaiqAcademy/AISA-AR-FunctionCall-Think
Update README.md
Browse files
README.md
CHANGED
|
@@ -1,228 +1,254 @@
|
|
| 1 |
---
|
|
|
|
| 2 |
language:
|
| 3 |
- ar
|
| 4 |
-
|
| 5 |
-
|
|
|
|
|
|
|
| 6 |
tags:
|
| 7 |
- function-calling
|
| 8 |
-
- arabic
|
| 9 |
- tool-use
|
| 10 |
- agentic
|
| 11 |
-
-
|
| 12 |
- reasoning
|
| 13 |
-
- lora
|
| 14 |
- think
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 15 |
datasets:
|
| 16 |
-
-
|
| 17 |
-
|
| 18 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 19 |
---
|
| 20 |
|
| 21 |
# AISA-AR-FunctionCall-Think
|
| 22 |
|
| 23 |
-
|
| 24 |
-
<img src="https://cdn-uploads.huggingface.co/production/uploads/628f7a71dd993507cfcbe587/21Mxl67VW-RQFiXTnvheT.png" width="700"/>
|
| 25 |
-
</p>
|
| 26 |
|
| 27 |
-
**Reasoning-Augmented
|
| 28 |
|
| 29 |
-
|
| 30 |
|
| 31 |
-
|
| 32 |
|
| 33 |
---
|
| 34 |
|
| 35 |
-
##
|
| 36 |
|
| 37 |
-
|
|
| 38 |
|---|---|
|
| 39 |
-
| **
|
| 40 |
-
| **Base model** |
|
| 41 |
-
| **
|
| 42 |
-
| **
|
| 43 |
-
| **
|
| 44 |
-
|
| 45 |
-
|
| 46 |
-
|
| 47 |
-
```
|
| 48 |
-
<think>
|
| 49 |
-
reasoning about tool selection
|
| 50 |
-
</think>
|
| 51 |
-
<start_function_call>
|
| 52 |
-
call:tool_name{arguments}
|
| 53 |
-
</end_function_call>
|
| 54 |
-
```
|
| 55 |
-
|
| 56 |
-
This allows the system to expose the reasoning behind tool selection.
|
| 57 |
|
| 58 |
---
|
| 59 |
|
| 60 |
-
##
|
| 61 |
|
| 62 |
-
|
| 63 |
-
- Explicit decision traces for tool invocation
|
| 64 |
-
- Improved argument extraction consistency
|
| 65 |
-
- Interpretable structured execution
|
| 66 |
|
| 67 |
-
**
|
|
|
|
|
|
|
|
|
|
| 68 |
|
| 69 |
-
|
|
| 70 |
-
|---|
|
| 71 |
-
|
|
| 72 |
-
|
|
| 73 |
-
|
|
| 74 |
-
| Weather |
|
| 75 |
-
| Healthcare |
|
| 76 |
-
| Banking & finance |
|
| 77 |
-
| E-commerce |
|
| 78 |
-
| Government services |
|
| 79 |
-
|
| 80 |
-
**Supported Arabic dialect groups:**
|
| 81 |
-
|
| 82 |
-
- Modern Standard Arabic (MSA)
|
| 83 |
-
- Gulf
|
| 84 |
-
- Egyptian
|
| 85 |
-
- Levantine
|
| 86 |
-
- Maghrebi
|
| 87 |
|
| 88 |
---
|
| 89 |
|
| 90 |
-
##
|
| 91 |
-
|
| 92 |
-
Training uses a subset of the [AISA-AR-FunctionCall](https://huggingface.co/datasets/AISA-Framework/AISA-AR-FunctionCall) dataset with reasoning annotations.
|
| 93 |
-
|
| 94 |
-
| Property | Value |
|
| 95 |
-
|---|---|
|
| 96 |
-
| Dataset size | ~12k reasoning-augmented samples |
|
| 97 |
-
| Dialect coverage | 5 Arabic dialects |
|
| 98 |
-
| Domains | 8 real-world domains |
|
| 99 |
-
| Tools | 27 structured tools |
|
| 100 |
-
|
| 101 |
-
---
|
| 102 |
|
| 103 |
-
|
| 104 |
|
| 105 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 106 |
|
| 107 |
-
|
| 108 |
|
| 109 |
```
|
| 110 |
<think>
|
| 111 |
-
|
| 112 |
</think>
|
| 113 |
-
<start_function_call>
|
| 114 |
-
call:tool{arguments}
|
| 115 |
-
</end_function_call>
|
| 116 |
```
|
| 117 |
|
| 118 |
-
|
| 119 |
-
|
| 120 |
-
**Training configuration:**
|
| 121 |
-
|
| 122 |
-
| Parameter | Value |
|
| 123 |
-
|---|---|
|
| 124 |
-
| Training type | LoRA fine-tuning |
|
| 125 |
-
| LoRA rank | 64 |
|
| 126 |
-
| Alpha | 64 |
|
| 127 |
-
| Dropout | 0.05 |
|
| 128 |
-
| Trainable parameters | ~5.36% |
|
| 129 |
-
| Epochs | 3 |
|
| 130 |
-
| Learning rate | 3e-6 |
|
| 131 |
-
| Effective batch size | 32 |
|
| 132 |
-
| Optimizer | 8-bit AdamW |
|
| 133 |
-
| Scheduler | Cosine |
|
| 134 |
-
|
| 135 |
-
Additional training signals include **negative tool examples** to reduce hallucinated tool calls when no tool invocation is required.
|
| 136 |
|
| 137 |
---
|
| 138 |
|
| 139 |
-
##
|
| 140 |
-
|
| 141 |
-
|
| 142 |
-
|
| 143 |
-
|
| 144 |
-
|
| 145 |
-
|
| 146 |
-
|
| 147 |
-
|
| 148 |
-
|
| 149 |
-
|
| 150 |
-
|
| 151 |
-
|
| 152 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 153 |
|
| 154 |
-
|
| 155 |
|
| 156 |
-
|
| 157 |
|
| 158 |
-
|
| 159 |
|
| 160 |
-
|
| 161 |
|
| 162 |
-
--
|
|
|
|
|
|
|
|
|
|
|
|
|
| 163 |
|
| 164 |
-
##
|
| 165 |
|
| 166 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 167 |
|
| 168 |
-
|
| 169 |
-
|
| 170 |
-
```
|
| 171 |
|
| 172 |
-
**
|
| 173 |
|
| 174 |
-
|
| 175 |
-
|
| 176 |
-
|
| 177 |
-
</think>
|
| 178 |
-
<start_function_call>
|
| 179 |
-
call:get_weather{city:<escape>الرياض<escape>,days:1}
|
| 180 |
-
</end_function_call>
|
| 181 |
-
```
|
| 182 |
|
| 183 |
---
|
| 184 |
|
| 185 |
-
##
|
| 186 |
|
| 187 |
-
|
|
|
|
|
|
|
|
|
|
| 188 |
|
| 189 |
-
-
|
| 190 |
-
- Interpretable agent systems
|
| 191 |
-
- Arabic reasoning supervision experiments
|
| 192 |
-
- Debugging tool selection behavior
|
| 193 |
|
| 194 |
-
##
|
| 195 |
|
| 196 |
-
|
|
|
|
|
|
|
| 197 |
|
| 198 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 199 |
|
| 200 |
---
|
| 201 |
|
| 202 |
-
## Related
|
| 203 |
|
| 204 |
-
|
| 205 |
-
|
| 206 |
-
|
| 207 |
-
|
| 208 |
-
|
| 209 |
|
| 210 |
---
|
| 211 |
|
| 212 |
-
##
|
| 213 |
-
|
| 214 |
-
**From Language to Action in Arabic: Reliable Structured Tool Calling via Data-Centric Fine-Tuning**
|
| 215 |
-
|
| 216 |
-
*AISA Framework*
|
| 217 |
-
|
| 218 |
-
---
|
| 219 |
|
| 220 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 221 |
|
| 222 |
-
|
| 223 |
|
| 224 |
-
--
|
| 225 |
|
| 226 |
-
##
|
| 227 |
|
| 228 |
-
|
|
|
|
|
|
| 1 |
---
|
| 2 |
+
license: gemma
|
| 3 |
language:
|
| 4 |
- ar
|
| 5 |
+
base_model:
|
| 6 |
+
- google/gemma-3-270m
|
| 7 |
+
pipeline_tag: text-generation
|
| 8 |
+
library_name: transformers
|
| 9 |
tags:
|
| 10 |
- function-calling
|
|
|
|
| 11 |
- tool-use
|
| 12 |
- agentic
|
| 13 |
+
- arabic
|
| 14 |
- reasoning
|
|
|
|
| 15 |
- think
|
| 16 |
+
- gemma3
|
| 17 |
+
- shared-task
|
| 18 |
+
- arabicnlp2026
|
| 19 |
+
- baseline
|
| 20 |
+
- dialect
|
| 21 |
datasets:
|
| 22 |
+
- TuwaiqAcademy/AISA-ArabicFC
|
| 23 |
+
- Omartificial-Intelligence-Space/AISA-AR-FunctionCall-Reasoning
|
| 24 |
+
model-index:
|
| 25 |
+
- name: AISA-AR-FunctionCall-Think
|
| 26 |
+
results:
|
| 27 |
+
- task:
|
| 28 |
+
type: text-generation
|
| 29 |
+
name: Arabic Function Calling — Track B (Reasoning-Augmented)
|
| 30 |
+
dataset:
|
| 31 |
+
name: AISA-ArabicFC (held-out test)
|
| 32 |
+
type: TuwaiqAcademy/AISA-ArabicFC
|
| 33 |
+
metrics:
|
| 34 |
+
- type: function-name-accuracy
|
| 35 |
+
value: 0.982
|
| 36 |
+
name: FnAcc
|
| 37 |
+
- type: argument-exact-match
|
| 38 |
+
value: 0.541
|
| 39 |
+
name: ArgEM
|
| 40 |
+
- type: think-before-call-rate
|
| 41 |
+
value: 0.868
|
| 42 |
+
name: ThinkRate
|
| 43 |
+
- type: overall
|
| 44 |
+
value: 0.739
|
| 45 |
+
name: Overall (Track B, v2)
|
| 46 |
---
|
| 47 |
|
| 48 |
# AISA-AR-FunctionCall-Think
|
| 49 |
|
| 50 |
+
### 🏷️ Official **Track B baseline** for the [AISA-ArabicFC shared task](https://huggingface.co/spaces/Omartificial-Intelligence-Space/AISA-ArabicFC-Shared-Task) @ **ArabicNLP 2026** (co-located with EMNLP 2026, Budapest)
|
|
|
|
|
|
|
| 51 |
|
| 52 |
+
> This model is the **organizer-provided baseline** for **Track B — Reasoning-Augmented Function Calling**. It defines the reference score that participating systems are expected to beat. It is released for reproducibility and as a starting point — **it is not a competition entry.**
|
| 53 |
|
| 54 |
+
A compact (**270M-parameter**) Arabic function-calling model that, given an Arabic user query (in any of 5 dialects) and a set of candidate tools, **writes a short Arabic `<think>` reasoning trace and then emits a structured tool call**. Fine-tuned (LoRA) from **[google/gemma-3-270m](https://huggingface.co/google/gemma-3-270m)** on the AISA-ArabicFC reasoning data.
|
| 55 |
|
| 56 |
+
For the non-reasoning Track A baseline, see the sibling model **[AISA-AR-FunctionCall-FT](https://huggingface.co/TuwaiqAcademy/AISA-AR-FunctionCall-FT)**.
|
| 57 |
|
| 58 |
---
|
| 59 |
|
| 60 |
+
## At a glance
|
| 61 |
|
| 62 |
+
| | |
|
| 63 |
|---|---|
|
| 64 |
+
| **Role** | Official baseline — Track B (Reasoning-Augmented) |
|
| 65 |
+
| **Base model** | google/gemma-3-270m (270M params) |
|
| 66 |
+
| **Adaptation** | LoRA fine-tune (merged), then full causal-LM inference |
|
| 67 |
+
| **Languages** | Arabic — MSA, Gulf, Egyptian, Levantine, Maghrebi |
|
| 68 |
+
| **Behaviour** | `<think>` Arabic reasoning → structured function call |
|
| 69 |
+
| **Training data** | [TuwaiqAcademy/AISA-ArabicFC](https://huggingface.co/datasets/TuwaiqAcademy/AISA-ArabicFC) + [reasoning annotations](https://huggingface.co/datasets/Omartificial-Intelligence-Space/AISA-AR-FunctionCall-Reasoning) |
|
| 70 |
+
| **License** | Gemma (see *License* below) |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 71 |
|
| 72 |
---
|
| 73 |
|
| 74 |
+
## The shared task
|
| 75 |
|
| 76 |
+
Given an Arabic user query and a set of candidate tool definitions, a system must:
|
|
|
|
|
|
|
|
|
|
| 77 |
|
| 78 |
+
1. **Decide** whether a function call is required (some queries need no tool),
|
| 79 |
+
2. **Select** the correct function name,
|
| 80 |
+
3. **Extract** the structured arguments,
|
| 81 |
+
4. **(Track B)** **Generate an Arabic reasoning trace** (`<think> … </think>`) *before* the call.
|
| 82 |
|
| 83 |
+
| Track | Description |
|
| 84 |
+
|-------|-------------|
|
| 85 |
+
| **A — Core** | Decide / Select / Extract |
|
| 86 |
+
| **B — Reasoning-Augmented** ← *this model* | Track A **+** an Arabic `<think>` reasoning trace |
|
| 87 |
+
| **C — Cross-Dialect Robustness** | Diagnostic: dialect-stratified evaluation of A/B submissions |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 88 |
|
| 89 |
---
|
| 90 |
|
| 91 |
+
## How it works — input / output format
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 92 |
|
| 93 |
+
This model uses **Gemma 3 chat turns** with a custom function-calling schema (it does **not** emit plain JSON). The exact prompt is the `text` field in the dataset; the structure is:
|
| 94 |
|
| 95 |
+
```
|
| 96 |
+
<bos><start_of_turn>developer
|
| 97 |
+
<system instruction in Arabic>
|
| 98 |
+
<start_function_declaration>declaration:NAME{description:<escape>…<escape>,parameters:{…}}<end_function_declaration>
|
| 99 |
+
…one declaration per candidate tool…<end_of_turn>
|
| 100 |
+
<start_of_turn>developer
|
| 101 |
+
التاريخ والوقت الحالي …: 2024-04-12T23:05:24
|
| 102 |
+
اليوم هو الجمعة
|
| 103 |
+
أنت نموذج يمكنه استدعاء الوظائف التالية<end_of_turn>
|
| 104 |
+
<start_of_turn>user
|
| 105 |
+
أريد مقارنة أسعار تلفاز سامسونج في الأردن<end_of_turn>
|
| 106 |
+
<start_of_turn>model
|
| 107 |
+
```
|
| 108 |
|
| 109 |
+
The model then generates:
|
| 110 |
|
| 111 |
```
|
| 112 |
<think>
|
| 113 |
+
يبدو أن نية المستخدم هي الحصول على مقارنة لأسعار تلفاز سامسونج في الأردن. أداة "compare_prices" هي الأنسب …
|
| 114 |
</think>
|
| 115 |
+
<start_function_call>call:compare_prices{country:<escape>Jordan<escape>,product_name:<escape>Samsung TV<escape>}<end_function_call>
|
|
|
|
|
|
|
| 116 |
```
|
| 117 |
|
| 118 |
+
For a query that needs **no tool**, the model omits the `<start_function_call>` block (→ `requires_function = false`).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 119 |
|
| 120 |
---
|
| 121 |
|
| 122 |
+
## Usage
|
| 123 |
+
|
| 124 |
+
```python
|
| 125 |
+
import re, torch
|
| 126 |
+
from transformers import AutoTokenizer, AutoModelForCausalLM
|
| 127 |
+
|
| 128 |
+
MODEL_ID = "TuwaiqAcademy/AISA-AR-FunctionCall-Think"
|
| 129 |
+
tok = AutoTokenizer.from_pretrained(MODEL_ID)
|
| 130 |
+
model = AutoModelForCausalLM.from_pretrained(
|
| 131 |
+
MODEL_ID, torch_dtype=torch.float32, device_map="auto"
|
| 132 |
+
).eval()
|
| 133 |
+
|
| 134 |
+
def parse_model_output(text: str) -> dict:
|
| 135 |
+
"""Turn raw generation into the shared-task submission schema."""
|
| 136 |
+
out = {"requires_function": False, "function_name": "none", "arguments": {}, "think": ""}
|
| 137 |
+
if (m := re.search(r"<think>\s*(.*?)\s*</think>", text, re.DOTALL)):
|
| 138 |
+
out["think"] = m.group(1).strip()
|
| 139 |
+
if (m := re.search(r"<start_function_call>\s*call:(\w+)\{(.*?)\}\s*<end_function_call>", text, re.DOTALL)):
|
| 140 |
+
out["requires_function"] = True
|
| 141 |
+
out["function_name"] = m.group(1)
|
| 142 |
+
for key, str_val, num_val in re.findall(r"(\w+):(?:<escape>(.*?)<escape>|([^,}]+))", m.group(2)):
|
| 143 |
+
val = str_val if str_val else num_val
|
| 144 |
+
try:
|
| 145 |
+
val = float(val) if "." in str(val) else int(val)
|
| 146 |
+
except (ValueError, TypeError):
|
| 147 |
+
pass
|
| 148 |
+
out["arguments"][key] = val
|
| 149 |
+
return out
|
| 150 |
+
|
| 151 |
+
# Easiest path: take the ready-made prompt from the dataset's `text` field and
|
| 152 |
+
# cut it at the model turn (everything after is what the model should produce).
|
| 153 |
+
from datasets import load_dataset
|
| 154 |
+
row = load_dataset("TuwaiqAcademy/AISA-ArabicFC", split="validation")[0]
|
| 155 |
+
prompt = row["text"].split("<start_of_turn>model\n")[0] + "<start_of_turn>model\n"
|
| 156 |
+
|
| 157 |
+
inputs = tok(prompt, return_tensors="pt", add_special_tokens=False).to(model.device)
|
| 158 |
+
with torch.no_grad():
|
| 159 |
+
gen = model.generate(**inputs, max_new_tokens=250, do_sample=False) # greedy
|
| 160 |
+
raw = tok.decode(gen[0][inputs["input_ids"].shape[1]:], skip_special_tokens=False)
|
| 161 |
+
|
| 162 |
+
print(parse_model_output(raw))
|
| 163 |
+
# → {'requires_function': True, 'function_name': 'compare_prices',
|
| 164 |
+
# 'arguments': {'country': 'Jordan', 'product_name': 'Samsung TV'},
|
| 165 |
+
# 'think': 'يبدو أن نية المستخدم …'}
|
| 166 |
+
```
|
| 167 |
|
| 168 |
+
The parsed dict maps directly onto a **leaderboard submission line**: `{"id", "tool_called", "arguments", "think"}` (use `function_name` → `tool_called`).
|
| 169 |
|
| 170 |
+
---
|
| 171 |
|
| 172 |
+
## Evaluation
|
| 173 |
|
| 174 |
+
Scored on the AISA-ArabicFC **held-out test set** (1,000 positive + negative examples) using the official **v2** metrics:
|
| 175 |
|
| 176 |
+
- **FnAcc** — function-name accuracy over *all* samples (also penalises hallucinated / missed calls; negatives have gold `none`)
|
| 177 |
+
- **ArgEM** — strict argument **exact match**, over positives only
|
| 178 |
+
- **ThinkRate** — fraction of outputs with a non-empty `<think>` trace
|
| 179 |
+
- **Overall (Track A)** = `0.40·FnAcc + 0.60·ArgEM`
|
| 180 |
+
- **Overall (Track B)** = `0.30·FnAcc + 0.50·ArgEM + 0.20·ThinkRate`
|
| 181 |
|
| 182 |
+
### Baseline results
|
| 183 |
|
| 184 |
+
| System | FnAcc | ArgEM | Overall (A) | Overall (B) |
|
| 185 |
+
|--------|:-----:|:-----:|:-----------:|:-----------:|
|
| 186 |
+
| **AISA-AR-FunctionCall-Think (270M) ← this** | **0.982** | **0.541** | **0.717** | **0.739** |
|
| 187 |
+
| GPT-4o — zero-shot | 0.927 | 0.070 | 0.413 | 0.313 |
|
| 188 |
+
| GPT-4o — 3-shot | 0.854 | 0.122 | 0.415 | 0.317 |
|
| 189 |
+
| Random baseline | 0.047 | 0.033 | 0.039 | 0.031 |
|
| 190 |
|
| 191 |
+
- **Think-Before-Call rate (ThinkRate):** **0.868** for this model; 0.000 for all non-reasoning baselines.
|
| 192 |
+
- **Hallucination rate:** **0.000** on negative (no-tool) queries.
|
|
|
|
| 193 |
|
| 194 |
+
**Key takeaways**
|
| 195 |
|
| 196 |
+
- 🎯 **Argument extraction is the open challenge.** Tool *selection* is largely solved (FnAcc ≈ 0.98), but strict argument **exact match tops out at 0.541** — and GPT-4o reaches only 0.070 zero-shot. This is where the task is won or lost.
|
| 197 |
+
- 🪶 **A 270M model beats GPT-4o** across every metric here, showing the value of task-specific Arabic training and lowering the compute barrier to entry.
|
| 198 |
+
- 🗣️ **Cross-dialect gaps remain.** FnAcc varies by roughly 10–15 points across dialects, with **Gulf and Levantine** consistently the hardest and Maghrebi (small sample) the easiest — see the Track C diagnostic in the task overview paper.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 199 |
|
| 200 |
---
|
| 201 |
|
| 202 |
+
## Training
|
| 203 |
|
| 204 |
+
- **Base:** `google/gemma-3-270m`
|
| 205 |
+
- **Method:** LoRA (rank 64), 3 epochs, cosine LR scheduler
|
| 206 |
+
- **Data:** AISA-ArabicFC training split (~10.5K examples) with 12,000 Arabic reasoning annotations for the `<think>` traces
|
| 207 |
+
- **Objective:** produce a short Arabic reasoning trace followed by a single structured tool call (or no call for negatives)
|
| 208 |
|
| 209 |
+
---
|
|
|
|
|
|
|
|
|
|
| 210 |
|
| 211 |
+
## Intended use & limitations
|
| 212 |
|
| 213 |
+
**Intended use**
|
| 214 |
+
- A reference **baseline** to compare against and reproduce for the AISA-ArabicFC shared task.
|
| 215 |
+
- A lightweight starting point for Arabic tool-use / agentic experiments.
|
| 216 |
|
| 217 |
+
**Out of scope / limitations**
|
| 218 |
+
- Trained for the **27-tool, 8-domain AISA-ArabicFC schema** and its prompt format; behaviour on arbitrary tools or free-form chat is undefined.
|
| 219 |
+
- Single-turn, single-call setting — no multi-tool or multi-turn dialogue.
|
| 220 |
+
- **Argument extraction is imperfect** (ArgEM 0.541): expect errors in date normalisation, numeric typing, and dialectal argument phrasing.
|
| 221 |
+
- Uneven dialect coverage (Maghrebi is only ~1.3% of data); robustness varies by dialect.
|
| 222 |
+
- A 270M model — capacity-limited by design to keep the baseline accessible.
|
| 223 |
|
| 224 |
---
|
| 225 |
|
| 226 |
+
## Related resources
|
| 227 |
|
| 228 |
+
- 🏆 **Shared task page:** https://huggingface.co/spaces/Omartificial-Intelligence-Space/AISA-ArabicFC-Shared-Task
|
| 229 |
+
- 📊 **Leaderboard:** https://huggingface.co/spaces/TuwaiqAcademy/AISA-ArabicFC-SharedTask-Leaderboard
|
| 230 |
+
- 📚 **Dataset (train + dev):** [TuwaiqAcademy/AISA-ArabicFC](https://huggingface.co/datasets/TuwaiqAcademy/AISA-ArabicFC)
|
| 231 |
+
- 🧠 **Reasoning dataset:** [Omartificial-Intelligence-Space/AISA-AR-FunctionCall-Reasoning](https://huggingface.co/datasets/Omartificial-Intelligence-Space/AISA-AR-FunctionCall-Reasoning)
|
| 232 |
+
- 🤝 **Sibling baseline (Track A):** [TuwaiqAcademy/AISA-AR-FunctionCall-FT](https://huggingface.co/TuwaiqAcademy/AISA-AR-FunctionCall-FT)
|
| 233 |
|
| 234 |
---
|
| 235 |
|
| 236 |
+
## Citation
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 237 |
|
| 238 |
+
```bibtex
|
| 239 |
+
@inproceedings{najar2026aisaarabicfc,
|
| 240 |
+
title = {AISA-ArabicFC: Arabic Function Calling for Agentic AI Systems},
|
| 241 |
+
author = {Najar, Omar},
|
| 242 |
+
booktitle = {Proceedings of the Fourth Arabic Natural Language Processing Conference (ArabicNLP 2026)},
|
| 243 |
+
year = {2026}
|
| 244 |
+
}
|
| 245 |
+
```
|
| 246 |
|
| 247 |
+
## License
|
| 248 |
|
| 249 |
+
This model is a derivative of **Gemma 3** and is distributed under the **[Gemma Terms of Use](https://ai.google.dev/gemma/terms)**. By using it you agree to those terms and to the [Gemma Prohibited Use Policy](https://ai.google.dev/gemma/prohibited_use_policy). The AISA-ArabicFC **dataset** is released separately under Apache-2.0.
|
| 250 |
|
| 251 |
+
## Contact
|
| 252 |
|
| 253 |
+
Shared-task organizers — **arabicnlp-shared-task-chair@sigarab.org** · Tuwaiq Academy
|
| 254 |
+
```
|