Instructions to use TuwaiqAcademy/AISA-AR-FunctionCall-Think with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use TuwaiqAcademy/AISA-AR-FunctionCall-Think with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="TuwaiqAcademy/AISA-AR-FunctionCall-Think")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("TuwaiqAcademy/AISA-AR-FunctionCall-Think")
model = AutoModelForCausalLM.from_pretrained("TuwaiqAcademy/AISA-AR-FunctionCall-Think")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use TuwaiqAcademy/AISA-AR-FunctionCall-Think with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "TuwaiqAcademy/AISA-AR-FunctionCall-Think"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "TuwaiqAcademy/AISA-AR-FunctionCall-Think",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/TuwaiqAcademy/AISA-AR-FunctionCall-Think

SGLang

How to use TuwaiqAcademy/AISA-AR-FunctionCall-Think with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "TuwaiqAcademy/AISA-AR-FunctionCall-Think" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "TuwaiqAcademy/AISA-AR-FunctionCall-Think",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "TuwaiqAcademy/AISA-AR-FunctionCall-Think" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "TuwaiqAcademy/AISA-AR-FunctionCall-Think",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use TuwaiqAcademy/AISA-AR-FunctionCall-Think with Docker Model Runner:
```
docker model run hf.co/TuwaiqAcademy/AISA-AR-FunctionCall-Think
```

Omartificial-Intelligence-Space commited on 4 days ago

Commit

63ef237

verified ·

1 Parent(s): 49b980c

Update README.md

Browse files

Files changed (1) hide show

README.md +181 -155

README.md CHANGED Viewed

@@ -1,228 +1,254 @@
 ---
 language:
 - ar
-license: apache-2.0
-base_model: AISA-Framework/AISA-AR-FunctionCall-FT
 tags:
 - function-calling
-- arabic
 - tool-use
 - agentic
-- gemma
 - reasoning
-- lora
 - think
 datasets:
-- AISA-Framework/AISA-AR-FunctionCall
-pipeline_tag: text-generation
-library_name: transformers
 ---
 # AISA-AR-FunctionCall-Think
-<p align="center">
-  <img src="https://cdn-uploads.huggingface.co/production/uploads/628f7a71dd993507cfcbe587/21Mxl67VW-RQFiXTnvheT.png" width="700"/>
-</p>
-**Reasoning-Augmented Arabic Structured Tool Calling**
-`AISA-AR-FunctionCall-Think` is a reasoning-enhanced variant of the Arabic function-calling model introduced in the **AISA-AR-FunctionCall** framework. The model generates an intermediate reasoning trace before invoking a tool, enabling transparent decision-making for Arabic agentic systems.
-This model extends [AISA-AR-FunctionCall-FT](https://huggingface.co/AISA-Framework/AISA-AR-FunctionCall-FT) by introducing explicit reasoning supervision using `<think>` blocks prior to tool execution.
 ---
-## Model Overview
-| Field | Value |
 |---|---|
-| **Model name** | AISA-AR-FunctionCall-Think |
-| **Base model** | AISA-AR-FunctionCall-FT |
-| **Architecture** | Gemma 3 (FunctionGemma 270M) |
-| **Training method** | LoRA reasoning fine-tuning |
-| **Primary task** | Arabic reasoning-aware function calling |
-The model produces outputs in the following pattern:
-```
-<think>
-reasoning about tool selection
-</think>
-<start_function_call>
-call:tool_name{arguments}
-</end_function_call>
-```
-This allows the system to expose the reasoning behind tool selection.
 ---
-## Key Capabilities
-- Reasoning-aware tool selection
-- Explicit decision traces for tool invocation
-- Improved argument extraction consistency
-- Interpretable structured execution
-**Supported domains:**
-| Domain |
-|---|
-| Travel |
-| Utilities |
-| Islamic services |
-| Weather |
-| Healthcare |
-| Banking & finance |
-| E-commerce |
-| Government services |
-**Supported Arabic dialect groups:**
-- Modern Standard Arabic (MSA)
-- Gulf
-- Egyptian
-- Levantine
-- Maghrebi
 ---
-## Training Dataset
-Training uses a subset of the [AISA-AR-FunctionCall](https://huggingface.co/datasets/AISA-Framework/AISA-AR-FunctionCall) dataset with reasoning annotations.
-| Property | Value |
-|---|---|
-| Dataset size | ~12k reasoning-augmented samples |
-| Dialect coverage | 5 Arabic dialects |
-| Domains | 8 real-world domains |
-| Tools | 27 structured tools |
----
-## Training Methodology
-The reasoning model is trained by augmenting assistant outputs with explicit reasoning segments.
-**Training format:**
 ```
 <think>
-tool selection reasoning
 </think>
-<start_function_call>
-call:tool{arguments}
-</end_function_call>
 ```
-Reasoning supervision is enforced during inference by priming the model to begin its generation with `<think>`.
-**Training configuration:**
-| Parameter | Value |
-|---|---|
-| Training type | LoRA fine-tuning |
-| LoRA rank | 64 |
-| Alpha | 64 |
-| Dropout | 0.05 |
-| Trainable parameters | ~5.36% |
-| Epochs | 3 |
-| Learning rate | 3e-6 |
-| Effective batch size | 32 |
-| Optimizer | 8-bit AdamW |
-| Scheduler | Cosine |
-Additional training signals include **negative tool examples** to reduce hallucinated tool calls when no tool invocation is required.
 ---
-## Evaluation Results
-Evaluation is performed on a strict reasoning evaluation subset.
-### Strict Evaluation (n = 240)
-| Metric | Score |
-|---|---|
-| Tool Call Rate | 0.992 |
-| Think-Before-Call Rate | **1.000** |
-| Function Name Accuracy | 0.992 |
-| Argument F1 | **1.000** |
-| Decision Accuracy | 0.992 |
-| Hallucination Rate | **0.000** |
-These results indicate that the model consistently performs reasoning before tool invocation and achieves near-perfect structured alignment within the evaluated subset.
-### Important Note on Format Validation
-Standard function-call validators may classify reasoning outputs as **parse failures** because `<think>` tokens appear before the function call marker.
-This does **not** indicate structural instability — it reflects a difference in serialization format. When reasoning segments are permitted, tool invocation correctness remains near-perfect.
----
-## Example Usage
-**User query:**
-```
-ما حالة الطقس في الرياض اليوم؟
-```
-**Model output:**
-```
-<think>
-المستخدم يريد معرفة حالة الطقس في مدينة الرياض، لذا يجب استخدام أداة get_weather.
-</think>
-<start_function_call>
-call:get_weather{city:<escape>الرياض<escape>,days:1}
-</end_function_call>
-```
 ---
-## Intended Use
-This model is intended for:
-- Research on reasoning-aware tool calling
-- Interpretable agent systems
-- Arabic reasoning supervision experiments
-- Debugging tool selection behavior
-### Production Recommendation
-This model is an **exploratory research variant**. For production deployment, we recommend using:
-[AISA-AR-FunctionCall-FT](https://huggingface.co/AISA-Framework/AISA-AR-FunctionCall-FT)
 ---
-## Related Resources
-| Resource | Link |
-|---|---|
-| Dataset | [AISA-Framework/AISA-AR-FunctionCall](https://huggingface.co/datasets/AISA-Framework/AISA-AR-FunctionCall) |
-| Production model | [AISA-AR-FunctionCall-FT](https://huggingface.co/AISA-Framework/AISA-AR-FunctionCall-FT) |
-| Model collection | [AISA Arabic FunctionCall](https://huggingface.co/collections/AISA-Framework/aisa-arabic-functioncall-datasets-and-models) |
 ---
-## Paper
-**From Language to Action in Arabic: Reliable Structured Tool Calling via Data-Centric Fine-Tuning**
-*AISA Framework*
----
-## AISA Framework
-This model is part of the **AISA** (Agentic AI Systems Architecture) initiative for building reliable multilingual AI agents.
----
-## License
-[Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)

 ---
+license: gemma
 language:
 - ar
+base_model:
+- google/gemma-3-270m
+pipeline_tag: text-generation
+library_name: transformers
 tags:
 - function-calling
 - tool-use
 - agentic
+- arabic
 - reasoning
 - think
+- gemma3
+- shared-task
+- arabicnlp2026
+- baseline
+- dialect
 datasets:
+- TuwaiqAcademy/AISA-ArabicFC
+- Omartificial-Intelligence-Space/AISA-AR-FunctionCall-Reasoning
+model-index:
+- name: AISA-AR-FunctionCall-Think
+  results:
+  - task:
+      type: text-generation
+      name: Arabic Function Calling — Track B (Reasoning-Augmented)
+    dataset:
+      name: AISA-ArabicFC (held-out test)
+      type: TuwaiqAcademy/AISA-ArabicFC
+    metrics:
+    - type: function-name-accuracy
+      value: 0.982
+      name: FnAcc
+    - type: argument-exact-match
+      value: 0.541
+      name: ArgEM
+    - type: think-before-call-rate
+      value: 0.868
+      name: ThinkRate
+    - type: overall
+      value: 0.739
+      name: Overall (Track B, v2)
 ---
 # AISA-AR-FunctionCall-Think
+### 🏷️ Official **Track B baseline** for the [AISA-ArabicFC shared task](https://huggingface.co/spaces/Omartificial-Intelligence-Space/AISA-ArabicFC-Shared-Task) @ **ArabicNLP 2026** (co-located with EMNLP 2026, Budapest)
+> This model is the **organizer-provided baseline** for **Track B — Reasoning-Augmented Function Calling**. It defines the reference score that participating systems are expected to beat. It is released for reproducibility and as a starting point — **it is not a competition entry.**
+A compact (**270M-parameter**) Arabic function-calling model that, given an Arabic user query (in any of 5 dialects) and a set of candidate tools, **writes a short Arabic `<think>` reasoning trace and then emits a structured tool call**. Fine-tuned (LoRA) from **[google/gemma-3-270m](https://huggingface.co/google/gemma-3-270m)** on the AISA-ArabicFC reasoning data.
+For the non-reasoning Track A baseline, see the sibling model **[AISA-AR-FunctionCall-FT](https://huggingface.co/TuwaiqAcademy/AISA-AR-FunctionCall-FT)**.
 ---
+## At a glance
+| | |
 |---|---|
+| **Role** | Official baseline — Track B (Reasoning-Augmented) |
+| **Base model** | google/gemma-3-270m (270M params) |
+| **Adaptation** | LoRA fine-tune (merged), then full causal-LM inference |
+| **Languages** | Arabic — MSA, Gulf, Egyptian, Levantine, Maghrebi |
+| **Behaviour** | `<think>` Arabic reasoning → structured function call |
+| **Training data** | [TuwaiqAcademy/AISA-ArabicFC](https://huggingface.co/datasets/TuwaiqAcademy/AISA-ArabicFC) + [reasoning annotations](https://huggingface.co/datasets/Omartificial-Intelligence-Space/AISA-AR-FunctionCall-Reasoning) |
+| **License** | Gemma (see *License* below) |
 ---
+## The shared task
+Given an Arabic user query and a set of candidate tool definitions, a system must:
+1. **Decide** whether a function call is required (some queries need no tool),
+2. **Select** the correct function name,
+3. **Extract** the structured arguments,
+4. **(Track B)** **Generate an Arabic reasoning trace** (`<think> … </think>`) *before* the call.
+| Track | Description |
+|-------|-------------|
+| **A — Core** | Decide / Select / Extract |
+| **B — Reasoning-Augmented** ← *this model* | Track A **+** an Arabic `<think>` reasoning trace |
+| **C — Cross-Dialect Robustness** | Diagnostic: dialect-stratified evaluation of A/B submissions |
 ---
+## How it works — input / output format
+This model uses **Gemma 3 chat turns** with a custom function-calling schema (it does **not** emit plain JSON). The exact prompt is the `text` field in the dataset; the structure is:
+```
+<bos><start_of_turn>developer
+<system instruction in Arabic>
+<start_function_declaration>declaration:NAME{description:<escape>…<escape>,parameters:{…}}<end_function_declaration>
+…one declaration per candidate tool…<end_of_turn>
+<start_of_turn>developer
+التاريخ والوقت الحالي …: 2024-04-12T23:05:24
+اليوم هو الجمعة
+أنت نموذج يمكنه استدعاء الوظائف التالية<end_of_turn>
+<start_of_turn>user
+أريد مقارنة أسعار تلفاز سامسونج في الأردن<end_of_turn>
+<start_of_turn>model
+```
+The model then generates:
 ```
 <think>
+يبدو أن نية المستخدم هي الحصول على مقارنة لأسعار تلفاز سامسونج في الأردن. أداة "compare_prices" هي الأنسب …
 </think>
+<start_function_call>call:compare_prices{country:<escape>Jordan<escape>,product_name:<escape>Samsung TV<escape>}<end_function_call>
 ```
+For a query that needs **no tool**, the model omits the `<start_function_call>` block (→ `requires_function = false`).
 ---
+## Usage
+```python
+import re, torch
+from transformers import AutoTokenizer, AutoModelForCausalLM
+MODEL_ID = "TuwaiqAcademy/AISA-AR-FunctionCall-Think"
+tok   = AutoTokenizer.from_pretrained(MODEL_ID)
+model = AutoModelForCausalLM.from_pretrained(
+    MODEL_ID, torch_dtype=torch.float32, device_map="auto"
+).eval()
+def parse_model_output(text: str) -> dict:
+    """Turn raw generation into the shared-task submission schema."""
+    out = {"requires_function": False, "function_name": "none", "arguments": {}, "think": ""}
+    if (m := re.search(r"<think>\s*(.*?)\s*</think>", text, re.DOTALL)):
+        out["think"] = m.group(1).strip()
+    if (m := re.search(r"<start_function_call>\s*call:(\w+)\{(.*?)\}\s*<end_function_call>", text, re.DOTALL)):
+        out["requires_function"] = True
+        out["function_name"] = m.group(1)
+        for key, str_val, num_val in re.findall(r"(\w+):(?:<escape>(.*?)<escape>|([^,}]+))", m.group(2)):
+            val = str_val if str_val else num_val
+            try:
+                val = float(val) if "." in str(val) else int(val)
+            except (ValueError, TypeError):
+                pass
+            out["arguments"][key] = val
+    return out
+# Easiest path: take the ready-made prompt from the dataset's `text` field and
+# cut it at the model turn (everything after is what the model should produce).
+from datasets import load_dataset
+row = load_dataset("TuwaiqAcademy/AISA-ArabicFC", split="validation")[0]
+prompt = row["text"].split("<start_of_turn>model\n")[0] + "<start_of_turn>model\n"
+inputs = tok(prompt, return_tensors="pt", add_special_tokens=False).to(model.device)
+with torch.no_grad():
+    gen = model.generate(**inputs, max_new_tokens=250, do_sample=False)  # greedy
+raw = tok.decode(gen[0][inputs["input_ids"].shape[1]:], skip_special_tokens=False)
+print(parse_model_output(raw))
+# → {'requires_function': True, 'function_name': 'compare_prices',
+#    'arguments': {'country': 'Jordan', 'product_name': 'Samsung TV'},
+#    'think': 'يبدو أن نية المستخدم …'}
+```
+The parsed dict maps directly onto a **leaderboard submission line**: `{"id", "tool_called", "arguments", "think"}` (use `function_name` → `tool_called`).
+---
+## Evaluation
+Scored on the AISA-ArabicFC **held-out test set** (1,000 positive + negative examples) using the official **v2** metrics:
+- **FnAcc** — function-name accuracy over *all* samples (also penalises hallucinated / missed calls; negatives have gold `none`)
+- **ArgEM** — strict argument **exact match**, over positives only
+- **ThinkRate** — fraction of outputs with a non-empty `<think>` trace
+- **Overall (Track A)** = `0.40·FnAcc + 0.60·ArgEM`
+- **Overall (Track B)** = `0.30·FnAcc + 0.50·ArgEM + 0.20·ThinkRate`
+### Baseline results
+| System | FnAcc | ArgEM | Overall (A) | Overall (B) |
+|--------|:-----:|:-----:|:-----------:|:-----------:|
+| **AISA-AR-FunctionCall-Think (270M) ← this** | **0.982** | **0.541** | **0.717** | **0.739** |
+| GPT-4o — zero-shot | 0.927 | 0.070 | 0.413 | 0.313 |
+| GPT-4o — 3-shot | 0.854 | 0.122 | 0.415 | 0.317 |
+| Random baseline | 0.047 | 0.033 | 0.039 | 0.031 |
+- **Think-Before-Call rate (ThinkRate):** **0.868** for this model; 0.000 for all non-reasoning baselines.
+- **Hallucination rate:** **0.000** on negative (no-tool) queries.
+**Key takeaways**
+- 🎯 **Argument extraction is the open challenge.** Tool *selection* is largely solved (FnAcc ≈ 0.98), but strict argument **exact match tops out at 0.541** — and GPT-4o reaches only 0.070 zero-shot. This is where the task is won or lost.
+- 🪶 **A 270M model beats GPT-4o** across every metric here, showing the value of task-specific Arabic training and lowering the compute barrier to entry.
+- 🗣️ **Cross-dialect gaps remain.** FnAcc varies by roughly 10–15 points across dialects, with **Gulf and Levantine** consistently the hardest and Maghrebi (small sample) the easiest — see the Track C diagnostic in the task overview paper.
 ---
+## Training
+- **Base:** `google/gemma-3-270m`
+- **Method:** LoRA (rank 64), 3 epochs, cosine LR scheduler
+- **Data:** AISA-ArabicFC training split (~10.5K examples) with 12,000 Arabic reasoning annotations for the `<think>` traces
+- **Objective:** produce a short Arabic reasoning trace followed by a single structured tool call (or no call for negatives)
+---
+## Intended use & limitations
+**Intended use**
+- A reference **baseline** to compare against and reproduce for the AISA-ArabicFC shared task.
+- A lightweight starting point for Arabic tool-use / agentic experiments.
+**Out of scope / limitations**
+- Trained for the **27-tool, 8-domain AISA-ArabicFC schema** and its prompt format; behaviour on arbitrary tools or free-form chat is undefined.
+- Single-turn, single-call setting — no multi-tool or multi-turn dialogue.
+- **Argument extraction is imperfect** (ArgEM 0.541): expect errors in date normalisation, numeric typing, and dialectal argument phrasing.
+- Uneven dialect coverage (Maghrebi is only ~1.3% of data); robustness varies by dialect.
+- A 270M model — capacity-limited by design to keep the baseline accessible.
 ---
+## Related resources
+- 🏆 **Shared task page:** https://huggingface.co/spaces/Omartificial-Intelligence-Space/AISA-ArabicFC-Shared-Task
+- 📊 **Leaderboard:** https://huggingface.co/spaces/TuwaiqAcademy/AISA-ArabicFC-SharedTask-Leaderboard
+- 📚 **Dataset (train + dev):** [TuwaiqAcademy/AISA-ArabicFC](https://huggingface.co/datasets/TuwaiqAcademy/AISA-ArabicFC)
+- 🧠 **Reasoning dataset:** [Omartificial-Intelligence-Space/AISA-AR-FunctionCall-Reasoning](https://huggingface.co/datasets/Omartificial-Intelligence-Space/AISA-AR-FunctionCall-Reasoning)
+- 🤝 **Sibling baseline (Track A):** [TuwaiqAcademy/AISA-AR-FunctionCall-FT](https://huggingface.co/TuwaiqAcademy/AISA-AR-FunctionCall-FT)
 ---
+## Citation
+```bibtex
+@inproceedings{najar2026aisaarabicfc,
+  title     = {AISA-ArabicFC: Arabic Function Calling for Agentic AI Systems},
+  author    = {Najar, Omar},
+  booktitle = {Proceedings of the Fourth Arabic Natural Language Processing Conference (ArabicNLP 2026)},
+  year      = {2026}
+}
+```
+## License
+This model is a derivative of **Gemma 3** and is distributed under the **[Gemma Terms of Use](https://ai.google.dev/gemma/terms)**. By using it you agree to those terms and to the [Gemma Prohibited Use Policy](https://ai.google.dev/gemma/prohibited_use_policy). The AISA-ArabicFC **dataset** is released separately under Apache-2.0.
+## Contact
+Shared-task organizers — **arabicnlp-shared-task-chair@sigarab.org** · Tuwaiq Academy
+```