Text Generation
Transformers
Safetensors
Arabic
gemma3_text
function-calling
tool-use
agentic
arabic
reasoning
think
gemma3
shared-task
arabicnlp2026
baseline
dialect
conversational
Eval Results (legacy)
text-generation-inference
Instructions to use TuwaiqAcademy/AISA-AR-FunctionCall-Think with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use TuwaiqAcademy/AISA-AR-FunctionCall-Think with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="TuwaiqAcademy/AISA-AR-FunctionCall-Think") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("TuwaiqAcademy/AISA-AR-FunctionCall-Think") model = AutoModelForCausalLM.from_pretrained("TuwaiqAcademy/AISA-AR-FunctionCall-Think") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use TuwaiqAcademy/AISA-AR-FunctionCall-Think with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "TuwaiqAcademy/AISA-AR-FunctionCall-Think" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "TuwaiqAcademy/AISA-AR-FunctionCall-Think", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/TuwaiqAcademy/AISA-AR-FunctionCall-Think
- SGLang
How to use TuwaiqAcademy/AISA-AR-FunctionCall-Think with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "TuwaiqAcademy/AISA-AR-FunctionCall-Think" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "TuwaiqAcademy/AISA-AR-FunctionCall-Think", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "TuwaiqAcademy/AISA-AR-FunctionCall-Think" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "TuwaiqAcademy/AISA-AR-FunctionCall-Think", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use TuwaiqAcademy/AISA-AR-FunctionCall-Think with Docker Model Runner:
docker model run hf.co/TuwaiqAcademy/AISA-AR-FunctionCall-Think
| license: gemma | |
| language: | |
| - ar | |
| base_model: | |
| - google/gemma-3-270m | |
| pipeline_tag: text-generation | |
| library_name: transformers | |
| tags: | |
| - function-calling | |
| - tool-use | |
| - agentic | |
| - arabic | |
| - reasoning | |
| - think | |
| - gemma3 | |
| - shared-task | |
| - arabicnlp2026 | |
| - baseline | |
| - dialect | |
| datasets: | |
| - TuwaiqAcademy/AISA-ArabicFC | |
| model-index: | |
| - name: AISA-AR-FunctionCall-Think | |
| results: | |
| - task: | |
| type: text-generation | |
| name: Arabic Function Calling — Track B (Reasoning-Augmented) | |
| dataset: | |
| name: AISA-ArabicFC (held-out test) | |
| type: TuwaiqAcademy/AISA-ArabicFC | |
| metrics: | |
| - type: function-name-accuracy | |
| value: 0.982 | |
| name: FnAcc | |
| - type: argument-exact-match | |
| value: 0.541 | |
| name: ArgEM | |
| - type: think-before-call-rate | |
| value: 0.868 | |
| name: ThinkRate | |
| - type: overall | |
| value: 0.739 | |
| name: Overall (Track B, v2) | |
| # AISA-AR-FunctionCall-Think | |
| ### 🏷️ Official **Track B baseline** for the [AISA-ArabicFC shared task](https://huggingface.co/spaces/Omartificial-Intelligence-Space/AISA-ArabicFC-Shared-Task) @ **ArabicNLP 2026** (co-located with EMNLP 2026, Budapest) | |
| > This model is the **organizer-provided baseline** for **Track B — Reasoning-Augmented Function Calling**. It defines the reference score that participating systems are expected to beat. It is released for reproducibility and as a starting point — **it is not a competition entry.** | |
| A compact (**270M-parameter**) Arabic function-calling model that, given an Arabic user query (in any of 5 dialects) and a set of candidate tools, **writes a short Arabic `<think>` reasoning trace and then emits a structured tool call**. Fine-tuned (LoRA) from **[google/gemma-3-270m](https://huggingface.co/google/gemma-3-270m)** on the AISA-ArabicFC reasoning data. | |
| For the non-reasoning Track A baseline, see the sibling model **[AISA-AR-FunctionCall-FT](https://huggingface.co/AISA-Framework/AISA-AR-FunctionCall-FT)**. | |
| --- | |
| ## At a glance | |
| | | | | |
| |---|---| | |
| | **Role** | Official baseline — Track B (Reasoning-Augmented) | | |
| | **Base model** | google/gemma-3-270m (270M params) | | |
| | **Adaptation** | LoRA fine-tune (merged), then full causal-LM inference | | |
| | **Languages** | Arabic — MSA, Gulf, Egyptian, Levantine, Maghrebi | | |
| | **Behaviour** | `<think>` Arabic reasoning → structured function call | | |
| | **Training data** | [TuwaiqAcademy/AISA-ArabicFC](https://huggingface.co/datasets/TuwaiqAcademy/AISA-ArabicFC) | |
| | **License** | Gemma (see *License* below) | | |
| --- | |
| ## The shared task | |
| Given an Arabic user query and a set of candidate tool definitions, a system must: | |
| 1. **Decide** whether a function call is required (some queries need no tool), | |
| 2. **Select** the correct function name, | |
| 3. **Extract** the structured arguments, | |
| 4. **(Track B)** **Generate an Arabic reasoning trace** (`<think> … </think>`) *before* the call. | |
| | Track | Description | | |
| |-------|-------------| | |
| | **A — Core** | Decide / Select / Extract | | |
| | **B — Reasoning-Augmented** ← *this model* | Track A **+** an Arabic `<think>` reasoning trace | | |
| | **C — Cross-Dialect Robustness** | Diagnostic: dialect-stratified evaluation of A/B submissions | | |
| --- | |
| ## How it works — input / output format | |
| This model uses **Gemma 3 chat turns** with a custom function-calling schema (it does **not** emit plain JSON). The exact prompt is the `text` field in the dataset; the structure is: | |
| ``` | |
| <bos><start_of_turn>developer | |
| <system instruction in Arabic> | |
| <start_function_declaration>declaration:NAME{description:<escape>…<escape>,parameters:{…}}<end_function_declaration> | |
| …one declaration per candidate tool…<end_of_turn> | |
| <start_of_turn>developer | |
| التاريخ والوقت الحالي …: 2024-04-12T23:05:24 | |
| اليوم هو الجمعة | |
| أنت نموذج يمكنه استدعاء الوظائف التالية<end_of_turn> | |
| <start_of_turn>user | |
| أريد مقارنة أسعار تلفاز سامسونج في الأردن<end_of_turn> | |
| <start_of_turn>model | |
| ``` | |
| The model then generates: | |
| ``` | |
| <think> | |
| يبدو أن نية المستخدم هي الحصول على مقارنة لأسعار تلفاز سامسونج في الأردن. أداة "compare_prices" هي الأنسب … | |
| </think> | |
| <start_function_call>call:compare_prices{country:<escape>Jordan<escape>,product_name:<escape>Samsung TV<escape>}<end_function_call> | |
| ``` | |
| For a query that needs **no tool**, the model omits the `<start_function_call>` block (→ `requires_function = false`). | |
| --- | |
| ## Usage | |
| ```python | |
| import re, torch | |
| from transformers import AutoTokenizer, AutoModelForCausalLM | |
| MODEL_ID = "TuwaiqAcademy/AISA-AR-FunctionCall-Think" | |
| tok = AutoTokenizer.from_pretrained(MODEL_ID) | |
| model = AutoModelForCausalLM.from_pretrained( | |
| MODEL_ID, torch_dtype=torch.float32, device_map="auto" | |
| ).eval() | |
| def parse_model_output(text: str) -> dict: | |
| """Turn raw generation into the shared-task submission schema.""" | |
| out = {"requires_function": False, "function_name": "none", "arguments": {}, "think": ""} | |
| if (m := re.search(r"<think>\s*(.*?)\s*</think>", text, re.DOTALL)): | |
| out["think"] = m.group(1).strip() | |
| if (m := re.search(r"<start_function_call>\s*call:(\w+)\{(.*?)\}\s*<end_function_call>", text, re.DOTALL)): | |
| out["requires_function"] = True | |
| out["function_name"] = m.group(1) | |
| for key, str_val, num_val in re.findall(r"(\w+):(?:<escape>(.*?)<escape>|([^,}]+))", m.group(2)): | |
| val = str_val if str_val else num_val | |
| try: | |
| val = float(val) if "." in str(val) else int(val) | |
| except (ValueError, TypeError): | |
| pass | |
| out["arguments"][key] = val | |
| return out | |
| # Easiest path: take the ready-made prompt from the dataset's `text` field and | |
| # cut it at the model turn (everything after is what the model should produce). | |
| from datasets import load_dataset | |
| row = load_dataset("TuwaiqAcademy/AISA-ArabicFC", split="validation")[0] | |
| prompt = row["text"].split("<start_of_turn>model\n")[0] + "<start_of_turn>model\n" | |
| inputs = tok(prompt, return_tensors="pt", add_special_tokens=False).to(model.device) | |
| with torch.no_grad(): | |
| gen = model.generate(**inputs, max_new_tokens=250, do_sample=False) # greedy | |
| raw = tok.decode(gen[0][inputs["input_ids"].shape[1]:], skip_special_tokens=False) | |
| print(parse_model_output(raw)) | |
| # → {'requires_function': True, 'function_name': 'compare_prices', | |
| # 'arguments': {'country': 'Jordan', 'product_name': 'Samsung TV'}, | |
| # 'think': 'يبدو أن نية المستخدم …'} | |
| ``` | |
| The parsed dict maps directly onto a **leaderboard submission line**: `{"id", "tool_called", "arguments", "think"}` (use `function_name` → `tool_called`). | |
| --- | |
| ## Evaluation | |
| Scored on the AISA-ArabicFC **held-out test set** (1,000 positive + negative examples) using the official **v2** metrics: | |
| - **FnAcc** — function-name accuracy over *all* samples (also penalises hallucinated / missed calls; negatives have gold `none`) | |
| - **ArgEM** — strict argument **exact match**, over positives only | |
| - **ThinkRate** — fraction of outputs with a non-empty `<think>` trace | |
| - **Overall (Track A)** = `0.40·FnAcc + 0.60·ArgEM` | |
| - **Overall (Track B)** = `0.30·FnAcc + 0.50·ArgEM + 0.20·ThinkRate` | |
| ### Baseline results | |
| | System | FnAcc | ArgEM | Overall (A) | Overall (B) | | |
| |--------|:-----:|:-----:|:-----------:|:-----------:| | |
| | **AISA-AR-FunctionCall-Think (270M) ← this** | **0.982** | **0.541** | **0.717** | **0.739** | | |
| | GPT-4o — zero-shot | 0.927 | 0.070 | 0.413 | 0.313 | | |
| | GPT-4o — 3-shot | 0.854 | 0.122 | 0.415 | 0.317 | | |
| | Random baseline | 0.047 | 0.033 | 0.039 | 0.031 | | |
| - **Think-Before-Call rate (ThinkRate):** **0.868** for this model; 0.000 for all non-reasoning baselines. | |
| - **Hallucination rate:** **0.000** on negative (no-tool) queries. | |
| **Key takeaways** | |
| - 🎯 **Argument extraction is the open challenge.** Tool *selection* is largely solved (FnAcc ≈ 0.98), but strict argument **exact match tops out at 0.541** — and GPT-4o reaches only 0.070 zero-shot. This is where the task is won or lost. | |
| - 🪶 **A 270M model beats GPT-4o** across every metric here, showing the value of task-specific Arabic training and lowering the compute barrier to entry. | |
| - 🗣️ **Cross-dialect gaps remain.** FnAcc varies by roughly 10–15 points across dialects, with **Gulf and Levantine** consistently the hardest and Maghrebi (small sample) the easiest — see the Track C diagnostic in the task overview paper. | |
| --- | |
| ## Training | |
| - **Base:** `google/gemma-3-270m` | |
| - **Method:** LoRA (rank 64), 3 epochs, cosine LR scheduler | |
| - **Data:** AISA-ArabicFC training split (~10.5K examples) with 12,000 Arabic reasoning annotations for the `<think>` traces | |
| - **Objective:** produce a short Arabic reasoning trace followed by a single structured tool call (or no call for negatives) | |
| --- | |
| ## Intended use & limitations | |
| **Intended use** | |
| - A reference **baseline** to compare against and reproduce for the AISA-ArabicFC shared task. | |
| - A lightweight starting point for Arabic tool-use / agentic experiments. | |
| **Out of scope / limitations** | |
| - Trained for the **27-tool, 8-domain AISA-ArabicFC schema** and its prompt format; behaviour on arbitrary tools or free-form chat is undefined. | |
| - Single-turn, single-call setting — no multi-tool or multi-turn dialogue. | |
| - **Argument extraction is imperfect** (ArgEM 0.541): expect errors in date normalisation, numeric typing, and dialectal argument phrasing. | |
| - Uneven dialect coverage (Maghrebi is only ~1.3% of data); robustness varies by dialect. | |
| - A 270M model — capacity-limited by design to keep the baseline accessible. | |
| --- | |
| ## Related resources | |
| - 🏆 **Shared task page:** https://huggingface.co/spaces/Omartificial-Intelligence-Space/AISA-ArabicFC-Shared-Task | |
| - 📊 **Leaderboard:** https://huggingface.co/spaces/TuwaiqAcademy/AISA-ArabicFC-SharedTask-Leaderboard | |
| - 📚 **Dataset (train + dev):** [TuwaiqAcademy/AISA-ArabicFC](https://huggingface.co/datasets/TuwaiqAcademy/AISA-ArabicFC) | |
| --- | |
| ## Citation | |
| ```bibtex | |
| @inproceedings{najar2026aisaarabicfc, | |
| title = {AISA-ArabicFC: Arabic Function Calling for Agentic AI Systems}, | |
| author = {Najar, Omar}, | |
| booktitle = {Proceedings of the Fourth Arabic Natural Language Processing Conference (ArabicNLP 2026)}, | |
| year = {2026} | |
| } | |
| ``` | |
| ## License | |
| This model is a derivative of **Gemma 3** and is distributed under the **[Gemma Terms of Use](https://ai.google.dev/gemma/terms)**. By using it you agree to those terms and to the [Gemma Prohibited Use Policy](https://ai.google.dev/gemma/prohibited_use_policy). The AISA-ArabicFC **dataset** is released separately under Apache-2.0. | |
| ## Contact | |
| Shared-task organizers — **trdc@tuwaiq.edu.sa** · Tuwaiq Academy | |
| ``` |