Instructions to use TuwaiqAcademy/AISA-AR-FunctionCall-Think with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use TuwaiqAcademy/AISA-AR-FunctionCall-Think with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="TuwaiqAcademy/AISA-AR-FunctionCall-Think")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("TuwaiqAcademy/AISA-AR-FunctionCall-Think")
model = AutoModelForCausalLM.from_pretrained("TuwaiqAcademy/AISA-AR-FunctionCall-Think")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use TuwaiqAcademy/AISA-AR-FunctionCall-Think with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "TuwaiqAcademy/AISA-AR-FunctionCall-Think"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "TuwaiqAcademy/AISA-AR-FunctionCall-Think",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/TuwaiqAcademy/AISA-AR-FunctionCall-Think

SGLang

How to use TuwaiqAcademy/AISA-AR-FunctionCall-Think with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "TuwaiqAcademy/AISA-AR-FunctionCall-Think" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "TuwaiqAcademy/AISA-AR-FunctionCall-Think",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "TuwaiqAcademy/AISA-AR-FunctionCall-Think" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "TuwaiqAcademy/AISA-AR-FunctionCall-Think",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use TuwaiqAcademy/AISA-AR-FunctionCall-Think with Docker Model Runner:
```
docker model run hf.co/TuwaiqAcademy/AISA-AR-FunctionCall-Think
```

AISA-AR-FunctionCall-Think / README.md

Omartificial-Intelligence-Space

Update README.md

a146440 verified 4 days ago

preview code

raw

history blame contribute delete

10.7 kB

	---
	license: gemma
	language:
	- ar
	base_model:
	- google/gemma-3-270m
	pipeline_tag: text-generation
	library_name: transformers
	tags:
	- function-calling
	- tool-use
	- agentic
	- arabic
	- reasoning
	- think
	- gemma3
	- shared-task
	- arabicnlp2026
	- baseline
	- dialect
	datasets:
	- TuwaiqAcademy/AISA-ArabicFC
	model-index:
	- name: AISA-AR-FunctionCall-Think
	results:
	- task:
	type: text-generation
	name: Arabic Function Calling — Track B (Reasoning-Augmented)
	dataset:
	name: AISA-ArabicFC (held-out test)
	type: TuwaiqAcademy/AISA-ArabicFC
	metrics:
	- type: function-name-accuracy
	value: 0.982
	name: FnAcc
	- type: argument-exact-match
	value: 0.541
	name: ArgEM
	- type: think-before-call-rate
	value: 0.868
	name: ThinkRate
	- type: overall
	value: 0.739
	name: Overall (Track B, v2)
	---

	# AISA-AR-FunctionCall-Think

	### 🏷️ Official Track B baseline for the [AISA-ArabicFC shared task](https://huggingface.co/spaces/Omartificial-Intelligence-Space/AISA-ArabicFC-Shared-Task) @ ArabicNLP 2026 (co-located with EMNLP 2026, Budapest)

	> This model is the organizer-provided baseline for Track B — Reasoning-Augmented Function Calling. It defines the reference score that participating systems are expected to beat. It is released for reproducibility and as a starting point — it is not a competition entry.

	A compact (270M-parameter) Arabic function-calling model that, given an Arabic user query (in any of 5 dialects) and a set of candidate tools, writes a short Arabic `<think>` reasoning trace and then emits a structured tool call. Fine-tuned (LoRA) from [google/gemma-3-270m](https://huggingface.co/google/gemma-3-270m) on the AISA-ArabicFC reasoning data.

	For the non-reasoning Track A baseline, see the sibling model [AISA-AR-FunctionCall-FT](https://huggingface.co/AISA-Framework/AISA-AR-FunctionCall-FT).

	---

	## At a glance

	\| \| \|
	\|---\|---\|
	\| Role \| Official baseline — Track B (Reasoning-Augmented) \|
	\| Base model \| google/gemma-3-270m (270M params) \|
	\| Adaptation \| LoRA fine-tune (merged), then full causal-LM inference \|
	\| Languages \| Arabic — MSA, Gulf, Egyptian, Levantine, Maghrebi \|
	\| Behaviour \| `<think>` Arabic reasoning → structured function call \|
	\| Training data \| [TuwaiqAcademy/AISA-ArabicFC](https://huggingface.co/datasets/TuwaiqAcademy/AISA-ArabicFC)
	\| License \| Gemma (see License below) \|

	---

	## The shared task

	Given an Arabic user query and a set of candidate tool definitions, a system must:

	1. Decide whether a function call is required (some queries need no tool),
	2. Select the correct function name,
	3. Extract the structured arguments,
	4. (Track B) Generate an Arabic reasoning trace (`<think> … </think>`) before the call.

	\| Track \| Description \|
	\|-------\|-------------\|
	\| A — Core \| Decide / Select / Extract \|
	\| B — Reasoning-Augmented ← this model \| Track A + an Arabic `<think>` reasoning trace \|
	\| C — Cross-Dialect Robustness \| Diagnostic: dialect-stratified evaluation of A/B submissions \|

	---

	## How it works — input / output format

	This model uses Gemma 3 chat turns with a custom function-calling schema (it does not emit plain JSON). The exact prompt is the `text` field in the dataset; the structure is:

	```
	<bos><start_of_turn>developer
	<system instruction in Arabic>
	<start_function_declaration>declaration:NAME{description:<escape>…<escape>,parameters:{…}}<end_function_declaration>
	…one declaration per candidate tool…<end_of_turn>
	<start_of_turn>developer
	التاريخ والوقت الحالي …: 2024-04-12T23:05:24
	اليوم هو الجمعة
	أنت نموذج يمكنه استدعاء الوظائف التالية<end_of_turn>
	<start_of_turn>user
	أريد مقارنة أسعار تلفاز سامسونج في الأردن<end_of_turn>
	<start_of_turn>model
	```

	The model then generates:

	```
	<think>
	يبدو أن نية المستخدم هي الحصول على مقارنة لأسعار تلفاز سامسونج في الأردن. أداة "compare_prices" هي الأنسب …
	</think>
	<start_function_call>call:compare_prices{country:<escape>Jordan<escape>,product_name:<escape>Samsung TV<escape>}<end_function_call>
	```

	For a query that needs no tool, the model omits the `<start_function_call>` block (→ `requires_function = false`).

	---

	## Usage

	```python
	import re, torch
	from transformers import AutoTokenizer, AutoModelForCausalLM

	MODEL_ID = "TuwaiqAcademy/AISA-AR-FunctionCall-Think"
	tok = AutoTokenizer.from_pretrained(MODEL_ID)
	model = AutoModelForCausalLM.from_pretrained(
	MODEL_ID, torch_dtype=torch.float32, device_map="auto"
	).eval()

	def parse_model_output(text: str) -> dict:
	"""Turn raw generation into the shared-task submission schema."""
	out = {"requires_function": False, "function_name": "none", "arguments": {}, "think": ""}
	if (m := re.search(r"<think>\s(.?)\s*</think>", text, re.DOTALL)):
	out["think"] = m.group(1).strip()
	if (m := re.search(r"<start_function_call>\scall:(\w+)\{(.?)\}\s*<end_function_call>", text, re.DOTALL)):
	out["requires_function"] = True
	out["function_name"] = m.group(1)
	for key, str_val, num_val in re.findall(r"(\w+):(?:<escape>(.*?)<escape>\|([^,}]+))", m.group(2)):
	val = str_val if str_val else num_val
	try:
	val = float(val) if "." in str(val) else int(val)
	except (ValueError, TypeError):
	pass
	out["arguments"][key] = val
	return out

	# Easiest path: take the ready-made prompt from the dataset's `text` field and
	# cut it at the model turn (everything after is what the model should produce).
	from datasets import load_dataset
	row = load_dataset("TuwaiqAcademy/AISA-ArabicFC", split="validation")[0]
	prompt = row["text"].split("<start_of_turn>model\n")[0] + "<start_of_turn>model\n"

	inputs = tok(prompt, return_tensors="pt", add_special_tokens=False).to(model.device)
	with torch.no_grad():
	gen = model.generate(**inputs, max_new_tokens=250, do_sample=False) # greedy
	raw = tok.decode(gen[0][inputs["input_ids"].shape[1]:], skip_special_tokens=False)

	print(parse_model_output(raw))
	# → {'requires_function': True, 'function_name': 'compare_prices',
	# 'arguments': {'country': 'Jordan', 'product_name': 'Samsung TV'},
	# 'think': 'يبدو أن نية المستخدم …'}
	```

	The parsed dict maps directly onto a leaderboard submission line: `{"id", "tool_called", "arguments", "think"}` (use `function_name` → `tool_called`).

	---

	## Evaluation

	Scored on the AISA-ArabicFC held-out test set (1,000 positive + negative examples) using the official v2 metrics:

	- FnAcc — function-name accuracy over all samples (also penalises hallucinated / missed calls; negatives have gold `none`)
	- ArgEM — strict argument exact match, over positives only
	- ThinkRate — fraction of outputs with a non-empty `<think>` trace
	- Overall (Track A) = `0.40·FnAcc + 0.60·ArgEM`
	- Overall (Track B) = `0.30·FnAcc + 0.50·ArgEM + 0.20·ThinkRate`

	### Baseline results

	\| System \| FnAcc \| ArgEM \| Overall (A) \| Overall (B) \|
	\|--------\|:-----:\|:-----:\|:-----------:\|:-----------:\|
	\| AISA-AR-FunctionCall-Think (270M) ← this \| 0.982 \| 0.541 \| 0.717 \| 0.739 \|
	\| GPT-4o — zero-shot \| 0.927 \| 0.070 \| 0.413 \| 0.313 \|
	\| GPT-4o — 3-shot \| 0.854 \| 0.122 \| 0.415 \| 0.317 \|
	\| Random baseline \| 0.047 \| 0.033 \| 0.039 \| 0.031 \|

	- Think-Before-Call rate (ThinkRate): 0.868 for this model; 0.000 for all non-reasoning baselines.
	- Hallucination rate: 0.000 on negative (no-tool) queries.

	Key takeaways

	- 🎯 Argument extraction is the open challenge. Tool selection is largely solved (FnAcc ≈ 0.98), but strict argument exact match tops out at 0.541 — and GPT-4o reaches only 0.070 zero-shot. This is where the task is won or lost.
	- 🪶 A 270M model beats GPT-4o across every metric here, showing the value of task-specific Arabic training and lowering the compute barrier to entry.
	- 🗣️ Cross-dialect gaps remain. FnAcc varies by roughly 10–15 points across dialects, with Gulf and Levantine consistently the hardest and Maghrebi (small sample) the easiest — see the Track C diagnostic in the task overview paper.

	---

	## Training

	- Base: `google/gemma-3-270m`
	- Method: LoRA (rank 64), 3 epochs, cosine LR scheduler
	- Data: AISA-ArabicFC training split (~10.5K examples) with 12,000 Arabic reasoning annotations for the `<think>` traces
	- Objective: produce a short Arabic reasoning trace followed by a single structured tool call (or no call for negatives)

	---

	## Intended use & limitations

	Intended use
	- A reference baseline to compare against and reproduce for the AISA-ArabicFC shared task.
	- A lightweight starting point for Arabic tool-use / agentic experiments.

	Out of scope / limitations
	- Trained for the 27-tool, 8-domain AISA-ArabicFC schema and its prompt format; behaviour on arbitrary tools or free-form chat is undefined.
	- Single-turn, single-call setting — no multi-tool or multi-turn dialogue.
	- Argument extraction is imperfect (ArgEM 0.541): expect errors in date normalisation, numeric typing, and dialectal argument phrasing.
	- Uneven dialect coverage (Maghrebi is only ~1.3% of data); robustness varies by dialect.
	- A 270M model — capacity-limited by design to keep the baseline accessible.

	---

	## Related resources

	- 🏆 Shared task page: https://huggingface.co/spaces/Omartificial-Intelligence-Space/AISA-ArabicFC-Shared-Task
	- 📊 Leaderboard: https://huggingface.co/spaces/TuwaiqAcademy/AISA-ArabicFC-SharedTask-Leaderboard
	- 📚 Dataset (train + dev): [TuwaiqAcademy/AISA-ArabicFC](https://huggingface.co/datasets/TuwaiqAcademy/AISA-ArabicFC)

	---

	## Citation

	```bibtex
	@inproceedings{najar2026aisaarabicfc,
	title = {AISA-ArabicFC: Arabic Function Calling for Agentic AI Systems},
	author = {Najar, Omar},
	booktitle = {Proceedings of the Fourth Arabic Natural Language Processing Conference (ArabicNLP 2026)},
	year = {2026}
	}
	```

	## License

	This model is a derivative of Gemma 3 and is distributed under the [Gemma Terms of Use](https://ai.google.dev/gemma/terms). By using it you agree to those terms and to the [Gemma Prohibited Use Policy](https://ai.google.dev/gemma/prohibited_use_policy). The AISA-ArabicFC dataset is released separately under Apache-2.0.

	## Contact

	Shared-task organizers — trdc@tuwaiq.edu.sa · Tuwaiq Academy
	```