Instructions to use zeon01/aiqarus-agent-4b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use zeon01/aiqarus-agent-4b with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="zeon01/aiqarus-agent-4b")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("zeon01/aiqarus-agent-4b")
model = AutoModelForCausalLM.from_pretrained("zeon01/aiqarus-agent-4b")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use zeon01/aiqarus-agent-4b with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "zeon01/aiqarus-agent-4b"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "zeon01/aiqarus-agent-4b",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/zeon01/aiqarus-agent-4b

SGLang

How to use zeon01/aiqarus-agent-4b with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "zeon01/aiqarus-agent-4b" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "zeon01/aiqarus-agent-4b",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "zeon01/aiqarus-agent-4b" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "zeon01/aiqarus-agent-4b",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use zeon01/aiqarus-agent-4b with Docker Model Runner:
```
docker model run hf.co/zeon01/aiqarus-agent-4b
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

aiqarus-agent-4b

A 4B parameter agent model fine-tuned from Qwen3-4B-Instruct for enterprise AI agent tasks: tool-calling, multi-step planning, risk escalation, confidence calibration, and multi-agent handoff.

Iteratively improved across two training rounds, with LLM-as-judge evaluation driving data and methodology changes between rounds.

Code & Docs: github.com/zeon01/aiqarus-agent-4b

Status: Research checkpoint (Round 2, SFT-only). Alignment was attempted but diverged — shipped as SFT-only. V3 planned on Qwen3.5-4B with on-policy alignment. Not recommended for production use without further fine-tuning.

Intended Use

Enterprise agent orchestration (tool routing, task decomposition)
Multi-system workflows requiring handoff and delegation
Research and experimentation with small agent models
Not recommended for: Safety-critical applications, adversarial environments, or production use without further fine-tuning and alignment

How to Use

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "zeon01/aiqarus-agent-4b"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

messages = [
    {"role": "system", "content": "You are an enterprise AI agent with access to the following tools:\n\n[{\"name\": \"search_customers\", \"parameters\": {\"query\": \"string\"}}]"},
    {"role": "user", "content": "Find all customers in the healthcare vertical with contracts expiring this quarter."}
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.6, do_sample=True)
response = tokenizer.decode(outputs[0][inputs.input_ids.shape[-1]:], skip_special_tokens=True)
print(response)

Training Summary

	Round 1 (V1)	Round 2 (V2) — Current
Base model	Qwen3-4B-Instruct-2507	Qwen3-4B-Instruct-2507
Method	QLoRA (4-bit NF4, rank=32, alpha=64)	QLoRA (4-bit NF4, rank=32, alpha=64)
Dataset size	51K samples	77K samples
Custom enterprise data	~1,600 samples	~12,000 samples (4x upsampled)
Negative examples (refusal, escalation, clarification)	0	~9,400
Adversarial data (injection, social engineering)	0	~400
Curriculum	2-stage (foundation first, then all)	Flattened (all layers from epoch 1)
Sequence packing	No (91% padding waste)	Yes
GPU	A10G, 17 hrs	B200, 11 hrs
Final loss	0.376	0.288
Token accuracy	90.9%	91.2%
Alignment	None	SimPO attempted, diverged. SFT-only.

Data Sources

Public datasets: vericava/sft-tool-calling, interstellarninja/hermes_reasoning_tool_use, nvidia/When2Call, deepset/prompt-injections, SPML
Custom enterprise data: Generated via LLM APIs covering multiple enterprise agent capability categories. Includes tool-calling, non-tool actions (escalation, clarification, refusal), and adversarial scenarios.

Training Curves (R1 vs R2)

R1 (grey): visible loss spike at step ~2,076 where Stage 2 data introduced a distribution shift. R2 (blue/green): smooth throughout — flattened curriculum eliminates the shock.

Key Methodology Changes (R1 to R2)

Systematic dataset audit across 604K candidate samples identified thousands of restraint/refusal examples in a dataset we nearly discarded, while confirming another dataset should be dropped entirely (100% tool-calling bias).
Action-type rebalancing from ~80% tool-calling (R1) to ~50/50 (R2) between tool and non-tool actions.
Flattened curriculum. R1's staged training cemented tool-calling bias before reasoning data was introduced. R2 mixes everything from the start.
Tool name diversification to prevent memorization of fixed tool names after upsampling.
Token distribution analysis enabled optimal sequence length and packing configuration, cutting training time nearly in half.

Evaluation Results

Custom Enterprise Eval (230 single-turn cases, LLM-judged)

Scored by LLM judge (Codex). Base model included for proper baseline.

Metric	Base Qwen3-4B	R1	R2
Action accuracy	46.1%	38.7%	44%
Risk escalation	36%	4%	40%
Reasoning quality	1.9/5	2.1/5	3.1/5
Response quality	2.3/5	1.9/5	2.9/5

R1 finetuning degraded performance below the base model (call_tool bias from unbalanced data). R2 recovered action accuracy and added reasoning depth the base model lacks.

Multi-Turn Eval (110 cases, LLM-judged) — New in R2

Full conversation flows with simulated tool responses, errors, partial data, and injection attempts. Both base model and R2 scored by the same LLM judge.

Metric	Base	R2	Delta
Composite	3.58/5	3.89/5	+0.31
Reasoning depth	2.37/5	3.30/5	+0.93
Injection detection	1.84/5	3.67/5	+1.83
Decision quality	4.36/5	4.38/5	+0.02

Finetuning adds reasoning and adversarial awareness. The base model is already competent at basic decisions.

External Benchmarks

Benchmark	Finetuned V2	Base Qwen3-4B	Delta	Notes
When2Call accuracy	47.7%	41.1%	+6.6%	MCQ format, directional signal only
When2Call Macro F1	0.3081	0.2477	+24.4%
BFCL v4 overall	21.32%	35.68%	-14.36%	Format mismatch (see below)

On BFCL regression: BFCL's FC handler expects OpenAI-style function calling. Our model was trained on Qwen3's native <tool_call> tag format. The model generates correct tool calls that the benchmark parser can't extract. This is a format compatibility issue, not a capability regression. Custom eval in the model's native format shows clear improvement across all metrics.

On When2Call: MCQ classification format (A/B/C/D) differs from the model's generative tool-calling format. Accuracy improved, but MCQ tool-calling bias worsened. Directional signal only — the model's actual tool restraint is better measured by custom eval (risk escalation 4% to 40-60%).

Known Limitations

SFT-only. Alignment (SimPO) diverged — the model lacks preference-tuned decision boundaries. It may be overconfident in tool-calling for ambiguous cases.
Response quality 2.8-2.9/5. Improved from R1 (1.9/5) but still below production quality.
External benchmark format mismatch. The model uses <tool_call> tags (Qwen3 native format), not OpenAI-style function calling. BFCL and other FC-format benchmarks will undercount correct predictions.
Single-turn eval has gaps. Several trained capabilities have zero coverage in single-turn eval. Multi-turn composite (4.10/5) is more representative of real behavior.

Project Journey

This model was built iteratively across two rounds, with each round informed by what the previous one revealed:

Base model baseline: Running the unmodified Qwen3-4B-Instruct through the same eval revealed it scores 46.1% action accuracy — meaning R1 was actually a regression below the base model.

Round 1: 51K samples, 2-stage curriculum on A10G. Token accuracy looked great (90.9%), heuristic eval said 53%. LLM judge revealed true accuracy was 38.7% — the model formatted tool calls perfectly but couldn't decide when to use them. Root cause: 80% tool-calling bias, zero negative examples, no adversarial data.

Round 2: Complete dataset rebuild. Systematic audit of 604K candidate samples changed the data strategy fundamentally. Scaled custom enterprise data from 1,600 to 12,000 samples, added 9,400 negative examples and 400 adversarial samples. Flattened curriculum, added sequence packing. Action accuracy recovered to 44%, reasoning improved from 1.9 to 3.1/5, and a new multi-turn eval (110 cases) showed the real gains: reasoning depth +0.93 and injection detection +1.83 over the base model. SimPO alignment attempted twice, diverged both times — shipped SFT-only.

Total cost across both rounds: ~$131 (mostly GPU time on Modal).

What's Next (V3)

Base model upgrade to Qwen3.5-4B (same parameter count, newer architecture)
On-policy alignment using the model's own best completions
Dedicated tool-restraint training
Native-format benchmark handler for fair BFCL comparison

Citation

@misc{aiqarus-agent-4b,
  title={aiqarus-agent-4b: A Fine-tuned 4B Agent Model for Enterprise AI Tasks},
  author={Saad Sharif Ahmed},
  year={2026},
  url={https://huggingface.co/zeon01/aiqarus-agent-4b}
}

Acknowledgments

Built on Qwen3-4B-Instruct by Alibaba Cloud. Public training data from vericava, interstellarninja, nvidia/When2Call, and deepset/prompt-injections. Trained on Modal.com.

License

Apache 2.0

Downloads last month: 26

Safetensors

Model size

4B params

Tensor type

BF16

Model tree for zeon01/aiqarus-agent-4b

Base model

Qwen/Qwen3-4B-Instruct-2507

Finetuned

(1822)

this model

Quantizations

2 models

zeon01
/

aiqarus-agent-4b