aiqarus-agent-4b

A 4B parameter agent model fine-tuned from Qwen3-4B-Instruct for enterprise AI agent tasks: tool-calling, multi-step planning, risk escalation, confidence calibration, and multi-agent handoff.

Iteratively improved across two training rounds, with LLM-as-judge evaluation driving data and methodology changes between rounds.

Code & Docs: github.com/zeon01/aiqarus-agent-4b

Status: Research checkpoint (Round 2, SFT-only). Alignment was attempted but diverged — shipped as SFT-only. V3 planned on Qwen3.5-4B with on-policy alignment. Not recommended for production use without further fine-tuning.


Intended Use

  • Enterprise agent orchestration (tool routing, task decomposition)
  • Multi-system workflows requiring handoff and delegation
  • Research and experimentation with small agent models
  • Not recommended for: Safety-critical applications, adversarial environments, or production use without further fine-tuning and alignment

How to Use

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "zeon01/aiqarus-agent-4b"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

messages = [
    {"role": "system", "content": "You are an enterprise AI agent with access to the following tools:\n\n[{\"name\": \"search_customers\", \"parameters\": {\"query\": \"string\"}}]"},
    {"role": "user", "content": "Find all customers in the healthcare vertical with contracts expiring this quarter."}
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.6, do_sample=True)
response = tokenizer.decode(outputs[0][inputs.input_ids.shape[-1]:], skip_special_tokens=True)
print(response)

Training Summary

Round 1 (V1) Round 2 (V2) — Current
Base model Qwen3-4B-Instruct-2507 Qwen3-4B-Instruct-2507
Method QLoRA (4-bit NF4, rank=32, alpha=64) QLoRA (4-bit NF4, rank=32, alpha=64)
Dataset size 51K samples 77K samples
Custom enterprise data ~1,600 samples ~12,000 samples (4x upsampled)
Negative examples (refusal, escalation, clarification) 0 ~9,400
Adversarial data (injection, social engineering) 0 ~400
Curriculum 2-stage (foundation first, then all) Flattened (all layers from epoch 1)
Sequence packing No (91% padding waste) Yes
GPU A10G, 17 hrs B200, 11 hrs
Final loss 0.376 0.288
Token accuracy 90.9% 91.2%
Alignment None SimPO attempted, diverged. SFT-only.

Data Sources

Training Curves (R1 vs R2)

Training Dashboard — R1 vs R2

R1 (grey): visible loss spike at step ~2,076 where Stage 2 data introduced a distribution shift. R2 (blue/green): smooth throughout — flattened curriculum eliminates the shock.

Key Methodology Changes (R1 to R2)

  • Systematic dataset audit across 604K candidate samples identified thousands of restraint/refusal examples in a dataset we nearly discarded, while confirming another dataset should be dropped entirely (100% tool-calling bias).
  • Action-type rebalancing from ~80% tool-calling (R1) to ~50/50 (R2) between tool and non-tool actions.
  • Flattened curriculum. R1's staged training cemented tool-calling bias before reasoning data was introduced. R2 mixes everything from the start.
  • Tool name diversification to prevent memorization of fixed tool names after upsampling.
  • Token distribution analysis enabled optimal sequence length and packing configuration, cutting training time nearly in half.

Evaluation Results

Custom Enterprise Eval (230 single-turn cases, LLM-judged)

Scored by LLM judge (Codex). Base model included for proper baseline.

Metric Base Qwen3-4B R1 R2
Action accuracy 46.1% 38.7% 44%
Risk escalation 36% 4% 40%
Reasoning quality 1.9/5 2.1/5 3.1/5
Response quality 2.3/5 1.9/5 2.9/5

R1 finetuning degraded performance below the base model (call_tool bias from unbalanced data). R2 recovered action accuracy and added reasoning depth the base model lacks.

Multi-Turn Eval (110 cases, LLM-judged) — New in R2

Full conversation flows with simulated tool responses, errors, partial data, and injection attempts. Both base model and R2 scored by the same LLM judge.

Metric Base R2 Delta
Composite 3.58/5 3.89/5 +0.31
Reasoning depth 2.37/5 3.30/5 +0.93
Injection detection 1.84/5 3.67/5 +1.83
Decision quality 4.36/5 4.38/5 +0.02

Finetuning adds reasoning and adversarial awareness. The base model is already competent at basic decisions.

Eval Dashboard — Base vs R2

External Benchmarks

Benchmark Finetuned V2 Base Qwen3-4B Delta Notes
When2Call accuracy 47.7% 41.1% +6.6% MCQ format, directional signal only
When2Call Macro F1 0.3081 0.2477 +24.4%
BFCL v4 overall 21.32% 35.68% -14.36% Format mismatch (see below)

On BFCL regression: BFCL's FC handler expects OpenAI-style function calling. Our model was trained on Qwen3's native <tool_call> tag format. The model generates correct tool calls that the benchmark parser can't extract. This is a format compatibility issue, not a capability regression. Custom eval in the model's native format shows clear improvement across all metrics.

On When2Call: MCQ classification format (A/B/C/D) differs from the model's generative tool-calling format. Accuracy improved, but MCQ tool-calling bias worsened. Directional signal only — the model's actual tool restraint is better measured by custom eval (risk escalation 4% to 40-60%).


Known Limitations

  • SFT-only. Alignment (SimPO) diverged — the model lacks preference-tuned decision boundaries. It may be overconfident in tool-calling for ambiguous cases.
  • Response quality 2.8-2.9/5. Improved from R1 (1.9/5) but still below production quality.
  • External benchmark format mismatch. The model uses <tool_call> tags (Qwen3 native format), not OpenAI-style function calling. BFCL and other FC-format benchmarks will undercount correct predictions.
  • Single-turn eval has gaps. Several trained capabilities have zero coverage in single-turn eval. Multi-turn composite (4.10/5) is more representative of real behavior.

Project Journey

This model was built iteratively across two rounds, with each round informed by what the previous one revealed:

Base model baseline: Running the unmodified Qwen3-4B-Instruct through the same eval revealed it scores 46.1% action accuracy — meaning R1 was actually a regression below the base model.

Round 1: 51K samples, 2-stage curriculum on A10G. Token accuracy looked great (90.9%), heuristic eval said 53%. LLM judge revealed true accuracy was 38.7% — the model formatted tool calls perfectly but couldn't decide when to use them. Root cause: 80% tool-calling bias, zero negative examples, no adversarial data.

Round 2: Complete dataset rebuild. Systematic audit of 604K candidate samples changed the data strategy fundamentally. Scaled custom enterprise data from 1,600 to 12,000 samples, added 9,400 negative examples and 400 adversarial samples. Flattened curriculum, added sequence packing. Action accuracy recovered to 44%, reasoning improved from 1.9 to 3.1/5, and a new multi-turn eval (110 cases) showed the real gains: reasoning depth +0.93 and injection detection +1.83 over the base model. SimPO alignment attempted twice, diverged both times — shipped SFT-only.

Total cost across both rounds: ~$131 (mostly GPU time on Modal).


What's Next (V3)

  • Base model upgrade to Qwen3.5-4B (same parameter count, newer architecture)
  • On-policy alignment using the model's own best completions
  • Dedicated tool-restraint training
  • Native-format benchmark handler for fair BFCL comparison

Citation

@misc{aiqarus-agent-4b,
  title={aiqarus-agent-4b: A Fine-tuned 4B Agent Model for Enterprise AI Tasks},
  author={Saad Sharif Ahmed},
  year={2026},
  url={https://huggingface.co/zeon01/aiqarus-agent-4b}
}

Acknowledgments

Built on Qwen3-4B-Instruct by Alibaba Cloud. Public training data from vericava, interstellarninja, nvidia/When2Call, and deepset/prompt-injections. Trained on Modal.com.

License

Apache 2.0

Downloads last month
95
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for zeon01/aiqarus-agent-4b

Finetuned
(1406)
this model
Quantizations
2 models

Datasets used to train zeon01/aiqarus-agent-4b

Spaces using zeon01/aiqarus-agent-4b 2