---
language:
- en
license: apache-2.0
base_model: EphAsad/Aristaeus
tags:
- agentic
- tool-calling
- reasoning
- fine-tuned
- qwen2.5
- qlora
- chain-of-thought
- function-calling
- unsloth
datasets:
- DJLougen/hermes-agent-traces-filtered
- lambda/hermes-agent-reasoning-traces
- zake7749/Qwen-3.6-plus-agent-tool-calling-trajectory
- glaiveai/glaive-function-calling-v2
pipeline_tag: text-generation
---

# AristaeusAgent

**AristaeusAgent** is a QLoRA fine-tune of [EphAsad/Aristaeus](https://huggingface.co/EphAsad/Aristaeus) — itself a full fine-tune of Qwen2.5-1.5B-Instruct — trained to add structured agentic tool-use on top of the reasoning foundation established in Stage 1.

This is Stage 2 of a two-stage training pipeline:

- **Stage 1 — Aristaeus:** Chain-of-thought reasoning (OpenThoughts3 + Bespoke-Stratos)
- **Stage 2 — AristaeusAgent (this model):** Agentic tool-calling with `<think>`-before-act behaviour

The model uses Hermes-style tool-call format: reasoning in `<think>...</think>` blocks, tool invocations as `<tool_call>{"name": "...", "arguments": {...}}</tool_call>`.

---

## Training

| Detail | Value |
|---|---|
| Base model | EphAsad/Aristaeus (Stage 1) |
| Fine-tune type | QLoRA (4-bit base, bf16 adapter) |
| LoRA rank | r=16, alpha=32 |
| LoRA targets | q/k/v/o/gate/up/down projections |
| Hardware | NVIDIA A100-SXM4-40GB |
| Epochs | 1 |
| Sequence length | 8192 tokens |
| Effective batch size | 16 (batch 1 × grad accum 16) |
| Learning rate | 5e-6 (cosine schedule) |
| Framework | Unsloth + TRL SFTTrainer |
| Packing | Disabled — agentic trajectories must stay intact |

### Why QLoRA over full fine-tune

Stage 1 established chain-of-thought reasoning over 29,400 examples and 81 minutes of full fine-tune. Stage 2 uses QLoRA (r=16, LR=5e-6, 1 epoch) specifically to preserve those Stage 1 gains. A full fine-tune at Stage 2 data scale would overwrite the reasoning capability rather than extend it.

### Datasets

**[DJLougen/hermes-agent-traces-filtered](https://huggingface.co/datasets/DJLougen/hermes-agent-traces-filtered)** — Quality-filtered subset of the Hermes agent traces. Filtered for non-trivial `<think>` blocks (>50 chars), valid tool-call JSON, and evidence of deliberate tool selection reasoning. Primary training signal.

**[lambda/hermes-agent-reasoning-traces](https://huggingface.co/datasets/lambda/hermes-agent-reasoning-traces)** — Multi-turn agentic trajectories generated via the Hermes Agent harness using Kimi-K2.5 and GLM-5.1 (both run locally, no API ToS concerns). 2,000 rows from each config. Apache 2.0.

**[zake7749/Qwen-3.6-plus-agent-tool-calling-trajectory](https://huggingface.co/datasets/zake7749/Qwen-3.6-plus-agent-tool-calling-trajectory)** — Multi-turn trajectories with per-row quality scores. Filtered to `score >= 0.7` (~2,000 rows). Includes `reasoning_content` fields converted to `<think>` blocks during normalisation. Apache 2.0.

**[glaiveai/glaive-function-calling-v2](https://huggingface.co/datasets/glaiveai/glaive-function-calling-v2)** — No-call subset only (~1,200 rows after filtering). Tool schemas present in the system prompt, assistant answers directly from knowledge. These negative examples teach tool refusal — the model learns that having tools available does not mean they must be called. Rows containing markdown code blocks were explicitly excluded to prevent format bleed into tool-call syntax. Apache 2.0.

---

## Evaluation

AristaeusAgent was evaluated against Aristaeus (base) using a custom 50-test benchmark across 5 capability dimensions (max 3 points per test, 150 total). Results below are from the final v2 training run (5-dataset stack including glaive no-call subset).

### Overall

| Model | Score | % |
|---|---|---|
| Aristaeus (Stage 1 base) | 68 / 150 | 45.3% |
| AristaeusAgent | 94 / 150 | 62.7% |
| Delta | **+26 points** | **+17.4pp** |

AristaeusAgent wins 23 tests, draws 19, loses 8.

### By dimension

| Dimension | Base | Base% | Agent | Agent% | Delta |
|---|---|---|---|---|---|
| Reasoning Quality | 5/30 | 16.7% | 25/30 | 83.3% | **+20** |
| Multi-Step Planning | 8/30 | 26.7% | 22/30 | 73.3% | **+14** |
| Tool Selection | 11/30 | 36.7% | 18/30 | 60.0% | **+7** |
| Argument Construction | 22/30 | 73.3% | 23/30 | 76.7% | +1 |
| Tool Refusal | 22/30 | 73.3% | 6/30 | 20.0% | -16 |

### Regressions (8 tests)

| Test | Dimension | Base | Agent | Note |
|---|---|---|---|---|
| B09 — run `grep -r 'error' /var/log/` | Argument Construction | 3 | 1 | Searched web instead of running bash |
| D01 — What is 2+2? | Tool Refusal | 2 | 0 | Called a tool unnecessarily |
| D02 — What language is Python? | Tool Refusal | 3 | 0 | Called a tool unnecessarily |
| D03 — Explain REST API | Tool Refusal | 3 | 0 | Called a tool unnecessarily |
| D05 — Write a haiku | Tool Refusal | 3 | 0 | Called a tool unnecessarily |
| D07 — Summarise AMR | Tool Refusal | 3 | 0 | Called a tool unnecessarily |
| D09 — SOLID principles | Tool Refusal | 3 | 0 | Called a tool unnecessarily |
| E10 — Weather → write report | Multi-Step Planning | 2 | 1 | Partial step sequencing |

### Tool Refusal — the remaining limitation

Tool Refusal is the only dimension where AristaeusAgent scores lower than base (6/30 vs 22/30). The glaive no-call dataset reduced the over-triggering seen in v1 but did not eliminate it. The model still calls tools on static knowledge questions it should answer directly. This is a persistent training data imbalance — agentic trajectories dominate and almost always end in a tool call, so the model's prior in a tool-bearing system prompt is to use one.

Partially addressable at inference time with an explicit system prompt instruction:
```
Only call a tool if the task genuinely requires external data, computation, or
system access. Answer directly from knowledge for factual questions, definitions,
and simple arithmetic.
```

### Spot check — format confirmation

```
Prompt: "What were the top AI news stories this week?"

── Aristaeus (base) ──
...{"name": "web_search", "arguments": {"query": "..."}}   ← raw JSON, no tags

── AristaeusAgent ──
<think> To find the top AI news stories this week, I should search for recent
articles. Let me use the web_search tool. </think>
<tool_call>
{"name": "web_search", "arguments": {"query": "top AI news stories this week"}}
</tool_call>
```

AristaeusAgent correctly learned the Hermes `<think>` + `<tool_call>` format. The base model produces raw JSON without wrapper tags. This format difference explains why several base model correct answers scored 0 in earlier benchmark runs.

---

## Honest Limitations

**Tool over-triggering (primary limitation).** AristaeusAgent scores 6/30 on Tool Refusal vs the base model's 22/30. The model consistently calls tools on static knowledge questions — "explain REST", "write a haiku", "what are the SOLID principles" — where it should answer directly. Root cause: agentic training data is dominated by trajectories that end in tool calls. The glaive no-call subset partially addressed this but not sufficiently at the data scale used. Partially mitigable via system prompt instruction; fully addressable with a larger proportion of refusal examples in training or a larger base model.

**Tool-call format resolved in v2.** An earlier training run (v1, lambda dataset only) produced a model that emitted markdown code blocks and hallucinated tool syntax. The v2 run corrected this — AristaeusAgent now reliably produces `<think>...</think>` + `<tool_call>{...}</tool_call>` format. Root cause of the v1 failure was `indent=2` in JSON serialisation during zake7749 normalisation, producing a different token sequence from the compact JSON used by all other datasets.

**Hallucination at 1.5B.** The model confabulates supporting detail for correct answers — in testing it correctly identified MIC as Minimum Inhibitory Concentration but fabricated false historical attribution. This is a fundamental capacity constraint at 1.5B parameters, not addressable through fine-tuning at this scale. Recommended mitigation: use Qwen2.5-3B or 7B as the base for any production use.

**Recursive reasoning failure.** Inherited from Aristaeus Stage 1. Deep recursive call stack tracing (e.g. Fibonacci f(7)) causes the model to lose thread and produce no answer. Documented in the Aristaeus model card.

---

## Usage

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model     = AutoModelForCausalLM.from_pretrained("EphAsad/AristaeusAgent")
tokenizer = AutoTokenizer.from_pretrained("EphAsad/AristaeusAgent")

SYSTEM = """You are a helpful assistant with access to the following tools.

<tools>
[
  {
    "name": "bash",
    "description": "Execute a bash command and return stdout/stderr.",
    "parameters": {
      "type": "object",
      "properties": {
        "command": {"type": "string"}
      },
      "required": ["command"]
    }
  }
]
</tools>

Think carefully before calling any tool. Use <think>...</think> to reason first.
Only call a tool if the task genuinely requires external data, computation, or
system access. Answer directly from knowledge for factual questions."""

messages = [
    {"role": "system", "content": SYSTEM},
    {"role": "user",   "content": "How many Python files are in /workspace?"},
]

text   = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=512, temperature=0.4,
                         top_p=0.9, do_sample=True)
print(tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
```

### Expected output format

```
<think>
To count Python files in /workspace I should use bash with find or ls.
The find command with -name "*.py" will be more reliable.
</think>
<tool_call>
{"name": "bash", "arguments": {"command": "find /workspace -name '*.py' | wc -l"}}
</tool_call>
```

---

## Two-Stage Pipeline

This model is the output of a reproducible two-stage training pipeline:

```
Qwen2.5-1.5B-Instruct
        │
        ▼  Stage 1 — Full fine-tune (81 min, A100)
        │  OpenThoughts3-1.2M (30k sampled) + Bespoke-Stratos-17k
        │ 
        ▼
   EphAsad/Aristaeus
        │
        ▼  Stage 2 — QLoRA r=16 (1 epoch, A100)
        │  Hermes agent traces + zake7749 + glaive no-call subset
        ▼
   EphAsad/AristaeusAgent  ← this model
```

All training scripts, validation scripts, and benchmark code are available on request.

---

## Design Notes

**Proof of concept, not production.** This model demonstrates that a two-stage reasoning → agentic pipeline is viable at 1.5B parameters with open datasets and a single A100 session. It is not production-ready. For practical deployment the recommended path is to apply the same pipeline to Qwen2.5-3B-Instruct or Qwen2.5-7B-Instruct, which would address the hallucination and format consistency limitations without changing any training code.

**Dataset licensing.** All training datasets are Apache 2.0. No API-generated outputs from closed models (Claude, GPT-4, Gemini) were used at any stage.

**Deterministic fallback philosophy.** Consistent with prior work in this portfolio (BactAID, FireSOP, Eidos), the model is designed with explicit reasoning before action — `<think>` blocks are not decorative, they are the intended inference path. Deployments should treat absent or trivially short think blocks as a quality signal.

---

## Author

Built by **Zain Asad** (Eph) — Senior Microbiology Analyst and Applied AI Engineer.

---

## Licence

Apache 2.0 — consistent with the base model and all training datasets used.