--- language: - en license: apache-2.0 base_model: EphAsad/Aristaeus tags: - agentic - tool-calling - reasoning - fine-tuned - qwen2.5 - qlora - chain-of-thought - function-calling - unsloth datasets: - DJLougen/hermes-agent-traces-filtered - lambda/hermes-agent-reasoning-traces - zake7749/Qwen-3.6-plus-agent-tool-calling-trajectory - glaiveai/glaive-function-calling-v2 pipeline_tag: text-generation --- # AristaeusAgent **AristaeusAgent** is a QLoRA fine-tune of [EphAsad/Aristaeus](https://huggingface.co/EphAsad/Aristaeus) — itself a full fine-tune of Qwen2.5-1.5B-Instruct — trained to add structured agentic tool-use on top of the reasoning foundation established in Stage 1. This is Stage 2 of a two-stage training pipeline: - **Stage 1 — Aristaeus:** Chain-of-thought reasoning (OpenThoughts3 + Bespoke-Stratos) - **Stage 2 — AristaeusAgent (this model):** Agentic tool-calling with ``-before-act behaviour The model uses Hermes-style tool-call format: reasoning in `...` blocks, tool invocations as `{"name": "...", "arguments": {...}}`. --- ## Training | Detail | Value | |---|---| | Base model | EphAsad/Aristaeus (Stage 1) | | Fine-tune type | QLoRA (4-bit base, bf16 adapter) | | LoRA rank | r=16, alpha=32 | | LoRA targets | q/k/v/o/gate/up/down projections | | Hardware | NVIDIA A100-SXM4-40GB | | Epochs | 1 | | Sequence length | 8192 tokens | | Effective batch size | 16 (batch 1 × grad accum 16) | | Learning rate | 5e-6 (cosine schedule) | | Framework | Unsloth + TRL SFTTrainer | | Packing | Disabled — agentic trajectories must stay intact | ### Why QLoRA over full fine-tune Stage 1 established chain-of-thought reasoning over 29,400 examples and 81 minutes of full fine-tune. Stage 2 uses QLoRA (r=16, LR=5e-6, 1 epoch) specifically to preserve those Stage 1 gains. A full fine-tune at Stage 2 data scale would overwrite the reasoning capability rather than extend it. ### Datasets **[DJLougen/hermes-agent-traces-filtered](https://huggingface.co/datasets/DJLougen/hermes-agent-traces-filtered)** — Quality-filtered subset of the Hermes agent traces. Filtered for non-trivial `` blocks (>50 chars), valid tool-call JSON, and evidence of deliberate tool selection reasoning. Primary training signal. **[lambda/hermes-agent-reasoning-traces](https://huggingface.co/datasets/lambda/hermes-agent-reasoning-traces)** — Multi-turn agentic trajectories generated via the Hermes Agent harness using Kimi-K2.5 and GLM-5.1 (both run locally, no API ToS concerns). 2,000 rows from each config. Apache 2.0. **[zake7749/Qwen-3.6-plus-agent-tool-calling-trajectory](https://huggingface.co/datasets/zake7749/Qwen-3.6-plus-agent-tool-calling-trajectory)** — Multi-turn trajectories with per-row quality scores. Filtered to `score >= 0.7` (~2,000 rows). Includes `reasoning_content` fields converted to `` blocks during normalisation. Apache 2.0. **[glaiveai/glaive-function-calling-v2](https://huggingface.co/datasets/glaiveai/glaive-function-calling-v2)** — No-call subset only (~1,200 rows after filtering). Tool schemas present in the system prompt, assistant answers directly from knowledge. These negative examples teach tool refusal — the model learns that having tools available does not mean they must be called. Rows containing markdown code blocks were explicitly excluded to prevent format bleed into tool-call syntax. Apache 2.0. --- ## Evaluation AristaeusAgent was evaluated against Aristaeus (base) using a custom 50-test benchmark across 5 capability dimensions (max 3 points per test, 150 total). Results below are from the final v2 training run (5-dataset stack including glaive no-call subset). ### Overall | Model | Score | % | |---|---|---| | Aristaeus (Stage 1 base) | 68 / 150 | 45.3% | | AristaeusAgent | 94 / 150 | 62.7% | | Delta | **+26 points** | **+17.4pp** | AristaeusAgent wins 23 tests, draws 19, loses 8. ### By dimension | Dimension | Base | Base% | Agent | Agent% | Delta | |---|---|---|---|---|---| | Reasoning Quality | 5/30 | 16.7% | 25/30 | 83.3% | **+20** | | Multi-Step Planning | 8/30 | 26.7% | 22/30 | 73.3% | **+14** | | Tool Selection | 11/30 | 36.7% | 18/30 | 60.0% | **+7** | | Argument Construction | 22/30 | 73.3% | 23/30 | 76.7% | +1 | | Tool Refusal | 22/30 | 73.3% | 6/30 | 20.0% | -16 | ### Regressions (8 tests) | Test | Dimension | Base | Agent | Note | |---|---|---|---|---| | B09 — run `grep -r 'error' /var/log/` | Argument Construction | 3 | 1 | Searched web instead of running bash | | D01 — What is 2+2? | Tool Refusal | 2 | 0 | Called a tool unnecessarily | | D02 — What language is Python? | Tool Refusal | 3 | 0 | Called a tool unnecessarily | | D03 — Explain REST API | Tool Refusal | 3 | 0 | Called a tool unnecessarily | | D05 — Write a haiku | Tool Refusal | 3 | 0 | Called a tool unnecessarily | | D07 — Summarise AMR | Tool Refusal | 3 | 0 | Called a tool unnecessarily | | D09 — SOLID principles | Tool Refusal | 3 | 0 | Called a tool unnecessarily | | E10 — Weather → write report | Multi-Step Planning | 2 | 1 | Partial step sequencing | ### Tool Refusal — the remaining limitation Tool Refusal is the only dimension where AristaeusAgent scores lower than base (6/30 vs 22/30). The glaive no-call dataset reduced the over-triggering seen in v1 but did not eliminate it. The model still calls tools on static knowledge questions it should answer directly. This is a persistent training data imbalance — agentic trajectories dominate and almost always end in a tool call, so the model's prior in a tool-bearing system prompt is to use one. Partially addressable at inference time with an explicit system prompt instruction: ``` Only call a tool if the task genuinely requires external data, computation, or system access. Answer directly from knowledge for factual questions, definitions, and simple arithmetic. ``` ### Spot check — format confirmation ``` Prompt: "What were the top AI news stories this week?" ── Aristaeus (base) ── ...{"name": "web_search", "arguments": {"query": "..."}} ← raw JSON, no tags ── AristaeusAgent ── To find the top AI news stories this week, I should search for recent articles. Let me use the web_search tool. {"name": "web_search", "arguments": {"query": "top AI news stories this week"}} ``` AristaeusAgent correctly learned the Hermes `` + `` format. The base model produces raw JSON without wrapper tags. This format difference explains why several base model correct answers scored 0 in earlier benchmark runs. --- ## Honest Limitations **Tool over-triggering (primary limitation).** AristaeusAgent scores 6/30 on Tool Refusal vs the base model's 22/30. The model consistently calls tools on static knowledge questions — "explain REST", "write a haiku", "what are the SOLID principles" — where it should answer directly. Root cause: agentic training data is dominated by trajectories that end in tool calls. The glaive no-call subset partially addressed this but not sufficiently at the data scale used. Partially mitigable via system prompt instruction; fully addressable with a larger proportion of refusal examples in training or a larger base model. **Tool-call format resolved in v2.** An earlier training run (v1, lambda dataset only) produced a model that emitted markdown code blocks and hallucinated tool syntax. The v2 run corrected this — AristaeusAgent now reliably produces `...` + `{...}` format. Root cause of the v1 failure was `indent=2` in JSON serialisation during zake7749 normalisation, producing a different token sequence from the compact JSON used by all other datasets. **Hallucination at 1.5B.** The model confabulates supporting detail for correct answers — in testing it correctly identified MIC as Minimum Inhibitory Concentration but fabricated false historical attribution. This is a fundamental capacity constraint at 1.5B parameters, not addressable through fine-tuning at this scale. Recommended mitigation: use Qwen2.5-3B or 7B as the base for any production use. **Recursive reasoning failure.** Inherited from Aristaeus Stage 1. Deep recursive call stack tracing (e.g. Fibonacci f(7)) causes the model to lose thread and produce no answer. Documented in the Aristaeus model card. --- ## Usage ```python from transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained("EphAsad/AristaeusAgent") tokenizer = AutoTokenizer.from_pretrained("EphAsad/AristaeusAgent") SYSTEM = """You are a helpful assistant with access to the following tools. [ { "name": "bash", "description": "Execute a bash command and return stdout/stderr.", "parameters": { "type": "object", "properties": { "command": {"type": "string"} }, "required": ["command"] } } ] Think carefully before calling any tool. Use ... to reason first. Only call a tool if the task genuinely requires external data, computation, or system access. Answer directly from knowledge for factual questions.""" messages = [ {"role": "system", "content": SYSTEM}, {"role": "user", "content": "How many Python files are in /workspace?"}, ] text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) inputs = tokenizer(text, return_tensors="pt").to(model.device) output = model.generate(**inputs, max_new_tokens=512, temperature=0.4, top_p=0.9, do_sample=True) print(tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)) ``` ### Expected output format ``` To count Python files in /workspace I should use bash with find or ls. The find command with -name "*.py" will be more reliable. {"name": "bash", "arguments": {"command": "find /workspace -name '*.py' | wc -l"}} ``` --- ## Two-Stage Pipeline This model is the output of a reproducible two-stage training pipeline: ``` Qwen2.5-1.5B-Instruct │ ▼ Stage 1 — Full fine-tune (81 min, A100) │ OpenThoughts3-1.2M (30k sampled) + Bespoke-Stratos-17k │ ▼ EphAsad/Aristaeus │ ▼ Stage 2 — QLoRA r=16 (1 epoch, A100) │ Hermes agent traces + zake7749 + glaive no-call subset ▼ EphAsad/AristaeusAgent ← this model ``` All training scripts, validation scripts, and benchmark code are available on request. --- ## Design Notes **Proof of concept, not production.** This model demonstrates that a two-stage reasoning → agentic pipeline is viable at 1.5B parameters with open datasets and a single A100 session. It is not production-ready. For practical deployment the recommended path is to apply the same pipeline to Qwen2.5-3B-Instruct or Qwen2.5-7B-Instruct, which would address the hallucination and format consistency limitations without changing any training code. **Dataset licensing.** All training datasets are Apache 2.0. No API-generated outputs from closed models (Claude, GPT-4, Gemini) were used at any stage. **Deterministic fallback philosophy.** Consistent with prior work in this portfolio (BactAID, FireSOP, Eidos), the model is designed with explicit reasoning before action — `` blocks are not decorative, they are the intended inference path. Deployments should treat absent or trivially short think blocks as a quality signal. --- ## Author Built by **Zain Asad** (Eph) — Senior Microbiology Analyst and Applied AI Engineer. --- ## Licence Apache 2.0 — consistent with the base model and all training datasets used.