Update README.md

3767d54 verified 29 days ago

11.8 kB

	---
	language:
	- en
	license: apache-2.0
	base_model: EphAsad/Aristaeus
	tags:
	- agentic
	- tool-calling
	- reasoning
	- fine-tuned
	- qwen2.5
	- qlora
	- chain-of-thought
	- function-calling
	- unsloth
	datasets:
	- DJLougen/hermes-agent-traces-filtered
	- lambda/hermes-agent-reasoning-traces
	- zake7749/Qwen-3.6-plus-agent-tool-calling-trajectory
	- glaiveai/glaive-function-calling-v2
	pipeline_tag: text-generation
	---

	# AristaeusAgent

	AristaeusAgent is a QLoRA fine-tune of [EphAsad/Aristaeus](https://huggingface.co/EphAsad/Aristaeus) — itself a full fine-tune of Qwen2.5-1.5B-Instruct — trained to add structured agentic tool-use on top of the reasoning foundation established in Stage 1.

	This is Stage 2 of a two-stage training pipeline:

	- Stage 1 — Aristaeus: Chain-of-thought reasoning (OpenThoughts3 + Bespoke-Stratos)
	- Stage 2 — AristaeusAgent (this model): Agentic tool-calling with `<think>`-before-act behaviour

	The model uses Hermes-style tool-call format: reasoning in `<think>...</think>` blocks, tool invocations as `<tool_call>{"name": "...", "arguments": {...}}</tool_call>`.

	---

	## Training

	\| Detail \| Value \|
	\|---\|---\|
	\| Base model \| EphAsad/Aristaeus (Stage 1) \|
	\| Fine-tune type \| QLoRA (4-bit base, bf16 adapter) \|
	\| LoRA rank \| r=16, alpha=32 \|
	\| LoRA targets \| q/k/v/o/gate/up/down projections \|
	\| Hardware \| NVIDIA A100-SXM4-40GB \|
	\| Epochs \| 1 \|
	\| Sequence length \| 8192 tokens \|
	\| Effective batch size \| 16 (batch 1 × grad accum 16) \|
	\| Learning rate \| 5e-6 (cosine schedule) \|
	\| Framework \| Unsloth + TRL SFTTrainer \|
	\| Packing \| Disabled — agentic trajectories must stay intact \|

	### Why QLoRA over full fine-tune

	Stage 1 established chain-of-thought reasoning over 29,400 examples and 81 minutes of full fine-tune. Stage 2 uses QLoRA (r=16, LR=5e-6, 1 epoch) specifically to preserve those Stage 1 gains. A full fine-tune at Stage 2 data scale would overwrite the reasoning capability rather than extend it.

	### Datasets

	[DJLougen/hermes-agent-traces-filtered](https://huggingface.co/datasets/DJLougen/hermes-agent-traces-filtered) — Quality-filtered subset of the Hermes agent traces. Filtered for non-trivial `<think>` blocks (>50 chars), valid tool-call JSON, and evidence of deliberate tool selection reasoning. Primary training signal.

	[lambda/hermes-agent-reasoning-traces](https://huggingface.co/datasets/lambda/hermes-agent-reasoning-traces) — Multi-turn agentic trajectories generated via the Hermes Agent harness using Kimi-K2.5 and GLM-5.1 (both run locally, no API ToS concerns). 2,000 rows from each config. Apache 2.0.

	[zake7749/Qwen-3.6-plus-agent-tool-calling-trajectory](https://huggingface.co/datasets/zake7749/Qwen-3.6-plus-agent-tool-calling-trajectory) — Multi-turn trajectories with per-row quality scores. Filtered to `score >= 0.7` (~2,000 rows). Includes `reasoning_content` fields converted to `<think>` blocks during normalisation. Apache 2.0.

	[glaiveai/glaive-function-calling-v2](https://huggingface.co/datasets/glaiveai/glaive-function-calling-v2) — No-call subset only (~1,200 rows after filtering). Tool schemas present in the system prompt, assistant answers directly from knowledge. These negative examples teach tool refusal — the model learns that having tools available does not mean they must be called. Rows containing markdown code blocks were explicitly excluded to prevent format bleed into tool-call syntax. Apache 2.0.

	---

	## Evaluation

	AristaeusAgent was evaluated against Aristaeus (base) using a custom 50-test benchmark across 5 capability dimensions (max 3 points per test, 150 total). Results below are from the final v2 training run (5-dataset stack including glaive no-call subset).

	### Overall

	\| Model \| Score \| % \|
	\|---\|---\|---\|
	\| Aristaeus (Stage 1 base) \| 68 / 150 \| 45.3% \|
	\| AristaeusAgent \| 94 / 150 \| 62.7% \|
	\| Delta \| +26 points \| +17.4pp \|

	AristaeusAgent wins 23 tests, draws 19, loses 8.

	### By dimension

	\| Dimension \| Base \| Base% \| Agent \| Agent% \| Delta \|
	\|---\|---\|---\|---\|---\|---\|
	\| Reasoning Quality \| 5/30 \| 16.7% \| 25/30 \| 83.3% \| +20 \|
	\| Multi-Step Planning \| 8/30 \| 26.7% \| 22/30 \| 73.3% \| +14 \|
	\| Tool Selection \| 11/30 \| 36.7% \| 18/30 \| 60.0% \| +7 \|
	\| Argument Construction \| 22/30 \| 73.3% \| 23/30 \| 76.7% \| +1 \|
	\| Tool Refusal \| 22/30 \| 73.3% \| 6/30 \| 20.0% \| -16 \|

	### Regressions (8 tests)

	\| Test \| Dimension \| Base \| Agent \| Note \|
	\|---\|---\|---\|---\|---\|
	\| B09 — run `grep -r 'error' /var/log/` \| Argument Construction \| 3 \| 1 \| Searched web instead of running bash \|
	\| D01 — What is 2+2? \| Tool Refusal \| 2 \| 0 \| Called a tool unnecessarily \|
	\| D02 — What language is Python? \| Tool Refusal \| 3 \| 0 \| Called a tool unnecessarily \|
	\| D03 — Explain REST API \| Tool Refusal \| 3 \| 0 \| Called a tool unnecessarily \|
	\| D05 — Write a haiku \| Tool Refusal \| 3 \| 0 \| Called a tool unnecessarily \|
	\| D07 — Summarise AMR \| Tool Refusal \| 3 \| 0 \| Called a tool unnecessarily \|
	\| D09 — SOLID principles \| Tool Refusal \| 3 \| 0 \| Called a tool unnecessarily \|
	\| E10 — Weather → write report \| Multi-Step Planning \| 2 \| 1 \| Partial step sequencing \|

	### Tool Refusal — the remaining limitation

	Tool Refusal is the only dimension where AristaeusAgent scores lower than base (6/30 vs 22/30). The glaive no-call dataset reduced the over-triggering seen in v1 but did not eliminate it. The model still calls tools on static knowledge questions it should answer directly. This is a persistent training data imbalance — agentic trajectories dominate and almost always end in a tool call, so the model's prior in a tool-bearing system prompt is to use one.

	Partially addressable at inference time with an explicit system prompt instruction:
	```
	Only call a tool if the task genuinely requires external data, computation, or
	system access. Answer directly from knowledge for factual questions, definitions,
	and simple arithmetic.
	```

	### Spot check — format confirmation

	```
	Prompt: "What were the top AI news stories this week?"

	── Aristaeus (base) ──
	...{"name": "web_search", "arguments": {"query": "..."}} ← raw JSON, no tags

	── AristaeusAgent ──
	<think> To find the top AI news stories this week, I should search for recent
	articles. Let me use the web_search tool. </think>
	<tool_call>
	{"name": "web_search", "arguments": {"query": "top AI news stories this week"}}
	</tool_call>
	```

	AristaeusAgent correctly learned the Hermes `<think>` + `<tool_call>` format. The base model produces raw JSON without wrapper tags. This format difference explains why several base model correct answers scored 0 in earlier benchmark runs.

	---

	## Honest Limitations

	Tool over-triggering (primary limitation). AristaeusAgent scores 6/30 on Tool Refusal vs the base model's 22/30. The model consistently calls tools on static knowledge questions — "explain REST", "write a haiku", "what are the SOLID principles" — where it should answer directly. Root cause: agentic training data is dominated by trajectories that end in tool calls. The glaive no-call subset partially addressed this but not sufficiently at the data scale used. Partially mitigable via system prompt instruction; fully addressable with a larger proportion of refusal examples in training or a larger base model.

	Tool-call format resolved in v2. An earlier training run (v1, lambda dataset only) produced a model that emitted markdown code blocks and hallucinated tool syntax. The v2 run corrected this — AristaeusAgent now reliably produces `<think>...</think>` + `<tool_call>{...}</tool_call>` format. Root cause of the v1 failure was `indent=2` in JSON serialisation during zake7749 normalisation, producing a different token sequence from the compact JSON used by all other datasets.

	Hallucination at 1.5B. The model confabulates supporting detail for correct answers — in testing it correctly identified MIC as Minimum Inhibitory Concentration but fabricated false historical attribution. This is a fundamental capacity constraint at 1.5B parameters, not addressable through fine-tuning at this scale. Recommended mitigation: use Qwen2.5-3B or 7B as the base for any production use.

	Recursive reasoning failure. Inherited from Aristaeus Stage 1. Deep recursive call stack tracing (e.g. Fibonacci f(7)) causes the model to lose thread and produce no answer. Documented in the Aristaeus model card.

	---

	## Usage

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model = AutoModelForCausalLM.from_pretrained("EphAsad/AristaeusAgent")
	tokenizer = AutoTokenizer.from_pretrained("EphAsad/AristaeusAgent")

	SYSTEM = """You are a helpful assistant with access to the following tools.

	<tools>
	[
	{
	"name": "bash",
	"description": "Execute a bash command and return stdout/stderr.",
	"parameters": {
	"type": "object",
	"properties": {
	"command": {"type": "string"}
	},
	"required": ["command"]
	}
	}
	]
	</tools>

	Think carefully before calling any tool. Use <think>...</think> to reason first.
	Only call a tool if the task genuinely requires external data, computation, or
	system access. Answer directly from knowledge for factual questions."""

	messages = [
	{"role": "system", "content": SYSTEM},
	{"role": "user", "content": "How many Python files are in /workspace?"},
	]

	text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
	inputs = tokenizer(text, return_tensors="pt").to(model.device)
	output = model.generate(**inputs, max_new_tokens=512, temperature=0.4,
	top_p=0.9, do_sample=True)
	print(tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
	```

	### Expected output format

	```
	<think>
	To count Python files in /workspace I should use bash with find or ls.
	The find command with -name "*.py" will be more reliable.
	</think>
	<tool_call>
	{"name": "bash", "arguments": {"command": "find /workspace -name '*.py' \| wc -l"}}
	</tool_call>
	```

	---

	## Two-Stage Pipeline

	This model is the output of a reproducible two-stage training pipeline:

	```
	Qwen2.5-1.5B-Instruct
	│
	▼ Stage 1 — Full fine-tune (81 min, A100)
	│ OpenThoughts3-1.2M (30k sampled) + Bespoke-Stratos-17k
	│
	▼
	EphAsad/Aristaeus
	│
	▼ Stage 2 — QLoRA r=16 (1 epoch, A100)
	│ Hermes agent traces + zake7749 + glaive no-call subset
	▼
	EphAsad/AristaeusAgent ← this model
	```

	All training scripts, validation scripts, and benchmark code are available on request.

	---

	## Design Notes

	Proof of concept, not production. This model demonstrates that a two-stage reasoning → agentic pipeline is viable at 1.5B parameters with open datasets and a single A100 session. It is not production-ready. For practical deployment the recommended path is to apply the same pipeline to Qwen2.5-3B-Instruct or Qwen2.5-7B-Instruct, which would address the hallucination and format consistency limitations without changing any training code.

	Dataset licensing. All training datasets are Apache 2.0. No API-generated outputs from closed models (Claude, GPT-4, Gemini) were used at any stage.

	Deterministic fallback philosophy. Consistent with prior work in this portfolio (BactAID, FireSOP, Eidos), the model is designed with explicit reasoning before action — `<think>` blocks are not decorative, they are the intended inference path. Deployments should treat absent or trivially short think blocks as a quality signal.

	---

	## Author

	Built by Zain Asad (Eph) — Senior Microbiology Analyst and Applied AI Engineer.

	---

	## Licence

	Apache 2.0 — consistent with the base model and all training datasets used.