AristaeusAgent / README.md
EphAsad's picture
Update README.md
3767d54 verified
---
language:
- en
license: apache-2.0
base_model: EphAsad/Aristaeus
tags:
- agentic
- tool-calling
- reasoning
- fine-tuned
- qwen2.5
- qlora
- chain-of-thought
- function-calling
- unsloth
datasets:
- DJLougen/hermes-agent-traces-filtered
- lambda/hermes-agent-reasoning-traces
- zake7749/Qwen-3.6-plus-agent-tool-calling-trajectory
- glaiveai/glaive-function-calling-v2
pipeline_tag: text-generation
---
# AristaeusAgent
**AristaeusAgent** is a QLoRA fine-tune of [EphAsad/Aristaeus](https://huggingface.co/EphAsad/Aristaeus) β€” itself a full fine-tune of Qwen2.5-1.5B-Instruct β€” trained to add structured agentic tool-use on top of the reasoning foundation established in Stage 1.
This is Stage 2 of a two-stage training pipeline:
- **Stage 1 β€” Aristaeus:** Chain-of-thought reasoning (OpenThoughts3 + Bespoke-Stratos)
- **Stage 2 β€” AristaeusAgent (this model):** Agentic tool-calling with `<think>`-before-act behaviour
The model uses Hermes-style tool-call format: reasoning in `<think>...</think>` blocks, tool invocations as `<tool_call>{"name": "...", "arguments": {...}}</tool_call>`.
---
## Training
| Detail | Value |
|---|---|
| Base model | EphAsad/Aristaeus (Stage 1) |
| Fine-tune type | QLoRA (4-bit base, bf16 adapter) |
| LoRA rank | r=16, alpha=32 |
| LoRA targets | q/k/v/o/gate/up/down projections |
| Hardware | NVIDIA A100-SXM4-40GB |
| Epochs | 1 |
| Sequence length | 8192 tokens |
| Effective batch size | 16 (batch 1 Γ— grad accum 16) |
| Learning rate | 5e-6 (cosine schedule) |
| Framework | Unsloth + TRL SFTTrainer |
| Packing | Disabled β€” agentic trajectories must stay intact |
### Why QLoRA over full fine-tune
Stage 1 established chain-of-thought reasoning over 29,400 examples and 81 minutes of full fine-tune. Stage 2 uses QLoRA (r=16, LR=5e-6, 1 epoch) specifically to preserve those Stage 1 gains. A full fine-tune at Stage 2 data scale would overwrite the reasoning capability rather than extend it.
### Datasets
**[DJLougen/hermes-agent-traces-filtered](https://huggingface.co/datasets/DJLougen/hermes-agent-traces-filtered)** β€” Quality-filtered subset of the Hermes agent traces. Filtered for non-trivial `<think>` blocks (>50 chars), valid tool-call JSON, and evidence of deliberate tool selection reasoning. Primary training signal.
**[lambda/hermes-agent-reasoning-traces](https://huggingface.co/datasets/lambda/hermes-agent-reasoning-traces)** β€” Multi-turn agentic trajectories generated via the Hermes Agent harness using Kimi-K2.5 and GLM-5.1 (both run locally, no API ToS concerns). 2,000 rows from each config. Apache 2.0.
**[zake7749/Qwen-3.6-plus-agent-tool-calling-trajectory](https://huggingface.co/datasets/zake7749/Qwen-3.6-plus-agent-tool-calling-trajectory)** β€” Multi-turn trajectories with per-row quality scores. Filtered to `score >= 0.7` (~2,000 rows). Includes `reasoning_content` fields converted to `<think>` blocks during normalisation. Apache 2.0.
**[glaiveai/glaive-function-calling-v2](https://huggingface.co/datasets/glaiveai/glaive-function-calling-v2)** β€” No-call subset only (~1,200 rows after filtering). Tool schemas present in the system prompt, assistant answers directly from knowledge. These negative examples teach tool refusal β€” the model learns that having tools available does not mean they must be called. Rows containing markdown code blocks were explicitly excluded to prevent format bleed into tool-call syntax. Apache 2.0.
---
## Evaluation
AristaeusAgent was evaluated against Aristaeus (base) using a custom 50-test benchmark across 5 capability dimensions (max 3 points per test, 150 total). Results below are from the final v2 training run (5-dataset stack including glaive no-call subset).
### Overall
| Model | Score | % |
|---|---|---|
| Aristaeus (Stage 1 base) | 68 / 150 | 45.3% |
| AristaeusAgent | 94 / 150 | 62.7% |
| Delta | **+26 points** | **+17.4pp** |
AristaeusAgent wins 23 tests, draws 19, loses 8.
### By dimension
| Dimension | Base | Base% | Agent | Agent% | Delta |
|---|---|---|---|---|---|
| Reasoning Quality | 5/30 | 16.7% | 25/30 | 83.3% | **+20** |
| Multi-Step Planning | 8/30 | 26.7% | 22/30 | 73.3% | **+14** |
| Tool Selection | 11/30 | 36.7% | 18/30 | 60.0% | **+7** |
| Argument Construction | 22/30 | 73.3% | 23/30 | 76.7% | +1 |
| Tool Refusal | 22/30 | 73.3% | 6/30 | 20.0% | -16 |
### Regressions (8 tests)
| Test | Dimension | Base | Agent | Note |
|---|---|---|---|---|
| B09 β€” run `grep -r 'error' /var/log/` | Argument Construction | 3 | 1 | Searched web instead of running bash |
| D01 β€” What is 2+2? | Tool Refusal | 2 | 0 | Called a tool unnecessarily |
| D02 β€” What language is Python? | Tool Refusal | 3 | 0 | Called a tool unnecessarily |
| D03 β€” Explain REST API | Tool Refusal | 3 | 0 | Called a tool unnecessarily |
| D05 β€” Write a haiku | Tool Refusal | 3 | 0 | Called a tool unnecessarily |
| D07 β€” Summarise AMR | Tool Refusal | 3 | 0 | Called a tool unnecessarily |
| D09 β€” SOLID principles | Tool Refusal | 3 | 0 | Called a tool unnecessarily |
| E10 β€” Weather β†’ write report | Multi-Step Planning | 2 | 1 | Partial step sequencing |
### Tool Refusal β€” the remaining limitation
Tool Refusal is the only dimension where AristaeusAgent scores lower than base (6/30 vs 22/30). The glaive no-call dataset reduced the over-triggering seen in v1 but did not eliminate it. The model still calls tools on static knowledge questions it should answer directly. This is a persistent training data imbalance β€” agentic trajectories dominate and almost always end in a tool call, so the model's prior in a tool-bearing system prompt is to use one.
Partially addressable at inference time with an explicit system prompt instruction:
```
Only call a tool if the task genuinely requires external data, computation, or
system access. Answer directly from knowledge for factual questions, definitions,
and simple arithmetic.
```
### Spot check β€” format confirmation
```
Prompt: "What were the top AI news stories this week?"
── Aristaeus (base) ──
...{"name": "web_search", "arguments": {"query": "..."}} ← raw JSON, no tags
── AristaeusAgent ──
<think> To find the top AI news stories this week, I should search for recent
articles. Let me use the web_search tool. </think>
<tool_call>
{"name": "web_search", "arguments": {"query": "top AI news stories this week"}}
</tool_call>
```
AristaeusAgent correctly learned the Hermes `<think>` + `<tool_call>` format. The base model produces raw JSON without wrapper tags. This format difference explains why several base model correct answers scored 0 in earlier benchmark runs.
---
## Honest Limitations
**Tool over-triggering (primary limitation).** AristaeusAgent scores 6/30 on Tool Refusal vs the base model's 22/30. The model consistently calls tools on static knowledge questions β€” "explain REST", "write a haiku", "what are the SOLID principles" β€” where it should answer directly. Root cause: agentic training data is dominated by trajectories that end in tool calls. The glaive no-call subset partially addressed this but not sufficiently at the data scale used. Partially mitigable via system prompt instruction; fully addressable with a larger proportion of refusal examples in training or a larger base model.
**Tool-call format resolved in v2.** An earlier training run (v1, lambda dataset only) produced a model that emitted markdown code blocks and hallucinated tool syntax. The v2 run corrected this β€” AristaeusAgent now reliably produces `<think>...</think>` + `<tool_call>{...}</tool_call>` format. Root cause of the v1 failure was `indent=2` in JSON serialisation during zake7749 normalisation, producing a different token sequence from the compact JSON used by all other datasets.
**Hallucination at 1.5B.** The model confabulates supporting detail for correct answers β€” in testing it correctly identified MIC as Minimum Inhibitory Concentration but fabricated false historical attribution. This is a fundamental capacity constraint at 1.5B parameters, not addressable through fine-tuning at this scale. Recommended mitigation: use Qwen2.5-3B or 7B as the base for any production use.
**Recursive reasoning failure.** Inherited from Aristaeus Stage 1. Deep recursive call stack tracing (e.g. Fibonacci f(7)) causes the model to lose thread and produce no answer. Documented in the Aristaeus model card.
---
## Usage
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("EphAsad/AristaeusAgent")
tokenizer = AutoTokenizer.from_pretrained("EphAsad/AristaeusAgent")
SYSTEM = """You are a helpful assistant with access to the following tools.
<tools>
[
{
"name": "bash",
"description": "Execute a bash command and return stdout/stderr.",
"parameters": {
"type": "object",
"properties": {
"command": {"type": "string"}
},
"required": ["command"]
}
}
]
</tools>
Think carefully before calling any tool. Use <think>...</think> to reason first.
Only call a tool if the task genuinely requires external data, computation, or
system access. Answer directly from knowledge for factual questions."""
messages = [
{"role": "system", "content": SYSTEM},
{"role": "user", "content": "How many Python files are in /workspace?"},
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=512, temperature=0.4,
top_p=0.9, do_sample=True)
print(tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
```
### Expected output format
```
<think>
To count Python files in /workspace I should use bash with find or ls.
The find command with -name "*.py" will be more reliable.
</think>
<tool_call>
{"name": "bash", "arguments": {"command": "find /workspace -name '*.py' | wc -l"}}
</tool_call>
```
---
## Two-Stage Pipeline
This model is the output of a reproducible two-stage training pipeline:
```
Qwen2.5-1.5B-Instruct
β”‚
β–Ό Stage 1 β€” Full fine-tune (81 min, A100)
β”‚ OpenThoughts3-1.2M (30k sampled) + Bespoke-Stratos-17k
β”‚
β–Ό
EphAsad/Aristaeus
β”‚
β–Ό Stage 2 β€” QLoRA r=16 (1 epoch, A100)
β”‚ Hermes agent traces + zake7749 + glaive no-call subset
β–Ό
EphAsad/AristaeusAgent ← this model
```
All training scripts, validation scripts, and benchmark code are available on request.
---
## Design Notes
**Proof of concept, not production.** This model demonstrates that a two-stage reasoning β†’ agentic pipeline is viable at 1.5B parameters with open datasets and a single A100 session. It is not production-ready. For practical deployment the recommended path is to apply the same pipeline to Qwen2.5-3B-Instruct or Qwen2.5-7B-Instruct, which would address the hallucination and format consistency limitations without changing any training code.
**Dataset licensing.** All training datasets are Apache 2.0. No API-generated outputs from closed models (Claude, GPT-4, Gemini) were used at any stage.
**Deterministic fallback philosophy.** Consistent with prior work in this portfolio (BactAID, FireSOP, Eidos), the model is designed with explicit reasoning before action β€” `<think>` blocks are not decorative, they are the intended inference path. Deployments should treat absent or trivially short think blocks as a quality signal.
---
## Author
Built by **Zain Asad** (Eph) β€” Senior Microbiology Analyst and Applied AI Engineer.
---
## Licence
Apache 2.0 β€” consistent with the base model and all training datasets used.