Text Generation
Safetensors
English
qwen2
agentic
tool-calling
reasoning
fine-tuned
qwen2.5
qlora
chain-of-thought
function-calling
unsloth
conversational
Instructions to use EphAsad/AristaeusAgent with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Inference
- Local Apps
- Unsloth Studio new
How to use EphAsad/AristaeusAgent with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for EphAsad/AristaeusAgent to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for EphAsad/AristaeusAgent to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for EphAsad/AristaeusAgent to start chatting
Load model with FastModel
pip install unsloth from unsloth import FastModel model, tokenizer = FastModel.from_pretrained( model_name="EphAsad/AristaeusAgent", max_seq_length=2048, )
| language: | |
| - en | |
| license: apache-2.0 | |
| base_model: EphAsad/Aristaeus | |
| tags: | |
| - agentic | |
| - tool-calling | |
| - reasoning | |
| - fine-tuned | |
| - qwen2.5 | |
| - qlora | |
| - chain-of-thought | |
| - function-calling | |
| - unsloth | |
| datasets: | |
| - DJLougen/hermes-agent-traces-filtered | |
| - lambda/hermes-agent-reasoning-traces | |
| - zake7749/Qwen-3.6-plus-agent-tool-calling-trajectory | |
| - glaiveai/glaive-function-calling-v2 | |
| pipeline_tag: text-generation | |
| # AristaeusAgent | |
| **AristaeusAgent** is a QLoRA fine-tune of [EphAsad/Aristaeus](https://huggingface.co/EphAsad/Aristaeus) β itself a full fine-tune of Qwen2.5-1.5B-Instruct β trained to add structured agentic tool-use on top of the reasoning foundation established in Stage 1. | |
| This is Stage 2 of a two-stage training pipeline: | |
| - **Stage 1 β Aristaeus:** Chain-of-thought reasoning (OpenThoughts3 + Bespoke-Stratos) | |
| - **Stage 2 β AristaeusAgent (this model):** Agentic tool-calling with `<think>`-before-act behaviour | |
| The model uses Hermes-style tool-call format: reasoning in `<think>...</think>` blocks, tool invocations as `<tool_call>{"name": "...", "arguments": {...}}</tool_call>`. | |
| --- | |
| ## Training | |
| | Detail | Value | | |
| |---|---| | |
| | Base model | EphAsad/Aristaeus (Stage 1) | | |
| | Fine-tune type | QLoRA (4-bit base, bf16 adapter) | | |
| | LoRA rank | r=16, alpha=32 | | |
| | LoRA targets | q/k/v/o/gate/up/down projections | | |
| | Hardware | NVIDIA A100-SXM4-40GB | | |
| | Epochs | 1 | | |
| | Sequence length | 8192 tokens | | |
| | Effective batch size | 16 (batch 1 Γ grad accum 16) | | |
| | Learning rate | 5e-6 (cosine schedule) | | |
| | Framework | Unsloth + TRL SFTTrainer | | |
| | Packing | Disabled β agentic trajectories must stay intact | | |
| ### Why QLoRA over full fine-tune | |
| Stage 1 established chain-of-thought reasoning over 29,400 examples and 81 minutes of full fine-tune. Stage 2 uses QLoRA (r=16, LR=5e-6, 1 epoch) specifically to preserve those Stage 1 gains. A full fine-tune at Stage 2 data scale would overwrite the reasoning capability rather than extend it. | |
| ### Datasets | |
| **[DJLougen/hermes-agent-traces-filtered](https://huggingface.co/datasets/DJLougen/hermes-agent-traces-filtered)** β Quality-filtered subset of the Hermes agent traces. Filtered for non-trivial `<think>` blocks (>50 chars), valid tool-call JSON, and evidence of deliberate tool selection reasoning. Primary training signal. | |
| **[lambda/hermes-agent-reasoning-traces](https://huggingface.co/datasets/lambda/hermes-agent-reasoning-traces)** β Multi-turn agentic trajectories generated via the Hermes Agent harness using Kimi-K2.5 and GLM-5.1 (both run locally, no API ToS concerns). 2,000 rows from each config. Apache 2.0. | |
| **[zake7749/Qwen-3.6-plus-agent-tool-calling-trajectory](https://huggingface.co/datasets/zake7749/Qwen-3.6-plus-agent-tool-calling-trajectory)** β Multi-turn trajectories with per-row quality scores. Filtered to `score >= 0.7` (~2,000 rows). Includes `reasoning_content` fields converted to `<think>` blocks during normalisation. Apache 2.0. | |
| **[glaiveai/glaive-function-calling-v2](https://huggingface.co/datasets/glaiveai/glaive-function-calling-v2)** β No-call subset only (~1,200 rows after filtering). Tool schemas present in the system prompt, assistant answers directly from knowledge. These negative examples teach tool refusal β the model learns that having tools available does not mean they must be called. Rows containing markdown code blocks were explicitly excluded to prevent format bleed into tool-call syntax. Apache 2.0. | |
| --- | |
| ## Evaluation | |
| AristaeusAgent was evaluated against Aristaeus (base) using a custom 50-test benchmark across 5 capability dimensions (max 3 points per test, 150 total). Results below are from the final v2 training run (5-dataset stack including glaive no-call subset). | |
| ### Overall | |
| | Model | Score | % | | |
| |---|---|---| | |
| | Aristaeus (Stage 1 base) | 68 / 150 | 45.3% | | |
| | AristaeusAgent | 94 / 150 | 62.7% | | |
| | Delta | **+26 points** | **+17.4pp** | | |
| AristaeusAgent wins 23 tests, draws 19, loses 8. | |
| ### By dimension | |
| | Dimension | Base | Base% | Agent | Agent% | Delta | | |
| |---|---|---|---|---|---| | |
| | Reasoning Quality | 5/30 | 16.7% | 25/30 | 83.3% | **+20** | | |
| | Multi-Step Planning | 8/30 | 26.7% | 22/30 | 73.3% | **+14** | | |
| | Tool Selection | 11/30 | 36.7% | 18/30 | 60.0% | **+7** | | |
| | Argument Construction | 22/30 | 73.3% | 23/30 | 76.7% | +1 | | |
| | Tool Refusal | 22/30 | 73.3% | 6/30 | 20.0% | -16 | | |
| ### Regressions (8 tests) | |
| | Test | Dimension | Base | Agent | Note | | |
| |---|---|---|---|---| | |
| | B09 β run `grep -r 'error' /var/log/` | Argument Construction | 3 | 1 | Searched web instead of running bash | | |
| | D01 β What is 2+2? | Tool Refusal | 2 | 0 | Called a tool unnecessarily | | |
| | D02 β What language is Python? | Tool Refusal | 3 | 0 | Called a tool unnecessarily | | |
| | D03 β Explain REST API | Tool Refusal | 3 | 0 | Called a tool unnecessarily | | |
| | D05 β Write a haiku | Tool Refusal | 3 | 0 | Called a tool unnecessarily | | |
| | D07 β Summarise AMR | Tool Refusal | 3 | 0 | Called a tool unnecessarily | | |
| | D09 β SOLID principles | Tool Refusal | 3 | 0 | Called a tool unnecessarily | | |
| | E10 β Weather β write report | Multi-Step Planning | 2 | 1 | Partial step sequencing | | |
| ### Tool Refusal β the remaining limitation | |
| Tool Refusal is the only dimension where AristaeusAgent scores lower than base (6/30 vs 22/30). The glaive no-call dataset reduced the over-triggering seen in v1 but did not eliminate it. The model still calls tools on static knowledge questions it should answer directly. This is a persistent training data imbalance β agentic trajectories dominate and almost always end in a tool call, so the model's prior in a tool-bearing system prompt is to use one. | |
| Partially addressable at inference time with an explicit system prompt instruction: | |
| ``` | |
| Only call a tool if the task genuinely requires external data, computation, or | |
| system access. Answer directly from knowledge for factual questions, definitions, | |
| and simple arithmetic. | |
| ``` | |
| ### Spot check β format confirmation | |
| ``` | |
| Prompt: "What were the top AI news stories this week?" | |
| ββ Aristaeus (base) ββ | |
| ...{"name": "web_search", "arguments": {"query": "..."}} β raw JSON, no tags | |
| ββ AristaeusAgent ββ | |
| <think> To find the top AI news stories this week, I should search for recent | |
| articles. Let me use the web_search tool. </think> | |
| <tool_call> | |
| {"name": "web_search", "arguments": {"query": "top AI news stories this week"}} | |
| </tool_call> | |
| ``` | |
| AristaeusAgent correctly learned the Hermes `<think>` + `<tool_call>` format. The base model produces raw JSON without wrapper tags. This format difference explains why several base model correct answers scored 0 in earlier benchmark runs. | |
| --- | |
| ## Honest Limitations | |
| **Tool over-triggering (primary limitation).** AristaeusAgent scores 6/30 on Tool Refusal vs the base model's 22/30. The model consistently calls tools on static knowledge questions β "explain REST", "write a haiku", "what are the SOLID principles" β where it should answer directly. Root cause: agentic training data is dominated by trajectories that end in tool calls. The glaive no-call subset partially addressed this but not sufficiently at the data scale used. Partially mitigable via system prompt instruction; fully addressable with a larger proportion of refusal examples in training or a larger base model. | |
| **Tool-call format resolved in v2.** An earlier training run (v1, lambda dataset only) produced a model that emitted markdown code blocks and hallucinated tool syntax. The v2 run corrected this β AristaeusAgent now reliably produces `<think>...</think>` + `<tool_call>{...}</tool_call>` format. Root cause of the v1 failure was `indent=2` in JSON serialisation during zake7749 normalisation, producing a different token sequence from the compact JSON used by all other datasets. | |
| **Hallucination at 1.5B.** The model confabulates supporting detail for correct answers β in testing it correctly identified MIC as Minimum Inhibitory Concentration but fabricated false historical attribution. This is a fundamental capacity constraint at 1.5B parameters, not addressable through fine-tuning at this scale. Recommended mitigation: use Qwen2.5-3B or 7B as the base for any production use. | |
| **Recursive reasoning failure.** Inherited from Aristaeus Stage 1. Deep recursive call stack tracing (e.g. Fibonacci f(7)) causes the model to lose thread and produce no answer. Documented in the Aristaeus model card. | |
| --- | |
| ## Usage | |
| ```python | |
| from transformers import AutoModelForCausalLM, AutoTokenizer | |
| model = AutoModelForCausalLM.from_pretrained("EphAsad/AristaeusAgent") | |
| tokenizer = AutoTokenizer.from_pretrained("EphAsad/AristaeusAgent") | |
| SYSTEM = """You are a helpful assistant with access to the following tools. | |
| <tools> | |
| [ | |
| { | |
| "name": "bash", | |
| "description": "Execute a bash command and return stdout/stderr.", | |
| "parameters": { | |
| "type": "object", | |
| "properties": { | |
| "command": {"type": "string"} | |
| }, | |
| "required": ["command"] | |
| } | |
| } | |
| ] | |
| </tools> | |
| Think carefully before calling any tool. Use <think>...</think> to reason first. | |
| Only call a tool if the task genuinely requires external data, computation, or | |
| system access. Answer directly from knowledge for factual questions.""" | |
| messages = [ | |
| {"role": "system", "content": SYSTEM}, | |
| {"role": "user", "content": "How many Python files are in /workspace?"}, | |
| ] | |
| text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) | |
| inputs = tokenizer(text, return_tensors="pt").to(model.device) | |
| output = model.generate(**inputs, max_new_tokens=512, temperature=0.4, | |
| top_p=0.9, do_sample=True) | |
| print(tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)) | |
| ``` | |
| ### Expected output format | |
| ``` | |
| <think> | |
| To count Python files in /workspace I should use bash with find or ls. | |
| The find command with -name "*.py" will be more reliable. | |
| </think> | |
| <tool_call> | |
| {"name": "bash", "arguments": {"command": "find /workspace -name '*.py' | wc -l"}} | |
| </tool_call> | |
| ``` | |
| --- | |
| ## Two-Stage Pipeline | |
| This model is the output of a reproducible two-stage training pipeline: | |
| ``` | |
| Qwen2.5-1.5B-Instruct | |
| β | |
| βΌ Stage 1 β Full fine-tune (81 min, A100) | |
| β OpenThoughts3-1.2M (30k sampled) + Bespoke-Stratos-17k | |
| β | |
| βΌ | |
| EphAsad/Aristaeus | |
| β | |
| βΌ Stage 2 β QLoRA r=16 (1 epoch, A100) | |
| β Hermes agent traces + zake7749 + glaive no-call subset | |
| βΌ | |
| EphAsad/AristaeusAgent β this model | |
| ``` | |
| All training scripts, validation scripts, and benchmark code are available on request. | |
| --- | |
| ## Design Notes | |
| **Proof of concept, not production.** This model demonstrates that a two-stage reasoning β agentic pipeline is viable at 1.5B parameters with open datasets and a single A100 session. It is not production-ready. For practical deployment the recommended path is to apply the same pipeline to Qwen2.5-3B-Instruct or Qwen2.5-7B-Instruct, which would address the hallucination and format consistency limitations without changing any training code. | |
| **Dataset licensing.** All training datasets are Apache 2.0. No API-generated outputs from closed models (Claude, GPT-4, Gemini) were used at any stage. | |
| **Deterministic fallback philosophy.** Consistent with prior work in this portfolio (BactAID, FireSOP, Eidos), the model is designed with explicit reasoning before action β `<think>` blocks are not decorative, they are the intended inference path. Deployments should treat absent or trivially short think blocks as a quality signal. | |
| --- | |
| ## Author | |
| Built by **Zain Asad** (Eph) β Senior Microbiology Analyst and Applied AI Engineer. | |
| --- | |
| ## Licence | |
| Apache 2.0 β consistent with the base model and all training datasets used. |