Opus MCP Research v0.5

Do not use this iteration. This is a research iteration published to document what was learned during post-training. During stress testing, we identified significant gaps in the training methodology that undermine the model's core value proposition. We recommend skipping this iteration entirely and waiting for the next release, which will incorporate the lessons documented here.

Base model: Qwen 3.5 27B Dense | Training data: Generated and curated using Claude Opus 4.6

Opus MCP Research v0.5 is an experiment in distilling Opus-level reasoning into a locally-runnable 27B model. The entire training corpus -- tool-use trajectories, escalation scenarios, adversarial probes, calibration data -- was generated and curated using Claude Opus 4.6. The base model is Qwen 3.5 27B Dense, chosen for its strong instruction-following and dense architecture at a size that runs on consumer hardware through Ollama.

The premise was to compress Opus-level reasoning patterns into a local model that handles agentic tasks without an API call, operating through 12 hot-swappable MCP tools on the Model Context Protocol, with trained escalation behavior for tasks beyond local capability.

The premise is sound. This iteration's execution fell short. The sections below explain exactly where and why.

Known Limitations

These are not minor caveats. These are fundamental gaps that make this iteration unsuitable for real use.

Escalation behavior is completely absent. The model scored 0/7 on escalation stress tests. Instead of recognizing when it's out of its depth and handing off to a cloud API, it tries to solve everything locally -- the exact failure mode the architecture was designed to prevent. This is the single most critical failure: the model lacks the core behavior that differentiates it from any other local model.
Identity bleed from the base model. The model identifies as Qwen 3.5 rather than maintaining its own identity. The Opus-derived persona and behavioral calibration applied via LoRA do not hold under pressure -- extended conversations, edge-case prompts, or distribution pressure cause the base model's defaults to surface. The identity integration needs to go much deeper than what LoRA alone achieved here.
Windows/PowerShell-dominant training data. Approximately 68% of the synthetic tool-use trajectories target Windows and PowerShell environments. Linux/macOS/bash coverage is present but severely underrepresented. For a model intended to run locally via Ollama (which skews heavily Linux/macOS), this is a significant mismatch between training distribution and deployment reality.
Weak MCP agent behavior overall. Beyond the escalation gap, general tool-use chains are brittle. Multi-step tool orchestration, error recovery mid-chain, and context maintenance across tool calls all underperform what the architecture should support. The training data covers the surface area, but the learned behaviors lack robustness.

Stress Test Results

In Progress -- results from v2 scoring pipeline will be added upon completion.

Architecture

Opus MCP Research v0.5 is not a standalone chatbot. It is the local inference component of a two-tier agent system.

The Escalation Pipeline

The model runs locally through Ollama. It has access to 12 MCP tools with hot-swap capability -- tools can be added, removed, or reconfigured without retraining. For most tasks, local inference is intended to be sufficient. The model works through the problem, calls tools as needed, and delivers results.

When the model exhausts local troubleshooting -- typically after 5 to 10 turns of attempted resolution -- it is trained to output a structured escalation signal. It should not hallucinate an answer or loop indefinitely. It should recognize the boundary of its competence and flag it.

The agent framework (agent.py) catches this signal and routes the request to a configured cloud API: Claude, GPT-4, or any OpenAI-compatible endpoint. API keys live in the framework's configuration (config.py), not in the model. The model never sees or handles credentials.

This is the core design principle: the model knows WHEN to escalate; the framework knows HOW to escalate. In this iteration, the model has not learned the WHEN -- see Known Limitations above.

Why This Architecture

Most local models either refuse tasks they can't handle (useless) or confabulate answers (dangerous). Opus MCP Research v0.5 was designed to do neither -- attempt the task with full tool access, and if it can't solve it locally, hand off cleanly. The architecture is right. The training needs another pass to actually instill the escalation behavior.

The Self-Identification Protocol

When asked what model it is, Opus MCP Research v0.5 is trained to respond with its own identity rather than claiming to be Claude, GPT-4, or the base Qwen model. In practice, identity bleed from the base model means this protocol does not hold reliably -- the model frequently identifies as Qwen 3.5.

Training Methodology

Data Generation Pipeline

All training data was generated using Claude Opus 4.6 through a multi-stage pipeline:

Trajectory Generation: Opus generated complete tool-use sequences across the 12 MCP tool categories, including successful completions, error recovery paths, and escalation triggers.
Adversarial Probing: Opus generated attempts to manipulate the model into unsafe behaviors -- prompt injection, identity confusion, unauthorized tool access. The model was trained on both the attacks and appropriate responses.
Calibration Data: Opus generated confidence-calibrated responses, teaching the model to distinguish between tasks it can handle locally and tasks requiring escalation.
Quality Filtering: Multi-pass filtering using Opus as judge, removing low-quality trajectories, inconsistent tool usage, and poorly calibrated confidence signals.

Training Configuration

Method: LoRA fine-tuning on Qwen 3.5 27B Dense
Training data: ~15K curated trajectories from the Opus pipeline
Platform bias: ~68% Windows/PowerShell, remainder Linux/macOS/bash
Quantizations: Q4_K_M, Q6_K, Q8_0, and bf16 GGUFs

What Comes Next

The next iteration will address the gaps identified here:

Escalation training with reinforcement, not just supervised examples
Rebalanced platform distribution targeting 50/50 or Linux-majority
Deeper identity integration beyond LoRA surface-level overrides
Expanded stress test suite with the v2 scoring pipeline

This iteration was published to document the methodology and be transparent about where it stands. It is not ready for use.

Downloads last month: 317

GGUF

Model size

27B params

Architecture

qwen35

Hardware compatibility

4-bit

6-bit

8-bit

16-bit