Organon: a skill-first agent architecture for end-to-end scientific workflows

Community Article Published May 5, 2026

Most agent scaffolds treat the prompt and the loop as the unit of capability. Organon's bet is that the unit of capability is a stateful skill, persistent across sessions, composed against research context that the agent already knows about the scientist using it.

It is open-source, runs on top of Claude Code with Claude Opus 4.7 as the base model, and ships 30+ skills covering the full daily research workflow from literature search through dissemination. MCP servers handle federated paper search and full-text biomedical grep, PreToolUse hooks enforce citation discipline at the file-write boundary, and a multi-persona research council reviews hypotheses adversarially. The full source is MIT-licensed at github.com/krmdel/organon.

This post walks through the architecture, the end-to-end workflow it supports, and the engineering details that make it reliable enough to use on real research. A short stress-test section uses the Einstein Arena to demonstrate that the same skill stack holds up beyond its design domain, with a sealed-sandbox ablation against the same base model. The Arena is the cross-validation, not the headline.

Why a skill-first architecture

Two styles of LLM-powered scientific system have settled into the literature in the last eighteen months. Evolutionary search systems, of which DeepMind's AlphaEvolve and its predecessor FunSearch are the visible examples, pair a base LLM with a programmatic evaluator and run an island-based evolutionary loop over code. End-to-end pipeline systems, such as Sakana AI's AI-Scientist and the v2 follow-up, chain idea generation, coding, experimentation, and manuscript drafting into a single autonomous run.

Both styles assume the base LLM is the unit of capability. Evolutionary systems wrap the LLM in a search loop. Pipeline systems wrap the LLM in a plan-act-reflect loop. In both cases the scaffolding is specialized to a narrow task shape, and most of the user-visible state is recreated per run.

Organon takes a different design point. The unit of capability is a skill, not a model and not a prompt. A skill is a self-contained folder with a YAML-fronted instruction file, depth references, executable scripts, and assets. The agent identity, including personality, working memory, and a long-running learnings journal, is persistent across sessions. Research context, including a researcher's actual papers, methods, journals, and active questions, is loaded into every skill's invocation as a thin layer that personalizes output without changing skill code.

The same skill stack handles parallel-fanout literature search across PubMed, arXiv, OpenAlex, and Semantic Scholar in the morning, runs assumption-checked statistical tests on a CSV after lunch, and adapts the same evidence into publications in the evening. The compounding effect is what the architecture is designed for. A correction logged against the data-analysis skill on Tuesday changes how it behaves on Friday. A new methodology added as a skill in week one is available, indexed, and routable in week six.

Three-layer architecture

Organon is organized into three layers, each with its own storage, update rule, and reconciliation policy.

Layer 1: Agent Identity. Personality, working memory, and learnings live in a small set of markdown files. SOUL.md defines non-negotiable behavior rules (be helpful not performative, have scientific opinions, preserve hedging language, report effect sizes alongside p-values). USER.md captures the researcher's name, affiliation, career stage, and field. context/memory/ contains one file per day with numbered session blocks tracking goal, deliverables, decisions, and open threads. context/learnings.md is the accumulated long-term memory: a "what works well, what doesn't" journal plus per-skill sections that persist across all sessions.

Layer 2: Skills Pack. Skills live in .claude/skills/{category}-{skill-name}/ with a fixed structure. Categories are sci for scientific workflow, viz for diagrams and illustrations, ops for scheduling and compute primitives, tool for utility integrations, and meta for self-improvement. At the time of writing the framework ships about 30 skills, each independently versioned, self-contained, and intentionally stateless between invocations. Persistent state lives in the identity layer or in research context, not in the skill.

Layer 3: Research Context. A small set of markdown files under research_context/ records the researcher's primary field, subfields, active questions, tool ecosystem, citation style, preferred journals, access constraints, and writing conventions. Every scientific skill reads this layer at the start of its run and degrades gracefully if files are missing. The graceful-degradation contract is enforced by a context matrix that maps each skill to the files it consumes.

A heartbeat routine runs at the start of every session: load identity, load research context, scan installed skills, then enter the routing layer. Reconciliation rules detect new skills appearing on disk and wire them into the registry automatically; removed skills require user confirmation before removal. The framework keeps itself consistent on every wake.

Skill routing cascade

Routing from a user request to a skill follows a four-step cascade. The first step that returns a usable match wins:

1. Match the user's phrasing against installed skill trigger phrases.
2. If no direct match, check whether an adjacent installed skill
   can handle the task with different inputs.
3. If still nothing, query ToolUniverse, the open-source biomedical
   tool catalog, for a routable match.
4. If all three fail, fall back to web search or propose a new skill
   via meta-skill-creator.

Each routed call prints a small notice the user can read and override:

--- SKILL ROUTED ---
Skill:   sci-data-analysis
Trigger: "run a t-test on these two groups"
Reason:  direct trigger match; adjacent skills (sci-hypothesis,
         sci-trending-research) less specific
---

The routing notice is the operational embodiment of context isolation. A routed skill loads only the depth material it needs, runs in its own context, and writes back results that the next skill can pick up. Most scientific requests route in step one. ToolUniverse, the open-source biomedical tool catalog from Marinka Zitnik's lab at Harvard, is the recovery path when the local skill set does not yet have a primitive for the request, and it covers more than 2,200 tools.

End-to-end scientific workflow

Skills are designed to compose. A typical research week routes through the same stack from literature to dissemination, with the agent loading only what each step needs.

Literature: federated MCP plus full-text grep

Literature work goes through three Model Context Protocol servers that complement each other.

The first is a local Node-based server that runs federated parallel queries across PubMed, arXiv, OpenAlex, and Semantic Scholar. The second is the Paperclip HTTP MCP, which provides full-text search over more than 8 million biomedical papers from bioRxiv, medRxiv, and PubMed Central, with capabilities the federated layer cannot match: sub-second corpus-wide regex grep, hybrid BM25 plus vector search, per-paper LLM reading via a map operation, figure analysis via ask-image, SQL queries over the unified document table, and stable line-anchor URLs for citation verification. The third is a ToolUniverse MCP that exposes the Harvard biomedical-tool catalog.

Routing rules between the three are calibrated. For biomedical discovery and review, both Paperclip and the federated layer run in parallel, with results deduplicated by DOI. Paperclip surfaces recent preprints and full-text passages; OpenAlex citation ranking surfaces the seminal canon. Neither alone is complete. For biomedical deep capability (regex grep, per-paper LLM reading, figure analysis, SQL), Paperclip is the only path. For non-biomedical queries, the federated layer alone is preferred; Paperclip's biomedical corpus generates keyword-collision noise outside its domain (a query for "gravitational waves" returns a termite paper about acoustic black holes in wood).

Credentials for these servers come from a local .env file. A small shim sources .env and execs the real MCP command, so MCP credentials are picked up automatically without changes to the user's shell rc:

"paper-search": {
  "command": "bash",
  "args": ["scripts/with-env.sh", "node",
           "mcp-servers/paper-search/dist/index.js"]
}

A literature search from the user is just a sentence. The skill responsible for routing is sci-literature-research:

> sweep recent literature on ribosomal frameshifting in viral
  translation, prefer mechanism-level papers from the last three years,
  give me a summary plus BibTeX

The skill picks the routing rule (biomedical: both servers), runs parallel queries, deduplicates, ranks by recency and citation, summarizes each top hit against the researcher's profile (so a translation biologist gets different highlights than a structural biologist), and emits a BibTeX file ready to drop into a manuscript.

Hypothesis generation under an adversarial council

Pattern-recognition skills make hypotheses cheap. The harder problem is filtering them to the small subset that is mechanistic, falsifiable, and not already in the literature. Organon handles this with two skills working together.

sci-hypothesis takes an observed pattern, a research question, or a partial dataset, and proposes a small set of mechanistic explanations along with falsifiable follow-up experiments and the statistical power needed to detect each one. sci-council then runs an adversarial review across multiple personas in parallel.

The council pattern is borrowed from how senior researchers actually run early-stage idea review: distinct experts, working independently, with their conclusions surfaced before any consensus is forced. Three general mathematical personas (Gauss for algebraic-structural angles, Erdős for probabilistic and extremal angles, Tao for harmonic and arithmetic-combinatorial angles) form the spine. Two or three domain-specific expert personas join based on the topic (a structural biologist for protein design, a statistical geneticist for GWAS, a kinetics specialist for reaction mechanisms). Each persona reads only the prompt and its assigned literature window; they cannot see each other's conclusions.

> we observe a U-shaped dose-response in the cytokine assay between
  10 nM and 1 μM. propose three mechanistic hypotheses, run them
  past the council, and design a falsification experiment for each.

The output is structured: convergent claims (where personas agree) versus divergent claims (where they disagree, often the more interesting axis), with each persona's reasoning attributed and each falsifiable test mapped to a follow-up study design. Divergence is the load-bearing signal; consensus tells you the obvious has been seen, and disagreement tells you where the real research lives.

Data analysis with assumption checks

sci-data-analysis loads CSV, Excel, or Parquet, runs the requested test, and reports effect sizes alongside p-values with confidence intervals. It checks assumptions automatically: normality via Shapiro-Wilk, equal variance via Levene, sample-size sufficiency for the requested test, and presence of outliers via robust z-scoring. If an assumption fails, the skill recommends the appropriate non-parametric or robust alternative rather than running the test silently.

> load data/expression.csv, compare control and treated groups on
  the gene_z column, plot the result with the field's preferred style.

The plot style is read from research_context/research-preferences.md. A scientist who prefers violin plots over box plots gets violins by default. A scientist whose field uses passive voice in figure captions gets passive captions. The default is field-appropriate, and the per-skill learnings journal records the corrections the user makes so future runs converge on the right defaults without re-asking.

Figure generation across four primitives

Four visualization skills cover the spread from architecture diagrams to publication-quality figures.

viz-diagram-code renders Mermaid for flowcharts, sequence diagrams, architecture diagrams, mind maps, and timelines. The output is SVG plus PNG with precise text rendering.
viz-excalidraw-diagram produces hand-drawn-style architecture and workflow diagrams in the Excalidraw JSON format, suited to whiteboard-style explanations and concept slides.
viz-nano-banana generates illustrations through Google's Gemini 3 Pro Image, in six styles ranging from publication-quality scientific (signaling pathways, cell diagrams, molecular mechanisms) to editorial color (infographics, lay-summary visuals) to monochrome (clean technical figures).
viz-presentation converts markdown into slide decks via Marp, supporting LaTeX equations, syntax-highlighted code, inline Mermaid, and speaker notes.

Each visualization skill follows the same gate pattern as the writing skills: it confirms the user's style choice before generating, learns from past selections, and records new preferences into the learnings journal so the suggestion narrows over time.

Writing with a four-agent verification pipeline

Manuscript drafting in sci-writing runs as a four-agent pipeline with strict context isolation.

sci-researcher → sci-writer → sci-verifier → sci-reviewer

sci-researcher builds a numbered evidence table from the literature MCPs, generates a .bib, and produces a quotes sidecar with source-anchored candidate passages. sci-writer drafts the section using only the evidence table; every claim must use a [@Key] citation marker that maps to the bibliography. The writer never sees the verifier's rules.

sci-verifier runs mechanical checks (DOI validation against CrossRef, citation marker syntax, hedging analysis, statistical reporting completeness) followed by a semantic pass that checks whether each quote actually appears in the cited paper at the cited line. It reads the draft cold, with no memory of the writing process. sci-reviewer then runs an adversarial review and produces FATAL, MAJOR, and MINOR findings with inline annotations.

Context isolation is the load-bearing engineering decision here. When one agent both writes and verifies in the same context window, it learns to hedge just enough to pass its own checks rather than being genuinely accurate. Separate agents force honest evaluation. The verifier cannot be gamed because it never saw the writing prompt.

A PreToolUse hook intercepts every file write to manuscript directories and blocks the save if it detects a fabricated citation or unsupported claim. Even if the pipeline has a bug, no fabrication reaches disk. The hook is small enough to read in one screen:

# .claude/hooks/verify_gate.py (sketch)
def on_pretooluse(event):
    if event.tool != "Write" or "manuscripts/" not in event.path:
        return ALLOW
    findings = run_verifier(event.content, project_bib(event.path))
    if findings.has_fatal():
        return DENY(reason=findings.summary())
    return ALLOW

A second pipeline, sci-auditor, runs the verification-plus-review pass on outputs from the science-communication skill (blog posts, tutorials, lay summaries) so non-academic prose receives the same citation discipline as a manuscript section.

Dissemination across three platforms

sci-communication adapts the same evidence into seven formats: blog posts, tutorials, concept explainers, lay summaries, social threads, newsletters, and press releases. Each format follows a tested methodology with field-appropriate structure. The same evidence table that produced a manuscript introduction can produce a blog post, a Twitter thread, and a lay summary for a patient-facing newsletter, with the appropriate voice and depth for each audience.

After drafting, three integrations route deliverables to where collaborators will see them.

tool-substack converts the markdown into Substack's editor schema, uploads images to the platform's CDN, pre-renders Mermaid diagrams to PNG, and creates a draft. Publishing remains a human click; the integration intentionally does not auto-publish.
tool-gdrive stages binary deliverables (figures, decks, datasets, manuscripts) into a Google Drive sync folder so collaborators with read access pick them up without further sharing.
tool-obsidian writes knowledge artifacts (paper summaries, experiment designs, research notes) into a local Obsidian vault with proper frontmatter, tags, and wiki-links, so the user's personal knowledge graph picks them up.

A humanizer pass runs on publishable text before any of these integrations fire. The pass strips characteristic AI tells (em-dash overuse, formulaic transitions, hedging stack-up) and is offered as a confirmable gate rather than applied silently.

Self-evolution: how learning compounds

The architecture is designed so corrections compound across sessions, not just within one. The mechanism is a flywheel between attempts, observations, learnings, patterns, and skills.

Attempts are whatever the researcher is doing. Every attempt is logged as a session block in the daily memory file with goal, deliverables, decisions, and open threads.

Observations are what the attempt produced, especially where expectation and reality diverged. "The polish step was expected to be a near no-op at the float64 floor, not a 10x improvement." "Subagent truncation is a recurring failure mode when the final return message carries both an evaluator dict and a JSON summary."

Learnings are observations re-expressed as advice for future-you. They go into context/learnings.md under the specific skill section (or under General if cross-cutting) with a Why: line and a How to apply: line. The file is append-only; entries never disappear. Skills read only their own section at invocation time.

Patterns emerge when a learnings entry recurs across three or more sessions. The wrap-up skill scans a session's closing and proposes crystallizing the pattern into a new skill. The arena-patterns library and the optimization-recipes catalog are concrete instances: each pattern is a named recipe (cross-resolution transfer, k-climbing, Dinkelbach fractional program, Remez exchange) with its own trigger conditions and applicability rules.

Skills are the terminal state of a pattern. A new skill folder is scaffolded, reviewed by the user, and registered automatically into the skill registry, the context matrix, and the relevant learnings section.

The reconciliation rule that runs at session start is what closes the loop. A new skill on disk gets picked up silently. A skill removed from disk triggers a confirmation prompt rather than a silent edit. The framework keeps itself consistent on every wake, and the researcher can focus on the science.

Engineering details that matter for reliability

A few non-obvious pieces of engineering account for most of the reliability gap between a demo and a system you can run on real research.

Context isolation between agents in a pipeline. Separate writer, verifier, and reviewer agents are not a stylistic choice; they are how you keep the verifier honest. A single agent self-verifying drifts toward language that passes its own checks, and that drift compounds. The four-agent split is mechanically simple and the failure mode is loud rather than silent.

Hooks at write boundaries. Both PreToolUse (citation gate) and SessionStart (heartbeat) are file-system-level intercepts, not in-prompt instructions. The former blocks fabricated citations from reaching disk independently of whether the writer agent caught them. The latter loads identity and research context deterministically every session, so the agent never starts cold against the same researcher.

Federated MCPs with credential shims. Multiple specialized servers that each speak the protocol cleanly, with a small with-env.sh shim that sources credentials from .env and execs the real command, beat one monolithic search wrapper. Each server can be restarted, swapped, or extended independently.

Adversarial council with parallel fan-out. Three personas plus domain experts, each working in isolation, finishing a wall-clock window before any synthesis runs. The synthesis step deliberately surfaces divergence rather than forcing consensus. Divergent claims are where new research lives.

Reconciliation rules at session start. The framework checks its own state every wake: skills present on disk versus listed in the registry, learnings sections versus installed skills, MCP servers versus credentials, cron jobs versus the dispatcher. New skills are integrated silently. Removals require confirmation. Drift surfaces immediately rather than weeks later.

Stress test: the Einstein Arena

The architecture above is built for daily research. To test whether it generalizes outside its design domain, the same stack was pointed at the Einstein Arena, a public benchmark of nineteen open mathematical construction problems adapted from the companion paper to AlphaEvolve. Each Arena problem ships with a numerical verifier, a public leaderboard, and a minImprovement threshold that a candidate must clear to claim a new top rank.

A sealed-sandbox ablation on the Arena's Prime Number Theorem problem isolates the contribution of the orchestration layer over the base model. Raw Claude Opus 4.7 in a Claude Code CLI sandbox received only the verbatim problem statement, the exact scorer, a Bash shell, and pre-installed numpy and scipy. Every Organon scaffold was withheld: no skills, no council, no memory, no learnings journal, no Arena API access, no web access, no human steering.

Three sandbox conditions ran against the same model. The strict T0 condition (no Bash, one-shot reasoning) produced no solution within the per-message API ceiling. The capped T0 retry produced a Möbius-on-squarefree certificate that is correct in the limit but violates the finite-N constraint and scored negative infinity. The iterative T1 condition with Bash and a 10-iteration budget reached a final score of 0.99283 across five hours before the rate limit terminated the run. The cross-session Organon run on the same problem reached a final score of 0.99490 in under four hours.

The 2.07e-3 gap is not "the model versus the framework"; the framework is the model plus scaffolding plus human intuition and judgement. What the gap measures is the cumulative contribution of skill composition, cross-session memory, the council, and a human at each decision gate. It is approximately forty times the margin separating the cross-session run from the next public agent on the same problem. The full protocol, with three sandbox conditions and an honest set of caveats around training-data contamination and rate-limit termination, is documented in the project's longer writeup.

The broader Arena results are a portfolio rather than a headline: the same skill stack used for daily research holds up on a domain it was not designed for, with multiple top-rank submissions accepted to the public board and several others falling below the per-problem minImprovement gate by margins five to seven orders of magnitude smaller than the threshold. The point is not the leaderboard. The point is that an agent architecture built around composable skills, persistent state, and human-checkpointed gates produced verifiable mathematical artifacts on a benchmark adjacent to AlphaEvolve's evaluation surface, using a single base model and zero benchmark-specific code.

Try it yourself

Organon is MIT-licensed and runs on macOS, Linux, and Windows.

git clone https://github.com/krmdel/organon.git
cd organon
bash scripts/install.sh
claude

The installer handles prerequisites, scientific Python, skill dependencies, MCP servers, and the cron dispatcher in one shot. The first session triggers an interactive onboarding that builds the research profile from the ground up: drop papers, manuscripts, notebooks, datasets, or reference lists into the research_artifacts/ folder; paste links to ORCID, Google Scholar, lab page, or GitHub; answer four short questions about field, active questions, statistical preferences, and writing conventions. Every downstream skill reads the resulting profile.

The Einstein Arena skills (tool-arena-attack-problem, tool-arena-runner, tool-einstein-arena) ship with the framework but are entirely optional. Uninstall any skill with one command:

bash scripts/remove-skill.sh tool-arena-attack-problem

Skills can also be added back, listed, or composed across multiple research clients (each with its own profile and memory) from the same install. The full skill registry, install scripts, MCP configurations, and hooks are in the repository.

Limitations and open questions

The architecture is several months old and has known gaps worth flagging.

Single-base-model dependence. Every skill at the time of writing runs through Claude Code with Opus 4.7. The skill abstraction is base-model-agnostic in principle (skills are markdown plus scripts plus YAML), but the routing layer and the council are tested only against this one base. Cross-model portability is an open engineering question.

Plain-text learnings journal. At the current scale (hundreds of entries) an append-only markdown file is fine. At ten thousand or a hundred thousand entries it is not. A hierarchical or vector-indexed learnings store will be required before that point.

Single-axis ablation. The sealed sandbox above held out four things at once: skills, council, memory, and the human in the loop. A four-cell ablation with one held out at a time would isolate which contributes most. That experiment is on the roadmap and will be reported separately.

Reproducibility on a budget. The PNT ablation saturated its time cap, not its dollar cap. A budget-matched T1 replication with substantially more inference compute remains an open experiment.

These are limitations of the current implementation, not of the architecture. The three-layer split, the routing cascade, the council pattern, and the verification hooks transfer to other base models and to other research domains without structural change.

Citation

@misc{delikoyun2026organon,
  title  = {Organon: a skill-first agent architecture for end-to-end
            scientific workflows},
  author = {Delikoyun, Kerem},
  year   = {2026},
  url    = {https://github.com/krmdel/organon}
}

Acknowledgements and resources

Organon's experimentations in this post used Claude Opus 4.7 on Claude Code as the base model. The Einstein Arena benchmark was constructed by the Arena team drawing on Tao and colleagues' companion paper to AlphaEvolve. ToolUniverse is the open-source biomedical tool catalog from Marinka Zitnik's lab at Harvard Medical School. Paperclip is the full-text biomedical MCP server used for corpus-wide grep and figure analysis. The comparison points referenced in the architecture section (AlphaEvolve, FunSearch, AI-Scientist and v2, ReAct, Toolformer, Voyager) are points of reference against which Organon is positioned, not systems against which it competes.

Source code, full documentation, and the longer writeup with the complete Arena portfolio are at github.com/krmdel/organon.

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote