Spaces:

VibecoderMcSwaggins
/

DeepBoner

Paused

App Files Files Community

DeepBoner / docs /architecture /agent-tool-state-contracts.md

VibecoderMcSwaggins

docs: Audit and fix architecture documentation for accuracy

c7a2e77 8 days ago

preview code

raw

history blame

18.2 kB

Agent-Tool-State Contract Registry

Status: Canonical Source of Truth Last Updated: 2025-12-06 Purpose: Developer reference for multi-agent coordination

This document defines the exact contracts between agents, tools, and shared state. Use this when:

Adding new agents or tools
Modifying agent behavior
Debugging coordination issues
Understanding "if I change X, what breaks?"

System Overview
Agent Contracts
Judge Decision Criteria
Shared State (ResearchMemory)
Tool Contracts
Event Flow
Break Conditions
Dependency Matrix

System Overview

┌─────────────────────────────────────────────────────────────────────┐
│                    ORCHESTRATOR (AdvancedOrchestrator)               │
│                                                                      │
│  ┌─────────────┐   ┌─────────────┐   ┌─────────────┐               │
│  │   Manager   │──▶│   Agents    │──▶│   Memory    │               │
│  │  (Magentic) │   │ (ChatAgent) │   │(ResearchMem)│               │
│  └─────────────┘   └─────────────┘   └─────────────┘               │
│         │                │                   │                      │
│         │                ▼                   ▼                      │
│         │         ┌─────────────┐   ┌─────────────┐                │
│         └────────▶│    Tools    │──▶│  Embeddings │                │
│                   │(@ai_function)│   │  (ChromaDB) │                │
│                   └─────────────┘   └─────────────┘                │
└─────────────────────────────────────────────────────────────────────┘

Agent Inventory

Agent	File	Role	Tools
SearchAgent	`magentic_agents.py`	Evidence gathering	search_pubmed, search_clinical_trials, search_preprints
JudgeAgent	`magentic_agents.py`	Evidence evaluation	None (LLM only)
HypothesisAgent	`magentic_agents.py`	Mechanism generation	None (LLM only)
ReportAgent	`magentic_agents.py`	Report synthesis	get_bibliography
RetrievalAgent	`retrieval_agent.py`	Web search	search_web

⚠️ Dead Code Warning: RetrievalAgent is implemented but NOT wired into magentic_agents.py. The orchestrator only uses SearchAgent (PubMed, ClinicalTrials, EuropePMC), not web search. See GitHub issue #134 for decision to delete or wire in.

Agent Contracts

SearchAgent

Factory: create_search_agent(chat_client, domain, api_key) -> ChatAgent

Input

# Manager instruction (string)
"Search for testosterone and libido mechanisms in peer-reviewed literature"

Output

# ChatMessage with:
message.text = """
Found 15 sources (12 new added to context):
- [Title 1](url): Abstract excerpt...
- [Title 2](url): Abstract excerpt...
"""
message.additional_properties = {
    "evidence": [Evidence.model_dump(), ...]
}

State Access

Operation	Key	Type	Description
READ	`memory.query`	str	Current research question
READ	`memory.evidence_ids`	list[str]	Existing evidence URLs
WRITE	`memory._evidence_cache`	dict[str, Evidence]	Caches Evidence objects
WRITE	`memory.evidence_ids`	list[str]	Appends new URLs
WRITE	`embedding_service`	VectorDB	Stores embeddings

Side Effects

Calls external APIs (PubMed, ClinicalTrials, Europe PMC)
Deduplicates via semantic similarity (0.9 threshold)
Stores in vector database

Error Behavior

API failure → Returns "No results found for: {query}"
Rate limit → Raises RateLimitError (caught by orchestrator)

JudgeAgent

Factory: create_judge_agent(chat_client, domain, api_key) -> ChatAgent

Input

# Manager instruction with evidence context
"Evaluate if we have sufficient evidence to answer: {query}"
# + Evidence list in context

Output

# ChatMessage with:
message.text = """
## Assessment
✅ SUFFICIENT EVIDENCE (confidence: 85%). STOP SEARCHING.

### Scores
- Mechanism: 8/10
- Clinical: 7/10

### Reasoning
Strong evidence for testosterone-AR pathway...
"""
message.additional_properties = {
    "assessment": JudgeAssessment.model_dump()
}

State Access

Operation	Key	Type	Description
READ	Evidence from context	list[Evidence]	Passed by Manager
WRITE	None	-	Read-only evaluation

Side Effects

None (pure evaluation)

Critical Output Signal

"✅ SUFFICIENT EVIDENCE" → Manager delegates to ReportAgent
"❌ INSUFFICIENT" → Manager calls SearchAgent with suggested queries

HypothesisAgent

Factory: create_hypothesis_agent(chat_client, domain, api_key) -> ChatAgent

Input

# Manager instruction
"Generate mechanistic hypotheses for: {query}"

Output

# ChatMessage with:
message.text = """
## Hypothesis 1 (Confidence: 75%)
**Mechanism**: Testosterone → Androgen Receptor → BDNF → Libido
**Suggested searches**: testosterone BDNF, androgen receptor signaling

## Primary Hypothesis
Testosterone → AR → dopamine release → reward pathway

## Knowledge Gaps
- Dose-response relationship unclear
"""
message.additional_properties = {
    "assessment": HypothesisAssessment.model_dump()
}

State Access

Operation	Key	Type	Description
READ	`memory.query`	str	Research question
READ	Evidence from context	list[Evidence]	Current evidence
WRITE	`evidence_store["hypotheses"]`	list	Appends hypotheses

ReportAgent

Factory: create_report_agent(chat_client, domain, api_key) -> ChatAgent

Input

# Manager instruction
"Generate final research report for: {query}"

Output

# ChatMessage with:
message.text = ResearchReport.to_markdown()  # Full markdown report
message.additional_properties = {
    "report": ResearchReport.model_dump()
}

State Access

Operation	Key	Type	Description
READ	`memory.get_all_evidence()`	list[Evidence]	All collected evidence
READ	`evidence_store["hypotheses"]`	list	Generated hypotheses
READ	`evidence_store["last_assessment"]`	JudgeAssessment	Final assessment
WRITE	`evidence_store["final_report"]`	ResearchReport	Stores report

Tool: get_bibliography()

@ai_function
def get_bibliography() -> str:
    """Returns formatted reference list from all evidence."""
    evidence = state.memory.get_all_evidence()
    return format_as_references(evidence)

Judge Decision Criteria

Scoring Dimensions

Mechanism Score (0-10)

Score	Meaning
0-3	Minimal mechanism understanding
4-5	Partial mechanism (some targets identified)
6-7	Clear mechanism (targets + pathways)
8-9	Comprehensive (multiple pathways, regulation)
10	Complete understanding

Clinical Evidence Score (0-10)

Score	Meaning
0-3	Preclinical only or weak human evidence
4-5	Some human evidence (small trials, case reports)
6-7	Strong human evidence (RCTs)
8-9	Robust (meta-analysis, large RCTs)
10	Definitive clinical proof

Sufficiency Decision

# SUFFICIENT (recommendation="synthesize")
if (
    confidence >= 0.7  # 70%
    and mechanism_score >= 6
    and clinical_evidence_score >= 6
):
    sufficient = True
    recommendation = "synthesize"

# INSUFFICIENT (recommendation="continue")
else:
    sufficient = False
    recommendation = "continue"
    next_search_queries = ["suggested query 1", "suggested query 2"]

JudgeAssessment Model

class JudgeAssessment(BaseModel):
    details: AssessmentDetails
        mechanism_score: int          # 0-10
        mechanism_reasoning: str      # min 10 chars
        clinical_evidence_score: int  # 0-10
        clinical_reasoning: str       # min 10 chars
        drug_candidates: list[str]
        key_findings: list[str]

    sufficient: bool                  # Ready for synthesis?
    confidence: float                 # 0.0-1.0
    recommendation: Literal["continue", "synthesize"]
    next_search_queries: list[str]    # If continue
    reasoning: str                    # min 20 chars

Shared State (ResearchMemory)

Initialization

# Per-query isolation via ContextVar
state = init_magentic_state(query, embedding_service)
# Returns MagenticState wrapping ResearchMemory

Memory Structure

class ResearchMemory:
    query: str                              # Research question
    hypotheses: list[Hypothesis]            # Generated hypotheses
    conflicts: list[Conflict]               # Detected conflicts
    evidence_ids: list[str]                 # URLs (unique keys)
    _evidence_cache: dict[str, Evidence]    # URL -> Evidence
    iteration_count: int                    # Current iteration
    _embedding_service: EmbeddingServiceProtocol

Key Methods

Method	Returns	Description
`store_evidence(evidence)`	`list[str]`	Store with dedup, return new IDs
`get_all_evidence()`	`list[Evidence]`	All accumulated evidence
`get_relevant_evidence(n)`	`list[Evidence]`	Top N by semantic similarity
`get_context_summary()`	`str`	Markdown summary for fallback
`add_hypothesis(h)`	`None`	Append hypothesis
`get_confirmed_hypotheses()`	`list[Hypothesis]`	Confidence > 0.8

State Flow

User Query
    │
    ▼
┌─────────────────────────────────────────────────────────────┐
│  ResearchMemory initialized (empty)                          │
└─────────────────────────────────────────────────────────────┘
    │
    ▼
SearchAgent ──▶ store_evidence([Evidence]) ──▶ evidence_ids grows
    │
    ▼
JudgeAgent ──▶ reads evidence from context ──▶ returns assessment
    │
    ├─── INSUFFICIENT ──▶ SearchAgent (with next_search_queries)
    │
    └─── SUFFICIENT ──▶ ReportAgent
                              │
                              ▼
                       get_all_evidence() ──▶ ResearchReport

Tool Contracts

search_pubmed

File: src/agents/tools.py

@ai_function
async def search_pubmed(query: str, max_results: int = 10) -> str:
    """Search PubMed for biomedical research papers."""

Aspect	Value
External API	NCBI E-utilities
Rate Limit	3/sec (10/sec with NCBI_API_KEY)
Output	Formatted string with titles/abstracts
Side Effect	Stores Evidence in memory

search_clinical_trials

@ai_function
async def search_clinical_trials(query: str, max_results: int = 10) -> str:
    """Search ClinicalTrials.gov for clinical studies."""

Aspect	Value
External API	ClinicalTrials.gov (uses `requests` not httpx)
Rate Limit	Standard HTTP limits
Output	Trial status, conditions, interventions
Side Effect	Stores Evidence in memory

search_preprints

@ai_function
async def search_preprints(query: str, max_results: int = 10) -> str:
    """Search Europe PMC for preprints and papers."""

Aspect	Value
External API	Europe PMC REST API
Output	Papers with PMIDs, DOIs
Side Effect	Stores Evidence in memory

get_bibliography

@ai_function
def get_bibliography() -> str:
    """Get formatted reference list from all collected evidence."""

Aspect	Value
External API	None
Reads	`memory.get_all_evidence()`
Output	Numbered reference list

search_web

@ai_function
async def search_web(query: str, max_results: int = 10) -> str:
    """Search web using DuckDuckGo."""

Aspect	Value
External API	DuckDuckGo
Output	Web results with URLs
Side Effect	Stores Evidence in memory

Event Flow

AgentEvent Types

Type	When Emitted	Data
`started`	Workflow begins	None
`thinking`	Before first agent event	None
`searching`	SearchAgent active	agent_id
`search_complete`	SearchAgent done	evidence count
`judging`	JudgeAgent active	agent_id
`judge_complete`	JudgeAgent done	assessment
`hypothesizing`	HypothesisAgent active	agent_id
`synthesizing`	ReportAgent active	agent_id
`streaming`	Real-time text	text, agent_id
`complete`	Workflow done	report, iterations
`error`	Error occurred	error message
`progress`	Status update	status message

Typical Sequence

1. started → "Starting research..."
2. progress → "Loading embedding service..."
3. thinking → "Multi-agent reasoning..."
4. streaming (searcher) → "Found 15 sources..."
5. streaming (judge) → "✅ SUFFICIENT..."
6. streaming (reporter) → "## Research Report..."
7. complete → Final report

Break Conditions

The orchestrator exits when ANY of these occur:

1. Judge Approval ✅

if "SUFFICIENT EVIDENCE" in judge_response:
    # Manager delegates to ReportAgent
    # ReportAgent completes → Workflow ends

2. Max Rounds Reached 🔄

# MagenticBuilder config
max_round_count = 5  # Default

# After 5 manager rounds:
if not reporter_ran:
    # Force fallback synthesis
    async for event in _synthesize_fallback(iteration, "max_rounds"):
        yield event

3. Timeout ⏱️

try:
    async with asyncio.timeout(settings.advanced_timeout):  # 600s default
        async for event in workflow.run_stream(task):
            yield event
except TimeoutError:
    async for event in _synthesize_fallback(iteration, "timeout"):
        yield event

4. Token Budget 💾

# Implicit via PydanticAI/LLM client
# ~50K tokens per query (from settings)
# Individual agent calls handle retries

Dependency Matrix

"If I change X, what breaks?"

Changed Component	Affected Components	Impact
Evidence model	All agents, Memory, Tools	HIGH - Core data type
JudgeAssessment	Judge, Orchestrator	HIGH - Decision flow
ResearchMemory	All agents	HIGH - Shared state
search_pubmed	SearchAgent	MEDIUM - One tool
get_bibliography	ReportAgent	MEDIUM - References
AgentEvent	Orchestrator, UI	MEDIUM - Streaming
EmbeddingService	Memory, Dedup	MEDIUM - Similarity
Judge thresholds	Workflow loop count	LOW - Tuning
System prompts	Agent behavior	LOW - Prompt eng

Agent Dependencies

SearchAgent
├── REQUIRES: MagenticState, EmbeddingService
├── WRITES TO: ResearchMemory (evidence)
└── NO DEPS ON: Other agents

JudgeAgent
├── REQUIRES: Evidence context (from Manager)
├── WRITES TO: Nothing
└── CONTROLS: SearchAgent (continue) or ReportAgent (synthesize)

HypothesisAgent
├── REQUIRES: Evidence context
├── WRITES TO: evidence_store["hypotheses"]
└── NO DEPS ON: Other agents

ReportAgent
├── REQUIRES: ResearchMemory, hypotheses, assessment
├── READS FROM: All prior state
└── WRITES TO: evidence_store["final_report"]

Critical Thresholds

Threshold	Value	Location	Impact
Confidence threshold	0.7 (70%)	JudgeAssessment	Sufficiency decision
Mechanism score threshold	6	Judge criteria	Sufficiency decision
Clinical score threshold	6	Judge criteria	Sufficiency decision
Max manager rounds	5	AdvancedOrchestrator	Loop termination
Max stall count	3	MagenticBuilder	Stall detection
Dedup similarity	0.9	EmbeddingService	Evidence dedup
Max evidence for judge	30	prompts/judge.py	Context limit
Confirmed hypothesis	0.8	ResearchMemory	High-confidence filter
Timeout	600s	settings.advanced_timeout	Workflow timeout

Developer Checklist

When modifying agents:

Update this document if contracts change
Verify state access (read/write) is correct
Check tool side effects
Test with make check
Verify event emission

When adding new agents:

Create factory function in magentic_agents.py
Define input/output contract
Document state access
Add to Agent Inventory table
Update Dependency Matrix

When changing Judge criteria:

Update JudgeAssessment model
Update Critical Thresholds table
Test workflow loop behavior
Verify fallback synthesis triggers correctly

This document is the source of truth for multi-agent coordination.

Agent-Tool-State Contract Registry

Table of Contents

System Overview

Agent Inventory

Agent Contracts

SearchAgent

Input

Output

State Access

Side Effects

Error Behavior

JudgeAgent

Input

Output

State Access

Side Effects

Critical Output Signal

HypothesisAgent

Input

Output

State Access

ReportAgent

Input

Output

State Access

Tool: get_bibliography()

Judge Decision Criteria

Scoring Dimensions

Sufficiency Decision

JudgeAssessment Model

Shared State (ResearchMemory)

Initialization

Memory Structure

Key Methods

State Flow

Tool Contracts

search_pubmed

search_clinical_trials

search_preprints

get_bibliography

search_web

Event Flow

AgentEvent Types

Typical Sequence

Break Conditions

1. Judge Approval ✅

2. Max Rounds Reached 🔄

3. Timeout ⏱️

4. Token Budget 💾

Dependency Matrix

"If I change X, what breaks?"

Agent Dependencies

Critical Thresholds

Developer Checklist