P2 Bug: First Agent Turn Exceeds Workflow Timeout
Date: 2025-12-03
Status: FIXED (PR fix/p2-double-bug-squash)
Severity: P2 (UX - Workflow always times out on complex queries)
Component: src/orchestrators/advanced.py + src/agents/search_agent.py
Affects: Both Free Tier (HuggingFace) AND Paid Tier (OpenAI)
Executive Summary
The search agent's first turn can exceed the 5-minute workflow timeout, causing:
iterations=0at timeout (no agent completed a turn)_handle_timeout()synthesizes from partial evidence- Users get incomplete research results
This is a performance/architecture bug, not a model issue.
Symptom
[warning] Workflow timed out iterations=0
The workflow times out with iterations=0 - meaning the first agent (search agent) never completed its turn before the 5-minute timeout.
Root Cause
The search agent's first turn is extremely expensive:
Search Agent First Turn:
βββ Manager assigns task
βββ Search agent starts
β βββ Calls PubMed search tool (10 results)
β βββ Calls ClinicalTrials search tool (10 results)
β βββ Calls EuropePMC search tool (10 results)
β βββ For EACH result (30 total):
β βββ Generate embedding (OpenAI API call)
β βββ Check for duplicates (ChromaDB query)
β βββ Store in ChromaDB
β
β TOTAL: 30 results Γ (embedding + dedup + store) = 90+ API/DB operations
β
βββ Agent turn completes (if timeout hasn't fired)
The timeout is on the WORKFLOW, not individual agent turns. A single greedy agent can consume the entire timeout budget.
Impact
| Aspect | Impact |
|---|---|
| UX | Queries always timeout on first turn |
| Research quality | Synthesis happens on partial evidence |
| Confusion | iterations=0 looks like nothing happened |
The Fix (Consensus)
Reduce work per turn + increase timeout budget.
Implementation
1. Reduce results per tool (immediate)
src/agents/search_agent.py line 70:
# Change from 10 to 5
result: SearchResult = await self._handler.execute(query, max_results_per_tool=5)
2. Increase workflow timeout (immediate)
src/utils/config.py:
advanced_timeout: float = Field(
default=600.0, # Was 300.0 (5 min), now 10 min
ge=60.0,
le=900.0,
description="Timeout for Advanced mode in seconds",
)
Why NOT Per-Turn Timeout
DANGER: The SearchHandler uses asyncio.gather():
# src/tools/search_handler.py line 163-164
results = await asyncio.gather(*tasks, return_exceptions=True)
This is an all-or-nothing operation. If you wrap it with asyncio.timeout() and the timeout fires, you get zero results, not partial results.
# DON'T DO THIS - yields nothing on timeout
async with asyncio.timeout(60):
result = await self._handler.execute(query) # Cancelled = zero results
Per-turn timeout requires SearchHandler to support cancellation with partial results. That's a separate architectural change (see Future Work).
Future Work (Streaming Evidence Ingestion)
For proper fix, SearchHandler.execute() should:
- Yield results as they arrive (async generator)
- Support cancellation with partial results
- Allow agent to return "what we have so far" on timeout
# Future architecture
async def execute_streaming(self, query: str) -> AsyncIterator[Evidence]:
for tool in self.tools:
async for evidence in tool.search_streaming(query):
yield evidence # Can be cancelled at any point
This is out of scope for the immediate fix.
Test Plan
- Run query with 10-minute timeout
- Verify first agent turn completes before timeout
- Verify
iterations >= 1at workflow end
Verification Data
From diagnostic run:
=== RAW FRAMEWORK EVENTS ===
MagenticAgentDeltaEvent: 284
MagenticOrchestratorMessageEvent: 3
...
NO MagenticAgentMessageEvent β Agent never completed a turn!
[warning] Workflow timed out iterations=0
Related
- P2 Duplicate Report Bug (separate issue, happens after successful completion)
_handle_timeout()correctly synthesizes, but with partial evidence- Not related to model quality - this is infrastructure/performance