# P2 Bug: First Agent Turn Exceeds Workflow Timeout **Date**: 2025-12-03 **Status**: FIXED (PR fix/p2-double-bug-squash) **Severity**: P2 (UX - Workflow always times out on complex queries) **Component**: `src/orchestrators/advanced.py` + `src/agents/search_agent.py` **Affects**: Both Free Tier (HuggingFace) AND Paid Tier (OpenAI) --- ## Executive Summary The search agent's first turn can exceed the 5-minute workflow timeout, causing: 1. `iterations=0` at timeout (no agent completed a turn) 2. `_handle_timeout()` synthesizes from partial evidence 3. Users get incomplete research results This is a **performance/architecture bug**, not a model issue. --- ## Symptom ``` [warning] Workflow timed out iterations=0 ``` The workflow times out with `iterations=0` - meaning the first agent (search agent) never completed its turn before the 5-minute timeout. --- ## Root Cause The search agent's first turn is **extremely expensive**: ``` Search Agent First Turn: ├── Manager assigns task ├── Search agent starts │ ├── Calls PubMed search tool (10 results) │ ├── Calls ClinicalTrials search tool (10 results) │ ├── Calls EuropePMC search tool (10 results) │ └── For EACH result (30 total): │ ├── Generate embedding (OpenAI API call) │ ├── Check for duplicates (ChromaDB query) │ └── Store in ChromaDB │ │ TOTAL: 30 results × (embedding + dedup + store) = 90+ API/DB operations │ └── Agent turn completes (if timeout hasn't fired) ``` **The timeout is on the WORKFLOW, not individual agent turns.** A single greedy agent can consume the entire timeout budget. --- ## Impact | Aspect | Impact | |--------|--------| | UX | Queries always timeout on first turn | | Research quality | Synthesis happens on partial evidence | | Confusion | `iterations=0` looks like nothing happened | --- ## The Fix (Consensus) **Reduce work per turn + increase timeout budget.** ### Implementation **1. Reduce results per tool (immediate)** `src/agents/search_agent.py` line 70: ```python # Change from 10 to 5 result: SearchResult = await self._handler.execute(query, max_results_per_tool=5) ``` **2. Increase workflow timeout (immediate)** `src/utils/config.py`: ```python advanced_timeout: float = Field( default=600.0, # Was 300.0 (5 min), now 10 min ge=60.0, le=900.0, description="Timeout for Advanced mode in seconds", ) ``` ### Why NOT Per-Turn Timeout **DANGER**: The SearchHandler uses `asyncio.gather()`: ```python # src/tools/search_handler.py line 163-164 results = await asyncio.gather(*tasks, return_exceptions=True) ``` This is an **all-or-nothing** operation. If you wrap it with `asyncio.timeout()` and the timeout fires, you get **zero results**, not partial results. ```python # DON'T DO THIS - yields nothing on timeout async with asyncio.timeout(60): result = await self._handler.execute(query) # Cancelled = zero results ``` Per-turn timeout requires `SearchHandler` to support cancellation with partial results. That's a separate architectural change (see Future Work). --- ## Future Work (Streaming Evidence Ingestion) For proper fix, `SearchHandler.execute()` should: 1. Yield results as they arrive (async generator) 2. Support cancellation with partial results 3. Allow agent to return "what we have so far" on timeout ```python # Future architecture async def execute_streaming(self, query: str) -> AsyncIterator[Evidence]: for tool in self.tools: async for evidence in tool.search_streaming(query): yield evidence # Can be cancelled at any point ``` This is out of scope for the immediate fix. --- ## Test Plan 1. Run query with 10-minute timeout 2. Verify first agent turn completes before timeout 3. Verify `iterations >= 1` at workflow end --- ## Verification Data From diagnostic run: ``` === RAW FRAMEWORK EVENTS === MagenticAgentDeltaEvent: 284 MagenticOrchestratorMessageEvent: 3 ... NO MagenticAgentMessageEvent ← Agent never completed a turn! [warning] Workflow timed out iterations=0 ``` --- ## Related - P2 Duplicate Report Bug (separate issue, happens after successful completion) - `_handle_timeout()` correctly synthesizes, but with partial evidence - Not related to model quality - this is infrastructure/performance