| # P2 Bug: First Agent Turn Exceeds Workflow Timeout | |
| **Date**: 2025-12-03 | |
| **Status**: FIXED (PR fix/p2-double-bug-squash) | |
| **Severity**: P2 (UX - Workflow always times out on complex queries) | |
| **Component**: `src/orchestrators/advanced.py` + `src/agents/search_agent.py` | |
| **Affects**: Both Free Tier (HuggingFace) AND Paid Tier (OpenAI) | |
| --- | |
| ## Executive Summary | |
| The search agent's first turn can exceed the 5-minute workflow timeout, causing: | |
| 1. `iterations=0` at timeout (no agent completed a turn) | |
| 2. `_handle_timeout()` synthesizes from partial evidence | |
| 3. Users get incomplete research results | |
| This is a **performance/architecture bug**, not a model issue. | |
| --- | |
| ## Symptom | |
| ``` | |
| [warning] Workflow timed out iterations=0 | |
| ``` | |
| The workflow times out with `iterations=0` - meaning the first agent (search agent) never completed its turn before the 5-minute timeout. | |
| --- | |
| ## Root Cause | |
| The search agent's first turn is **extremely expensive**: | |
| ``` | |
| Search Agent First Turn: | |
| βββ Manager assigns task | |
| βββ Search agent starts | |
| β βββ Calls PubMed search tool (10 results) | |
| β βββ Calls ClinicalTrials search tool (10 results) | |
| β βββ Calls EuropePMC search tool (10 results) | |
| β βββ For EACH result (30 total): | |
| β βββ Generate embedding (OpenAI API call) | |
| β βββ Check for duplicates (ChromaDB query) | |
| β βββ Store in ChromaDB | |
| β | |
| β TOTAL: 30 results Γ (embedding + dedup + store) = 90+ API/DB operations | |
| β | |
| βββ Agent turn completes (if timeout hasn't fired) | |
| ``` | |
| **The timeout is on the WORKFLOW, not individual agent turns.** A single greedy agent can consume the entire timeout budget. | |
| --- | |
| ## Impact | |
| | Aspect | Impact | | |
| |--------|--------| | |
| | UX | Queries always timeout on first turn | | |
| | Research quality | Synthesis happens on partial evidence | | |
| | Confusion | `iterations=0` looks like nothing happened | | |
| --- | |
| ## The Fix (Consensus) | |
| **Reduce work per turn + increase timeout budget.** | |
| ### Implementation | |
| **1. Reduce results per tool (immediate)** | |
| `src/agents/search_agent.py` line 70: | |
| ```python | |
| # Change from 10 to 5 | |
| result: SearchResult = await self._handler.execute(query, max_results_per_tool=5) | |
| ``` | |
| **2. Increase workflow timeout (immediate)** | |
| `src/utils/config.py`: | |
| ```python | |
| advanced_timeout: float = Field( | |
| default=600.0, # Was 300.0 (5 min), now 10 min | |
| ge=60.0, | |
| le=900.0, | |
| description="Timeout for Advanced mode in seconds", | |
| ) | |
| ``` | |
| ### Why NOT Per-Turn Timeout | |
| **DANGER**: The SearchHandler uses `asyncio.gather()`: | |
| ```python | |
| # src/tools/search_handler.py line 163-164 | |
| results = await asyncio.gather(*tasks, return_exceptions=True) | |
| ``` | |
| This is an **all-or-nothing** operation. If you wrap it with `asyncio.timeout()` and the timeout fires, you get **zero results**, not partial results. | |
| ```python | |
| # DON'T DO THIS - yields nothing on timeout | |
| async with asyncio.timeout(60): | |
| result = await self._handler.execute(query) # Cancelled = zero results | |
| ``` | |
| Per-turn timeout requires `SearchHandler` to support cancellation with partial results. That's a separate architectural change (see Future Work). | |
| --- | |
| ## Future Work (Streaming Evidence Ingestion) | |
| For proper fix, `SearchHandler.execute()` should: | |
| 1. Yield results as they arrive (async generator) | |
| 2. Support cancellation with partial results | |
| 3. Allow agent to return "what we have so far" on timeout | |
| ```python | |
| # Future architecture | |
| async def execute_streaming(self, query: str) -> AsyncIterator[Evidence]: | |
| for tool in self.tools: | |
| async for evidence in tool.search_streaming(query): | |
| yield evidence # Can be cancelled at any point | |
| ``` | |
| This is out of scope for the immediate fix. | |
| --- | |
| ## Test Plan | |
| 1. Run query with 10-minute timeout | |
| 2. Verify first agent turn completes before timeout | |
| 3. Verify `iterations >= 1` at workflow end | |
| --- | |
| ## Verification Data | |
| From diagnostic run: | |
| ``` | |
| === RAW FRAMEWORK EVENTS === | |
| MagenticAgentDeltaEvent: 284 | |
| MagenticOrchestratorMessageEvent: 3 | |
| ... | |
| NO MagenticAgentMessageEvent β Agent never completed a turn! | |
| [warning] Workflow timed out iterations=0 | |
| ``` | |
| --- | |
| ## Related | |
| - P2 Duplicate Report Bug (separate issue, happens after successful completion) | |
| - `_handle_timeout()` correctly synthesizes, but with partial evidence | |
| - Not related to model quality - this is infrastructure/performance | |