| # P2 Bug: First Agent Turn Exceeds Workflow Timeout |
|
|
| **Date**: 2025-12-03 |
| **Status**: FIXED (PR fix/p2-double-bug-squash) |
| **Severity**: P2 (UX - Workflow always times out on complex queries) |
| **Component**: `src/orchestrators/advanced.py` + `src/agents/search_agent.py` |
| **Affects**: Both Free Tier (HuggingFace) AND Paid Tier (OpenAI) |
|
|
| --- |
|
|
| ## Executive Summary |
|
|
| The search agent's first turn can exceed the 5-minute workflow timeout, causing: |
| 1. `iterations=0` at timeout (no agent completed a turn) |
| 2. `_handle_timeout()` synthesizes from partial evidence |
| 3. Users get incomplete research results |
|
|
| This is a **performance/architecture bug**, not a model issue. |
|
|
| --- |
|
|
| ## Symptom |
|
|
| ``` |
| [warning] Workflow timed out iterations=0 |
| ``` |
|
|
| The workflow times out with `iterations=0` - meaning the first agent (search agent) never completed its turn before the 5-minute timeout. |
|
|
| --- |
|
|
| ## Root Cause |
|
|
| The search agent's first turn is **extremely expensive**: |
|
|
| ``` |
| Search Agent First Turn: |
| βββ Manager assigns task |
| βββ Search agent starts |
| β βββ Calls PubMed search tool (10 results) |
| β βββ Calls ClinicalTrials search tool (10 results) |
| β βββ Calls EuropePMC search tool (10 results) |
| β βββ For EACH result (30 total): |
| β βββ Generate embedding (OpenAI API call) |
| β βββ Check for duplicates (ChromaDB query) |
| β βββ Store in ChromaDB |
| β |
| β TOTAL: 30 results Γ (embedding + dedup + store) = 90+ API/DB operations |
| β |
| βββ Agent turn completes (if timeout hasn't fired) |
| ``` |
|
|
| **The timeout is on the WORKFLOW, not individual agent turns.** A single greedy agent can consume the entire timeout budget. |
|
|
| --- |
|
|
| ## Impact |
|
|
| | Aspect | Impact | |
| |--------|--------| |
| | UX | Queries always timeout on first turn | |
| | Research quality | Synthesis happens on partial evidence | |
| | Confusion | `iterations=0` looks like nothing happened | |
|
|
| --- |
|
|
| ## The Fix (Consensus) |
|
|
| **Reduce work per turn + increase timeout budget.** |
|
|
| ### Implementation |
|
|
| **1. Reduce results per tool (immediate)** |
|
|
| `src/agents/search_agent.py` line 70: |
| ```python |
| # Change from 10 to 5 |
| result: SearchResult = await self._handler.execute(query, max_results_per_tool=5) |
| ``` |
|
|
| **2. Increase workflow timeout (immediate)** |
|
|
| `src/utils/config.py`: |
| ```python |
| advanced_timeout: float = Field( |
| default=600.0, # Was 300.0 (5 min), now 10 min |
| ge=60.0, |
| le=900.0, |
| description="Timeout for Advanced mode in seconds", |
| ) |
| ``` |
|
|
| ### Why NOT Per-Turn Timeout |
|
|
| **DANGER**: The SearchHandler uses `asyncio.gather()`: |
|
|
| ```python |
| # src/tools/search_handler.py line 163-164 |
| results = await asyncio.gather(*tasks, return_exceptions=True) |
| ``` |
|
|
| This is an **all-or-nothing** operation. If you wrap it with `asyncio.timeout()` and the timeout fires, you get **zero results**, not partial results. |
|
|
| ```python |
| # DON'T DO THIS - yields nothing on timeout |
| async with asyncio.timeout(60): |
| result = await self._handler.execute(query) # Cancelled = zero results |
| ``` |
|
|
| Per-turn timeout requires `SearchHandler` to support cancellation with partial results. That's a separate architectural change (see Future Work). |
|
|
| --- |
|
|
| ## Future Work (Streaming Evidence Ingestion) |
|
|
| For proper fix, `SearchHandler.execute()` should: |
| 1. Yield results as they arrive (async generator) |
| 2. Support cancellation with partial results |
| 3. Allow agent to return "what we have so far" on timeout |
|
|
| ```python |
| # Future architecture |
| async def execute_streaming(self, query: str) -> AsyncIterator[Evidence]: |
| for tool in self.tools: |
| async for evidence in tool.search_streaming(query): |
| yield evidence # Can be cancelled at any point |
| ``` |
|
|
| This is out of scope for the immediate fix. |
|
|
| --- |
|
|
| ## Test Plan |
|
|
| 1. Run query with 10-minute timeout |
| 2. Verify first agent turn completes before timeout |
| 3. Verify `iterations >= 1` at workflow end |
|
|
| --- |
|
|
| ## Verification Data |
|
|
| From diagnostic run: |
| ``` |
| === RAW FRAMEWORK EVENTS === |
| MagenticAgentDeltaEvent: 284 |
| MagenticOrchestratorMessageEvent: 3 |
| ... |
| NO MagenticAgentMessageEvent β Agent never completed a turn! |
| |
| [warning] Workflow timed out iterations=0 |
| ``` |
|
|
| --- |
|
|
| ## Related |
|
|
| - P2 Duplicate Report Bug (separate issue, happens after successful completion) |
| - `_handle_timeout()` correctly synthesizes, but with partial evidence |
| - Not related to model quality - this is infrastructure/performance |
|
|