File size: 4,379 Bytes
21dd8fe |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 |
# P2 Bug: First Agent Turn Exceeds Workflow Timeout
**Date**: 2025-12-03
**Status**: FIXED (PR fix/p2-double-bug-squash)
**Severity**: P2 (UX - Workflow always times out on complex queries)
**Component**: `src/orchestrators/advanced.py` + `src/agents/search_agent.py`
**Affects**: Both Free Tier (HuggingFace) AND Paid Tier (OpenAI)
---
## Executive Summary
The search agent's first turn can exceed the 5-minute workflow timeout, causing:
1. `iterations=0` at timeout (no agent completed a turn)
2. `_handle_timeout()` synthesizes from partial evidence
3. Users get incomplete research results
This is a **performance/architecture bug**, not a model issue.
---
## Symptom
```
[warning] Workflow timed out iterations=0
```
The workflow times out with `iterations=0` - meaning the first agent (search agent) never completed its turn before the 5-minute timeout.
---
## Root Cause
The search agent's first turn is **extremely expensive**:
```
Search Agent First Turn:
βββ Manager assigns task
βββ Search agent starts
β βββ Calls PubMed search tool (10 results)
β βββ Calls ClinicalTrials search tool (10 results)
β βββ Calls EuropePMC search tool (10 results)
β βββ For EACH result (30 total):
β βββ Generate embedding (OpenAI API call)
β βββ Check for duplicates (ChromaDB query)
β βββ Store in ChromaDB
β
β TOTAL: 30 results Γ (embedding + dedup + store) = 90+ API/DB operations
β
βββ Agent turn completes (if timeout hasn't fired)
```
**The timeout is on the WORKFLOW, not individual agent turns.** A single greedy agent can consume the entire timeout budget.
---
## Impact
| Aspect | Impact |
|--------|--------|
| UX | Queries always timeout on first turn |
| Research quality | Synthesis happens on partial evidence |
| Confusion | `iterations=0` looks like nothing happened |
---
## The Fix (Consensus)
**Reduce work per turn + increase timeout budget.**
### Implementation
**1. Reduce results per tool (immediate)**
`src/agents/search_agent.py` line 70:
```python
# Change from 10 to 5
result: SearchResult = await self._handler.execute(query, max_results_per_tool=5)
```
**2. Increase workflow timeout (immediate)**
`src/utils/config.py`:
```python
advanced_timeout: float = Field(
default=600.0, # Was 300.0 (5 min), now 10 min
ge=60.0,
le=900.0,
description="Timeout for Advanced mode in seconds",
)
```
### Why NOT Per-Turn Timeout
**DANGER**: The SearchHandler uses `asyncio.gather()`:
```python
# src/tools/search_handler.py line 163-164
results = await asyncio.gather(*tasks, return_exceptions=True)
```
This is an **all-or-nothing** operation. If you wrap it with `asyncio.timeout()` and the timeout fires, you get **zero results**, not partial results.
```python
# DON'T DO THIS - yields nothing on timeout
async with asyncio.timeout(60):
result = await self._handler.execute(query) # Cancelled = zero results
```
Per-turn timeout requires `SearchHandler` to support cancellation with partial results. That's a separate architectural change (see Future Work).
---
## Future Work (Streaming Evidence Ingestion)
For proper fix, `SearchHandler.execute()` should:
1. Yield results as they arrive (async generator)
2. Support cancellation with partial results
3. Allow agent to return "what we have so far" on timeout
```python
# Future architecture
async def execute_streaming(self, query: str) -> AsyncIterator[Evidence]:
for tool in self.tools:
async for evidence in tool.search_streaming(query):
yield evidence # Can be cancelled at any point
```
This is out of scope for the immediate fix.
---
## Test Plan
1. Run query with 10-minute timeout
2. Verify first agent turn completes before timeout
3. Verify `iterations >= 1` at workflow end
---
## Verification Data
From diagnostic run:
```
=== RAW FRAMEWORK EVENTS ===
MagenticAgentDeltaEvent: 284
MagenticOrchestratorMessageEvent: 3
...
NO MagenticAgentMessageEvent β Agent never completed a turn!
[warning] Workflow timed out iterations=0
```
---
## Related
- P2 Duplicate Report Bug (separate issue, happens after successful completion)
- `_handle_timeout()` correctly synthesizes, but with partial evidence
- Not related to model quality - this is infrastructure/performance
|