DeepBoner / docs /bugs /P2_FIRST_TURN_TIMEOUT.md
VibecoderMcSwaggins's picture
fix(p2): Resolve duplicate report content and first-turn timeout bugs (#123)
21dd8fe unverified
|
raw
history blame
4.38 kB
# P2 Bug: First Agent Turn Exceeds Workflow Timeout
**Date**: 2025-12-03
**Status**: FIXED (PR fix/p2-double-bug-squash)
**Severity**: P2 (UX - Workflow always times out on complex queries)
**Component**: `src/orchestrators/advanced.py` + `src/agents/search_agent.py`
**Affects**: Both Free Tier (HuggingFace) AND Paid Tier (OpenAI)
---
## Executive Summary
The search agent's first turn can exceed the 5-minute workflow timeout, causing:
1. `iterations=0` at timeout (no agent completed a turn)
2. `_handle_timeout()` synthesizes from partial evidence
3. Users get incomplete research results
This is a **performance/architecture bug**, not a model issue.
---
## Symptom
```
[warning] Workflow timed out iterations=0
```
The workflow times out with `iterations=0` - meaning the first agent (search agent) never completed its turn before the 5-minute timeout.
---
## Root Cause
The search agent's first turn is **extremely expensive**:
```
Search Agent First Turn:
β”œβ”€β”€ Manager assigns task
β”œβ”€β”€ Search agent starts
β”‚ β”œβ”€β”€ Calls PubMed search tool (10 results)
β”‚ β”œβ”€β”€ Calls ClinicalTrials search tool (10 results)
β”‚ β”œβ”€β”€ Calls EuropePMC search tool (10 results)
β”‚ └── For EACH result (30 total):
β”‚ β”œβ”€β”€ Generate embedding (OpenAI API call)
β”‚ β”œβ”€β”€ Check for duplicates (ChromaDB query)
β”‚ └── Store in ChromaDB
β”‚
β”‚ TOTAL: 30 results Γ— (embedding + dedup + store) = 90+ API/DB operations
β”‚
└── Agent turn completes (if timeout hasn't fired)
```
**The timeout is on the WORKFLOW, not individual agent turns.** A single greedy agent can consume the entire timeout budget.
---
## Impact
| Aspect | Impact |
|--------|--------|
| UX | Queries always timeout on first turn |
| Research quality | Synthesis happens on partial evidence |
| Confusion | `iterations=0` looks like nothing happened |
---
## The Fix (Consensus)
**Reduce work per turn + increase timeout budget.**
### Implementation
**1. Reduce results per tool (immediate)**
`src/agents/search_agent.py` line 70:
```python
# Change from 10 to 5
result: SearchResult = await self._handler.execute(query, max_results_per_tool=5)
```
**2. Increase workflow timeout (immediate)**
`src/utils/config.py`:
```python
advanced_timeout: float = Field(
default=600.0, # Was 300.0 (5 min), now 10 min
ge=60.0,
le=900.0,
description="Timeout for Advanced mode in seconds",
)
```
### Why NOT Per-Turn Timeout
**DANGER**: The SearchHandler uses `asyncio.gather()`:
```python
# src/tools/search_handler.py line 163-164
results = await asyncio.gather(*tasks, return_exceptions=True)
```
This is an **all-or-nothing** operation. If you wrap it with `asyncio.timeout()` and the timeout fires, you get **zero results**, not partial results.
```python
# DON'T DO THIS - yields nothing on timeout
async with asyncio.timeout(60):
result = await self._handler.execute(query) # Cancelled = zero results
```
Per-turn timeout requires `SearchHandler` to support cancellation with partial results. That's a separate architectural change (see Future Work).
---
## Future Work (Streaming Evidence Ingestion)
For proper fix, `SearchHandler.execute()` should:
1. Yield results as they arrive (async generator)
2. Support cancellation with partial results
3. Allow agent to return "what we have so far" on timeout
```python
# Future architecture
async def execute_streaming(self, query: str) -> AsyncIterator[Evidence]:
for tool in self.tools:
async for evidence in tool.search_streaming(query):
yield evidence # Can be cancelled at any point
```
This is out of scope for the immediate fix.
---
## Test Plan
1. Run query with 10-minute timeout
2. Verify first agent turn completes before timeout
3. Verify `iterations >= 1` at workflow end
---
## Verification Data
From diagnostic run:
```
=== RAW FRAMEWORK EVENTS ===
MagenticAgentDeltaEvent: 284
MagenticOrchestratorMessageEvent: 3
...
NO MagenticAgentMessageEvent ← Agent never completed a turn!
[warning] Workflow timed out iterations=0
```
---
## Related
- P2 Duplicate Report Bug (separate issue, happens after successful completion)
- `_handle_timeout()` correctly synthesizes, but with partial evidence
- Not related to model quality - this is infrastructure/performance