Spaces:

VibecoderMcSwaggins
/

DeepBoner

Paused

VibecoderMcSwaggins commited on 21 days ago

Commit

3753c39

1 Parent(s): b9709ff

docs: Add P2 bug reports for first agent turn timeout and duplicate report content

- Introduced documentation for two performance-related bugs:
1. **First Agent Turn Exceeds Workflow Timeout**: Identified root cause as excessive API calls in the search agent's first turn, leading to timeouts. Recommended fixes include reducing results per tool and increasing the workflow timeout.
2. **Duplicate Report Content**: Documented the issue of repeated report content due to lack of deduplication between streamed and final events. Suggested handling final events inline to avoid duplication.

Both issues affect user experience and require immediate attention for resolution.

Files changed (3) hide show

docs/bugs/ACTIVE_BUGS.md +14 -1
docs/bugs/P2_DUPLICATE_REPORT_CONTENT.md +77 -268
docs/bugs/P2_FIRST_TURN_TIMEOUT.md +160 -0

docs/bugs/ACTIVE_BUGS.md CHANGED Viewed

@@ -18,7 +18,20 @@
 **Root Cause:** Both `MagenticFinalResultEvent` and `WorkflowOutputEvent` emit the full report content that was already streamed. No deduplication exists.
-**Recommended Fix:** Track streamed content length in orchestrator; emit minimal "Research complete." message instead of repeating content.
 ---

 **Root Cause:** Both `MagenticFinalResultEvent` and `WorkflowOutputEvent` emit the full report content that was already streamed. No deduplication exists.
+**Recommended Fix:** Handle final events inline in `run()` loop where buffer context exists. Track `last_streamed_length`; if > 100 chars, emit "Research complete." instead of full content.
+---
+### P2 - First Agent Turn Exceeds Workflow Timeout
+**File:** `docs/bugs/P2_FIRST_TURN_TIMEOUT.md`
+**Status:** OPEN - Performance Bug
+**Problem:** The search agent's first turn can exceed the 5-minute workflow timeout, causing `iterations=0` at timeout. Users get partial research results.
+**Root Cause:** Search agent does too much work in a single turn: 3 API searches → 30 results → 30 embedding calls → 30 ChromaDB stores. The timeout is on the WORKFLOW, not individual agent turns.
+**Recommended Fix:** Reduce `max_results_per_tool` from 10 to 5; increase `advanced_timeout` to 600s (10 min).
 ---

docs/bugs/P2_DUPLICATE_REPORT_CONTENT.md CHANGED Viewed

@@ -3,7 +3,7 @@
 **Date**: 2025-12-03
 **Status**: OPEN
 **Severity**: P2 (UX - Duplicate content confuses users)
-**Component**: `src/orchestrators/advanced.py` + `src/app.py`
 **Affects**: Both Free Tier (HuggingFace) AND Paid Tier (OpenAI)
 ---
@@ -25,318 +25,127 @@ The final research report appears **twice** in the UI output:
 1. First as streaming content (with `📡 **STREAMING**:` prefix)
 2. Then again as a complete event (without prefix)
-Example:
-```
-📡 **STREAMING**:
-### Summary of Drugs and Mechanisms of Action
-...
-### Conclusion
-Post-menopausal women experiencing libido issues can benefit from...
-### Recommendations
-- Estrogen Therapy: Effective in enhancing...
-Based on the information gathered, we have identified...   <-- DUPLICATE STARTS
-### Summary of Drugs and Mechanisms of Action
-...
-### Conclusion
-Post-menopausal women experiencing libido issues can benefit from...
-### Recommendations
-- Estrogen Therapy: Effective in enhancing...
-```
 ---
-## Root Cause Analysis
-### Event Flow (Current - Buggy)
-```
-1. Reporter Agent streams content
-   └─ MagenticAgentDeltaEvent × N
-      └─ Each yields AgentEvent(type="streaming", message=delta)
-      └─ app.py: streaming_buffer += event.message
-      └─ User sees: "📡 **STREAMING**: [content building up]"
-2. Reporter Agent completes
-   └─ MagenticAgentMessageEvent
-      └─ Yields truncated completion: "reporter: [first 200 chars]..."
-      └─ app.py: flushes streaming_buffer to response_parts
-3. Workflow ends
-   └─ MagenticFinalResultEvent OR WorkflowOutputEvent
-      └─ Contains FULL report content (same as streaming)
-      └─ Yields AgentEvent(type="complete", message=FULL_CONTENT)
-      └─ app.py: appends event.message to response_parts
-      └─ User sees: [SAME CONTENT AGAIN]
-```
-### Bug Location
-**`src/orchestrators/advanced.py` lines 532-552:**
 ```python
-elif isinstance(event, MagenticFinalResultEvent):
-    text = self._extract_text(event.message) if event.message else "No result"
-    return AgentEvent(
-        type="complete",
-        message=text,  # <-- FULL content, already streamed
-        ...
-    )
-elif isinstance(event, WorkflowOutputEvent):
-    if event.data:
-        text = self._extract_text(event.data)
-        return AgentEvent(
-            type="complete",
-            message=text,  # <-- FULL content, already streamed
-            ...
-        )
-```
-**`src/app.py` lines 229-232:**
-```python
-if event.type == "complete":
-    response_parts.append(event.message)  # <-- Appends duplicate
-    yield "\n\n".join(response_parts)
 ```
-### Why It Happens
-1. **Streaming events** yield the full report character-by-character
-2. **Final events** (`MagenticFinalResultEvent`, `WorkflowOutputEvent`) contain the same full content
-3. **No deduplication** exists between streamed content and final event content
-4. **app.py appends both** to the output
 ---
-## Impact
-| Aspect | Impact |
-|--------|--------|
-| UX | Report appears twice, looks buggy |
-| Token usage | Renders same content twice |
-| Trust | Users may think system is broken |
----
-## Proposed Fix Options
-### Option 1: Skip Complete Event if Content Matches Streaming (Recommended)
-**Location**: `src/app.py` lines 229-232
-```python
-if event.type == "complete":
-    # Skip if content matches what we already streamed
-    streaming_content = next(
-        (p.replace("📡 **STREAMING**: ", "") for p in response_parts if p.startswith("📡 **STREAMING**:")),
-        None
-    )
-    if streaming_content and event.message.strip() == streaming_content.strip():
-        continue  # Skip duplicate
-    response_parts.append(event.message)
-    yield "\n\n".join(response_parts)
-```
-**Pros**: Simple, targets exact issue
-**Cons**: String comparison may be fragile
-### Option 2: Track Streamed Content Hash
-**Location**: `src/app.py`
 ```python
-streaming_hash = None
-...
-if streaming_buffer:
-    streaming_hash = hash(streaming_buffer.strip())
-    response_parts.append(f"📡 **STREAMING**: {streaming_buffer}")
-    streaming_buffer = ""
-...
-if event.type == "complete":
-    if streaming_hash and hash(event.message.strip()) == streaming_hash:
-        continue  # Skip duplicate
-    response_parts.append(event.message)
-```
-**Pros**: More robust comparison
-**Cons**: Hash collision possible (unlikely)
-### Option 3: Don't Emit Complete Event Content from Orchestrator
-**Location**: `src/orchestrators/advanced.py` lines 532-552
-Replace full content with summary:
-```python
-elif isinstance(event, MagenticFinalResultEvent):
-    return AgentEvent(
-        type="complete",
-        message="Research complete.",  # Don't repeat content
-        data={"iterations": iteration},
-        iteration=iteration,
-    )
 ```
-**Pros**: Clean separation of streaming vs completion
-**Cons**: Loses fallback if streaming failed
-### Option 4: Flag-Based Deduplication in Orchestrator
-**Location**: `src/orchestrators/advanced.py`
-Track if substantial streaming occurred:
-```python
-has_substantial_streaming = len(current_message_buffer) > 100
-# In _process_event for final events:
-if has_substantial_streaming:
-    return AgentEvent(
-        type="complete",
-        message="Research complete.",  # Don't repeat
-        ...
-    )
-```
 ---
-## Recommended Fix
-**Option 3** is cleanest - the orchestrator should not re-emit content that was already streamed.
-**Implementation**:
-1. Track `streamed_report_length` in the run loop
-2. If substantial content was streamed (>500 chars), emit minimal complete message
-3. If no streaming occurred, emit full content as fallback
 ---
-## Files Involved
-| File | Role |
-|------|------|
-| `src/orchestrators/advanced.py:532-552` | Emits duplicate complete events |
-| `src/app.py:229-232` | Appends duplicate to output |
 ---
 ## Test Plan
-1. Run Free Tier query: "What drugs improve female libido post-menopause?"
-2. Verify report appears ONCE (with streaming prefix)
-3. Verify `complete` event does NOT repeat content
-4. Verify fallback works if streaming fails
 ---
-## Deep Technical Analysis
-### Microsoft Agent Framework Event Types
-The framework emits these event types (all inherit from `WorkflowEvent`):
-| Event Type | Purpose | Key Attributes |
-|------------|---------|----------------|
-| `MagenticAgentDeltaEvent` | Streaming tokens | `text`, `agent_id` |
-| `MagenticAgentMessageEvent` | Agent turn complete | `message` (ChatMessage), `agent_id` |
-| `MagenticFinalResultEvent` | Workflow final result | `message` (ChatMessage) |
-| `MagenticOrchestratorMessageEvent` | Manager bookkeeping | `message`, `kind`, `orchestrator_id` |
-| `WorkflowOutputEvent` | Workflow output | `data`, `source_executor_id` |
-### Event Flow Trace
-```
-PHASE 1: Agent Streaming (Reporter)
-─────────────────────────────────────
-MagenticAgentDeltaEvent(text="##", agent_id="reporter")     → yields streaming event
-MagenticAgentDeltaEvent(text=" Summary", agent_id="reporter") → yields streaming event
-MagenticAgentDeltaEvent(text="\n", agent_id="reporter")     → yields streaming event
-... (hundreds more delta events)
-MagenticAgentDeltaEvent(text=".", agent_id="reporter")      → yields streaming event
-→ Result: Full report content in streaming_buffer (app.py) and current_message_buffer (orchestrator)
-PHASE 2: Agent Completion
-─────────────────────────────────────
-MagenticAgentMessageEvent(message=ChatMessage(...), agent_id="reporter")
-→ _handle_completion_event() yields: "reporter: [first 200 chars]..."
-→ Clears current_message_buffer
-→ app.py flushes streaming_buffer to response_parts with "📡 **STREAMING**:" prefix
-PHASE 3: Workflow Termination (THE BUG)
-─────────────────────────────────────
-MagenticFinalResultEvent(message=ChatMessage(...))  ← Contains SAME full report!
-OR
-WorkflowOutputEvent(data=ChatMessage(...))          ← Contains SAME full report!
-→ _process_event() extracts text with _extract_text()
-→ Returns AgentEvent(type="complete", message=FULL_REPORT)
-→ app.py appends FULL_REPORT to response_parts (NO prefix)
-RESULT: Report appears twice:
-1. "📡 **STREAMING**: [full report]"
-2. "[full report again]"
-```
-### Key Code Paths
-**`advanced.py` lines 299-345 (main loop):**
-```python
-# Buffer is cleared HERE (line 337) after MagenticAgentMessageEvent
-current_message_buffer = ""
-# But MagenticFinalResultEvent comes AFTER and _process_event has no buffer context!
-agent_event = self._process_event(event, iteration)  # line 341
-if agent_event:
-    yield agent_event  # line 345 - yields duplicate!
-```
-**`advanced.py` lines 532-539 (_process_event):**
-```python
-elif isinstance(event, MagenticFinalResultEvent):
-    text = self._extract_text(event.message)  # Extracts FULL content
-    return AgentEvent(type="complete", message=text)  # Returns FULL content
-```
-**`app.py` lines 229-232 (UI handling):**
-```python
-if event.type == "complete":
-    response_parts.append(event.message)  # Appends to existing streamed content!
-    yield "\n\n".join(response_parts)
-```
-### Why Buffer Clearing Doesn't Help
-The `current_message_buffer` is cleared (line 337) BEFORE the final events arrive. So even if we wanted to compare, we've already lost the reference:
-```python
-# Line 327-338: Handle MagenticAgentMessageEvent
-iteration += 1
-comp_event, prog_event = self._handle_completion_event(...)
-yield comp_event
-yield prog_event
-current_message_buffer = ""  # CLEARED!
-continue
-# Line 341-345: Handle final events (buffer is empty now!)
-agent_event = self._process_event(event, iteration)  # No buffer context
-```
-### Potential Edge Cases
-1. **Tool-only turns**: If agent makes tool calls without text, buffer is empty → fallback text used
-2. **Multiple agents streaming**: Buffer clears on agent switch (line 311-313) → OK
-3. **Timeout**: Uses `_handle_timeout()` which invokes ReportAgent directly → Different path
-4. **No final event**: Falls back to "Research completed..." message (line 354-363) → OK
-### Verification Needed
-- [ ] Confirm `MagenticFinalResultEvent` vs `WorkflowOutputEvent` - which is emitted?
-- [ ] Confirm bug occurs on both Free and Paid tiers
-- [ ] Measure content length match between streaming and final event
 ---
 ## Related
-- **Not related to model quality** - This is a stack bug, not model limitation
-- P1 Free Tier fix (PR fix/P1-free-tier) enabled streaming, exposing this bug
-- SPEC-17 Accumulator Pattern addressed repr bug but created this side effect

 **Date**: 2025-12-03
 **Status**: OPEN
 **Severity**: P2 (UX - Duplicate content confuses users)
+**Component**: `src/orchestrators/advanced.py`
 **Affects**: Both Free Tier (HuggingFace) AND Paid Tier (OpenAI)
 ---
 1. First as streaming content (with `📡 **STREAMING**:` prefix)
 2. Then again as a complete event (without prefix)
 ---
+## Root Cause
+The `_process_event()` method handles final events but has **no access to buffer state**. The buffer was already cleared at line 337 before these events arrive.
 ```python
+# Line 337: Buffer cleared
+current_message_buffer = ""
+continue
+# Line 341: Final events processed WITHOUT buffer context
+agent_event = self._process_event(event, iteration)  # No buffer info!
 ```
 ---
+## The Fix (Consensus: Stateful Orchestrator Logic)
+**Location**: `src/orchestrators/advanced.py` `run()` method
+**Strategy**: Handle final events **inline in the run() loop** where buffer state exists. Track streaming volume to decide whether to re-emit content.
+### Why This Is Correct
+| Rejected Approach | Why Wrong |
+|-------------------|-----------|
+| UI-side string comparison | Wrong layer, fragile, treats symptom |
+| Stateless `_process_event` fix | No state = can't know if streaming occurred |
+| **Stateful run() loop** | ✅ Only place with full lifecycle visibility |
+The `run()` loop is the **single source of truth** for the request lifecycle. It "saw" the content stream out. It must decide whether to re-emit.
+### Implementation
 ```python
+# In run() method, add tracking variable after line 302:
+last_streamed_length: int = 0
+# Before clearing buffer at line 337, save its length:
+last_streamed_length = len(current_message_buffer)
+current_message_buffer = ""
+continue
+# Replace lines 340-345 with inline handling of final events:
+if isinstance(event, (MagenticFinalResultEvent, WorkflowOutputEvent)):
+    final_event_received = True
+    # DECISION: Did we stream substantial content?
+    if last_streamed_length > 100:
+        # YES: Final event is a SIGNAL, not a payload
+        yield AgentEvent(
+            type="complete",
+            message="Research complete.",
+            data={"iterations": iteration, "streamed_chars": last_streamed_length},
+            iteration=iteration,
+        )
+    else:
+        # NO: Final event must carry the payload (tool-only turn, cache hit)
+        if isinstance(event, MagenticFinalResultEvent):
+            text = self._extract_text(event.message) if event.message else "No result"
+        else:  # WorkflowOutputEvent
+            text = self._extract_text(event.data) if event.data else "Research complete"
+        yield AgentEvent(
+            type="complete",
+            message=text,
+            data={"iterations": iteration},
+            iteration=iteration,
+        )
+    continue
+# Keep existing fallback for other events:
+agent_event = self._process_event(event, iteration)
 ```
+### Why Threshold of 100 Chars?
+- `> 0` is too aggressive (might catch single-word streams)
+- `> 500` is too conservative (might miss short but complete responses)
+- `> 100` distinguishes "real content was streamed" from "just status messages"
 ---
+## Edge Cases Handled
+| Scenario | `last_streamed_length` | Action |
+|----------|------------------------|--------|
+| Normal streaming report | 5000+ | Emit "Research complete." |
+| Tool call, no text | 0 | Emit full content from final event |
+| Very short response | 50 | Emit full content (fallback) |
+| Agent switch mid-stream | Reset on switch | Tracks only final agent |
 ---
+## Files to Modify
+| File | Lines | Change |
+|------|-------|--------|
+| `src/orchestrators/advanced.py` | 296-345 | Add `last_streamed_length`, handle final events inline |
+| `src/orchestrators/advanced.py` | 532-552 | Optional: remove dead code from `_process_event()` |
 ---
 ## Test Plan
+1. **Happy Path**: Run query, verify report appears ONCE
+2. **Fallback**: Mock tool-only turn (no streaming), verify full content emitted
+3. **Both Tiers**: Test Free Tier and Paid Tier
 ---
+## Validation
+This fix was independently validated by two AI agents (Claude and Gemini) analyzing the architecture. Both concluded:
+> "The Stateful Orchestrator Fix is the correct engineering solution. The 'Source of Truth' is the Orchestrator's runtime state."
 ---
 ## Related
+- **Not related to model quality** - This is a stack bug
+- P1 Free Tier fix enabled streaming, exposing this bug
+- SPEC-17 Accumulator Pattern addressed repr bug but created this side effect

docs/bugs/P2_FIRST_TURN_TIMEOUT.md ADDED Viewed

	@@ -0,0 +1,160 @@

+# P2 Bug: First Agent Turn Exceeds Workflow Timeout
+**Date**: 2025-12-03
+**Status**: OPEN
+**Severity**: P2 (UX - Workflow always times out on complex queries)
+**Component**: `src/orchestrators/advanced.py` + `src/agents/search_agent.py`
+**Affects**: Both Free Tier (HuggingFace) AND Paid Tier (OpenAI)
+---
+## Executive Summary
+The search agent's first turn can exceed the 5-minute workflow timeout, causing:
+1. `iterations=0` at timeout (no agent completed a turn)
+2. `_handle_timeout()` synthesizes from partial evidence
+3. Users get incomplete research results
+This is a **performance/architecture bug**, not a model issue.
+---
+## Symptom
+```
+[warning] Workflow timed out             iterations=0
+```
+The workflow times out with `iterations=0` - meaning the first agent (search agent) never completed its turn before the 5-minute timeout.
+---
+## Root Cause
+The search agent's first turn is **extremely expensive**:
+```
+Search Agent First Turn:
+├── Manager assigns task
+├── Search agent starts
+│   ├── Calls PubMed search tool (10 results)
+│   ├── Calls ClinicalTrials search tool (10 results)
+│   ├── Calls EuropePMC search tool (10 results)
+│   └── For EACH result (30 total):
+│       ├── Generate embedding (OpenAI API call)
+│       ├── Check for duplicates (ChromaDB query)
+│       └── Store in ChromaDB
+│
+│   TOTAL: 30 results × (embedding + dedup + store) = 90+ API/DB operations
+│
+└── Agent turn completes (if timeout hasn't fired)
+```
+**The timeout is on the WORKFLOW, not individual agent turns.** A single greedy agent can consume the entire timeout budget.
+---
+## Impact
+| Aspect | Impact |
+|--------|--------|
+| UX | Queries always timeout on first turn |
+| Research quality | Synthesis happens on partial evidence |
+| Confusion | `iterations=0` looks like nothing happened |
+---
+## The Fix (Consensus)
+**Reduce work per turn + increase timeout budget.**
+### Implementation
+**1. Reduce results per tool (immediate)**
+`src/agents/search_agent.py` line 70:
+```python
+# Change from 10 to 5
+result: SearchResult = await self._handler.execute(query, max_results_per_tool=5)
+```
+**2. Increase workflow timeout (immediate)**
+`src/utils/config.py`:
+```python
+advanced_timeout: float = Field(
+    default=600.0,  # Was 300.0 (5 min), now 10 min
+    ge=60.0,
+    le=900.0,
+    description="Timeout for Advanced mode in seconds",
+)
+```
+### Why NOT Per-Turn Timeout
+**DANGER**: The SearchHandler uses `asyncio.gather()`:
+```python
+# src/tools/search_handler.py line 163-164
+results = await asyncio.gather(*tasks, return_exceptions=True)
+```
+This is an **all-or-nothing** operation. If you wrap it with `asyncio.timeout()` and the timeout fires, you get **zero results**, not partial results.
+```python
+# DON'T DO THIS - yields nothing on timeout
+async with asyncio.timeout(60):
+    result = await self._handler.execute(query)  # Cancelled = zero results
+```
+Per-turn timeout requires `SearchHandler` to support cancellation with partial results. That's a separate architectural change (see Future Work).
+---
+## Future Work (Streaming Evidence Ingestion)
+For proper fix, `SearchHandler.execute()` should:
+1. Yield results as they arrive (async generator)
+2. Support cancellation with partial results
+3. Allow agent to return "what we have so far" on timeout
+```python
+# Future architecture
+async def execute_streaming(self, query: str) -> AsyncIterator[Evidence]:
+    for tool in self.tools:
+        async for evidence in tool.search_streaming(query):
+            yield evidence  # Can be cancelled at any point
+```
+This is out of scope for the immediate fix.
+---
+## Test Plan
+1. Run query with 10-minute timeout
+2. Verify first agent turn completes before timeout
+3. Verify `iterations >= 1` at workflow end
+---
+## Verification Data
+From diagnostic run:
+```
+=== RAW FRAMEWORK EVENTS ===
+  MagenticAgentDeltaEvent: 284
+  MagenticOrchestratorMessageEvent: 3
+  ...
+  NO MagenticAgentMessageEvent  ← Agent never completed a turn!
+[warning] Workflow timed out             iterations=0
+```
+---
+## Related
+- P2 Duplicate Report Bug (separate issue, happens after successful completion)
+- `_handle_timeout()` correctly synthesizes, but with partial evidence
+- Not related to model quality - this is infrastructure/performance