Spaces:

VibecoderMcSwaggins
/

DeepBoner

Paused

App Files Files Community

VibecoderMcSwaggins commited on Nov 29, 2025

Commit

8f45b69

2 Parent(s): 622c8ba 82503b1

Merge branch 'dev' - P1 bug fixes + CodeRabbit feedback

Browse files

Files changed (9) hide show

AGENTS.md +2 -3
CLAUDE.md +2 -3
GEMINI.md +2 -3
docs/bugs/INVESTIGATION_INVALID_MODELS.md +13 -12
docs/bugs/P1_MAGENTIC_STREAMING_AND_KEY_PERSISTENCE.md +181 -0
src/agent_factory/judges.py +2 -2
src/app.py +41 -10
src/utils/llm_factory.py +2 -2
tests/unit/test_streaming_fix.py +118 -0

AGENTS.md CHANGED Viewed

@@ -93,9 +93,8 @@ DeepBonerError (base)
 Given the rapid advancements, as of November 29, 2025, the DeepBoner project uses the following default LLM models in its configuration (`src/utils/config.py`):
-- **OpenAI:** `gpt-5`
-  - This is the stable flagship model released in August 2025.
-  - While `gpt-5.1` (released November 2025) exists, it is currently gated, and attempts to use it resulted in a `403 model_not_found` error for typical API keys. Advanced users with access to `gpt-5.1-instant`, `gpt-5.1-thinking`, or `gpt-5.1-codex-max` may configure their `.env` accordingly.
 - **Anthropic:** `claude-sonnet-4-5-20250929`
   - This is the mid-range Claude 4.5 model, released on September 29, 2025.
   - The flagship `Claude Opus 4.5` (released November 24, 2025) is also available and can be configured by advanced users for enhanced capabilities.

 Given the rapid advancements, as of November 29, 2025, the DeepBoner project uses the following default LLM models in its configuration (`src/utils/config.py`):
+- **OpenAI:** `gpt-5.1`
+  - Current flagship model (November 2025). Requires Tier 5 access.
 - **Anthropic:** `claude-sonnet-4-5-20250929`
   - This is the mid-range Claude 4.5 model, released on September 29, 2025.
   - The flagship `Claude Opus 4.5` (released November 24, 2025) is also available and can be configured by advanced users for enhanced capabilities.

CLAUDE.md CHANGED Viewed

@@ -100,9 +100,8 @@ DeepBonerError (base)
 Given the rapid advancements, as of November 29, 2025, the DeepBoner project uses the following default LLM models in its configuration (`src/utils/config.py`):
-- **OpenAI:** `gpt-5`
-  - This is the stable flagship model released in August 2025.
-  - While `gpt-5.1` (released November 2025) exists, it is currently gated, and attempts to use it resulted in a `403 model_not_found` error for typical API keys. Advanced users with access to `gpt-5.1-instant`, `gpt-5.1-thinking`, or `gpt-5.1-codex-max` may configure their `.env` accordingly.
 - **Anthropic:** `claude-sonnet-4-5-20250929`
   - This is the mid-range Claude 4.5 model, released on September 29, 2025.
   - The flagship `Claude Opus 4.5` (released November 24, 2025) is also available and can be configured by advanced users for enhanced capabilities.

 Given the rapid advancements, as of November 29, 2025, the DeepBoner project uses the following default LLM models in its configuration (`src/utils/config.py`):
+- **OpenAI:** `gpt-5.1`
+  - Current flagship model (November 2025). Requires Tier 5 access.
 - **Anthropic:** `claude-sonnet-4-5-20250929`
   - This is the mid-range Claude 4.5 model, released on September 29, 2025.
   - The flagship `Claude Opus 4.5` (released November 24, 2025) is also available and can be configured by advanced users for enhanced capabilities.

GEMINI.md CHANGED Viewed

@@ -74,9 +74,8 @@ Settings via pydantic-settings from `.env`:
 Given the rapid advancements, as of November 29, 2025, the DeepBoner project uses the following default LLM models in its configuration (`src/utils/config.py`):
-- **OpenAI:** `gpt-5`
-  - This is the stable flagship model released in August 2025.
-  - While `gpt-5.1` (released November 2025) exists, it is currently gated, and attempts to use it resulted in a `403 model_not_found` error for typical API keys. Advanced users with access to `gpt-5.1-instant`, `gpt-5.1-thinking`, or `gpt-5.1-codex-max` may configure their `.env` accordingly.
 - **Anthropic:** `claude-sonnet-4-5-20250929`
   - This is the mid-range Claude 4.5 model, released on September 29, 2025.
   - The flagship `Claude Opus 4.5` (released November 24, 2025) is also available and can be configured by advanced users for enhanced capabilities.

 Given the rapid advancements, as of November 29, 2025, the DeepBoner project uses the following default LLM models in its configuration (`src/utils/config.py`):
+- **OpenAI:** `gpt-5.1`
+  - Current flagship model (November 2025). Requires Tier 5 access.
 - **Anthropic:** `claude-sonnet-4-5-20250929`
   - This is the mid-range Claude 4.5 model, released on September 29, 2025.
   - The flagship `Claude Opus 4.5` (released November 24, 2025) is also available and can be configured by advanced users for enhanced capabilities.

docs/bugs/INVESTIGATION_INVALID_MODELS.md CHANGED Viewed

@@ -9,22 +9,23 @@
 ## Issue Description
 The user encountered a 403 error when running in Magentic mode:
-`Error code: 403 - {'error': {'message': 'Project ... does not have access to model gpt-5.1', ... 'code': 'model_not_found'}}`
-This indicates the application is trying to use `gpt-5.1`, which the user's API key did not have access to (likely a beta/gated model).
 ## Root Cause Analysis
-The default config used `gpt-5.1` (beta/preview) and `claude-sonnet-4-5-20250929`.
-Initial remediation mistakenly downgraded these to 2024 models (`gpt-4o`).
-Web search confirmed that in November 2025:
-- `claude-sonnet-4-5-20250929` IS valid.
-- `gpt-5.1` exists but access is restricted (leading to 403).
-- `gpt-5` (August 2025) is the stable flagship.
 ## Solution Implemented
 Updated `src/utils/config.py` to use:
-- `anthropic_model`: `claude-sonnet-4-5-20250929` (Restored correct Nov 2025 model)
-- `openai_model`: `gpt-5` (Changed from 5.1 to 5 to ensure stability/access).
 ## Verification
-- `tests/unit/agent_factory/test_judges_factory.py` updated and passed.

 ## Issue Description
 The user encountered a 403 error when running in Magentic mode:
+`Error code: 403 - {'error': {'message': 'Project ... does not have access to model gpt-5', ... 'code': 'model_not_found'}}`
 ## Root Cause Analysis
+OpenAI deprecated the base `gpt-5` model. Tier 5 accounts now have access to:
+- `gpt-5.1` (current flagship)
+- `gpt-5-mini`
+- `gpt-5-nano`
+- `gpt-4.1`, `gpt-4.1-mini`, `gpt-4.1-nano`
+- `o3`, `o4-mini`
+The base `gpt-5` is NO LONGER available via API.
 ## Solution Implemented
 Updated `src/utils/config.py` to use:
+- `openai_model`: `gpt-5.1` (the actual current model)
+- `anthropic_model`: `claude-sonnet-4-5-20250929` (unchanged)
 ## Verification
+- `tests/unit/agent_factory/test_judges_factory.py` updated and passed.
+- User confirmed Tier 5 access to `gpt-5.1` via OpenAI dashboard.

docs/bugs/P1_MAGENTIC_STREAMING_AND_KEY_PERSISTENCE.md ADDED Viewed

	@@ -0,0 +1,181 @@

+# Bug Report: Magentic Mode Integration Issues
+## Status
+- **Date:** 2025-11-29
+- **Reporter:** CLI User
+- **Priority:** P1 (UX Degradation + Deprecation Warnings)
+- **Component:** `src/app.py`, `src/orchestrator_magentic.py`, `src/utils/llm_factory.py`
+- **Status:** ✅ FIXED (Bug 1 & Bug 2) - 2025-11-29
+- **Tests:** 138 passing (136 original + 2 new validation tests)
+---
+## Bug 1: Token-by-Token Streaming Spam ✅ FIXED
+### Symptoms
+When running Magentic (Advanced) mode, the UI shows hundreds of individual lines like:
+```text
+📡 STREAMING: Below
+📡 STREAMING: is
+📡 STREAMING: a
+📡 STREAMING: curated
+📡 STREAMING: list
+...
+```
+Each token is displayed as a separate streaming event, creating visual spam and making it impossible to read the output until completion.
+### Root Cause (VALIDATED)
+**File:** `src/orchestrator_magentic.py:247-254`
+```python
+elif isinstance(event, MagenticAgentDeltaEvent):
+    if event.text:
+        return AgentEvent(
+            type="streaming",
+            message=event.text,  # Single token!
+            data={"agent_id": event.agent_id},
+            iteration=iteration,
+        )
+```
+Every LLM token emits a `MagenticAgentDeltaEvent`, which creates an `AgentEvent(type="streaming")`.
+**File:** `src/app.py:171-192` (BEFORE FIX)
+```python
+async for event in orchestrator.run(message):
+    event_md = event.to_markdown()
+    response_parts.append(event_md)  # Appends EVERY token
+    if event.type == "complete":
+        yield event.message
+    else:
+        yield "\n\n".join(response_parts)  # Yields ALL accumulated tokens
+```
+For N tokens, this yields N times, each time showing all previous tokens. This is O(N²) string operations and creates massive visual spam.
+### Fix Applied
+**File:** `src/app.py:175-204`
+Implemented streaming token buffering with live updates:
+1. Added `streaming_buffer = ""` to accumulate tokens
+2. For each streaming event: append to buffer, yield immediately (for live typing UX)
+3. **Key fix**: Don't append streaming events to `response_parts` (prevents O(N²) list growth)
+4. Each yield has only ONE `📡 STREAMING:` line (the accumulated buffer)
+5. Flush buffer to `response_parts` only when non-streaming event occurs
+**Result**: Live typing feel preserved, but no visual spam (each update replaces, not accumulates)
+### Proposed Fix Options
+**Option A: Buffer streaming tokens (recommended)**
+```python
+# In app.py - accumulate streaming tokens, yield periodically
+streaming_buffer = ""
+last_yield_time = time.time()
+async for event in orchestrator.run(message):
+    if event.type == "streaming":
+        streaming_buffer += event.message
+        # Only yield every 500ms or on newline
+        if time.time() - last_yield_time > 0.5 or "\n" in event.message:
+            yield f"📡 {streaming_buffer}"
+            last_yield_time = time.time()
+    elif event.type == "complete":
+        yield event.message
+    else:
+        # Non-streaming events
+        response_parts.append(event.to_markdown())
+        yield "\n\n".join(response_parts)
+```
+**Option B: Don't yield streaming events at all**
+```python
+# In app.py - only yield meaningful events
+async for event in orchestrator.run(message):
+    if event.type == "streaming":
+        continue  # Skip token-by-token spam
+    # ... rest of logic
+```
+**Option C: Fix at orchestrator level**
+Don't emit `AgentEvent` for every delta - buffer in `_process_event`.
+---
+## Bug 2: API Key Does Not Persist in Textbox ✅ FIXED
+### Symptoms
+1. User opens the "Mode & API Key" accordion
+2. User pastes their API key into the password textbox
+3. User clicks an example OR clicks elsewhere
+4. The API key textbox is now empty - value lost
+### Root Cause (VALIDATED)
+**File:** `src/app.py:255-267` (BEFORE FIX)
+```python
+additional_inputs_accordion=additional_inputs_accordion,
+additional_inputs=[
+    gr.Radio(...),
+    gr.Textbox(
+        label="🔑 API Key (Optional)",
+        type="password",
+        # No `value` parameter - defaults to empty
+        # No state persistence mechanism
+    ),
+],
+```
+Gradio's `ChatInterface` with `additional_inputs` has known issues:
+1. Clicking examples resets additional inputs to defaults
+2. The accordion state and input values may not persist correctly
+3. No explicit state management for the API key
+### Fix Applied
+**Files Modified:**
+1. `src/app.py`
+2. `src/utils/llm_factory.py`
+**Bug 1 (Streaming Spam):**
+- Accumulate tokens in `streaming_buffer`
+- Yield updates immediately for live typing UX
+- **Key**: Don't append to `response_parts` until stream segment complete
+- Each yield has ONE `📡 STREAMING:` line (not N accumulated lines)
+**Bug 2 (API Key Persistence):**
+- **Strategy:** Partial example list (relies on Gradio behavior)
+  - Examples have only 2 elements `[message, mode]` instead of 4
+  - Gradio only updates inputs with corresponding example values
+  - Remaining inputs (api_key textbox) are left unchanged
+  - `api_key_state` parameter exists as fallback but may be redundant
+- **Note:** This is a workaround relying on undocumented Gradio behavior
+**Bug 3 (OpenAIModel Deprecation):** ✅ FIXED
+- Replaced all `OpenAIModel` imports with `OpenAIChatModel` in `src/app.py` and `src/utils/llm_factory.py`.
+### Test Results
+```bash
+uv run pytest tests/ -q
+============================= 138 passed in 20.60s =============================
+```
+**Status:** ✅ All tests passing
+### Why This Fix Works
+**Bug 1 (Streaming Spam):**
+- **Before:** Every token → `append()` to list → `yield` → List grew to size N → O(N²) complexity.
+- **After:** Every token → `yield` dynamically constructed string (buffer + history) → List stays size K (number of *events*).
+- **Impact:** Smooth streaming, no visual spam, no browser freeze.
+**Bug 2 (API Key):**
+- **Before:** Example click → Overwrote API Key textbox with `""`.
+- **After:** Example click → Updates only `message` and `mode` → API Key textbox untouched.
+- **Impact:** User input persists naturally.
+### Remaining Work
+- **Bug 4 (Asyncio GC errors):** Monitoring only - likely Gradio/HF Spaces issue

src/agent_factory/judges.py CHANGED Viewed

@@ -451,12 +451,12 @@ class MockJudgeHandler:
     def _extract_key_findings(self, evidence: list[Evidence], max_findings: int = 5) -> list[str]:
         """Extract key findings from evidence titles."""
-        findings = _extract_titles_from_evidence(
             evidence,
             max_items=max_findings,
             fallback_message="No specific findings extracted (demo mode)",
         )
-        return findings if findings else ["No specific findings extracted (demo mode)"]
     def _extract_drug_candidates(self, question: str, evidence: list[Evidence]) -> list[str]:
         """Extract drug candidates - demo mode returns honest message."""

     def _extract_key_findings(self, evidence: list[Evidence], max_findings: int = 5) -> list[str]:
         """Extract key findings from evidence titles."""
+        # Helper guarantees non-empty list when fallback_message is provided
+        return _extract_titles_from_evidence(
             evidence,
             max_items=max_findings,
             fallback_message="No specific findings extracted (demo mode)",
         )
     def _extract_drug_candidates(self, question: str, evidence: list[Evidence]) -> list[str]:
         """Extract drug candidates - demo mode returns honest message."""

src/app.py CHANGED Viewed

@@ -6,7 +6,7 @@ from typing import Any
 import gradio as gr
 from pydantic_ai.models.anthropic import AnthropicModel
-from pydantic_ai.models.openai import OpenAIModel
 from pydantic_ai.providers.anthropic import AnthropicProvider
 from pydantic_ai.providers.openai import OpenAIProvider
@@ -61,7 +61,7 @@ def configure_orchestrator(
     # 2. Paid API Key (User provided or Env)
     elif user_api_key and user_api_key.strip():
         # Auto-detect provider from key prefix
-        model: AnthropicModel | OpenAIModel
         if user_api_key.startswith("sk-ant-"):
             # Anthropic key
             anthropic_provider = AnthropicProvider(api_key=user_api_key)
@@ -70,7 +70,7 @@ def configure_orchestrator(
         elif user_api_key.startswith("sk-"):
             # OpenAI key
             openai_provider = OpenAIProvider(api_key=user_api_key)
-            model = OpenAIModel(settings.openai_model, provider=openai_provider)
             backend_info = "Paid API (OpenAI)"
         else:
             raise ConfigurationError(
@@ -108,6 +108,7 @@ async def research_agent(
     history: list[dict[str, Any]],
     mode: str = "simple",
     api_key: str = "",
 ) -> AsyncGenerator[str, None]:
     """
     Gradio chat function that runs the research agent.
@@ -117,6 +118,7 @@ async def research_agent(
         history: Chat history (Gradio format)
         mode: Orchestrator mode ("simple" or "advanced")
         api_key: Optional user-provided API key (BYOK - auto-detects provider)
     Yields:
         Markdown-formatted responses for streaming
@@ -125,8 +127,8 @@ async def research_agent(
         yield "Please enter a research question."
         return
-    # Clean user-provided API key
-    user_api_key = api_key.strip() if api_key else None
     # Check available keys
     has_openai = bool(os.getenv("OPENAI_API_KEY"))
@@ -155,6 +157,7 @@ async def research_agent(
     # Run the agent and stream events
     response_parts: list[str] = []
     try:
         # use_mock=False - let configure_orchestrator decide based on available keys
@@ -168,17 +171,36 @@ async def research_agent(
         yield f"🧠 **Backend**: {backend_name}\n\n"
         async for event in orchestrator.run(message):
-            # Format event as markdown
-            event_md = event.to_markdown()
-            response_parts.append(event_md)
-            # If complete, show full response
             if event.type == "complete":
                 yield event.message
             else:
                 # Show progress
                 yield "\n\n".join(response_parts)
     except Exception as e:
         yield f"❌ **Error**: {e!s}"
@@ -193,6 +215,10 @@ def create_demo() -> tuple[gr.ChatInterface, gr.Accordion]:
     additional_inputs_accordion = gr.Accordion(
         label="⚙️ Mode & API Key (Free tier works!)", open=False
     )
     # 1. Unwrapped ChatInterface (Fixes Accordion Bug)
     demo = gr.ChatInterface(
         fn=research_agent,
@@ -210,6 +236,7 @@ def create_demo() -> tuple[gr.ChatInterface, gr.Accordion]:
             [
                 "What drugs improve female libido post-menopause?",
                 "simple",
             ],
             [
                 "Clinical trials for erectile dysfunction alternatives to PDE5 inhibitors?",
@@ -234,9 +261,13 @@ def create_demo() -> tuple[gr.ChatInterface, gr.Accordion]:
                 type="password",
                 info="Leave empty for free tier. Auto-detects provider from key prefix.",
             ),
         ],
     )
     return demo, additional_inputs_accordion

 import gradio as gr
 from pydantic_ai.models.anthropic import AnthropicModel
+from pydantic_ai.models.openai import OpenAIChatModel
 from pydantic_ai.providers.anthropic import AnthropicProvider
 from pydantic_ai.providers.openai import OpenAIProvider
     # 2. Paid API Key (User provided or Env)
     elif user_api_key and user_api_key.strip():
         # Auto-detect provider from key prefix
+        model: AnthropicModel | OpenAIChatModel
         if user_api_key.startswith("sk-ant-"):
             # Anthropic key
             anthropic_provider = AnthropicProvider(api_key=user_api_key)
         elif user_api_key.startswith("sk-"):
             # OpenAI key
             openai_provider = OpenAIProvider(api_key=user_api_key)
+            model = OpenAIChatModel(settings.openai_model, provider=openai_provider)
             backend_info = "Paid API (OpenAI)"
         else:
             raise ConfigurationError(
     history: list[dict[str, Any]],
     mode: str = "simple",
     api_key: str = "",
+    api_key_state: str = "",
 ) -> AsyncGenerator[str, None]:
     """
     Gradio chat function that runs the research agent.
         history: Chat history (Gradio format)
         mode: Orchestrator mode ("simple" or "advanced")
         api_key: Optional user-provided API key (BYOK - auto-detects provider)
+        api_key_state: Persistent API key state (survives example clicks)
     Yields:
         Markdown-formatted responses for streaming
         yield "Please enter a research question."
         return
+    # BUG FIX: Prefer freshly-entered key, then persisted state
+    user_api_key = (api_key.strip() or api_key_state.strip()) or None
     # Check available keys
     has_openai = bool(os.getenv("OPENAI_API_KEY"))
     # Run the agent and stream events
     response_parts: list[str] = []
+    streaming_buffer = ""  # Buffer for accumulating streaming tokens
     try:
         # use_mock=False - let configure_orchestrator decide based on available keys
         yield f"🧠 **Backend**: {backend_name}\n\n"
         async for event in orchestrator.run(message):
+            # BUG FIX: Handle streaming events separately to avoid token-by-token spam
+            if event.type == "streaming":
+                # Accumulate streaming tokens without emitting individual events
+                streaming_buffer += event.message
+                # Yield the current buffer combined with previous parts to show progress
+                # But DO NOT append to response_parts list yet (to avoid O(N^2) list growth)
+                current_parts = [*response_parts, f"📡 **STREAMING**: {streaming_buffer}"]
+                yield "\n\n".join(current_parts)
+                continue
+            # For non-streaming events, flush any buffered streaming content first
+            if streaming_buffer:
+                response_parts.append(f"📡 **STREAMING**: {streaming_buffer}")
+                streaming_buffer = ""  # Reset buffer
+            # Handle complete events specially
             if event.type == "complete":
                 yield event.message
             else:
+                # Format and append non-streaming events
+                event_md = event.to_markdown()
+                response_parts.append(event_md)
                 # Show progress
                 yield "\n\n".join(response_parts)
+        # Flush any remaining streaming content at the end
+        if streaming_buffer:
+            response_parts.append(f"📡 **STREAMING**: {streaming_buffer}")
+            yield "\n\n".join(response_parts)
     except Exception as e:
         yield f"❌ **Error**: {e!s}"
     additional_inputs_accordion = gr.Accordion(
         label="⚙️ Mode & API Key (Free tier works!)", open=False
     )
+    # BUG FIX: Add gr.State for API key persistence across example clicks
+    api_key_state = gr.State("")
     # 1. Unwrapped ChatInterface (Fixes Accordion Bug)
     demo = gr.ChatInterface(
         fn=research_agent,
             [
                 "What drugs improve female libido post-menopause?",
                 "simple",
+                # Removed empty strings for api_key and api_key_state to prevent overwriting
             ],
             [
                 "Clinical trials for erectile dysfunction alternatives to PDE5 inhibitors?",
                 type="password",
                 info="Leave empty for free tier. Auto-detects provider from key prefix.",
             ),
+            api_key_state,  # Hidden state component for persistence
         ],
     )
+    # API key persists because examples only include [message, mode] columns,
+    # so Gradio doesn't overwrite the api_key textbox when examples are clicked.
     return demo, additional_inputs_accordion

src/utils/llm_factory.py CHANGED Viewed

@@ -56,7 +56,7 @@ def get_pydantic_ai_model() -> Any:
         Configured pydantic-ai model
     """
     from pydantic_ai.models.anthropic import AnthropicModel
-    from pydantic_ai.models.openai import OpenAIModel
     from pydantic_ai.providers.anthropic import AnthropicProvider
     from pydantic_ai.providers.openai import OpenAIProvider
@@ -64,7 +64,7 @@ def get_pydantic_ai_model() -> Any:
         if not settings.openai_api_key:
             raise ConfigurationError("OPENAI_API_KEY not set for pydantic-ai")
         provider = OpenAIProvider(api_key=settings.openai_api_key)
-        return OpenAIModel(settings.openai_model, provider=provider)
     if settings.llm_provider == "anthropic":
         if not settings.anthropic_api_key:

         Configured pydantic-ai model
     """
     from pydantic_ai.models.anthropic import AnthropicModel
+    from pydantic_ai.models.openai import OpenAIChatModel
     from pydantic_ai.providers.anthropic import AnthropicProvider
     from pydantic_ai.providers.openai import OpenAIProvider
         if not settings.openai_api_key:
             raise ConfigurationError("OPENAI_API_KEY not set for pydantic-ai")
         provider = OpenAIProvider(api_key=settings.openai_api_key)
+        return OpenAIChatModel(settings.openai_model, provider=provider)
     if settings.llm_provider == "anthropic":
         if not settings.anthropic_api_key:

tests/unit/test_streaming_fix.py ADDED Viewed

	@@ -0,0 +1,118 @@

+"""Test that streaming event handling is fixed (no token-by-token spam)."""
+from unittest.mock import MagicMock
+import pytest
+from src.utils.models import AgentEvent
+@pytest.mark.unit
+@pytest.mark.asyncio
+async def test_streaming_events_are_buffered_not_spammed():
+    """
+    Verify that streaming events are buffered, not yielded individually.
+    This test validates the fix for Bug 1: Token-by-Token Streaming Spam.
+    Before the fix, each token would create a separate yield, resulting in O(N²) spam.
+    After the fix, streaming tokens are buffered and only yielded once.
+    """
+    # Import here to avoid circular dependencies
+    from src.app import research_agent
+    # Mock orchestrator
+    mock_orchestrator = MagicMock()
+    # Simulate streaming events (like LLM token-by-token output)
+    streaming_events = [
+        AgentEvent(type="started", message="Starting research", iteration=0),
+        AgentEvent(type="streaming", message="This", iteration=1),
+        AgentEvent(type="streaming", message=" is", iteration=1),
+        AgentEvent(type="streaming", message=" a", iteration=1),
+        AgentEvent(type="streaming", message=" test", iteration=1),
+        AgentEvent(type="complete", message="Final answer: This is a test", iteration=1),
+    ]
+    # Create async generator that yields events
+    async def mock_run(query):
+        for event in streaming_events:
+            yield event
+    mock_orchestrator.run = mock_run
+    # Mock configure_orchestrator to return our mock
+    import src.app as app_module
+    original_configure = app_module.configure_orchestrator
+    app_module.configure_orchestrator = MagicMock(return_value=(mock_orchestrator, "Test Backend"))
+    try:
+        # Run the research agent
+        results = []
+        async for result in research_agent("test query", [], mode="simple", api_key=""):
+            results.append(result)
+        # Verify that we DO see streaming updates (for UX responsiveness)
+        # But we don't want O(N^2) growth of the persisted list.
+        # We expect results to contain the streaming updates
+        assert len(results) > 0, "Should have yielded results"
+        # Check that we see the accumulated message
+        assert any(
+            "📡 **STREAMING**: This is a test" in r for r in results
+        ), "Buffer didn't accumulate correctly"
+        # The critical check for the "Spam" bug:
+        # In the spam bug, the output grew like:
+        # "Stream: T"
+        # "Stream: T\nStream: h"
+        # "Stream: T\nStream: h\nStream: i"
+        #
+        # In the fixed version, it should look like:
+        # "Stream: T"
+        # "Stream: Th"
+        # "Stream: Thi"
+        # (Replacing the last line, not adding new lines)
+        for res in results:
+            # Count occurrences of "📡 **STREAMING**:": in a single result string
+            # It should appear AT MOST once
+            # (unless we have multiple distinct streaming blocks)
+            streaming_markers = res.count("📡 **STREAMING**:")
+            assert streaming_markers <= 1, (
+                f"Found multiple streaming markers in single response: {res}\n"
+                "This indicates we are appending new lines instead of updating in place."
+            )
+        # The final result should be the complete message
+        assert any("Final answer" in r for r in results), "Missing final complete message"
+    finally:
+        # Restore original function
+        app_module.configure_orchestrator = original_configure
+@pytest.mark.unit
+@pytest.mark.asyncio
+async def test_api_key_state_parameter_exists():
+    """
+    Verify that api_key_state parameter was added to research_agent.
+    This validates the fix for Bug 2: API Key Persistence.
+    """
+    import inspect
+    from src.app import research_agent
+    # Get function signature
+    sig = inspect.signature(research_agent)
+    params = list(sig.parameters.keys())
+    # Verify api_key_state parameter exists
+    assert "api_key_state" in params, "api_key_state parameter missing from research_agent"
+    # Verify it's after api_key
+    api_key_idx = params.index("api_key")
+    api_key_state_idx = params.index("api_key_state")
+    assert api_key_state_idx > api_key_idx, "api_key_state should come after api_key"