Merge branch 'dev' - P1 bug fixes + CodeRabbit feedback
Browse files- AGENTS.md +2 -3
- CLAUDE.md +2 -3
- GEMINI.md +2 -3
- docs/bugs/INVESTIGATION_INVALID_MODELS.md +13 -12
- docs/bugs/P1_MAGENTIC_STREAMING_AND_KEY_PERSISTENCE.md +181 -0
- src/agent_factory/judges.py +2 -2
- src/app.py +41 -10
- src/utils/llm_factory.py +2 -2
- tests/unit/test_streaming_fix.py +118 -0
AGENTS.md
CHANGED
|
@@ -93,9 +93,8 @@ DeepBonerError (base)
|
|
| 93 |
|
| 94 |
Given the rapid advancements, as of November 29, 2025, the DeepBoner project uses the following default LLM models in its configuration (`src/utils/config.py`):
|
| 95 |
|
| 96 |
-
- **OpenAI:** `gpt-5`
|
| 97 |
-
-
|
| 98 |
-
- While `gpt-5.1` (released November 2025) exists, it is currently gated, and attempts to use it resulted in a `403 model_not_found` error for typical API keys. Advanced users with access to `gpt-5.1-instant`, `gpt-5.1-thinking`, or `gpt-5.1-codex-max` may configure their `.env` accordingly.
|
| 99 |
- **Anthropic:** `claude-sonnet-4-5-20250929`
|
| 100 |
- This is the mid-range Claude 4.5 model, released on September 29, 2025.
|
| 101 |
- The flagship `Claude Opus 4.5` (released November 24, 2025) is also available and can be configured by advanced users for enhanced capabilities.
|
|
|
|
| 93 |
|
| 94 |
Given the rapid advancements, as of November 29, 2025, the DeepBoner project uses the following default LLM models in its configuration (`src/utils/config.py`):
|
| 95 |
|
| 96 |
+
- **OpenAI:** `gpt-5.1`
|
| 97 |
+
- Current flagship model (November 2025). Requires Tier 5 access.
|
|
|
|
| 98 |
- **Anthropic:** `claude-sonnet-4-5-20250929`
|
| 99 |
- This is the mid-range Claude 4.5 model, released on September 29, 2025.
|
| 100 |
- The flagship `Claude Opus 4.5` (released November 24, 2025) is also available and can be configured by advanced users for enhanced capabilities.
|
CLAUDE.md
CHANGED
|
@@ -100,9 +100,8 @@ DeepBonerError (base)
|
|
| 100 |
|
| 101 |
Given the rapid advancements, as of November 29, 2025, the DeepBoner project uses the following default LLM models in its configuration (`src/utils/config.py`):
|
| 102 |
|
| 103 |
-
- **OpenAI:** `gpt-5`
|
| 104 |
-
-
|
| 105 |
-
- While `gpt-5.1` (released November 2025) exists, it is currently gated, and attempts to use it resulted in a `403 model_not_found` error for typical API keys. Advanced users with access to `gpt-5.1-instant`, `gpt-5.1-thinking`, or `gpt-5.1-codex-max` may configure their `.env` accordingly.
|
| 106 |
- **Anthropic:** `claude-sonnet-4-5-20250929`
|
| 107 |
- This is the mid-range Claude 4.5 model, released on September 29, 2025.
|
| 108 |
- The flagship `Claude Opus 4.5` (released November 24, 2025) is also available and can be configured by advanced users for enhanced capabilities.
|
|
|
|
| 100 |
|
| 101 |
Given the rapid advancements, as of November 29, 2025, the DeepBoner project uses the following default LLM models in its configuration (`src/utils/config.py`):
|
| 102 |
|
| 103 |
+
- **OpenAI:** `gpt-5.1`
|
| 104 |
+
- Current flagship model (November 2025). Requires Tier 5 access.
|
|
|
|
| 105 |
- **Anthropic:** `claude-sonnet-4-5-20250929`
|
| 106 |
- This is the mid-range Claude 4.5 model, released on September 29, 2025.
|
| 107 |
- The flagship `Claude Opus 4.5` (released November 24, 2025) is also available and can be configured by advanced users for enhanced capabilities.
|
GEMINI.md
CHANGED
|
@@ -74,9 +74,8 @@ Settings via pydantic-settings from `.env`:
|
|
| 74 |
|
| 75 |
Given the rapid advancements, as of November 29, 2025, the DeepBoner project uses the following default LLM models in its configuration (`src/utils/config.py`):
|
| 76 |
|
| 77 |
-
- **OpenAI:** `gpt-5`
|
| 78 |
-
-
|
| 79 |
-
- While `gpt-5.1` (released November 2025) exists, it is currently gated, and attempts to use it resulted in a `403 model_not_found` error for typical API keys. Advanced users with access to `gpt-5.1-instant`, `gpt-5.1-thinking`, or `gpt-5.1-codex-max` may configure their `.env` accordingly.
|
| 80 |
- **Anthropic:** `claude-sonnet-4-5-20250929`
|
| 81 |
- This is the mid-range Claude 4.5 model, released on September 29, 2025.
|
| 82 |
- The flagship `Claude Opus 4.5` (released November 24, 2025) is also available and can be configured by advanced users for enhanced capabilities.
|
|
|
|
| 74 |
|
| 75 |
Given the rapid advancements, as of November 29, 2025, the DeepBoner project uses the following default LLM models in its configuration (`src/utils/config.py`):
|
| 76 |
|
| 77 |
+
- **OpenAI:** `gpt-5.1`
|
| 78 |
+
- Current flagship model (November 2025). Requires Tier 5 access.
|
|
|
|
| 79 |
- **Anthropic:** `claude-sonnet-4-5-20250929`
|
| 80 |
- This is the mid-range Claude 4.5 model, released on September 29, 2025.
|
| 81 |
- The flagship `Claude Opus 4.5` (released November 24, 2025) is also available and can be configured by advanced users for enhanced capabilities.
|
docs/bugs/INVESTIGATION_INVALID_MODELS.md
CHANGED
|
@@ -9,22 +9,23 @@
|
|
| 9 |
|
| 10 |
## Issue Description
|
| 11 |
The user encountered a 403 error when running in Magentic mode:
|
| 12 |
-
`Error code: 403 - {'error': {'message': 'Project ... does not have access to model gpt-5
|
| 13 |
-
|
| 14 |
-
This indicates the application is trying to use `gpt-5.1`, which the user's API key did not have access to (likely a beta/gated model).
|
| 15 |
|
| 16 |
## Root Cause Analysis
|
| 17 |
-
|
| 18 |
-
|
| 19 |
-
|
| 20 |
-
- `
|
| 21 |
-
- `gpt-
|
| 22 |
-
- `
|
|
|
|
|
|
|
| 23 |
|
| 24 |
## Solution Implemented
|
| 25 |
Updated `src/utils/config.py` to use:
|
| 26 |
-
- `
|
| 27 |
-
- `
|
| 28 |
|
| 29 |
## Verification
|
| 30 |
-
- `tests/unit/agent_factory/test_judges_factory.py` updated and passed.
|
|
|
|
|
|
| 9 |
|
| 10 |
## Issue Description
|
| 11 |
The user encountered a 403 error when running in Magentic mode:
|
| 12 |
+
`Error code: 403 - {'error': {'message': 'Project ... does not have access to model gpt-5', ... 'code': 'model_not_found'}}`
|
|
|
|
|
|
|
| 13 |
|
| 14 |
## Root Cause Analysis
|
| 15 |
+
OpenAI deprecated the base `gpt-5` model. Tier 5 accounts now have access to:
|
| 16 |
+
- `gpt-5.1` (current flagship)
|
| 17 |
+
- `gpt-5-mini`
|
| 18 |
+
- `gpt-5-nano`
|
| 19 |
+
- `gpt-4.1`, `gpt-4.1-mini`, `gpt-4.1-nano`
|
| 20 |
+
- `o3`, `o4-mini`
|
| 21 |
+
|
| 22 |
+
The base `gpt-5` is NO LONGER available via API.
|
| 23 |
|
| 24 |
## Solution Implemented
|
| 25 |
Updated `src/utils/config.py` to use:
|
| 26 |
+
- `openai_model`: `gpt-5.1` (the actual current model)
|
| 27 |
+
- `anthropic_model`: `claude-sonnet-4-5-20250929` (unchanged)
|
| 28 |
|
| 29 |
## Verification
|
| 30 |
+
- `tests/unit/agent_factory/test_judges_factory.py` updated and passed.
|
| 31 |
+
- User confirmed Tier 5 access to `gpt-5.1` via OpenAI dashboard.
|
docs/bugs/P1_MAGENTIC_STREAMING_AND_KEY_PERSISTENCE.md
ADDED
|
@@ -0,0 +1,181 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Bug Report: Magentic Mode Integration Issues
|
| 2 |
+
|
| 3 |
+
## Status
|
| 4 |
+
- **Date:** 2025-11-29
|
| 5 |
+
- **Reporter:** CLI User
|
| 6 |
+
- **Priority:** P1 (UX Degradation + Deprecation Warnings)
|
| 7 |
+
- **Component:** `src/app.py`, `src/orchestrator_magentic.py`, `src/utils/llm_factory.py`
|
| 8 |
+
- **Status:** ✅ FIXED (Bug 1 & Bug 2) - 2025-11-29
|
| 9 |
+
- **Tests:** 138 passing (136 original + 2 new validation tests)
|
| 10 |
+
|
| 11 |
+
---
|
| 12 |
+
|
| 13 |
+
## Bug 1: Token-by-Token Streaming Spam ✅ FIXED
|
| 14 |
+
|
| 15 |
+
### Symptoms
|
| 16 |
+
When running Magentic (Advanced) mode, the UI shows hundreds of individual lines like:
|
| 17 |
+
```text
|
| 18 |
+
📡 STREAMING: Below
|
| 19 |
+
📡 STREAMING: is
|
| 20 |
+
📡 STREAMING: a
|
| 21 |
+
📡 STREAMING: curated
|
| 22 |
+
📡 STREAMING: list
|
| 23 |
+
...
|
| 24 |
+
```
|
| 25 |
+
|
| 26 |
+
Each token is displayed as a separate streaming event, creating visual spam and making it impossible to read the output until completion.
|
| 27 |
+
|
| 28 |
+
### Root Cause (VALIDATED)
|
| 29 |
+
**File:** `src/orchestrator_magentic.py:247-254`
|
| 30 |
+
|
| 31 |
+
```python
|
| 32 |
+
elif isinstance(event, MagenticAgentDeltaEvent):
|
| 33 |
+
if event.text:
|
| 34 |
+
return AgentEvent(
|
| 35 |
+
type="streaming",
|
| 36 |
+
message=event.text, # Single token!
|
| 37 |
+
data={"agent_id": event.agent_id},
|
| 38 |
+
iteration=iteration,
|
| 39 |
+
)
|
| 40 |
+
```
|
| 41 |
+
|
| 42 |
+
Every LLM token emits a `MagenticAgentDeltaEvent`, which creates an `AgentEvent(type="streaming")`.
|
| 43 |
+
|
| 44 |
+
**File:** `src/app.py:171-192` (BEFORE FIX)
|
| 45 |
+
|
| 46 |
+
```python
|
| 47 |
+
async for event in orchestrator.run(message):
|
| 48 |
+
event_md = event.to_markdown()
|
| 49 |
+
response_parts.append(event_md) # Appends EVERY token
|
| 50 |
+
|
| 51 |
+
if event.type == "complete":
|
| 52 |
+
yield event.message
|
| 53 |
+
else:
|
| 54 |
+
yield "\n\n".join(response_parts) # Yields ALL accumulated tokens
|
| 55 |
+
```
|
| 56 |
+
|
| 57 |
+
For N tokens, this yields N times, each time showing all previous tokens. This is O(N²) string operations and creates massive visual spam.
|
| 58 |
+
|
| 59 |
+
### Fix Applied
|
| 60 |
+
**File:** `src/app.py:175-204`
|
| 61 |
+
|
| 62 |
+
Implemented streaming token buffering with live updates:
|
| 63 |
+
1. Added `streaming_buffer = ""` to accumulate tokens
|
| 64 |
+
2. For each streaming event: append to buffer, yield immediately (for live typing UX)
|
| 65 |
+
3. **Key fix**: Don't append streaming events to `response_parts` (prevents O(N²) list growth)
|
| 66 |
+
4. Each yield has only ONE `📡 STREAMING:` line (the accumulated buffer)
|
| 67 |
+
5. Flush buffer to `response_parts` only when non-streaming event occurs
|
| 68 |
+
|
| 69 |
+
**Result**: Live typing feel preserved, but no visual spam (each update replaces, not accumulates)
|
| 70 |
+
|
| 71 |
+
### Proposed Fix Options
|
| 72 |
+
|
| 73 |
+
**Option A: Buffer streaming tokens (recommended)**
|
| 74 |
+
```python
|
| 75 |
+
# In app.py - accumulate streaming tokens, yield periodically
|
| 76 |
+
streaming_buffer = ""
|
| 77 |
+
last_yield_time = time.time()
|
| 78 |
+
|
| 79 |
+
async for event in orchestrator.run(message):
|
| 80 |
+
if event.type == "streaming":
|
| 81 |
+
streaming_buffer += event.message
|
| 82 |
+
# Only yield every 500ms or on newline
|
| 83 |
+
if time.time() - last_yield_time > 0.5 or "\n" in event.message:
|
| 84 |
+
yield f"📡 {streaming_buffer}"
|
| 85 |
+
last_yield_time = time.time()
|
| 86 |
+
elif event.type == "complete":
|
| 87 |
+
yield event.message
|
| 88 |
+
else:
|
| 89 |
+
# Non-streaming events
|
| 90 |
+
response_parts.append(event.to_markdown())
|
| 91 |
+
yield "\n\n".join(response_parts)
|
| 92 |
+
```
|
| 93 |
+
|
| 94 |
+
**Option B: Don't yield streaming events at all**
|
| 95 |
+
```python
|
| 96 |
+
# In app.py - only yield meaningful events
|
| 97 |
+
async for event in orchestrator.run(message):
|
| 98 |
+
if event.type == "streaming":
|
| 99 |
+
continue # Skip token-by-token spam
|
| 100 |
+
# ... rest of logic
|
| 101 |
+
```
|
| 102 |
+
|
| 103 |
+
**Option C: Fix at orchestrator level**
|
| 104 |
+
Don't emit `AgentEvent` for every delta - buffer in `_process_event`.
|
| 105 |
+
|
| 106 |
+
---
|
| 107 |
+
|
| 108 |
+
## Bug 2: API Key Does Not Persist in Textbox ✅ FIXED
|
| 109 |
+
|
| 110 |
+
### Symptoms
|
| 111 |
+
1. User opens the "Mode & API Key" accordion
|
| 112 |
+
2. User pastes their API key into the password textbox
|
| 113 |
+
3. User clicks an example OR clicks elsewhere
|
| 114 |
+
4. The API key textbox is now empty - value lost
|
| 115 |
+
|
| 116 |
+
### Root Cause (VALIDATED)
|
| 117 |
+
**File:** `src/app.py:255-267` (BEFORE FIX)
|
| 118 |
+
|
| 119 |
+
```python
|
| 120 |
+
additional_inputs_accordion=additional_inputs_accordion,
|
| 121 |
+
additional_inputs=[
|
| 122 |
+
gr.Radio(...),
|
| 123 |
+
gr.Textbox(
|
| 124 |
+
label="🔑 API Key (Optional)",
|
| 125 |
+
type="password",
|
| 126 |
+
# No `value` parameter - defaults to empty
|
| 127 |
+
# No state persistence mechanism
|
| 128 |
+
),
|
| 129 |
+
],
|
| 130 |
+
```
|
| 131 |
+
|
| 132 |
+
Gradio's `ChatInterface` with `additional_inputs` has known issues:
|
| 133 |
+
1. Clicking examples resets additional inputs to defaults
|
| 134 |
+
2. The accordion state and input values may not persist correctly
|
| 135 |
+
3. No explicit state management for the API key
|
| 136 |
+
|
| 137 |
+
### Fix Applied
|
| 138 |
+
**Files Modified:**
|
| 139 |
+
1. `src/app.py`
|
| 140 |
+
2. `src/utils/llm_factory.py`
|
| 141 |
+
|
| 142 |
+
**Bug 1 (Streaming Spam):**
|
| 143 |
+
- Accumulate tokens in `streaming_buffer`
|
| 144 |
+
- Yield updates immediately for live typing UX
|
| 145 |
+
- **Key**: Don't append to `response_parts` until stream segment complete
|
| 146 |
+
- Each yield has ONE `📡 STREAMING:` line (not N accumulated lines)
|
| 147 |
+
|
| 148 |
+
**Bug 2 (API Key Persistence):**
|
| 149 |
+
- **Strategy:** Partial example list (relies on Gradio behavior)
|
| 150 |
+
- Examples have only 2 elements `[message, mode]` instead of 4
|
| 151 |
+
- Gradio only updates inputs with corresponding example values
|
| 152 |
+
- Remaining inputs (api_key textbox) are left unchanged
|
| 153 |
+
- `api_key_state` parameter exists as fallback but may be redundant
|
| 154 |
+
- **Note:** This is a workaround relying on undocumented Gradio behavior
|
| 155 |
+
|
| 156 |
+
**Bug 3 (OpenAIModel Deprecation):** ✅ FIXED
|
| 157 |
+
- Replaced all `OpenAIModel` imports with `OpenAIChatModel` in `src/app.py` and `src/utils/llm_factory.py`.
|
| 158 |
+
|
| 159 |
+
### Test Results
|
| 160 |
+
```bash
|
| 161 |
+
uv run pytest tests/ -q
|
| 162 |
+
============================= 138 passed in 20.60s =============================
|
| 163 |
+
```
|
| 164 |
+
|
| 165 |
+
**Status:** ✅ All tests passing
|
| 166 |
+
|
| 167 |
+
### Why This Fix Works
|
| 168 |
+
|
| 169 |
+
**Bug 1 (Streaming Spam):**
|
| 170 |
+
- **Before:** Every token → `append()` to list → `yield` → List grew to size N → O(N²) complexity.
|
| 171 |
+
- **After:** Every token → `yield` dynamically constructed string (buffer + history) → List stays size K (number of *events*).
|
| 172 |
+
- **Impact:** Smooth streaming, no visual spam, no browser freeze.
|
| 173 |
+
|
| 174 |
+
**Bug 2 (API Key):**
|
| 175 |
+
- **Before:** Example click → Overwrote API Key textbox with `""`.
|
| 176 |
+
- **After:** Example click → Updates only `message` and `mode` → API Key textbox untouched.
|
| 177 |
+
- **Impact:** User input persists naturally.
|
| 178 |
+
|
| 179 |
+
### Remaining Work
|
| 180 |
+
- **Bug 4 (Asyncio GC errors):** Monitoring only - likely Gradio/HF Spaces issue
|
| 181 |
+
|
src/agent_factory/judges.py
CHANGED
|
@@ -451,12 +451,12 @@ class MockJudgeHandler:
|
|
| 451 |
|
| 452 |
def _extract_key_findings(self, evidence: list[Evidence], max_findings: int = 5) -> list[str]:
|
| 453 |
"""Extract key findings from evidence titles."""
|
| 454 |
-
|
|
|
|
| 455 |
evidence,
|
| 456 |
max_items=max_findings,
|
| 457 |
fallback_message="No specific findings extracted (demo mode)",
|
| 458 |
)
|
| 459 |
-
return findings if findings else ["No specific findings extracted (demo mode)"]
|
| 460 |
|
| 461 |
def _extract_drug_candidates(self, question: str, evidence: list[Evidence]) -> list[str]:
|
| 462 |
"""Extract drug candidates - demo mode returns honest message."""
|
|
|
|
| 451 |
|
| 452 |
def _extract_key_findings(self, evidence: list[Evidence], max_findings: int = 5) -> list[str]:
|
| 453 |
"""Extract key findings from evidence titles."""
|
| 454 |
+
# Helper guarantees non-empty list when fallback_message is provided
|
| 455 |
+
return _extract_titles_from_evidence(
|
| 456 |
evidence,
|
| 457 |
max_items=max_findings,
|
| 458 |
fallback_message="No specific findings extracted (demo mode)",
|
| 459 |
)
|
|
|
|
| 460 |
|
| 461 |
def _extract_drug_candidates(self, question: str, evidence: list[Evidence]) -> list[str]:
|
| 462 |
"""Extract drug candidates - demo mode returns honest message."""
|
src/app.py
CHANGED
|
@@ -6,7 +6,7 @@ from typing import Any
|
|
| 6 |
|
| 7 |
import gradio as gr
|
| 8 |
from pydantic_ai.models.anthropic import AnthropicModel
|
| 9 |
-
from pydantic_ai.models.openai import
|
| 10 |
from pydantic_ai.providers.anthropic import AnthropicProvider
|
| 11 |
from pydantic_ai.providers.openai import OpenAIProvider
|
| 12 |
|
|
@@ -61,7 +61,7 @@ def configure_orchestrator(
|
|
| 61 |
# 2. Paid API Key (User provided or Env)
|
| 62 |
elif user_api_key and user_api_key.strip():
|
| 63 |
# Auto-detect provider from key prefix
|
| 64 |
-
model: AnthropicModel |
|
| 65 |
if user_api_key.startswith("sk-ant-"):
|
| 66 |
# Anthropic key
|
| 67 |
anthropic_provider = AnthropicProvider(api_key=user_api_key)
|
|
@@ -70,7 +70,7 @@ def configure_orchestrator(
|
|
| 70 |
elif user_api_key.startswith("sk-"):
|
| 71 |
# OpenAI key
|
| 72 |
openai_provider = OpenAIProvider(api_key=user_api_key)
|
| 73 |
-
model =
|
| 74 |
backend_info = "Paid API (OpenAI)"
|
| 75 |
else:
|
| 76 |
raise ConfigurationError(
|
|
@@ -108,6 +108,7 @@ async def research_agent(
|
|
| 108 |
history: list[dict[str, Any]],
|
| 109 |
mode: str = "simple",
|
| 110 |
api_key: str = "",
|
|
|
|
| 111 |
) -> AsyncGenerator[str, None]:
|
| 112 |
"""
|
| 113 |
Gradio chat function that runs the research agent.
|
|
@@ -117,6 +118,7 @@ async def research_agent(
|
|
| 117 |
history: Chat history (Gradio format)
|
| 118 |
mode: Orchestrator mode ("simple" or "advanced")
|
| 119 |
api_key: Optional user-provided API key (BYOK - auto-detects provider)
|
|
|
|
| 120 |
|
| 121 |
Yields:
|
| 122 |
Markdown-formatted responses for streaming
|
|
@@ -125,8 +127,8 @@ async def research_agent(
|
|
| 125 |
yield "Please enter a research question."
|
| 126 |
return
|
| 127 |
|
| 128 |
-
#
|
| 129 |
-
user_api_key = api_key.strip()
|
| 130 |
|
| 131 |
# Check available keys
|
| 132 |
has_openai = bool(os.getenv("OPENAI_API_KEY"))
|
|
@@ -155,6 +157,7 @@ async def research_agent(
|
|
| 155 |
|
| 156 |
# Run the agent and stream events
|
| 157 |
response_parts: list[str] = []
|
|
|
|
| 158 |
|
| 159 |
try:
|
| 160 |
# use_mock=False - let configure_orchestrator decide based on available keys
|
|
@@ -168,17 +171,36 @@ async def research_agent(
|
|
| 168 |
yield f"🧠 **Backend**: {backend_name}\n\n"
|
| 169 |
|
| 170 |
async for event in orchestrator.run(message):
|
| 171 |
-
#
|
| 172 |
-
|
| 173 |
-
|
| 174 |
-
|
| 175 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 176 |
if event.type == "complete":
|
| 177 |
yield event.message
|
| 178 |
else:
|
|
|
|
|
|
|
|
|
|
| 179 |
# Show progress
|
| 180 |
yield "\n\n".join(response_parts)
|
| 181 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 182 |
except Exception as e:
|
| 183 |
yield f"❌ **Error**: {e!s}"
|
| 184 |
|
|
@@ -193,6 +215,10 @@ def create_demo() -> tuple[gr.ChatInterface, gr.Accordion]:
|
|
| 193 |
additional_inputs_accordion = gr.Accordion(
|
| 194 |
label="⚙️ Mode & API Key (Free tier works!)", open=False
|
| 195 |
)
|
|
|
|
|
|
|
|
|
|
|
|
|
| 196 |
# 1. Unwrapped ChatInterface (Fixes Accordion Bug)
|
| 197 |
demo = gr.ChatInterface(
|
| 198 |
fn=research_agent,
|
|
@@ -210,6 +236,7 @@ def create_demo() -> tuple[gr.ChatInterface, gr.Accordion]:
|
|
| 210 |
[
|
| 211 |
"What drugs improve female libido post-menopause?",
|
| 212 |
"simple",
|
|
|
|
| 213 |
],
|
| 214 |
[
|
| 215 |
"Clinical trials for erectile dysfunction alternatives to PDE5 inhibitors?",
|
|
@@ -234,9 +261,13 @@ def create_demo() -> tuple[gr.ChatInterface, gr.Accordion]:
|
|
| 234 |
type="password",
|
| 235 |
info="Leave empty for free tier. Auto-detects provider from key prefix.",
|
| 236 |
),
|
|
|
|
| 237 |
],
|
| 238 |
)
|
| 239 |
|
|
|
|
|
|
|
|
|
|
| 240 |
return demo, additional_inputs_accordion
|
| 241 |
|
| 242 |
|
|
|
|
| 6 |
|
| 7 |
import gradio as gr
|
| 8 |
from pydantic_ai.models.anthropic import AnthropicModel
|
| 9 |
+
from pydantic_ai.models.openai import OpenAIChatModel
|
| 10 |
from pydantic_ai.providers.anthropic import AnthropicProvider
|
| 11 |
from pydantic_ai.providers.openai import OpenAIProvider
|
| 12 |
|
|
|
|
| 61 |
# 2. Paid API Key (User provided or Env)
|
| 62 |
elif user_api_key and user_api_key.strip():
|
| 63 |
# Auto-detect provider from key prefix
|
| 64 |
+
model: AnthropicModel | OpenAIChatModel
|
| 65 |
if user_api_key.startswith("sk-ant-"):
|
| 66 |
# Anthropic key
|
| 67 |
anthropic_provider = AnthropicProvider(api_key=user_api_key)
|
|
|
|
| 70 |
elif user_api_key.startswith("sk-"):
|
| 71 |
# OpenAI key
|
| 72 |
openai_provider = OpenAIProvider(api_key=user_api_key)
|
| 73 |
+
model = OpenAIChatModel(settings.openai_model, provider=openai_provider)
|
| 74 |
backend_info = "Paid API (OpenAI)"
|
| 75 |
else:
|
| 76 |
raise ConfigurationError(
|
|
|
|
| 108 |
history: list[dict[str, Any]],
|
| 109 |
mode: str = "simple",
|
| 110 |
api_key: str = "",
|
| 111 |
+
api_key_state: str = "",
|
| 112 |
) -> AsyncGenerator[str, None]:
|
| 113 |
"""
|
| 114 |
Gradio chat function that runs the research agent.
|
|
|
|
| 118 |
history: Chat history (Gradio format)
|
| 119 |
mode: Orchestrator mode ("simple" or "advanced")
|
| 120 |
api_key: Optional user-provided API key (BYOK - auto-detects provider)
|
| 121 |
+
api_key_state: Persistent API key state (survives example clicks)
|
| 122 |
|
| 123 |
Yields:
|
| 124 |
Markdown-formatted responses for streaming
|
|
|
|
| 127 |
yield "Please enter a research question."
|
| 128 |
return
|
| 129 |
|
| 130 |
+
# BUG FIX: Prefer freshly-entered key, then persisted state
|
| 131 |
+
user_api_key = (api_key.strip() or api_key_state.strip()) or None
|
| 132 |
|
| 133 |
# Check available keys
|
| 134 |
has_openai = bool(os.getenv("OPENAI_API_KEY"))
|
|
|
|
| 157 |
|
| 158 |
# Run the agent and stream events
|
| 159 |
response_parts: list[str] = []
|
| 160 |
+
streaming_buffer = "" # Buffer for accumulating streaming tokens
|
| 161 |
|
| 162 |
try:
|
| 163 |
# use_mock=False - let configure_orchestrator decide based on available keys
|
|
|
|
| 171 |
yield f"🧠 **Backend**: {backend_name}\n\n"
|
| 172 |
|
| 173 |
async for event in orchestrator.run(message):
|
| 174 |
+
# BUG FIX: Handle streaming events separately to avoid token-by-token spam
|
| 175 |
+
if event.type == "streaming":
|
| 176 |
+
# Accumulate streaming tokens without emitting individual events
|
| 177 |
+
streaming_buffer += event.message
|
| 178 |
+
# Yield the current buffer combined with previous parts to show progress
|
| 179 |
+
# But DO NOT append to response_parts list yet (to avoid O(N^2) list growth)
|
| 180 |
+
current_parts = [*response_parts, f"📡 **STREAMING**: {streaming_buffer}"]
|
| 181 |
+
yield "\n\n".join(current_parts)
|
| 182 |
+
continue
|
| 183 |
+
|
| 184 |
+
# For non-streaming events, flush any buffered streaming content first
|
| 185 |
+
if streaming_buffer:
|
| 186 |
+
response_parts.append(f"📡 **STREAMING**: {streaming_buffer}")
|
| 187 |
+
streaming_buffer = "" # Reset buffer
|
| 188 |
+
|
| 189 |
+
# Handle complete events specially
|
| 190 |
if event.type == "complete":
|
| 191 |
yield event.message
|
| 192 |
else:
|
| 193 |
+
# Format and append non-streaming events
|
| 194 |
+
event_md = event.to_markdown()
|
| 195 |
+
response_parts.append(event_md)
|
| 196 |
# Show progress
|
| 197 |
yield "\n\n".join(response_parts)
|
| 198 |
|
| 199 |
+
# Flush any remaining streaming content at the end
|
| 200 |
+
if streaming_buffer:
|
| 201 |
+
response_parts.append(f"📡 **STREAMING**: {streaming_buffer}")
|
| 202 |
+
yield "\n\n".join(response_parts)
|
| 203 |
+
|
| 204 |
except Exception as e:
|
| 205 |
yield f"❌ **Error**: {e!s}"
|
| 206 |
|
|
|
|
| 215 |
additional_inputs_accordion = gr.Accordion(
|
| 216 |
label="⚙️ Mode & API Key (Free tier works!)", open=False
|
| 217 |
)
|
| 218 |
+
|
| 219 |
+
# BUG FIX: Add gr.State for API key persistence across example clicks
|
| 220 |
+
api_key_state = gr.State("")
|
| 221 |
+
|
| 222 |
# 1. Unwrapped ChatInterface (Fixes Accordion Bug)
|
| 223 |
demo = gr.ChatInterface(
|
| 224 |
fn=research_agent,
|
|
|
|
| 236 |
[
|
| 237 |
"What drugs improve female libido post-menopause?",
|
| 238 |
"simple",
|
| 239 |
+
# Removed empty strings for api_key and api_key_state to prevent overwriting
|
| 240 |
],
|
| 241 |
[
|
| 242 |
"Clinical trials for erectile dysfunction alternatives to PDE5 inhibitors?",
|
|
|
|
| 261 |
type="password",
|
| 262 |
info="Leave empty for free tier. Auto-detects provider from key prefix.",
|
| 263 |
),
|
| 264 |
+
api_key_state, # Hidden state component for persistence
|
| 265 |
],
|
| 266 |
)
|
| 267 |
|
| 268 |
+
# API key persists because examples only include [message, mode] columns,
|
| 269 |
+
# so Gradio doesn't overwrite the api_key textbox when examples are clicked.
|
| 270 |
+
|
| 271 |
return demo, additional_inputs_accordion
|
| 272 |
|
| 273 |
|
src/utils/llm_factory.py
CHANGED
|
@@ -56,7 +56,7 @@ def get_pydantic_ai_model() -> Any:
|
|
| 56 |
Configured pydantic-ai model
|
| 57 |
"""
|
| 58 |
from pydantic_ai.models.anthropic import AnthropicModel
|
| 59 |
-
from pydantic_ai.models.openai import
|
| 60 |
from pydantic_ai.providers.anthropic import AnthropicProvider
|
| 61 |
from pydantic_ai.providers.openai import OpenAIProvider
|
| 62 |
|
|
@@ -64,7 +64,7 @@ def get_pydantic_ai_model() -> Any:
|
|
| 64 |
if not settings.openai_api_key:
|
| 65 |
raise ConfigurationError("OPENAI_API_KEY not set for pydantic-ai")
|
| 66 |
provider = OpenAIProvider(api_key=settings.openai_api_key)
|
| 67 |
-
return
|
| 68 |
|
| 69 |
if settings.llm_provider == "anthropic":
|
| 70 |
if not settings.anthropic_api_key:
|
|
|
|
| 56 |
Configured pydantic-ai model
|
| 57 |
"""
|
| 58 |
from pydantic_ai.models.anthropic import AnthropicModel
|
| 59 |
+
from pydantic_ai.models.openai import OpenAIChatModel
|
| 60 |
from pydantic_ai.providers.anthropic import AnthropicProvider
|
| 61 |
from pydantic_ai.providers.openai import OpenAIProvider
|
| 62 |
|
|
|
|
| 64 |
if not settings.openai_api_key:
|
| 65 |
raise ConfigurationError("OPENAI_API_KEY not set for pydantic-ai")
|
| 66 |
provider = OpenAIProvider(api_key=settings.openai_api_key)
|
| 67 |
+
return OpenAIChatModel(settings.openai_model, provider=provider)
|
| 68 |
|
| 69 |
if settings.llm_provider == "anthropic":
|
| 70 |
if not settings.anthropic_api_key:
|
tests/unit/test_streaming_fix.py
ADDED
|
@@ -0,0 +1,118 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Test that streaming event handling is fixed (no token-by-token spam)."""
|
| 2 |
+
|
| 3 |
+
from unittest.mock import MagicMock
|
| 4 |
+
|
| 5 |
+
import pytest
|
| 6 |
+
|
| 7 |
+
from src.utils.models import AgentEvent
|
| 8 |
+
|
| 9 |
+
|
| 10 |
+
@pytest.mark.unit
|
| 11 |
+
@pytest.mark.asyncio
|
| 12 |
+
async def test_streaming_events_are_buffered_not_spammed():
|
| 13 |
+
"""
|
| 14 |
+
Verify that streaming events are buffered, not yielded individually.
|
| 15 |
+
|
| 16 |
+
This test validates the fix for Bug 1: Token-by-Token Streaming Spam.
|
| 17 |
+
Before the fix, each token would create a separate yield, resulting in O(N²) spam.
|
| 18 |
+
After the fix, streaming tokens are buffered and only yielded once.
|
| 19 |
+
"""
|
| 20 |
+
# Import here to avoid circular dependencies
|
| 21 |
+
from src.app import research_agent
|
| 22 |
+
|
| 23 |
+
# Mock orchestrator
|
| 24 |
+
mock_orchestrator = MagicMock()
|
| 25 |
+
|
| 26 |
+
# Simulate streaming events (like LLM token-by-token output)
|
| 27 |
+
streaming_events = [
|
| 28 |
+
AgentEvent(type="started", message="Starting research", iteration=0),
|
| 29 |
+
AgentEvent(type="streaming", message="This", iteration=1),
|
| 30 |
+
AgentEvent(type="streaming", message=" is", iteration=1),
|
| 31 |
+
AgentEvent(type="streaming", message=" a", iteration=1),
|
| 32 |
+
AgentEvent(type="streaming", message=" test", iteration=1),
|
| 33 |
+
AgentEvent(type="complete", message="Final answer: This is a test", iteration=1),
|
| 34 |
+
]
|
| 35 |
+
|
| 36 |
+
# Create async generator that yields events
|
| 37 |
+
async def mock_run(query):
|
| 38 |
+
for event in streaming_events:
|
| 39 |
+
yield event
|
| 40 |
+
|
| 41 |
+
mock_orchestrator.run = mock_run
|
| 42 |
+
|
| 43 |
+
# Mock configure_orchestrator to return our mock
|
| 44 |
+
import src.app as app_module
|
| 45 |
+
|
| 46 |
+
original_configure = app_module.configure_orchestrator
|
| 47 |
+
app_module.configure_orchestrator = MagicMock(return_value=(mock_orchestrator, "Test Backend"))
|
| 48 |
+
|
| 49 |
+
try:
|
| 50 |
+
# Run the research agent
|
| 51 |
+
results = []
|
| 52 |
+
async for result in research_agent("test query", [], mode="simple", api_key=""):
|
| 53 |
+
results.append(result)
|
| 54 |
+
|
| 55 |
+
# Verify that we DO see streaming updates (for UX responsiveness)
|
| 56 |
+
# But we don't want O(N^2) growth of the persisted list.
|
| 57 |
+
|
| 58 |
+
# We expect results to contain the streaming updates
|
| 59 |
+
assert len(results) > 0, "Should have yielded results"
|
| 60 |
+
|
| 61 |
+
# Check that we see the accumulated message
|
| 62 |
+
assert any(
|
| 63 |
+
"📡 **STREAMING**: This is a test" in r for r in results
|
| 64 |
+
), "Buffer didn't accumulate correctly"
|
| 65 |
+
|
| 66 |
+
# The critical check for the "Spam" bug:
|
| 67 |
+
# In the spam bug, the output grew like:
|
| 68 |
+
# "Stream: T"
|
| 69 |
+
# "Stream: T\nStream: h"
|
| 70 |
+
# "Stream: T\nStream: h\nStream: i"
|
| 71 |
+
#
|
| 72 |
+
# In the fixed version, it should look like:
|
| 73 |
+
# "Stream: T"
|
| 74 |
+
# "Stream: Th"
|
| 75 |
+
# "Stream: Thi"
|
| 76 |
+
# (Replacing the last line, not adding new lines)
|
| 77 |
+
|
| 78 |
+
for res in results:
|
| 79 |
+
# Count occurrences of "📡 **STREAMING**:": in a single result string
|
| 80 |
+
# It should appear AT MOST once
|
| 81 |
+
# (unless we have multiple distinct streaming blocks)
|
| 82 |
+
streaming_markers = res.count("📡 **STREAMING**:")
|
| 83 |
+
assert streaming_markers <= 1, (
|
| 84 |
+
f"Found multiple streaming markers in single response: {res}\n"
|
| 85 |
+
"This indicates we are appending new lines instead of updating in place."
|
| 86 |
+
)
|
| 87 |
+
|
| 88 |
+
# The final result should be the complete message
|
| 89 |
+
assert any("Final answer" in r for r in results), "Missing final complete message"
|
| 90 |
+
|
| 91 |
+
finally:
|
| 92 |
+
# Restore original function
|
| 93 |
+
app_module.configure_orchestrator = original_configure
|
| 94 |
+
|
| 95 |
+
|
| 96 |
+
@pytest.mark.unit
|
| 97 |
+
@pytest.mark.asyncio
|
| 98 |
+
async def test_api_key_state_parameter_exists():
|
| 99 |
+
"""
|
| 100 |
+
Verify that api_key_state parameter was added to research_agent.
|
| 101 |
+
|
| 102 |
+
This validates the fix for Bug 2: API Key Persistence.
|
| 103 |
+
"""
|
| 104 |
+
import inspect
|
| 105 |
+
|
| 106 |
+
from src.app import research_agent
|
| 107 |
+
|
| 108 |
+
# Get function signature
|
| 109 |
+
sig = inspect.signature(research_agent)
|
| 110 |
+
params = list(sig.parameters.keys())
|
| 111 |
+
|
| 112 |
+
# Verify api_key_state parameter exists
|
| 113 |
+
assert "api_key_state" in params, "api_key_state parameter missing from research_agent"
|
| 114 |
+
|
| 115 |
+
# Verify it's after api_key
|
| 116 |
+
api_key_idx = params.index("api_key")
|
| 117 |
+
api_key_state_idx = params.index("api_key_state")
|
| 118 |
+
assert api_key_state_idx > api_key_idx, "api_key_state should come after api_key"
|