VibecoderMcSwaggins commited on
Commit
153a9c0
·
unverified ·
2 Parent(s): 6b5e05b dbe535c

Merge pull request #62 from The-Obstacle-Is-The-Way/dev

Browse files

fix: resolve all P0-P3 bugs (termination, streaming, thinking state)

docs/bugs/ACTIVE_BUGS.md CHANGED
@@ -1,39 +1,54 @@
1
  # Active Bugs
2
 
3
- > Last updated: 2025-11-28
4
 
5
- ## P0 - Critical
6
 
7
- ### Magentic Mode Report Generation
8
- **File**: [FIX_PLAN_MAGENTIC_MODE.md](./FIX_PLAN_MAGENTIC_MODE.md)
9
 
10
- **Symptom**: Magentic mode returns `ChatMessage` object instead of synthesized report text.
11
 
12
- **Root Cause**:
13
- - `event.message.text` extraction fails in orchestrator
14
- - `max_rounds=3` too low for SearchAgent + JudgeAgent + ReportAgent sequence
15
 
16
- **Workaround**: Use Simple mode (default) - works correctly with all LLM providers.
 
17
 
18
- **Status**: Fix plan documented, not yet implemented.
 
 
19
 
20
- ---
 
21
 
22
- ## P1 - Minor UX
 
 
 
23
 
24
- ### Gradio Settings Accordion Won't Collapse
25
- **File**: [P1_GRADIO_SETTINGS_CLEANUP.md](./P1_GRADIO_SETTINGS_CLEANUP.md)
26
 
27
- **Symptom**: Settings accordion stays open after user interaction.
 
 
28
 
29
- **Root Cause**: Nested `gr.Blocks` context prevents accordion state management.
 
30
 
31
- **Impact**: UX only - all functionality works correctly.
 
 
32
 
33
- **Status**: Solution documented, not yet implemented.
 
 
 
34
 
35
  ---
36
 
37
- ## Resolved Bugs
38
 
39
- *None currently - bugs above are still open.*
 
 
 
 
1
  # Active Bugs
2
 
3
+ > Last updated: 2025-11-29
4
 
5
+ ## P3 - Edge Case
6
 
7
+ *(None)*
 
8
 
9
+ ---
10
 
11
+ ## Resolved Bugs
 
 
12
 
13
+ ### ~~P3 - Magentic Mode Missing Termination Guarantee~~ FIXED
14
+ **Commit**: `d36ce3c` (2025-11-29)
15
 
16
+ - Added `final_event_received` tracking in `orchestrator_magentic.py`
17
+ - Added fallback yield for "max iterations reached" scenario
18
+ - Verified with `test_magentic_termination.py`
19
 
20
+ ### ~~P0 - Magentic Mode Report Generation~~ FIXED
21
+ **Commit**: `9006d69` (2025-11-29)
22
 
23
+ - Fixed `_extract_text()` to handle various message object formats
24
+ - Increased `max_rounds=10` (was 3)
25
+ - Added `temperature=1.0` for reasoning model compatibility
26
+ - Advanced mode now produces full research reports
27
 
28
+ ### ~~P1 - Streaming Spam + API Key Persistence~~ FIXED
29
+ **Commit**: `0c9be4a` (2025-11-29)
30
 
31
+ - Streaming events now buffered (not token-by-token spam)
32
+ - API key persists across example clicks via `gr.State`
33
+ - Examples use explicit `None` values to avoid overwriting keys
34
 
35
+ ### ~~P2 - Missing "Thinking" State~~ FIXED
36
+ **Commit**: `9006d69` (2025-11-29)
37
 
38
+ - Added `"thinking"` event type with hourglass icon
39
+ - Yields "Multi-agent reasoning in progress..." before blocking workflow call
40
+ - Users now see feedback during 2-5 minute initial processing
41
 
42
+ ### ~~P1 - Gradio Settings Accordion~~ WONTFIX
43
+
44
+ Decision: Removed nested Blocks, using ChatInterface directly.
45
+ Accordion behavior is default Gradio - acceptable for demo.
46
 
47
  ---
48
 
49
+ ## How to Report Bugs
50
 
51
+ 1. Create `docs/bugs/P{N}_{SHORT_NAME}.md`
52
+ 2. Include: Symptom, Root Cause, Fix Plan, Test Plan
53
+ 3. Update this index
54
+ 4. Priority: P0=blocker, P1=important, P2=UX, P3=edge case
docs/bugs/FIX_PLAN_CRITICAL_BUGS.md DELETED
@@ -1,36 +0,0 @@
1
- # Fix Plan: Critical Bugs (P0)
2
-
3
- **Date**: 2025-11-28
4
- **Status**: COMPLETED (2025-11-29)
5
- **Based on**: `docs/bugs/SENIOR_AUDIT_RESULTS.md`
6
-
7
- ---
8
-
9
- ## Summary of Fixes
10
-
11
- ### 1. Fixed Data Leak (Bug 4 & 2)
12
- - **Action**: Removed singleton `_embedding_service` in `src/services/embeddings.py`.
13
- - **Action**: Updated `EmbeddingService.__init__` to use a unique collection name (`evidence_{uuid}`) for complete isolation per instance.
14
- - **Action**: Refactored `SentenceTransformer` loading to a shared global to maintain performance while isolating state.
15
- - **Verified**: Unit tests passed, including new isolation verification.
16
-
17
- ### 2. Fixed Advanced Mode BYOK (Bug 3)
18
- - **Action**: Updated `create_orchestrator` in `src/orchestrator_factory.py` to accept `api_key`.
19
- - **Action**: Updated `MagenticOrchestrator` to accept and use the `api_key` for the manager and agents.
20
- - **Action**: Updated `src/app.py` to pass the user's API key during orchestrator configuration.
21
- - **Verified**: `test_dual_mode_e2e.py` passed.
22
-
23
- ### 3. Fixed Free Tier Experience (Bug 1)
24
- - **Action**: Updated `HFInferenceJudgeHandler` in `src/agent_factory/judges.py` to catch 402 (Payment Required) errors.
25
- - **Action**: Added logic to return a "synthesize" assessment with a clear error message when quota is exhausted, stopping the infinite loop.
26
- - **Verified**: Unit tests passed.
27
-
28
- ---
29
-
30
- ## Verification
31
-
32
- All changes have been verified with:
33
- - `make check` (lint, typecheck, test) - ALL PASSED
34
- - Custom reproduction script for isolation - PASSED
35
-
36
- The system is now stable for the hackathon demo.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/bugs/FIX_PLAN_MAGENTIC_MODE.md DELETED
@@ -1,227 +0,0 @@
1
- # Fix Plan: Magentic Mode Report Generation
2
-
3
- **Related Bug**: `P0_MAGENTIC_MODE_BROKEN.md`
4
- **Approach**: Test-Driven Development (TDD)
5
- **Estimated Scope**: 4 tasks, ~2-3 hours
6
-
7
- ---
8
-
9
- ## Problem Summary
10
-
11
- Magentic mode runs but fails to produce readable reports due to:
12
-
13
- 1. **Primary Bug**: `MagenticFinalResultEvent.message` returns `ChatMessage` object, not text
14
- 2. **Secondary Bug**: Max rounds (3) reached before ReportAgent completes
15
- 3. **Tertiary Issues**: Stale "bioRxiv" references in prompts
16
-
17
- ---
18
-
19
- ## Fix Order (TDD)
20
-
21
- ### Phase 1: Write Failing Tests
22
-
23
- **Task 1.1**: Create test for ChatMessage text extraction
24
-
25
- ```python
26
- # tests/unit/test_orchestrator_magentic.py
27
-
28
- def test_process_event_extracts_text_from_chat_message():
29
- """Final result event should extract text from ChatMessage object."""
30
- # Arrange: Mock ChatMessage with .content attribute
31
- # Act: Call _process_event with MagenticFinalResultEvent
32
- # Assert: Returned AgentEvent.message is a string, not object repr
33
- ```
34
-
35
- **Task 1.2**: Create test for max rounds configuration
36
-
37
- ```python
38
- def test_orchestrator_uses_configured_max_rounds():
39
- """MagenticOrchestrator should use max_rounds from constructor."""
40
- # Arrange: Create orchestrator with max_rounds=10
41
- # Act: Build workflow
42
- # Assert: Workflow has max_round_count=10
43
- ```
44
-
45
- **Task 1.3**: Create test for bioRxiv reference removal
46
-
47
- ```python
48
- def test_task_prompt_references_europe_pmc():
49
- """Task prompt should reference Europe PMC, not bioRxiv."""
50
- # Arrange: Create orchestrator
51
- # Act: Check task string in run()
52
- # Assert: Contains "Europe PMC", not "bioRxiv"
53
- ```
54
-
55
- ---
56
-
57
- ### Phase 2: Fix ChatMessage Text Extraction
58
-
59
- **File**: `src/orchestrator_magentic.py`
60
- **Lines**: 192-199
61
-
62
- **Current Code**:
63
- ```python
64
- elif isinstance(event, MagenticFinalResultEvent):
65
- text = event.message.text if event.message else "No result"
66
- ```
67
-
68
- **Fixed Code**:
69
- ```python
70
- elif isinstance(event, MagenticFinalResultEvent):
71
- if event.message:
72
- # ChatMessage may have .content or .text depending on version
73
- if hasattr(event.message, 'content') and event.message.content:
74
- text = str(event.message.content)
75
- elif hasattr(event.message, 'text') and event.message.text:
76
- text = str(event.message.text)
77
- else:
78
- # Fallback: convert entire message to string
79
- text = str(event.message)
80
- else:
81
- text = "No result generated"
82
- ```
83
-
84
- **Why**: The `agent_framework.ChatMessage` object structure may vary. We need defensive extraction.
85
-
86
- ---
87
-
88
- ### Phase 3: Fix Max Rounds Configuration
89
-
90
- **File**: `src/orchestrator_magentic.py`
91
- **Lines**: 97-99
92
-
93
- **Current Code**:
94
- ```python
95
- .with_standard_manager(
96
- chat_client=manager_client,
97
- max_round_count=self._max_rounds, # Already uses config
98
- max_stall_count=3,
99
- max_reset_count=2,
100
- )
101
- ```
102
-
103
- **Issue**: Default `max_rounds` in `__init__` is 10, but workflow may need more for complex queries.
104
-
105
- **Fix**: Verify the value flows through correctly. Add logging.
106
-
107
- ```python
108
- logger.info(
109
- "Building Magentic workflow",
110
- max_rounds=self._max_rounds,
111
- max_stall=3,
112
- max_reset=2,
113
- )
114
- ```
115
-
116
- **Also check**: `src/orchestrator_factory.py` passes config correctly:
117
- ```python
118
- return MagenticOrchestrator(
119
- max_rounds=config.max_iterations if config else 10,
120
- )
121
- ```
122
-
123
- ---
124
-
125
- ### Phase 4: Fix Stale bioRxiv References
126
-
127
- **Files to update**:
128
-
129
- | File | Line | Change |
130
- |------|------|--------|
131
- | `src/orchestrator_magentic.py` | 131 | "bioRxiv" → "Europe PMC" |
132
- | `src/agents/magentic_agents.py` | 32-33 | "bioRxiv" → "Europe PMC" |
133
- | `src/app.py` | 202-203 | "bioRxiv" → "Europe PMC" |
134
-
135
- **Search command to verify**:
136
- ```bash
137
- grep -rn "bioRxiv\|biorxiv" src/
138
- ```
139
-
140
- ---
141
-
142
- ## Implementation Checklist
143
-
144
- ```
145
- [ ] Phase 1: Write failing tests
146
- [ ] 1.1 Test ChatMessage text extraction
147
- [ ] 1.2 Test max rounds configuration
148
- [ ] 1.3 Test Europe PMC references
149
-
150
- [ ] Phase 2: Fix ChatMessage extraction
151
- [ ] Update _process_event() in orchestrator_magentic.py
152
- [ ] Run test 1.1 - should pass
153
-
154
- [ ] Phase 3: Fix max rounds
155
- [ ] Add logging to _build_workflow()
156
- [ ] Verify factory passes config correctly
157
- [ ] Run test 1.2 - should pass
158
-
159
- [ ] Phase 4: Fix bioRxiv references
160
- [ ] Update orchestrator_magentic.py task prompt
161
- [ ] Update magentic_agents.py descriptions
162
- [ ] Update app.py UI text
163
- [ ] Run test 1.3 - should pass
164
- [ ] Run grep to verify no remaining refs
165
-
166
- [ ] Final Verification
167
- [ ] make check passes
168
- [ ] All tests pass (108+)
169
- [ ] Manual test: run_magentic.py produces readable report
170
- ```
171
-
172
- ---
173
-
174
- ## Test Commands
175
-
176
- ```bash
177
- # Run specific test file
178
- uv run pytest tests/unit/test_orchestrator_magentic.py -v
179
-
180
- # Run all tests
181
- uv run pytest tests/unit/ -v
182
-
183
- # Full check
184
- make check
185
-
186
- # Manual integration test
187
- set -a && source .env && set +a
188
- uv run python examples/orchestrator_demo/run_magentic.py "metformin alzheimer"
189
- ```
190
-
191
- ---
192
-
193
- ## Success Criteria
194
-
195
- 1. `run_magentic.py` outputs a readable research report (not `<ChatMessage object>`)
196
- 2. Report includes: Executive Summary, Key Findings, Drug Candidates, References
197
- 3. No "Max round count reached" error with default settings
198
- 4. No "bioRxiv" references anywhere in codebase
199
- 5. All 108+ tests pass
200
- 6. `make check` passes
201
-
202
- ---
203
-
204
- ## Files Modified
205
-
206
- ```
207
- src/
208
- ├── orchestrator_magentic.py # ChatMessage fix, logging
209
- ├── agents/magentic_agents.py # bioRxiv → Europe PMC
210
- └── app.py # bioRxiv → Europe PMC
211
-
212
- tests/unit/
213
- └── test_orchestrator_magentic.py # NEW: 3 tests
214
- ```
215
-
216
- ---
217
-
218
- ## Notes for AI Agent
219
-
220
- When implementing this fix plan:
221
-
222
- 1. **DO NOT** create mock data or fake responses
223
- 2. **DO** write real tests that verify actual behavior
224
- 3. **DO** run `make check` after each phase
225
- 4. **DO** test with real OpenAI API key via `.env`
226
- 5. **DO** preserve existing functionality - simple mode must still work
227
- 6. **DO NOT** over-engineer - minimal changes to fix the specific bugs
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/bugs/FIX_UI_SIMPLIFICATION.md DELETED
@@ -1,314 +0,0 @@
1
- # UI Simplification: Remove API Provider Dropdown
2
-
3
- **Issues**: #52, #53
4
- **Priority**: P1 - UX improvement for hackathon demo
5
- **Estimated Time**: 30 minutes
6
- **Senior Review**: ✅ Approved with changes (incorporated below)
7
-
8
- ---
9
-
10
- ## Problem
11
-
12
- The current UI has confusing BYOK (Bring Your Own Key) settings:
13
-
14
- 1. **Provider dropdown is misleading** - Shows "openai" but actually uses free tier when no key
15
- 2. **Examples table shows useless columns** - API Key (empty), Provider (ignored)
16
- 3. **Anthropic doesn't work with Advanced mode** - Only OpenAI has `agent-framework` support
17
-
18
- ## Solution
19
-
20
- Remove `api_provider` dropdown entirely. Auto-detect provider from key prefix.
21
-
22
- **Functionality preserved:**
23
- - Simple mode: Free tier, OpenAI, OR Anthropic (all work)
24
- - Advanced mode: OpenAI only (Magentic multi-agent requires `OpenAIChatClient`)
25
-
26
- ---
27
-
28
- ## Implementation
29
-
30
- ### File: `src/app.py`
31
-
32
- #### Change 1: Update `configure_orchestrator()` signature (lines 23-28)
33
-
34
- ```python
35
- # BEFORE
36
- def configure_orchestrator(
37
- use_mock: bool = False,
38
- mode: str = "simple",
39
- user_api_key: str | None = None,
40
- api_provider: str = "openai", # ← REMOVE
41
- ) -> tuple[Any, str]:
42
-
43
- # AFTER
44
- def configure_orchestrator(
45
- use_mock: bool = False,
46
- mode: str = "simple",
47
- user_api_key: str | None = None,
48
- ) -> tuple[Any, str]:
49
- ```
50
-
51
- #### Change 2: Update docstring (lines 29-40)
52
-
53
- ```python
54
- # AFTER
55
- """
56
- Create an orchestrator instance.
57
-
58
- Args:
59
- use_mock: If True, use MockJudgeHandler (no API key needed)
60
- mode: Orchestrator mode ("simple" or "advanced")
61
- user_api_key: Optional user-provided API key (BYOK) - auto-detects provider
62
-
63
- Returns:
64
- Tuple of (Orchestrator instance, backend_name)
65
- """
66
- ```
67
-
68
- #### Change 3: Replace provider logic with auto-detection (lines 62-88)
69
-
70
- ```python
71
- # BEFORE (lines 62-88) - complex provider checking with api_provider param
72
-
73
- # AFTER - auto-detect from key prefix
74
- # 2. Paid API Key (User provided or Env)
75
- elif user_api_key and user_api_key.strip():
76
- # Auto-detect provider from key prefix
77
- model: AnthropicModel | OpenAIModel
78
- if user_api_key.startswith("sk-ant-"):
79
- # Anthropic key
80
- anthropic_provider = AnthropicProvider(api_key=user_api_key)
81
- model = AnthropicModel(settings.anthropic_model, provider=anthropic_provider)
82
- backend_info = "Paid API (Anthropic)"
83
- elif user_api_key.startswith("sk-"):
84
- # OpenAI key
85
- openai_provider = OpenAIProvider(api_key=user_api_key)
86
- model = OpenAIModel(settings.openai_model, provider=openai_provider)
87
- backend_info = "Paid API (OpenAI)"
88
- else:
89
- raise ValueError(
90
- "Invalid API key format. Expected sk-... (OpenAI) or sk-ant-... (Anthropic)"
91
- )
92
- judge_handler = JudgeHandler(model=model)
93
-
94
- # 3. Environment API Keys (fallback)
95
- elif os.getenv("OPENAI_API_KEY"):
96
- judge_handler = JudgeHandler(model=None) # Uses env key
97
- backend_info = "Paid API (OpenAI from env)"
98
-
99
- elif os.getenv("ANTHROPIC_API_KEY"):
100
- judge_handler = JudgeHandler(model=None) # Uses env key
101
- backend_info = "Paid API (Anthropic from env)"
102
-
103
- # 4. Free Tier (HuggingFace Inference)
104
- else:
105
- judge_handler = HFInferenceJudgeHandler()
106
- backend_info = "Free Tier (Llama 3.1 / Mistral)"
107
- ```
108
-
109
- #### Change 4: Update `research_agent()` signature (lines 105-111)
110
-
111
- ```python
112
- # BEFORE
113
- async def research_agent(
114
- message: str,
115
- history: list[dict[str, Any]],
116
- mode: str = "simple",
117
- api_key: str = "",
118
- api_provider: str = "openai", # ← REMOVE
119
- ) -> AsyncGenerator[str, None]:
120
-
121
- # AFTER
122
- async def research_agent(
123
- message: str,
124
- history: list[dict[str, Any]],
125
- mode: str = "simple",
126
- api_key: str = "",
127
- ) -> AsyncGenerator[str, None]:
128
- ```
129
-
130
- #### Change 5: Update docstring (lines 112-124)
131
-
132
- ```python
133
- # AFTER
134
- """
135
- Gradio chat function that runs the research agent.
136
-
137
- Args:
138
- message: User's research question
139
- history: Chat history (Gradio format)
140
- mode: Orchestrator mode ("simple" or "advanced")
141
- api_key: Optional user-provided API key (BYOK - auto-detects provider)
142
-
143
- Yields:
144
- Markdown-formatted responses for streaming
145
- """
146
- ```
147
-
148
- #### Change 6: Fix Advanced mode check (line 139)
149
-
150
- ```python
151
- # BEFORE
152
- if mode == "advanced" and not (has_openai or (has_user_key and api_provider == "openai")):
153
-
154
- # AFTER - auto-detect OpenAI key from prefix
155
- is_openai_user_key = user_api_key and user_api_key.startswith("sk-") and not user_api_key.startswith("sk-ant-")
156
- if mode == "advanced" and not (has_openai or is_openai_user_key):
157
- yield (
158
- "⚠️ **Advanced mode requires OpenAI API key.** "
159
- "Anthropic keys only work in Simple mode. Falling back to Simple.\n\n"
160
- )
161
- mode = "simple"
162
- ```
163
-
164
- #### Change 7: Remove premature "Using your key" message (lines 146-151)
165
-
166
- ```python
167
- # BEFORE - uses api_provider which no longer exists
168
- if has_user_key:
169
- yield (
170
- f"🔑 **Using your {api_provider.upper()} API key** - "
171
- "Your key is used only for this session and is never stored.\n\n"
172
- )
173
-
174
- # AFTER - remove this block entirely
175
- # The backend_name from configure_orchestrator already shows "Paid API (OpenAI)" or "Paid API (Anthropic)"
176
- # No need for duplicate messaging
177
- ```
178
-
179
- #### Change 8: Update configure_orchestrator call (lines 165-170)
180
-
181
- ```python
182
- # BEFORE
183
- orchestrator, backend_name = configure_orchestrator(
184
- use_mock=False,
185
- mode=mode,
186
- user_api_key=user_api_key,
187
- api_provider=api_provider, # ← REMOVE
188
- )
189
-
190
- # AFTER
191
- orchestrator, backend_name = configure_orchestrator(
192
- use_mock=False,
193
- mode=mode,
194
- user_api_key=user_api_key,
195
- )
196
- ```
197
-
198
- #### Change 9: Simplify examples (lines 210-229)
199
-
200
- ```python
201
- # BEFORE - 4 items per example
202
- examples=[
203
- ["What drugs improve female libido post-menopause?", "simple", "", "openai"],
204
- ["Clinical trials for erectile dysfunction alternatives to PDE5 inhibitors?", "simple", "", "openai"],
205
- ["Evidence for testosterone therapy in women with HSDD?", "simple", "", "openai"],
206
- ],
207
-
208
- # AFTER - 2 items per example (query, mode) - API key always empty in examples
209
- examples=[
210
- ["What drugs improve female libido post-menopause?", "simple"],
211
- ["Clinical trials for ED alternatives to PDE5 inhibitors?", "simple"],
212
- ["Evidence for testosterone therapy in women with HSDD?", "simple"],
213
- ],
214
- ```
215
-
216
- #### Change 10: Update additional_inputs (lines 231-252)
217
-
218
- ```python
219
- # BEFORE - 3 inputs (mode, api_key, api_provider)
220
- additional_inputs=[
221
- gr.Radio(
222
- choices=["simple", "advanced"],
223
- value="simple",
224
- label="Orchestrator Mode",
225
- info="Simple: Linear (Free Tier Friendly) | Advanced: Multi-Agent (Requires OpenAI)",
226
- ),
227
- gr.Textbox(
228
- label="🔑 API Key (Optional - BYOK)",
229
- placeholder="sk-... or sk-ant-...",
230
- type="password",
231
- info="Enter your own API key. Never stored.",
232
- ),
233
- gr.Radio( # ← REMOVE THIS ENTIRE BLOCK
234
- choices=["openai", "anthropic"],
235
- value="openai",
236
- label="API Provider",
237
- info="Select the provider for your API key",
238
- ),
239
- ],
240
-
241
- # AFTER - 2 inputs (mode, api_key)
242
- additional_inputs=[
243
- gr.Radio(
244
- choices=["simple", "advanced"],
245
- value="simple",
246
- label="Orchestrator Mode",
247
- info="Simple: Works with any key or free tier | Advanced: Requires OpenAI key",
248
- ),
249
- gr.Textbox(
250
- label="🔑 API Key (Optional)",
251
- placeholder="sk-... (OpenAI) or sk-ant-... (Anthropic)",
252
- type="password",
253
- info="Leave empty for free tier. Auto-detects provider from key prefix.",
254
- ),
255
- ],
256
- ```
257
-
258
- #### Change 11: Update accordion label (line 230)
259
-
260
- ```python
261
- # BEFORE
262
- additional_inputs_accordion=gr.Accordion(label="⚙️ Settings", open=False),
263
-
264
- # AFTER
265
- additional_inputs_accordion=gr.Accordion(label="⚙️ Settings (Free tier works without API key)", open=False),
266
- ```
267
-
268
- ---
269
-
270
- ## Testing Checklist
271
-
272
- ### Manual Tests
273
- - [ ] **No key**: Shows "Free Tier (Llama 3.1 / Mistral)" in backend
274
- - [ ] **OpenAI key (sk-...)**: Shows "Paid API (OpenAI)" in backend
275
- - [ ] **Anthropic key (sk-ant-...)**: Shows "Paid API (Anthropic)" in backend
276
- - [ ] **Invalid key format**: Shows error message
277
- - [ ] **Anthropic key + Advanced mode**: Falls back to Simple with warning
278
- - [ ] **OpenAI key + Advanced mode**: Uses full Magentic multi-agent
279
- - [ ] **Examples table**: Shows only 2 columns (query, mode)
280
- - [ ] **MCP server**: Still accessible at `/gradio_api/mcp/`
281
-
282
- ### Unit Test Updates
283
- - [ ] `tests/unit/test_app_smoke.py` - may need update if checking input count
284
-
285
- ---
286
-
287
- ## Definition of Done
288
-
289
- - [ ] `api_provider` parameter removed from `configure_orchestrator()`
290
- - [ ] `api_provider` parameter removed from `research_agent()`
291
- - [ ] Auto-detection logic works for `sk-` and `sk-ant-` prefixes
292
- - [ ] Advanced mode check uses auto-detection (not removed param)
293
- - [ ] "Using your X key" message removed (backend_name handles this)
294
- - [ ] Examples table shows 2 columns
295
- - [ ] Accordion label updated
296
- - [ ] Placeholder text shows both key formats
297
- - [ ] All existing tests pass
298
- - [ ] MCP server still works
299
-
300
- ---
301
-
302
- ## Mode Compatibility Matrix (Unchanged)
303
-
304
- | Mode | No Key | OpenAI Key | Anthropic Key |
305
- |------|--------|------------|---------------|
306
- | **Simple** | ✅ Free tier | ✅ GPT-5.1 | ✅ Claude Sonnet 4.5 |
307
- | **Advanced** | ⚠️ Falls back | ✅ Full Magentic | ⚠️ Falls back to Simple |
308
-
309
- ---
310
-
311
- ## Related
312
- - Issue #52: UI Polish - Examples table confusion
313
- - Issue #53: API Provider Simplification
314
- - Senior Review: Approved 2025-11-28
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/bugs/INVESTIGATION_INVALID_MODELS.md DELETED
@@ -1,31 +0,0 @@
1
- # Bug Investigation: Invalid Default LLM Models
2
-
3
- ## Status
4
- - **Date:** 2025-11-29
5
- - **Reporter:** CLI User
6
- - **Component:** `src/utils/config.py`
7
- - **Priority:** High (Magentic Mode Blocker)
8
- - **Resolution:** FIXED
9
-
10
- ## Issue Description
11
- The user encountered a 403 error when running in Magentic mode:
12
- `Error code: 403 - {'error': {'message': 'Project ... does not have access to model gpt-5', ... 'code': 'model_not_found'}}`
13
-
14
- ## Root Cause Analysis
15
- OpenAI deprecated the base `gpt-5` model. Tier 5 accounts now have access to:
16
- - `gpt-5.1` (current flagship)
17
- - `gpt-5-mini`
18
- - `gpt-5-nano`
19
- - `gpt-4.1`, `gpt-4.1-mini`, `gpt-4.1-nano`
20
- - `o3`, `o4-mini`
21
-
22
- The base `gpt-5` is NO LONGER available via API.
23
-
24
- ## Solution Implemented
25
- Updated `src/utils/config.py` to use:
26
- - `openai_model`: `gpt-5.1` (the actual current model)
27
- - `anthropic_model`: `claude-sonnet-4-5-20250929` (unchanged)
28
-
29
- ## Verification
30
- - `tests/unit/agent_factory/test_judges_factory.py` updated and passed.
31
- - User confirmed Tier 5 access to `gpt-5.1` via OpenAI dashboard.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/bugs/INVESTIGATION_QUOTA_BLOCKER.md DELETED
@@ -1,49 +0,0 @@
1
- # Bug Investigation: HF Free Tier Quota Exhaustion
2
-
3
- ## Status
4
- - **Date:** 2025-11-29
5
- - **Reporter:** CLI User
6
- - **Component:** `HFInferenceJudgeHandler`
7
- - **Priority:** High (UX Blocker for Free Tier)
8
- - **Resolution:** FIXED
9
-
10
- ## Issue Description
11
- On a fresh run with a simple query ("What drugs improve female libido post-menopause?"), the system retrieved 20 valid sources but failed during the Judge/Analysis phase with:
12
- `⚠️ Free Tier Quota Exceeded ⚠️`
13
-
14
- This results in a "Synthesis" step that has 0 candidates and 0 findings, rendering the application useless for free users once the (very low) limit is hit, despite having valid search results.
15
-
16
- ## Evidence
17
- Output provided:
18
- ```text
19
- ### Citations (20 sources)
20
- ...
21
- ### Reasoning
22
- ⚠️ **Free Tier Quota Exceeded** ⚠️
23
- ```
24
-
25
- ## Root Cause Analysis
26
- 1. **Search Success:** `SearchAgent` correctly found 20 documents (PubMed/EuropePMC).
27
- 2. **Judge Failure:** `HFInferenceJudgeHandler` called the HF Inference API.
28
- 3. **Quota Trap:** The API returned a 402 (Payment Required) or Quota error.
29
- 4. **Previous Handling:** The handler caught this error and returned a `JudgeAssessment` with `sufficient=True` (to stop the loop) and *empty* fields.
30
- 5. **Data Loss:** The 20 valid search results were effectively discarded from the "Analysis" perspective.
31
-
32
- ## The "Deep Blocker"
33
- The system had a "hard failure" mode for quota exhaustion, assuming that if the LLM can't judge, we have *no* useful information. This "bricked" the UX for free users immediately upon hitting the limit.
34
-
35
- ## Solution Implemented
36
- Modified `HFInferenceJudgeHandler._create_quota_exhausted_assessment` to:
37
- 1. Accept the `evidence` list as an argument.
38
- 2. Perform basic heuristic extraction (borrowed from `MockJudgeHandler` logic):
39
- - Use titles as "Key Findings" (first 5 sources).
40
- - Add a clear message in "Drug Candidates" telling the user to upgrade.
41
- 3. Return this "Partial" assessment instead of an empty one.
42
-
43
- ## Verification
44
- - Created `tests/unit/agent_factory/test_judges_hf_quota.py` to verify that:
45
- - 402 errors are caught.
46
- - `sufficient` is set to `True` (stops loop).
47
- - `key_findings` are populated from search result titles.
48
- - `reasoning` contains the warning message.
49
- - Ran existing tests `tests/unit/agent_factory/test_judges_hf.py` - All passed.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/bugs/P0_CRITICAL_BUGS.md DELETED
@@ -1,43 +0,0 @@
1
- # P0 Critical Bugs - DeepBoner Demo Broken
2
-
3
- **Date**: 2025-11-28
4
- **Status**: RESOLVED (2025-11-29)
5
- **Priority**: P0 - Blocking hackathon submission
6
-
7
- ---
8
-
9
- ## Summary
10
-
11
- The Gradio demo was non-functional due to 4 critical bugs. All have been fixed and verified.
12
-
13
- ---
14
-
15
- ## Bug 1: Free Tier LLM Quota Exhausted (P0) - FIXED
16
-
17
- **Resolution**:
18
- - Implemented `QuotaExhaustedError` detection in `HFInferenceJudgeHandler`.
19
- - The agent now gracefully stops and displays a clear "Free Tier Quota Exceeded" message instead of looping infinitely.
20
-
21
- ## Bug 2: Evidence Counter Shows 0 After Dedup (P1) - FIXED
22
-
23
- **Resolution**:
24
- - Fixed by resolving Bug 4 (Data Leak). Deduplication now works correctly on isolated per-request collections.
25
-
26
- ## Bug 3: API Key Not Passed to Advanced Mode (P0) - FIXED
27
-
28
- **Resolution**:
29
- - Plumbed `api_key` from the UI through `configure_orchestrator` -> `create_orchestrator` -> `MagenticOrchestrator`.
30
- - Magentic agents now correctly use the user-provided OpenAI key.
31
-
32
- ## Bug 4: Singleton EmbeddingService Causes Cross-Session Pollution (P0) - FIXED
33
-
34
- **Resolution**:
35
- - Removed the singleton pattern for `EmbeddingService`.
36
- - Each request now gets a fresh `EmbeddingService` with a unique, isolated ChromaDB collection (`evidence_{uuid}`).
37
- - `SentenceTransformer` model is lazily cached globally to maintain performance.
38
-
39
- ---
40
-
41
- ## Verification
42
-
43
- Run `make check` to verify all tests pass.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/bugs/P1_GRADIO_SETTINGS_CLEANUP.md DELETED
@@ -1,81 +0,0 @@
1
- # P1 Bug: Gradio Settings Accordion Not Collapsing
2
-
3
- **Priority**: P1 (UX Bug)
4
- **Status**: OPEN
5
- **Date**: 2025-11-27
6
- **Target Component**: `src/app.py`
7
-
8
- ---
9
-
10
- ## 1. Problem Description
11
-
12
- The "Settings" accordion in the Gradio UI (containing Orchestrator Mode, API Key, Provider) fails to collapse, even when configured with `open=False`. It remains permanently expanded, cluttering the interface and obscuring the chat history.
13
-
14
- ### Symptoms
15
- - Accordion arrow toggles visually, but content remains visible.
16
- - Occurs in both local development (`uv run src/app.py`) and HuggingFace Spaces.
17
-
18
- ---
19
-
20
- ## 2. Root Cause Analysis
21
-
22
- **Definitive Cause**: Nested `Blocks` Context Bug.
23
- `gr.ChatInterface` is itself a high-level abstraction that creates a `gr.Blocks` context. Wrapping `gr.ChatInterface` inside an external `with gr.Blocks():` context causes event listener conflicts, specifically breaking the JavaScript state management for `additional_inputs_accordion`.
24
-
25
- **Reference**: [Gradio Issue #8861](https://github.com/gradio-app/gradio/issues/8861) confirms that `additional_inputs_accordion` malfunctions when `ChatInterface` is not the top-level block.
26
-
27
- ---
28
-
29
- ## 3. Solution Strategy: "The Unwrap Fix"
30
-
31
- We will remove the redundant `gr.Blocks` wrapper. This restores the native behavior of `ChatInterface`, ensuring the accordion respects `open=False`.
32
-
33
- ### Implementation Plan
34
-
35
- **Refactor `src/app.py` / `create_demo()`**:
36
-
37
- 1. **Remove** the `with gr.Blocks() as demo:` context manager.
38
- 2. **Instantiate** `gr.ChatInterface` directly as the `demo` object.
39
- 3. **Migrate UI Elements**:
40
- * **Header**: Move the H1/Title text into the `title` parameter of `ChatInterface`.
41
- * **Footer**: Move the footer text ("MCP Server Active...") into the `description` parameter. `ChatInterface` supports Markdown in `description`, making it the ideal place for static info below the title but above the chat.
42
-
43
- ### Before (Buggy)
44
- ```python
45
- def create_demo():
46
- with gr.Blocks() as demo: # <--- CAUSE OF BUG
47
- gr.Markdown("# Title")
48
- gr.ChatInterface(..., additional_inputs_accordion=gr.Accordion(open=False))
49
- gr.Markdown("Footer")
50
- return demo
51
- ```
52
-
53
- ### After (Correct)
54
- ```python
55
- def create_demo():
56
- return gr.ChatInterface( # <--- FIX: Top-level component
57
- ...,
58
- title="🧬 DeepBoner",
59
- description="*AI-Powered Drug Repurposing Agent...*\n\n---\n**MCP Server Active**...",
60
- additional_inputs_accordion=gr.Accordion(label="⚙️ Settings", open=False)
61
- )
62
- ```
63
-
64
- ---
65
-
66
- ## 4. Validation
67
-
68
- 1. **Run**: `uv run python src/app.py`
69
- 2. **Check**: Open `http://localhost:7860`
70
- 3. **Verify**:
71
- * Settings accordion starts **COLLAPSED**.
72
- * Header title ("DeepBoner") is visible.
73
- * Footer text ("MCP Server Active") is visible in the description area.
74
- * Chat functionality works (Magentic/Simple modes).
75
-
76
- ---
77
-
78
- ## 5. Constraints & Notes
79
-
80
- - **Layout**: We lose the ability to place arbitrary elements *below* the chat box (footer will move to top, under title), but this is an acceptable trade-off for a working UI.
81
- - **CSS**: `ChatInterface` handles its own CSS; any custom class styling from the previous footer will be standardized to the description text style.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/bugs/P1_MAGENTIC_STREAMING_AND_KEY_PERSISTENCE.md DELETED
@@ -1,181 +0,0 @@
1
- # Bug Report: Magentic Mode Integration Issues
2
-
3
- ## Status
4
- - **Date:** 2025-11-29
5
- - **Reporter:** CLI User
6
- - **Priority:** P1 (UX Degradation + Deprecation Warnings)
7
- - **Component:** `src/app.py`, `src/orchestrator_magentic.py`, `src/utils/llm_factory.py`
8
- - **Status:** ✅ FIXED (Bug 1 & Bug 2) - 2025-11-29
9
- - **Tests:** 138 passing (136 original + 2 new validation tests)
10
-
11
- ---
12
-
13
- ## Bug 1: Token-by-Token Streaming Spam ✅ FIXED
14
-
15
- ### Symptoms
16
- When running Magentic (Advanced) mode, the UI shows hundreds of individual lines like:
17
- ```text
18
- 📡 STREAMING: Below
19
- 📡 STREAMING: is
20
- 📡 STREAMING: a
21
- 📡 STREAMING: curated
22
- 📡 STREAMING: list
23
- ...
24
- ```
25
-
26
- Each token is displayed as a separate streaming event, creating visual spam and making it impossible to read the output until completion.
27
-
28
- ### Root Cause (VALIDATED)
29
- **File:** `src/orchestrator_magentic.py:247-254`
30
-
31
- ```python
32
- elif isinstance(event, MagenticAgentDeltaEvent):
33
- if event.text:
34
- return AgentEvent(
35
- type="streaming",
36
- message=event.text, # Single token!
37
- data={"agent_id": event.agent_id},
38
- iteration=iteration,
39
- )
40
- ```
41
-
42
- Every LLM token emits a `MagenticAgentDeltaEvent`, which creates an `AgentEvent(type="streaming")`.
43
-
44
- **File:** `src/app.py:171-192` (BEFORE FIX)
45
-
46
- ```python
47
- async for event in orchestrator.run(message):
48
- event_md = event.to_markdown()
49
- response_parts.append(event_md) # Appends EVERY token
50
-
51
- if event.type == "complete":
52
- yield event.message
53
- else:
54
- yield "\n\n".join(response_parts) # Yields ALL accumulated tokens
55
- ```
56
-
57
- For N tokens, this yields N times, each time showing all previous tokens. This is O(N²) string operations and creates massive visual spam.
58
-
59
- ### Fix Applied
60
- **File:** `src/app.py:175-204`
61
-
62
- Implemented streaming token buffering with live updates:
63
- 1. Added `streaming_buffer = ""` to accumulate tokens
64
- 2. For each streaming event: append to buffer, yield immediately (for live typing UX)
65
- 3. **Key fix**: Don't append streaming events to `response_parts` (prevents O(N²) list growth)
66
- 4. Each yield has only ONE `📡 STREAMING:` line (the accumulated buffer)
67
- 5. Flush buffer to `response_parts` only when non-streaming event occurs
68
-
69
- **Result**: Live typing feel preserved, but no visual spam (each update replaces, not accumulates)
70
-
71
- ### Proposed Fix Options
72
-
73
- **Option A: Buffer streaming tokens (recommended)**
74
- ```python
75
- # In app.py - accumulate streaming tokens, yield periodically
76
- streaming_buffer = ""
77
- last_yield_time = time.time()
78
-
79
- async for event in orchestrator.run(message):
80
- if event.type == "streaming":
81
- streaming_buffer += event.message
82
- # Only yield every 500ms or on newline
83
- if time.time() - last_yield_time > 0.5 or "\n" in event.message:
84
- yield f"📡 {streaming_buffer}"
85
- last_yield_time = time.time()
86
- elif event.type == "complete":
87
- yield event.message
88
- else:
89
- # Non-streaming events
90
- response_parts.append(event.to_markdown())
91
- yield "\n\n".join(response_parts)
92
- ```
93
-
94
- **Option B: Don't yield streaming events at all**
95
- ```python
96
- # In app.py - only yield meaningful events
97
- async for event in orchestrator.run(message):
98
- if event.type == "streaming":
99
- continue # Skip token-by-token spam
100
- # ... rest of logic
101
- ```
102
-
103
- **Option C: Fix at orchestrator level**
104
- Don't emit `AgentEvent` for every delta - buffer in `_process_event`.
105
-
106
- ---
107
-
108
- ## Bug 2: API Key Does Not Persist in Textbox ✅ FIXED
109
-
110
- ### Symptoms
111
- 1. User opens the "Mode & API Key" accordion
112
- 2. User pastes their API key into the password textbox
113
- 3. User clicks an example OR clicks elsewhere
114
- 4. The API key textbox is now empty - value lost
115
-
116
- ### Root Cause (VALIDATED)
117
- **File:** `src/app.py:255-267` (BEFORE FIX)
118
-
119
- ```python
120
- additional_inputs_accordion=additional_inputs_accordion,
121
- additional_inputs=[
122
- gr.Radio(...),
123
- gr.Textbox(
124
- label="🔑 API Key (Optional)",
125
- type="password",
126
- # No `value` parameter - defaults to empty
127
- # No state persistence mechanism
128
- ),
129
- ],
130
- ```
131
-
132
- Gradio's `ChatInterface` with `additional_inputs` has known issues:
133
- 1. Clicking examples resets additional inputs to defaults
134
- 2. The accordion state and input values may not persist correctly
135
- 3. No explicit state management for the API key
136
-
137
- ### Fix Applied
138
- **Files Modified:**
139
- 1. `src/app.py`
140
- 2. `src/utils/llm_factory.py`
141
-
142
- **Bug 1 (Streaming Spam):**
143
- - Accumulate tokens in `streaming_buffer`
144
- - Yield updates immediately for live typing UX
145
- - **Key**: Don't append to `response_parts` until stream segment complete
146
- - Each yield has ONE `📡 STREAMING:` line (not N accumulated lines)
147
-
148
- **Bug 2 (API Key Persistence):**
149
- - **Strategy:** Partial example list (relies on Gradio behavior)
150
- - Examples have only 2 elements `[message, mode]` instead of 4
151
- - Gradio only updates inputs with corresponding example values
152
- - Remaining inputs (api_key textbox) are left unchanged
153
- - `api_key_state` parameter exists as fallback but may be redundant
154
- - **Note:** This is a workaround relying on undocumented Gradio behavior
155
-
156
- **Bug 3 (OpenAIModel Deprecation):** ✅ FIXED
157
- - Replaced all `OpenAIModel` imports with `OpenAIChatModel` in `src/app.py` and `src/utils/llm_factory.py`.
158
-
159
- ### Test Results
160
- ```bash
161
- uv run pytest tests/ -q
162
- ============================= 138 passed in 20.60s =============================
163
- ```
164
-
165
- **Status:** ✅ All tests passing
166
-
167
- ### Why This Fix Works
168
-
169
- **Bug 1 (Streaming Spam):**
170
- - **Before:** Every token → `append()` to list → `yield` → List grew to size N → O(N²) complexity.
171
- - **After:** Every token → `yield` dynamically constructed string (buffer + history) → List stays size K (number of *events*).
172
- - **Impact:** Smooth streaming, no visual spam, no browser freeze.
173
-
174
- **Bug 2 (API Key):**
175
- - **Before:** Example click → Overwrote API Key textbox with `""`.
176
- - **After:** Example click → Updates only `message` and `mode` → API Key textbox untouched.
177
- - **Impact:** User input persists naturally.
178
-
179
- ### Remaining Work
180
- - **Bug 4 (Asyncio GC errors):** Monitoring only - likely Gradio/HF Spaces issue
181
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/bugs/P3_MAGENTIC_NO_TERMINATION_EVENT.md ADDED
@@ -0,0 +1,177 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # P3 Bug Report: Advanced Mode Missing Termination Guarantee
2
+
3
+ ## Status
4
+ - **Date:** 2025-11-29
5
+ - **Priority:** P3 (Edge case, but confusing UX)
6
+ - **Component:** `src/orchestrator_magentic.py`
7
+ - **Resolution:** Fixed (Guarantee termination event)
8
+
9
+ ---
10
+
11
+ ## Symptoms
12
+
13
+ In **Advanced (Magentic) mode** with OpenAI API key:
14
+
15
+ 1. Workflow runs for many iterations (up to 10 rounds)
16
+ 2. Agents search, judge, hypothesize repeatedly
17
+ 3. Eventually... **nothing happens**
18
+ - No "complete" event
19
+ - No error message
20
+ - UI just stops updating
21
+
22
+ **User perception:** "Did it finish? Did it crash? What happened?"
23
+
24
+ ### Observed Behavior
25
+
26
+ When workflow hits `max_round_count=10`:
27
+ - `workflow.run_stream(task)` iterator ends
28
+ - NO `MagenticFinalResultEvent` is emitted by agent-framework
29
+ - Our code yields nothing after the loop
30
+ - User is left hanging
31
+
32
+ ---
33
+
34
+ ## Root Cause Analysis
35
+
36
+ ### Code Path (`src/orchestrator_magentic.py:170-186`)
37
+
38
+ ```python
39
+ iteration = 0
40
+ try:
41
+ async for event in workflow.run_stream(task):
42
+ agent_event = self._process_event(event, iteration)
43
+ if agent_event:
44
+ if isinstance(event, MagenticAgentMessageEvent):
45
+ iteration += 1
46
+ yield agent_event
47
+ # BUG: NO FALLBACK HERE!
48
+ # If loop ends without FinalResultEvent, user sees nothing
49
+
50
+ except Exception as e:
51
+ logger.error("Magentic workflow failed", error=str(e))
52
+ yield AgentEvent(
53
+ type="error",
54
+ message=f"Workflow error: {e!s}",
55
+ iteration=iteration,
56
+ )
57
+ # BUG: NO FINALLY BLOCK TO GUARANTEE TERMINATION EVENT
58
+ ```
59
+
60
+ ### Workflow Configuration (`src/orchestrator_magentic.py:110-116`)
61
+
62
+ ```python
63
+ .with_standard_manager(
64
+ chat_client=manager_client,
65
+ max_round_count=self._max_rounds, # 10 - can hit this limit
66
+ max_stall_count=3, # If agents repeat 3x
67
+ max_reset_count=2, # Workflow reset limit
68
+ )
69
+ ```
70
+
71
+ ### Failure Modes
72
+
73
+ | Scenario | What Happens | User Sees |
74
+ |----------|--------------|-----------|
75
+ | `MagenticFinalResultEvent` emitted | `_process_event` yields "complete" | Final report |
76
+ | Max rounds (10) reached, no final event | Loop ends silently | **Nothing** |
77
+ | `max_stall_count` triggered | Workflow ends | **Nothing** |
78
+ | `max_reset_count` triggered | Workflow ends | **Nothing** |
79
+ | OpenAI API error | Exception caught | Error message |
80
+
81
+ ---
82
+
83
+ ## The Fix
84
+
85
+ Add guaranteed termination event after the loop:
86
+
87
+ ```python
88
+ iteration = 0
89
+ final_event_received = False
90
+
91
+ try:
92
+ async for event in workflow.run_stream(task):
93
+ agent_event = self._process_event(event, iteration)
94
+ if agent_event:
95
+ if isinstance(event, MagenticAgentMessageEvent):
96
+ iteration += 1
97
+ if agent_event.type == "complete":
98
+ final_event_received = True
99
+ yield agent_event
100
+
101
+ except Exception as e:
102
+ logger.error("Magentic workflow failed", error=str(e))
103
+ yield AgentEvent(
104
+ type="error",
105
+ message=f"Workflow error: {e!s}",
106
+ iteration=iteration,
107
+ )
108
+ final_event_received = True # Error is a form of termination
109
+
110
+ finally:
111
+ # GUARANTEE: Always emit termination event
112
+ if not final_event_received:
113
+ logger.warning(
114
+ "Workflow ended without final event",
115
+ iterations=iteration,
116
+ )
117
+ yield AgentEvent(
118
+ type="complete",
119
+ message=(
120
+ f"Research completed after {iteration} agent rounds. "
121
+ "Max iterations reached - results may be partial. "
122
+ "Try a more specific query for better results."
123
+ ),
124
+ data={"iterations": iteration, "reason": "max_rounds_reached"},
125
+ iteration=iteration,
126
+ )
127
+ ```
128
+
129
+ ---
130
+
131
+ ## Alternative: Increase Max Rounds
132
+
133
+ The default `max_rounds=10` might be too low for complex queries.
134
+
135
+ In `src/orchestrator_factory.py:52-53`:
136
+ ```python
137
+ return orchestrator_cls(
138
+ max_rounds=config.max_iterations if config else 10, # Could increase to 15-20
139
+ api_key=api_key,
140
+ )
141
+ ```
142
+
143
+ **Trade-off:** More rounds = more API cost, but better chance of complete results.
144
+
145
+ ---
146
+
147
+ ## Test Plan
148
+
149
+ - [ ] Add fallback yield after async for loop
150
+ - [ ] Add `final_event_received` flag tracking
151
+ - [ ] Log warning when fallback is used
152
+ - [ ] Test with `max_rounds=2` to force hitting limit
153
+ - [ ] Verify user always sees termination event
154
+ - [ ] `make check` passes
155
+
156
+ ---
157
+
158
+ ## Related Files
159
+
160
+ - `src/orchestrator_magentic.py` - Main fix location
161
+ - `src/orchestrator_factory.py` - Max rounds configuration
162
+ - `src/utils/models.py` - AgentEvent types
163
+ - `docs/bugs/P2_MAGENTIC_THINKING_STATE.md` - Related UX issue (implemented)
164
+
165
+ ---
166
+
167
+ ## Priority Justification
168
+
169
+ **P3** because:
170
+ - Advanced mode is working for most queries
171
+ - Only hits edge case when max rounds reached without synthesis
172
+ - User CAN retry with different query
173
+ - Not blocking hackathon demo (free tier Simple mode works)
174
+
175
+ Would be P2 if:
176
+ - This happened frequently
177
+ - No workaround existed
docs/bugs/SENIOR_AGENT_AUDIT_PROMPT.md DELETED
@@ -1,247 +0,0 @@
1
- # Senior Agent Audit Request: DeepBoner Codebase Bug Hunt
2
-
3
- **Date**: 2025-11-28
4
- **Requesting Agent**: Claude (Opus)
5
- **Purpose**: Comprehensive bug audit and verification of P0_CRITICAL_BUGS.md
6
-
7
- ---
8
-
9
- ## Your Mission
10
-
11
- You are a senior software engineer performing a comprehensive audit of the DeepBoner codebase. Your goals:
12
-
13
- 1. **VERIFY** the 4 bugs documented in `docs/bugs/P0_CRITICAL_BUGS.md` are accurately described
14
- 2. **FIND** any additional bugs (P0-P4) that could affect the demo
15
- 3. **TRACE** the complete code paths for Simple and Advanced modes
16
- 4. **IDENTIFY** any silent failures, race conditions, or edge cases
17
-
18
- ---
19
-
20
- ## Context: What DeepBoner Does
21
-
22
- DeepBoner is a Gradio-based biomedical research agent that:
23
- 1. Takes a research question from user
24
- 2. Searches PubMed, ClinicalTrials.gov, Europe PMC
25
- 3. Uses an LLM "judge" to evaluate if evidence is sufficient
26
- 4. Either loops for more evidence or synthesizes a final report
27
-
28
- **Two Modes**:
29
- - **Simple**: Linear orchestrator with search → judge → report loop
30
- - **Advanced**: Magentic multi-agent with SearchAgent, JudgeAgent, HypothesisAgent, ReportAgent
31
-
32
- **Three Backend Options**:
33
- - Free tier: HuggingFace Inference API (Llama/Mistral)
34
- - OpenAI: User-provided or env var key
35
- - Anthropic: User-provided or env var key (Simple mode only)
36
-
37
- ---
38
-
39
- ## Files to Audit (Priority Order)
40
-
41
- ### Critical Path Files:
42
- 1. `src/app.py` - Gradio UI, entry point, key routing
43
- 2. `src/orchestrator.py` - Simple mode main loop
44
- 3. `src/orchestrator_factory.py` - Mode selection and orchestrator creation
45
- 4. `src/orchestrator_magentic.py` - Advanced mode implementation
46
- 5. `src/services/embeddings.py` - Deduplication singleton (KNOWN BUG)
47
- 6. `src/agent_factory/judges.py` - LLM judge handlers (HF, OpenAI, Anthropic)
48
-
49
- ### Supporting Files:
50
- 7. `src/tools/search_handler.py` - Parallel search orchestration
51
- 8. `src/tools/pubmed.py` - PubMed API integration
52
- 9. `src/tools/clinicaltrials.py` - ClinicalTrials.gov API
53
- 10. `src/tools/europepmc.py` - Europe PMC API
54
- 11. `src/agents/magentic_agents.py` - Agent factories (KNOWN BUG: hardcoded env key)
55
- 12. `src/utils/config.py` - Settings and configuration
56
- 13. `src/utils/models.py` - Data models (Evidence, Citation, etc.)
57
-
58
- ---
59
-
60
- ## Known Bugs to Verify
61
-
62
- ### Bug 1: Free Tier LLM Quota Exhausted
63
- **Claim**: HuggingFace Inference returns 402, all 3 fallback models fail
64
- **Verify**:
65
- - Check `src/agent_factory/judges.py` class `HFInferenceJudgeHandler`
66
- - Trace the fallback chain: Llama → Mistral → Zephyr
67
- - Confirm what happens when ALL fail (does it return default "continue"?)
68
- - Check if the error message reaches the user or is swallowed
69
-
70
- ### Bug 2: Evidence Counter Shows 0 After Dedup
71
- **Claim**: `_deduplicate_and_rank()` can return empty list, losing all evidence
72
- **Verify**:
73
- - Check `src/orchestrator.py` lines 97-114 and 219
74
- - Trace what happens if `embeddings.deduplicate()` returns `[]`
75
- - Is there defensive handling? Does exception handler catch this?
76
- - Could this be a race condition in async code?
77
-
78
- ### Bug 3: API Key Not Passed to Advanced Mode
79
- **Claim**: User's API key from Gradio is never passed to MagenticOrchestrator
80
- **Verify**:
81
- - Trace: `app.py:research_agent()` → `configure_orchestrator()` → `orchestrator_factory.py`
82
- - Check if `user_api_key` is passed to `create_orchestrator()`
83
- - Check if `MagenticOrchestrator.__init__()` receives a key
84
- - Check `src/agents/magentic_agents.py` - do agents use `settings.openai_api_key`?
85
-
86
- ### Bug 4: Singleton EmbeddingService Cross-Session Pollution
87
- **Claim**: ChromaDB collection persists across requests, causing false duplicates
88
- **Verify**:
89
- - Check `src/services/embeddings.py` singleton pattern
90
- - Is `_embedding_service` ever reset?
91
- - What happens to ChromaDB collection between Gradio requests?
92
- - Could this cause "Found 20 new sources (0 total)"?
93
-
94
- ---
95
-
96
- ## Additional Bug Categories to Search For
97
-
98
- ### A. Error Handling Gaps
99
- - [ ] Silent `except: pass` blocks
100
- - [ ] Exceptions logged but not re-raised
101
- - [ ] Missing error messages to user
102
- - [ ] Swallowed API errors
103
-
104
- ### B. Async/Concurrency Issues
105
- - [ ] Race conditions in parallel searches
106
- - [ ] Shared mutable state across async calls
107
- - [ ] Missing `await` keywords
108
- - [ ] Event loop blocking (sync code in async context)
109
-
110
- ### C. API Integration Bugs
111
- - [ ] Missing rate limiting
112
- - [ ] Hardcoded timeouts that are too short
113
- - [ ] XML/JSON parsing failures not handled
114
- - [ ] Empty response handling
115
-
116
- ### D. State Management Issues
117
- - [ ] Global singletons that should be session-scoped
118
- - [ ] Gradio state not properly isolated between users
119
- - [ ] Memory leaks from accumulated data
120
-
121
- ### E. Configuration Bugs
122
- - [ ] Missing env var defaults
123
- - [ ] Type mismatches in settings
124
- - [ ] Hardcoded values that should be configurable
125
-
126
- ### F. UI/UX Bugs
127
- - [ ] Streaming not working properly
128
- - [ ] Progress messages misleading
129
- - [ ] Examples not matching actual functionality
130
- - [ ] Error messages not user-friendly
131
-
132
- ---
133
-
134
- ## Output Format
135
-
136
- Please produce a report with:
137
-
138
- ### 1. Verification of Known Bugs
139
- For each of the 4 bugs in P0_CRITICAL_BUGS.md:
140
- - **CONFIRMED** or **INCORRECT** or **PARTIALLY CORRECT**
141
- - Exact file:line references
142
- - Any corrections or additional details
143
-
144
- ### 2. New Bugs Found
145
- For each new bug:
146
- ```
147
- ## Bug N: [Title]
148
- **Priority**: P0/P1/P2/P3/P4
149
- **File**: path/to/file.py:line
150
- **Symptoms**: What the user sees
151
- **Root Cause**: Technical explanation
152
- **Code**:
153
- ```python
154
- # The buggy code
155
- ```
156
- **Fix**:
157
- ```python
158
- # The corrected code
159
- ```
160
- ```
161
-
162
- ### 3. Code Quality Concerns
163
- Any patterns that aren't bugs but could cause issues:
164
- - Technical debt
165
- - Missing tests for critical paths
166
- - Unclear error handling
167
-
168
- ### 4. Recommended Fix Order
169
- Prioritized list of what to fix first for a working demo.
170
-
171
- ---
172
-
173
- ## Commands to Help Your Investigation
174
-
175
- ```bash
176
- # Run the tests
177
- make check
178
-
179
- # Test search works
180
- uv run python -c "
181
- import asyncio
182
- from src.tools.pubmed import PubMedTool
183
- async def test():
184
- tool = PubMedTool()
185
- results = await tool.search('female libido', 5)
186
- print(f'Found {len(results)} results')
187
- asyncio.run(test())
188
- "
189
-
190
- # Test HF inference (will show 402 if quota exhausted)
191
- uv run python -c "
192
- from huggingface_hub import InferenceClient
193
- client = InferenceClient()
194
- try:
195
- resp = client.chat_completion(
196
- messages=[{'role': 'user', 'content': 'Hi'}],
197
- model='meta-llama/Llama-3.1-8B-Instruct',
198
- max_tokens=10
199
- )
200
- print(resp)
201
- except Exception as e:
202
- print(f'Error: {e}')
203
- "
204
-
205
- # Test full orchestrator (simple mode)
206
- uv run python -c "
207
- import asyncio
208
- from src.app import configure_orchestrator
209
- async def test():
210
- orch, backend = configure_orchestrator(use_mock=True, mode='simple')
211
- print(f'Backend: {backend}')
212
- async for event in orch.run('test query'):
213
- print(f'{event.type}: {event.message[:50] if event.message else \"\"}'[:60])
214
- asyncio.run(test())
215
- "
216
-
217
- # Check for hardcoded API keys (security)
218
- grep -r "sk-" src/ --include="*.py" | grep -v "sk-..." | grep -v "sk-ant-..."
219
-
220
- # Find all singletons
221
- grep -r "_.*: .* | None = None" src/ --include="*.py"
222
-
223
- # Find all except blocks
224
- grep -rn "except.*:" src/ --include="*.py" | head -50
225
- ```
226
-
227
- ---
228
-
229
- ## Important Notes
230
-
231
- 1. **DO NOT fix bugs** - just document them
232
- 2. **Be thorough** - check edge cases and error paths
233
- 3. **Be specific** - include file:line references
234
- 4. **Be skeptical** - verify claims in P0_CRITICAL_BUGS.md independently
235
- 5. **Think like a user** - what would break the demo experience?
236
-
237
- The hackathon deadline is approaching. We need a working demo. Your audit will determine what gets fixed first.
238
-
239
- ---
240
-
241
- ## Deliverable
242
-
243
- A comprehensive markdown report that:
244
- 1. Confirms or corrects the 4 known bugs
245
- 2. Lists any new bugs found (with priority)
246
- 3. Recommends the optimal fix order
247
- 4. Can be saved as `docs/bugs/SENIOR_AUDIT_RESULTS.md`
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/bugs/SENIOR_AUDIT_RESULTS.md DELETED
@@ -1,84 +0,0 @@
1
- # Senior Agent Audit Results: DeepBoner Codebase
2
-
3
- **Date**: 2025-11-28
4
- **Auditor**: Claude (Senior Software Engineer)
5
- **Status**: COMPLETE
6
-
7
- ---
8
-
9
- ## Executive Summary
10
-
11
- The DeepBoner codebase has **4 critical defects** that render the demo non-functional for most users. The most severe is a **data leak** where the vector database persists across user sessions, causing search result corruption and potential privacy issues. Additionally, the "Advanced" mode ignores user-provided API keys, and the "Free Tier" mode fails silently when quotas are exhausted.
12
-
13
- **Recommendation**: Immediate remediation of P0 bugs is required before hackathon submission.
14
-
15
- ---
16
-
17
- ## 1. Verification of Known Bugs (P0_CRITICAL_BUGS.md)
18
-
19
- | Bug | Claim | Verification Status | Notes |
20
- | :--- | :--- | :--- | :--- |
21
- | **Bug 1** | Free Tier LLM Quota Exhausted | **CONFIRMED** | `HFInferenceJudgeHandler` catches errors but returns a fallback assessment with `recommendation="continue"`. This causes the orchestrator to loop uselessly until `max_iterations` is reached. The user sees no error message. |
22
- | **Bug 2** | Evidence Counter Shows 0 | **CONFIRMED** | Directly caused by Bug 4. Deduplication logic works correctly *in isolation*, but fails because the underlying ChromaDB collection is polluted with stale data from previous sessions. |
23
- | **Bug 3** | API Key Not Passed to Advanced | **CONFIRMED** | `create_orchestrator` in `orchestrator_factory.py` ignores the user's API key. `MagenticOrchestrator` and its agents fall back to `settings.openai_api_key` (env var), which is empty for BYOK users. |
24
- | **Bug 4** | Singleton EmbeddingService | **CONFIRMED** | `EmbeddingService` is a global singleton with an in-memory ChromaDB. The collection is never cleared. Data leaks between sessions, causing valid new results to be marked as duplicates of old results. |
25
-
26
- ---
27
-
28
- ## 2. New Bugs Found
29
-
30
- ### Bug 5: Search Error Swallowing (P2)
31
- **File**: `src/orchestrator.py` / `src/tools/search_handler.py`
32
- **Symptoms**: If all search tools fail (e.g., network issue, API limit), the UI shows "Found 0 sources" without explaining why.
33
- **Root Cause**: `SearchHandler` captures exceptions and returns them in an `errors` list, but `Orchestrator` only logs them to the console (`logger.warning`) and proceeds with empty evidence.
34
- **Fix**: Yield an `AgentEvent(type="error")` or include errors in the `search_complete` event message.
35
-
36
- ### Bug 6: Hardcoded Model Names (P3)
37
- **File**: `src/agent_factory/judges.py`
38
- **Symptoms**: Maintenance burden.
39
- **Root Cause**: Model names like `meta-llama/Llama-3.1-8B-Instruct` are hardcoded in the class `HFInferenceJudgeHandler` rather than pulled from `config.py`.
40
- **Fix**: Move to `Settings`.
41
-
42
- ---
43
-
44
- ## 3. Code Quality Concerns
45
-
46
- 1. **Singleton Abuse**: The `_embedding_service` global in `src/services/embeddings.py` is a major architectural flaw for a multi-user web app (even a demo). It should be scoped to the `Orchestrator` instance.
47
- 2. **Inconsistent Factory Signatures**: `create_orchestrator` does not accept `api_key`, forcing hacks or reliance on global env vars.
48
- 3. **Silent Failures**: The pervasive use of `try...except Exception` with only logging (no user feedback) makes debugging difficult for end-users.
49
-
50
- ---
51
-
52
- ## 4. Recommended Fix Order
53
-
54
- ### Step 1: Fix the Data Leak (Bug 4 & 2)
55
- **Why**: Prevents result corruption and cross-user data leakage.
56
- **Plan**:
57
- 1. Remove singleton pattern from `src/services/embeddings.py`.
58
- 2. Make `EmbeddingService` an instance variable of `Orchestrator`.
59
- 3. Initialize a fresh `EmbeddingService` (and ChromaDB collection) for each `run()`.
60
-
61
- ### Step 2: Fix Advanced Mode BYOK (Bug 3)
62
- **Why**: Enables the core "Advanced" feature for judges/users.
63
- **Plan**:
64
- 1. Update `create_orchestrator` signature to accept `api_key`.
65
- 2. Update `MagenticOrchestrator` to accept `api_key`.
66
- 3. Update `configure_orchestrator` in `app.py` to pass the key.
67
- 4. Ensure `MagenticOrchestrator` constructs `OpenAIChatClient` with the user's key.
68
-
69
- ### Step 3: Fix Free Tier Experience (Bug 1)
70
- **Why**: Ensures a usable fallback for those without keys.
71
- **Plan**:
72
- 1. In `HFInferenceJudgeHandler`, detect 402/429 errors.
73
- 2. If caught, return a `JudgeAssessment` that triggers a "Complete" event with a clear error message, rather than "Continue".
74
- 3. Add `HF_TOKEN` to the deployment environment if possible.
75
-
76
- ---
77
-
78
- ## Verification Plan
79
-
80
- After applying fixes, run:
81
- 1. **Unit Tests**: `make check`
82
- 2. **Manual Test (Simple)**: Run without key, verify 402 error is handled OR works if token added.
83
- 3. **Manual Test (Advanced)**: Run with OpenAI key, verify it proceeds past initialization.
84
- 4. **Manual Test (Dedup)**: Run same query twice. Second run should find same number of results (not 0).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
src/app.py CHANGED
@@ -173,7 +173,11 @@ async def research_agent(
173
  user_api_key=user_api_key,
174
  )
175
 
176
- yield f"🧠 **Backend**: {backend_name}\n\n"
 
 
 
 
177
 
178
  # Immediate loading feedback so user knows something is happening
179
  yield (
 
173
  user_api_key=user_api_key,
174
  )
175
 
176
+ # Immediate backend info + loading feedback so user knows something is happening
177
+ yield (
178
+ f"🧠 **Backend**: {backend_name}\n\n"
179
+ "⏳ **Processing...** Searching PubMed, ClinicalTrials.gov, Europe PMC...\n"
180
+ )
181
 
182
  # Immediate loading feedback so user knows something is happening
183
  yield (
src/orchestrator_magentic.py CHANGED
@@ -168,14 +168,38 @@ The final output should be a structured research report."""
168
  )
169
 
170
  iteration = 0
 
 
171
  try:
172
  async for event in workflow.run_stream(task):
173
  agent_event = self._process_event(event, iteration)
174
  if agent_event:
175
  if isinstance(event, MagenticAgentMessageEvent):
176
  iteration += 1
 
 
 
 
177
  yield agent_event
178
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
179
  except Exception as e:
180
  logger.error("Magentic workflow failed", error=str(e))
181
  yield AgentEvent(
 
168
  )
169
 
170
  iteration = 0
171
+ final_event_received = False
172
+
173
  try:
174
  async for event in workflow.run_stream(task):
175
  agent_event = self._process_event(event, iteration)
176
  if agent_event:
177
  if isinstance(event, MagenticAgentMessageEvent):
178
  iteration += 1
179
+
180
+ if agent_event.type == "complete":
181
+ final_event_received = True
182
+
183
  yield agent_event
184
 
185
+ # GUARANTEE: Always emit termination event if stream ends without one
186
+ # (e.g., max rounds reached)
187
+ if not final_event_received:
188
+ logger.warning(
189
+ "Workflow ended without final event",
190
+ iterations=iteration,
191
+ )
192
+ yield AgentEvent(
193
+ type="complete",
194
+ message=(
195
+ f"Research completed after {iteration} agent rounds. "
196
+ "Max iterations reached - results may be partial. "
197
+ "Try a more specific query for better results."
198
+ ),
199
+ data={"iterations": iteration, "reason": "max_rounds_reached"},
200
+ iteration=iteration,
201
+ )
202
+
203
  except Exception as e:
204
  logger.error("Magentic workflow failed", error=str(e))
205
  yield AgentEvent(
tests/unit/test_magentic_termination.py ADDED
@@ -0,0 +1,111 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Tests for Magentic Orchestrator termination guarantee."""
2
+
3
+ from unittest.mock import MagicMock, patch
4
+
5
+ import pytest
6
+ from agent_framework import MagenticAgentMessageEvent
7
+
8
+ from src.orchestrator_magentic import MagenticOrchestrator
9
+ from src.utils.models import AgentEvent
10
+
11
+ # Skip tests if agent_framework is not installed
12
+ pytest.importorskip("agent_framework")
13
+
14
+
15
+ class MockChatMessage:
16
+ def __init__(self, content):
17
+ self.content = content
18
+ self.role = "assistant"
19
+
20
+ @property
21
+ def text(self):
22
+ return self.content
23
+
24
+
25
+ @pytest.fixture
26
+ def mock_magentic_requirements():
27
+ """Mock requirements check."""
28
+ with patch("src.orchestrator_magentic.check_magentic_requirements"):
29
+ yield
30
+
31
+
32
+ @pytest.mark.asyncio
33
+ async def test_termination_event_emitted_on_stream_end(mock_magentic_requirements):
34
+ """
35
+ Verify that a termination event is emitted when the workflow stream ends
36
+ without a MagenticFinalResultEvent (e.g. max rounds reached).
37
+ """
38
+ orchestrator = MagenticOrchestrator(max_rounds=2)
39
+
40
+ # Use real event class
41
+ mock_message = MockChatMessage("Thinking...")
42
+ mock_agent_event = MagenticAgentMessageEvent(agent_id="SearchAgent", message=mock_message)
43
+
44
+ # Mock the workflow and its run_stream method
45
+ mock_workflow = MagicMock()
46
+
47
+ # Create an async generator for run_stream
48
+ async def mock_stream(task):
49
+ # Yield the real message event
50
+ yield mock_agent_event
51
+ # STOP HERE - No FinalResultEvent
52
+
53
+ mock_workflow.run_stream = mock_stream
54
+
55
+ # Mock _build_workflow to return our mock workflow
56
+ with patch.object(orchestrator, "_build_workflow", return_value=mock_workflow):
57
+ events = []
58
+ async for event in orchestrator.run("Research query"):
59
+ events.append(event)
60
+
61
+ for i, e in enumerate(events):
62
+ print(f"Event {i}: {e.type} - {e.message}")
63
+
64
+ assert len(events) >= 2
65
+ assert events[0].type == "started"
66
+
67
+ # Verify the message event was processed
68
+ # Depending on _process_event logic, MagenticAgentMessageEvent might map to different types
69
+ # We assume it maps to something valid or we just check presence.
70
+ assert any("Thinking..." in e.message for e in events)
71
+
72
+ # THE CRITICAL CHECK: Did we get the fallback termination event?
73
+ last_event = events[-1]
74
+ assert last_event.type == "complete"
75
+ assert "Max iterations reached" in last_event.message
76
+ assert last_event.data.get("reason") == "max_rounds_reached"
77
+
78
+
79
+ @pytest.mark.asyncio
80
+ async def test_no_double_termination_event(mock_magentic_requirements):
81
+ """
82
+ Verify that we DO NOT emit a fallback event if the workflow finished normally.
83
+ """
84
+ orchestrator = MagenticOrchestrator()
85
+
86
+ mock_workflow = MagicMock()
87
+
88
+ with patch.object(orchestrator, "_build_workflow", return_value=mock_workflow):
89
+ # Mock _process_event to simulate a natural completion event
90
+ with patch.object(orchestrator, "_process_event") as mock_process:
91
+ mock_process.side_effect = [
92
+ AgentEvent(type="thinking", message="Working...", iteration=1),
93
+ AgentEvent(type="complete", message="Done!", iteration=2),
94
+ ]
95
+
96
+ async def mock_stream_with_yields(task):
97
+ yield "raw_event_1"
98
+ yield "raw_event_2"
99
+
100
+ mock_workflow.run_stream = mock_stream_with_yields
101
+
102
+ events = []
103
+ async for event in orchestrator.run("Research query"):
104
+ events.append(event)
105
+
106
+ assert events[-1].message == "Done!"
107
+ assert events[-1].type == "complete"
108
+
109
+ # Verify we didn't get a SECOND "Max iterations reached" event
110
+ fallback_events = [e for e in events if "Max iterations reached" in e.message]
111
+ assert len(fallback_events) == 0