Commit
Β·
7f11675
1
Parent(s):
d36ce3c
docs: clean up resolved bug reports, update P3 commit hash
Browse filesDelete 13 obsolete bug docs (all resolved):
- FIX_PLAN_*.md (superseded by implementations)
- INVESTIGATION_*.md (completed)
- P0_*, P1_*, P2_* (all fixed)
- SENIOR_AGENT_*.md (one-time prompts)
Update ACTIVE_BUGS.md:
- P3 commit hash: (Pending) β d36ce3c
- Remove broken link to deleted P1_GRADIO_SETTINGS_CLEANUP.md
Bug index now shows all bugs resolved. Zero active bugs.
- docs/bugs/ACTIVE_BUGS.md +1 -2
- docs/bugs/FIX_PLAN_CRITICAL_BUGS.md +0 -36
- docs/bugs/FIX_PLAN_MAGENTIC_MODE.md +0 -227
- docs/bugs/FIX_UI_SIMPLIFICATION.md +0 -314
- docs/bugs/INVESTIGATION_INVALID_MODELS.md +0 -31
- docs/bugs/INVESTIGATION_QUOTA_BLOCKER.md +0 -49
- docs/bugs/P0_CRITICAL_BUGS.md +0 -43
- docs/bugs/P0_GRADIO_EXAMPLE_CACHING_CRASH.md +0 -134
- docs/bugs/P1_GRADIO_SETTINGS_CLEANUP.md +0 -81
- docs/bugs/P1_MAGENTIC_STREAMING_AND_KEY_PERSISTENCE.md +0 -181
- docs/bugs/P1_MULTIPLE_UX_BUGS.md +0 -49
- docs/bugs/P2_MAGENTIC_THINKING_STATE.md +0 -232
- docs/bugs/SENIOR_AGENT_AUDIT_PROMPT.md +0 -247
- docs/bugs/SENIOR_AUDIT_RESULTS.md +0 -84
docs/bugs/ACTIVE_BUGS.md
CHANGED
|
@@ -11,7 +11,7 @@
|
|
| 11 |
## Resolved Bugs
|
| 12 |
|
| 13 |
### ~~P3 - Magentic Mode Missing Termination Guarantee~~ FIXED
|
| 14 |
-
**Commit**: `
|
| 15 |
|
| 16 |
- Added `final_event_received` tracking in `orchestrator_magentic.py`
|
| 17 |
- Added fallback yield for "max iterations reached" scenario
|
|
@@ -40,7 +40,6 @@
|
|
| 40 |
- Users now see feedback during 2-5 minute initial processing
|
| 41 |
|
| 42 |
### ~~P1 - Gradio Settings Accordion~~ WONTFIX
|
| 43 |
-
**File**: [P1_GRADIO_SETTINGS_CLEANUP.md](./P1_GRADIO_SETTINGS_CLEANUP.md)
|
| 44 |
|
| 45 |
Decision: Removed nested Blocks, using ChatInterface directly.
|
| 46 |
Accordion behavior is default Gradio - acceptable for demo.
|
|
|
|
| 11 |
## Resolved Bugs
|
| 12 |
|
| 13 |
### ~~P3 - Magentic Mode Missing Termination Guarantee~~ FIXED
|
| 14 |
+
**Commit**: `d36ce3c` (2025-11-29)
|
| 15 |
|
| 16 |
- Added `final_event_received` tracking in `orchestrator_magentic.py`
|
| 17 |
- Added fallback yield for "max iterations reached" scenario
|
|
|
|
| 40 |
- Users now see feedback during 2-5 minute initial processing
|
| 41 |
|
| 42 |
### ~~P1 - Gradio Settings Accordion~~ WONTFIX
|
|
|
|
| 43 |
|
| 44 |
Decision: Removed nested Blocks, using ChatInterface directly.
|
| 45 |
Accordion behavior is default Gradio - acceptable for demo.
|
docs/bugs/FIX_PLAN_CRITICAL_BUGS.md
DELETED
|
@@ -1,36 +0,0 @@
|
|
| 1 |
-
# Fix Plan: Critical Bugs (P0)
|
| 2 |
-
|
| 3 |
-
**Date**: 2025-11-28
|
| 4 |
-
**Status**: COMPLETED (2025-11-29)
|
| 5 |
-
**Based on**: `docs/bugs/SENIOR_AUDIT_RESULTS.md`
|
| 6 |
-
|
| 7 |
-
---
|
| 8 |
-
|
| 9 |
-
## Summary of Fixes
|
| 10 |
-
|
| 11 |
-
### 1. Fixed Data Leak (Bug 4 & 2)
|
| 12 |
-
- **Action**: Removed singleton `_embedding_service` in `src/services/embeddings.py`.
|
| 13 |
-
- **Action**: Updated `EmbeddingService.__init__` to use a unique collection name (`evidence_{uuid}`) for complete isolation per instance.
|
| 14 |
-
- **Action**: Refactored `SentenceTransformer` loading to a shared global to maintain performance while isolating state.
|
| 15 |
-
- **Verified**: Unit tests passed, including new isolation verification.
|
| 16 |
-
|
| 17 |
-
### 2. Fixed Advanced Mode BYOK (Bug 3)
|
| 18 |
-
- **Action**: Updated `create_orchestrator` in `src/orchestrator_factory.py` to accept `api_key`.
|
| 19 |
-
- **Action**: Updated `MagenticOrchestrator` to accept and use the `api_key` for the manager and agents.
|
| 20 |
-
- **Action**: Updated `src/app.py` to pass the user's API key during orchestrator configuration.
|
| 21 |
-
- **Verified**: `test_dual_mode_e2e.py` passed.
|
| 22 |
-
|
| 23 |
-
### 3. Fixed Free Tier Experience (Bug 1)
|
| 24 |
-
- **Action**: Updated `HFInferenceJudgeHandler` in `src/agent_factory/judges.py` to catch 402 (Payment Required) errors.
|
| 25 |
-
- **Action**: Added logic to return a "synthesize" assessment with a clear error message when quota is exhausted, stopping the infinite loop.
|
| 26 |
-
- **Verified**: Unit tests passed.
|
| 27 |
-
|
| 28 |
-
---
|
| 29 |
-
|
| 30 |
-
## Verification
|
| 31 |
-
|
| 32 |
-
All changes have been verified with:
|
| 33 |
-
- `make check` (lint, typecheck, test) - ALL PASSED
|
| 34 |
-
- Custom reproduction script for isolation - PASSED
|
| 35 |
-
|
| 36 |
-
The system is now stable for the hackathon demo.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/bugs/FIX_PLAN_MAGENTIC_MODE.md
DELETED
|
@@ -1,227 +0,0 @@
|
|
| 1 |
-
# Fix Plan: Magentic Mode Report Generation
|
| 2 |
-
|
| 3 |
-
**Related Bug**: `P0_MAGENTIC_MODE_BROKEN.md`
|
| 4 |
-
**Approach**: Test-Driven Development (TDD)
|
| 5 |
-
**Estimated Scope**: 4 tasks, ~2-3 hours
|
| 6 |
-
|
| 7 |
-
---
|
| 8 |
-
|
| 9 |
-
## Problem Summary
|
| 10 |
-
|
| 11 |
-
Magentic mode runs but fails to produce readable reports due to:
|
| 12 |
-
|
| 13 |
-
1. **Primary Bug**: `MagenticFinalResultEvent.message` returns `ChatMessage` object, not text
|
| 14 |
-
2. **Secondary Bug**: Max rounds (3) reached before ReportAgent completes
|
| 15 |
-
3. **Tertiary Issues**: Stale "bioRxiv" references in prompts
|
| 16 |
-
|
| 17 |
-
---
|
| 18 |
-
|
| 19 |
-
## Fix Order (TDD)
|
| 20 |
-
|
| 21 |
-
### Phase 1: Write Failing Tests
|
| 22 |
-
|
| 23 |
-
**Task 1.1**: Create test for ChatMessage text extraction
|
| 24 |
-
|
| 25 |
-
```python
|
| 26 |
-
# tests/unit/test_orchestrator_magentic.py
|
| 27 |
-
|
| 28 |
-
def test_process_event_extracts_text_from_chat_message():
|
| 29 |
-
"""Final result event should extract text from ChatMessage object."""
|
| 30 |
-
# Arrange: Mock ChatMessage with .content attribute
|
| 31 |
-
# Act: Call _process_event with MagenticFinalResultEvent
|
| 32 |
-
# Assert: Returned AgentEvent.message is a string, not object repr
|
| 33 |
-
```
|
| 34 |
-
|
| 35 |
-
**Task 1.2**: Create test for max rounds configuration
|
| 36 |
-
|
| 37 |
-
```python
|
| 38 |
-
def test_orchestrator_uses_configured_max_rounds():
|
| 39 |
-
"""MagenticOrchestrator should use max_rounds from constructor."""
|
| 40 |
-
# Arrange: Create orchestrator with max_rounds=10
|
| 41 |
-
# Act: Build workflow
|
| 42 |
-
# Assert: Workflow has max_round_count=10
|
| 43 |
-
```
|
| 44 |
-
|
| 45 |
-
**Task 1.3**: Create test for bioRxiv reference removal
|
| 46 |
-
|
| 47 |
-
```python
|
| 48 |
-
def test_task_prompt_references_europe_pmc():
|
| 49 |
-
"""Task prompt should reference Europe PMC, not bioRxiv."""
|
| 50 |
-
# Arrange: Create orchestrator
|
| 51 |
-
# Act: Check task string in run()
|
| 52 |
-
# Assert: Contains "Europe PMC", not "bioRxiv"
|
| 53 |
-
```
|
| 54 |
-
|
| 55 |
-
---
|
| 56 |
-
|
| 57 |
-
### Phase 2: Fix ChatMessage Text Extraction
|
| 58 |
-
|
| 59 |
-
**File**: `src/orchestrator_magentic.py`
|
| 60 |
-
**Lines**: 192-199
|
| 61 |
-
|
| 62 |
-
**Current Code**:
|
| 63 |
-
```python
|
| 64 |
-
elif isinstance(event, MagenticFinalResultEvent):
|
| 65 |
-
text = event.message.text if event.message else "No result"
|
| 66 |
-
```
|
| 67 |
-
|
| 68 |
-
**Fixed Code**:
|
| 69 |
-
```python
|
| 70 |
-
elif isinstance(event, MagenticFinalResultEvent):
|
| 71 |
-
if event.message:
|
| 72 |
-
# ChatMessage may have .content or .text depending on version
|
| 73 |
-
if hasattr(event.message, 'content') and event.message.content:
|
| 74 |
-
text = str(event.message.content)
|
| 75 |
-
elif hasattr(event.message, 'text') and event.message.text:
|
| 76 |
-
text = str(event.message.text)
|
| 77 |
-
else:
|
| 78 |
-
# Fallback: convert entire message to string
|
| 79 |
-
text = str(event.message)
|
| 80 |
-
else:
|
| 81 |
-
text = "No result generated"
|
| 82 |
-
```
|
| 83 |
-
|
| 84 |
-
**Why**: The `agent_framework.ChatMessage` object structure may vary. We need defensive extraction.
|
| 85 |
-
|
| 86 |
-
---
|
| 87 |
-
|
| 88 |
-
### Phase 3: Fix Max Rounds Configuration
|
| 89 |
-
|
| 90 |
-
**File**: `src/orchestrator_magentic.py`
|
| 91 |
-
**Lines**: 97-99
|
| 92 |
-
|
| 93 |
-
**Current Code**:
|
| 94 |
-
```python
|
| 95 |
-
.with_standard_manager(
|
| 96 |
-
chat_client=manager_client,
|
| 97 |
-
max_round_count=self._max_rounds, # Already uses config
|
| 98 |
-
max_stall_count=3,
|
| 99 |
-
max_reset_count=2,
|
| 100 |
-
)
|
| 101 |
-
```
|
| 102 |
-
|
| 103 |
-
**Issue**: Default `max_rounds` in `__init__` is 10, but workflow may need more for complex queries.
|
| 104 |
-
|
| 105 |
-
**Fix**: Verify the value flows through correctly. Add logging.
|
| 106 |
-
|
| 107 |
-
```python
|
| 108 |
-
logger.info(
|
| 109 |
-
"Building Magentic workflow",
|
| 110 |
-
max_rounds=self._max_rounds,
|
| 111 |
-
max_stall=3,
|
| 112 |
-
max_reset=2,
|
| 113 |
-
)
|
| 114 |
-
```
|
| 115 |
-
|
| 116 |
-
**Also check**: `src/orchestrator_factory.py` passes config correctly:
|
| 117 |
-
```python
|
| 118 |
-
return MagenticOrchestrator(
|
| 119 |
-
max_rounds=config.max_iterations if config else 10,
|
| 120 |
-
)
|
| 121 |
-
```
|
| 122 |
-
|
| 123 |
-
---
|
| 124 |
-
|
| 125 |
-
### Phase 4: Fix Stale bioRxiv References
|
| 126 |
-
|
| 127 |
-
**Files to update**:
|
| 128 |
-
|
| 129 |
-
| File | Line | Change |
|
| 130 |
-
|------|------|--------|
|
| 131 |
-
| `src/orchestrator_magentic.py` | 131 | "bioRxiv" β "Europe PMC" |
|
| 132 |
-
| `src/agents/magentic_agents.py` | 32-33 | "bioRxiv" β "Europe PMC" |
|
| 133 |
-
| `src/app.py` | 202-203 | "bioRxiv" β "Europe PMC" |
|
| 134 |
-
|
| 135 |
-
**Search command to verify**:
|
| 136 |
-
```bash
|
| 137 |
-
grep -rn "bioRxiv\|biorxiv" src/
|
| 138 |
-
```
|
| 139 |
-
|
| 140 |
-
---
|
| 141 |
-
|
| 142 |
-
## Implementation Checklist
|
| 143 |
-
|
| 144 |
-
```
|
| 145 |
-
[ ] Phase 1: Write failing tests
|
| 146 |
-
[ ] 1.1 Test ChatMessage text extraction
|
| 147 |
-
[ ] 1.2 Test max rounds configuration
|
| 148 |
-
[ ] 1.3 Test Europe PMC references
|
| 149 |
-
|
| 150 |
-
[ ] Phase 2: Fix ChatMessage extraction
|
| 151 |
-
[ ] Update _process_event() in orchestrator_magentic.py
|
| 152 |
-
[ ] Run test 1.1 - should pass
|
| 153 |
-
|
| 154 |
-
[ ] Phase 3: Fix max rounds
|
| 155 |
-
[ ] Add logging to _build_workflow()
|
| 156 |
-
[ ] Verify factory passes config correctly
|
| 157 |
-
[ ] Run test 1.2 - should pass
|
| 158 |
-
|
| 159 |
-
[ ] Phase 4: Fix bioRxiv references
|
| 160 |
-
[ ] Update orchestrator_magentic.py task prompt
|
| 161 |
-
[ ] Update magentic_agents.py descriptions
|
| 162 |
-
[ ] Update app.py UI text
|
| 163 |
-
[ ] Run test 1.3 - should pass
|
| 164 |
-
[ ] Run grep to verify no remaining refs
|
| 165 |
-
|
| 166 |
-
[ ] Final Verification
|
| 167 |
-
[ ] make check passes
|
| 168 |
-
[ ] All tests pass (108+)
|
| 169 |
-
[ ] Manual test: run_magentic.py produces readable report
|
| 170 |
-
```
|
| 171 |
-
|
| 172 |
-
---
|
| 173 |
-
|
| 174 |
-
## Test Commands
|
| 175 |
-
|
| 176 |
-
```bash
|
| 177 |
-
# Run specific test file
|
| 178 |
-
uv run pytest tests/unit/test_orchestrator_magentic.py -v
|
| 179 |
-
|
| 180 |
-
# Run all tests
|
| 181 |
-
uv run pytest tests/unit/ -v
|
| 182 |
-
|
| 183 |
-
# Full check
|
| 184 |
-
make check
|
| 185 |
-
|
| 186 |
-
# Manual integration test
|
| 187 |
-
set -a && source .env && set +a
|
| 188 |
-
uv run python examples/orchestrator_demo/run_magentic.py "metformin alzheimer"
|
| 189 |
-
```
|
| 190 |
-
|
| 191 |
-
---
|
| 192 |
-
|
| 193 |
-
## Success Criteria
|
| 194 |
-
|
| 195 |
-
1. `run_magentic.py` outputs a readable research report (not `<ChatMessage object>`)
|
| 196 |
-
2. Report includes: Executive Summary, Key Findings, Drug Candidates, References
|
| 197 |
-
3. No "Max round count reached" error with default settings
|
| 198 |
-
4. No "bioRxiv" references anywhere in codebase
|
| 199 |
-
5. All 108+ tests pass
|
| 200 |
-
6. `make check` passes
|
| 201 |
-
|
| 202 |
-
---
|
| 203 |
-
|
| 204 |
-
## Files Modified
|
| 205 |
-
|
| 206 |
-
```
|
| 207 |
-
src/
|
| 208 |
-
βββ orchestrator_magentic.py # ChatMessage fix, logging
|
| 209 |
-
βββ agents/magentic_agents.py # bioRxiv β Europe PMC
|
| 210 |
-
βββ app.py # bioRxiv β Europe PMC
|
| 211 |
-
|
| 212 |
-
tests/unit/
|
| 213 |
-
βββ test_orchestrator_magentic.py # NEW: 3 tests
|
| 214 |
-
```
|
| 215 |
-
|
| 216 |
-
---
|
| 217 |
-
|
| 218 |
-
## Notes for AI Agent
|
| 219 |
-
|
| 220 |
-
When implementing this fix plan:
|
| 221 |
-
|
| 222 |
-
1. **DO NOT** create mock data or fake responses
|
| 223 |
-
2. **DO** write real tests that verify actual behavior
|
| 224 |
-
3. **DO** run `make check` after each phase
|
| 225 |
-
4. **DO** test with real OpenAI API key via `.env`
|
| 226 |
-
5. **DO** preserve existing functionality - simple mode must still work
|
| 227 |
-
6. **DO NOT** over-engineer - minimal changes to fix the specific bugs
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/bugs/FIX_UI_SIMPLIFICATION.md
DELETED
|
@@ -1,314 +0,0 @@
|
|
| 1 |
-
# UI Simplification: Remove API Provider Dropdown
|
| 2 |
-
|
| 3 |
-
**Issues**: #52, #53
|
| 4 |
-
**Priority**: P1 - UX improvement for hackathon demo
|
| 5 |
-
**Estimated Time**: 30 minutes
|
| 6 |
-
**Senior Review**: β
Approved with changes (incorporated below)
|
| 7 |
-
|
| 8 |
-
---
|
| 9 |
-
|
| 10 |
-
## Problem
|
| 11 |
-
|
| 12 |
-
The current UI has confusing BYOK (Bring Your Own Key) settings:
|
| 13 |
-
|
| 14 |
-
1. **Provider dropdown is misleading** - Shows "openai" but actually uses free tier when no key
|
| 15 |
-
2. **Examples table shows useless columns** - API Key (empty), Provider (ignored)
|
| 16 |
-
3. **Anthropic doesn't work with Advanced mode** - Only OpenAI has `agent-framework` support
|
| 17 |
-
|
| 18 |
-
## Solution
|
| 19 |
-
|
| 20 |
-
Remove `api_provider` dropdown entirely. Auto-detect provider from key prefix.
|
| 21 |
-
|
| 22 |
-
**Functionality preserved:**
|
| 23 |
-
- Simple mode: Free tier, OpenAI, OR Anthropic (all work)
|
| 24 |
-
- Advanced mode: OpenAI only (Magentic multi-agent requires `OpenAIChatClient`)
|
| 25 |
-
|
| 26 |
-
---
|
| 27 |
-
|
| 28 |
-
## Implementation
|
| 29 |
-
|
| 30 |
-
### File: `src/app.py`
|
| 31 |
-
|
| 32 |
-
#### Change 1: Update `configure_orchestrator()` signature (lines 23-28)
|
| 33 |
-
|
| 34 |
-
```python
|
| 35 |
-
# BEFORE
|
| 36 |
-
def configure_orchestrator(
|
| 37 |
-
use_mock: bool = False,
|
| 38 |
-
mode: str = "simple",
|
| 39 |
-
user_api_key: str | None = None,
|
| 40 |
-
api_provider: str = "openai", # β REMOVE
|
| 41 |
-
) -> tuple[Any, str]:
|
| 42 |
-
|
| 43 |
-
# AFTER
|
| 44 |
-
def configure_orchestrator(
|
| 45 |
-
use_mock: bool = False,
|
| 46 |
-
mode: str = "simple",
|
| 47 |
-
user_api_key: str | None = None,
|
| 48 |
-
) -> tuple[Any, str]:
|
| 49 |
-
```
|
| 50 |
-
|
| 51 |
-
#### Change 2: Update docstring (lines 29-40)
|
| 52 |
-
|
| 53 |
-
```python
|
| 54 |
-
# AFTER
|
| 55 |
-
"""
|
| 56 |
-
Create an orchestrator instance.
|
| 57 |
-
|
| 58 |
-
Args:
|
| 59 |
-
use_mock: If True, use MockJudgeHandler (no API key needed)
|
| 60 |
-
mode: Orchestrator mode ("simple" or "advanced")
|
| 61 |
-
user_api_key: Optional user-provided API key (BYOK) - auto-detects provider
|
| 62 |
-
|
| 63 |
-
Returns:
|
| 64 |
-
Tuple of (Orchestrator instance, backend_name)
|
| 65 |
-
"""
|
| 66 |
-
```
|
| 67 |
-
|
| 68 |
-
#### Change 3: Replace provider logic with auto-detection (lines 62-88)
|
| 69 |
-
|
| 70 |
-
```python
|
| 71 |
-
# BEFORE (lines 62-88) - complex provider checking with api_provider param
|
| 72 |
-
|
| 73 |
-
# AFTER - auto-detect from key prefix
|
| 74 |
-
# 2. Paid API Key (User provided or Env)
|
| 75 |
-
elif user_api_key and user_api_key.strip():
|
| 76 |
-
# Auto-detect provider from key prefix
|
| 77 |
-
model: AnthropicModel | OpenAIModel
|
| 78 |
-
if user_api_key.startswith("sk-ant-"):
|
| 79 |
-
# Anthropic key
|
| 80 |
-
anthropic_provider = AnthropicProvider(api_key=user_api_key)
|
| 81 |
-
model = AnthropicModel(settings.anthropic_model, provider=anthropic_provider)
|
| 82 |
-
backend_info = "Paid API (Anthropic)"
|
| 83 |
-
elif user_api_key.startswith("sk-"):
|
| 84 |
-
# OpenAI key
|
| 85 |
-
openai_provider = OpenAIProvider(api_key=user_api_key)
|
| 86 |
-
model = OpenAIModel(settings.openai_model, provider=openai_provider)
|
| 87 |
-
backend_info = "Paid API (OpenAI)"
|
| 88 |
-
else:
|
| 89 |
-
raise ValueError(
|
| 90 |
-
"Invalid API key format. Expected sk-... (OpenAI) or sk-ant-... (Anthropic)"
|
| 91 |
-
)
|
| 92 |
-
judge_handler = JudgeHandler(model=model)
|
| 93 |
-
|
| 94 |
-
# 3. Environment API Keys (fallback)
|
| 95 |
-
elif os.getenv("OPENAI_API_KEY"):
|
| 96 |
-
judge_handler = JudgeHandler(model=None) # Uses env key
|
| 97 |
-
backend_info = "Paid API (OpenAI from env)"
|
| 98 |
-
|
| 99 |
-
elif os.getenv("ANTHROPIC_API_KEY"):
|
| 100 |
-
judge_handler = JudgeHandler(model=None) # Uses env key
|
| 101 |
-
backend_info = "Paid API (Anthropic from env)"
|
| 102 |
-
|
| 103 |
-
# 4. Free Tier (HuggingFace Inference)
|
| 104 |
-
else:
|
| 105 |
-
judge_handler = HFInferenceJudgeHandler()
|
| 106 |
-
backend_info = "Free Tier (Llama 3.1 / Mistral)"
|
| 107 |
-
```
|
| 108 |
-
|
| 109 |
-
#### Change 4: Update `research_agent()` signature (lines 105-111)
|
| 110 |
-
|
| 111 |
-
```python
|
| 112 |
-
# BEFORE
|
| 113 |
-
async def research_agent(
|
| 114 |
-
message: str,
|
| 115 |
-
history: list[dict[str, Any]],
|
| 116 |
-
mode: str = "simple",
|
| 117 |
-
api_key: str = "",
|
| 118 |
-
api_provider: str = "openai", # β REMOVE
|
| 119 |
-
) -> AsyncGenerator[str, None]:
|
| 120 |
-
|
| 121 |
-
# AFTER
|
| 122 |
-
async def research_agent(
|
| 123 |
-
message: str,
|
| 124 |
-
history: list[dict[str, Any]],
|
| 125 |
-
mode: str = "simple",
|
| 126 |
-
api_key: str = "",
|
| 127 |
-
) -> AsyncGenerator[str, None]:
|
| 128 |
-
```
|
| 129 |
-
|
| 130 |
-
#### Change 5: Update docstring (lines 112-124)
|
| 131 |
-
|
| 132 |
-
```python
|
| 133 |
-
# AFTER
|
| 134 |
-
"""
|
| 135 |
-
Gradio chat function that runs the research agent.
|
| 136 |
-
|
| 137 |
-
Args:
|
| 138 |
-
message: User's research question
|
| 139 |
-
history: Chat history (Gradio format)
|
| 140 |
-
mode: Orchestrator mode ("simple" or "advanced")
|
| 141 |
-
api_key: Optional user-provided API key (BYOK - auto-detects provider)
|
| 142 |
-
|
| 143 |
-
Yields:
|
| 144 |
-
Markdown-formatted responses for streaming
|
| 145 |
-
"""
|
| 146 |
-
```
|
| 147 |
-
|
| 148 |
-
#### Change 6: Fix Advanced mode check (line 139)
|
| 149 |
-
|
| 150 |
-
```python
|
| 151 |
-
# BEFORE
|
| 152 |
-
if mode == "advanced" and not (has_openai or (has_user_key and api_provider == "openai")):
|
| 153 |
-
|
| 154 |
-
# AFTER - auto-detect OpenAI key from prefix
|
| 155 |
-
is_openai_user_key = user_api_key and user_api_key.startswith("sk-") and not user_api_key.startswith("sk-ant-")
|
| 156 |
-
if mode == "advanced" and not (has_openai or is_openai_user_key):
|
| 157 |
-
yield (
|
| 158 |
-
"β οΈ **Advanced mode requires OpenAI API key.** "
|
| 159 |
-
"Anthropic keys only work in Simple mode. Falling back to Simple.\n\n"
|
| 160 |
-
)
|
| 161 |
-
mode = "simple"
|
| 162 |
-
```
|
| 163 |
-
|
| 164 |
-
#### Change 7: Remove premature "Using your key" message (lines 146-151)
|
| 165 |
-
|
| 166 |
-
```python
|
| 167 |
-
# BEFORE - uses api_provider which no longer exists
|
| 168 |
-
if has_user_key:
|
| 169 |
-
yield (
|
| 170 |
-
f"π **Using your {api_provider.upper()} API key** - "
|
| 171 |
-
"Your key is used only for this session and is never stored.\n\n"
|
| 172 |
-
)
|
| 173 |
-
|
| 174 |
-
# AFTER - remove this block entirely
|
| 175 |
-
# The backend_name from configure_orchestrator already shows "Paid API (OpenAI)" or "Paid API (Anthropic)"
|
| 176 |
-
# No need for duplicate messaging
|
| 177 |
-
```
|
| 178 |
-
|
| 179 |
-
#### Change 8: Update configure_orchestrator call (lines 165-170)
|
| 180 |
-
|
| 181 |
-
```python
|
| 182 |
-
# BEFORE
|
| 183 |
-
orchestrator, backend_name = configure_orchestrator(
|
| 184 |
-
use_mock=False,
|
| 185 |
-
mode=mode,
|
| 186 |
-
user_api_key=user_api_key,
|
| 187 |
-
api_provider=api_provider, # β REMOVE
|
| 188 |
-
)
|
| 189 |
-
|
| 190 |
-
# AFTER
|
| 191 |
-
orchestrator, backend_name = configure_orchestrator(
|
| 192 |
-
use_mock=False,
|
| 193 |
-
mode=mode,
|
| 194 |
-
user_api_key=user_api_key,
|
| 195 |
-
)
|
| 196 |
-
```
|
| 197 |
-
|
| 198 |
-
#### Change 9: Simplify examples (lines 210-229)
|
| 199 |
-
|
| 200 |
-
```python
|
| 201 |
-
# BEFORE - 4 items per example
|
| 202 |
-
examples=[
|
| 203 |
-
["What drugs improve female libido post-menopause?", "simple", "", "openai"],
|
| 204 |
-
["Clinical trials for erectile dysfunction alternatives to PDE5 inhibitors?", "simple", "", "openai"],
|
| 205 |
-
["Evidence for testosterone therapy in women with HSDD?", "simple", "", "openai"],
|
| 206 |
-
],
|
| 207 |
-
|
| 208 |
-
# AFTER - 2 items per example (query, mode) - API key always empty in examples
|
| 209 |
-
examples=[
|
| 210 |
-
["What drugs improve female libido post-menopause?", "simple"],
|
| 211 |
-
["Clinical trials for ED alternatives to PDE5 inhibitors?", "simple"],
|
| 212 |
-
["Evidence for testosterone therapy in women with HSDD?", "simple"],
|
| 213 |
-
],
|
| 214 |
-
```
|
| 215 |
-
|
| 216 |
-
#### Change 10: Update additional_inputs (lines 231-252)
|
| 217 |
-
|
| 218 |
-
```python
|
| 219 |
-
# BEFORE - 3 inputs (mode, api_key, api_provider)
|
| 220 |
-
additional_inputs=[
|
| 221 |
-
gr.Radio(
|
| 222 |
-
choices=["simple", "advanced"],
|
| 223 |
-
value="simple",
|
| 224 |
-
label="Orchestrator Mode",
|
| 225 |
-
info="Simple: Linear (Free Tier Friendly) | Advanced: Multi-Agent (Requires OpenAI)",
|
| 226 |
-
),
|
| 227 |
-
gr.Textbox(
|
| 228 |
-
label="π API Key (Optional - BYOK)",
|
| 229 |
-
placeholder="sk-... or sk-ant-...",
|
| 230 |
-
type="password",
|
| 231 |
-
info="Enter your own API key. Never stored.",
|
| 232 |
-
),
|
| 233 |
-
gr.Radio( # β REMOVE THIS ENTIRE BLOCK
|
| 234 |
-
choices=["openai", "anthropic"],
|
| 235 |
-
value="openai",
|
| 236 |
-
label="API Provider",
|
| 237 |
-
info="Select the provider for your API key",
|
| 238 |
-
),
|
| 239 |
-
],
|
| 240 |
-
|
| 241 |
-
# AFTER - 2 inputs (mode, api_key)
|
| 242 |
-
additional_inputs=[
|
| 243 |
-
gr.Radio(
|
| 244 |
-
choices=["simple", "advanced"],
|
| 245 |
-
value="simple",
|
| 246 |
-
label="Orchestrator Mode",
|
| 247 |
-
info="Simple: Works with any key or free tier | Advanced: Requires OpenAI key",
|
| 248 |
-
),
|
| 249 |
-
gr.Textbox(
|
| 250 |
-
label="π API Key (Optional)",
|
| 251 |
-
placeholder="sk-... (OpenAI) or sk-ant-... (Anthropic)",
|
| 252 |
-
type="password",
|
| 253 |
-
info="Leave empty for free tier. Auto-detects provider from key prefix.",
|
| 254 |
-
),
|
| 255 |
-
],
|
| 256 |
-
```
|
| 257 |
-
|
| 258 |
-
#### Change 11: Update accordion label (line 230)
|
| 259 |
-
|
| 260 |
-
```python
|
| 261 |
-
# BEFORE
|
| 262 |
-
additional_inputs_accordion=gr.Accordion(label="βοΈ Settings", open=False),
|
| 263 |
-
|
| 264 |
-
# AFTER
|
| 265 |
-
additional_inputs_accordion=gr.Accordion(label="βοΈ Settings (Free tier works without API key)", open=False),
|
| 266 |
-
```
|
| 267 |
-
|
| 268 |
-
---
|
| 269 |
-
|
| 270 |
-
## Testing Checklist
|
| 271 |
-
|
| 272 |
-
### Manual Tests
|
| 273 |
-
- [ ] **No key**: Shows "Free Tier (Llama 3.1 / Mistral)" in backend
|
| 274 |
-
- [ ] **OpenAI key (sk-...)**: Shows "Paid API (OpenAI)" in backend
|
| 275 |
-
- [ ] **Anthropic key (sk-ant-...)**: Shows "Paid API (Anthropic)" in backend
|
| 276 |
-
- [ ] **Invalid key format**: Shows error message
|
| 277 |
-
- [ ] **Anthropic key + Advanced mode**: Falls back to Simple with warning
|
| 278 |
-
- [ ] **OpenAI key + Advanced mode**: Uses full Magentic multi-agent
|
| 279 |
-
- [ ] **Examples table**: Shows only 2 columns (query, mode)
|
| 280 |
-
- [ ] **MCP server**: Still accessible at `/gradio_api/mcp/`
|
| 281 |
-
|
| 282 |
-
### Unit Test Updates
|
| 283 |
-
- [ ] `tests/unit/test_app_smoke.py` - may need update if checking input count
|
| 284 |
-
|
| 285 |
-
---
|
| 286 |
-
|
| 287 |
-
## Definition of Done
|
| 288 |
-
|
| 289 |
-
- [ ] `api_provider` parameter removed from `configure_orchestrator()`
|
| 290 |
-
- [ ] `api_provider` parameter removed from `research_agent()`
|
| 291 |
-
- [ ] Auto-detection logic works for `sk-` and `sk-ant-` prefixes
|
| 292 |
-
- [ ] Advanced mode check uses auto-detection (not removed param)
|
| 293 |
-
- [ ] "Using your X key" message removed (backend_name handles this)
|
| 294 |
-
- [ ] Examples table shows 2 columns
|
| 295 |
-
- [ ] Accordion label updated
|
| 296 |
-
- [ ] Placeholder text shows both key formats
|
| 297 |
-
- [ ] All existing tests pass
|
| 298 |
-
- [ ] MCP server still works
|
| 299 |
-
|
| 300 |
-
---
|
| 301 |
-
|
| 302 |
-
## Mode Compatibility Matrix (Unchanged)
|
| 303 |
-
|
| 304 |
-
| Mode | No Key | OpenAI Key | Anthropic Key |
|
| 305 |
-
|------|--------|------------|---------------|
|
| 306 |
-
| **Simple** | β
Free tier | β
GPT-5.1 | β
Claude Sonnet 4.5 |
|
| 307 |
-
| **Advanced** | β οΈ Falls back | β
Full Magentic | β οΈ Falls back to Simple |
|
| 308 |
-
|
| 309 |
-
---
|
| 310 |
-
|
| 311 |
-
## Related
|
| 312 |
-
- Issue #52: UI Polish - Examples table confusion
|
| 313 |
-
- Issue #53: API Provider Simplification
|
| 314 |
-
- Senior Review: Approved 2025-11-28
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/bugs/INVESTIGATION_INVALID_MODELS.md
DELETED
|
@@ -1,31 +0,0 @@
|
|
| 1 |
-
# Bug Investigation: Invalid Default LLM Models
|
| 2 |
-
|
| 3 |
-
## Status
|
| 4 |
-
- **Date:** 2025-11-29
|
| 5 |
-
- **Reporter:** CLI User
|
| 6 |
-
- **Component:** `src/utils/config.py`
|
| 7 |
-
- **Priority:** High (Magentic Mode Blocker)
|
| 8 |
-
- **Resolution:** FIXED
|
| 9 |
-
|
| 10 |
-
## Issue Description
|
| 11 |
-
The user encountered a 403 error when running in Magentic mode:
|
| 12 |
-
`Error code: 403 - {'error': {'message': 'Project ... does not have access to model gpt-5', ... 'code': 'model_not_found'}}`
|
| 13 |
-
|
| 14 |
-
## Root Cause Analysis
|
| 15 |
-
OpenAI deprecated the base `gpt-5` model. Tier 5 accounts now have access to:
|
| 16 |
-
- `gpt-5.1` (current flagship)
|
| 17 |
-
- `gpt-5-mini`
|
| 18 |
-
- `gpt-5-nano`
|
| 19 |
-
- `gpt-4.1`, `gpt-4.1-mini`, `gpt-4.1-nano`
|
| 20 |
-
- `o3`, `o4-mini`
|
| 21 |
-
|
| 22 |
-
The base `gpt-5` is NO LONGER available via API.
|
| 23 |
-
|
| 24 |
-
## Solution Implemented
|
| 25 |
-
Updated `src/utils/config.py` to use:
|
| 26 |
-
- `openai_model`: `gpt-5.1` (the actual current model)
|
| 27 |
-
- `anthropic_model`: `claude-sonnet-4-5-20250929` (unchanged)
|
| 28 |
-
|
| 29 |
-
## Verification
|
| 30 |
-
- `tests/unit/agent_factory/test_judges_factory.py` updated and passed.
|
| 31 |
-
- User confirmed Tier 5 access to `gpt-5.1` via OpenAI dashboard.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/bugs/INVESTIGATION_QUOTA_BLOCKER.md
DELETED
|
@@ -1,49 +0,0 @@
|
|
| 1 |
-
# Bug Investigation: HF Free Tier Quota Exhaustion
|
| 2 |
-
|
| 3 |
-
## Status
|
| 4 |
-
- **Date:** 2025-11-29
|
| 5 |
-
- **Reporter:** CLI User
|
| 6 |
-
- **Component:** `HFInferenceJudgeHandler`
|
| 7 |
-
- **Priority:** High (UX Blocker for Free Tier)
|
| 8 |
-
- **Resolution:** FIXED
|
| 9 |
-
|
| 10 |
-
## Issue Description
|
| 11 |
-
On a fresh run with a simple query ("What drugs improve female libido post-menopause?"), the system retrieved 20 valid sources but failed during the Judge/Analysis phase with:
|
| 12 |
-
`β οΈ Free Tier Quota Exceeded β οΈ`
|
| 13 |
-
|
| 14 |
-
This results in a "Synthesis" step that has 0 candidates and 0 findings, rendering the application useless for free users once the (very low) limit is hit, despite having valid search results.
|
| 15 |
-
|
| 16 |
-
## Evidence
|
| 17 |
-
Output provided:
|
| 18 |
-
```text
|
| 19 |
-
### Citations (20 sources)
|
| 20 |
-
...
|
| 21 |
-
### Reasoning
|
| 22 |
-
β οΈ **Free Tier Quota Exceeded** β οΈ
|
| 23 |
-
```
|
| 24 |
-
|
| 25 |
-
## Root Cause Analysis
|
| 26 |
-
1. **Search Success:** `SearchAgent` correctly found 20 documents (PubMed/EuropePMC).
|
| 27 |
-
2. **Judge Failure:** `HFInferenceJudgeHandler` called the HF Inference API.
|
| 28 |
-
3. **Quota Trap:** The API returned a 402 (Payment Required) or Quota error.
|
| 29 |
-
4. **Previous Handling:** The handler caught this error and returned a `JudgeAssessment` with `sufficient=True` (to stop the loop) and *empty* fields.
|
| 30 |
-
5. **Data Loss:** The 20 valid search results were effectively discarded from the "Analysis" perspective.
|
| 31 |
-
|
| 32 |
-
## The "Deep Blocker"
|
| 33 |
-
The system had a "hard failure" mode for quota exhaustion, assuming that if the LLM can't judge, we have *no* useful information. This "bricked" the UX for free users immediately upon hitting the limit.
|
| 34 |
-
|
| 35 |
-
## Solution Implemented
|
| 36 |
-
Modified `HFInferenceJudgeHandler._create_quota_exhausted_assessment` to:
|
| 37 |
-
1. Accept the `evidence` list as an argument.
|
| 38 |
-
2. Perform basic heuristic extraction (borrowed from `MockJudgeHandler` logic):
|
| 39 |
-
- Use titles as "Key Findings" (first 5 sources).
|
| 40 |
-
- Add a clear message in "Drug Candidates" telling the user to upgrade.
|
| 41 |
-
3. Return this "Partial" assessment instead of an empty one.
|
| 42 |
-
|
| 43 |
-
## Verification
|
| 44 |
-
- Created `tests/unit/agent_factory/test_judges_hf_quota.py` to verify that:
|
| 45 |
-
- 402 errors are caught.
|
| 46 |
-
- `sufficient` is set to `True` (stops loop).
|
| 47 |
-
- `key_findings` are populated from search result titles.
|
| 48 |
-
- `reasoning` contains the warning message.
|
| 49 |
-
- Ran existing tests `tests/unit/agent_factory/test_judges_hf.py` - All passed.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/bugs/P0_CRITICAL_BUGS.md
DELETED
|
@@ -1,43 +0,0 @@
|
|
| 1 |
-
# P0 Critical Bugs - DeepBoner Demo Broken
|
| 2 |
-
|
| 3 |
-
**Date**: 2025-11-28
|
| 4 |
-
**Status**: RESOLVED (2025-11-29)
|
| 5 |
-
**Priority**: P0 - Blocking hackathon submission
|
| 6 |
-
|
| 7 |
-
---
|
| 8 |
-
|
| 9 |
-
## Summary
|
| 10 |
-
|
| 11 |
-
The Gradio demo was non-functional due to 4 critical bugs. All have been fixed and verified.
|
| 12 |
-
|
| 13 |
-
---
|
| 14 |
-
|
| 15 |
-
## Bug 1: Free Tier LLM Quota Exhausted (P0) - FIXED
|
| 16 |
-
|
| 17 |
-
**Resolution**:
|
| 18 |
-
- Implemented `QuotaExhaustedError` detection in `HFInferenceJudgeHandler`.
|
| 19 |
-
- The agent now gracefully stops and displays a clear "Free Tier Quota Exceeded" message instead of looping infinitely.
|
| 20 |
-
|
| 21 |
-
## Bug 2: Evidence Counter Shows 0 After Dedup (P1) - FIXED
|
| 22 |
-
|
| 23 |
-
**Resolution**:
|
| 24 |
-
- Fixed by resolving Bug 4 (Data Leak). Deduplication now works correctly on isolated per-request collections.
|
| 25 |
-
|
| 26 |
-
## Bug 3: API Key Not Passed to Advanced Mode (P0) - FIXED
|
| 27 |
-
|
| 28 |
-
**Resolution**:
|
| 29 |
-
- Plumbed `api_key` from the UI through `configure_orchestrator` -> `create_orchestrator` -> `MagenticOrchestrator`.
|
| 30 |
-
- Magentic agents now correctly use the user-provided OpenAI key.
|
| 31 |
-
|
| 32 |
-
## Bug 4: Singleton EmbeddingService Causes Cross-Session Pollution (P0) - FIXED
|
| 33 |
-
|
| 34 |
-
**Resolution**:
|
| 35 |
-
- Removed the singleton pattern for `EmbeddingService`.
|
| 36 |
-
- Each request now gets a fresh `EmbeddingService` with a unique, isolated ChromaDB collection (`evidence_{uuid}`).
|
| 37 |
-
- `SentenceTransformer` model is lazily cached globally to maintain performance.
|
| 38 |
-
|
| 39 |
-
---
|
| 40 |
-
|
| 41 |
-
## Verification
|
| 42 |
-
|
| 43 |
-
Run `make check` to verify all tests pass.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/bugs/P0_GRADIO_EXAMPLE_CACHING_CRASH.md
DELETED
|
@@ -1,134 +0,0 @@
|
|
| 1 |
-
# P0 Bug Report: Gradio Example Caching Crash
|
| 2 |
-
|
| 3 |
-
## Status
|
| 4 |
-
- **Date:** 2025-11-29
|
| 5 |
-
- **Priority:** P0 CRITICAL (Production Down)
|
| 6 |
-
- **Component:** `src/app.py:131`
|
| 7 |
-
- **Environment:** HuggingFace Spaces (Python 3.11, Gradio)
|
| 8 |
-
|
| 9 |
-
## Error Message
|
| 10 |
-
|
| 11 |
-
```text
|
| 12 |
-
AttributeError: 'NoneType' object has no attribute 'strip'
|
| 13 |
-
```
|
| 14 |
-
|
| 15 |
-
## Full Stack Trace
|
| 16 |
-
|
| 17 |
-
```text
|
| 18 |
-
File "/app/src/app.py", line 131, in research_agent
|
| 19 |
-
user_api_key = (api_key.strip() or api_key_state.strip()) or None
|
| 20 |
-
^^^^^^^^^^^^^
|
| 21 |
-
AttributeError: 'NoneType' object has no attribute 'strip'
|
| 22 |
-
```
|
| 23 |
-
|
| 24 |
-
## Root Cause Analysis
|
| 25 |
-
|
| 26 |
-
### The Trigger
|
| 27 |
-
Gradio's example caching mechanism runs the `research_agent` function during startup to pre-cache example outputs. This happens at:
|
| 28 |
-
|
| 29 |
-
```text
|
| 30 |
-
File "/usr/local/lib/python3.11/site-packages/gradio/helpers.py", line 509, in _start_caching
|
| 31 |
-
await self.cache()
|
| 32 |
-
```
|
| 33 |
-
|
| 34 |
-
### The Problem
|
| 35 |
-
Our examples only provide values for 2 of the 4 function parameters:
|
| 36 |
-
|
| 37 |
-
```python
|
| 38 |
-
examples=[
|
| 39 |
-
["What is the evidence for testosterone therapy in women with HSDD?", "simple"],
|
| 40 |
-
["Promising drug candidates for endometriosis pain management", "simple"],
|
| 41 |
-
]
|
| 42 |
-
```
|
| 43 |
-
|
| 44 |
-
These map to `[message, mode]` but **NOT** to `api_key` or `api_key_state`.
|
| 45 |
-
|
| 46 |
-
When Gradio runs the function for caching, it passes `None` for the unprovided parameters:
|
| 47 |
-
|
| 48 |
-
```python
|
| 49 |
-
async def research_agent(
|
| 50 |
-
message: str, # β
Provided by example
|
| 51 |
-
history: list[...], # β
Empty list default
|
| 52 |
-
mode: str = "simple", # β
Provided by example
|
| 53 |
-
api_key: str = "", # β Becomes None during caching!
|
| 54 |
-
api_key_state: str = "" # β Becomes None during caching!
|
| 55 |
-
) -> AsyncGenerator[...]:
|
| 56 |
-
```
|
| 57 |
-
|
| 58 |
-
### The Crash
|
| 59 |
-
Line 131 attempts to call `.strip()` on `None`:
|
| 60 |
-
|
| 61 |
-
```python
|
| 62 |
-
user_api_key = (api_key.strip() or api_key_state.strip()) or None
|
| 63 |
-
# ^^^^^^^^^^^^^
|
| 64 |
-
# NoneType has no attribute 'strip'
|
| 65 |
-
```
|
| 66 |
-
|
| 67 |
-
## Gradio Warning (Ignored)
|
| 68 |
-
|
| 69 |
-
Gradio actually warned us about this:
|
| 70 |
-
|
| 71 |
-
```text
|
| 72 |
-
UserWarning: Examples will be cached but not all input components have
|
| 73 |
-
example values. This may result in an exception being thrown by your function.
|
| 74 |
-
```
|
| 75 |
-
|
| 76 |
-
## Solution
|
| 77 |
-
|
| 78 |
-
### Option A: Defensive None Handling (Recommended)
|
| 79 |
-
Add None guards before calling `.strip()`:
|
| 80 |
-
|
| 81 |
-
```python
|
| 82 |
-
# Handle None values from Gradio example caching
|
| 83 |
-
api_key_str = api_key or ""
|
| 84 |
-
api_key_state_str = api_key_state or ""
|
| 85 |
-
user_api_key = (api_key_str.strip() or api_key_state_str.strip()) or None
|
| 86 |
-
```
|
| 87 |
-
|
| 88 |
-
### Option B: Disable Example Caching
|
| 89 |
-
Set `cache_examples=False` in ChatInterface:
|
| 90 |
-
|
| 91 |
-
```python
|
| 92 |
-
gr.ChatInterface(
|
| 93 |
-
fn=research_agent,
|
| 94 |
-
examples=[...],
|
| 95 |
-
cache_examples=False, # Disable caching
|
| 96 |
-
)
|
| 97 |
-
```
|
| 98 |
-
|
| 99 |
-
This avoids the crash but loses the UX benefit of pre-cached examples.
|
| 100 |
-
|
| 101 |
-
### Option C: Provide Full Example Values
|
| 102 |
-
Include all 4 columns in examples:
|
| 103 |
-
|
| 104 |
-
```python
|
| 105 |
-
examples=[
|
| 106 |
-
["What is the evidence...", "simple", "", ""], # [msg, mode, api_key, state]
|
| 107 |
-
]
|
| 108 |
-
```
|
| 109 |
-
|
| 110 |
-
This is verbose and exposes internal state to users.
|
| 111 |
-
|
| 112 |
-
## Recommendation
|
| 113 |
-
|
| 114 |
-
**Option A** is the cleanest fix. It:
|
| 115 |
-
1. Maintains cached examples for fast UX
|
| 116 |
-
2. Handles edge cases defensively
|
| 117 |
-
3. Doesn't expose internal state in examples
|
| 118 |
-
|
| 119 |
-
## Pre-Merge Checklist
|
| 120 |
-
|
| 121 |
-
- [ ] Fix applied to `src/app.py`
|
| 122 |
-
- [ ] Unit test added for None parameter handling
|
| 123 |
-
- [ ] `make check` passes
|
| 124 |
-
- [ ] Test locally with `uv run python -m src.app`
|
| 125 |
-
- [ ] Verify example caching works without crash
|
| 126 |
-
- [ ] Deploy to HuggingFace Spaces
|
| 127 |
-
- [ ] Verify Space starts without error
|
| 128 |
-
|
| 129 |
-
## Lessons Learned
|
| 130 |
-
|
| 131 |
-
1. Always test Gradio apps with example caching enabled locally before deploying
|
| 132 |
-
2. Gradio's "partial examples" feature passes `None` for missing columns
|
| 133 |
-
3. Default parameter values (`str = ""`) are ignored when Gradio explicitly passes `None`
|
| 134 |
-
4. The Gradio warning about missing example values should be treated as an error
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/bugs/P1_GRADIO_SETTINGS_CLEANUP.md
DELETED
|
@@ -1,81 +0,0 @@
|
|
| 1 |
-
# P1 Bug: Gradio Settings Accordion Not Collapsing
|
| 2 |
-
|
| 3 |
-
**Priority**: P1 (UX Bug)
|
| 4 |
-
**Status**: OPEN
|
| 5 |
-
**Date**: 2025-11-27
|
| 6 |
-
**Target Component**: `src/app.py`
|
| 7 |
-
|
| 8 |
-
---
|
| 9 |
-
|
| 10 |
-
## 1. Problem Description
|
| 11 |
-
|
| 12 |
-
The "Settings" accordion in the Gradio UI (containing Orchestrator Mode, API Key, Provider) fails to collapse, even when configured with `open=False`. It remains permanently expanded, cluttering the interface and obscuring the chat history.
|
| 13 |
-
|
| 14 |
-
### Symptoms
|
| 15 |
-
- Accordion arrow toggles visually, but content remains visible.
|
| 16 |
-
- Occurs in both local development (`uv run src/app.py`) and HuggingFace Spaces.
|
| 17 |
-
|
| 18 |
-
---
|
| 19 |
-
|
| 20 |
-
## 2. Root Cause Analysis
|
| 21 |
-
|
| 22 |
-
**Definitive Cause**: Nested `Blocks` Context Bug.
|
| 23 |
-
`gr.ChatInterface` is itself a high-level abstraction that creates a `gr.Blocks` context. Wrapping `gr.ChatInterface` inside an external `with gr.Blocks():` context causes event listener conflicts, specifically breaking the JavaScript state management for `additional_inputs_accordion`.
|
| 24 |
-
|
| 25 |
-
**Reference**: [Gradio Issue #8861](https://github.com/gradio-app/gradio/issues/8861) confirms that `additional_inputs_accordion` malfunctions when `ChatInterface` is not the top-level block.
|
| 26 |
-
|
| 27 |
-
---
|
| 28 |
-
|
| 29 |
-
## 3. Solution Strategy: "The Unwrap Fix"
|
| 30 |
-
|
| 31 |
-
We will remove the redundant `gr.Blocks` wrapper. This restores the native behavior of `ChatInterface`, ensuring the accordion respects `open=False`.
|
| 32 |
-
|
| 33 |
-
### Implementation Plan
|
| 34 |
-
|
| 35 |
-
**Refactor `src/app.py` / `create_demo()`**:
|
| 36 |
-
|
| 37 |
-
1. **Remove** the `with gr.Blocks() as demo:` context manager.
|
| 38 |
-
2. **Instantiate** `gr.ChatInterface` directly as the `demo` object.
|
| 39 |
-
3. **Migrate UI Elements**:
|
| 40 |
-
* **Header**: Move the H1/Title text into the `title` parameter of `ChatInterface`.
|
| 41 |
-
* **Footer**: Move the footer text ("MCP Server Active...") into the `description` parameter. `ChatInterface` supports Markdown in `description`, making it the ideal place for static info below the title but above the chat.
|
| 42 |
-
|
| 43 |
-
### Before (Buggy)
|
| 44 |
-
```python
|
| 45 |
-
def create_demo():
|
| 46 |
-
with gr.Blocks() as demo: # <--- CAUSE OF BUG
|
| 47 |
-
gr.Markdown("# Title")
|
| 48 |
-
gr.ChatInterface(..., additional_inputs_accordion=gr.Accordion(open=False))
|
| 49 |
-
gr.Markdown("Footer")
|
| 50 |
-
return demo
|
| 51 |
-
```
|
| 52 |
-
|
| 53 |
-
### After (Correct)
|
| 54 |
-
```python
|
| 55 |
-
def create_demo():
|
| 56 |
-
return gr.ChatInterface( # <--- FIX: Top-level component
|
| 57 |
-
...,
|
| 58 |
-
title="𧬠DeepBoner",
|
| 59 |
-
description="*AI-Powered Drug Repurposing Agent...*\n\n---\n**MCP Server Active**...",
|
| 60 |
-
additional_inputs_accordion=gr.Accordion(label="βοΈ Settings", open=False)
|
| 61 |
-
)
|
| 62 |
-
```
|
| 63 |
-
|
| 64 |
-
---
|
| 65 |
-
|
| 66 |
-
## 4. Validation
|
| 67 |
-
|
| 68 |
-
1. **Run**: `uv run python src/app.py`
|
| 69 |
-
2. **Check**: Open `http://localhost:7860`
|
| 70 |
-
3. **Verify**:
|
| 71 |
-
* Settings accordion starts **COLLAPSED**.
|
| 72 |
-
* Header title ("DeepBoner") is visible.
|
| 73 |
-
* Footer text ("MCP Server Active") is visible in the description area.
|
| 74 |
-
* Chat functionality works (Magentic/Simple modes).
|
| 75 |
-
|
| 76 |
-
---
|
| 77 |
-
|
| 78 |
-
## 5. Constraints & Notes
|
| 79 |
-
|
| 80 |
-
- **Layout**: We lose the ability to place arbitrary elements *below* the chat box (footer will move to top, under title), but this is an acceptable trade-off for a working UI.
|
| 81 |
-
- **CSS**: `ChatInterface` handles its own CSS; any custom class styling from the previous footer will be standardized to the description text style.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/bugs/P1_MAGENTIC_STREAMING_AND_KEY_PERSISTENCE.md
DELETED
|
@@ -1,181 +0,0 @@
|
|
| 1 |
-
# Bug Report: Magentic Mode Integration Issues
|
| 2 |
-
|
| 3 |
-
## Status
|
| 4 |
-
- **Date:** 2025-11-29
|
| 5 |
-
- **Reporter:** CLI User
|
| 6 |
-
- **Priority:** P1 (UX Degradation + Deprecation Warnings)
|
| 7 |
-
- **Component:** `src/app.py`, `src/orchestrator_magentic.py`, `src/utils/llm_factory.py`
|
| 8 |
-
- **Status:** β
FIXED (Bug 1 & Bug 2) - 2025-11-29
|
| 9 |
-
- **Tests:** 138 passing (136 original + 2 new validation tests)
|
| 10 |
-
|
| 11 |
-
---
|
| 12 |
-
|
| 13 |
-
## Bug 1: Token-by-Token Streaming Spam β
FIXED
|
| 14 |
-
|
| 15 |
-
### Symptoms
|
| 16 |
-
When running Magentic (Advanced) mode, the UI shows hundreds of individual lines like:
|
| 17 |
-
```text
|
| 18 |
-
π‘ STREAMING: Below
|
| 19 |
-
π‘ STREAMING: is
|
| 20 |
-
π‘ STREAMING: a
|
| 21 |
-
π‘ STREAMING: curated
|
| 22 |
-
π‘ STREAMING: list
|
| 23 |
-
...
|
| 24 |
-
```
|
| 25 |
-
|
| 26 |
-
Each token is displayed as a separate streaming event, creating visual spam and making it impossible to read the output until completion.
|
| 27 |
-
|
| 28 |
-
### Root Cause (VALIDATED)
|
| 29 |
-
**File:** `src/orchestrator_magentic.py:247-254`
|
| 30 |
-
|
| 31 |
-
```python
|
| 32 |
-
elif isinstance(event, MagenticAgentDeltaEvent):
|
| 33 |
-
if event.text:
|
| 34 |
-
return AgentEvent(
|
| 35 |
-
type="streaming",
|
| 36 |
-
message=event.text, # Single token!
|
| 37 |
-
data={"agent_id": event.agent_id},
|
| 38 |
-
iteration=iteration,
|
| 39 |
-
)
|
| 40 |
-
```
|
| 41 |
-
|
| 42 |
-
Every LLM token emits a `MagenticAgentDeltaEvent`, which creates an `AgentEvent(type="streaming")`.
|
| 43 |
-
|
| 44 |
-
**File:** `src/app.py:171-192` (BEFORE FIX)
|
| 45 |
-
|
| 46 |
-
```python
|
| 47 |
-
async for event in orchestrator.run(message):
|
| 48 |
-
event_md = event.to_markdown()
|
| 49 |
-
response_parts.append(event_md) # Appends EVERY token
|
| 50 |
-
|
| 51 |
-
if event.type == "complete":
|
| 52 |
-
yield event.message
|
| 53 |
-
else:
|
| 54 |
-
yield "\n\n".join(response_parts) # Yields ALL accumulated tokens
|
| 55 |
-
```
|
| 56 |
-
|
| 57 |
-
For N tokens, this yields N times, each time showing all previous tokens. This is O(NΒ²) string operations and creates massive visual spam.
|
| 58 |
-
|
| 59 |
-
### Fix Applied
|
| 60 |
-
**File:** `src/app.py:175-204`
|
| 61 |
-
|
| 62 |
-
Implemented streaming token buffering with live updates:
|
| 63 |
-
1. Added `streaming_buffer = ""` to accumulate tokens
|
| 64 |
-
2. For each streaming event: append to buffer, yield immediately (for live typing UX)
|
| 65 |
-
3. **Key fix**: Don't append streaming events to `response_parts` (prevents O(NΒ²) list growth)
|
| 66 |
-
4. Each yield has only ONE `π‘ STREAMING:` line (the accumulated buffer)
|
| 67 |
-
5. Flush buffer to `response_parts` only when non-streaming event occurs
|
| 68 |
-
|
| 69 |
-
**Result**: Live typing feel preserved, but no visual spam (each update replaces, not accumulates)
|
| 70 |
-
|
| 71 |
-
### Proposed Fix Options
|
| 72 |
-
|
| 73 |
-
**Option A: Buffer streaming tokens (recommended)**
|
| 74 |
-
```python
|
| 75 |
-
# In app.py - accumulate streaming tokens, yield periodically
|
| 76 |
-
streaming_buffer = ""
|
| 77 |
-
last_yield_time = time.time()
|
| 78 |
-
|
| 79 |
-
async for event in orchestrator.run(message):
|
| 80 |
-
if event.type == "streaming":
|
| 81 |
-
streaming_buffer += event.message
|
| 82 |
-
# Only yield every 500ms or on newline
|
| 83 |
-
if time.time() - last_yield_time > 0.5 or "\n" in event.message:
|
| 84 |
-
yield f"π‘ {streaming_buffer}"
|
| 85 |
-
last_yield_time = time.time()
|
| 86 |
-
elif event.type == "complete":
|
| 87 |
-
yield event.message
|
| 88 |
-
else:
|
| 89 |
-
# Non-streaming events
|
| 90 |
-
response_parts.append(event.to_markdown())
|
| 91 |
-
yield "\n\n".join(response_parts)
|
| 92 |
-
```
|
| 93 |
-
|
| 94 |
-
**Option B: Don't yield streaming events at all**
|
| 95 |
-
```python
|
| 96 |
-
# In app.py - only yield meaningful events
|
| 97 |
-
async for event in orchestrator.run(message):
|
| 98 |
-
if event.type == "streaming":
|
| 99 |
-
continue # Skip token-by-token spam
|
| 100 |
-
# ... rest of logic
|
| 101 |
-
```
|
| 102 |
-
|
| 103 |
-
**Option C: Fix at orchestrator level**
|
| 104 |
-
Don't emit `AgentEvent` for every delta - buffer in `_process_event`.
|
| 105 |
-
|
| 106 |
-
---
|
| 107 |
-
|
| 108 |
-
## Bug 2: API Key Does Not Persist in Textbox β
FIXED
|
| 109 |
-
|
| 110 |
-
### Symptoms
|
| 111 |
-
1. User opens the "Mode & API Key" accordion
|
| 112 |
-
2. User pastes their API key into the password textbox
|
| 113 |
-
3. User clicks an example OR clicks elsewhere
|
| 114 |
-
4. The API key textbox is now empty - value lost
|
| 115 |
-
|
| 116 |
-
### Root Cause (VALIDATED)
|
| 117 |
-
**File:** `src/app.py:255-267` (BEFORE FIX)
|
| 118 |
-
|
| 119 |
-
```python
|
| 120 |
-
additional_inputs_accordion=additional_inputs_accordion,
|
| 121 |
-
additional_inputs=[
|
| 122 |
-
gr.Radio(...),
|
| 123 |
-
gr.Textbox(
|
| 124 |
-
label="π API Key (Optional)",
|
| 125 |
-
type="password",
|
| 126 |
-
# No `value` parameter - defaults to empty
|
| 127 |
-
# No state persistence mechanism
|
| 128 |
-
),
|
| 129 |
-
],
|
| 130 |
-
```
|
| 131 |
-
|
| 132 |
-
Gradio's `ChatInterface` with `additional_inputs` has known issues:
|
| 133 |
-
1. Clicking examples resets additional inputs to defaults
|
| 134 |
-
2. The accordion state and input values may not persist correctly
|
| 135 |
-
3. No explicit state management for the API key
|
| 136 |
-
|
| 137 |
-
### Fix Applied
|
| 138 |
-
**Files Modified:**
|
| 139 |
-
1. `src/app.py`
|
| 140 |
-
2. `src/utils/llm_factory.py`
|
| 141 |
-
|
| 142 |
-
**Bug 1 (Streaming Spam):**
|
| 143 |
-
- Accumulate tokens in `streaming_buffer`
|
| 144 |
-
- Yield updates immediately for live typing UX
|
| 145 |
-
- **Key**: Don't append to `response_parts` until stream segment complete
|
| 146 |
-
- Each yield has ONE `π‘ STREAMING:` line (not N accumulated lines)
|
| 147 |
-
|
| 148 |
-
**Bug 2 (API Key Persistence):**
|
| 149 |
-
- **Strategy:** Partial example list (relies on Gradio behavior)
|
| 150 |
-
- Examples have only 2 elements `[message, mode]` instead of 4
|
| 151 |
-
- Gradio only updates inputs with corresponding example values
|
| 152 |
-
- Remaining inputs (api_key textbox) are left unchanged
|
| 153 |
-
- `api_key_state` parameter exists as fallback but may be redundant
|
| 154 |
-
- **Note:** This is a workaround relying on undocumented Gradio behavior
|
| 155 |
-
|
| 156 |
-
**Bug 3 (OpenAIModel Deprecation):** β
FIXED
|
| 157 |
-
- Replaced all `OpenAIModel` imports with `OpenAIChatModel` in `src/app.py` and `src/utils/llm_factory.py`.
|
| 158 |
-
|
| 159 |
-
### Test Results
|
| 160 |
-
```bash
|
| 161 |
-
uv run pytest tests/ -q
|
| 162 |
-
============================= 138 passed in 20.60s =============================
|
| 163 |
-
```
|
| 164 |
-
|
| 165 |
-
**Status:** β
All tests passing
|
| 166 |
-
|
| 167 |
-
### Why This Fix Works
|
| 168 |
-
|
| 169 |
-
**Bug 1 (Streaming Spam):**
|
| 170 |
-
- **Before:** Every token β `append()` to list β `yield` β List grew to size N β O(NΒ²) complexity.
|
| 171 |
-
- **After:** Every token β `yield` dynamically constructed string (buffer + history) β List stays size K (number of *events*).
|
| 172 |
-
- **Impact:** Smooth streaming, no visual spam, no browser freeze.
|
| 173 |
-
|
| 174 |
-
**Bug 2 (API Key):**
|
| 175 |
-
- **Before:** Example click β Overwrote API Key textbox with `""`.
|
| 176 |
-
- **After:** Example click β Updates only `message` and `mode` β API Key textbox untouched.
|
| 177 |
-
- **Impact:** User input persists naturally.
|
| 178 |
-
|
| 179 |
-
### Remaining Work
|
| 180 |
-
- **Bug 4 (Asyncio GC errors):** Monitoring only - likely Gradio/HF Spaces issue
|
| 181 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/bugs/P1_MULTIPLE_UX_BUGS.md
DELETED
|
@@ -1,49 +0,0 @@
|
|
| 1 |
-
# P1 Bug Report: Multiple UX and Configuration Issues
|
| 2 |
-
|
| 3 |
-
## Status
|
| 4 |
-
- **Date:** 2025-11-29
|
| 5 |
-
- **Priority:** P1 (Multiple user-facing issues)
|
| 6 |
-
- **Components:** `src/app.py`, `src/orchestrator_magentic.py`
|
| 7 |
-
|
| 8 |
-
## Resolved Issues (Fixed 2025-11-29)
|
| 9 |
-
|
| 10 |
-
### Bug 1: API Key Cleared When Clicking Examples
|
| 11 |
-
**Fixed.** Updated `examples` in `app.py` to include explicit `None` values for additional inputs. Gradio preserves values when the example value is `None`.
|
| 12 |
-
|
| 13 |
-
### Bug 2: No Loading/Processing Indicator
|
| 14 |
-
**Fixed.** `research_agent` yields an immediate "β³ Processing..." message before starting the orchestrator.
|
| 15 |
-
|
| 16 |
-
### Bug 3: Advanced Mode Temperature Error
|
| 17 |
-
**Fixed.** Explicitly set `temperature=1.0` for all Magentic agents in `src/agents/magentic_agents.py`. This is compatible with OpenAI reasoning models (o1/o3) which require `temperature=1` and were rejecting the default (likely 0.3 or None).
|
| 18 |
-
|
| 19 |
-
### Bug 4: HSDD Acronym Not Spelled Out
|
| 20 |
-
**Fixed.** Updated example text in `app.py` to "HSDD (Hypoactive Sexual Desire Disorder)".
|
| 21 |
-
|
| 22 |
-
---
|
| 23 |
-
|
| 24 |
-
## Open / Deferred Issues
|
| 25 |
-
|
| 26 |
-
### Bug 5: Free Tier Quota Exhausted (UX Improvement)
|
| 27 |
-
**Deferred.** Currently shows standard error message. Improve if users report confusion.
|
| 28 |
-
|
| 29 |
-
### Bug 6: Asyncio File Descriptor Warnings
|
| 30 |
-
**Won't Fix.** Cosmetic issue only.
|
| 31 |
-
|
| 32 |
-
---
|
| 33 |
-
|
| 34 |
-
## Priority Order (Completed)
|
| 35 |
-
|
| 36 |
-
1. **Bug 4 (HSDD)** - Fixed
|
| 37 |
-
2. **Bug 2 (Loading indicator)** - Fixed
|
| 38 |
-
3. **Bug 3 (Temperature)** - Fixed
|
| 39 |
-
4. **Bug 1 (API key)** - Fixed
|
| 40 |
-
|
| 41 |
-
---
|
| 42 |
-
|
| 43 |
-
## Test Plan
|
| 44 |
-
- [x] Fix HSDD acronym
|
| 45 |
-
- [x] Add loading indicator yield
|
| 46 |
-
- [x] Test advanced mode with temperature fix (Static analysis/Code change)
|
| 47 |
-
- [x] Research Gradio example behavior for API key (Implemented None fix)
|
| 48 |
-
- [ ] Run `make check`
|
| 49 |
-
- [ ] Deploy and test on HuggingFace Spaces
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/bugs/P2_MAGENTIC_THINKING_STATE.md
DELETED
|
@@ -1,232 +0,0 @@
|
|
| 1 |
-
# P2 Bug Report: Advanced Mode Missing "Thinking" State
|
| 2 |
-
|
| 3 |
-
## Status
|
| 4 |
-
- **Date:** 2025-11-29
|
| 5 |
-
- **Priority:** P2 (UX polish, not blocking functionality)
|
| 6 |
-
- **Component:** `src/orchestrator_magentic.py`, `src/app.py`
|
| 7 |
-
|
| 8 |
-
---
|
| 9 |
-
|
| 10 |
-
## Symptoms
|
| 11 |
-
|
| 12 |
-
User experience in **Advanced (Magentic) mode**:
|
| 13 |
-
1. Click example or submit query
|
| 14 |
-
2. See: `π **STARTED**: Starting research (Magentic mode)...`
|
| 15 |
-
3. **2+ minutes of nothing** (no spinner, no progress, no indication work is happening)
|
| 16 |
-
4. Eventually see: `π§ **JUDGING**: Manager (user_task)...`
|
| 17 |
-
|
| 18 |
-
**User perception:** "Is it frozen? Did it crash?"
|
| 19 |
-
|
| 20 |
-
### Container Logs Confirm Work IS Happening
|
| 21 |
-
```
|
| 22 |
-
14:54:22 [info] Starting Magentic orchestrator query='...'
|
| 23 |
-
14:54:22 [info] Embedding service enabled
|
| 24 |
-
... 2+ MINUTES OF SILENCE (agent-framework doing internal LLM calls) ...
|
| 25 |
-
14:56:38 [info] Creating orchestrator mode=advanced
|
| 26 |
-
```
|
| 27 |
-
|
| 28 |
-
The silence is because `workflow.run_stream()` doesn't yield events during its setup phase.
|
| 29 |
-
|
| 30 |
-
---
|
| 31 |
-
|
| 32 |
-
## Root Cause Analysis
|
| 33 |
-
|
| 34 |
-
### Current Flow (`src/orchestrator_magentic.py`)
|
| 35 |
-
```python
|
| 36 |
-
async def run(self, query: str) -> AsyncGenerator[AgentEvent, None]:
|
| 37 |
-
# 1. Immediately yields "started"
|
| 38 |
-
yield AgentEvent(type="started", message=f"Starting research (Magentic mode): {query}")
|
| 39 |
-
|
| 40 |
-
# 2. Setup (fast, no yield needed)
|
| 41 |
-
embedding_service = self._init_embedding_service()
|
| 42 |
-
init_magentic_state(embedding_service)
|
| 43 |
-
workflow = self._build_workflow()
|
| 44 |
-
|
| 45 |
-
# 3. GAP: workflow.run_stream() blocks for 2+ minutes before first event
|
| 46 |
-
async for event in workflow.run_stream(task): # <-- THE BOTTLENECK
|
| 47 |
-
yield self._process_event(event)
|
| 48 |
-
```
|
| 49 |
-
|
| 50 |
-
The `agent-framework`'s `workflow.run_stream()` is calling OpenAI's API, building the manager prompt, coordinating agents, etc. **It doesn't yield events during this setup phase**.
|
| 51 |
-
|
| 52 |
-
---
|
| 53 |
-
|
| 54 |
-
## Gold Standard UX (What We'd Want)
|
| 55 |
-
|
| 56 |
-
### Gradio's Native Thinking Support
|
| 57 |
-
|
| 58 |
-
Per [Gradio Chatbot Docs](https://www.gradio.app/docs/gradio/chatbot):
|
| 59 |
-
|
| 60 |
-
> "The Gradio Chatbot can natively display intermediate thoughts and tool usage in a collapsible accordion next to a chat message. This makes it perfect for creating UIs for LLM agents and chain-of-thought (CoT) or reasoning demos."
|
| 61 |
-
|
| 62 |
-
**Features available:**
|
| 63 |
-
- `gr.ChatMessage` with `metadata={"status": "pending"}` shows spinner
|
| 64 |
-
- `metadata={"title": "Thinking...", "status": "pending"}` creates collapsible accordion
|
| 65 |
-
- Nested thoughts via `id` and `parent_id`
|
| 66 |
-
- `duration` metadata shows time spent
|
| 67 |
-
|
| 68 |
-
**Example from Gradio docs:**
|
| 69 |
-
```python
|
| 70 |
-
import gradio as gr
|
| 71 |
-
|
| 72 |
-
def chat_fn(message, history):
|
| 73 |
-
# Yield thinking state with spinner
|
| 74 |
-
yield gr.ChatMessage(
|
| 75 |
-
role="assistant",
|
| 76 |
-
metadata={"title": "π§ Thinking...", "status": "pending"}
|
| 77 |
-
)
|
| 78 |
-
|
| 79 |
-
# Do work...
|
| 80 |
-
|
| 81 |
-
# Update with completed thought
|
| 82 |
-
yield gr.ChatMessage(
|
| 83 |
-
role="assistant",
|
| 84 |
-
content="Analysis complete",
|
| 85 |
-
metadata={"title": "π§ Thinking...", "status": "done", "duration": 5.2}
|
| 86 |
-
)
|
| 87 |
-
|
| 88 |
-
yield "Here's the final answer..."
|
| 89 |
-
```
|
| 90 |
-
|
| 91 |
-
---
|
| 92 |
-
|
| 93 |
-
## Why This is Complex for DeepBoner
|
| 94 |
-
|
| 95 |
-
### Constraint 1: ChatInterface Returns Strings
|
| 96 |
-
Our `research_agent()` yields plain strings:
|
| 97 |
-
```python
|
| 98 |
-
yield "π§ **Backend**: {backend_name}\n\n"
|
| 99 |
-
yield "β³ **Processing...** Searching PubMed...\n"
|
| 100 |
-
yield "\n\n".join(response_parts)
|
| 101 |
-
```
|
| 102 |
-
|
| 103 |
-
Converting to `gr.ChatMessage` objects would require refactoring the entire response pipeline.
|
| 104 |
-
|
| 105 |
-
### Constraint 2: Agent-Framework is the Bottleneck
|
| 106 |
-
The 2-minute gap is inside `workflow.run_stream(task)`, which is the `agent-framework` library. We can't inject yields into a third-party library's blocking call.
|
| 107 |
-
|
| 108 |
-
### Constraint 3: ChatInterface vs Blocks
|
| 109 |
-
`gr.ChatInterface` is a convenience wrapper. The full `gr.ChatMessage` metadata features work best with raw `gr.Blocks` + `gr.Chatbot` components.
|
| 110 |
-
|
| 111 |
-
---
|
| 112 |
-
|
| 113 |
-
## Options
|
| 114 |
-
|
| 115 |
-
### Option A: Yield "Thinking" Before Blocking Call (Recommended)
|
| 116 |
-
**Effort:** 5 minutes
|
| 117 |
-
**Impact:** Users see *something* while waiting
|
| 118 |
-
|
| 119 |
-
```python
|
| 120 |
-
# In src/orchestrator_magentic.py
|
| 121 |
-
async def run(self, query: str) -> AsyncGenerator[AgentEvent, None]:
|
| 122 |
-
yield AgentEvent(type="started", message=f"Starting research (Magentic mode): {query}")
|
| 123 |
-
|
| 124 |
-
# NEW: Yield thinking state before the blocking call
|
| 125 |
-
yield AgentEvent(
|
| 126 |
-
type="thinking", # New event type
|
| 127 |
-
message="π§ Agents are reasoning... This may take 2-5 minutes for complex queries.",
|
| 128 |
-
iteration=0,
|
| 129 |
-
)
|
| 130 |
-
|
| 131 |
-
# ... rest of setup ...
|
| 132 |
-
|
| 133 |
-
async for event in workflow.run_stream(task):
|
| 134 |
-
yield self._process_event(event)
|
| 135 |
-
```
|
| 136 |
-
|
| 137 |
-
**Pros:**
|
| 138 |
-
- Simple, doesn't require Gradio changes
|
| 139 |
-
- Works with current string-based approach
|
| 140 |
-
- Sets user expectations ("2-5 minutes")
|
| 141 |
-
|
| 142 |
-
**Cons:**
|
| 143 |
-
- No spinner/animation (static text)
|
| 144 |
-
- Doesn't show real-time progress during the gap
|
| 145 |
-
|
| 146 |
-
### Option B: Use `gr.ChatMessage` with Metadata (Major Refactor)
|
| 147 |
-
**Effort:** 2-4 hours
|
| 148 |
-
**Impact:** Full gold-standard UX
|
| 149 |
-
|
| 150 |
-
Would require:
|
| 151 |
-
1. Changing `research_agent()` to yield `gr.ChatMessage` objects
|
| 152 |
-
2. Adding thinking states with `metadata={"status": "pending"}`
|
| 153 |
-
3. Updating all event handlers to produce proper ChatMessage objects
|
| 154 |
-
|
| 155 |
-
### Option C: Heartbeat/Polling (Over-Engineering)
|
| 156 |
-
**Effort:** 4+ hours
|
| 157 |
-
**Impact:** Spinner during blocking call
|
| 158 |
-
|
| 159 |
-
Create a background task that yields "still working..." every 10 seconds while waiting for the agent-framework. Requires:
|
| 160 |
-
- `asyncio.create_task()` for heartbeat
|
| 161 |
-
- Task cancellation when real events arrive
|
| 162 |
-
- Proper cleanup
|
| 163 |
-
|
| 164 |
-
**Verdict:** Over-engineering for a demo.
|
| 165 |
-
|
| 166 |
-
### Option D: Accept the Limitation (Document It)
|
| 167 |
-
**Effort:** 0
|
| 168 |
-
**Impact:** None (users still confused)
|
| 169 |
-
|
| 170 |
-
Just document that Advanced mode takes 2-5 minutes and users should wait.
|
| 171 |
-
|
| 172 |
-
---
|
| 173 |
-
|
| 174 |
-
## Recommendation
|
| 175 |
-
|
| 176 |
-
**Implement Option A** - Add a "thinking" yield before the blocking call.
|
| 177 |
-
|
| 178 |
-
It's:
|
| 179 |
-
1. Minimal code change (5 minutes)
|
| 180 |
-
2. Sets user expectations clearly
|
| 181 |
-
3. Doesn't require Gradio refactoring
|
| 182 |
-
4. Better than silence
|
| 183 |
-
|
| 184 |
-
---
|
| 185 |
-
|
| 186 |
-
## Implementation Plan
|
| 187 |
-
|
| 188 |
-
### Step 1: Add "thinking" Event Type
|
| 189 |
-
```python
|
| 190 |
-
# In src/utils/models.py
|
| 191 |
-
class AgentEvent(BaseModel):
|
| 192 |
-
type: Literal[
|
| 193 |
-
"started", "thinking", "searching", ... # Add "thinking"
|
| 194 |
-
]
|
| 195 |
-
```
|
| 196 |
-
|
| 197 |
-
### Step 2: Yield Thinking Event in Magentic Orchestrator
|
| 198 |
-
```python
|
| 199 |
-
# In src/orchestrator_magentic.py, run() method
|
| 200 |
-
yield AgentEvent(
|
| 201 |
-
type="thinking",
|
| 202 |
-
message="π§ Multi-agent reasoning in progress... This may take 2-5 minutes.",
|
| 203 |
-
iteration=0,
|
| 204 |
-
)
|
| 205 |
-
```
|
| 206 |
-
|
| 207 |
-
### Step 3: Handle in App
|
| 208 |
-
```python
|
| 209 |
-
# In src/app.py, research_agent()
|
| 210 |
-
if event.type == "thinking":
|
| 211 |
-
yield f"β³ {event.message}"
|
| 212 |
-
```
|
| 213 |
-
|
| 214 |
-
---
|
| 215 |
-
|
| 216 |
-
## Test Plan
|
| 217 |
-
|
| 218 |
-
- [ ] Add `"thinking"` to AgentEvent type literals
|
| 219 |
-
- [ ] Add yield before `workflow.run_stream()`
|
| 220 |
-
- [ ] Handle in app.py
|
| 221 |
-
- [ ] `make check` passes
|
| 222 |
-
- [ ] Manual test: Advanced mode shows "reasoning in progress" message
|
| 223 |
-
- [ ] Deploy to HuggingFace, verify UX improvement
|
| 224 |
-
|
| 225 |
-
---
|
| 226 |
-
|
| 227 |
-
## References
|
| 228 |
-
|
| 229 |
-
- [Gradio ChatInterface Docs](https://www.gradio.app/docs/gradio/chatinterface)
|
| 230 |
-
- [Gradio Chatbot Metadata](https://www.gradio.app/docs/gradio/chatbot)
|
| 231 |
-
- [Agents and Tool Usage Guide](https://www.gradio.app/guides/agents-and-tool-usage)
|
| 232 |
-
- [GitHub Issue: Streaming text not working](https://github.com/gradio-app/gradio/issues/11443)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/bugs/SENIOR_AGENT_AUDIT_PROMPT.md
DELETED
|
@@ -1,247 +0,0 @@
|
|
| 1 |
-
# Senior Agent Audit Request: DeepBoner Codebase Bug Hunt
|
| 2 |
-
|
| 3 |
-
**Date**: 2025-11-28
|
| 4 |
-
**Requesting Agent**: Claude (Opus)
|
| 5 |
-
**Purpose**: Comprehensive bug audit and verification of P0_CRITICAL_BUGS.md
|
| 6 |
-
|
| 7 |
-
---
|
| 8 |
-
|
| 9 |
-
## Your Mission
|
| 10 |
-
|
| 11 |
-
You are a senior software engineer performing a comprehensive audit of the DeepBoner codebase. Your goals:
|
| 12 |
-
|
| 13 |
-
1. **VERIFY** the 4 bugs documented in `docs/bugs/P0_CRITICAL_BUGS.md` are accurately described
|
| 14 |
-
2. **FIND** any additional bugs (P0-P4) that could affect the demo
|
| 15 |
-
3. **TRACE** the complete code paths for Simple and Advanced modes
|
| 16 |
-
4. **IDENTIFY** any silent failures, race conditions, or edge cases
|
| 17 |
-
|
| 18 |
-
---
|
| 19 |
-
|
| 20 |
-
## Context: What DeepBoner Does
|
| 21 |
-
|
| 22 |
-
DeepBoner is a Gradio-based biomedical research agent that:
|
| 23 |
-
1. Takes a research question from user
|
| 24 |
-
2. Searches PubMed, ClinicalTrials.gov, Europe PMC
|
| 25 |
-
3. Uses an LLM "judge" to evaluate if evidence is sufficient
|
| 26 |
-
4. Either loops for more evidence or synthesizes a final report
|
| 27 |
-
|
| 28 |
-
**Two Modes**:
|
| 29 |
-
- **Simple**: Linear orchestrator with search β judge β report loop
|
| 30 |
-
- **Advanced**: Magentic multi-agent with SearchAgent, JudgeAgent, HypothesisAgent, ReportAgent
|
| 31 |
-
|
| 32 |
-
**Three Backend Options**:
|
| 33 |
-
- Free tier: HuggingFace Inference API (Llama/Mistral)
|
| 34 |
-
- OpenAI: User-provided or env var key
|
| 35 |
-
- Anthropic: User-provided or env var key (Simple mode only)
|
| 36 |
-
|
| 37 |
-
---
|
| 38 |
-
|
| 39 |
-
## Files to Audit (Priority Order)
|
| 40 |
-
|
| 41 |
-
### Critical Path Files:
|
| 42 |
-
1. `src/app.py` - Gradio UI, entry point, key routing
|
| 43 |
-
2. `src/orchestrator.py` - Simple mode main loop
|
| 44 |
-
3. `src/orchestrator_factory.py` - Mode selection and orchestrator creation
|
| 45 |
-
4. `src/orchestrator_magentic.py` - Advanced mode implementation
|
| 46 |
-
5. `src/services/embeddings.py` - Deduplication singleton (KNOWN BUG)
|
| 47 |
-
6. `src/agent_factory/judges.py` - LLM judge handlers (HF, OpenAI, Anthropic)
|
| 48 |
-
|
| 49 |
-
### Supporting Files:
|
| 50 |
-
7. `src/tools/search_handler.py` - Parallel search orchestration
|
| 51 |
-
8. `src/tools/pubmed.py` - PubMed API integration
|
| 52 |
-
9. `src/tools/clinicaltrials.py` - ClinicalTrials.gov API
|
| 53 |
-
10. `src/tools/europepmc.py` - Europe PMC API
|
| 54 |
-
11. `src/agents/magentic_agents.py` - Agent factories (KNOWN BUG: hardcoded env key)
|
| 55 |
-
12. `src/utils/config.py` - Settings and configuration
|
| 56 |
-
13. `src/utils/models.py` - Data models (Evidence, Citation, etc.)
|
| 57 |
-
|
| 58 |
-
---
|
| 59 |
-
|
| 60 |
-
## Known Bugs to Verify
|
| 61 |
-
|
| 62 |
-
### Bug 1: Free Tier LLM Quota Exhausted
|
| 63 |
-
**Claim**: HuggingFace Inference returns 402, all 3 fallback models fail
|
| 64 |
-
**Verify**:
|
| 65 |
-
- Check `src/agent_factory/judges.py` class `HFInferenceJudgeHandler`
|
| 66 |
-
- Trace the fallback chain: Llama β Mistral β Zephyr
|
| 67 |
-
- Confirm what happens when ALL fail (does it return default "continue"?)
|
| 68 |
-
- Check if the error message reaches the user or is swallowed
|
| 69 |
-
|
| 70 |
-
### Bug 2: Evidence Counter Shows 0 After Dedup
|
| 71 |
-
**Claim**: `_deduplicate_and_rank()` can return empty list, losing all evidence
|
| 72 |
-
**Verify**:
|
| 73 |
-
- Check `src/orchestrator.py` lines 97-114 and 219
|
| 74 |
-
- Trace what happens if `embeddings.deduplicate()` returns `[]`
|
| 75 |
-
- Is there defensive handling? Does exception handler catch this?
|
| 76 |
-
- Could this be a race condition in async code?
|
| 77 |
-
|
| 78 |
-
### Bug 3: API Key Not Passed to Advanced Mode
|
| 79 |
-
**Claim**: User's API key from Gradio is never passed to MagenticOrchestrator
|
| 80 |
-
**Verify**:
|
| 81 |
-
- Trace: `app.py:research_agent()` β `configure_orchestrator()` β `orchestrator_factory.py`
|
| 82 |
-
- Check if `user_api_key` is passed to `create_orchestrator()`
|
| 83 |
-
- Check if `MagenticOrchestrator.__init__()` receives a key
|
| 84 |
-
- Check `src/agents/magentic_agents.py` - do agents use `settings.openai_api_key`?
|
| 85 |
-
|
| 86 |
-
### Bug 4: Singleton EmbeddingService Cross-Session Pollution
|
| 87 |
-
**Claim**: ChromaDB collection persists across requests, causing false duplicates
|
| 88 |
-
**Verify**:
|
| 89 |
-
- Check `src/services/embeddings.py` singleton pattern
|
| 90 |
-
- Is `_embedding_service` ever reset?
|
| 91 |
-
- What happens to ChromaDB collection between Gradio requests?
|
| 92 |
-
- Could this cause "Found 20 new sources (0 total)"?
|
| 93 |
-
|
| 94 |
-
---
|
| 95 |
-
|
| 96 |
-
## Additional Bug Categories to Search For
|
| 97 |
-
|
| 98 |
-
### A. Error Handling Gaps
|
| 99 |
-
- [ ] Silent `except: pass` blocks
|
| 100 |
-
- [ ] Exceptions logged but not re-raised
|
| 101 |
-
- [ ] Missing error messages to user
|
| 102 |
-
- [ ] Swallowed API errors
|
| 103 |
-
|
| 104 |
-
### B. Async/Concurrency Issues
|
| 105 |
-
- [ ] Race conditions in parallel searches
|
| 106 |
-
- [ ] Shared mutable state across async calls
|
| 107 |
-
- [ ] Missing `await` keywords
|
| 108 |
-
- [ ] Event loop blocking (sync code in async context)
|
| 109 |
-
|
| 110 |
-
### C. API Integration Bugs
|
| 111 |
-
- [ ] Missing rate limiting
|
| 112 |
-
- [ ] Hardcoded timeouts that are too short
|
| 113 |
-
- [ ] XML/JSON parsing failures not handled
|
| 114 |
-
- [ ] Empty response handling
|
| 115 |
-
|
| 116 |
-
### D. State Management Issues
|
| 117 |
-
- [ ] Global singletons that should be session-scoped
|
| 118 |
-
- [ ] Gradio state not properly isolated between users
|
| 119 |
-
- [ ] Memory leaks from accumulated data
|
| 120 |
-
|
| 121 |
-
### E. Configuration Bugs
|
| 122 |
-
- [ ] Missing env var defaults
|
| 123 |
-
- [ ] Type mismatches in settings
|
| 124 |
-
- [ ] Hardcoded values that should be configurable
|
| 125 |
-
|
| 126 |
-
### F. UI/UX Bugs
|
| 127 |
-
- [ ] Streaming not working properly
|
| 128 |
-
- [ ] Progress messages misleading
|
| 129 |
-
- [ ] Examples not matching actual functionality
|
| 130 |
-
- [ ] Error messages not user-friendly
|
| 131 |
-
|
| 132 |
-
---
|
| 133 |
-
|
| 134 |
-
## Output Format
|
| 135 |
-
|
| 136 |
-
Please produce a report with:
|
| 137 |
-
|
| 138 |
-
### 1. Verification of Known Bugs
|
| 139 |
-
For each of the 4 bugs in P0_CRITICAL_BUGS.md:
|
| 140 |
-
- **CONFIRMED** or **INCORRECT** or **PARTIALLY CORRECT**
|
| 141 |
-
- Exact file:line references
|
| 142 |
-
- Any corrections or additional details
|
| 143 |
-
|
| 144 |
-
### 2. New Bugs Found
|
| 145 |
-
For each new bug:
|
| 146 |
-
```
|
| 147 |
-
## Bug N: [Title]
|
| 148 |
-
**Priority**: P0/P1/P2/P3/P4
|
| 149 |
-
**File**: path/to/file.py:line
|
| 150 |
-
**Symptoms**: What the user sees
|
| 151 |
-
**Root Cause**: Technical explanation
|
| 152 |
-
**Code**:
|
| 153 |
-
```python
|
| 154 |
-
# The buggy code
|
| 155 |
-
```
|
| 156 |
-
**Fix**:
|
| 157 |
-
```python
|
| 158 |
-
# The corrected code
|
| 159 |
-
```
|
| 160 |
-
```
|
| 161 |
-
|
| 162 |
-
### 3. Code Quality Concerns
|
| 163 |
-
Any patterns that aren't bugs but could cause issues:
|
| 164 |
-
- Technical debt
|
| 165 |
-
- Missing tests for critical paths
|
| 166 |
-
- Unclear error handling
|
| 167 |
-
|
| 168 |
-
### 4. Recommended Fix Order
|
| 169 |
-
Prioritized list of what to fix first for a working demo.
|
| 170 |
-
|
| 171 |
-
---
|
| 172 |
-
|
| 173 |
-
## Commands to Help Your Investigation
|
| 174 |
-
|
| 175 |
-
```bash
|
| 176 |
-
# Run the tests
|
| 177 |
-
make check
|
| 178 |
-
|
| 179 |
-
# Test search works
|
| 180 |
-
uv run python -c "
|
| 181 |
-
import asyncio
|
| 182 |
-
from src.tools.pubmed import PubMedTool
|
| 183 |
-
async def test():
|
| 184 |
-
tool = PubMedTool()
|
| 185 |
-
results = await tool.search('female libido', 5)
|
| 186 |
-
print(f'Found {len(results)} results')
|
| 187 |
-
asyncio.run(test())
|
| 188 |
-
"
|
| 189 |
-
|
| 190 |
-
# Test HF inference (will show 402 if quota exhausted)
|
| 191 |
-
uv run python -c "
|
| 192 |
-
from huggingface_hub import InferenceClient
|
| 193 |
-
client = InferenceClient()
|
| 194 |
-
try:
|
| 195 |
-
resp = client.chat_completion(
|
| 196 |
-
messages=[{'role': 'user', 'content': 'Hi'}],
|
| 197 |
-
model='meta-llama/Llama-3.1-8B-Instruct',
|
| 198 |
-
max_tokens=10
|
| 199 |
-
)
|
| 200 |
-
print(resp)
|
| 201 |
-
except Exception as e:
|
| 202 |
-
print(f'Error: {e}')
|
| 203 |
-
"
|
| 204 |
-
|
| 205 |
-
# Test full orchestrator (simple mode)
|
| 206 |
-
uv run python -c "
|
| 207 |
-
import asyncio
|
| 208 |
-
from src.app import configure_orchestrator
|
| 209 |
-
async def test():
|
| 210 |
-
orch, backend = configure_orchestrator(use_mock=True, mode='simple')
|
| 211 |
-
print(f'Backend: {backend}')
|
| 212 |
-
async for event in orch.run('test query'):
|
| 213 |
-
print(f'{event.type}: {event.message[:50] if event.message else \"\"}'[:60])
|
| 214 |
-
asyncio.run(test())
|
| 215 |
-
"
|
| 216 |
-
|
| 217 |
-
# Check for hardcoded API keys (security)
|
| 218 |
-
grep -r "sk-" src/ --include="*.py" | grep -v "sk-..." | grep -v "sk-ant-..."
|
| 219 |
-
|
| 220 |
-
# Find all singletons
|
| 221 |
-
grep -r "_.*: .* | None = None" src/ --include="*.py"
|
| 222 |
-
|
| 223 |
-
# Find all except blocks
|
| 224 |
-
grep -rn "except.*:" src/ --include="*.py" | head -50
|
| 225 |
-
```
|
| 226 |
-
|
| 227 |
-
---
|
| 228 |
-
|
| 229 |
-
## Important Notes
|
| 230 |
-
|
| 231 |
-
1. **DO NOT fix bugs** - just document them
|
| 232 |
-
2. **Be thorough** - check edge cases and error paths
|
| 233 |
-
3. **Be specific** - include file:line references
|
| 234 |
-
4. **Be skeptical** - verify claims in P0_CRITICAL_BUGS.md independently
|
| 235 |
-
5. **Think like a user** - what would break the demo experience?
|
| 236 |
-
|
| 237 |
-
The hackathon deadline is approaching. We need a working demo. Your audit will determine what gets fixed first.
|
| 238 |
-
|
| 239 |
-
---
|
| 240 |
-
|
| 241 |
-
## Deliverable
|
| 242 |
-
|
| 243 |
-
A comprehensive markdown report that:
|
| 244 |
-
1. Confirms or corrects the 4 known bugs
|
| 245 |
-
2. Lists any new bugs found (with priority)
|
| 246 |
-
3. Recommends the optimal fix order
|
| 247 |
-
4. Can be saved as `docs/bugs/SENIOR_AUDIT_RESULTS.md`
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/bugs/SENIOR_AUDIT_RESULTS.md
DELETED
|
@@ -1,84 +0,0 @@
|
|
| 1 |
-
# Senior Agent Audit Results: DeepBoner Codebase
|
| 2 |
-
|
| 3 |
-
**Date**: 2025-11-28
|
| 4 |
-
**Auditor**: Claude (Senior Software Engineer)
|
| 5 |
-
**Status**: COMPLETE
|
| 6 |
-
|
| 7 |
-
---
|
| 8 |
-
|
| 9 |
-
## Executive Summary
|
| 10 |
-
|
| 11 |
-
The DeepBoner codebase has **4 critical defects** that render the demo non-functional for most users. The most severe is a **data leak** where the vector database persists across user sessions, causing search result corruption and potential privacy issues. Additionally, the "Advanced" mode ignores user-provided API keys, and the "Free Tier" mode fails silently when quotas are exhausted.
|
| 12 |
-
|
| 13 |
-
**Recommendation**: Immediate remediation of P0 bugs is required before hackathon submission.
|
| 14 |
-
|
| 15 |
-
---
|
| 16 |
-
|
| 17 |
-
## 1. Verification of Known Bugs (P0_CRITICAL_BUGS.md)
|
| 18 |
-
|
| 19 |
-
| Bug | Claim | Verification Status | Notes |
|
| 20 |
-
| :--- | :--- | :--- | :--- |
|
| 21 |
-
| **Bug 1** | Free Tier LLM Quota Exhausted | **CONFIRMED** | `HFInferenceJudgeHandler` catches errors but returns a fallback assessment with `recommendation="continue"`. This causes the orchestrator to loop uselessly until `max_iterations` is reached. The user sees no error message. |
|
| 22 |
-
| **Bug 2** | Evidence Counter Shows 0 | **CONFIRMED** | Directly caused by Bug 4. Deduplication logic works correctly *in isolation*, but fails because the underlying ChromaDB collection is polluted with stale data from previous sessions. |
|
| 23 |
-
| **Bug 3** | API Key Not Passed to Advanced | **CONFIRMED** | `create_orchestrator` in `orchestrator_factory.py` ignores the user's API key. `MagenticOrchestrator` and its agents fall back to `settings.openai_api_key` (env var), which is empty for BYOK users. |
|
| 24 |
-
| **Bug 4** | Singleton EmbeddingService | **CONFIRMED** | `EmbeddingService` is a global singleton with an in-memory ChromaDB. The collection is never cleared. Data leaks between sessions, causing valid new results to be marked as duplicates of old results. |
|
| 25 |
-
|
| 26 |
-
---
|
| 27 |
-
|
| 28 |
-
## 2. New Bugs Found
|
| 29 |
-
|
| 30 |
-
### Bug 5: Search Error Swallowing (P2)
|
| 31 |
-
**File**: `src/orchestrator.py` / `src/tools/search_handler.py`
|
| 32 |
-
**Symptoms**: If all search tools fail (e.g., network issue, API limit), the UI shows "Found 0 sources" without explaining why.
|
| 33 |
-
**Root Cause**: `SearchHandler` captures exceptions and returns them in an `errors` list, but `Orchestrator` only logs them to the console (`logger.warning`) and proceeds with empty evidence.
|
| 34 |
-
**Fix**: Yield an `AgentEvent(type="error")` or include errors in the `search_complete` event message.
|
| 35 |
-
|
| 36 |
-
### Bug 6: Hardcoded Model Names (P3)
|
| 37 |
-
**File**: `src/agent_factory/judges.py`
|
| 38 |
-
**Symptoms**: Maintenance burden.
|
| 39 |
-
**Root Cause**: Model names like `meta-llama/Llama-3.1-8B-Instruct` are hardcoded in the class `HFInferenceJudgeHandler` rather than pulled from `config.py`.
|
| 40 |
-
**Fix**: Move to `Settings`.
|
| 41 |
-
|
| 42 |
-
---
|
| 43 |
-
|
| 44 |
-
## 3. Code Quality Concerns
|
| 45 |
-
|
| 46 |
-
1. **Singleton Abuse**: The `_embedding_service` global in `src/services/embeddings.py` is a major architectural flaw for a multi-user web app (even a demo). It should be scoped to the `Orchestrator` instance.
|
| 47 |
-
2. **Inconsistent Factory Signatures**: `create_orchestrator` does not accept `api_key`, forcing hacks or reliance on global env vars.
|
| 48 |
-
3. **Silent Failures**: The pervasive use of `try...except Exception` with only logging (no user feedback) makes debugging difficult for end-users.
|
| 49 |
-
|
| 50 |
-
---
|
| 51 |
-
|
| 52 |
-
## 4. Recommended Fix Order
|
| 53 |
-
|
| 54 |
-
### Step 1: Fix the Data Leak (Bug 4 & 2)
|
| 55 |
-
**Why**: Prevents result corruption and cross-user data leakage.
|
| 56 |
-
**Plan**:
|
| 57 |
-
1. Remove singleton pattern from `src/services/embeddings.py`.
|
| 58 |
-
2. Make `EmbeddingService` an instance variable of `Orchestrator`.
|
| 59 |
-
3. Initialize a fresh `EmbeddingService` (and ChromaDB collection) for each `run()`.
|
| 60 |
-
|
| 61 |
-
### Step 2: Fix Advanced Mode BYOK (Bug 3)
|
| 62 |
-
**Why**: Enables the core "Advanced" feature for judges/users.
|
| 63 |
-
**Plan**:
|
| 64 |
-
1. Update `create_orchestrator` signature to accept `api_key`.
|
| 65 |
-
2. Update `MagenticOrchestrator` to accept `api_key`.
|
| 66 |
-
3. Update `configure_orchestrator` in `app.py` to pass the key.
|
| 67 |
-
4. Ensure `MagenticOrchestrator` constructs `OpenAIChatClient` with the user's key.
|
| 68 |
-
|
| 69 |
-
### Step 3: Fix Free Tier Experience (Bug 1)
|
| 70 |
-
**Why**: Ensures a usable fallback for those without keys.
|
| 71 |
-
**Plan**:
|
| 72 |
-
1. In `HFInferenceJudgeHandler`, detect 402/429 errors.
|
| 73 |
-
2. If caught, return a `JudgeAssessment` that triggers a "Complete" event with a clear error message, rather than "Continue".
|
| 74 |
-
3. Add `HF_TOKEN` to the deployment environment if possible.
|
| 75 |
-
|
| 76 |
-
---
|
| 77 |
-
|
| 78 |
-
## Verification Plan
|
| 79 |
-
|
| 80 |
-
After applying fixes, run:
|
| 81 |
-
1. **Unit Tests**: `make check`
|
| 82 |
-
2. **Manual Test (Simple)**: Run without key, verify 402 error is handled OR works if token added.
|
| 83 |
-
3. **Manual Test (Advanced)**: Run with OpenAI key, verify it proceeds past initialization.
|
| 84 |
-
4. **Manual Test (Dedup)**: Run same query twice. Second run should find same number of results (not 0).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|