VibecoderMcSwaggins commited on
Commit
ae22306
Β·
1 Parent(s): 37b8559

docs: add senior agent audit prompt for comprehensive bug hunt

Browse files
docs/bugs/SENIOR_AGENT_AUDIT_PROMPT.md ADDED
@@ -0,0 +1,247 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Senior Agent Audit Request: DeepBoner Codebase Bug Hunt
2
+
3
+ **Date**: 2025-11-28
4
+ **Requesting Agent**: Claude (Opus)
5
+ **Purpose**: Comprehensive bug audit and verification of P0_CRITICAL_BUGS.md
6
+
7
+ ---
8
+
9
+ ## Your Mission
10
+
11
+ You are a senior software engineer performing a comprehensive audit of the DeepBoner codebase. Your goals:
12
+
13
+ 1. **VERIFY** the 4 bugs documented in `docs/bugs/P0_CRITICAL_BUGS.md` are accurately described
14
+ 2. **FIND** any additional bugs (P0-P4) that could affect the demo
15
+ 3. **TRACE** the complete code paths for Simple and Advanced modes
16
+ 4. **IDENTIFY** any silent failures, race conditions, or edge cases
17
+
18
+ ---
19
+
20
+ ## Context: What DeepBoner Does
21
+
22
+ DeepBoner is a Gradio-based biomedical research agent that:
23
+ 1. Takes a research question from user
24
+ 2. Searches PubMed, ClinicalTrials.gov, Europe PMC
25
+ 3. Uses an LLM "judge" to evaluate if evidence is sufficient
26
+ 4. Either loops for more evidence or synthesizes a final report
27
+
28
+ **Two Modes**:
29
+ - **Simple**: Linear orchestrator with search β†’ judge β†’ report loop
30
+ - **Advanced**: Magentic multi-agent with SearchAgent, JudgeAgent, HypothesisAgent, ReportAgent
31
+
32
+ **Three Backend Options**:
33
+ - Free tier: HuggingFace Inference API (Llama/Mistral)
34
+ - OpenAI: User-provided or env var key
35
+ - Anthropic: User-provided or env var key (Simple mode only)
36
+
37
+ ---
38
+
39
+ ## Files to Audit (Priority Order)
40
+
41
+ ### Critical Path Files:
42
+ 1. `src/app.py` - Gradio UI, entry point, key routing
43
+ 2. `src/orchestrator.py` - Simple mode main loop
44
+ 3. `src/orchestrator_factory.py` - Mode selection and orchestrator creation
45
+ 4. `src/orchestrator_magentic.py` - Advanced mode implementation
46
+ 5. `src/services/embeddings.py` - Deduplication singleton (KNOWN BUG)
47
+ 6. `src/agent_factory/judges.py` - LLM judge handlers (HF, OpenAI, Anthropic)
48
+
49
+ ### Supporting Files:
50
+ 7. `src/tools/search_handler.py` - Parallel search orchestration
51
+ 8. `src/tools/pubmed.py` - PubMed API integration
52
+ 9. `src/tools/clinicaltrials.py` - ClinicalTrials.gov API
53
+ 10. `src/tools/europepmc.py` - Europe PMC API
54
+ 11. `src/agents/magentic_agents.py` - Agent factories (KNOWN BUG: hardcoded env key)
55
+ 12. `src/utils/config.py` - Settings and configuration
56
+ 13. `src/utils/models.py` - Data models (Evidence, Citation, etc.)
57
+
58
+ ---
59
+
60
+ ## Known Bugs to Verify
61
+
62
+ ### Bug 1: Free Tier LLM Quota Exhausted
63
+ **Claim**: HuggingFace Inference returns 402, all 3 fallback models fail
64
+ **Verify**:
65
+ - Check `src/agent_factory/judges.py` class `HFInferenceJudgeHandler`
66
+ - Trace the fallback chain: Llama β†’ Mistral β†’ Zephyr
67
+ - Confirm what happens when ALL fail (does it return default "continue"?)
68
+ - Check if the error message reaches the user or is swallowed
69
+
70
+ ### Bug 2: Evidence Counter Shows 0 After Dedup
71
+ **Claim**: `_deduplicate_and_rank()` can return empty list, losing all evidence
72
+ **Verify**:
73
+ - Check `src/orchestrator.py` lines 97-114 and 219
74
+ - Trace what happens if `embeddings.deduplicate()` returns `[]`
75
+ - Is there defensive handling? Does exception handler catch this?
76
+ - Could this be a race condition in async code?
77
+
78
+ ### Bug 3: API Key Not Passed to Advanced Mode
79
+ **Claim**: User's API key from Gradio is never passed to MagenticOrchestrator
80
+ **Verify**:
81
+ - Trace: `app.py:research_agent()` β†’ `configure_orchestrator()` β†’ `orchestrator_factory.py`
82
+ - Check if `user_api_key` is passed to `create_orchestrator()`
83
+ - Check if `MagenticOrchestrator.__init__()` receives a key
84
+ - Check `src/agents/magentic_agents.py` - do agents use `settings.openai_api_key`?
85
+
86
+ ### Bug 4: Singleton EmbeddingService Cross-Session Pollution
87
+ **Claim**: ChromaDB collection persists across requests, causing false duplicates
88
+ **Verify**:
89
+ - Check `src/services/embeddings.py` singleton pattern
90
+ - Is `_embedding_service` ever reset?
91
+ - What happens to ChromaDB collection between Gradio requests?
92
+ - Could this cause "Found 20 new sources (0 total)"?
93
+
94
+ ---
95
+
96
+ ## Additional Bug Categories to Search For
97
+
98
+ ### A. Error Handling Gaps
99
+ - [ ] Silent `except: pass` blocks
100
+ - [ ] Exceptions logged but not re-raised
101
+ - [ ] Missing error messages to user
102
+ - [ ] Swallowed API errors
103
+
104
+ ### B. Async/Concurrency Issues
105
+ - [ ] Race conditions in parallel searches
106
+ - [ ] Shared mutable state across async calls
107
+ - [ ] Missing `await` keywords
108
+ - [ ] Event loop blocking (sync code in async context)
109
+
110
+ ### C. API Integration Bugs
111
+ - [ ] Missing rate limiting
112
+ - [ ] Hardcoded timeouts that are too short
113
+ - [ ] XML/JSON parsing failures not handled
114
+ - [ ] Empty response handling
115
+
116
+ ### D. State Management Issues
117
+ - [ ] Global singletons that should be session-scoped
118
+ - [ ] Gradio state not properly isolated between users
119
+ - [ ] Memory leaks from accumulated data
120
+
121
+ ### E. Configuration Bugs
122
+ - [ ] Missing env var defaults
123
+ - [ ] Type mismatches in settings
124
+ - [ ] Hardcoded values that should be configurable
125
+
126
+ ### F. UI/UX Bugs
127
+ - [ ] Streaming not working properly
128
+ - [ ] Progress messages misleading
129
+ - [ ] Examples not matching actual functionality
130
+ - [ ] Error messages not user-friendly
131
+
132
+ ---
133
+
134
+ ## Output Format
135
+
136
+ Please produce a report with:
137
+
138
+ ### 1. Verification of Known Bugs
139
+ For each of the 4 bugs in P0_CRITICAL_BUGS.md:
140
+ - **CONFIRMED** or **INCORRECT** or **PARTIALLY CORRECT**
141
+ - Exact file:line references
142
+ - Any corrections or additional details
143
+
144
+ ### 2. New Bugs Found
145
+ For each new bug:
146
+ ```
147
+ ## Bug N: [Title]
148
+ **Priority**: P0/P1/P2/P3/P4
149
+ **File**: path/to/file.py:line
150
+ **Symptoms**: What the user sees
151
+ **Root Cause**: Technical explanation
152
+ **Code**:
153
+ ```python
154
+ # The buggy code
155
+ ```
156
+ **Fix**:
157
+ ```python
158
+ # The corrected code
159
+ ```
160
+ ```
161
+
162
+ ### 3. Code Quality Concerns
163
+ Any patterns that aren't bugs but could cause issues:
164
+ - Technical debt
165
+ - Missing tests for critical paths
166
+ - Unclear error handling
167
+
168
+ ### 4. Recommended Fix Order
169
+ Prioritized list of what to fix first for a working demo.
170
+
171
+ ---
172
+
173
+ ## Commands to Help Your Investigation
174
+
175
+ ```bash
176
+ # Run the tests
177
+ make check
178
+
179
+ # Test search works
180
+ uv run python -c "
181
+ import asyncio
182
+ from src.tools.pubmed import PubMedTool
183
+ async def test():
184
+ tool = PubMedTool()
185
+ results = await tool.search('female libido', 5)
186
+ print(f'Found {len(results)} results')
187
+ asyncio.run(test())
188
+ "
189
+
190
+ # Test HF inference (will show 402 if quota exhausted)
191
+ uv run python -c "
192
+ from huggingface_hub import InferenceClient
193
+ client = InferenceClient()
194
+ try:
195
+ resp = client.chat_completion(
196
+ messages=[{'role': 'user', 'content': 'Hi'}],
197
+ model='meta-llama/Llama-3.1-8B-Instruct',
198
+ max_tokens=10
199
+ )
200
+ print(resp)
201
+ except Exception as e:
202
+ print(f'Error: {e}')
203
+ "
204
+
205
+ # Test full orchestrator (simple mode)
206
+ uv run python -c "
207
+ import asyncio
208
+ from src.app import configure_orchestrator
209
+ async def test():
210
+ orch, backend = configure_orchestrator(use_mock=True, mode='simple')
211
+ print(f'Backend: {backend}')
212
+ async for event in orch.run('test query'):
213
+ print(f'{event.type}: {event.message[:50] if event.message else \"\"}'[:60])
214
+ asyncio.run(test())
215
+ "
216
+
217
+ # Check for hardcoded API keys (security)
218
+ grep -r "sk-" src/ --include="*.py" | grep -v "sk-..." | grep -v "sk-ant-..."
219
+
220
+ # Find all singletons
221
+ grep -r "_.*: .* | None = None" src/ --include="*.py"
222
+
223
+ # Find all except blocks
224
+ grep -rn "except.*:" src/ --include="*.py" | head -50
225
+ ```
226
+
227
+ ---
228
+
229
+ ## Important Notes
230
+
231
+ 1. **DO NOT fix bugs** - just document them
232
+ 2. **Be thorough** - check edge cases and error paths
233
+ 3. **Be specific** - include file:line references
234
+ 4. **Be skeptical** - verify claims in P0_CRITICAL_BUGS.md independently
235
+ 5. **Think like a user** - what would break the demo experience?
236
+
237
+ The hackathon deadline is approaching. We need a working demo. Your audit will determine what gets fixed first.
238
+
239
+ ---
240
+
241
+ ## Deliverable
242
+
243
+ A comprehensive markdown report that:
244
+ 1. Confirms or corrects the 4 known bugs
245
+ 2. Lists any new bugs found (with priority)
246
+ 3. Recommends the optimal fix order
247
+ 4. Can be saved as `docs/bugs/SENIOR_AUDIT_RESULTS.md`