Senior Agent Audit Request: DeepBoner Codebase Bug Hunt
Date: 2025-11-28 Requesting Agent: Claude (Opus) Purpose: Comprehensive bug audit and verification of P0_CRITICAL_BUGS.md
Your Mission
You are a senior software engineer performing a comprehensive audit of the DeepBoner codebase. Your goals:
- VERIFY the 4 bugs documented in
docs/bugs/P0_CRITICAL_BUGS.mdare accurately described - FIND any additional bugs (P0-P4) that could affect the demo
- TRACE the complete code paths for Simple and Advanced modes
- IDENTIFY any silent failures, race conditions, or edge cases
Context: What DeepBoner Does
DeepBoner is a Gradio-based biomedical research agent that:
- Takes a research question from user
- Searches PubMed, ClinicalTrials.gov, Europe PMC
- Uses an LLM "judge" to evaluate if evidence is sufficient
- Either loops for more evidence or synthesizes a final report
Two Modes:
- Simple: Linear orchestrator with search β judge β report loop
- Advanced: Magentic multi-agent with SearchAgent, JudgeAgent, HypothesisAgent, ReportAgent
Three Backend Options:
- Free tier: HuggingFace Inference API (Llama/Mistral)
- OpenAI: User-provided or env var key
- Anthropic: User-provided or env var key (Simple mode only)
Files to Audit (Priority Order)
Critical Path Files:
src/app.py- Gradio UI, entry point, key routingsrc/orchestrator.py- Simple mode main loopsrc/orchestrator_factory.py- Mode selection and orchestrator creationsrc/orchestrator_magentic.py- Advanced mode implementationsrc/services/embeddings.py- Deduplication singleton (KNOWN BUG)src/agent_factory/judges.py- LLM judge handlers (HF, OpenAI, Anthropic)
Supporting Files:
src/tools/search_handler.py- Parallel search orchestrationsrc/tools/pubmed.py- PubMed API integrationsrc/tools/clinicaltrials.py- ClinicalTrials.gov APIsrc/tools/europepmc.py- Europe PMC APIsrc/agents/magentic_agents.py- Agent factories (KNOWN BUG: hardcoded env key)src/utils/config.py- Settings and configurationsrc/utils/models.py- Data models (Evidence, Citation, etc.)
Known Bugs to Verify
Bug 1: Free Tier LLM Quota Exhausted
Claim: HuggingFace Inference returns 402, all 3 fallback models fail Verify:
- Check
src/agent_factory/judges.pyclassHFInferenceJudgeHandler - Trace the fallback chain: Llama β Mistral β Zephyr
- Confirm what happens when ALL fail (does it return default "continue"?)
- Check if the error message reaches the user or is swallowed
Bug 2: Evidence Counter Shows 0 After Dedup
Claim: _deduplicate_and_rank() can return empty list, losing all evidence
Verify:
- Check
src/orchestrator.pylines 97-114 and 219 - Trace what happens if
embeddings.deduplicate()returns[] - Is there defensive handling? Does exception handler catch this?
- Could this be a race condition in async code?
Bug 3: API Key Not Passed to Advanced Mode
Claim: User's API key from Gradio is never passed to MagenticOrchestrator Verify:
- Trace:
app.py:research_agent()βconfigure_orchestrator()βorchestrator_factory.py - Check if
user_api_keyis passed tocreate_orchestrator() - Check if
MagenticOrchestrator.__init__()receives a key - Check
src/agents/magentic_agents.py- do agents usesettings.openai_api_key?
Bug 4: Singleton EmbeddingService Cross-Session Pollution
Claim: ChromaDB collection persists across requests, causing false duplicates Verify:
- Check
src/services/embeddings.pysingleton pattern - Is
_embedding_serviceever reset? - What happens to ChromaDB collection between Gradio requests?
- Could this cause "Found 20 new sources (0 total)"?
Additional Bug Categories to Search For
A. Error Handling Gaps
- Silent
except: passblocks - Exceptions logged but not re-raised
- Missing error messages to user
- Swallowed API errors
B. Async/Concurrency Issues
- Race conditions in parallel searches
- Shared mutable state across async calls
- Missing
awaitkeywords - Event loop blocking (sync code in async context)
C. API Integration Bugs
- Missing rate limiting
- Hardcoded timeouts that are too short
- XML/JSON parsing failures not handled
- Empty response handling
D. State Management Issues
- Global singletons that should be session-scoped
- Gradio state not properly isolated between users
- Memory leaks from accumulated data
E. Configuration Bugs
- Missing env var defaults
- Type mismatches in settings
- Hardcoded values that should be configurable
F. UI/UX Bugs
- Streaming not working properly
- Progress messages misleading
- Examples not matching actual functionality
- Error messages not user-friendly
Output Format
Please produce a report with:
1. Verification of Known Bugs
For each of the 4 bugs in P0_CRITICAL_BUGS.md:
- CONFIRMED or INCORRECT or PARTIALLY CORRECT
- Exact file:line references
- Any corrections or additional details
2. New Bugs Found
For each new bug:
## Bug N: [Title]
**Priority**: P0/P1/P2/P3/P4
**File**: path/to/file.py:line
**Symptoms**: What the user sees
**Root Cause**: Technical explanation
**Code**:
```python
# The buggy code
Fix:
# The corrected code
### 3. Code Quality Concerns
Any patterns that aren't bugs but could cause issues:
- Technical debt
- Missing tests for critical paths
- Unclear error handling
### 4. Recommended Fix Order
Prioritized list of what to fix first for a working demo.
---
## Commands to Help Your Investigation
```bash
# Run the tests
make check
# Test search works
uv run python -c "
import asyncio
from src.tools.pubmed import PubMedTool
async def test():
tool = PubMedTool()
results = await tool.search('female libido', 5)
print(f'Found {len(results)} results')
asyncio.run(test())
"
# Test HF inference (will show 402 if quota exhausted)
uv run python -c "
from huggingface_hub import InferenceClient
client = InferenceClient()
try:
resp = client.chat_completion(
messages=[{'role': 'user', 'content': 'Hi'}],
model='meta-llama/Llama-3.1-8B-Instruct',
max_tokens=10
)
print(resp)
except Exception as e:
print(f'Error: {e}')
"
# Test full orchestrator (simple mode)
uv run python -c "
import asyncio
from src.app import configure_orchestrator
async def test():
orch, backend = configure_orchestrator(use_mock=True, mode='simple')
print(f'Backend: {backend}')
async for event in orch.run('test query'):
print(f'{event.type}: {event.message[:50] if event.message else \"\"}'[:60])
asyncio.run(test())
"
# Check for hardcoded API keys (security)
grep -r "sk-" src/ --include="*.py" | grep -v "sk-..." | grep -v "sk-ant-..."
# Find all singletons
grep -r "_.*: .* | None = None" src/ --include="*.py"
# Find all except blocks
grep -rn "except.*:" src/ --include="*.py" | head -50
Important Notes
- DO NOT fix bugs - just document them
- Be thorough - check edge cases and error paths
- Be specific - include file:line references
- Be skeptical - verify claims in P0_CRITICAL_BUGS.md independently
- Think like a user - what would break the demo experience?
The hackathon deadline is approaching. We need a working demo. Your audit will determine what gets fixed first.
Deliverable
A comprehensive markdown report that:
- Confirms or corrects the 4 known bugs
- Lists any new bugs found (with priority)
- Recommends the optimal fix order
- Can be saved as
docs/bugs/SENIOR_AUDIT_RESULTS.md