Spaces:

VibecoderMcSwaggins
/

DeepBoner

Paused

App Files Files Community

VibecoderMcSwaggins commited on 17 days ago

Commit

153a9c0

unverified ·

2 Parent(s): 6b5e05b dbe535c

Merge pull request #62 from The-Obstacle-Is-The-Way/dev

Browse files

fix: resolve all P0-P3 bugs (termination, streaming, thinking state)

Files changed (15) hide show

docs/bugs/ACTIVE_BUGS.md +35 -20
docs/bugs/FIX_PLAN_CRITICAL_BUGS.md +0 -36
docs/bugs/FIX_PLAN_MAGENTIC_MODE.md +0 -227
docs/bugs/FIX_UI_SIMPLIFICATION.md +0 -314
docs/bugs/INVESTIGATION_INVALID_MODELS.md +0 -31
docs/bugs/INVESTIGATION_QUOTA_BLOCKER.md +0 -49
docs/bugs/P0_CRITICAL_BUGS.md +0 -43
docs/bugs/P1_GRADIO_SETTINGS_CLEANUP.md +0 -81
docs/bugs/P1_MAGENTIC_STREAMING_AND_KEY_PERSISTENCE.md +0 -181
docs/bugs/P3_MAGENTIC_NO_TERMINATION_EVENT.md +177 -0
docs/bugs/SENIOR_AGENT_AUDIT_PROMPT.md +0 -247
docs/bugs/SENIOR_AUDIT_RESULTS.md +0 -84
src/app.py +5 -1
src/orchestrator_magentic.py +24 -0
tests/unit/test_magentic_termination.py +111 -0

docs/bugs/ACTIVE_BUGS.md CHANGED Viewed

@@ -1,39 +1,54 @@
 # Active Bugs
-> Last updated: 2025-11-28
-## P0 - Critical
-### Magentic Mode Report Generation
-**File**: [FIX_PLAN_MAGENTIC_MODE.md](./FIX_PLAN_MAGENTIC_MODE.md)
-**Symptom**: Magentic mode returns `ChatMessage` object instead of synthesized report text.
-**Root Cause**:
-- `event.message.text` extraction fails in orchestrator
-- `max_rounds=3` too low for SearchAgent + JudgeAgent + ReportAgent sequence
-**Workaround**: Use Simple mode (default) - works correctly with all LLM providers.
-**Status**: Fix plan documented, not yet implemented.
----
-## P1 - Minor UX
-### Gradio Settings Accordion Won't Collapse
-**File**: [P1_GRADIO_SETTINGS_CLEANUP.md](./P1_GRADIO_SETTINGS_CLEANUP.md)
-**Symptom**: Settings accordion stays open after user interaction.
-**Root Cause**: Nested `gr.Blocks` context prevents accordion state management.
-**Impact**: UX only - all functionality works correctly.
-**Status**: Solution documented, not yet implemented.
 ---
-## Resolved Bugs
-*None currently - bugs above are still open.*

 # Active Bugs
+> Last updated: 2025-11-29
+## P3 - Edge Case
+*(None)*
+---
+## Resolved Bugs
+### ~~P3 - Magentic Mode Missing Termination Guarantee~~ FIXED
+**Commit**: `d36ce3c` (2025-11-29)
+- Added `final_event_received` tracking in `orchestrator_magentic.py`
+- Added fallback yield for "max iterations reached" scenario
+- Verified with `test_magentic_termination.py`
+### ~~P0 - Magentic Mode Report Generation~~ FIXED
+**Commit**: `9006d69` (2025-11-29)
+- Fixed `_extract_text()` to handle various message object formats
+- Increased `max_rounds=10` (was 3)
+- Added `temperature=1.0` for reasoning model compatibility
+- Advanced mode now produces full research reports
+### ~~P1 - Streaming Spam + API Key Persistence~~ FIXED
+**Commit**: `0c9be4a` (2025-11-29)
+- Streaming events now buffered (not token-by-token spam)
+- API key persists across example clicks via `gr.State`
+- Examples use explicit `None` values to avoid overwriting keys
+### ~~P2 - Missing "Thinking" State~~ FIXED
+**Commit**: `9006d69` (2025-11-29)
+- Added `"thinking"` event type with hourglass icon
+- Yields "Multi-agent reasoning in progress..." before blocking workflow call
+- Users now see feedback during 2-5 minute initial processing
+### ~~P1 - Gradio Settings Accordion~~ WONTFIX
+Decision: Removed nested Blocks, using ChatInterface directly.
+Accordion behavior is default Gradio - acceptable for demo.
 ---
+## How to Report Bugs
+1. Create `docs/bugs/P{N}_{SHORT_NAME}.md`
+2. Include: Symptom, Root Cause, Fix Plan, Test Plan
+3. Update this index
+4. Priority: P0=blocker, P1=important, P2=UX, P3=edge case

docs/bugs/FIX_PLAN_CRITICAL_BUGS.md DELETED Viewed

@@ -1,36 +0,0 @@
-# Fix Plan: Critical Bugs (P0)
-**Date**: 2025-11-28
-**Status**: COMPLETED (2025-11-29)
-**Based on**: `docs/bugs/SENIOR_AUDIT_RESULTS.md`
----
-## Summary of Fixes
-### 1. Fixed Data Leak (Bug 4 & 2)
-- **Action**: Removed singleton `_embedding_service` in `src/services/embeddings.py`.
-- **Action**: Updated `EmbeddingService.__init__` to use a unique collection name (`evidence_{uuid}`) for complete isolation per instance.
-- **Action**: Refactored `SentenceTransformer` loading to a shared global to maintain performance while isolating state.
-- **Verified**: Unit tests passed, including new isolation verification.
-### 2. Fixed Advanced Mode BYOK (Bug 3)
-- **Action**: Updated `create_orchestrator` in `src/orchestrator_factory.py` to accept `api_key`.
-- **Action**: Updated `MagenticOrchestrator` to accept and use the `api_key` for the manager and agents.
-- **Action**: Updated `src/app.py` to pass the user's API key during orchestrator configuration.
-- **Verified**: `test_dual_mode_e2e.py` passed.
-### 3. Fixed Free Tier Experience (Bug 1)
-- **Action**: Updated `HFInferenceJudgeHandler` in `src/agent_factory/judges.py` to catch 402 (Payment Required) errors.
-- **Action**: Added logic to return a "synthesize" assessment with a clear error message when quota is exhausted, stopping the infinite loop.
-- **Verified**: Unit tests passed.
----
-## Verification
-All changes have been verified with:
-- `make check` (lint, typecheck, test) - ALL PASSED
-- Custom reproduction script for isolation - PASSED
-The system is now stable for the hackathon demo.

docs/bugs/FIX_PLAN_MAGENTIC_MODE.md DELETED Viewed

@@ -1,227 +0,0 @@
-# Fix Plan: Magentic Mode Report Generation
-**Related Bug**: `P0_MAGENTIC_MODE_BROKEN.md`
-**Approach**: Test-Driven Development (TDD)
-**Estimated Scope**: 4 tasks, ~2-3 hours
----
-## Problem Summary
-Magentic mode runs but fails to produce readable reports due to:
-1. **Primary Bug**: `MagenticFinalResultEvent.message` returns `ChatMessage` object, not text
-2. **Secondary Bug**: Max rounds (3) reached before ReportAgent completes
-3. **Tertiary Issues**: Stale "bioRxiv" references in prompts
----
-## Fix Order (TDD)
-### Phase 1: Write Failing Tests
-**Task 1.1**: Create test for ChatMessage text extraction
-```python
-# tests/unit/test_orchestrator_magentic.py
-def test_process_event_extracts_text_from_chat_message():
-    """Final result event should extract text from ChatMessage object."""
-    # Arrange: Mock ChatMessage with .content attribute
-    # Act: Call _process_event with MagenticFinalResultEvent
-    # Assert: Returned AgentEvent.message is a string, not object repr
-```
-**Task 1.2**: Create test for max rounds configuration
-```python
-def test_orchestrator_uses_configured_max_rounds():
-    """MagenticOrchestrator should use max_rounds from constructor."""
-    # Arrange: Create orchestrator with max_rounds=10
-    # Act: Build workflow
-    # Assert: Workflow has max_round_count=10
-```
-**Task 1.3**: Create test for bioRxiv reference removal
-```python
-def test_task_prompt_references_europe_pmc():
-    """Task prompt should reference Europe PMC, not bioRxiv."""
-    # Arrange: Create orchestrator
-    # Act: Check task string in run()
-    # Assert: Contains "Europe PMC", not "bioRxiv"
-```
----
-### Phase 2: Fix ChatMessage Text Extraction
-**File**: `src/orchestrator_magentic.py`
-**Lines**: 192-199
-**Current Code**:
-```python
-elif isinstance(event, MagenticFinalResultEvent):
-    text = event.message.text if event.message else "No result"
-```
-**Fixed Code**:
-```python
-elif isinstance(event, MagenticFinalResultEvent):
-    if event.message:
-        # ChatMessage may have .content or .text depending on version
-        if hasattr(event.message, 'content') and event.message.content:
-            text = str(event.message.content)
-        elif hasattr(event.message, 'text') and event.message.text:
-            text = str(event.message.text)
-        else:
-            # Fallback: convert entire message to string
-            text = str(event.message)
-    else:
-        text = "No result generated"
-```
-**Why**: The `agent_framework.ChatMessage` object structure may vary. We need defensive extraction.
----
-### Phase 3: Fix Max Rounds Configuration
-**File**: `src/orchestrator_magentic.py`
-**Lines**: 97-99
-**Current Code**:
-```python
-.with_standard_manager(
-    chat_client=manager_client,
-    max_round_count=self._max_rounds,  # Already uses config
-    max_stall_count=3,
-    max_reset_count=2,
-)
-```
-**Issue**: Default `max_rounds` in `__init__` is 10, but workflow may need more for complex queries.
-**Fix**: Verify the value flows through correctly. Add logging.
-```python
-logger.info(
-    "Building Magentic workflow",
-    max_rounds=self._max_rounds,
-    max_stall=3,
-    max_reset=2,
-)
-```
-**Also check**: `src/orchestrator_factory.py` passes config correctly:
-```python
-return MagenticOrchestrator(
-    max_rounds=config.max_iterations if config else 10,
-)
-```
----
-### Phase 4: Fix Stale bioRxiv References
-**Files to update**:
-| File | Line | Change |
-|------|------|--------|
-| `src/orchestrator_magentic.py` | 131 | "bioRxiv" → "Europe PMC" |
-| `src/agents/magentic_agents.py` | 32-33 | "bioRxiv" → "Europe PMC" |
-| `src/app.py` | 202-203 | "bioRxiv" → "Europe PMC" |
-**Search command to verify**:
-```bash
-grep -rn "bioRxiv\|biorxiv" src/
-```
----
-## Implementation Checklist
-```
-[ ] Phase 1: Write failing tests
-    [ ] 1.1 Test ChatMessage text extraction
-    [ ] 1.2 Test max rounds configuration
-    [ ] 1.3 Test Europe PMC references
-[ ] Phase 2: Fix ChatMessage extraction
-    [ ] Update _process_event() in orchestrator_magentic.py
-    [ ] Run test 1.1 - should pass
-[ ] Phase 3: Fix max rounds
-    [ ] Add logging to _build_workflow()
-    [ ] Verify factory passes config correctly
-    [ ] Run test 1.2 - should pass
-[ ] Phase 4: Fix bioRxiv references
-    [ ] Update orchestrator_magentic.py task prompt
-    [ ] Update magentic_agents.py descriptions
-    [ ] Update app.py UI text
-    [ ] Run test 1.3 - should pass
-    [ ] Run grep to verify no remaining refs
-[ ] Final Verification
-    [ ] make check passes
-    [ ] All tests pass (108+)
-    [ ] Manual test: run_magentic.py produces readable report
-```
----
-## Test Commands
-```bash
-# Run specific test file
-uv run pytest tests/unit/test_orchestrator_magentic.py -v
-# Run all tests
-uv run pytest tests/unit/ -v
-# Full check
-make check
-# Manual integration test
-set -a && source .env && set +a
-uv run python examples/orchestrator_demo/run_magentic.py "metformin alzheimer"
-```
----
-## Success Criteria
-1. `run_magentic.py` outputs a readable research report (not `<ChatMessage object>`)
-2. Report includes: Executive Summary, Key Findings, Drug Candidates, References
-3. No "Max round count reached" error with default settings
-4. No "bioRxiv" references anywhere in codebase
-5. All 108+ tests pass
-6. `make check` passes
----
-## Files Modified
-```
-src/
-├── orchestrator_magentic.py   # ChatMessage fix, logging
-├── agents/magentic_agents.py  # bioRxiv → Europe PMC
-└── app.py                     # bioRxiv → Europe PMC
-tests/unit/
-└── test_orchestrator_magentic.py  # NEW: 3 tests
-```
----
-## Notes for AI Agent
-When implementing this fix plan:
-1. **DO NOT** create mock data or fake responses
-2. **DO** write real tests that verify actual behavior
-3. **DO** run `make check` after each phase
-4. **DO** test with real OpenAI API key via `.env`
-5. **DO** preserve existing functionality - simple mode must still work
-6. **DO NOT** over-engineer - minimal changes to fix the specific bugs

docs/bugs/FIX_UI_SIMPLIFICATION.md DELETED Viewed

@@ -1,314 +0,0 @@
-# UI Simplification: Remove API Provider Dropdown
-**Issues**: #52, #53
-**Priority**: P1 - UX improvement for hackathon demo
-**Estimated Time**: 30 minutes
-**Senior Review**: ✅ Approved with changes (incorporated below)
----
-## Problem
-The current UI has confusing BYOK (Bring Your Own Key) settings:
-1. **Provider dropdown is misleading** - Shows "openai" but actually uses free tier when no key
-2. **Examples table shows useless columns** - API Key (empty), Provider (ignored)
-3. **Anthropic doesn't work with Advanced mode** - Only OpenAI has `agent-framework` support
-## Solution
-Remove `api_provider` dropdown entirely. Auto-detect provider from key prefix.
-**Functionality preserved:**
-- Simple mode: Free tier, OpenAI, OR Anthropic (all work)
-- Advanced mode: OpenAI only (Magentic multi-agent requires `OpenAIChatClient`)
----
-## Implementation
-### File: `src/app.py`
-#### Change 1: Update `configure_orchestrator()` signature (lines 23-28)
-```python
-# BEFORE
-def configure_orchestrator(
-    use_mock: bool = False,
-    mode: str = "simple",
-    user_api_key: str | None = None,
-    api_provider: str = "openai",  # ← REMOVE
-) -> tuple[Any, str]:
-# AFTER
-def configure_orchestrator(
-    use_mock: bool = False,
-    mode: str = "simple",
-    user_api_key: str | None = None,
-) -> tuple[Any, str]:
-```
-#### Change 2: Update docstring (lines 29-40)
-```python
-# AFTER
-    """
-    Create an orchestrator instance.
-    Args:
-        use_mock: If True, use MockJudgeHandler (no API key needed)
-        mode: Orchestrator mode ("simple" or "advanced")
-        user_api_key: Optional user-provided API key (BYOK) - auto-detects provider
-    Returns:
-        Tuple of (Orchestrator instance, backend_name)
-    """
-```
-#### Change 3: Replace provider logic with auto-detection (lines 62-88)
-```python
-# BEFORE (lines 62-88) - complex provider checking with api_provider param
-# AFTER - auto-detect from key prefix
-    # 2. Paid API Key (User provided or Env)
-    elif user_api_key and user_api_key.strip():
-        # Auto-detect provider from key prefix
-        model: AnthropicModel | OpenAIModel
-        if user_api_key.startswith("sk-ant-"):
-            # Anthropic key
-            anthropic_provider = AnthropicProvider(api_key=user_api_key)
-            model = AnthropicModel(settings.anthropic_model, provider=anthropic_provider)
-            backend_info = "Paid API (Anthropic)"
-        elif user_api_key.startswith("sk-"):
-            # OpenAI key
-            openai_provider = OpenAIProvider(api_key=user_api_key)
-            model = OpenAIModel(settings.openai_model, provider=openai_provider)
-            backend_info = "Paid API (OpenAI)"
-        else:
-            raise ValueError(
-                "Invalid API key format. Expected sk-... (OpenAI) or sk-ant-... (Anthropic)"
-            )
-        judge_handler = JudgeHandler(model=model)
-    # 3. Environment API Keys (fallback)
-    elif os.getenv("OPENAI_API_KEY"):
-        judge_handler = JudgeHandler(model=None)  # Uses env key
-        backend_info = "Paid API (OpenAI from env)"
-    elif os.getenv("ANTHROPIC_API_KEY"):
-        judge_handler = JudgeHandler(model=None)  # Uses env key
-        backend_info = "Paid API (Anthropic from env)"
-    # 4. Free Tier (HuggingFace Inference)
-    else:
-        judge_handler = HFInferenceJudgeHandler()
-        backend_info = "Free Tier (Llama 3.1 / Mistral)"
-```
-#### Change 4: Update `research_agent()` signature (lines 105-111)
-```python
-# BEFORE
-async def research_agent(
-    message: str,
-    history: list[dict[str, Any]],
-    mode: str = "simple",
-    api_key: str = "",
-    api_provider: str = "openai",  # ← REMOVE
-) -> AsyncGenerator[str, None]:
-# AFTER
-async def research_agent(
-    message: str,
-    history: list[dict[str, Any]],
-    mode: str = "simple",
-    api_key: str = "",
-) -> AsyncGenerator[str, None]:
-```
-#### Change 5: Update docstring (lines 112-124)
-```python
-# AFTER
-    """
-    Gradio chat function that runs the research agent.
-    Args:
-        message: User's research question
-        history: Chat history (Gradio format)
-        mode: Orchestrator mode ("simple" or "advanced")
-        api_key: Optional user-provided API key (BYOK - auto-detects provider)
-    Yields:
-        Markdown-formatted responses for streaming
-    """
-```
-#### Change 6: Fix Advanced mode check (line 139)
-```python
-# BEFORE
-if mode == "advanced" and not (has_openai or (has_user_key and api_provider == "openai")):
-# AFTER - auto-detect OpenAI key from prefix
-is_openai_user_key = user_api_key and user_api_key.startswith("sk-") and not user_api_key.startswith("sk-ant-")
-if mode == "advanced" and not (has_openai or is_openai_user_key):
-    yield (
-        "⚠️ **Advanced mode requires OpenAI API key.** "
-        "Anthropic keys only work in Simple mode. Falling back to Simple.\n\n"
-    )
-    mode = "simple"
-```
-#### Change 7: Remove premature "Using your key" message (lines 146-151)
-```python
-# BEFORE - uses api_provider which no longer exists
-if has_user_key:
-    yield (
-        f"🔑 **Using your {api_provider.upper()} API key** - "
-        "Your key is used only for this session and is never stored.\n\n"
-    )
-# AFTER - remove this block entirely
-# The backend_name from configure_orchestrator already shows "Paid API (OpenAI)" or "Paid API (Anthropic)"
-# No need for duplicate messaging
-```
-#### Change 8: Update configure_orchestrator call (lines 165-170)
-```python
-# BEFORE
-orchestrator, backend_name = configure_orchestrator(
-    use_mock=False,
-    mode=mode,
-    user_api_key=user_api_key,
-    api_provider=api_provider,  # ← REMOVE
-)
-# AFTER
-orchestrator, backend_name = configure_orchestrator(
-    use_mock=False,
-    mode=mode,
-    user_api_key=user_api_key,
-)
-```
-#### Change 9: Simplify examples (lines 210-229)
-```python
-# BEFORE - 4 items per example
-examples=[
-    ["What drugs improve female libido post-menopause?", "simple", "", "openai"],
-    ["Clinical trials for erectile dysfunction alternatives to PDE5 inhibitors?", "simple", "", "openai"],
-    ["Evidence for testosterone therapy in women with HSDD?", "simple", "", "openai"],
-],
-# AFTER - 2 items per example (query, mode) - API key always empty in examples
-examples=[
-    ["What drugs improve female libido post-menopause?", "simple"],
-    ["Clinical trials for ED alternatives to PDE5 inhibitors?", "simple"],
-    ["Evidence for testosterone therapy in women with HSDD?", "simple"],
-],
-```
-#### Change 10: Update additional_inputs (lines 231-252)
-```python
-# BEFORE - 3 inputs (mode, api_key, api_provider)
-additional_inputs=[
-    gr.Radio(
-        choices=["simple", "advanced"],
-        value="simple",
-        label="Orchestrator Mode",
-        info="Simple: Linear (Free Tier Friendly) | Advanced: Multi-Agent (Requires OpenAI)",
-    ),
-    gr.Textbox(
-        label="🔑 API Key (Optional - BYOK)",
-        placeholder="sk-... or sk-ant-...",
-        type="password",
-        info="Enter your own API key. Never stored.",
-    ),
-    gr.Radio(  # ← REMOVE THIS ENTIRE BLOCK
-        choices=["openai", "anthropic"],
-        value="openai",
-        label="API Provider",
-        info="Select the provider for your API key",
-    ),
-],
-# AFTER - 2 inputs (mode, api_key)
-additional_inputs=[
-    gr.Radio(
-        choices=["simple", "advanced"],
-        value="simple",
-        label="Orchestrator Mode",
-        info="Simple: Works with any key or free tier | Advanced: Requires OpenAI key",
-    ),
-    gr.Textbox(
-        label="🔑 API Key (Optional)",
-        placeholder="sk-... (OpenAI) or sk-ant-... (Anthropic)",
-        type="password",
-        info="Leave empty for free tier. Auto-detects provider from key prefix.",
-    ),
-],
-```
-#### Change 11: Update accordion label (line 230)
-```python
-# BEFORE
-additional_inputs_accordion=gr.Accordion(label="⚙️ Settings", open=False),
-# AFTER
-additional_inputs_accordion=gr.Accordion(label="⚙️ Settings (Free tier works without API key)", open=False),
-```
----
-## Testing Checklist
-### Manual Tests
-- [ ] **No key**: Shows "Free Tier (Llama 3.1 / Mistral)" in backend
-- [ ] **OpenAI key (sk-...)**: Shows "Paid API (OpenAI)" in backend
-- [ ] **Anthropic key (sk-ant-...)**: Shows "Paid API (Anthropic)" in backend
-- [ ] **Invalid key format**: Shows error message
-- [ ] **Anthropic key + Advanced mode**: Falls back to Simple with warning
-- [ ] **OpenAI key + Advanced mode**: Uses full Magentic multi-agent
-- [ ] **Examples table**: Shows only 2 columns (query, mode)
-- [ ] **MCP server**: Still accessible at `/gradio_api/mcp/`
-### Unit Test Updates
-- [ ] `tests/unit/test_app_smoke.py` - may need update if checking input count
----
-## Definition of Done
-- [ ] `api_provider` parameter removed from `configure_orchestrator()`
-- [ ] `api_provider` parameter removed from `research_agent()`
-- [ ] Auto-detection logic works for `sk-` and `sk-ant-` prefixes
-- [ ] Advanced mode check uses auto-detection (not removed param)
-- [ ] "Using your X key" message removed (backend_name handles this)
-- [ ] Examples table shows 2 columns
-- [ ] Accordion label updated
-- [ ] Placeholder text shows both key formats
-- [ ] All existing tests pass
-- [ ] MCP server still works
----
-## Mode Compatibility Matrix (Unchanged)
-| Mode | No Key | OpenAI Key | Anthropic Key |
-|------|--------|------------|---------------|
-| **Simple** | ✅ Free tier | ✅ GPT-5.1 | ✅ Claude Sonnet 4.5 |
-| **Advanced** | ⚠️ Falls back | ✅ Full Magentic | ⚠️ Falls back to Simple |
----
-## Related
-- Issue #52: UI Polish - Examples table confusion
-- Issue #53: API Provider Simplification
-- Senior Review: Approved 2025-11-28

docs/bugs/INVESTIGATION_INVALID_MODELS.md DELETED Viewed

@@ -1,31 +0,0 @@
-# Bug Investigation: Invalid Default LLM Models
-## Status
-- **Date:** 2025-11-29
-- **Reporter:** CLI User
-- **Component:** `src/utils/config.py`
-- **Priority:** High (Magentic Mode Blocker)
-- **Resolution:** FIXED
-## Issue Description
-The user encountered a 403 error when running in Magentic mode:
-`Error code: 403 - {'error': {'message': 'Project ... does not have access to model gpt-5', ... 'code': 'model_not_found'}}`
-## Root Cause Analysis
-OpenAI deprecated the base `gpt-5` model. Tier 5 accounts now have access to:
-- `gpt-5.1` (current flagship)
-- `gpt-5-mini`
-- `gpt-5-nano`
-- `gpt-4.1`, `gpt-4.1-mini`, `gpt-4.1-nano`
-- `o3`, `o4-mini`
-The base `gpt-5` is NO LONGER available via API.
-## Solution Implemented
-Updated `src/utils/config.py` to use:
-- `openai_model`: `gpt-5.1` (the actual current model)
-- `anthropic_model`: `claude-sonnet-4-5-20250929` (unchanged)
-## Verification
-- `tests/unit/agent_factory/test_judges_factory.py` updated and passed.
-- User confirmed Tier 5 access to `gpt-5.1` via OpenAI dashboard.

docs/bugs/INVESTIGATION_QUOTA_BLOCKER.md DELETED Viewed

@@ -1,49 +0,0 @@
-# Bug Investigation: HF Free Tier Quota Exhaustion
-## Status
-- **Date:** 2025-11-29
-- **Reporter:** CLI User
-- **Component:** `HFInferenceJudgeHandler`
-- **Priority:** High (UX Blocker for Free Tier)
-- **Resolution:** FIXED
-## Issue Description
-On a fresh run with a simple query ("What drugs improve female libido post-menopause?"), the system retrieved 20 valid sources but failed during the Judge/Analysis phase with:
-`⚠️ Free Tier Quota Exceeded ⚠️`
-This results in a "Synthesis" step that has 0 candidates and 0 findings, rendering the application useless for free users once the (very low) limit is hit, despite having valid search results.
-## Evidence
-Output provided:
-```text
-### Citations (20 sources)
-...
-### Reasoning
-⚠️ **Free Tier Quota Exceeded** ⚠️
-```
-## Root Cause Analysis
-1. **Search Success:** `SearchAgent` correctly found 20 documents (PubMed/EuropePMC).
-2. **Judge Failure:** `HFInferenceJudgeHandler` called the HF Inference API.
-3. **Quota Trap:** The API returned a 402 (Payment Required) or Quota error.
-4. **Previous Handling:** The handler caught this error and returned a `JudgeAssessment` with `sufficient=True` (to stop the loop) and *empty* fields.
-5. **Data Loss:** The 20 valid search results were effectively discarded from the "Analysis" perspective.
-## The "Deep Blocker"
-The system had a "hard failure" mode for quota exhaustion, assuming that if the LLM can't judge, we have *no* useful information. This "bricked" the UX for free users immediately upon hitting the limit.
-## Solution Implemented
-Modified `HFInferenceJudgeHandler._create_quota_exhausted_assessment` to:
-1. Accept the `evidence` list as an argument.
-2. Perform basic heuristic extraction (borrowed from `MockJudgeHandler` logic):
-   - Use titles as "Key Findings" (first 5 sources).
-   - Add a clear message in "Drug Candidates" telling the user to upgrade.
-3. Return this "Partial" assessment instead of an empty one.
-## Verification
-- Created `tests/unit/agent_factory/test_judges_hf_quota.py` to verify that:
-  - 402 errors are caught.
-  - `sufficient` is set to `True` (stops loop).
-  - `key_findings` are populated from search result titles.
-  - `reasoning` contains the warning message.
-- Ran existing tests `tests/unit/agent_factory/test_judges_hf.py` - All passed.

docs/bugs/P0_CRITICAL_BUGS.md DELETED Viewed

@@ -1,43 +0,0 @@
-# P0 Critical Bugs - DeepBoner Demo Broken
-**Date**: 2025-11-28
-**Status**: RESOLVED (2025-11-29)
-**Priority**: P0 - Blocking hackathon submission
----
-## Summary
-The Gradio demo was non-functional due to 4 critical bugs. All have been fixed and verified.
----
-## Bug 1: Free Tier LLM Quota Exhausted (P0) - FIXED
-**Resolution**:
-- Implemented `QuotaExhaustedError` detection in `HFInferenceJudgeHandler`.
-- The agent now gracefully stops and displays a clear "Free Tier Quota Exceeded" message instead of looping infinitely.
-## Bug 2: Evidence Counter Shows 0 After Dedup (P1) - FIXED
-**Resolution**:
-- Fixed by resolving Bug 4 (Data Leak). Deduplication now works correctly on isolated per-request collections.
-## Bug 3: API Key Not Passed to Advanced Mode (P0) - FIXED
-**Resolution**:
-- Plumbed `api_key` from the UI through `configure_orchestrator` -> `create_orchestrator` -> `MagenticOrchestrator`.
-- Magentic agents now correctly use the user-provided OpenAI key.
-## Bug 4: Singleton EmbeddingService Causes Cross-Session Pollution (P0) - FIXED
-**Resolution**:
-- Removed the singleton pattern for `EmbeddingService`.
-- Each request now gets a fresh `EmbeddingService` with a unique, isolated ChromaDB collection (`evidence_{uuid}`).
-- `SentenceTransformer` model is lazily cached globally to maintain performance.
----
-## Verification
-Run `make check` to verify all tests pass.

docs/bugs/P1_GRADIO_SETTINGS_CLEANUP.md DELETED Viewed

@@ -1,81 +0,0 @@
-# P1 Bug: Gradio Settings Accordion Not Collapsing
-**Priority**: P1 (UX Bug)
-**Status**: OPEN
-**Date**: 2025-11-27
-**Target Component**: `src/app.py`
----
-## 1. Problem Description
-The "Settings" accordion in the Gradio UI (containing Orchestrator Mode, API Key, Provider) fails to collapse, even when configured with `open=False`. It remains permanently expanded, cluttering the interface and obscuring the chat history.
-### Symptoms
-- Accordion arrow toggles visually, but content remains visible.
-- Occurs in both local development (`uv run src/app.py`) and HuggingFace Spaces.
----
-## 2. Root Cause Analysis
-**Definitive Cause**: Nested `Blocks` Context Bug.
-`gr.ChatInterface` is itself a high-level abstraction that creates a `gr.Blocks` context. Wrapping `gr.ChatInterface` inside an external `with gr.Blocks():` context causes event listener conflicts, specifically breaking the JavaScript state management for `additional_inputs_accordion`.
-**Reference**: [Gradio Issue #8861](https://github.com/gradio-app/gradio/issues/8861) confirms that `additional_inputs_accordion` malfunctions when `ChatInterface` is not the top-level block.
----
-## 3. Solution Strategy: "The Unwrap Fix"
-We will remove the redundant `gr.Blocks` wrapper. This restores the native behavior of `ChatInterface`, ensuring the accordion respects `open=False`.
-### Implementation Plan
-**Refactor `src/app.py` / `create_demo()`**:
-1.  **Remove** the `with gr.Blocks() as demo:` context manager.
-2.  **Instantiate** `gr.ChatInterface` directly as the `demo` object.
-3.  **Migrate UI Elements**:
-    *   **Header**: Move the H1/Title text into the `title` parameter of `ChatInterface`.
-    *   **Footer**: Move the footer text ("MCP Server Active...") into the `description` parameter. `ChatInterface` supports Markdown in `description`, making it the ideal place for static info below the title but above the chat.
-### Before (Buggy)
-```python
-def create_demo():
-    with gr.Blocks() as demo:  # <--- CAUSE OF BUG
-        gr.Markdown("# Title")
-        gr.ChatInterface(..., additional_inputs_accordion=gr.Accordion(open=False))
-        gr.Markdown("Footer")
-    return demo
-```
-### After (Correct)
-```python
-def create_demo():
-    return gr.ChatInterface(   # <--- FIX: Top-level component
-        ...,
-        title="🧬 DeepBoner",
-        description="*AI-Powered Drug Repurposing Agent...*\n\n---\n**MCP Server Active**...",
-        additional_inputs_accordion=gr.Accordion(label="⚙️ Settings", open=False)
-    )
-```
----
-## 4. Validation
-1.  **Run**: `uv run python src/app.py`
-2.  **Check**: Open `http://localhost:7860`
-3.  **Verify**:
-    *   Settings accordion starts **COLLAPSED**.
-    *   Header title ("DeepBoner") is visible.
-    *   Footer text ("MCP Server Active") is visible in the description area.
-    *   Chat functionality works (Magentic/Simple modes).
----
-## 5. Constraints & Notes
-- **Layout**: We lose the ability to place arbitrary elements *below* the chat box (footer will move to top, under title), but this is an acceptable trade-off for a working UI.
-- **CSS**: `ChatInterface` handles its own CSS; any custom class styling from the previous footer will be standardized to the description text style.

docs/bugs/P1_MAGENTIC_STREAMING_AND_KEY_PERSISTENCE.md DELETED Viewed

@@ -1,181 +0,0 @@
-# Bug Report: Magentic Mode Integration Issues
-## Status
-- **Date:** 2025-11-29
-- **Reporter:** CLI User
-- **Priority:** P1 (UX Degradation + Deprecation Warnings)
-- **Component:** `src/app.py`, `src/orchestrator_magentic.py`, `src/utils/llm_factory.py`
-- **Status:** ✅ FIXED (Bug 1 & Bug 2) - 2025-11-29
-- **Tests:** 138 passing (136 original + 2 new validation tests)
----
-## Bug 1: Token-by-Token Streaming Spam ✅ FIXED
-### Symptoms
-When running Magentic (Advanced) mode, the UI shows hundreds of individual lines like:
-```text
-📡 STREAMING: Below
-📡 STREAMING: is
-📡 STREAMING: a
-📡 STREAMING: curated
-📡 STREAMING: list
-...
-```
-Each token is displayed as a separate streaming event, creating visual spam and making it impossible to read the output until completion.
-### Root Cause (VALIDATED)
-**File:** `src/orchestrator_magentic.py:247-254`
-```python
-elif isinstance(event, MagenticAgentDeltaEvent):
-    if event.text:
-        return AgentEvent(
-            type="streaming",
-            message=event.text,  # Single token!
-            data={"agent_id": event.agent_id},
-            iteration=iteration,
-        )
-```
-Every LLM token emits a `MagenticAgentDeltaEvent`, which creates an `AgentEvent(type="streaming")`.
-**File:** `src/app.py:171-192` (BEFORE FIX)
-```python
-async for event in orchestrator.run(message):
-    event_md = event.to_markdown()
-    response_parts.append(event_md)  # Appends EVERY token
-    if event.type == "complete":
-        yield event.message
-    else:
-        yield "\n\n".join(response_parts)  # Yields ALL accumulated tokens
-```
-For N tokens, this yields N times, each time showing all previous tokens. This is O(N²) string operations and creates massive visual spam.
-### Fix Applied
-**File:** `src/app.py:175-204`
-Implemented streaming token buffering with live updates:
-1. Added `streaming_buffer = ""` to accumulate tokens
-2. For each streaming event: append to buffer, yield immediately (for live typing UX)
-3. **Key fix**: Don't append streaming events to `response_parts` (prevents O(N²) list growth)
-4. Each yield has only ONE `📡 STREAMING:` line (the accumulated buffer)
-5. Flush buffer to `response_parts` only when non-streaming event occurs
-**Result**: Live typing feel preserved, but no visual spam (each update replaces, not accumulates)
-### Proposed Fix Options
-**Option A: Buffer streaming tokens (recommended)**
-```python
-# In app.py - accumulate streaming tokens, yield periodically
-streaming_buffer = ""
-last_yield_time = time.time()
-async for event in orchestrator.run(message):
-    if event.type == "streaming":
-        streaming_buffer += event.message
-        # Only yield every 500ms or on newline
-        if time.time() - last_yield_time > 0.5 or "\n" in event.message:
-            yield f"📡 {streaming_buffer}"
-            last_yield_time = time.time()
-    elif event.type == "complete":
-        yield event.message
-    else:
-        # Non-streaming events
-        response_parts.append(event.to_markdown())
-        yield "\n\n".join(response_parts)
-```
-**Option B: Don't yield streaming events at all**
-```python
-# In app.py - only yield meaningful events
-async for event in orchestrator.run(message):
-    if event.type == "streaming":
-        continue  # Skip token-by-token spam
-    # ... rest of logic
-```
-**Option C: Fix at orchestrator level**
-Don't emit `AgentEvent` for every delta - buffer in `_process_event`.
----
-## Bug 2: API Key Does Not Persist in Textbox ✅ FIXED
-### Symptoms
-1. User opens the "Mode & API Key" accordion
-2. User pastes their API key into the password textbox
-3. User clicks an example OR clicks elsewhere
-4. The API key textbox is now empty - value lost
-### Root Cause (VALIDATED)
-**File:** `src/app.py:255-267` (BEFORE FIX)
-```python
-additional_inputs_accordion=additional_inputs_accordion,
-additional_inputs=[
-    gr.Radio(...),
-    gr.Textbox(
-        label="🔑 API Key (Optional)",
-        type="password",
-        # No `value` parameter - defaults to empty
-        # No state persistence mechanism
-    ),
-],
-```
-Gradio's `ChatInterface` with `additional_inputs` has known issues:
-1. Clicking examples resets additional inputs to defaults
-2. The accordion state and input values may not persist correctly
-3. No explicit state management for the API key
-### Fix Applied
-**Files Modified:**
-1. `src/app.py`
-2. `src/utils/llm_factory.py`
-**Bug 1 (Streaming Spam):**
-- Accumulate tokens in `streaming_buffer`
-- Yield updates immediately for live typing UX
-- **Key**: Don't append to `response_parts` until stream segment complete
-- Each yield has ONE `📡 STREAMING:` line (not N accumulated lines)
-**Bug 2 (API Key Persistence):**
-- **Strategy:** Partial example list (relies on Gradio behavior)
-  - Examples have only 2 elements `[message, mode]` instead of 4
-  - Gradio only updates inputs with corresponding example values
-  - Remaining inputs (api_key textbox) are left unchanged
-  - `api_key_state` parameter exists as fallback but may be redundant
-- **Note:** This is a workaround relying on undocumented Gradio behavior
-**Bug 3 (OpenAIModel Deprecation):** ✅ FIXED
-- Replaced all `OpenAIModel` imports with `OpenAIChatModel` in `src/app.py` and `src/utils/llm_factory.py`.
-### Test Results
-```bash
-uv run pytest tests/ -q
-============================= 138 passed in 20.60s =============================
-```
-**Status:** ✅ All tests passing
-### Why This Fix Works
-**Bug 1 (Streaming Spam):**
-- **Before:** Every token → `append()` to list → `yield` → List grew to size N → O(N²) complexity.
-- **After:** Every token → `yield` dynamically constructed string (buffer + history) → List stays size K (number of *events*).
-- **Impact:** Smooth streaming, no visual spam, no browser freeze.
-**Bug 2 (API Key):**
-- **Before:** Example click → Overwrote API Key textbox with `""`.
-- **After:** Example click → Updates only `message` and `mode` → API Key textbox untouched.
-- **Impact:** User input persists naturally.
-### Remaining Work
-- **Bug 4 (Asyncio GC errors):** Monitoring only - likely Gradio/HF Spaces issue

docs/bugs/P3_MAGENTIC_NO_TERMINATION_EVENT.md ADDED Viewed

	@@ -0,0 +1,177 @@

+# P3 Bug Report: Advanced Mode Missing Termination Guarantee
+## Status
+- **Date:** 2025-11-29
+- **Priority:** P3 (Edge case, but confusing UX)
+- **Component:** `src/orchestrator_magentic.py`
+- **Resolution:** Fixed (Guarantee termination event)
+---
+## Symptoms
+In **Advanced (Magentic) mode** with OpenAI API key:
+1. Workflow runs for many iterations (up to 10 rounds)
+2. Agents search, judge, hypothesize repeatedly
+3. Eventually... **nothing happens**
+   - No "complete" event
+   - No error message
+   - UI just stops updating
+**User perception:** "Did it finish? Did it crash? What happened?"
+### Observed Behavior
+When workflow hits `max_round_count=10`:
+- `workflow.run_stream(task)` iterator ends
+- NO `MagenticFinalResultEvent` is emitted by agent-framework
+- Our code yields nothing after the loop
+- User is left hanging
+---
+## Root Cause Analysis
+### Code Path (`src/orchestrator_magentic.py:170-186`)
+```python
+iteration = 0
+try:
+    async for event in workflow.run_stream(task):
+        agent_event = self._process_event(event, iteration)
+        if agent_event:
+            if isinstance(event, MagenticAgentMessageEvent):
+                iteration += 1
+            yield agent_event
+    # BUG: NO FALLBACK HERE!
+    # If loop ends without FinalResultEvent, user sees nothing
+except Exception as e:
+    logger.error("Magentic workflow failed", error=str(e))
+    yield AgentEvent(
+        type="error",
+        message=f"Workflow error: {e!s}",
+        iteration=iteration,
+    )
+# BUG: NO FINALLY BLOCK TO GUARANTEE TERMINATION EVENT
+```
+### Workflow Configuration (`src/orchestrator_magentic.py:110-116`)
+```python
+.with_standard_manager(
+    chat_client=manager_client,
+    max_round_count=self._max_rounds,  # 10 - can hit this limit
+    max_stall_count=3,                  # If agents repeat 3x
+    max_reset_count=2,                  # Workflow reset limit
+)
+```
+### Failure Modes
+| Scenario | What Happens | User Sees |
+|----------|--------------|-----------|
+| `MagenticFinalResultEvent` emitted | `_process_event` yields "complete" | Final report |
+| Max rounds (10) reached, no final event | Loop ends silently | **Nothing** |
+| `max_stall_count` triggered | Workflow ends | **Nothing** |
+| `max_reset_count` triggered | Workflow ends | **Nothing** |
+| OpenAI API error | Exception caught | Error message |
+---
+## The Fix
+Add guaranteed termination event after the loop:
+```python
+iteration = 0
+final_event_received = False
+try:
+    async for event in workflow.run_stream(task):
+        agent_event = self._process_event(event, iteration)
+        if agent_event:
+            if isinstance(event, MagenticAgentMessageEvent):
+                iteration += 1
+            if agent_event.type == "complete":
+                final_event_received = True
+            yield agent_event
+except Exception as e:
+    logger.error("Magentic workflow failed", error=str(e))
+    yield AgentEvent(
+        type="error",
+        message=f"Workflow error: {e!s}",
+        iteration=iteration,
+    )
+    final_event_received = True  # Error is a form of termination
+finally:
+    # GUARANTEE: Always emit termination event
+    if not final_event_received:
+        logger.warning(
+            "Workflow ended without final event",
+            iterations=iteration,
+        )
+        yield AgentEvent(
+            type="complete",
+            message=(
+                f"Research completed after {iteration} agent rounds. "
+                "Max iterations reached - results may be partial. "
+                "Try a more specific query for better results."
+            ),
+            data={"iterations": iteration, "reason": "max_rounds_reached"},
+            iteration=iteration,
+        )
+```
+---
+## Alternative: Increase Max Rounds
+The default `max_rounds=10` might be too low for complex queries.
+In `src/orchestrator_factory.py:52-53`:
+```python
+return orchestrator_cls(
+    max_rounds=config.max_iterations if config else 10,  # Could increase to 15-20
+    api_key=api_key,
+)
+```
+**Trade-off:** More rounds = more API cost, but better chance of complete results.
+---
+## Test Plan
+- [ ] Add fallback yield after async for loop
+- [ ] Add `final_event_received` flag tracking
+- [ ] Log warning when fallback is used
+- [ ] Test with `max_rounds=2` to force hitting limit
+- [ ] Verify user always sees termination event
+- [ ] `make check` passes
+---
+## Related Files
+- `src/orchestrator_magentic.py` - Main fix location
+- `src/orchestrator_factory.py` - Max rounds configuration
+- `src/utils/models.py` - AgentEvent types
+- `docs/bugs/P2_MAGENTIC_THINKING_STATE.md` - Related UX issue (implemented)
+---
+## Priority Justification
+**P3** because:
+- Advanced mode is working for most queries
+- Only hits edge case when max rounds reached without synthesis
+- User CAN retry with different query
+- Not blocking hackathon demo (free tier Simple mode works)
+Would be P2 if:
+- This happened frequently
+- No workaround existed

docs/bugs/SENIOR_AGENT_AUDIT_PROMPT.md DELETED Viewed

@@ -1,247 +0,0 @@
-# Senior Agent Audit Request: DeepBoner Codebase Bug Hunt
-**Date**: 2025-11-28
-**Requesting Agent**: Claude (Opus)
-**Purpose**: Comprehensive bug audit and verification of P0_CRITICAL_BUGS.md
----
-## Your Mission
-You are a senior software engineer performing a comprehensive audit of the DeepBoner codebase. Your goals:
-1. **VERIFY** the 4 bugs documented in `docs/bugs/P0_CRITICAL_BUGS.md` are accurately described
-2. **FIND** any additional bugs (P0-P4) that could affect the demo
-3. **TRACE** the complete code paths for Simple and Advanced modes
-4. **IDENTIFY** any silent failures, race conditions, or edge cases
----
-## Context: What DeepBoner Does
-DeepBoner is a Gradio-based biomedical research agent that:
-1. Takes a research question from user
-2. Searches PubMed, ClinicalTrials.gov, Europe PMC
-3. Uses an LLM "judge" to evaluate if evidence is sufficient
-4. Either loops for more evidence or synthesizes a final report
-**Two Modes**:
-- **Simple**: Linear orchestrator with search → judge → report loop
-- **Advanced**: Magentic multi-agent with SearchAgent, JudgeAgent, HypothesisAgent, ReportAgent
-**Three Backend Options**:
-- Free tier: HuggingFace Inference API (Llama/Mistral)
-- OpenAI: User-provided or env var key
-- Anthropic: User-provided or env var key (Simple mode only)
----
-## Files to Audit (Priority Order)
-### Critical Path Files:
-1. `src/app.py` - Gradio UI, entry point, key routing
-2. `src/orchestrator.py` - Simple mode main loop
-3. `src/orchestrator_factory.py` - Mode selection and orchestrator creation
-4. `src/orchestrator_magentic.py` - Advanced mode implementation
-5. `src/services/embeddings.py` - Deduplication singleton (KNOWN BUG)
-6. `src/agent_factory/judges.py` - LLM judge handlers (HF, OpenAI, Anthropic)
-### Supporting Files:
-7. `src/tools/search_handler.py` - Parallel search orchestration
-8. `src/tools/pubmed.py` - PubMed API integration
-9. `src/tools/clinicaltrials.py` - ClinicalTrials.gov API
-10. `src/tools/europepmc.py` - Europe PMC API
-11. `src/agents/magentic_agents.py` - Agent factories (KNOWN BUG: hardcoded env key)
-12. `src/utils/config.py` - Settings and configuration
-13. `src/utils/models.py` - Data models (Evidence, Citation, etc.)
----
-## Known Bugs to Verify
-### Bug 1: Free Tier LLM Quota Exhausted
-**Claim**: HuggingFace Inference returns 402, all 3 fallback models fail
-**Verify**:
-- Check `src/agent_factory/judges.py` class `HFInferenceJudgeHandler`
-- Trace the fallback chain: Llama → Mistral → Zephyr
-- Confirm what happens when ALL fail (does it return default "continue"?)
-- Check if the error message reaches the user or is swallowed
-### Bug 2: Evidence Counter Shows 0 After Dedup
-**Claim**: `_deduplicate_and_rank()` can return empty list, losing all evidence
-**Verify**:
-- Check `src/orchestrator.py` lines 97-114 and 219
-- Trace what happens if `embeddings.deduplicate()` returns `[]`
-- Is there defensive handling? Does exception handler catch this?
-- Could this be a race condition in async code?
-### Bug 3: API Key Not Passed to Advanced Mode
-**Claim**: User's API key from Gradio is never passed to MagenticOrchestrator
-**Verify**:
-- Trace: `app.py:research_agent()` → `configure_orchestrator()` → `orchestrator_factory.py`
-- Check if `user_api_key` is passed to `create_orchestrator()`
-- Check if `MagenticOrchestrator.__init__()` receives a key
-- Check `src/agents/magentic_agents.py` - do agents use `settings.openai_api_key`?
-### Bug 4: Singleton EmbeddingService Cross-Session Pollution
-**Claim**: ChromaDB collection persists across requests, causing false duplicates
-**Verify**:
-- Check `src/services/embeddings.py` singleton pattern
-- Is `_embedding_service` ever reset?
-- What happens to ChromaDB collection between Gradio requests?
-- Could this cause "Found 20 new sources (0 total)"?
----
-## Additional Bug Categories to Search For
-### A. Error Handling Gaps
-- [ ] Silent `except: pass` blocks
-- [ ] Exceptions logged but not re-raised
-- [ ] Missing error messages to user
-- [ ] Swallowed API errors
-### B. Async/Concurrency Issues
-- [ ] Race conditions in parallel searches
-- [ ] Shared mutable state across async calls
-- [ ] Missing `await` keywords
-- [ ] Event loop blocking (sync code in async context)
-### C. API Integration Bugs
-- [ ] Missing rate limiting
-- [ ] Hardcoded timeouts that are too short
-- [ ] XML/JSON parsing failures not handled
-- [ ] Empty response handling
-### D. State Management Issues
-- [ ] Global singletons that should be session-scoped
-- [ ] Gradio state not properly isolated between users
-- [ ] Memory leaks from accumulated data
-### E. Configuration Bugs
-- [ ] Missing env var defaults
-- [ ] Type mismatches in settings
-- [ ] Hardcoded values that should be configurable
-### F. UI/UX Bugs
-- [ ] Streaming not working properly
-- [ ] Progress messages misleading
-- [ ] Examples not matching actual functionality
-- [ ] Error messages not user-friendly
----
-## Output Format
-Please produce a report with:
-### 1. Verification of Known Bugs
-For each of the 4 bugs in P0_CRITICAL_BUGS.md:
-- **CONFIRMED** or **INCORRECT** or **PARTIALLY CORRECT**
-- Exact file:line references
-- Any corrections or additional details
-### 2. New Bugs Found
-For each new bug:
-```
-## Bug N: [Title]
-**Priority**: P0/P1/P2/P3/P4
-**File**: path/to/file.py:line
-**Symptoms**: What the user sees
-**Root Cause**: Technical explanation
-**Code**:
-```python
-# The buggy code
-```
-**Fix**:
-```python
-# The corrected code
-```
-```
-### 3. Code Quality Concerns
-Any patterns that aren't bugs but could cause issues:
-- Technical debt
-- Missing tests for critical paths
-- Unclear error handling
-### 4. Recommended Fix Order
-Prioritized list of what to fix first for a working demo.
----
-## Commands to Help Your Investigation
-```bash
-# Run the tests
-make check
-# Test search works
-uv run python -c "
-import asyncio
-from src.tools.pubmed import PubMedTool
-async def test():
-    tool = PubMedTool()
-    results = await tool.search('female libido', 5)
-    print(f'Found {len(results)} results')
-asyncio.run(test())
-"
-# Test HF inference (will show 402 if quota exhausted)
-uv run python -c "
-from huggingface_hub import InferenceClient
-client = InferenceClient()
-try:
-    resp = client.chat_completion(
-        messages=[{'role': 'user', 'content': 'Hi'}],
-        model='meta-llama/Llama-3.1-8B-Instruct',
-        max_tokens=10
-    )
-    print(resp)
-except Exception as e:
-    print(f'Error: {e}')
-"
-# Test full orchestrator (simple mode)
-uv run python -c "
-import asyncio
-from src.app import configure_orchestrator
-async def test():
-    orch, backend = configure_orchestrator(use_mock=True, mode='simple')
-    print(f'Backend: {backend}')
-    async for event in orch.run('test query'):
-        print(f'{event.type}: {event.message[:50] if event.message else \"\"}'[:60])
-asyncio.run(test())
-"
-# Check for hardcoded API keys (security)
-grep -r "sk-" src/ --include="*.py" | grep -v "sk-..." | grep -v "sk-ant-..."
-# Find all singletons
-grep -r "_.*: .* | None = None" src/ --include="*.py"
-# Find all except blocks
-grep -rn "except.*:" src/ --include="*.py" | head -50
-```
----
-## Important Notes
-1. **DO NOT fix bugs** - just document them
-2. **Be thorough** - check edge cases and error paths
-3. **Be specific** - include file:line references
-4. **Be skeptical** - verify claims in P0_CRITICAL_BUGS.md independently
-5. **Think like a user** - what would break the demo experience?
-The hackathon deadline is approaching. We need a working demo. Your audit will determine what gets fixed first.
----
-## Deliverable
-A comprehensive markdown report that:
-1. Confirms or corrects the 4 known bugs
-2. Lists any new bugs found (with priority)
-3. Recommends the optimal fix order
-4. Can be saved as `docs/bugs/SENIOR_AUDIT_RESULTS.md`

docs/bugs/SENIOR_AUDIT_RESULTS.md DELETED Viewed

@@ -1,84 +0,0 @@
-# Senior Agent Audit Results: DeepBoner Codebase
-**Date**: 2025-11-28
-**Auditor**: Claude (Senior Software Engineer)
-**Status**: COMPLETE
----
-## Executive Summary
-The DeepBoner codebase has **4 critical defects** that render the demo non-functional for most users. The most severe is a **data leak** where the vector database persists across user sessions, causing search result corruption and potential privacy issues. Additionally, the "Advanced" mode ignores user-provided API keys, and the "Free Tier" mode fails silently when quotas are exhausted.
-**Recommendation**: Immediate remediation of P0 bugs is required before hackathon submission.
----
-## 1. Verification of Known Bugs (P0_CRITICAL_BUGS.md)
-| Bug | Claim | Verification Status | Notes |
-| :--- | :--- | :--- | :--- |
-| **Bug 1** | Free Tier LLM Quota Exhausted | **CONFIRMED** | `HFInferenceJudgeHandler` catches errors but returns a fallback assessment with `recommendation="continue"`. This causes the orchestrator to loop uselessly until `max_iterations` is reached. The user sees no error message. |
-| **Bug 2** | Evidence Counter Shows 0 | **CONFIRMED** | Directly caused by Bug 4. Deduplication logic works correctly *in isolation*, but fails because the underlying ChromaDB collection is polluted with stale data from previous sessions. |
-| **Bug 3** | API Key Not Passed to Advanced | **CONFIRMED** | `create_orchestrator` in `orchestrator_factory.py` ignores the user's API key. `MagenticOrchestrator` and its agents fall back to `settings.openai_api_key` (env var), which is empty for BYOK users. |
-| **Bug 4** | Singleton EmbeddingService | **CONFIRMED** | `EmbeddingService` is a global singleton with an in-memory ChromaDB. The collection is never cleared. Data leaks between sessions, causing valid new results to be marked as duplicates of old results. |
----
-## 2. New Bugs Found
-### Bug 5: Search Error Swallowing (P2)
-**File**: `src/orchestrator.py` / `src/tools/search_handler.py`
-**Symptoms**: If all search tools fail (e.g., network issue, API limit), the UI shows "Found 0 sources" without explaining why.
-**Root Cause**: `SearchHandler` captures exceptions and returns them in an `errors` list, but `Orchestrator` only logs them to the console (`logger.warning`) and proceeds with empty evidence.
-**Fix**: Yield an `AgentEvent(type="error")` or include errors in the `search_complete` event message.
-### Bug 6: Hardcoded Model Names (P3)
-**File**: `src/agent_factory/judges.py`
-**Symptoms**: Maintenance burden.
-**Root Cause**: Model names like `meta-llama/Llama-3.1-8B-Instruct` are hardcoded in the class `HFInferenceJudgeHandler` rather than pulled from `config.py`.
-**Fix**: Move to `Settings`.
----
-## 3. Code Quality Concerns
-1.  **Singleton Abuse**: The `_embedding_service` global in `src/services/embeddings.py` is a major architectural flaw for a multi-user web app (even a demo). It should be scoped to the `Orchestrator` instance.
-2.  **Inconsistent Factory Signatures**: `create_orchestrator` does not accept `api_key`, forcing hacks or reliance on global env vars.
-3.  **Silent Failures**: The pervasive use of `try...except Exception` with only logging (no user feedback) makes debugging difficult for end-users.
----
-## 4. Recommended Fix Order
-### Step 1: Fix the Data Leak (Bug 4 & 2)
-**Why**: Prevents result corruption and cross-user data leakage.
-**Plan**:
-1.  Remove singleton pattern from `src/services/embeddings.py`.
-2.  Make `EmbeddingService` an instance variable of `Orchestrator`.
-3.  Initialize a fresh `EmbeddingService` (and ChromaDB collection) for each `run()`.
-### Step 2: Fix Advanced Mode BYOK (Bug 3)
-**Why**: Enables the core "Advanced" feature for judges/users.
-**Plan**:
-1.  Update `create_orchestrator` signature to accept `api_key`.
-2.  Update `MagenticOrchestrator` to accept `api_key`.
-3.  Update `configure_orchestrator` in `app.py` to pass the key.
-4.  Ensure `MagenticOrchestrator` constructs `OpenAIChatClient` with the user's key.
-### Step 3: Fix Free Tier Experience (Bug 1)
-**Why**: Ensures a usable fallback for those without keys.
-**Plan**:
-1.  In `HFInferenceJudgeHandler`, detect 402/429 errors.
-2.  If caught, return a `JudgeAssessment` that triggers a "Complete" event with a clear error message, rather than "Continue".
-3.  Add `HF_TOKEN` to the deployment environment if possible.
----
-## Verification Plan
-After applying fixes, run:
-1.  **Unit Tests**: `make check`
-2.  **Manual Test (Simple)**: Run without key, verify 402 error is handled OR works if token added.
-3.  **Manual Test (Advanced)**: Run with OpenAI key, verify it proceeds past initialization.
-4.  **Manual Test (Dedup)**: Run same query twice. Second run should find same number of results (not 0).

src/app.py CHANGED Viewed

@@ -173,7 +173,11 @@ async def research_agent(
             user_api_key=user_api_key,
         )
-        yield f"🧠 **Backend**: {backend_name}\n\n"
         # Immediate loading feedback so user knows something is happening
         yield (

             user_api_key=user_api_key,
         )
+        # Immediate backend info + loading feedback so user knows something is happening
+        yield (
+            f"🧠 **Backend**: {backend_name}\n\n"
+            "⏳ **Processing...** Searching PubMed, ClinicalTrials.gov, Europe PMC...\n"
+        )
         # Immediate loading feedback so user knows something is happening
         yield (

src/orchestrator_magentic.py CHANGED Viewed

@@ -168,14 +168,38 @@ The final output should be a structured research report."""
         )
         iteration = 0
         try:
             async for event in workflow.run_stream(task):
                 agent_event = self._process_event(event, iteration)
                 if agent_event:
                     if isinstance(event, MagenticAgentMessageEvent):
                         iteration += 1
                     yield agent_event
         except Exception as e:
             logger.error("Magentic workflow failed", error=str(e))
             yield AgentEvent(

         )
         iteration = 0
+        final_event_received = False
         try:
             async for event in workflow.run_stream(task):
                 agent_event = self._process_event(event, iteration)
                 if agent_event:
                     if isinstance(event, MagenticAgentMessageEvent):
                         iteration += 1
+                    if agent_event.type == "complete":
+                        final_event_received = True
                     yield agent_event
+            # GUARANTEE: Always emit termination event if stream ends without one
+            # (e.g., max rounds reached)
+            if not final_event_received:
+                logger.warning(
+                    "Workflow ended without final event",
+                    iterations=iteration,
+                )
+                yield AgentEvent(
+                    type="complete",
+                    message=(
+                        f"Research completed after {iteration} agent rounds. "
+                        "Max iterations reached - results may be partial. "
+                        "Try a more specific query for better results."
+                    ),
+                    data={"iterations": iteration, "reason": "max_rounds_reached"},
+                    iteration=iteration,
+                )
         except Exception as e:
             logger.error("Magentic workflow failed", error=str(e))
             yield AgentEvent(

tests/unit/test_magentic_termination.py ADDED Viewed

	@@ -0,0 +1,111 @@

+"""Tests for Magentic Orchestrator termination guarantee."""
+from unittest.mock import MagicMock, patch
+import pytest
+from agent_framework import MagenticAgentMessageEvent
+from src.orchestrator_magentic import MagenticOrchestrator
+from src.utils.models import AgentEvent
+# Skip tests if agent_framework is not installed
+pytest.importorskip("agent_framework")
+class MockChatMessage:
+    def __init__(self, content):
+        self.content = content
+        self.role = "assistant"
+    @property
+    def text(self):
+        return self.content
+@pytest.fixture
+def mock_magentic_requirements():
+    """Mock requirements check."""
+    with patch("src.orchestrator_magentic.check_magentic_requirements"):
+        yield
+@pytest.mark.asyncio
+async def test_termination_event_emitted_on_stream_end(mock_magentic_requirements):
+    """
+    Verify that a termination event is emitted when the workflow stream ends
+    without a MagenticFinalResultEvent (e.g. max rounds reached).
+    """
+    orchestrator = MagenticOrchestrator(max_rounds=2)
+    # Use real event class
+    mock_message = MockChatMessage("Thinking...")
+    mock_agent_event = MagenticAgentMessageEvent(agent_id="SearchAgent", message=mock_message)
+    # Mock the workflow and its run_stream method
+    mock_workflow = MagicMock()
+    # Create an async generator for run_stream
+    async def mock_stream(task):
+        # Yield the real message event
+        yield mock_agent_event
+        # STOP HERE - No FinalResultEvent
+    mock_workflow.run_stream = mock_stream
+    # Mock _build_workflow to return our mock workflow
+    with patch.object(orchestrator, "_build_workflow", return_value=mock_workflow):
+        events = []
+        async for event in orchestrator.run("Research query"):
+            events.append(event)
+        for i, e in enumerate(events):
+            print(f"Event {i}: {e.type} - {e.message}")
+        assert len(events) >= 2
+        assert events[0].type == "started"
+        # Verify the message event was processed
+        # Depending on _process_event logic, MagenticAgentMessageEvent might map to different types
+        # We assume it maps to something valid or we just check presence.
+        assert any("Thinking..." in e.message for e in events)
+        # THE CRITICAL CHECK: Did we get the fallback termination event?
+        last_event = events[-1]
+        assert last_event.type == "complete"
+        assert "Max iterations reached" in last_event.message
+        assert last_event.data.get("reason") == "max_rounds_reached"
+@pytest.mark.asyncio
+async def test_no_double_termination_event(mock_magentic_requirements):
+    """
+    Verify that we DO NOT emit a fallback event if the workflow finished normally.
+    """
+    orchestrator = MagenticOrchestrator()
+    mock_workflow = MagicMock()
+    with patch.object(orchestrator, "_build_workflow", return_value=mock_workflow):
+        # Mock _process_event to simulate a natural completion event
+        with patch.object(orchestrator, "_process_event") as mock_process:
+            mock_process.side_effect = [
+                AgentEvent(type="thinking", message="Working...", iteration=1),
+                AgentEvent(type="complete", message="Done!", iteration=2),
+            ]
+            async def mock_stream_with_yields(task):
+                yield "raw_event_1"
+                yield "raw_event_2"
+            mock_workflow.run_stream = mock_stream_with_yields
+            events = []
+            async for event in orchestrator.run("Research query"):
+                events.append(event)
+            assert events[-1].message == "Done!"
+            assert events[-1].type == "complete"
+            # Verify we didn't get a SECOND "Max iterations reached" event
+            fallback_events = [e for e in events if "Max iterations reached" in e.message]
+            assert len(fallback_events) == 0