Spaces:

VibecoderMcSwaggins
/

DeepBoner

Paused

VibecoderMcSwaggins commited on 9 days ago

Commit

4878d51

1 Parent(s): d4d872a

Update ACTIVE_BUGS.md and archive completed bug documentation

- Consolidate active bug documentation, focusing on the P3 Progress Bar Positioning issue.
- Archive resolved documentation for various P0 and P2 bugs, including the ExecutorCompletedEvent UI noise and Round Counter Semantic Mismatch.
- Ensure the ACTIVE_BUGS.md reflects the current status of ongoing issues and directs users to archived documentation for completed bugs.

Files changed (34) hide show

docs/bugs/ACTIVE_BUGS.md +9 -65
docs/bugs/archive/AUDIT_FINDINGS_2025_11_30.md +0 -70
docs/bugs/archive/GRADIO_EXAMPLE_VS_CHAT_ARROW_ANALYSIS.md +0 -147
docs/bugs/archive/P0_ADVANCED_MODE_TIMEOUT_NO_SYNTHESIS.md +0 -307
docs/bugs/archive/P0_AIFUNCTION_NOT_JSON_SERIALIZABLE.md +0 -225
docs/bugs/archive/P0_HUGGINGFACE_TOOL_CALLING_BROKEN.md +0 -173
docs/bugs/archive/P0_MCP_TOOLUSECONTENT_MISSING.md +0 -88
docs/bugs/archive/P0_ORCHESTRATOR_DEDUP_AND_JUDGE_BUGS.md +0 -144
docs/bugs/archive/P0_REPR_BUG_ROOT_CAUSE_ANALYSIS.md +0 -99
docs/bugs/archive/P0_SIMPLE_MODE_FORCED_SYNTHESIS_BYPASS.md +0 -59
docs/bugs/archive/P0_SIMPLE_MODE_NEVER_SYNTHESIZES.md +0 -254
docs/bugs/archive/P0_SYNTHESIS_PROVIDER_MISMATCH.md +0 -273
docs/bugs/archive/P1_ADVANCED_MODE_UNINTERPRETABLE_CHAIN_OF_THOUGHT.md +0 -184
docs/bugs/archive/P1_FREE_TIER_TOOL_EXECUTION_FAILURE.md +0 -319
docs/bugs/archive/P1_GRADIO_EXAMPLE_CLICK_AUTO_SUBMIT.md +0 -273
docs/bugs/archive/P1_HUGGINGFACE_NOVITA_500_ERROR.md +0 -133
docs/bugs/archive/P1_HUGGINGFACE_ROUTER_401_HYPERBOLIC.md +0 -62
docs/bugs/archive/P1_NARRATIVE_SYNTHESIS_FALLBACK.md +0 -185
docs/bugs/archive/P1_NO_SYNTHESIS_FREE_TIER.md +0 -165
docs/bugs/archive/P1_SIMPLE_MODE_REMOVED_BREAKS_FREE_TIER_UX.md +0 -61
docs/bugs/archive/P1_SYNTHESIS_BROKEN_KEY_FALLBACK.md +0 -163
docs/bugs/archive/P2_7B_MODEL_GARBAGE_OUTPUT.md +0 -266
docs/bugs/archive/P2_ADVANCED_MODE_COLD_START_NO_FEEDBACK.md +0 -255
docs/bugs/archive/P2_ARCHITECTURAL_BYOK_GAPS.md +0 -100
docs/bugs/archive/P2_DUPLICATE_REPORT_CONTENT.md +0 -151
docs/bugs/archive/P2_EXECUTOR_COMPLETED_EVENT_UI_NOISE.md +0 -351
docs/bugs/archive/P2_FIRST_TURN_TIMEOUT.md +0 -160
docs/bugs/archive/P2_GRADIO_EXAMPLE_NOT_FILLING.md +0 -68
docs/bugs/archive/P2_ROUND_COUNTER_SEMANTIC_MISMATCH.md +0 -321
docs/bugs/archive/P3_ARCHITECTURAL_GAP_EPHEMERAL_MEMORY.md +0 -23
docs/bugs/archive/P3_ARCHITECTURAL_GAP_STRUCTURED_MEMORY.md +0 -150
docs/bugs/archive/P3_MAGENTIC_NO_TERMINATION_EVENT.md +0 -177
docs/bugs/archive/P3_MODAL_INTEGRATION_REMOVAL.md +0 -78
docs/bugs/archive/P3_REMOVE_ANTHROPIC_PARTIAL_WIRING.md +0 -160

docs/bugs/ACTIVE_BUGS.md CHANGED Viewed

@@ -1,83 +1,27 @@
 # Active Bugs
 > Last updated: 2025-12-06
->
-> **Note:** Completed bug docs archived to `docs/bugs/archive/`
-> **See also:** [ARCHITECTURE.md](../ARCHITECTURE.md) for unified architecture plan
 ---
-## Currently Active Bugs
-### P3 - Progress Bar Positioning in ChatInterface
-**File:** `docs/bugs/P3_PROGRESS_BAR_POSITIONING.md`
-**Status:** OPEN - Low Priority UX Polish
-**Problem:** The `gr.Progress()` bar renders in a strange position when used inside ChatInterface, causing visual overlap with chat messages.
-**Recommended Fix:** Remove `gr.Progress()` entirely and rely on emoji status messages in chat output.
----
-## Resolved Bugs (December 2025)
-All resolved bugs have been moved to `docs/bugs/archive/`. Summary:
-### P0 Bugs (All FIXED)
-- **P0 MCP ToolUseContent Missing** - FIXED, requirements.txt missing `mcp>=1.23.0` pin (HF Spaces crashed)
-- **P0 Repr Bug** - FIXED in PR #117 via Accumulator Pattern
-- **P0 AIFunction Not JSON Serializable** - FIXED, full tool support for HuggingFace
-- **P0 HuggingFace Tool Calling Broken** - FIXED, history serialization + Accumulator Pattern
-- **P0 Simple Mode Forced Synthesis Bypass** - N/A, simple.py deleted (Unified Architecture)
-- **P0 Synthesis Provider Mismatch** - FIXED, auto-detect in judges.py
-- **P0 Advanced Mode Timeout No Synthesis** - FIXED, actual synthesis on timeout
-### P1 Bugs (All FIXED)
-- **P1 No Synthesis Free Tier** - FIXED in PR fix/p1-forced-synthesis, forced synthesis safety net when ReportAgent doesn't run
-- **P1 Free Tier Tool Execution Failure** - FIXED in PR fix/P1-free-tier-tool-execution, removed premature marker
-- **P1 Gradio Example Click Auto-Submits** - FIXED in PR #120, prevents auto-submit on example click
-- **P1 HuggingFace Router 401 Hyperbolic** - FIXED, invalid token was root cause
-- **P1 HuggingFace Novita 500 Error** - SUPERSEDED, switched to 7B model
-- **P1 Advanced Mode Uninterpretable Chain-of-Thought** - FIXED in PR #107
-- **P1 Synthesis Broken Key Fallback** - FIXED in PR #103
-### P2 Bugs (All FIXED)
-- **P2 ExecutorCompletedEvent UI Noise** - FIXED in PR #133, silenced internal framework events
-- **P2 Round Counter Semantic Mismatch** - FIXED in PR #132, semantic progress tracking
-- **P2 Duplicate Report Content** - FIXED in PR fix/p2-double-bug-squash, stateful deduplication in `run()` loop
-- **P2 First Turn Timeout** - FIXED in PR fix/p2-double-bug-squash, reduced results per tool (10→5), increased timeout (5→10 min)
-- **P2 7B Model Garbage Output** - SUPERSEDED by P1 Free Tier fix (root cause was premature marker, not model capacity)
-- **P2 Advanced Mode Cold Start No Feedback** - FIXED, all phases complete
-- **P2 Architectural BYOK Gaps** - FIXED, end-to-end BYOK support in PR #119
-### P3 Tech Debt (All RESOLVED)
-- **P3 Remove Anthropic Partial Wiring** - DONE in PR #130, all Anthropic code removed
-- **P3 Remove Modal Integration** - DONE in PR #130, all Modal code removed (~1400 lines deleted)
 ---
 ## How to Report Bugs
 1. Create `docs/bugs/P{N}_{SHORT_NAME}.md`
-2. Include: Symptom, Root Cause, Fix Plan, Test Plan
-3. Update this index
-4. Priority: P0=blocker, P1=important, P2=UX, P3=edge case/tech debt
 ---
-## Archived Documentation
-The following have been moved to `docs/bugs/archive/`:
-- All resolved P0-P2 bug reports
-- Code quality audit findings (2025-11-30)
-- Gradio example vs chat arrow analysis
-Additional documentation moved:
-- `HF_FREE_TIER_ANALYSIS.md` → `docs/architecture/`
-- `TOOL_ANALYSIS_CRITICAL.md` → `docs/future-roadmap/`
-- `P3_REMOVE_ANTHROPIC_PARTIAL_WIRING.md` → `docs/future-roadmap/`

 # Active Bugs
 > Last updated: 2025-12-06
 ---
+## P3 - Progress Bar Positioning in ChatInterface
+**File:** [P3_PROGRESS_BAR_POSITIONING.md](./P3_PROGRESS_BAR_POSITIONING.md)
+**Status:** OPEN
+**Priority:** Low (cosmetic UX issue)
+**Problem:** `gr.Progress()` conflicts with ChatInterface, causing the progress bar to float/overlap with chat messages.
+**Fix:** Remove `gr.Progress()` entirely and rely on emoji status messages in chat output.
 ---
 ## How to Report Bugs
 1. Create `docs/bugs/P{N}_{SHORT_NAME}.md`
+2. Add entry to this file
+3. Priority: P0=blocker, P1=important, P2=UX, P3=cosmetic
 ---
+*Historical bugs are preserved in the [v0.1.0 release tag](https://github.com/The-Obstacle-Is-The-Way/DeepBoner/releases/tag/v0.1.0).*

docs/bugs/archive/AUDIT_FINDINGS_2025_11_30.md DELETED Viewed

@@ -1,70 +0,0 @@
-# Code Quality Audit Findings - 2025-11-30
-**Auditor:** Senior Staff Engineer (Gemini)
-**Date:** 2025-11-30
-**Scope:** `src/` (services, tools, agents, orchestrators)
-**Focus:** Configuration validation, Error handling, Defensive programming anti-patterns
-## Summary
-The codebase is generally clean and modern, but exhibits specific anti-patterns related to configuration management and defensive error handling. The most critical finding is the reliance on manual `os.getenv` calls and "silent default" fallbacks which obscure configuration errors, directly contributing to the `OpenAIError` observed in production.
-## Findings
-### 1. Defensive Pass Block (Silent Failure) - MEDIUM
-**File:** `src/services/statistical_analyzer.py:246-247`
-```python
-            try:
-                min_p = min(float(p) for p in p_values)
-                # ... logic ...
-            except ValueError:
-                pass
-```
-**Problem:** If p-values are found by regex but fail to parse, the error is swallowed silently. This makes debugging parser issues impossible.
-**Fix:** Replace `pass` with `logger.warning("Failed to parse p-values: %s", p_values)` to aid debugging.
-### 2. Missing Pydantic Validation (Manual Config) - MEDIUM
-**File:** `src/tools/code_execution.py:75-76`
-```python
-        self.modal_token_id = os.getenv("MODAL_TOKEN_ID")
-        self.modal_token_secret = os.getenv("MODAL_TOKEN_SECRET")
-```
-**Problem:** Secrets are manually fetched from env vars, bypassing the centralized `Settings` validation.
-**Fix:** Move to `src/utils/config.py` in the `Settings` class and inject `settings` into `ModalCodeExecutor`.
-### 3. Broad Exception Swallowing - MEDIUM
-**File:** `src/tools/pubmed.py:129-130`
-```python
-            except Exception:
-                continue  # Skip malformed articles
-```
-**Problem:** Catching `Exception` hides potential bugs (like `NameError` or `TypeError` in our own code), not just malformed data.
-**Fix:** Catch specific exceptions (e.g., `(KeyError, AttributeError, TypeError)`) OR log the error before continuing: `logger.debug(f"Skipping malformed article {pmid}: {e}")`.
-### 4. Missing Pydantic Validation (UI Layer) - LOW
-**File:** `src/app.py:115, 119`
-```python
-    elif os.getenv("OPENAI_API_KEY"):
-        # ...
-    elif os.getenv("ANTHROPIC_API_KEY"):
-```
-**Problem:** Application logic relies on raw environment variable checks to determine available backends, creating duplication and potential inconsistency with `config.py`.
-**Fix:** Centralize this logic in `src/utils/config.py` (e.g., `settings.has_openai`, `settings.has_anthropic`).
-### 5. Try/Except for Flow Control - LOW
-**File:** `src/tools/code_execution.py:244-249`
-```python
-        try:
-            start_idx = text.index(start_marker) + len(start_marker)
-            # ...
-        except ValueError:
-            return text.strip()
-```
-**Problem:** Using exceptions for expected "not found" cases is slower and less explicit.
-**Fix:** Use `find()` which returns `-1` on failure.
-## Action Plan
-1.  **Refactor Configuration:** Eliminate `os.getenv` in favor of `src/utils/config.py` `Settings` model.
-2.  **Fix Error Handling:** Remove empty `pass` blocks; add logging.
-3.  **Address P0 Bug:** Fix the `OpenAIError` in synthesis (caused by Finding #4/General Config issue) by injecting the correct model into the orchestrator.

docs/bugs/archive/GRADIO_EXAMPLE_VS_CHAT_ARROW_ANALYSIS.md DELETED Viewed

@@ -1,147 +0,0 @@
-# Gradio Example Click vs Chat Arrow - Code Path Analysis
-**Status**: ANALYZED - NOT A BUG (Same code path, different timing)
-**Priority**: N/A (Symptom of upstream repr bug)
-**Analyzed**: 2025-12-01
-**Related**: P0_HUGGINGFACE_TOOL_CALLING_BROKEN.md
----
-## Symptom Reported
-User observed two different outputs when:
-1. **Clicking an Example** → Shows progress at 10%, "THINKING" message
-2. **Clicking Chat Arrow** → Shows full 5 rounds with repr garbage
-User suspected divergent code paths from vestigial Simple Mode deletion.
----
-## Analysis: NO DIVERGENT CODE PATHS
-### Code Trace
-Both Example Click and Chat Arrow use **the exact same code path**:
-```text
-User Action (Example OR Chat Arrow)
-         ↓
-app.py:research_agent()         ← SAME FUNCTION
-         ↓
-app.py:configure_orchestrator() ← SAME FUNCTION (mode="advanced" always)
-         ↓
-factory.py:create_orchestrator() ← SAME FUNCTION
-         ↓
-factory.py:_determine_mode()    ← ALWAYS returns "advanced"
-         ↓
-AdvancedOrchestrator            ← SAME CLASS
-         ↓
-clients/factory.py:get_chat_client() ← SAME FUNCTION
-         ↓
-HuggingFaceChatClient (no API key) OR OpenAIChatClient (with API key)
-```
-### Evidence from Code
-**app.py:279-325 - ChatInterface Setup:**
-```python
-demo = gr.ChatInterface(
-    fn=research_agent,  # ← SAME FUNCTION FOR BOTH
-    examples=[
-        ["What drugs improve female libido post-menopause?", "sexual_health", None, None],
-        # ...
-    ],
-    # ...
-)
-```
-**factory.py:76-90 - Mode Determination:**
-```python
-def _determine_mode(explicit_mode: str | None) -> str:
-    if explicit_mode == "hierarchical":
-        return "hierarchical"
-    # "simple" is deprecated -> upgrade to "advanced"
-    # "magentic" is alias for "advanced"
-    return "advanced"  # ← ALWAYS ADVANCED
-```
----
-## Explanation of Visual Difference
-The difference the user observed is **timing**, not code paths:
-| Screenshot | When Captured | Content |
-|------------|---------------|---------|
-| Example Click | Mid-execution | Progress bar at 10%, "THINKING" |
-| Chat Arrow | After completion | Full 5 rounds with repr garbage |
-**Both show the same process at different stages.**
-The repr garbage (`<agent_framework._types.ChatMessage object at 0x...>`) appears in BOTH:
-- Example Click: Would show repr garbage if captured after completion
-- Chat Arrow: Shows repr garbage because it was captured after completion
----
-## The Real Bug: Upstream repr Issue
-The repr garbage is the **upstream Microsoft Agent Framework bug** documented in:
-- `docs/bugs/P0_HUGGINGFACE_TOOL_CALLING_BROKEN.md`
-**Root cause in upstream code:**
-```python
-# agent_framework/_workflows/_magentic.py line ~1799
-text = last.text or str(last)  # BUG: str(last) gives repr for tool-only messages
-```
-**Our workaround in advanced.py:**
-```python
-def _extract_text(self, message: Any) -> str:
-    # Filter out repr strings
-    if isinstance(message, str) and message.startswith("<") and "object at" in message:
-        return ""
-    # ...
-```
----
-## Verification
-1. **No vestigial Simple Mode code** - `simple.py` is deleted, not imported anywhere
-2. **Factory always returns AdvancedOrchestrator** - verified in `factory.py:66-73`
-3. **Same research_agent function** - Gradio routes both Example and Chat Arrow through it
----
-## Conclusion
-**There are NO divergent code paths.** The unified architecture is correctly implemented:
-| Component | Status |
-|-----------|--------|
-| Simple Mode | ✅ DELETED (no vestigial code) |
-| Factory Pattern | ✅ Always returns AdvancedOrchestrator |
-| Chat Client Factory | ✅ Auto-selects HuggingFace (free) or OpenAI (paid) |
-| Example Click | ✅ Uses same `research_agent()` function |
-| Chat Arrow Click | ✅ Uses same `research_agent()` function |
-**The only bug is the upstream repr display issue**, which affects BOTH paths equally.
----
-## Next Steps
-1. **Wait for upstream fix** - [PR #2566](https://github.com/microsoft/agent-framework/pull/2566)
-2. **Once merged**: `uv add agent-framework@latest`
-3. **Test**: Verify both Example Click and Chat Arrow work identically
----
-## References
-- `src/app.py` - Line 134-247 (`research_agent()`)
-- `src/app.py` - Line 279-325 (ChatInterface with examples)
-- `src/orchestrators/factory.py` - Line 43-73 (`create_orchestrator()`)
-- `src/clients/factory.py` - Line 15-76 (`get_chat_client()`)
-- `docs/bugs/P0_HUGGINGFACE_TOOL_CALLING_BROKEN.md` - Upstream repr bug details

docs/bugs/archive/P0_ADVANCED_MODE_TIMEOUT_NO_SYNTHESIS.md DELETED Viewed

@@ -1,307 +0,0 @@
-# P0 - Advanced Mode Timeout Yields False "Synthesizing" Message
-**Status:** RESOLVED
-**Priority:** P0 (Blocker for Advanced/Magentic mode)
-**Found:** 2025-11-30 (Manual Testing)
-**Resolved:** 2025-11-30
-**Component:** `src/orchestrators/advanced.py`
-## Resolution Summary
-The issue where Advanced Mode timeouts produced a fake synthesis message has been fully resolved.
-We implemented a robust fallback mechanism that synthesizes a report from collected evidence upon timeout.
-### Fix Details
-1.  **Implemented `ResearchMemory.get_context_summary()`**:
-    -   Added missing method to `src/services/research_memory.py`.
-    -   Generates a structured summary of hypotheses and top 20 evidence items.
-    -   Enables the ReportAgent to function even without a formal handoff from JudgeAgent.
-2.  **Fixed Factory Configuration**:
-    -   Updated `src/orchestrators/factory.py` to use `settings.advanced_max_rounds` (default 5).
-    -   Previously used global `max_iterations` (default 10), causing workflows to run 2x longer than intended and hitting timeouts.
-3.  **Implemented Timeout Synthesis Logic**:
-    -   Updated `src/orchestrators/advanced.py` to catch `TimeoutError`.
-    -   Now retrieves `get_context_summary()` from memory.
-    -   Directly invokes `ReportAgent` to generate a final report from available evidence.
-    -   Yields the actual report content instead of a static placeholder message.
-### Verification
--   **Unit Tests**: `tests/unit/orchestrators/test_advanced_timeout.py` verifies:
-    -   Timeout triggers synthesis (mocked ReportAgent is called).
-    -   Factory correctly sets `max_rounds=5`.
--   **Manual Verification**:
-    -   Confirmed logic flow via TDD.
-    -   SearchAgent verbosity mitigated by reduced round count (5 rounds = ~20KB context vs 40KB+).
----
-## Symptom (Archive)
-When using Advanced mode (Magentic/Multi-Agent) with an OpenAI API key, the workflow:
-1. Starts correctly ("Starting research (Advanced mode)")
-2. Shows "Multi-agent reasoning in progress (10 rounds max)"
-3. Streams SearchAgent results successfully
-4. Shows "Round 1/10" progress
-5. Then hangs for ~5 minutes (timeout period)
-6. Finally shows: **"Research timed out. Synthesizing available evidence..."**
-7. **BUT NO SYNTHESIS OCCURS** - the output ends there
-User sees massive streaming output from SearchAgent but NO final research report.
-## Observed Output
-```text
-🚀 **STARTED**: Starting research (Advanced mode): Clinical trials for PDE5 inhibitors alternatives?
-⏳ **THINKING**: Multi-agent reasoning in progress (10 rounds max)...
-🧠 **JUDGING**: Manager (user_task): Research sexual health and wellness interventions...
-📡 **STREAMING**: [MASSIVE SearchAgent output - 10KB+ of clinical trial data]
-⏱️ **PROGRESS**: Round 1/10 (~6m 45s remaining)
-📚 **SEARCH_COMPLETE**: searcher: Below is a structured evidence dataset...
-Research timed out. Synthesizing available evidence...
-[END - Nothing more happens]
-```
-## Root Cause Analysis
-### Bug Location: `src/orchestrators/advanced.py:254-261`
-```python
-except TimeoutError:
-    logger.warning("Workflow timed out", iterations=iteration)
-    yield AgentEvent(
-        type="complete",
-        message="Research timed out. Synthesizing available evidence...",  # <-- LIE
-        data={"reason": "timeout", "iterations": iteration},
-        iteration=iteration,
-    )
-```
-**The message is a lie.** It says "Synthesizing available evidence..." but:
-1. No synthesis code is called
-2. The `MagenticState` (containing gathered evidence) is never accessed
-3. The `ReportAgent` is never invoked
-4. User just sees the raw streaming output
-### Secondary Issue: Workflow Never Progresses Past Round 1
-The SearchAgent produces a MASSIVE response (10KB+) in Round 1, but the workflow appears to stall and never delegate to:
-- HypothesisAgent
-- JudgeAgent
-- ReportAgent
-This suggests the Manager agent may be:
-1. Overwhelmed by the verbose SearchAgent output
-2. Stuck in a decision loop
-3. Not receiving proper signals to delegate to next agent
-### Configuration Issue: Wrong `max_rounds` Used
-**File:** `src/orchestrators/factory.py:93-97`
-```python
-return orchestrator_cls(
-    max_rounds=effective_config.max_iterations,  # <-- Uses max_iterations (10)
-    api_key=api_key,
-    domain=domain,
-)
-```
-The factory passes `max_iterations` (10) instead of using `settings.advanced_max_rounds` (5).
-This means timeout is more likely since workflows run longer.
-## Impact
-- **User Experience:** After waiting 5+ minutes, users get NO useful output
-- **Demo Killer:** Advanced mode is effectively broken for external users
-- **Misleading UX:** Message claims synthesis is happening when it's not
-## Proposed Fix
-### Fix 1: Implement Actual Timeout Synthesis
-**File:** `src/orchestrators/advanced.py`
-```python
-except TimeoutError:
-    logger.warning("Workflow timed out", iterations=iteration)
-    # ACTUALLY synthesize from gathered evidence
-    try:
-        from src.agents.state import get_magentic_state
-        from src.agents.magentic_agents import create_report_agent
-        state = get_magentic_state()
-        memory: ResearchMemory = state.memory
-        # Get evidence summary from memory
-        evidence_summary = await memory.get_context_summary()
-        # Create and invoke ReportAgent for synthesis
-        report_agent = create_report_agent(self._chat_client, domain=self.domain)
-        synthesis_result = await report_agent.invoke(
-            f"Synthesize research report from this evidence:\n{evidence_summary}"
-        )
-        yield AgentEvent(
-            type="complete",
-            message=synthesis_result,
-            data={"reason": "timeout_synthesis", "iterations": iteration},
-            iteration=iteration,
-        )
-    except Exception as synth_error:
-        logger.error("Timeout synthesis failed", error=str(synth_error))
-        yield AgentEvent(
-            type="complete",
-            message=(
-                f"Research timed out after {iteration} rounds. "
-                f"Evidence gathered but synthesis failed: {synth_error}"
-            ),
-            data={"reason": "timeout_synthesis_failed", "iterations": iteration},
-            iteration=iteration,
-        )
-```
-### Fix 2: Address SearchAgent Verbosity
-The SearchAgent is producing large outputs (~4KB per search, accumulating to 40KB+ over 10 rounds), which overwhelms the Manager's context window.
-Consider:
-1. Limiting SearchAgent output length further (currently 300 chars/result)
-2. Summarizing results before returning to Manager
-3. Using structured output format instead of prose
-### Fix 3: Use Correct max_rounds
-**File:** `src/orchestrators/factory.py`
-```python
-# Use advanced-specific setting, not max_iterations
-return orchestrator_cls(
-    max_rounds=settings.advanced_max_rounds,  # 5 by default
-    api_key=api_key,
-    domain=domain,
-)
-```
-### Fix 4: Implement `get_context_summary` in ResearchMemory
-**File:** `src/services/research_memory.py`
-The `ResearchMemory` class is missing the `get_context_summary` method required by Fix 1.
-```python
-    async def get_context_summary(self) -> str:
-        """Generate a summary of all collected evidence for the final report."""
-        if not self.evidence_ids:
-            return "No evidence collected."
-        summary = [f"Research Query: {self.query}\n"]
-        # Add Hypotheses
-        if self.hypotheses:
-            summary.append("## Hypotheses")
-            for h in self.hypotheses:
-                summary.append(f"- {h.drug} -> {h.target}: {h.effect} (Conf: {h.confidence})")
-            summary.append("")
-        # Add Top Evidence (limit to avoid token overflow)
-        # We use get_all_evidence() but might need to summarize if too large
-        evidence = self.get_all_evidence()
-        summary.append(f"## Evidence ({len(evidence)} items)")
-        # Group by source for cleaner summary
-        for i, ev in enumerate(evidence[:20], 1):  # Limit to top 20 items
-            summary.append(f"{i}. {ev.citation.title} ({ev.citation.date})")
-            summary.append(f"   {ev.content[:200]}...") # Brief snippet
-        return "\n".join(summary)
-```
-## Call Stack Trace
-```
-app.py:research_agent()
-  → configure_orchestrator(mode="advanced")
-    → factory.py:create_orchestrator()
-      → AdvancedOrchestrator(max_rounds=10)  # Should be 5
-  → orchestrator.run(query)
-    → advanced.py:run()
-      → init_magentic_state(query)
-      → workflow = _build_workflow()  # MagenticBuilder
-      → async for event in workflow.run_stream(task):
-          # SearchAgent runs (accumulates 4KB+ per round)
-          # Manager receives, but never delegates further
-          # TimeoutError after 300 seconds
-      → except TimeoutError:
-          → yield AgentEvent(message="Synthesizing...")  # LIE - no synthesis
-```
-## Files to Modify
-| File | Change |
-|------|--------|
-| `src/orchestrators/advanced.py:254-261` | Implement actual synthesis on timeout |
-| `src/orchestrators/factory.py:93-97` | Use `settings.advanced_max_rounds` |
-| `src/services/research_memory.py` | Implement `get_context_summary()` method |
-| `src/agents/magentic_agents.py` | Consider limiting SearchAgent output |
-## Test Plan
-### Unit Tests
-```python
-# tests/unit/orchestrators/test_advanced_timeout.py
-@pytest.mark.asyncio
-async def test_timeout_synthesizes_evidence():
-    """Timeout should produce synthesis, not empty message."""
-    orchestrator = AdvancedOrchestrator(
-        max_rounds=1,
-        timeout_seconds=0.1,  # Force immediate timeout
-        api_key="sk-test",
-    )
-    events = [e async for e in orchestrator.run("test query")]
-    complete_event = [e for e in events if e.type == "complete"][-1]
-    # Should contain synthesis, not just "timed out"
-    assert "Research timed out" not in complete_event.message or \
-           len(complete_event.message) > 100  # Actual content present
-@pytest.mark.asyncio
-async def test_factory_uses_advanced_max_rounds():
-    """Factory should use settings.advanced_max_rounds for advanced mode."""
-    orchestrator = create_orchestrator(
-        mode="advanced",
-        api_key="sk-test",
-    )
-    assert orchestrator._max_rounds == settings.advanced_max_rounds
-```
-### Manual Verification
-1. Set `OPENAI_API_KEY` and run app
-2. Select "Advanced" mode
-3. Submit: "Clinical trials for PDE5 inhibitors alternatives?"
-4. Wait for completion or timeout
-5. **Verify:** Final output contains synthesized report (not just "timed out" message)
-## Related Issues
-- This may be related to the SearchAgent being too verbose
-- The Magentic pattern expects agents to produce concise outputs
-- Microsoft Agent Framework's Manager may struggle with 10KB+ messages
-## Priority Justification
-**P0 because:**
-1. Advanced mode is a major selling point (multi-agent, deep research)
-2. Users with paid API keys expect it to work
-3. The current behavior is deceptive (claims synthesis, delivers nothing)
-4. Demo credibility is destroyed when users wait 5min for nothing

docs/bugs/archive/P0_AIFUNCTION_NOT_JSON_SERIALIZABLE.md DELETED Viewed

@@ -1,225 +0,0 @@
-# P0 Bug: AIFunction Not JSON Serializable (Free Tier Broken)
-**Severity**: P0 (Critical) - Free Tier cannot perform research
-**Status**: RESOLVED
-**Discovered**: 2025-12-01
-**Resolved**: 2025-12-01
-**Reporter**: Production user via HuggingFace Spaces
-## Symptom
-Every search round fails with:
-```
-📚 SEARCH_COMPLETE: searcher: Agent searcher: Error processing request -
-Object of type AIFunction is not JSON serializable
-```
-Research never completes. Users see 5 rounds of the same error.
-## Root Cause
-### The Problem
-In `src/clients/huggingface.py` lines 82-103:
-```python
-# Extract tool configuration
-tools = chat_options.tools if chat_options.tools else None  # AIFunction objects!
-...
-call_fn = partial(
-    self._client.chat_completion,
-    messages=hf_messages,
-    tools=tools,  # <-- RAW AIFunction objects passed here
-    ...
-)
-```
-The `chat_options.tools` contains `AIFunction` objects from Microsoft's agent-framework.
-When `requests` tries to serialize these for the HTTP request, it fails:
-```
-TypeError: Object of type AIFunction is not JSON serializable
-```
-### Why This Happens
-1. Microsoft's agent-framework defines tools as `AIFunction` objects
-2. `ChatAgent` with tools passes them via `chat_options.tools`
-3. Our `HuggingFaceChatClient` forwards them directly to `InferenceClient.chat_completion()`
-4. `requests.post()` internally calls `json.dumps()` on the request body
-5. `AIFunction` has no `__json__()` method or isn't a dict → TypeError
-## Impact
-| Component | Impact |
-|-----------|--------|
-| Free Tier (HuggingFace) | **COMPLETELY BROKEN** |
-| Advanced Mode without API key | **Cannot do research** |
-| Paid Tier (OpenAI) | Unaffected (OpenAI handles AIFunction) |
-## Professional Fix (Full Implementation)
-Qwen2.5-72B-Instruct **SUPPORTS** function calling via HuggingFace. The fix requires:
-1. **Request Serialization**: Convert `AIFunction` → OpenAI-compatible JSON
-2. **Response Parsing**: Convert HuggingFace `tool_calls` → Framework `FunctionCallContent`
-### Part 1: Tool Serialization (`_convert_tools`)
-```python
-def _convert_tools(self, tools: list[Any] | None) -> list[dict[str, Any]] | None:
-    """Convert AIFunction objects to OpenAI-compatible tool definitions.
-    AIFunction.to_dict() returns:
-        {'type': 'ai_function', 'name': '...', 'description': '...', 'input_model': {...}}
-    OpenAI/HuggingFace expects:
-        {'type': 'function', 'function': {'name': '...', 'description': '...', 'parameters': {...}}}
-    """
-    if not tools:
-        return None
-    json_tools = []
-    for tool in tools:
-        if hasattr(tool, 'to_dict'):
-            t_dict = tool.to_dict()
-            json_tools.append({
-                "type": "function",
-                "function": {
-                    "name": t_dict["name"],
-                    "description": t_dict.get("description", ""),
-                    "parameters": t_dict["input_model"]
-                }
-            })
-        elif isinstance(tool, dict):
-            json_tools.append(tool)
-        else:
-            logger.warning(f"Skipping non-serializable tool: {type(tool)}")
-    return json_tools if json_tools else None
-```
-### Part 2: Response Parsing (Tool Calls → FunctionCallContent)
-When HuggingFace returns tool calls, we must convert them to the framework's format:
-```python
-from agent_framework._types import FunctionCallContent
-# In _inner_get_response, after getting the response:
-choice = choices[0]
-message = choice.message
-message_content = message.content or ""
-# Parse tool calls if present
-contents: list[Any] = []
-if hasattr(message, 'tool_calls') and message.tool_calls:
-    for tc in message.tool_calls:
-        # HF returns: tc.id, tc.function.name, tc.function.arguments
-        contents.append(FunctionCallContent(
-            call_id=tc.id,
-            name=tc.function.name,
-            arguments=tc.function.arguments  # JSON string or dict
-        ))
-response_msg = ChatMessage(
-    role=cast(Any, message.role),
-    text=message_content,
-    contents=contents if contents else None
-)
-```
-### Verified Schema Mapping
-```python
-# AIFunction.to_dict() output (verified 2025-12-01):
-{
-  "type": "ai_function",
-  "name": "search_pubmed",
-  "description": "Search PubMed for biomedical research papers...",
-  "input_model": {
-    "properties": {"query": {"title": "Query", "type": "string"}, ...},
-    "required": ["query"],
-    "type": "object"
-  }
-}
-# Mapped to OpenAI format:
-{
-  "type": "function",
-  "function": {
-    "name": "search_pubmed",
-    "description": "Search PubMed for biomedical research papers...",
-    "parameters": {
-      "properties": {"query": {"title": "Query", "type": "string"}, ...},
-      "required": ["query"],
-      "type": "object"
-    }
-  }
-}
-```
-## Call Stack Trace
-```
-User Query (HuggingFace Spaces)
-    ↓
-src/app.py:research_agent()
-    ↓
-src/orchestrators/advanced.py:AdvancedOrchestrator.run()
-    ↓
-agent_framework.MagenticBuilder.run_stream()
-    ↓
-agent_framework.ChatAgent (SearchAgent with tools=[search_pubmed, ...])
-    ↓
-src/clients/huggingface.py:HuggingFaceChatClient._inner_get_response()
-    → chat_options.tools contains AIFunction objects
-    ↓
-huggingface_hub.InferenceClient.chat_completion(tools=tools)
-    ↓
-requests.post(json={..., "tools": [AIFunction, ...]})
-    ↓
-json.dumps() → TypeError: Object of type AIFunction is not JSON serializable
-```
-## Testing
-```bash
-# Reproduce locally (remove OpenAI key)
-unset OPENAI_API_KEY
-uv run python -c "
-import asyncio
-from src.orchestrators.advanced import AdvancedOrchestrator
-async def test():
-    orch = AdvancedOrchestrator(max_rounds=2)
-    async for event in orch.run('testosterone benefits'):
-        print(f'[{event.type}] {str(event.message)[:50]}...')
-asyncio.run(test())
-"
-# Expected BEFORE fix: TypeError: Object of type AIFunction is not JSON serializable
-# Expected AFTER fix: Research completes with tool calls working
-```
-## Resolution
-Implemented full function calling support for HuggingFace client:
-1.  **Request Serialization**: Added `_convert_tools` to map `AIFunction` schemas to OpenAI-compatible JSON.
-2.  **Response Parsing (Sync)**: Added `_parse_tool_calls` to convert HF `tool_calls` to `FunctionCallContent`.
-3.  **Response Parsing (Async)**: Implemented tool call accumulator in `_inner_get_streaming_response` to handle partial tool call deltas and yield valid `FunctionCallContent` objects.
-## Verification
-Verified with unit tests and manual simulation:
-1.  **Serialization**: Confirmed `AIFunction` -> JSON conversion works for `search_pubmed`.
-2.  **Streaming**: Verified that fragmented tool call deltas (e.g., `{"query":` then `"testosterone"}`) are correctly reassembled into a single `FunctionCallContent`.
-3.  **Integration**: Passed project-level `make check`.
-## References
-- [HuggingFace Chat Completion - Function Calling](https://huggingface.co/docs/inference-providers/tasks/chat-completion)
-- [Qwen Function Calling](https://qwen.readthedocs.io/en/latest/framework/function_call.html)
-- [Microsoft Agent Framework - AIFunction](https://learn.microsoft.com/en-us/python/api/agent-framework-core/agent_framework.aifunction)

docs/bugs/archive/P0_HUGGINGFACE_TOOL_CALLING_BROKEN.md DELETED Viewed

@@ -1,173 +0,0 @@
-# P0 Bug: HuggingFace Free Tier Tool Calling Broken
-**Severity**: P0 (Critical) - Free Tier cannot perform multi-turn tool-based research
-**Status**: PARTIALLY RESOLVED - Bug #1 FIXED, Bug #2 requires upstream fix
-**Discovered**: 2025-12-01
-**Investigator**: Claude Code (Systematic First-Principles Analysis)
-**Last Updated**: 2025-12-01
-## Executive Summary
-The HuggingFace Free Tier had two critical bugs preventing end-to-end tool-based research:
-1. **Bug #1 (FIXED)**: Conversation history serialization missing `tool_calls` and `tool_call_id`
-2. **Bug #2 (UPSTREAM)**: Microsoft Agent Framework produces repr strings instead of message text
-## Current Status
-| Bug | Status | Location | Fix |
-|-----|--------|----------|-----|
-| #1 History Serialization | ✅ **FIXED** | `src/clients/huggingface.py` | Commit `809ad60` |
-| #2 Framework Repr Bug | ⏳ **UPSTREAM** | `agent_framework/_workflows/_magentic.py` | [Issue #2562](https://github.com/microsoft/agent-framework/issues/2562) |
----
-## BUG #1: Conversation History Serialization ✅ FIXED
-### What Was Wrong
-`_convert_messages()` didn't serialize `tool_calls` (for assistant messages) or `tool_call_id` (for tool messages).
-### The Fix (Commit `809ad60`)
-Updated `_convert_messages()` in `src/clients/huggingface.py:71-121` to:
-1. Extract `FunctionCallContent` from `msg.contents` → `tool_calls` array
-2. Extract `FunctionResultContent` from `msg.contents` → `tool_call_id`
-3. Properly format for HuggingFace/OpenAI API
-### Verification
-```python
-# Before fix: BadRequestError on multi-turn
-# After fix: Multi-turn conversations work
-# The message format is now correct:
-{
-    "role": "assistant",
-    "content": "",
-    "tool_calls": [{"id": "call_123", "type": "function", "function": {...}}]
-}
-```
----
-## BUG #2: Framework Message Corruption ⏳ UPSTREAM
-### Symptom
-`MagenticAgentMessageEvent.message.text` contains:
-```text
-'<agent_framework._types.ChatMessage object at 0x10c394210>'
-```
-### Root Cause (CONFIRMED)
-**File**: `agent_framework/_workflows/_magentic.py` line ~1799
-```python
-async def _invoke_agent(self, ctx, ...) -> ChatMessage:
-    # ...
-    if messages and len(messages) > 0:
-        last: ChatMessage = messages[-1]
-        text = last.text or str(last)  # <-- BUG: str(last) gives repr!
-        msg = ChatMessage(role=role, text=text, author_name=author)
-```
-**Why it happens**:
-1. `ChatMessage.text` property only extracts `TextContent` items
-2. Tool-call-only messages have empty `.text` (returns `""`)
-3. `"" or str(last)` evaluates to `str(last)`
-4. `ChatMessage` has no `__str__` method → default Python repr
-### Impact Assessment
-| Aspect | Impact | Critical? |
-|--------|--------|-----------|
-| UI Display | Shows garbage instead of agent output | YES for UX |
-| Logging | Can't debug what agents did | YES for debugging |
-| Tool Execution | Tools ARE being called (middleware works) | NO - Works |
-| Research Completion | Manager may not track progress properly | MAYBE - Unclear |
-**Observed behavior**: Research loops often reach max rounds without synthesis. The Manager keeps saying "no progress" even though tools ARE being called. This COULD be:
-1. The repr bug affecting Manager's understanding
-2. Qwen 72B not handling tool message format well
-3. Unrelated orchestration issue
-### Upstream Issue Filed
-**GitHub Issue**: [microsoft/agent-framework#2562](https://github.com/microsoft/agent-framework/issues/2562)
-**Suggested fixes in issue**:
-1. **Minimal**: `text = last.text or ""`
-2. **Better UX**: Format tool calls for display
-3. **Best**: Add `__str__` to `ChatMessage` class
-### Workaround (Implemented in `advanced.py`)
-We modified `_extract_text()` in `advanced.py` to extract tool call names from `.contents` when text is empty or looks like a repr:
-```python
-def _extract_text(self, message: Any) -> str:
-    # ... existing logic with repr filtering ...
-    # Workaround: Extract tool call info when text is repr/empty
-    if hasattr(message, "contents") and message.contents:
-        tool_names = [
-            f"[Tool: {c.name}]"
-            for c in message.contents
-            if hasattr(c, "name")  # FunctionCallContent
-        ]
-        if tool_names:
-            return " ".join(tool_names)
-    return ""
-```
-**Decision**: Implemented locally to fix display and logging while we wait for upstream fix.
----
-## Verification Matrix (Updated)
-| Component | Status | Notes |
-|-----------|--------|-------|
-| Tool Serialization | ✅ WORKS | `_convert_tools()` |
-| Tool Call Parsing | ✅ WORKS | `_parse_tool_calls()` |
-| History Serialization | ✅ **FIXED** | `_convert_messages()` |
-| Middleware Decorators | ✅ **FIXED** | `@use_function_invocation` etc. |
-| Event Display | ❌ UPSTREAM | Shows repr - framework bug |
-| End-to-End Research | ⚠️ UNCLEAR | Needs testing after upstream fix |
----
-## Files Changed
-### Fixed (Commit `809ad60`)
-- `src/clients/huggingface.py`
-  - `_convert_messages()` - Now serializes `tool_calls` and `tool_call_id`
-  - Added `@use_function_invocation`, `@use_observability`, `@use_chat_middleware` decorators
-  - Added `__function_invoking_chat_client__ = True` marker
-### Also Fixed
-- `src/orchestrators/advanced.py` - `_extract_text()` now filters repr strings AND extracts tool call names
----
-## Related Upstream Issues
-| Issue | Title | Status | Relevance |
-|-------|-------|--------|-----------|
-| [#2562](https://github.com/microsoft/agent-framework/issues/2562) | Repr string bug (OUR ISSUE) | OPEN | Direct cause |
-| [#1366](https://github.com/microsoft/agent-framework/issues/1366) | Thread corruption - unexecuted tool calls | OPEN | Same area |
-| [#2410](https://github.com/microsoft/agent-framework/issues/2410) | OpenAI client splits content/tool_calls | OPEN | Related bug |
----
-## Next Steps
-1. **Monitor**: Watch for response to [Issue #2562](https://github.com/microsoft/agent-framework/issues/2562)
-2. **Test**: Run end-to-end research tests to see if Bug #2 actually blocks completion
-3. **Optional**: Implement workaround in `_extract_text()` if display is critical
-4. **Contribute**: Consider submitting PR to fix `_magentic.py` line 1799
----
-## References
-- [HuggingFace Chat Completion API - Tool Use](https://huggingface.co/docs/huggingface_hub/package_reference/inference_client#huggingface_hub.InferenceClient.chat_completion)
-- [OpenAI Function Calling](https://platform.openai.com/docs/guides/function-calling)
-- [Microsoft Agent Framework Repository](https://github.com/microsoft/agent-framework)
-- [Our Upstream Issue #2562](https://github.com/microsoft/agent-framework/issues/2562)

docs/bugs/archive/P0_MCP_TOOLUSECONTENT_MISSING.md DELETED Viewed

@@ -1,88 +0,0 @@
-# P0 Bug: mcp.types.ToolUseContent AttributeError on HuggingFace Spaces
-**Status**: FIXED
-**Severity**: P0 (App completely broken)
-**Discovered**: 2025-12-04
-**Fixed**: 2025-12-04 (PR TBD)
----
-## Symptom
-HuggingFace Spaces deployment crashes with:
-```
-module 'mcp.types' has no attribute 'ToolUseContent'
-```
-The app fails to start entirely. No functionality works.
----
-## Root Cause
-**Dependency version mismatch between `pyproject.toml` and `requirements.txt`.**
-| File | MCP Pin | Result |
-|------|---------|--------|
-| `pyproject.toml` | `mcp>=1.23.0` | Correct - has `ToolUseContent` |
-| `requirements.txt` | (missing) | Pulls old MCP via `gradio[mcp]` transitive dep |
-**Background:**
-- `ToolUseContent` was added in MCP spec **2025-11-25** via **SEP-1577 (Sampling With Tools)**
-- Our pyproject.toml correctly pins `mcp>=1.23.0` (for security fix GHSA-9h52-p55h-vw2f)
-- HuggingFace Spaces uses `requirements.txt`, NOT `pyproject.toml`
-- `gradio[mcp]>=6.0.0` pulls in MCP as transitive dependency
-- Without explicit pin, Gradio was pulling an older MCP version lacking `ToolUseContent`
----
-## Fix
-Added explicit MCP pin to `requirements.txt`:
-```diff
-# UI (Gradio with MCP server support - 6.0 required for css in launch())
-gradio[mcp]>=6.0.0
-+
-+# Security: Pin mcp to fix GHSA-9h52-p55h-vw2f and ensure ToolUseContent exists
-+mcp>=1.23.0
-```
-Also synced ALL dependencies between `pyproject.toml` and `requirements.txt` to prevent future drift.
----
-## Changes Made
-**Files modified:**
-- `requirements.txt` - Full sync with `pyproject.toml`:
-  - Added `mcp>=1.23.0` (root cause fix)
-  - Added `beautifulsoup4>=4.12` (was missing)
-  - Fixed `huggingface-hub>=0.24.0` (was 0.20.0)
-  - Added upper bound to `agent-framework-core>=1.0.0b251120,<2.0.0`
-  - Added sync header comment with date
----
-## Prevention
-1. **Sync header**: `requirements.txt` now has "Last synced: YYYY-MM-DD" comment
-2. **CI check**: Consider adding a pre-commit hook to validate requirements.txt matches pyproject.toml
----
-## References
-- [MCP Python SDK Releases](https://github.com/modelcontextprotocol/python-sdk/releases)
-- [MCP Spec 2025-11-25 - Sampling With Tools](https://modelcontextprotocol.io/specification/2025-11-25/client/sampling)
-- [GHSA-9h52-p55h-vw2f](https://github.com/advisories/GHSA-9h52-p55h-vw2f) - MCP security advisory
----
-## Verification
-After fix:
-1. Deploy to HuggingFace Spaces
-2. Verify app starts without errors
-3. Verify MCP server responds at `/gradio_api/mcp/`

docs/bugs/archive/P0_ORCHESTRATOR_DEDUP_AND_JUDGE_BUGS.md DELETED Viewed

@@ -1,144 +0,0 @@
-# P0 Bug Report: Orchestrator Dedup + Judge Failures
-## Status
-- **Date:** 2025-11-29
-- **Priority:** P0 (Blocker - Simple mode broken on HF Spaces)
-- **Component:** `src/orchestrator.py`, `src/agent_factory/judges.py`
-- **Resolution:** FIXED in commits `5e761eb`, `2588375`
----
-## Symptoms
-When running Simple mode (free tier) on HuggingFace Spaces:
-1. **Judge always returns 0% confidence** → loops forever with "continue"
-2. **Deduplication removes ALL evidence** after iteration 1
-3. **Never synthesizes** → user sees infinite loop
-### Example Output
-```
-📚 SEARCH_COMPLETE: Found 20 new sources (19 total)   ← Iteration 1 OK
-✅ JUDGE_COMPLETE: Assessment: continue (confidence: 0%)  ← FAIL: 0% = fallback
-📚 SEARCH_COMPLETE: Found 12 new sources (11 total)   ← Iteration 2 BROKEN
-...
-📚 SEARCH_COMPLETE: Found 31 new sources (0 total)    ← 0 TOTAL = all removed!
-✅ JUDGE_COMPLETE: Assessment: continue (confidence: 0%)  ← Still failing
-```
----
-## Root Cause Analysis
-### Bug 1: Semantic Deduplication Removes Old Evidence
-**File:** `src/orchestrator.py:213-219`
-```python
-# URL dedup (correct)
-seen_urls = {e.citation.url for e in all_evidence}
-unique_new = [e for e in new_evidence if e.citation.url not in seen_urls]
-all_evidence.extend(unique_new)
-# BUG: Passes ALL evidence (including old) to semantic dedup
-all_evidence = await self._deduplicate_and_rank(all_evidence, query)
-```
-**Problem:** The `deduplicate()` function checks each item against the vector store. Items from iteration 1 are ALREADY in the store. When re-checked in iteration 2+, they find THEMSELVES (distance ≈ 0) and are removed as "duplicates".
-**Result:** After iteration 1, evidence count drops to 0.
-### Bug 2: HF Inference Judge Always Failing
-**File:** `src/agent_factory/judges.py:186-254`
-**Evidence:** Judge returns this every time:
-- `confidence: 0.0`
-- `recommendation: "continue"`
-- Next queries are just the original query with suffixes
-This is the `_create_fallback_assessment()` response, meaning:
-- The HF Inference API calls are failing
-- All 3 fallback models (Llama, Mistral, Zephyr) are failing
-- Likely due to rate limits, quota, or model availability
----
-## The Fix
-### Fix 1: Only Dedup NEW Evidence (not all_evidence)
-```python
-# Before (broken)
-all_evidence.extend(unique_new)
-all_evidence = await self._deduplicate_and_rank(all_evidence, query)
-# After (fixed)
-# Only dedup the NEW evidence against the store
-if unique_new:
-    unique_new = await self._deduplicate_new_evidence(unique_new, query)
-all_evidence.extend(unique_new)
-```
-Or simpler - disable semantic dedup until we fix it properly:
-```python
-# Disable broken semantic dedup
-# all_evidence = await self._deduplicate_and_rank(all_evidence, query)
-```
-### Fix 2: Handle HF Inference Failures Gracefully
-Option A: After N failed judge calls, force synthesize with available evidence
-Option B: Increase retry count or add longer backoff
-Option C: Fall back to MockJudgeHandler (which DOES work) after failures
-```python
-# In _create_fallback_assessment, track failures
-if self._consecutive_failures >= 3:
-    # Force synthesis instead of infinite loop
-    return JudgeAssessment(
-        sufficient=True,  # STOP
-        confidence=0.1,
-        recommendation="synthesize",
-        ...
-    )
-```
----
-## Test Plan
-- [ ] Disable semantic dedup OR fix to only process new items
-- [ ] Verify evidence accumulates across iterations (not drops to 0)
-- [ ] Test HF Inference with fresh HF_TOKEN
-- [ ] If HF keeps failing, fall back to MockJudgeHandler
-- [ ] Verify "synthesize" is eventually reached
-- [ ] Deploy and test on HF Space
----
-## Priority Justification
-**P0** because:
-- Simple mode (free tier) is the DEFAULT experience
-- Currently produces infinite loop with no output
-- Users see "confidence: 0%" and think tool is broken
-- Blocks hackathon demo for users without API keys
----
-## Quick Workaround
-Disable semantic dedup by setting `enable_embeddings=False` in orchestrator creation:
-```python
-orchestrator = create_orchestrator(
-    ...
-    enable_embeddings=False,  # Disable broken dedup
-)
-```
-Or users can enter an OpenAI/Anthropic API key to bypass HF Inference issues.

docs/bugs/archive/P0_REPR_BUG_ROOT_CAUSE_ANALYSIS.md DELETED Viewed

@@ -1,99 +0,0 @@
-# P0: Event Handling Implementation Spec
-**Status**: FIXED
-**Priority**: P0
-**Source of Truth**: `reference_repos/microsoft-agent-framework/python/samples/autogen-migration/orchestrations/04_magentic_one.py`
----
-## Root Cause (One Sentence)
-We were extracting content from `MagenticAgentMessageEvent.message` — **the wrong event type** — instead of using `MagenticAgentDeltaEvent.text` as the sole source of streaming content.
----
-## The Fix: Correct Event Handling Per Microsoft SSOT
-| Event Type | Correct Usage | What We Were Doing (Wrong) |
-|------------|---------------|----------------------------|
-| `MagenticAgentDeltaEvent` | **Extract `.text`** - This is the ONLY source of content | Partially used, not accumulated |
-| `MagenticAgentMessageEvent` | **Signal only** - Agent turn complete. IGNORE `.message` | Extracting `.message.text` (hits repr bug) |
-| `MagenticFinalResultEvent` | **Extract `.message.text`** - Final synthesis result | Correct |
----
-## Implementation: Accumulator Pattern
-From Microsoft's `04_magentic_one.py` (lines 108-138):
-```python
-# Microsoft's Pattern
-async for event in workflow.run_stream(task):
-    if isinstance(event, MagenticAgentDeltaEvent):
-        # STREAM CONTENT: Accumulate and display
-        if event.text:
-            print(event.text, end="", flush=True)
-    elif isinstance(event, MagenticAgentMessageEvent):
-        # SIGNAL ONLY: Agent done. Print newline. DO NOT read .message
-        print()
-    elif isinstance(event, MagenticFinalResultEvent):
-        # FINAL RESULT: Safe to read .message.text
-        print(event.message.text)
-```
----
-## Our Implementation (`src/orchestrators/advanced.py`)
-**Status**: ✅ IMPLEMENTED (lines 241-308)
-```python
-# 1. Accumulate streaming content (ONLY source of truth)
-if isinstance(event, MagenticAgentDeltaEvent):
-    if event.text:
-        current_message_buffer += event.text
-        yield AgentEvent(type="streaming", message=event.text, ...)
-# 2. Use buffer on completion signal (IGNORE event.message)
-if isinstance(event, MagenticAgentMessageEvent):
-    text_content = current_message_buffer or "Action completed (Tool Call)"
-    yield AgentEvent(message=f"{agent_name}: {text_content[:200]}...", ...)
-    current_message_buffer = ""  # Reset for next agent
-# 3. Final result - safe to extract
-if isinstance(event, MagenticFinalResultEvent):
-    text = self._extract_text(event.message)
-    yield AgentEvent(type="complete", message=text, ...)
-```
----
-## Why This Eliminates the Repr Bug
-The repr bug occurs at `_magentic.py:1730`:
-```python
-text = last.text or str(last)  # Falls back to repr() for tool-only messages
-```
-By **never reading** `MagenticAgentMessageEvent.message.text`, we never hit this code path.
-**The repr bug is eliminated by correct implementation — no upstream fix required.**
----
-## Verification Checklist
-- [x] `MagenticAgentDeltaEvent.text` used as sole content source
-- [x] `MagenticAgentMessageEvent` used as signal only (buffer consumed, not `.message`)
-- [x] `MagenticFinalResultEvent.message.text` extracted for final result
-- [x] Buffer reset on agent switch and completion
-- [x] Remove dead code path in `_process_event()` that still calls `_extract_text` on `MagenticAgentMessageEvent`
----
-## Remaining Cleanup
-✅ **DONE** - Dead code paths for `MagenticAgentMessageEvent` and `MagenticAgentDeltaEvent` have been removed from `_process_event()`. Comments now explain these events are handled by the Accumulator Pattern in `run()`.

docs/bugs/archive/P0_SIMPLE_MODE_FORCED_SYNTHESIS_BYPASS.md DELETED Viewed

@@ -1,59 +0,0 @@
-# P0 BUG: Simple Mode Synthesis Bypass (WILL BE FIXED BY UNIFIED ARCHITECTURE)
-**Status**: BLOCKED - Waiting for upstream PR #2566
-**Priority**: P0 (Demo-blocking)
-**Discovered**: 2025-12-01
-**GitHub Issue**: [#113](https://github.com/The-Obstacle-Is-The-Way/DeepBoner/issues/113)
----
-## Current State
-**`simple.py` is DELETED.** This bug existed in the old Simple Mode code.
-The bug will NOT be fixed by restoring Simple Mode. Instead, it will be **automatically fixed** when we complete the unified architecture (after upstream PR #2566 merges).
----
-## The Bug (Historical)
-When HuggingFace Inference API failed, Simple Mode's `_should_synthesize()` ignored forced synthesis signals due to overly strict thresholds.
-```text
-✅ JUDGE_COMPLETE: Assessment: synthesize (confidence: 10%)
-🔄 LOOPING: Gathering more evidence...  ← BUG: Should have synthesized!
-```
----
-## Why Unified Architecture Fixes This
-| Architecture | How Termination Works |
-|--------------|----------------------|
-| **Old (Simple Mode)** | Custom `_should_synthesize()` with buggy thresholds |
-| **New (Unified)** | Manager agent respects "SUFFICIENT EVIDENCE" signals |
-The Manager agent in Advanced Mode already works correctly. By completing the unified architecture with HuggingFace support, we inherit that correct behavior.
-**No need to patch `_should_synthesize()` because the code is deleted.**
----
-## Path Forward
-1. **Wait** for upstream PR #2566 to merge (fixes repr bug)
-2. **Update** `agent-framework` dependency
-3. **Verify** Advanced Mode + HuggingFace works
-4. **Done** - This bug is gone (no `_should_synthesize()` thresholds)
----
-## Related
-| Reference | Description |
-|-----------|-------------|
-| [ARCHITECTURE.md](../ARCHITECTURE.md) | Current state and unified plan |
-| [SPEC_16](../specs/SPEC_16_UNIFIED_CHAT_CLIENT_ARCHITECTURE.md) | Unified architecture spec |
-| [Issue #105](https://github.com/The-Obstacle-Is-The-Way/DeepBoner/issues/105) | GitHub tracking |
-| [Upstream #2562](https://github.com/microsoft/agent-framework/issues/2562) | Framework bug |
-| [Upstream PR #2566](https://github.com/microsoft/agent-framework/pull/2566) | Framework fix |

docs/bugs/archive/P0_SIMPLE_MODE_NEVER_SYNTHESIZES.md DELETED Viewed

@@ -1,254 +0,0 @@
-# P0 Bug Report: Simple Mode Never Synthesizes
-## Status
-- **Date:** 2025-11-29
-- **Priority:** P0 (Blocker - Simple mode produces useless output)
-- **Component:** `src/orchestrators/simple.py`, `src/agent_factory/judges.py`, `src/prompts/judge.py`
-- **Environment:** Simple mode **WITHOUT OpenAI key** (HuggingFace Inference free tier)
----
-## Symptoms
-When running Simple mode with a real research question:
-1. **Judge never recommends "synthesize"** even with 455 sources and 90% confidence
-2. **Confidence drops to 0%** in late iterations (API failures or context overflow)
-3. **Search derails** to tangential topics (bone health, muscle mass instead of libido)
-4. **Max iterations reached** → User gets garbage output (just citations, no synthesis)
-### Example Output (Real Run)
-```
-🔍 SEARCHING: What drugs improve female libido post-menopause?
-📚 SEARCH_COMPLETE: Found 30 new sources (30 total)
-✅ JUDGE_COMPLETE: Assessment: continue (confidence: 70%)    ← Never "synthesize"
-... 8 more iterations ...
-📚 SEARCH_COMPLETE: Found 10 new sources (429 total)
-✅ JUDGE_COMPLETE: Assessment: continue (confidence: 0%)     ← API failure?
-📚 SEARCH_COMPLETE: Found 26 new sources (455 total)
-✅ JUDGE_COMPLETE: Assessment: continue (confidence: 0%)     ← Still failing
-## Partial Analysis (Max Iterations Reached)      ← GARBAGE OUTPUT
-### Question
-What drugs improve female libido post-menopause?
-### Status
-Maximum search iterations reached.
-### Citations
-1. [Tribulus terrestris and female reproductive...]
-2. ...
----
-*Consider searching with more specific terms*     ← NO SYNTHESIS AT ALL
-```
----
-## Root Cause Analysis
-### Bug 1: Judge Never Says "sufficient=True"
-**File:** `src/prompts/judge.py:22-25`
-```python
-3. **Sufficiency**: Evidence is sufficient when:
-   - Combined scores >= 12 AND
-   - At least one specific drug candidate identified AND
-   - Clear mechanistic rationale exists
-```
-**Problem:** The prompt is too conservative. With 455 sources spanning testosterone, DHEA, estrogen, oxytocin, etc., the judge should have identified candidates and said "synthesize". But:
-1. LLM may not be extracting drug candidates from evidence properly
-2. The "AND" conditions are too strict - evidence can be "good enough" without hitting all criteria
-3. The recommendation "continue" seems to be the default state
-**Evidence:** Output shows 70-90% confidence but still "continue" - the judge is confident but never satisfied.
-### Bug 2: Confidence Drops to 0% (Late Iteration Failures)
-**File:** `src/agent_factory/judges.py:150-183`
-The `_create_fallback_assessment()` returns:
-- `confidence: 0.0`
-- `recommendation: "continue"`
-**Problem:** In iterations 9-10, something failed:
-- Context too long (455 sources × ~1500 chars = 680K chars → token limit exceeded)
-- API rate limit hit
-- Network timeout
-**Evidence:** Confidence went from 80%→0%→0% in final iterations - this is the fallback response.
-### Bug 3: Search Derailment
-**Evidence from logs:**
-```
-Next searches: androgen therapy and bone health, androgen therapy and muscle mass...
-Next searches: testosterone therapy in postmenopausal women, mechanisms of testosterone...
-```
-**Problem:** Judge's `next_search_queries` drift off-topic. "Bone health" and "muscle mass" are tangential to "female libido". The judge should stay focused on the original question.
-### Bug 4: Partial Synthesis is Garbage
-**File:** `src/orchestrators/simple.py:432-470`
-```python
-def _generate_partial_synthesis(self, query: str, evidence: list[Evidence]) -> str:
-    """Generate a partial synthesis when max iterations reached."""
-    citations = "\n".join([...])  # Just citations
-    return f"""## Partial Analysis (Max Iterations Reached)
-### Question
-{query}
-### Status
-Maximum search iterations reached. The evidence gathered may be incomplete.
-### Evidence Collected
-Found {len(evidence)} sources.
-### Citations
-{citations}
----
-*Consider searching with more specific terms*
-"""
-```
-**Problem:** When max iterations reached, we have 455 sources but output NO analysis. We should:
-1. Force a synthesis call to the LLM
-2. Or at minimum generate drug candidates/findings from the last good assessment
-3. Not just dump citations and give up
----
-## The Fix
-### Fix 1: Lower the Bar for "synthesize"
-**Option A:** Change prompt to be less strict:
-```python
-SYSTEM_PROMPT = """...
-3. **Sufficiency**: Evidence is sufficient when:
-   - Combined scores >= 10 (was 12) OR
-   - Confidence >= 80% with drug candidates identified OR
-   - 5+ iterations completed with 100+ sources
-"""
-```
-**Option B:** Add iteration-based heuristic in orchestrator:
-```python
-# If we have lots of evidence and high confidence, force synthesis
-if iteration >= 5 and len(all_evidence) > 100 and assessment.confidence > 0.7:
-    assessment.sufficient = True
-    assessment.recommendation = "synthesize"
-```
-### Fix 2: Handle Context Overflow
-**File:** `src/agent_factory/judges.py`
-Before sending to LLM, cap evidence:
-```python
-async def assess(self, question: str, evidence: list[Evidence]) -> JudgeAssessment:
-    # Cap at 50 most recent/relevant to avoid token overflow
-    if len(evidence) > 50:
-        evidence = evidence[:50]  # Or use embedding similarity to select best 50
-```
-### Fix 3: Keep Search Focused
-**File:** `src/prompts/judge.py`
-Add to prompt:
-```python
-SYSTEM_PROMPT = """...
-## Search Query Rules
-When suggesting next_search_queries:
-- Stay focused on the ORIGINAL question
-- Do NOT drift to tangential topics (e.g., don't search "bone health" for a libido question)
-- Refine existing good terms, don't explore random associations
-"""
-```
-### Fix 4: Generate Real Synthesis on Max Iterations
-**File:** `src/orchestrators/simple.py`
-```python
-def _generate_partial_synthesis(self, query: str, evidence: list[Evidence]) -> str:
-    """Generate a REAL synthesis when max iterations reached."""
-    # Get the last assessment's data (if available)
-    last_assessment = self.history[-1]["assessment"] if self.history else None
-    drug_candidates = last_assessment.get("details", {}).get("drug_candidates", []) if last_assessment else []
-    key_findings = last_assessment.get("details", {}).get("key_findings", []) if last_assessment else []
-    drug_list = "\n".join([f"- **{d}**" for d in drug_candidates]) or "- See sources below for candidates"
-    findings_list = "\n".join([f"- {f}" for f in key_findings[:5]]) or "- Review citations for findings"
-    citations = "\n".join([
-        f"{i + 1}. [{e.citation.title}]({e.citation.url}) ({e.citation.source.upper()})"
-        for i, e in enumerate(evidence[:10])
-    ])
-    return f"""## Drug Repurposing Analysis (Partial)
-### Question
-{query}
-### Status
-⚠️ Maximum iterations reached. Analysis based on {len(evidence)} sources.
-### Drug Candidates Identified
-{drug_list}
-### Key Findings
-{findings_list}
-### Top Citations ({len(evidence)} sources)
-{citations}
----
-*Analysis may be incomplete. Consider refining query or adding API key for better results.*
-"""
-```
----
-## Test Plan
-- [ ] Verify judge says "synthesize" within 5 iterations for good queries
-- [ ] Test with 500+ sources to ensure no token overflow
-- [ ] Verify search stays on-topic (no bone/muscle tangents for libido query)
-- [ ] Verify partial synthesis shows drug candidates (not just citations)
-- [ ] Test with MockJudgeHandler to confirm issue is in LLM behavior
-- [ ] Add unit test: `test_judge_synthesizes_with_good_evidence`
----
-## Priority Justification
-**P0** because:
-- Simple mode is the DEFAULT for users without API keys
-- 455 sources found but ZERO useful output generated
-- User waited 10 iterations just to get a citation dump
-- Makes the tool look completely broken
-- Blocks hackathon demo effectiveness
----
-## Immediate Workaround
-1. Use **Advanced mode** (requires OpenAI key) - it has its own synthesis logic
-2. Or use **fewer iterations** (MAX_ITERATIONS=3) to hit partial synthesis faster
-3. Or manually review the citations (they ARE relevant, just not synthesized)
----
-## Related Issues
-- `P0_ORCHESTRATOR_DEDUP_AND_JUDGE_BUGS.md` - Fixed dedup issue, but synthesis problem persists
-- `ACTIVE_BUGS.md` - Update when this is resolved

docs/bugs/archive/P0_SYNTHESIS_PROVIDER_MISMATCH.md DELETED Viewed

@@ -1,273 +0,0 @@
-# P0 - Systemic Provider Mismatch Across All Modes
-**Status:** RESOLVED
-**Priority:** P0 (Blocker for Free Tier/Demo)
-**Found:** 2025-11-30 (during Audit)
-**Resolved:** 2025-11-30
-**Component:** Multiple files across orchestrators, agents, services
-## Resolution Summary
-The critical provider mismatch bug has been fixed by implementing auto-detection in `src/agent_factory/judges.py`.
-The `get_model()` function now checks for actual API key availability (`has_openai_key`, `has_anthropic_key`, `has_huggingface_key`)
-instead of relying on the static `settings.llm_provider` configuration.
-### Fix Details
-- **Auto-Detection Implemented**: `get_model()` prioritizes OpenAI > Anthropic > HuggingFace based on *available keys*.
-- **Fail-Fast on No Keys**: If no API keys are configured, `get_model()` raises `ConfigurationError` with clear message.
-- **HuggingFace Requires Token**: Free Tier via `HuggingFaceModel` requires `HF_TOKEN` (PydanticAI requirement).
-- **Synthesis Fallback**: When `get_model()` fails, synthesis gracefully falls back to template.
-- **Audit Fixes Applied**:
-    - Replaced manual `os.getenv` checks with centralized `settings` properties in `src/app.py`.
-    - Added logging to `src/services/statistical_analyzer.py` (fixed silent `pass`).
-    - Narrowed exception handling in `src/tools/pubmed.py`.
-    - Optimized string search in `src/tools/code_execution.py`.
-### Key Clarification
-The **Free Tier** in Simple Mode uses `HFInferenceJudgeHandler` (which uses `huggingface_hub.InferenceClient`)
-for judging - this does NOT require `HF_TOKEN`. However, synthesis via `get_model()` uses PydanticAI's
-`HuggingFaceModel` which DOES require `HF_TOKEN`. When no tokens are configured, synthesis falls back to
-the template-based summary (which is still useful).
-### Verification
-- **Unit Tests**: 5 new TDD tests in `tests/unit/agent_factory/test_get_model_auto_detect.py` pass.
-- **All Tests**: 309 tests pass (`make check` succeeds).
-- **Regression Tests**: Fixed and verified `tests/unit/agent_factory/test_judges_factory.py`.
----
-## Symptom (Archive)
-When running in "Simple Mode" (Free Tier / No API Key), the synthesis step fails to generate a narrative and falls back to a structured summary template. The user sees:
-```text
-> ⚠️ Note: AI narrative synthesis unavailable. Showing structured summary.
-> _Error: OpenAIError_
-```
-## Affected Files (COMPREHENSIVE AUDIT)
-### Files Calling `get_model()` Directly (9 locations)
-| File | Line | Context | Impact |
-|------|------|---------|--------|
-| `simple.py` | 547 | Synthesis step | Free Tier broken |
-| `statistical_analyzer.py` | 75 | Analysis agent | Free Tier broken |
-| `judge_agent_llm.py` | 18 | LLM Judge | Free Tier broken |
-| `graph/nodes.py` | 177 | LangGraph hypothesis | Free Tier broken |
-| `graph/nodes.py` | 249 | LangGraph synthesis | Free Tier broken |
-| `report_agent.py` | 45 | Report generation | Free Tier broken |
-| `hypothesis_agent.py` | 44 | Hypothesis generation | Free Tier broken |
-| `judges.py` | 100 | JudgeHandler default | OK (accepts param) |
-### Files Hardcoding `OpenAIChatClient` (Architecturally OpenAI-Only)
-| File | Lines | Context |
-|------|-------|---------|
-| `advanced.py` | 100, 121 | Manager client |
-| `magentic_agents.py` | 29, 70, 129, 173 | All 4 agents |
-| `retrieval_agent.py` | 62 | Retrieval agent |
-| `code_executor_agent.py` | 52 | Code executor |
-| `llm_factory.py` | 42 | Factory default |
-**Note:** Advanced mode is architecturally locked to OpenAI via `agent_framework.openai.OpenAIChatClient`. This is by design - see `app.py:188-194` which falls back to Simple mode if no OpenAI key. However, users are not clearly informed of this limitation.
-## Root Cause
-**Settings/Runtime Sync Gap - Two Separate Backend Selection Systems.**
-The codebase has **two independent** systems for selecting the LLM backend:
-1. `settings.llm_provider` (config.py default: "openai")
-2. `app.py` runtime detection via `os.getenv()` checks
-These are **never synchronized**, causing the Judge and Synthesis steps to use different backends.
-### Detailed Call Chain
-1.  **`src/app.py:115-126`** (runtime detection):
-    ```python
-    # app.py bypasses settings entirely for JudgeHandler selection
-    elif os.getenv("OPENAI_API_KEY"):
-        judge_handler = JudgeHandler(model=None, domain=domain)
-    elif os.getenv("ANTHROPIC_API_KEY"):
-        judge_handler = JudgeHandler(model=None, domain=domain)
-    else:
-        judge_handler = HFInferenceJudgeHandler(domain=domain)  # Free Tier
-    ```
-    **Note:** This creates the correct handler but does NOT update `settings.llm_provider`.
-2.  **`src/orchestrators/simple.py:546-552`** (synthesis step):
-    ```python
-    from src.agent_factory.judges import get_model
-    agent: Agent[None, str] = Agent(model=get_model(), ...)  # <-- BUG!
-    ```
-    Synthesis calls `get_model()` directly instead of using the injected judge's model.
-3.  **`src/agent_factory/judges.py:56-78`** (`get_model()`):
-    ```python
-    def get_model() -> Any:
-        llm_provider = settings.llm_provider  # <-- Reads from settings (still "openai")
-        # ...
-        openai_provider = OpenAIProvider(api_key=settings.openai_api_key)  # <-- None!
-        return OpenAIChatModel(settings.openai_model, provider=openai_provider)
-    ```
-    **Result:** Creates OpenAI model with `api_key=None` → `OpenAIError`
-### Why Free Tier Fails
-| Step | System Used | Backend Selected |
-|------|-------------|------------------|
-| JudgeHandler | `app.py` runtime | HFInferenceJudgeHandler ✅ |
-| Synthesis | `settings.llm_provider` | OpenAI (default) ❌ |
-The Judge works because app.py explicitly creates `HFInferenceJudgeHandler`.
-Synthesis fails because it calls `get_model()` which reads `settings.llm_provider = "openai"` (unchanged from default).
-## Impact
--   **User Experience:** Free tier users (Demo users) never see the high-quality narrative synthesis, only the fallback.
--   **System Integrity:** The orchestrator ignores the runtime backend selection.
-## Implemented Fix
-**Strategy: Fix `get_model()` to Auto-Detect Available Provider**
-### Actual Implementation (Merged)
-**File:** `src/agent_factory/judges.py`
-This is the **single point of fix** that resolves all 7 broken `get_model()` call sites.
-```python
-def get_model() -> Any:
-    """Get the LLM model based on available API keys.
-    Priority order:
-    1. OpenAI (if OPENAI_API_KEY set)
-    2. Anthropic (if ANTHROPIC_API_KEY set)
-    3. HuggingFace (if HF_TOKEN set)
-    Raises:
-        ConfigurationError: If no API keys are configured.
-    Note: settings.llm_provider is ignored in favor of actual key availability.
-    This ensures the model matches what app.py selected for JudgeHandler.
-    """
-    from src.utils.exceptions import ConfigurationError
-    # Priority 1: OpenAI (most common, best tool calling)
-    if settings.has_openai_key:
-        openai_provider = OpenAIProvider(api_key=settings.openai_api_key)
-        return OpenAIChatModel(settings.openai_model, provider=openai_provider)
-    # Priority 2: Anthropic
-    if settings.has_anthropic_key:
-        provider = AnthropicProvider(api_key=settings.anthropic_api_key)
-        return AnthropicModel(settings.anthropic_model, provider=provider)
-    # Priority 3: HuggingFace (requires HF_TOKEN)
-    if settings.has_huggingface_key:
-        model_name = settings.huggingface_model or "meta-llama/Llama-3.1-70B-Instruct"
-        hf_provider = HuggingFaceProvider(api_key=settings.hf_token)
-        return HuggingFaceModel(model_name, provider=hf_provider)
-    # No keys configured - fail fast with clear error
-    raise ConfigurationError(
-        "No LLM API key configured. Set one of: OPENAI_API_KEY, ANTHROPIC_API_KEY, or HF_TOKEN"
-    )
-```
-**Why this works:**
-- Single fix location updates all 7 broken call sites
-- Matches app.py's detection logic (key availability, not settings.llm_provider)
-- HuggingFace works when HF_TOKEN is available
-- Raises clear error when no keys configured (callers can catch and fallback)
-- No changes needed to orchestrators, agents, or services
-### What This Does NOT Fix (By Design)
-**Advanced Mode remains OpenAI-only.** The following files use `agent_framework.openai.OpenAIChatClient` which only supports OpenAI:
-- `advanced.py` (Manager + agents)
-- `magentic_agents.py` (SearchAgent, JudgeAgent, HypothesisAgent, ReportAgent)
-- `retrieval_agent.py`, `code_executor_agent.py`
-This is **by design** - the Microsoft Agent Framework library (`agent-framework-core`) only provides `OpenAIChatClient`. To support other providers in Advanced mode would require:
-1. Wait for `agent-framework` to add Anthropic/HuggingFace clients, OR
-2. Write our own `ChatClient` implementations (significant effort)
-**The current app.py behavior is correct:** it falls back to Simple mode when no OpenAI key is present (lines 188-194). The UI message could be clearer about why.
-## Test Plan (Implemented)
-### Unit Tests (Verified Passing)
-```python
-# tests/unit/agent_factory/test_get_model_auto_detect.py
-import pytest
-from src.agent_factory.judges import get_model
-from src.utils.config import settings
-from src.utils.exceptions import ConfigurationError
-class TestGetModelAutoDetect:
-    """Test that get_model() auto-detects available providers."""
-    def test_returns_openai_when_key_present(self, monkeypatch):
-        """OpenAI key present → OpenAI model."""
-        monkeypatch.setattr(settings, "openai_api_key", "sk-test")
-        monkeypatch.setattr(settings, "anthropic_api_key", None)
-        monkeypatch.setattr(settings, "hf_token", None)
-        model = get_model()
-        assert isinstance(model, OpenAIChatModel)
-    def test_returns_anthropic_when_only_anthropic_key(self, monkeypatch):
-        """Only Anthropic key → Anthropic model."""
-        monkeypatch.setattr(settings, "openai_api_key", None)
-        monkeypatch.setattr(settings, "anthropic_api_key", "sk-ant-test")
-        monkeypatch.setattr(settings, "hf_token", None)
-        model = get_model()
-        assert isinstance(model, AnthropicModel)
-    def test_returns_huggingface_when_hf_token_present(self, monkeypatch):
-        """HF_TOKEN present (no paid keys) → HuggingFace model."""
-        monkeypatch.setattr(settings, "openai_api_key", None)
-        monkeypatch.setattr(settings, "anthropic_api_key", None)
-        monkeypatch.setattr(settings, "hf_token", "hf_test_token")
-        model = get_model()
-        assert isinstance(model, HuggingFaceModel)
-    def test_raises_error_when_no_keys(self, monkeypatch):
-        """No keys at all → ConfigurationError."""
-        monkeypatch.setattr(settings, "openai_api_key", None)
-        monkeypatch.setattr(settings, "anthropic_api_key", None)
-        monkeypatch.setattr(settings, "hf_token", None)
-        with pytest.raises(ConfigurationError) as exc_info:
-            get_model()
-        assert "No LLM API key configured" in str(exc_info.value)
-    def test_openai_takes_priority_over_anthropic(self, monkeypatch):
-        """Both keys present → OpenAI wins."""
-        monkeypatch.setattr(settings, "openai_api_key", "sk-test")
-        monkeypatch.setattr(settings, "anthropic_api_key", "sk-ant-test")
-        model = get_model()
-        assert isinstance(model, OpenAIChatModel)
-```
-### Full Test Suite
-```bash
-$ make check
-# 309 passed in 238.16s (0:03:58)
-# All checks passed!
-```
-### Manual Verification
-1. **Unset all API keys**: `unset OPENAI_API_KEY ANTHROPIC_API_KEY HF_TOKEN`
-2. **Run app**: `uv run python -m src.app`
-3. **Submit query**: "What drugs improve female libido?"
-4. **Verify**: Synthesis falls back to template (shows `ConfigurationError` in logs, but user sees structured summary)

docs/bugs/archive/P1_ADVANCED_MODE_UNINTERPRETABLE_CHAIN_OF_THOUGHT.md DELETED Viewed

@@ -1,184 +0,0 @@
-# P1: Advanced Mode Exposes Uninterpretable Chain-of-Thought Events
-**Priority**: P1 (UX Degradation)
-**Component**: `src/orchestrators/advanced.py`
-**Status**: Resolved
-**Issue**: [#106](https://github.com/The-Obstacle-Is-The-Way/DeepBoner/issues/106)
-**PR**: [#107](https://github.com/The-Obstacle-Is-The-Way/DeepBoner/pull/107)
-**Created**: 2025-12-01
-**Resolved**: 2025-12-01
-## Summary
-The Advanced orchestrator exposes raw internal framework events from `agent-framework-core` directly to users. These events contain internal manager bookkeeping (task assignments, ledgers, instructions) that are:
-1. Truncated mid-sentence at 200 characters
-2. Use internal framework terminology (`user_task`, `task_ledger`, `instruction`)
-3. Shown with misleading "JUDGING" event type
-4. Not meaningful to end users
-## Resolution
-Implemented "Smart Filter + Transform" logic in `src/orchestrators/advanced.py`:
-1. **Filtered**: `task_ledger` and `instruction` events are now hidden.
-2. **Transformed**: `user_task` events are mapped to `type="progress"` with a friendly "Manager assigning research task..." message.
-3. **Smart Truncation**: Text is now truncated at sentence boundaries or word boundaries, preventing mid-word cuts.
-Verified with new unit tests in `tests/unit/orchestrators/test_advanced_events.py`.
-## Example of Bad Output
-```
-🧠 **JUDGING**: Manager (user_task): Research sexual health and wellness interventions for: sildenafil mechanism  ##...
-🧠 **JUDGING**: Manager (task_ledger):  We are working to address the following user request:  Research sexual healt...
-🧠 **JUDGING**: Manager (instruction): Conduct targeted searches on PubMed, ClinicalTrials.gov, and Europe PMC to ga...
-```
-Users see:
-- Raw internal prompts being passed between manager and agents
-- Truncated text that cuts off mid-word ("healt...", "ga...")
-- Technical jargon ("task_ledger") with no context
-- All events labeled as "JUDGING" even when they're task assignments
-## Root Cause Analysis
-### The Chain of Issues
-| Location | Issue |
-|----------|-------|
-| `src/orchestrators/advanced.py:363-370` | `MagenticOrchestratorMessageEvent` raw events exposed without filtering |
-| `src/orchestrators/advanced.py:368` | `event.kind` values (`user_task`, `task_ledger`, `instruction`) are internal framework concepts |
-| `src/orchestrators/advanced.py:368` | Hard truncation: `text[:200]...` breaks mid-sentence |
-| `src/orchestrators/advanced.py:367` | All manager events mapped to `type="judging"` regardless of actual purpose |
-| `src/orchestrators/advanced.py:380` | Agent messages also truncated at 200 chars |
-| `src/utils/models.py:136` | `"judging": "🧠"` icon shown for all these internal events |
-| `src/app.py:248` | Events displayed verbatim via `event.to_markdown()` |
-### Code Path
-```
-agent-framework-core (Microsoft)
-        ↓
-MagenticOrchestratorMessageEvent(kind="task_ledger", message="...")
-        ↓
-advanced.py:_process_event() - NO FILTERING
-        ↓
-AgentEvent(type="judging", message=f"Manager ({event.kind}): {text[:200]}...")
-        ↓
-models.py:to_markdown() → "🧠 **JUDGING**: Manager (task_ledger): ..."
-        ↓
-app.py → Displayed to user verbatim
-```
-## Impact
-1. **User Confusion**: Users see internal framework bookkeeping, not meaningful progress
-2. **Truncated Gibberish**: 200-char limit cuts prompts mid-sentence, making them uninterpretable
-3. **Misleading Labels**: "JUDGING" event type is wrong - these are task assignments
-4. **No Actionable Info**: Users can't understand what the system is actually doing
-## Proposed Solutions
-### Option A: Filter Internal Events (Minimal)
-Skip internal manager events entirely - they're framework bookkeeping:
-```python
-def _process_event(self, event: Any, iteration: int) -> AgentEvent | None:
-    if isinstance(event, MagenticOrchestratorMessageEvent):
-        # Skip internal framework bookkeeping events
-        if event.kind in ("user_task", "task_ledger", "instruction"):
-            return None  # Don't expose to users
-        # ... rest of handling
-```
-**Pros**: Simple, removes noise
-**Cons**: Users lose visibility into manager activity
-### Option B: Transform to User-Friendly Messages (Better UX)
-Map internal events to meaningful user messages:
-```python
-MANAGER_EVENT_MESSAGES = {
-    "user_task": "Manager received research task",
-    "task_ledger": "Manager tracking task progress",
-    "instruction": "Manager assigning work to agent",
-}
-def _process_event(self, event: Any, iteration: int) -> AgentEvent | None:
-    if isinstance(event, MagenticOrchestratorMessageEvent):
-        if event.kind in MANAGER_EVENT_MESSAGES:
-            return AgentEvent(
-                type="progress",  # Not "judging"!
-                message=MANAGER_EVENT_MESSAGES[event.kind],
-                iteration=iteration,
-            )
-```
-**Pros**: Users see meaningful progress, correct event types
-**Cons**: More code, loses raw detail for debugging
-### Option C: Smart Truncation + Verbose Mode
-1. Truncate at sentence boundaries, not hard character limit
-2. Add `verbose_mode` setting that shows full internal events for debugging
-3. Use appropriate event types based on `event.kind`
-```python
-def _smart_truncate(self, text: str, max_len: int = 200) -> str:
-    """Truncate at sentence boundary."""
-    if len(text) <= max_len:
-        return text
-    # Find last sentence boundary before limit
-    truncated = text[:max_len]
-    last_period = truncated.rfind(". ")
-    if last_period > max_len // 2:
-        return truncated[:last_period + 1]
-    return truncated.rsplit(" ", 1)[0] + "..."
-```
-### Recommended Approach
-**Combine Option A + B**:
-1. **Default**: Filter out `task_ledger` and `instruction` events (pure bookkeeping)
-2. **Transform**: `user_task` → "Assigning research task to agents"
-3. **Proper Types**: Use `"progress"` not `"judging"` for manager events
-4. **Future**: Add verbose mode for debugging
-## Files to Modify
-1. `src/orchestrators/advanced.py:361-410` - `_process_event()` method
-2. `src/utils/models.py:107-123` - Add new event types if needed
-3. `tests/unit/orchestrators/test_advanced_timeout.py` - Update assertions
-## Related Issues
-- P0: Advanced Mode Timeout No Synthesis (FIXED in PR #104)
-- This P1 was discovered while testing the P0 fix
-## Testing the Bug
-```python
-import asyncio
-from src.orchestrators.advanced import AdvancedOrchestrator
-async def test():
-    orch = AdvancedOrchestrator(max_rounds=3)
-    async for event in orch.run("sildenafil mechanism"):
-        if "Manager" in event.message:
-            print(f"[{event.type}] {event.message}")
-            # You'll see uninterpretable output
-asyncio.run(test())
-```
-## References
-- Microsoft Agent Framework: https://github.com/microsoft/agent-framework
-- AgentEvent model: `src/utils/models.py:104`
-- Advanced orchestrator: `src/orchestrators/advanced.py`

docs/bugs/archive/P1_FREE_TIER_TOOL_EXECUTION_FAILURE.md DELETED Viewed

@@ -1,319 +0,0 @@
-# P1 Bug: Free Tier Tool Execution Failure
-**Date**: 2025-12-03
-**Status**: FIXED (PR fix/P1-free-tier-tool-execution)
-**Severity**: P1 (Critical - Free Tier Completely Broken)
-**Component**: HuggingFaceChatClient + Together.ai Routing + Tool Calling
-**Resolution**: Removed premature `__function_invoking_chat_client__ = True` marker from class body
----
-## Executive Summary
-The Free Tier (HuggingFace) is fundamentally broken due to **multiple interacting issues** that cause tool calls to fail, resulting in garbage output, hallucinated results, and raw JSON appearing in the UI.
-**This is NOT a simple 7B model issue** - it's a chain of infrastructure and code problems.
----
-## Symptoms
-Users on Free Tier see:
-1. **Garbage tokens**: "oleon", "UrlParser", "MemoryWarning", "PostalCodes"
-2. **Raw tool call XML tags**: `<tool_call>`, `</tool_call>` appearing as text
-3. **Raw JSON tool calls**: `{"name": "search_pubmed", "arguments": {...}}`
-4. **Hallucinated tool results**: Fake JSON responses that were never returned by actual tools:
-   ```json
-   {"response": "[{'title': 'Effect of Flibanserin...', ...}]"}
-   ```
-5. **No actual database searches**: PubMed, ClinicalTrials.gov never queried
----
-## Root Cause Analysis
-### Cause 1: Model Routed to Third-Party Provider (Together.ai)
-**Discovery**: Qwen2.5-7B-Instruct is NOT served by native HuggingFace infrastructure.
-```python
-# API response from HuggingFace:
-{
-  "inferenceProviderMapping": {
-    "together": {
-      "status": "live",
-      "providerId": "Qwen/Qwen2.5-7B-Instruct-Turbo"  # <-- TURBO variant!
-    },
-    "featherless-ai": {
-      "status": "live",
-      "providerId": "Qwen/Qwen2.5-7B-Instruct"
-    }
-  }
-}
-```
-**Impact**:
-- Native HF-inference returns 404 for this model
-- All requests route through Together.ai
-- Together serves a "Turbo" variant, not the original
-- We cannot control how Together handles tool calling
-### Cause 2: Qwen2.5 Uses XML-Style Tool Calling Format
-**Discovery**: The model's chat template instructs it to output tool calls in XML format:
-```jinja
-For each function call, return a json object with function name and arguments
-within <tool_call></tool_call> XML tags:
-<tool_call>
-{"name": <function-name>, "arguments": <args-json-object>}
-</tool_call>
-```
-**Impact**:
-- Model outputs `<tool_call>{"name":...}</tool_call>` as **text**
-- This text appears in `delta.content` (not `delta.tool_calls`)
-- Our streaming code yields this as visible text to the UI
-- When tool calling works correctly, the API parses this internally
-- When it fails, raw XML appears in output
-### Cause 3: Together.ai Turbo Inconsistent Tool Call Parsing
-**Discovery**: Together's serving of the Turbo model has inconsistent behavior:
-| Test Scenario | Tool Call Behavior |
-|---------------|-------------------|
-| Simple query, single tool | ✅ Parsed correctly to `tool_calls` |
-| Complex multi-agent prompt | ❌ Mixed: some parsed, some as text |
-| Multi-turn with tool results | ❌ Model hallucinates fake results |
-**Evidence from testing**:
-```python
-# Simple test - WORKS:
-finish_reason: tool_calls
-content: None
-tool_calls: [ChatCompletionOutputToolCall(function=..., name='search_pubmed')]
-# Complex prompt - FAILS:
-TEXT[49]: '建档立标'  # Chinese garbage between tool calls
-TEXT[X]: '{"name": "search_preprints", ...}'  # Raw JSON as text
-```
-### Cause 4: Potential Code Bug - Premature Marker Setting
-**Discovery**: In `HuggingFaceChatClient`, we set a marker that may prevent tool execution wrapping:
-```python
-@use_function_invocation   # Decorator checks marker BEFORE wrapping
-@use_observability
-@use_chat_middleware
-class HuggingFaceChatClient(BaseChatClient):
-    # This marker causes decorator to return early!
-    __function_invoking_chat_client__ = True  # <-- BUG?
-```
-The `@use_function_invocation` decorator source:
-```python
-def use_function_invocation(chat_client):
-    if getattr(chat_client, FUNCTION_INVOKING_CHAT_CLIENT_MARKER, False):
-        return chat_client  # EARLY RETURN - doesn't wrap methods!
-    # ... wrapping code never runs ...
-```
-**Impact**: The decorator sees the marker as `True` and returns early without wrapping `get_response` and `get_streaming_response` with the function invocation handler.
-**Status**: NEEDS VERIFICATION - Testing shows methods have `__wrapped__` attribute, suggesting some decoration occurred. May be from other decorators.
-### Cause 5: Model Hallucination Under Complexity
-**Discovery**: When the model fails to make proper API tool calls, it **simulates** tool use by outputting fake results:
-```
-{"response": "[{'title': 'Effect of Flibanserin...'}]"}
-```
-This is pure hallucination - no actual API calls were made. The model is trained to produce tool-like outputs, so when the API tool calling fails, it falls back to text-based simulation.
----
-## Verification Steps
-### Test 1: Direct InferenceClient (PASSES)
-```python
-from huggingface_hub import InferenceClient
-client = InferenceClient(model='Qwen/Qwen2.5-7B-Instruct')
-response = client.chat_completion(
-    messages=[{'role': 'user', 'content': 'What is the weather?'}],
-    tools=[weather_tool],
-    tool_choice='auto',
-)
-# Result: tool_calls properly parsed, content=None
-```
-### Test 2: Complex Multi-Agent Prompt (FAILS)
-```python
-# With our SearchAgent-style prompts:
-stream = client.chat_completion(
-    messages=[system_prompt, user_query],
-    tools=multiple_tools,
-    ...
-)
-# Result: Mix of text content AND tool_calls, garbage tokens appear
-```
-### Test 3: ChatAgent Single Tool (PARTIAL)
-```python
-agent = ChatAgent(
-    chat_client=HuggingFaceChatClient(),
-    tools=[search_pubmed],
-    ...
-)
-result = await agent.run('Search for libido drugs')
-# Result: Tool call request made but function NOT executed (tool_calls=0)
-```
----
-## Impact Assessment
-| Aspect | Impact |
-|--------|--------|
-| Free Tier Users | **100% broken** - Cannot get any useful results |
-| Demo Quality | **Unprofessional** - Shows garbage/hallucinations |
-| User Trust | **Critical** - Appears completely broken |
-| Tool Execution | **Not working** - Tools never actually called |
----
-## Fix Options
-### Option 1: Remove Premature Marker (QUICK - Test First)
-**Location**: `src/clients/huggingface.py:43`
-```python
-# REMOVE THIS LINE:
-__function_invoking_chat_client__ = True
-```
-Let the `@use_function_invocation` decorator set the marker AFTER wrapping.
-**Risk**: Unknown - need to test if this actually enables tool execution.
-### Option 2: Switch to Model with Native HF Support
-Find a model that runs on native HuggingFace infrastructure (not routed to third parties):
-| Model | Size | Native HF? | Tool Calling |
-|-------|------|------------|--------------|
-| `Qwen/Qwen2.5-3B-Instruct` | 3B | ❓ Test | ❓ |
-| `mistralai/Mistral-7B-Instruct-v0.3` | 7B | ❓ Test | ✅ |
-| `microsoft/Phi-3-mini-4k-instruct` | 3.8B | ❓ Test | Limited |
-### Option 3: Simplify Free Tier to Single-Agent
-Remove multi-agent complexity for Free Tier:
-- Single ChatAgent with simpler prompt
-- Direct tool calls instead of MagenticBuilder workflow
-- Reduced prompt complexity
-### Option 4: Streaming Content Filter (BAND-AID)
-Filter garbage from streaming output:
-```python
-def should_stream_content(text: str) -> bool:
-    """Filter garbage from streaming."""
-    if text.strip().startswith('{"name":'):
-        return False  # Raw tool call JSON
-    if '</tool_call>' in text or '<tool_call>' in text:
-        return False  # XML tags
-    garbage = ["oleon", "UrlParser", "MemoryWarning", "建档立标"]
-    if any(g in text for g in garbage):
-        return False
-    return True
-```
-**Note**: This hides symptoms but doesn't fix the underlying tool execution failure.
-### Option 5: Use Together.ai Directly with Their SDK
-Bypass HuggingFace routing entirely:
-- Use Together's official SDK
-- May have better tool calling support
-- Requires new client implementation
----
-## Files Involved
-| File | Role |
-|------|------|
-| `src/clients/huggingface.py` | Main HF client - has premature marker |
-| `src/clients/factory.py` | Client selection logic |
-| `src/agents/magentic_agents.py` | Agent definitions with tools |
-| `src/orchestrators/advanced.py` | Multi-agent workflow |
-| `src/agents/tools.py` | Tool function definitions |
----
-## Recommended Action Plan
-### Phase 1: Verify Code Bug (Immediate)
-1. Remove `__function_invoking_chat_client__ = True` from HuggingFaceChatClient
-2. Test if tool execution now works
-3. If yes, verify no regressions with full test suite
-### Phase 2: Provider Testing
-1. Test which small models have native HF support
-2. Evaluate Together.ai direct integration
-3. Document provider routing for all candidate models
-### Phase 3: Architecture Decision
-Based on Phase 1-2 results:
-- If code fix works: Deploy and monitor
-- If provider issues persist: Implement simplified single-agent mode
-- Consider hybrid: Simple mode for free, advanced for paid
----
-## Relation to P2_7B_MODEL_GARBAGE_OUTPUT
-This P1 bug **supersedes** the P2 bug. The P2 doc incorrectly blamed the model capacity. The real issues are:
-1. **Provider routing** (Together.ai Turbo, not native HF)
-2. **Tool execution failure** (possible code bug)
-3. **Model hallucination** (consequence of #2, not root cause)
-The P2 symptoms are downstream effects of this P1 root cause.
----
-## Investigation Timeline
-| Time | Finding |
-|------|---------|
-| 16:00 | Started deep investigation per user request |
-| 16:10 | Found Qwen chat template uses XML-style tool_call |
-| 16:20 | Confirmed HF API parses tool calls correctly |
-| 16:30 | Discovered model routed to Together.ai, not native HF |
-| 16:35 | Found premature marker in HuggingFaceChatClient |
-| 16:40 | Verified ChatAgent makes tool requests but doesn't execute |
-| 16:45 | Documented complete root cause chain |
----
-## References
-- [HuggingFace Inference Providers](https://huggingface.co/docs/inference-providers/index)
-- [Together.ai Function Calling](https://docs.together.ai/docs/function-calling)
-- [Qwen Function Calling Docs](https://qwen.readthedocs.io/en/latest/framework/function_call.html)
-- [TGI Tool Calling Issue #2375](https://github.com/huggingface/text-generation-inference/issues/2375)

docs/bugs/archive/P1_GRADIO_EXAMPLE_CLICK_AUTO_SUBMIT.md DELETED Viewed

@@ -1,273 +0,0 @@
-# P1: Gradio Example Click Auto-Submits Instead of Loading
-**Status:** FIXED (PR #120, merged 2025-12-03)
-**Priority:** P1 (High - UX breaks BYOK flow)
-**Discovered:** 2025-12-03
-**Component:** `src/app.py` (Gradio UI)
----
-## Summary
-Clicking on example questions in the Gradio ChatInterface immediately starts the research agent instead of just loading the text into the input field. This prevents users from:
-1. Entering their API key before starting the chat
-2. Modifying the example query before submission
-3. Understanding what's happening (chat starts without explicit action)
----
-## Reproduction Steps
-1. Open DeepBoner Gradio UI
-2. **Before entering any API key**, click on an example like "What drugs improve female libido post-menopause?"
-3. Observe: Chat immediately starts with Free Tier
-4. Try to enter an OpenAI API key in the accordion
-5. Try to submit a new query
-6. **Result:** Confusing UX - the chat already ran, state is unclear
-### Expected Behavior
-1. Click example → text loads into input field
-2. User can enter API key
-3. User clicks submit → chat starts with their configured settings
----
-## Root Cause Analysis
-### Problem 1: Missing `run_examples_on_click=False`
-Gradio's `ChatInterface` has a parameter `run_examples_on_click` (added in [PR #10109](https://github.com/gradio-app/gradio/pull/10109), December 2024):
-| Value | Behavior |
-|-------|----------|
-| `True` (default) | Clicking example immediately runs the function |
-| `False` | Clicking example only populates the input field |
-**Our code** in `src/app.py:279-325` does NOT set this parameter:
-```python
-demo = gr.ChatInterface(
-    fn=research_agent,
-    examples=[...],
-    # run_examples_on_click=False  ← MISSING!
-)
-```
-### Problem 2: HuggingFace Spaces Default Overrides
-From [Gradio docs](https://www.gradio.app/docs/gradio/chatinterface):
-> `cache_examples`: The default option in HuggingFace Spaces is **True**.
-> `run_examples_on_click` has **no effect** if `cache_examples` is True.
-This means on HuggingFace Spaces:
-- `cache_examples` defaults to `True`
-- Even if we add `run_examples_on_click=False`, it would be **ignored**
-- We MUST explicitly set `cache_examples=False`
-### ~~Problem 3: Example Data Overwrites User Settings~~ (CORRECTION: This is Actually Fine)
-Looking at lines 283-304:
-```python
-examples=[
-    [
-        "What drugs improve female libido post-menopause?",
-        "sexual_health",
-        None,  # ← api_key set to None
-        None,  # ← api_key_state set to None
-    ],
-    ...
-]
-```
-**CORRECTION:** Per [Stack Overflow research](https://stackoverflow.com/questions/78584977/how-to-use-additional-inputs-and-examples-at-the-same-time):
-> "If you set None for some input in all examples then it will not display this column in example and example will not change current value for this input."
-Since ALL examples have `None` for api_key and api_key_state:
-- Those columns won't display in the examples table
-- **Clicking an example will NOT change the API key textbox**
-- User's API key is PRESERVED!
-The current example structure is actually **correct**. The only issue is auto-submit.
-### Dead Code: api_key_state Never Updated (Non-Blocking)
-Line 258-259 has a comment suggesting a fix was attempted:
-```python
-# BUG FIX: Add gr.State for API key persistence across example clicks
-api_key_state = gr.State("")
-```
-This code is **dead** because:
-1. The `gr.State` is initialized empty (`""`)
-2. There's NO event handler (`.change()`) to update the state when textbox changes
-3. The value passed to `research_agent` is always `""`
-4. In `_validate_inputs`: `(api_key or api_key_state or "")` - the State never contributes
-**However**, this is NOT blocking the fix. The fix works regardless of this dead code.
-We can clean it up in a separate PR after the fix is verified working.
----
-## Architecture Implications
-### BYOK Flow Broken
-The unified architecture (SPEC-16) relies on API key auto-detection:
-```text
-User provides key?
-├── YES → OpenAI backend (sk-...) or Anthropic backend (sk-ant-...)
-└── NO  → HuggingFace Free Tier
-```
-The example click bug forces users into Free Tier even if they intended to use their API key.
-### Session State Confusion
-After an example auto-submits:
-1. Chat history has content
-2. User enters API key
-3. User submits new query
-4. **Question:** Does the new query use the new key? Is history preserved correctly?
-This creates ambiguous state that could lead to:
-- Inconsistent backend usage within a session
-- Confusion about which tier was used for which response
----
-## Fix Implementation
-### Required Changes to `src/app.py`
-```python
-demo = gr.ChatInterface(
-    fn=research_agent,
-    title="🍆 DeepBoner",
-    description=description,
-    examples=[...],
-    additional_inputs_accordion=additional_inputs_accordion,
-    additional_inputs=[...],
-    # === FIX: Prevent auto-submit on example click ===
-    cache_examples=False,  # MUST be False for run_examples_on_click to work
-    run_examples_on_click=False,  # Load into input, don't auto-run
-)
-```
-### Why This Fix is Safe (No Optional Enhancements Needed)
-The current example structure with `None` values is **correct**:
-- API key textbox value is PRESERVED when clicking examples
-- Only the message textbox is populated
-- No restructuring of examples needed
-**The fix is minimal and surgical:**
-```python
-cache_examples=False,
-run_examples_on_click=False,
-```
-No other changes required.
----
-## Testing
-### Manual Test Cases
-1. **Fresh load, click example:** Should only populate input, not start chat
-2. **Enter API key, click example:** Query loads, API key preserved
-3. **Click example, enter key, submit:** Should use the entered key
-4. **Multiple example clicks:** Each should just replace input text
-### Automated Test (if possible)
-```python
-def test_example_click_does_not_auto_submit():
-    """Verify examples only populate input, not trigger function."""
-    # Would need Gradio testing utilities
-    pass
-```
----
-## Related Issues
-- [Gradio #10103](https://github.com/gradio-app/gradio/issues/10103): Original feature request for `run_examples_on_click`
-- [Gradio #10109](https://github.com/gradio-app/gradio/pull/10109): PR that implemented the parameter
-- SPEC-16: Unified Chat Client Architecture (relies on proper API key handling)
-- P2_ARCHITECTURAL_BYOK_GAPS.md (archived) - Related BYOK issues now fixed
----
-## Priority Justification
-**P1 (High)** because:
-1. Breaks the BYOK (Bring Your Own Key) user flow
-2. Forces users into Free Tier unexpectedly
-3. Creates confusing UX that may prevent demo adoption
-4. Simple fix with clear solution path
----
-## Files Affected
-- `src/app.py:279-325` - ChatInterface configuration
----
-## Senior Review: Risk Assessment
-**Reviewed:** 2025-12-03
-### Verification Performed
-1. **Gradio Version Confirmed:** 6.0.1 (`uv pip show gradio`)
-2. **Parameters Exist:** Both `run_examples_on_click` and `cache_examples` verified in `ChatInterface.__init__` signature
-3. **No Hidden Gradio Usage:** Only `src/app.py` imports gradio (grep confirmed)
-4. **No Event Handlers:** No `.change()`, `.click()`, `.submit()` events in app.py that could conflict
-5. **Example Format Correct:** List-of-lists format matches `additional_inputs` order
-### Potential Regressions Checked
-| Risk | Assessment | Mitigation |
-|------|------------|------------|
-| Cold start slower on HF Spaces | Low - examples aren't pre-cached, but they also don't run on click | None needed - acceptable tradeoff |
-| Progress bar issues | None - `gr.Progress()` issues only affect cached examples, we're disabling caching | N/A |
-| Example display changes | None - examples already appear below chatbot due to `additional_inputs` | N/A |
-| API key cleared on example click | **Verified SAFE** - `None` in all examples means input is preserved | N/A |
-| Dead State code causes issues | No - it's inert, just passes `""` always | Clean up in follow-up PR |
-### Gotchas Investigated
-1. **ViewFrame/hydration issues:** `ssr_mode=False` already set at line 339 - no conflict
-2. **MCP server interaction:** MCP server (`mcp_server=True`) operates independently of examples - no conflict
-3. **CSS injection:** Custom CSS only affects `.api-key-input` class - no conflict
-4. **Accordion state:** `additional_inputs_accordion` unaffected by example behavior
-### Confidence Level
-**HIGH** - This is a two-line, surgical fix that:
-- Uses documented, stable Gradio 6.0 parameters
-- Has no side effects on other components
-- Preserves existing example structure
-- Was explicitly designed for this use case (PR #10109)
-### Recommended Approach
-1. **Phase 1:** Add the two params, test manually on HF Spaces
-2. **Phase 2:** (Optional) Clean up dead `api_key_state` code in follow-up PR
----
-## References
-- [Gradio ChatInterface Docs](https://www.gradio.app/docs/gradio/chatinterface)
-- [Gradio Examples Behavior](https://www.gradio.app/guides/chatinterface-examples)
-- [PR #10109: run_examples_on_click](https://github.com/gradio-app/gradio/pull/10109)
-- [Stack Overflow: None values in examples](https://stackoverflow.com/questions/78584977/how-to-use-additional-inputs-and-examples-at-the-same-time)

docs/bugs/archive/P1_HUGGINGFACE_NOVITA_500_ERROR.md DELETED Viewed

@@ -1,133 +0,0 @@
-# P1 BUG: HuggingFace Router 500 Error via Novita Provider
-**Status**: ACTIVE - Upstream Infrastructure Issue
-**Priority**: P1 (Free Tier Broken)
-**Discovered**: 2025-12-02
-**Related**: CLAUDE.md (Llama/Hyperbolic issue)
----
-## Symptom
-```
-❌ **ERROR**: Workflow error: 500 Server Error: Internal Server Error for url:
-https://router.huggingface.co/novita/v3/openai/chat/completions
-```
-Free tier users (no API key) cannot use the system.
----
-## Stack Trace
-```text
-User (no API key)
-    ↓
-src/clients/factory.py:get_chat_client()
-    ↓
-src/clients/huggingface.py:HuggingFaceChatClient
-    ↓
-Model: Qwen/Qwen2.5-72B-Instruct (from config.py)
-    ↓
-huggingface_hub.InferenceClient
-    ↓
-HuggingFace Router: router.huggingface.co
-    ↓
-Routes to: NOVITA (third-party inference provider)
-    ↓
-❌ Novita returns 500 Internal Server Error
-```
----
-## Root Cause
-**HuggingFace doesn't host all models directly.** For some models, they route to third-party inference providers:
-| Model | Provider | Status |
-|-------|----------|--------|
-| Llama-3.1-70B | Hyperbolic | ❌ "staging mode" auth issues |
-| Qwen2.5-72B | Novita | ❌ 500 Internal Server Error |
-We switched from Llama to Qwen specifically to avoid Hyperbolic's issues. Now Novita is having its own problems.
-**This is an upstream infrastructure issue - not a bug in our code.**
----
-## Evidence
-From the error URL:
-```
-https://router.huggingface.co/novita/v3/openai/chat/completions
-                              ^^^^^^
-                              Third-party provider in URL path
-```
----
-## Potential Fixes
-### Option 1: Try a Different Model (Quick)
-Find a model that HuggingFace hosts natively (not routed to partners):
-```python
-# Candidates to test:
-# - mistralai/Mistral-7B-Instruct-v0.3
-# - microsoft/Phi-3-mini-4k-instruct
-# - google/gemma-2-9b-it
-```
-### Option 2: Add Fallback Logic (Robust)
-```python
-FALLBACK_MODELS = [
-    "Qwen/Qwen2.5-72B-Instruct",
-    "mistralai/Mistral-7B-Instruct-v0.3",
-    "microsoft/Phi-3-mini-4k-instruct",
-]
-async def get_response_with_fallback(...):
-    for model in FALLBACK_MODELS:
-        try:
-            return await client.chat_completion(model=model, ...)
-        except HfHubHTTPError as e:
-            if e.status_code == 500:
-                continue
-            raise
-    raise AllModelsFailedError()
-```
-### Option 3: Wait for Novita Fix (Passive)
-500 errors are typically transient. Novita may fix their infrastructure.
----
-## Verification
-To check if issue is resolved:
-```bash
-curl -X POST "https://router.huggingface.co/novita/v3/openai/chat/completions" \
-  -H "Authorization: Bearer $HF_TOKEN" \
-  -H "Content-Type: application/json" \
-  -d '{"model": "Qwen/Qwen2.5-72B-Instruct", "messages": [{"role": "user", "content": "hi"}]}'
-```
----
-## Historical Context
-From `CLAUDE.md`:
-```
-- **HuggingFace (Free Tier):** `Qwen/Qwen2.5-72B-Instruct`
-  - Changed from Llama-3.1-70B (Dec 2025) due to HuggingFace routing Llama
-    to Hyperbolic provider which has unreliable "staging mode" auth.
-```
-Now Qwen is being routed to Novita, continuing the pattern of unreliable third-party routing.
----
-## Recommendation
-**Short-term**: Switch to a model hosted natively by HuggingFace (test candidates above)
-**Long-term**: Implement fallback model logic to handle provider outages gracefully

docs/bugs/archive/P1_HUGGINGFACE_ROUTER_401_HYPERBOLIC.md DELETED Viewed

@@ -1,62 +0,0 @@
-# P1 Bug: HuggingFace Router 401 Unauthorized
-**Severity**: P1 (High)
-**Status**: RESOLVED
-**Discovered**: 2025-12-01
-**Resolved**: 2025-12-01
-**Reporter**: Production user via HuggingFace Spaces
-## Symptom
-```
-401 Client Error: Unauthorized for url:
-https://router.huggingface.co/hyperbolic/v1/chat/completions
-Invalid username or password.
-```
-## Root Cause
-**The HF_TOKEN in `.env` and HuggingFace Spaces secrets was invalid/expired.**
-Token `hf_ssayg...` failed `HfApi().whoami()` verification.
-## Resolution
-1. Generated new HF_TOKEN at https://huggingface.co/settings/tokens
-2. Updated `.env` with new token: `hf_gZVBI...`
-3. Updated HuggingFace Spaces secret with same token
-4. Switched default model from `meta-llama/Llama-3.1-70B-Instruct` to `Qwen/Qwen2.5-72B-Instruct` (better reliability via HF router)
-## Verification
-```bash
-uv run python -c "
-import os
-from huggingface_hub import InferenceClient, HfApi
-token = os.environ['HF_TOKEN']  # Your valid token from .env
-api = HfApi(token=token)
-print(f'Token valid: {api.whoami()[\"name\"]}')
-client = InferenceClient(model='Qwen/Qwen2.5-72B-Instruct', token=token)
-response = client.chat_completion(messages=[{'role': 'user', 'content': '2+2=?'}], max_tokens=10)
-print(f'Inference works: {response.choices[0].message.content}')
-"
-# Output:
-# Token valid: VibecoderMcSwaggins
-# Inference works: 4
-```
-## Lessons Learned
-1. **First-principles debugging**: Before adding complex "fixes", verify basic assumptions (is the token actually valid?)
-2. **Token expiration**: HuggingFace tokens can expire or become invalid. Always verify with `whoami()`.
-3. **Model routing**: HuggingFace routes large models to partner providers (Hyperbolic, Novita). All require valid auth.
-## Files Changed
-- `src/utils/config.py`: Changed default model to `Qwen/Qwen2.5-72B-Instruct`
-- `src/clients/huggingface.py`: Updated fallback model reference
-- `src/agent_factory/judges.py`: Updated fallback model reference
-- `src/orchestrators/langgraph_orchestrator.py`: Updated hardcoded model
-- `CLAUDE.md`, `AGENTS.md`, `GEMINI.md`: Updated documentation

docs/bugs/archive/P1_NARRATIVE_SYNTHESIS_FALLBACK.md DELETED Viewed

@@ -1,185 +0,0 @@
-# P1: Narrative Synthesis Falls Back to Template (SPEC_12 Not Taking Effect)
-**Status**: Open
-**Priority**: P1 - Major UX degradation
-**Affects**: Simple mode, all deployments
-**Root Cause**: LLM synthesis silently failing → template fallback
-**Related**: SPEC_12 (implemented but not functioning)
----
-## Problem Statement
-SPEC_12 implemented LLM-based narrative synthesis, but users still see **template-formatted bullet points** instead of **prose paragraphs**:
-### What Users See (Template Fallback)
-```markdown
-## Sexual Health Analysis
-### Question
-what medication for the best boners?
-### Drug Candidates
-- **tadalafil**
-- **sildenafil**
-### Key Findings
-- Tadalafil improves erectile function
-### Assessment
-- **Mechanism Score**: 4/10
-- **Clinical Evidence Score**: 6/10
-```
-### What They Should See (LLM Synthesis)
-```markdown
-### Executive Summary
-Sildenafil demonstrates clinically meaningful efficacy for erectile dysfunction,
-with strong evidence from multiple RCTs demonstrating improved erectile function...
-### Background
-Erectile dysfunction (ED) is a common male sexual health disorder...
-### Evidence Synthesis
-**Mechanism of Action**
-Sildenafil works by inhibiting phosphodiesterase type 5 (PDE5)...
-```
----
-## Root Cause Analysis
-### Location: `src/orchestrators/simple.py:555-564`
-```python
-try:
-    agent = Agent(model=get_model(), output_type=str, system_prompt=system_prompt)
-    result = await agent.run(user_prompt)
-    narrative = result.output
-except Exception as e:  # ← SILENT FALLBACK
-    logger.warning("LLM synthesis failed, using template fallback", error=str(e))
-    return self._generate_template_synthesis(query, evidence, assessment)
-```
-**The Problem**: When ANY exception occurs during LLM synthesis, it silently falls back to template. Users see janky bullet points with no indication that the LLM call failed.
-### Why Synthesis Fails
-| Cause | Symptom | Frequency |
-|-------|---------|-----------|
-| No API key in deployment | HuggingFace Spaces | HIGH |
-| API rate limiting | Heavy usage | MEDIUM |
-| Token overflow | Long evidence lists | MEDIUM |
-| Model mismatch | Wrong model ID | LOW |
-| Network timeout | Slow connections | LOW |
----
-## Evidence: LLM Synthesis WORKS When Configured
-Local test with API key:
-```python
-# This works perfectly:
-agent = Agent(model=get_model(), output_type=str, system_prompt=system_prompt)
-result = await agent.run(user_prompt)
-print(result.output)  # → Beautiful narrative prose!
-```
-Output:
-```
-### Executive Summary
-Sildenafil demonstrates clinically meaningful efficacy for erectile dysfunction,
-with one study (Smith, 2020; N=100) reporting improved erectile function...
-```
----
-## Impact
-| Metric | Current | Expected |
-|--------|---------|----------|
-| Report quality | 3/10 (metadata dump) | 9/10 (professional prose) |
-| User satisfaction | Low | High |
-| Clinical utility | Limited | High |
-The ENTIRE VALUE PROPOSITION of the research agent is the synthesized report. Template output defeats the purpose.
----
-## Fix Options
-### Option A: Surface Error to User (RECOMMENDED)
-When LLM synthesis fails, don't silently fall back. Show the user what went wrong:
-```python
-except Exception as e:
-    logger.error("LLM synthesis failed", error=str(e), exc_info=True)
-    # Show error in report instead of silent fallback
-    error_note = f"""
-⚠️ **Note**: AI narrative synthesis unavailable.
-Showing structured summary instead.
-_Technical: {type(e).__name__}: {str(e)[:100]}_
-"""
-    template = self._generate_template_synthesis(query, evidence, assessment)
-    return f"{error_note}\n\n{template}"
-```
-### Option B: HuggingFace Secrets Configuration
-For HuggingFace Spaces deployment, add secrets:
-- `OPENAI_API_KEY` → Required for synthesis
-- `ANTHROPIC_API_KEY` → Alternative provider
-### Option C: Graceful Degradation with Explanation
-Add a banner explaining synthesis status:
-- ✅ "AI-synthesized narrative report" (when LLM works)
-- ⚠️ "Structured summary (AI synthesis unavailable)" (fallback)
----
-## Diagnostic Steps
-To determine why synthesis is failing in production:
-1. **Review logs** for warning: `"LLM synthesis failed, using template fallback"`
-2. **Verify API key**: Is `OPENAI_API_KEY` set in environment?
-3. **Confirm model access**: Is `gpt-5` accessible with current API tier?
-4. **Inspect rate limits**: Is the account quota exhausted?
----
-## Acceptance Criteria
-- [ ] Users see narrative prose reports (not bullet points) when API key is configured
-- [ ] When synthesis fails, user sees clear indication (not silent fallback)
-- [ ] HuggingFace Spaces deployment has proper secrets configured
-- [ ] Logging captures the specific exception for debugging
----
-## Files to Modify
-| File | Change |
-|------|--------|
-| `src/orchestrators/simple.py:555-580` | Add error surfacing in fallback |
-| `src/app.py` | Add synthesis status indicator to UI |
-| HuggingFace Spaces Settings | Add `OPENAI_API_KEY` secret |
----
-## Test Plan
-1. Run locally with API key → Should get narrative prose
-2. Run locally WITHOUT API key → Should get template WITH error message
-3. Deploy to HuggingFace with secrets → Should get narrative prose
-4. Deploy to HuggingFace WITHOUT secrets → Should get template WITH warning

docs/bugs/archive/P1_NO_SYNTHESIS_FREE_TIER.md DELETED Viewed

@@ -1,165 +0,0 @@
-# P1 Bug: No Synthesis Report in Free Tier (Premature Workflow Termination)
-**Date**: 2025-12-04
-**Status**: FIXED (PR fix/p1-forced-synthesis)
-**Severity**: P1 (Critical UX - No usable output from research)
-**Component**: `src/orchestrators/advanced.py`
-**Affects**: Free Tier (HuggingFace) primarily, potentially Paid Tier
----
-## Executive Summary
-The workflow terminates without the ReportAgent ever producing a synthesis report. Users see search results and hypotheses streaming, but the final output is just "Research complete." with no actual research report. This is caused by the 7B Manager model failing to properly delegate to ReportAgent before workflow termination.
----
-## Symptom
-```text
-📚 **SEARCH_COMPLETE**: searcher: [search results]
-⏱️ **PROGRESS**: Round 1/5 (~3m 0s remaining)
-🔬 **HYPOTHESIZING**: hypothesizer: [hypotheses]
-⏱️ **PROGRESS**: Round 2/5 (~2m 15s remaining)
-✅ **JUDGE_COMPLETE**: judge: [asks for more evidence]
-⏱️ **PROGRESS**: Round 4/5 (~45s remaining)
-Research complete.
-Research complete.   ← NO SYNTHESIS REPORT!
-```
-The workflow runs through multiple agents (Search, Hypothesis, Judge) but never reaches the ReportAgent. The user receives no usable research report.
----
-## Root Cause Analysis
-### Primary Issue: Manager Model Failure
-The `with_standard_manager()` in Microsoft Agent Framework uses the provided chat client (HuggingFace 7B model) to coordinate agents. The 7B model:
-1. **Cannot follow complex multi-step instructions** - The manager prompt instructs: "When JudgeAgent says SUFFICIENT EVIDENCE → delegate to ReportAgent." The 7B model doesn't reliably follow this.
-2. **Triggers premature termination** - The framework has `max_stall_count=3` and `max_reset_count=2`. If the manager keeps making the same delegation or gets confused, the workflow terminates.
-3. **Emits final event without synthesis** - The framework sends `MagenticFinalResultEvent` or `WorkflowOutputEvent` without ReportAgent ever running.
-### Secondary Issue: Duplicate Complete Events
-Both `MagenticFinalResultEvent` and `WorkflowOutputEvent` are emitted when the workflow ends. The previous code handled both, yielding "Research complete." twice.
----
-## The Fix
-### 1. Track ReportAgent Execution (Forced Synthesis)
-Add a `reporter_ran` flag that tracks whether ReportAgent produced output:
-```python
-reporter_ran = False  # P1 FIX: Track if ReportAgent produced output
-# In MagenticAgentMessageEvent handler:
-agent_name = (event.agent_id or "").lower()
-if "report" in agent_name:
-    reporter_ran = True
-```
-### 2. Force Synthesis on Final Event
-If the workflow ends without ReportAgent running, force synthesis:
-```python
-if isinstance(event, (MagenticFinalResultEvent, WorkflowOutputEvent)):
-    if not reporter_ran:
-        logger.warning("ReportAgent never ran - forcing synthesis")
-        async for synth_event in self._force_synthesis(iteration):
-            yield synth_event
-    else:
-        yield self._handle_final_event(event, iteration, last_streamed_length)
-```
-### 3. `_force_synthesis()` Method
-Similar to `_handle_timeout()`, invokes ReportAgent directly:
-```python
-async def _force_synthesis(self, iteration: int) -> AsyncGenerator[AgentEvent, None]:
-    """Force synthesis when workflow ends without ReportAgent running."""
-    state = get_magentic_state()
-    evidence_summary = await state.memory.get_context_summary()
-    report_agent = create_report_agent(self._chat_client, domain=self.domain)
-    yield AgentEvent(type="synthesizing", message="Synthesizing research findings...")
-    synthesis_result = await report_agent.run(
-        f"Synthesize research report from this evidence.\n\n{evidence_summary}"
-    )
-    yield AgentEvent(type="complete", message=synthesis_result.text)
-```
-### 4. Skip Duplicate Final Events
-Prevent "Research complete." appearing twice:
-```python
-if isinstance(event, (MagenticFinalResultEvent, WorkflowOutputEvent)):
-    if final_event_received:
-        continue  # Skip duplicate final events
-    final_event_received = True
-```
----
-## Why This Is The Correct Architecture
-| Alternative | Why Wrong |
-|-------------|-----------|
-| Improve manager prompt | 7B models have fundamental reasoning limitations |
-| Use larger model for manager | Defeats "free tier" purpose |
-| Wait for upstream fix | Framework may never change; we control our code |
-| **Forced synthesis safety net** | ✅ Guarantees output regardless of manager behavior |
-The `_force_synthesis()` pattern is a **defensive architecture**. It guarantees users always get a research report, even if:
-- The manager model fails to delegate properly
-- The workflow hits stall/reset limits
-- Any unexpected termination occurs
----
-## Files Modified
-| File | Change |
-|------|--------|
-| `src/orchestrators/advanced.py` | Added `reporter_ran` tracking |
-| `src/orchestrators/advanced.py` | Added `_force_synthesis()` method |
-| `src/orchestrators/advanced.py` | Added duplicate final event skipping |
-| `src/orchestrators/advanced.py` | Added forced synthesis in final event handler |
-| `src/orchestrators/advanced.py` | Added forced synthesis in max rounds fallback |
----
-## Test Plan
-1. **Free Tier**: Run query, verify synthesis report is always generated
-2. **Paid Tier**: Run query, verify no regression in OpenAI behavior
-3. **Timeout**: Verify existing timeout synthesis still works
-4. **Max Rounds**: Verify synthesis happens even at max rounds
----
-## Related
-- P2 Duplicate Report Bug (separate issue, also fixed in this PR)
-- P2 First Turn Timeout Bug (previously fixed)
-- Manager model limitations are fundamental to 7B models
-- OpenAI tier works because GPT-5 follows instructions better
----
-## Lessons Learned
-1. **Defensive architecture** - Don't trust upstream components to always behave correctly
-2. **Tracking flags** - Simple boolean flags can enable powerful safety nets
-3. **AI-native challenges** - When using AI models as infrastructure components, build in fallbacks for model failures
-4. **Regression prevention** - This bug was likely introduced when we unified the architecture; comprehensive test coverage is critical

docs/bugs/archive/P1_SIMPLE_MODE_REMOVED_BREAKS_FREE_TIER_UX.md DELETED Viewed

@@ -1,61 +0,0 @@
-# Free Tier (No API Key) - BLOCKED by Upstream #2562
-**Status**: BLOCKED - Waiting for upstream PR #2566
-**Priority**: P1
-**Discovered**: 2025-12-01
----
-## Problem
-Free tier (no API key provided) shows garbage output:
-```
-📚 **SEARCH_COMPLETE**: searcher: <agent_framework._types.ChatMessage object at 0x7fd3f8617b10>
-```
-## Cause
-**Upstream Bug #2562**: Microsoft Agent Framework produces `repr()` garbage for tool-call-only messages.
-## Architecture
-```
-User provides API key?
-NO (Free Tier)              YES (Paid Tier)
-──────────────              ───────────────
-HuggingFace backend         OpenAI backend
-Qwen 2.5 72B (free)         GPT-5 (paid)
-SAME orchestration, different backends
-ONE codebase, not parallel universes
-```
-## Framework Stack
-| Framework | Role |
-|-----------|------|
-| Microsoft Agent Framework | Multi-agent orchestration |
-| Pydantic AI | Structured outputs & validation |
-Both work TOGETHER. Not mutually exclusive.
-## Fix
-**Upstream PR #2566** will fix this.
-Once merged:
-1. `uv add agent-framework@latest`
-2. Verify free tier works
-3. Done
-## What Was Deleted
-`simple.py` (778 lines) was a SEPARATE orchestrator. Created parallel universe. Now deleted. ONE orchestrator with different backends.
-## Related
-- [Issue #105](https://github.com/The-Obstacle-Is-The-Way/DeepBoner/issues/105)
-- [Upstream #2562](https://github.com/microsoft/agent-framework/issues/2562)
-- [Upstream PR #2566](https://github.com/microsoft/agent-framework/pull/2566)

docs/bugs/archive/P1_SYNTHESIS_BROKEN_KEY_FALLBACK.md DELETED Viewed

@@ -1,163 +0,0 @@
-# P0 - Free Tier Synthesis Incorrectly Uses Server-Side API Keys
-**Status:** RESOLVED
-**Priority:** P0 (Breaks Free Tier Promise)
-**Found:** 2025-11-30
-**Resolved:** 2025-11-30
-**Component:** `src/orchestrators/simple.py`, `src/agent_factory/judges.py`
-## Resolution Summary
-The architectural bug where Simple Mode synthesis incorrectly used server-side API keys has been fixed.
-We implemented a dedicated `synthesize()` method in `HFInferenceJudgeHandler` that uses the free
-HuggingFace Inference API, consistent with the judging phase.
-### Fix Details
-1.  **New Feature**: Added `synthesize()` method to `HFInferenceJudgeHandler` (and `JudgeHandler` protocol).
-    -   Uses `huggingface_hub.InferenceClient.chat_completion` (Free Tier).
-    -   Mirrors the `assess()` logic for consistent free access.
-2.  **Orchestrator Logic Update**:
-    -   `SimpleOrchestrator` now checks `if hasattr(self.judge, "synthesize")`.
-    -   If true (Free Tier), it calls `judge.synthesize()` directly, skipping `get_model()`/`pydantic_ai`.
-    -   If false (Paid Tier), it falls back to the existing `pydantic_ai` agent flow using `get_model()`.
-3.  **Test Coverage**:
-    -   Updated `tests/unit/orchestrators/test_simple_synthesis.py` to mock `judge.synthesize`.
-    -   Added new test case ensuring Free Tier path is taken when available.
-    -   Fixed integration tests to simulate Free Tier correctly.
-### Verification
--   **Unit Tests**: `tests/unit/orchestrators/test_simple_synthesis.py` passed (7/7).
--   **Integration**: `tests/integration/test_simple_mode_synthesis.py` passed.
--   **Full Suite**: `make check` passed (310/310 tests).
----
-## Symptom (Archive)
-When using Simple Mode (Free Tier) without providing a user API key, users see:
-```
-> ⚠️ **Note**: AI narrative synthesis unavailable. Showing structured summary.
-> _Error: OpenAIError_
-```
-This is confusing because the user didn't configure any OpenAI key - they expected Free Tier to work.
-## Root Cause
-**Architecture bug: Synthesis is decoupled from JudgeHandler selection.**
-| Component | Paid Tier | Free Tier |
-|-----------|-----------|-----------|
-| Judge | `JudgeHandler` (uses `get_model()`) | `HFInferenceJudgeHandler` (free HF Inference) |
-| Synthesis | `get_model()` | **BUG: Also uses `get_model()`** |
-**Flow:**
-1. User selects Simple mode, leaves API key empty
-2. `app.py` correctly creates `HFInferenceJudgeHandler` for judging (works)
-3. Search works (no keys needed for PubMed/ClinicalTrials/Europe PMC)
-4. Judge works (HFInferenceJudgeHandler uses free HuggingFace inference)
-5. **BUG:** Synthesis calls `get_model()` in `simple.py:547`
-6. `get_model()` checks `settings.has_openai_key` → reads SERVER-SIDE env vars
-7. If ANY server-side key is set (even broken), synthesis tries to use it
-8. This VIOLATES the Free Tier promise - user didn't provide a key!
-**The bug is NOT about broken keys - it's about synthesis ignoring the Free Tier selection.**
-## Impact
-- **User Confusion**: User didn't provide a key, sees "OpenAIError"
-- **Free Tier Perception**: Makes Free Tier seem broken when it's actually working (template synthesis is still useful)
-- **Demo Quality**: Hackathon judges may think the app is broken
-## Fix Options
-### Option A: Remove/Fix Admin Key (Quick Fix for Hackathon)
-Remove or update the `OPENAI_API_KEY` secret on HuggingFace Spaces.
-- If removed: Free Tier works as designed (template synthesis)
-- If fixed: OpenAI synthesis works
-**Pros:** Instant fix, no code changes
-**Cons:** Doesn't fix the underlying UX issue
-### Option B: Better Error Message
-Change error message to be more user-friendly:
-```python
-# src/orchestrators/simple.py:569-573
-error_note = (
-    f"\n\n> ⚠️ **Note**: AI narrative synthesis unavailable. "
-    f"Showing structured summary.\n"
-    f"> _Tip: Provide your own API key for full synthesis._\n"
-)
-```
-**Pros:** Clearer UX
-**Cons:** Hides the real error for debugging
-### Option C: Provider Fallback Chain (Best Long-term)
-If primary provider fails, try next provider before falling back to template:
-```python
-def get_model_with_fallback() -> Any:
-    """Try providers in order, return first that works."""
-    from src.utils.exceptions import ConfigurationError
-    providers = []
-    if settings.has_openai_key:
-        providers.append(("openai", lambda: OpenAIChatModel(...)))
-    if settings.has_anthropic_key:
-        providers.append(("anthropic", lambda: AnthropicModel(...)))
-    if settings.has_huggingface_key:
-        providers.append(("huggingface", lambda: HuggingFaceModel(...)))
-    for name, factory in providers:
-        try:
-            return factory()
-        except Exception as e:
-            logger.warning(f"Provider {name} failed: {e}")
-            continue
-    raise ConfigurationError("No working LLM provider available")
-```
-**Pros:** Most robust, graceful degradation
-**Cons:** More complex, may hide real errors
-### Option D: Validate Key Before Using (Recommended)
-Add key validation to `get_model()`:
-```python
-def get_model() -> Any:
-    if settings.has_openai_key:
-        # Quick validation - check key format
-        key = settings.openai_api_key
-        if not key or not key.startswith("sk-"):
-            logger.warning("Invalid OpenAI key format, trying next provider")
-        else:
-            return OpenAIChatModel(...)
-    # ... continue to next provider
-```
-**Pros:** Catches obviously invalid keys early
-**Cons:** Can't catch quota/permission issues without API call
-## Recommended Action (Hackathon)
-1. **Immediate**: Remove `OPENAI_API_KEY` from HuggingFace Space secrets, OR replace with valid key
-2. **If key is valid**: Check if model `gpt-5` is accessible (may need to use `gpt-4o` instead)
-## Test Plan
-1. Remove all secrets from HuggingFace Space
-2. Run Simple mode query
-3. Verify: Search works, Judge works, Synthesis shows template (no error message)
-## Related
-- `docs/bugs/P0_SYNTHESIS_PROVIDER_MISMATCH.md` (RESOLVED - handles "no keys" case)
-- This bug is specifically about "key exists but broken" case

docs/bugs/archive/P2_7B_MODEL_GARBAGE_OUTPUT.md DELETED Viewed

@@ -1,266 +0,0 @@
-# P2 Bug: 7B Model Produces Garbage Streaming Output
-**Date**: 2025-12-02
-**Status**: OPEN - Investigating
-**Severity**: P2 (Major - Degrades User Experience)
-**Component**: Free Tier / HuggingFace + Multi-Agent Orchestration
----
-## Symptoms
-When running a research query on Free Tier (Qwen2.5-7B-Instruct), the streaming output shows **garbage tokens** and **malformed tool calls** instead of coherent agent reasoning:
-### Symptom A: Random Garbage Tokens
-```text
-📡 **STREAMING**: yarg
-📡 **STREAMING**: PostalCodes
-📡 **STREAMING**: FunctionFlags
-📡 **STREAMING**: system
-📡 **STREAMING**: Transferred to searcher, adopt the persona immediately.
-```
-### Symptom B: Raw Tool Call JSON in Text (NEW - 2025-12-03)
-```text
-📡 **STREAMING**:
-oleon
-{"name": "search_preprints", "arguments": {"query": "female libido post-menopause drug", "max_results": 10}}
-</tool_call>
-system
-UrlParser
-{"name": "search_clinical_trials", "arguments": {"query": "female libido post-menopause drug", "max_results": 10}}
-```
-The model is outputting:
-1. **Garbage tokens**: "oleon", "UrlParser" - meaningless fragments
-2. **Raw JSON tool calls**: `{"name": "search_preprints", ...}` - intended tool calls output as TEXT
-3. **XML-style tags**: `</tool_call>` - model trying to use wrong tool calling format
-4. **"system" keyword**: Model confusing role markers with content
-**Root Cause of Symptom B**: The 7B model is attempting to make tool calls but outputting them as **text content** instead of using the HuggingFace API's native `tool_calls` structure. The model may have been trained on a different tool calling format (XML-style like Claude's `<tool_call>` tags) and doesn't properly use the OpenAI-compatible JSON format.
-The model outputs random tokens like "yarg", "PostalCodes", "FunctionFlags" instead of actual research reasoning.
----
-## Reproduction Steps
-1. Go to HuggingFace Spaces: https://huggingface.co/spaces/vcms/deepboner
-2. Leave API key empty (Free Tier)
-3. Click any example query or type a question
-4. Click submit
-5. Observe streaming output - garbage tokens appear
-**Expected**: Coherent agent reasoning like "Searching PubMed for female libido treatments..."
-**Actual**: Random tokens like "yarg", "PostalCodes"
----
-## Root Cause Analysis
-### Primary Cause: 7B Model Too Small for Multi-Agent Prompts
-The Qwen2.5-7B-Instruct model has **insufficient reasoning capacity** for the complex multi-agent framework. The system requires the model to:
-1. **Adopt agent personas** with specialized instructions
-2. **Follow structured workflows** (Search → Judge → Hypothesis → Report)
-3. **Make tool calls** (search_pubmed, search_clinical_trials, etc.)
-4. **Generate JSON-formatted progress ledgers** for workflow control
-5. **Understand manager instructions** and delegate appropriately
-A 7B parameter model simply does not have the reasoning depth to handle this. Larger models (70B+) were originally intended, but those are routed to unreliable third-party providers (see `HF_FREE_TIER_ANALYSIS.md`).
-### Technical Flow (Where Garbage Appears)
-```
-User Query
-    ↓
-AdvancedOrchestrator.run() [advanced.py:247]
-    ↓
-workflow.run_stream(task) [builds Magentic workflow]
-    ↓
-MagenticAgentDeltaEvent emitted with event.text
-    ↓
-Yields AgentEvent(type="streaming", message=event.text) [advanced.py:314-319]
-    ↓
-Gradio displays: "📡 **STREAMING**: {garbage}"
-```
-The garbage tokens are **raw model output**. The 7B model is:
-- Not following the system prompt
-- Outputting partial/incomplete token sequences
-- Possibly attempting tool calls but formatting incorrectly
-- Hallucinating random words
-### Evidence from Microsoft Reference Framework
-The Microsoft Agent Framework's `_magentic.py` (lines 1717-1741) shows how agent invocation works:
-```python
-async for update in agent.run_stream(messages=self._chat_history):
-    updates.append(update)
-    await self._emit_agent_delta_event(ctx, update)
-```
-The framework passes through whatever the underlying chat client produces. If the model produces garbage, the framework streams it directly.
-### Why Click Example vs Submit Shows Different Initial State
-Both code paths go through the same `research_agent()` function in `app.py`. The difference:
-- **Example click**: Immediately submits query, so you see garbage quickly
-- **Submit button click**: Shows "Starting research (Advanced mode)" banner first, then garbage
-Both ultimately produce the same garbage output from the 7B model.
----
-## Impact Assessment
-| Aspect | Impact |
-|--------|--------|
-| Free Tier Users | Cannot get usable research results |
-| Demo Quality | Appears broken/unprofessional |
-| Trust | Users may think the entire system is broken |
-| Differentiation | Undermines "free tier works!" messaging |
----
-## Potential Solutions
-### Option 1: Switch to Better Small Model (Recommended - Quick Fix)
-Find a small model that better handles complex instructions. Candidates:
-| Model | Size | Tool Calling | Instruction Following |
-|-------|------|--------------|----------------------|
-| `mistralai/Mistral-7B-Instruct-v0.3` | 7B | Yes | Better |
-| `microsoft/Phi-3-mini-4k-instruct` | 3.8B | Limited | Good |
-| `google/gemma-2-9b-it` | 9B | Yes | Good |
-| `Qwen/Qwen2.5-14B-Instruct` | 14B | Yes | Better |
-**Risk**: 14B model might still be routed to third-party providers. Need to test each.
-### Option 2: Simplify Free Tier Architecture
-Create a **simpler single-agent mode** for Free Tier:
-- Remove multi-agent coordination (Manager, multiple ChatAgents)
-- Use a single direct query → search → synthesize flow
-- Reduce prompt complexity significantly
-**Pros**: More reliable with smaller models
-**Cons**: Loses sophisticated multi-agent research capability
-### Option 3: Output Filtering/Validation
-Add validation layer to detect and filter garbage output:
-```python
-def is_valid_streaming_token(text: str) -> bool:
-    """Check if streaming token appears valid."""
-    # Garbage patterns we've seen
-    garbage_patterns = ["yarg", "PostalCodes", "FunctionFlags"]
-    if any(g in text for g in garbage_patterns):
-        return False
-    # Check for minimum coherence (has spaces, reasonable length)
-    return len(text) > 0 and text.strip()
-```
-**Pros**: Band-aid fix, quick to implement
-**Cons**: Doesn't fix root cause, will miss new garbage patterns
-### Option 4: Graceful Degradation
-Detect when model output is incoherent and fall back to:
-- Returning an error message
-- Suggesting user provide an API key
-- Using a cached/templated response
-### Option 5: Prompt Engineering for 7B Models
-Significantly simplify the agent prompts for 7B compatibility:
-- Shorter system prompts
-- More explicit step-by-step instructions
-- Remove abstract concepts
-- Use few-shot examples
-### Option 6: Streaming Content Filter (For Symptom B)
-Filter raw tool call JSON from streaming output:
-```python
-def should_stream_content(text: str) -> bool:
-    """Filter garbage and raw tool calls from streaming."""
-    # Don't stream raw JSON tool calls
-    if text.strip().startswith('{"name":'):
-        return False
-    # Don't stream XML-style tool tags
-    if '</tool_call>' in text or '<tool_call>' in text:
-        return False
-    # Don't stream garbage tokens (extend as needed)
-    garbage = ["oleon", "UrlParser", "yarg", "PostalCodes", "FunctionFlags"]
-    if any(g in text for g in garbage):
-        return False
-    return True
-```
-**Location**: `src/orchestrators/advanced.py` lines 315-322
-This would prevent the raw tool call JSON from being shown to users, even if the model produces it.
----
-## Recommended Action Plan
-### Phase 1: Quick Fix (P2)
-1. Test `mistralai/Mistral-7B-Instruct-v0.3` or `Qwen/Qwen2.5-14B-Instruct`
-2. Verify they stay on HuggingFace native infrastructure (no third-party routing)
-3. Evaluate output quality on sample queries
-### Phase 2: Architecture Review (P3)
-1. Consider simplified single-agent mode for Free Tier
-2. Design graceful degradation when model output is invalid
-3. Add output validation layer
-### Phase 3: Long-term (P4)
-1. Consider hybrid approach: simple mode for free tier, advanced for paid
-2. Explore fine-tuning a small model specifically for research agent tasks
----
-## Files Involved
-| File | Relevance |
-|------|-----------|
-| `src/orchestrators/advanced.py` | Main orchestrator, streaming event handling |
-| `src/clients/huggingface.py` | HuggingFace chat client adapter |
-| `src/agents/magentic_agents.py` | Agent definitions and prompts |
-| `src/app.py` | Gradio UI, event display |
-| `src/utils/config.py` | Model configuration |
----
-## Relation to Previous Bugs
-- **P0 Repr Bug (RESOLVED)**: Fixed in PR #117 - Was about `<generator object>` appearing due to async generator mishandling
-- **P1 HuggingFace Novita Error (RESOLVED)**: Fixed in PR #118 - Was about 72B models being routed to failing third-party providers
-This P2 bug is **downstream** of the P1 fix - we fixed the 500 errors by switching to 7B, but now the 7B model doesn't produce quality output.
----
-## Questions to Investigate
-1. What models in the 7-20B range stay on HuggingFace native infrastructure?
-2. Can we detect third-party routing before making the full request?
-3. Is the chat template correct for Qwen2.5-7B? (Some models need specific formatting)
-4. Are there HuggingFace serverless models specifically optimized for tool calling?
----
-## References
-- `HF_FREE_TIER_ANALYSIS.md` - Analysis of HuggingFace provider routing
-- `CLAUDE.md` - Critical HuggingFace Free Tier section
-- Microsoft Agent Framework `_magentic.py` - Reference implementation

docs/bugs/archive/P2_ADVANCED_MODE_COLD_START_NO_FEEDBACK.md DELETED Viewed

@@ -1,255 +0,0 @@
-# P2: Advanced Mode Cold Start Has No User Feedback
-**Priority**: P2 (UX Friction)
-**Component**: `src/orchestrators/advanced.py`
-**Status**: ✅ FIXED (All Phases Complete)
-**Issue**: [#108](https://github.com/The-Obstacle-Is-The-Way/DeepBoner/issues/108)
-**Created**: 2025-12-01
-## Summary
-When Advanced Mode starts, users experience three significant "dead zones" with no visual feedback:
-1. **Initialization delay** (5-15 seconds): Between "STARTED" and "THINKING" events
-2. **First LLM call delay** (10-30+ seconds): Between "THINKING" and first "PROGRESS" event
-3. **Agent execution delay** (30-90+ seconds): After "PROGRESS" while SearchAgent executes
-Users see the UI freeze with no indication of what's happening, leading to confusion about whether the system is working.
-## Visual Timeline
-```
-🚀 STARTED: Starting research (Advanced mode)...
-   │
-   │  ← DEAD ZONE #1: 5-15 seconds of nothing
-   │     - Loading LlamaIndex/ChromaDB
-   │     - Initializing embedding service
-   │     - Building 4 agents + manager
-   │
-⏳ THINKING: Multi-agent reasoning in progress...
-   │
-   │  ← DEAD ZONE #2: 10-30+ seconds of nothing
-   │     - Manager agent's first OpenAI API call
-   │     - Cold connection to OpenAI
-   │
-⏱️ PROGRESS: Manager assigning research task...
-   │
-   │  ← DEAD ZONE #3: 30-90+ seconds of nothing
-   │     - SearchAgent executing PubMed/ClinicalTrials/EuropePMC queries
-   │     - Embedding and storing results in ChromaDB
-   │     - No streaming events during search execution
-   │
-📊 SEARCH_COMPLETE / PROGRESS: Round 1/5...
-```
-## Root Cause Analysis
-### Dead Zone #1: Initialization (Lines 162-165)
-```python
-yield AgentEvent(type="started", ...)  # User sees this
-# === BLOCKING OPERATIONS (no events yielded) ===
-embedding_service = self._init_embedding_service()  # ChromaDB, embeddings
-init_magentic_state(query, embedding_service)       # Shared state
-workflow = self._build_workflow()                   # 4 agents + manager
-yield AgentEvent(type="thinking", ...)  # User finally sees this
-```
-**What's happening:**
-1. `_init_embedding_service()` → Loads LlamaIndex, connects to ChromaDB, initializes OpenAI embeddings
-2. `init_magentic_state()` → Creates ResearchMemory, sets up context
-3. `_build_workflow()` → Instantiates SearchAgent, JudgeAgent, HypothesisAgent, ReportAgent, Manager
-### Dead Zone #2: First LLM Call (Line 206)
-```python
-yield AgentEvent(type="thinking", ...)  # User sees this
-async for event in workflow.run_stream(task):  # BLOCKING until first event
-    # Manager makes first OpenAI call here
-    # No events until manager responds and starts delegating
-```
-**What's happening:**
-- Microsoft Agent Framework's manager agent receives the task
-- Makes synchronous(ish) call to OpenAI for orchestration planning
-- Only after response does it emit `MagenticOrchestratorMessageEvent`
-### Dead Zone #3: Agent Execution (After PROGRESS event)
-After "Manager assigning research task...", the SearchAgent executes but emits no events until complete:
-**What's happening:**
-- SearchAgent receives task from manager
-- Executes parallel queries to PubMed, ClinicalTrials.gov, Europe PMC
-- Each result is embedded and stored in ChromaDB
-- Only after ALL searches complete does it emit `MagenticAgentMessageEvent`
-**Why no streaming:**
-- The agent's internal tool calls (search APIs, embeddings) don't emit framework events
-- Microsoft Agent Framework only emits events at agent message boundaries
-- 3 databases × multiple queries × embedding each result = long silent period
-**Potential fix:** Add progress callbacks to `SearchAgent` tools:
-```python
-# In search_agent.py - hypothetical
-async def search_pubmed(query: str, on_progress: Callable = None):
-    results = await pubmed_client.search(query)
-    if on_progress:
-        on_progress(f"Found {len(results)} PubMed results")
-    # ... embed and store
-```
-## Impact
-1. **User Confusion**: "Is it frozen? Should I refresh?"
-2. **Perceived Slowness**: Dead time feels longer than active progress
-3. **No Cancel Option**: Users can't abort during these zones
-4. **Support Burden**: Users report "it's not working" when it's actually initializing
-## Proposed Solutions
-### Option A: Granular Initialization Events (Quick Win)
-Add progress events during initialization:
-```python
-yield AgentEvent(type="started", ...)
-yield AgentEvent(
-    type="progress",
-    message="Loading embedding service...",
-    iteration=0,
-)
-embedding_service = self._init_embedding_service()
-yield AgentEvent(
-    type="progress",
-    message="Initializing research memory...",
-    iteration=0,
-)
-init_magentic_state(query, embedding_service)
-yield AgentEvent(
-    type="progress",
-    message="Building agent team (Search, Judge, Hypothesis, Report)...",
-    iteration=0,
-)
-workflow = self._build_workflow()
-yield AgentEvent(type="thinking", ...)
-```
-**Pros**: Simple, immediate feedback
-**Cons**: Still sequential, doesn't speed up actual time
-### Option B: Parallel Initialization (Performance + UX)
-Use `asyncio.gather()` for independent operations:
-```python
-yield AgentEvent(type="progress", message="Initializing agents...", iteration=0)
-# These could potentially run in parallel
-embedding_task = asyncio.create_task(self._init_embedding_service_async())
-workflow_task = asyncio.create_task(self._build_workflow_async())
-embedding_service, workflow = await asyncio.gather(embedding_task, workflow_task)
-init_magentic_state(query, embedding_service)
-```
-**Pros**: Faster initialization, better UX
-**Cons**: Need to verify thread safety, more complex
-### Option C: Pre-warming / Singleton Services
-Initialize expensive services once at app startup, not per-request:
-```python
-# In app.py startup
-global_embedding_service = init_embedding_service()
-global_workflow_template = build_workflow_template()
-# In orchestrator
-workflow = global_workflow_template.clone()  # Fast
-```
-**Pros**: Near-instant start after first request
-**Cons**: Memory overhead, cold start on first request still slow
-### Option D: Animated Progress Indicator (UI-Only)
-Add a Gradio progress bar or spinner that animates during the dead zones:
-```python
-# In app.py
-with gr.Blocks() as demo:
-    progress = gr.Progress()
-    async def research(query):
-        progress(0.1, desc="Initializing...")
-        # ...
-        progress(0.2, desc="Building agents...")
-```
-**Pros**: User sees activity even if nothing to report
-**Cons**: Doesn't solve the actual blocking, Gradio-specific
-## Recommended Approach
-**Phase 1 (Quick Win)**: Option A - Add granular events ✅ COMPLETE
-**Phase 2 (Performance)**: Option C - Pre-warm services at startup ✅ COMPLETE
-**Phase 3 (Polish)**: Option D - Gradio progress bar ✅ COMPLETE
-## Related Considerations
-### Parallel Agent Orchestration
-The current Microsoft Agent Framework runs agents sequentially through the manager. True parallel execution would require:
-1. Breaking out of the framework's `run_stream()` pattern
-2. Implementing our own parallel task dispatch
-3. Managing agent coordination manually
-This is a larger architectural change (P1 scope) and should be tracked separately if desired.
-## Files to Modify
-1. `src/orchestrators/advanced.py:155-210` - Add initialization events in `run()` method
-2. `src/utils/service_loader.py` - Pre-warming logic
-3. `src/app.py` - Gradio progress integration
-## Testing the Issue
-```python
-import asyncio
-import time
-from src.orchestrators.advanced import AdvancedOrchestrator
-async def test():
-    orch = AdvancedOrchestrator(max_rounds=3)
-    start = time.time()
-    async for event in orch.run("test query"):
-        elapsed = time.time() - start
-        print(f"[{elapsed:.1f}s] {event.type}: {event.message[:50]}...")
-        if event.type == "complete":
-            break
-asyncio.run(test())
-```
-Expected output showing the gaps:
-```
-[0.0s] started: Starting research (Advanced mode)...
-[8.2s] thinking: Multi-agent reasoning in progress...  ← 8 second gap!
-[22.5s] progress: Manager assigning research task...   ← 14 second gap!
-```
-## References
-- Advanced orchestrator: `src/orchestrators/advanced.py`
-- Embedding service loader: `src/utils/service_loader.py`
-- LlamaIndex RAG: `src/services/llamaindex_rag.py`
-- Microsoft Agent Framework: `agent-framework-core`

docs/bugs/archive/P2_ARCHITECTURAL_BYOK_GAPS.md DELETED Viewed

@@ -1,100 +0,0 @@
-# P2 Architectural: BYOK Gaps in Non-Critical Paths
-**Date**: 2025-12-03
-**Status**: ✅ RESOLVED
-**Severity**: P2 (Architectural Debt)
-**Component**: LLM Routing / BYOK Support
-**Resolution**: Fixed end-to-end BYOK support in this PR
----
-## Summary
-Two code paths do NOT support BYOK (Bring Your Own Key) from Gradio:
-1. **HierarchicalOrchestrator** - Doesn't receive `api_key` parameter
-2. **get_model() (PydanticAI)** - Only checks env vars, no BYOK
-These are **latent bugs** - they don't affect the main user flow currently.
----
-## Bug 1: HierarchicalOrchestrator Missing api_key
-**Location**: `src/orchestrators/factory.py:61-64`
-```python
-if effective_mode == "hierarchical":
-    from src.orchestrators.hierarchical import HierarchicalOrchestrator
-    return HierarchicalOrchestrator(config=effective_config, domain=domain)
-    # BUG: api_key is NOT passed to HierarchicalOrchestrator
-```
-**Impact**: If hierarchical mode were exposed in UI, BYOK would not work.
-**Current State**: Hierarchical mode is NOT exposed in Gradio UI, so this is latent.
-**Fix**: Pass `api_key` to HierarchicalOrchestrator when instantiating.
----
-## Bug 2: get_model() Doesn't Support BYOK
-**Location**: `src/agent_factory/judges.py:62-91` (function `get_model()`)
-```python
-def get_model() -> Any:
-    # Priority 1: OpenAI
-    if settings.has_openai_key:  # Only checks ENV VAR
-        ...
-    # Priority 2: Anthropic
-    if settings.has_anthropic_key:  # Only checks ENV VAR
-        ...
-    # Priority 3: HuggingFace
-    if settings.has_huggingface_key:  # Only checks ENV VAR
-        ...
-```
-**Impact**: PydanticAI-based components (judges, statistical analyzer) cannot use BYOK keys.
-**Current State**: The main Advanced mode flow uses `get_chat_client()` (Microsoft Agent Framework), NOT `get_model()`. So this is latent.
-**Fix**: Either:
-1. Add `api_key` parameter to `get_model()`
-2. Or deprecate `get_model()` in favor of `get_chat_client()` everywhere
----
-## Architecture Notes
-The codebase has **TWO separate LLM routing systems**:
-| System | Function | BYOK Support | Used By |
-|--------|----------|--------------|---------|
-| Microsoft Agent Framework | `get_chat_client()` | **YES** (key prefix detection) | Advanced mode (main flow) |
-| PydanticAI | `get_model()` | **NO** (env vars only) | Judges, statistical analyzer |
-This dual-system architecture creates confusion and maintenance burden.
----
-## Recommendation
-**Short-term**: Leave as-is (latent, not blocking)
-**Long-term**: Unify on `get_chat_client()` and deprecate `get_model()` (see P3_REMOVE_ANTHROPIC_PARTIAL_WIRING.md for related cleanup)
----
-## Test Results
-- All 310 unit tests pass
-- Main user flow (Gradio → Advanced) works with BYOK
----
-## Related Documents
-- `P3_REMOVE_ANTHROPIC_PARTIAL_WIRING.md` - Related architecture cleanup
-- `src/clients/factory.py` - BYOK-capable factory (correct implementation)
-- `src/agent_factory/judges.py` - Non-BYOK factory (needs fix)

docs/bugs/archive/P2_DUPLICATE_REPORT_CONTENT.md DELETED Viewed

@@ -1,151 +0,0 @@
-# P2 Bug: Duplicate Report Content in Output
-**Date**: 2025-12-03
-**Status**: FIXED (PR fix/p2-double-bug-squash)
-**Severity**: P2 (UX - Duplicate content confuses users)
-**Component**: `src/orchestrators/advanced.py`
-**Affects**: Both Free Tier (HuggingFace) AND Paid Tier (OpenAI)
----
-## Executive Summary
-This is a **confirmed stack bug**, NOT a model limitation. The duplicate report appears because:
-1. Streaming events yield the full report content character-by-character
-2. Final events (`MagenticFinalResultEvent`/`WorkflowOutputEvent`) contain the SAME content
-3. No deduplication exists between streamed content and final event content
-4. Both are appended to the output
----
-## Symptom
-The final research report appears **twice** in the UI output:
-1. First as streaming content (with `📡 **STREAMING**:` prefix)
-2. Then again as a complete event (without prefix)
----
-## Root Cause
-The `_process_event()` method handles final events but has **no access to buffer state**. The buffer was already cleared at line 337 before these events arrive.
-```python
-# Line 337: Buffer cleared
-current_message_buffer = ""
-continue
-# Line 341: Final events processed WITHOUT buffer context
-agent_event = self._process_event(event, iteration)  # No buffer info!
-```
----
-## The Fix (Consensus: Stateful Orchestrator Logic)
-**Location**: `src/orchestrators/advanced.py` `run()` method
-**Strategy**: Handle final events **inline in the run() loop** where buffer state exists. Track streaming volume to decide whether to re-emit content.
-### Why This Is Correct
-| Rejected Approach | Why Wrong |
-|-------------------|-----------|
-| UI-side string comparison | Wrong layer, fragile, treats symptom |
-| Stateless `_process_event` fix | No state = can't know if streaming occurred |
-| **Stateful run() loop** | ✅ Only place with full lifecycle visibility |
-The `run()` loop is the **single source of truth** for the request lifecycle. It "saw" the content stream out. It must decide whether to re-emit.
-### Implementation
-```python
-# In run() method, add tracking variable after line 302:
-last_streamed_length: int = 0
-# Before clearing buffer at line 337, save its length:
-last_streamed_length = len(current_message_buffer)
-current_message_buffer = ""
-continue
-# Replace lines 340-345 with inline handling of final events:
-if isinstance(event, (MagenticFinalResultEvent, WorkflowOutputEvent)):
-    final_event_received = True
-    # DECISION: Did we stream substantial content?
-    if last_streamed_length > 100:
-        # YES: Final event is a SIGNAL, not a payload
-        yield AgentEvent(
-            type="complete",
-            message="Research complete.",
-            data={"iterations": iteration, "streamed_chars": last_streamed_length},
-            iteration=iteration,
-        )
-    else:
-        # NO: Final event must carry the payload (tool-only turn, cache hit)
-        if isinstance(event, MagenticFinalResultEvent):
-            text = self._extract_text(event.message) if event.message else "No result"
-        else:  # WorkflowOutputEvent
-            text = self._extract_text(event.data) if event.data else "Research complete"
-        yield AgentEvent(
-            type="complete",
-            message=text,
-            data={"iterations": iteration},
-            iteration=iteration,
-        )
-    continue
-# Keep existing fallback for other events:
-agent_event = self._process_event(event, iteration)
-```
-### Why Threshold of 100 Chars?
-- `> 0` is too aggressive (might catch single-word streams)
-- `> 500` is too conservative (might miss short but complete responses)
-- `> 100` distinguishes "real content was streamed" from "just status messages"
----
-## Edge Cases Handled
-| Scenario | `last_streamed_length` | Action |
-|----------|------------------------|--------|
-| Normal streaming report | 5000+ | Emit "Research complete." |
-| Tool call, no text | 0 | Emit full content from final event |
-| Very short response | 50 | Emit full content (fallback) |
-| Agent switch mid-stream | Reset on switch | Tracks only final agent |
----
-## Files to Modify
-| File | Lines | Change |
-|------|-------|--------|
-| `src/orchestrators/advanced.py` | 296-345 | Add `last_streamed_length`, handle final events inline |
-| `src/orchestrators/advanced.py` | 532-552 | Optional: remove dead code from `_process_event()` |
----
-## Test Plan
-1. **Happy Path**: Run query, verify report appears ONCE
-2. **Fallback**: Mock tool-only turn (no streaming), verify full content emitted
-3. **Both Tiers**: Test Free Tier and Paid Tier
----
-## Validation
-This fix was independently validated by two AI agents (Claude and Gemini) analyzing the architecture. Both concluded:
-> "The Stateful Orchestrator Fix is the correct engineering solution. The 'Source of Truth' is the Orchestrator's runtime state."
----
-## Related
-- **Not related to model quality** - This is a stack bug
-- P1 Free Tier fix enabled streaming, exposing this bug
-- SPEC-17 Accumulator Pattern addressed repr bug but created this side effect

docs/bugs/archive/P2_EXECUTOR_COMPLETED_EVENT_UI_NOISE.md DELETED Viewed

@@ -1,351 +0,0 @@
-# P2 Bug: ExecutorCompletedEvent UI Noise
-**Status**: VALIDATED - Ready for Implementation
-**Discovered**: 2025-12-05
-**Senior Review**: 2025-12-05 (External agent audit confirmed analysis)
-**Severity**: P2 (UX noise, confusing but not blocking)
-**Component**: `src/orchestrators/advanced.py`
----
-## Symptom
-After the report synthesis completes, extra events appear in the UI:
-```text
-📝 **SYNTHESIZING**: Synthesizing research findings...
-[...full report content...]
-🧠 **JUDGING**: ManagerAgent: Action completed (Tool Call)
-⏱️ **PROGRESS**: Step 11: ManagerAgent task completed
-```
-The "JUDGING" and "PROGRESS" events appear AFTER the report is already displayed, creating confusion.
----
-## Root Cause Analysis
-### The Misunderstanding
-We're treating `ExecutorCompletedEvent` as a **UI event** when it's actually an **internal framework bookkeeping event**.
-### Microsoft Agent Framework Design
-Looking at `agent_framework/_workflows/_executor.py` (lines 266-281):
-```python
-# This is auto-emitted by the framework - NOT for UI consumption
-with _framework_event_origin():
-    completed_event = ExecutorCompletedEvent(self.id, sent_messages if sent_messages else None)
-await context.add_event(completed_event)
-```
-The framework emits `ExecutorCompletedEvent` automatically after every executor handler completes. This includes:
-- SearchAgent completing a search
-- JudgeAgent completing evaluation
-- ReportAgent completing synthesis
-- **ManagerAgent completing coordination** (this is the problem)
-### What the MS Framework Sample Does
-From `samples/getting_started/workflows/orchestration/magentic.py`:
-```python
-async for event in workflow.run_stream(task):
-    if isinstance(event, AgentRunUpdateEvent):
-        # Handle streaming with metadata
-        props = event.data.additional_properties if event.data else None
-        event_type = props.get("magentic_event_type") if props else None
-        # ...
-    elif isinstance(event, WorkflowOutputEvent):
-        # Handle final output
-        output = output_messages[-1].text
-```
-They only handle:
-1. `AgentRunUpdateEvent` - for streaming content (with `magentic_event_type` metadata)
-2. `WorkflowOutputEvent` - for final output
-**They do NOT emit UI events for `ExecutorCompletedEvent`.**
-### Our Problematic Code
-In `src/orchestrators/advanced.py`:
-```python
-# Line 348-368: We emit UI events for EVERY ExecutorCompletedEvent
-if isinstance(event, ExecutorCompletedEvent):
-    state.iteration += 1
-    comp_event, prog_event = self._handle_completion_event(...)
-    yield comp_event   # <-- WRONG: UI event for internal framework event
-    yield prog_event   # <-- WRONG: More noise
-```
-### Why the Manager Fires a Completion Event
-The workflow execution order:
-1. ReportAgent streams its output (`AgentRunUpdateEvent`)
-2. ReportAgent handler completes → `ExecutorCompletedEvent(reporter)` (we display this)
-3. Manager orchestrator handler completes → `ExecutorCompletedEvent(manager)` (we display this too!)
-4. `WorkflowOutputEvent` (final)
-The Manager is also an executor in the framework. When it finishes coordinating (after ReportAgent returns), it fires its own `ExecutorCompletedEvent`. We're incorrectly emitting UI events for this.
----
-## Impact
-1. **User Confusion**: Extra "JUDGING: ManagerAgent" events after the report
-2. **UX Noise**: Progress events that don't add value
-3. **Incorrect Semantics**: Manager completions displayed as agent activity
-4. **No Functional Bug**: The workflow completes correctly, just noisy
----
-## The Fix
-### Stop Emitting UI Events for ExecutorCompletedEvent
-Remove UI event emission for `ExecutorCompletedEvent` entirely. Keep internal state tracking only.
-**Before (buggy):**
-```python
-if isinstance(event, ExecutorCompletedEvent):
-    state.iteration += 1
-    agent_name = getattr(event, "executor_id", "") or "unknown"
-    if REPORTER_AGENT_ID in agent_name.lower():
-        state.reporter_ran = True
-    comp_event, prog_event = self._handle_completion_event(...)
-    yield comp_event   # <-- REMOVE: Emits UI noise
-    yield prog_event   # <-- REMOVE: Emits UI noise
-```
-**After (correct):**
-```python
-if isinstance(event, ExecutorCompletedEvent):
-    # Internal state tracking only - NO UI events
-    agent_name = getattr(event, "executor_id", "") or "unknown"
-    if REPORTER_AGENT_ID in agent_name.lower():
-        state.reporter_ran = True
-    state.current_message_buffer = ""
-    continue  # Skip to next event - do not yield anything
-```
-**Key changes:**
-1. Remove `yield comp_event` and `yield prog_event`
-2. Remove `state.iteration += 1` (iteration counter becomes meaningless without UI events)
-3. Keep `state.reporter_ran` tracking (needed for fallback synthesis logic)
-4. Add `continue` to skip to next event
-**Why this is correct:**
-- Aligns with MS framework design (their sample ignores `ExecutorCompletedEvent`)
-- Eliminates all completion noise including trailing "ManagerAgent" events
-- The streaming events (`AgentRunUpdateEvent`) already provide real-time feedback
-- `WorkflowOutputEvent` signals completion
-### Additional Fix: Add Metadata Filtering to AgentRunUpdateEvent
-The senior review identified a gap: we're not filtering `AgentRunUpdateEvent` by `magentic_event_type`.
-**Current (incomplete):**
-```python
-if isinstance(event, AgentRunUpdateEvent):
-    if event.data and hasattr(event.data, "text") and event.data.text:
-        yield AgentEvent(type="streaming", message=event.data.text)
-```
-**Should be:**
-```python
-if isinstance(event, AgentRunUpdateEvent):
-    if event.data and hasattr(event.data, "text") and event.data.text:
-        # Check metadata to filter internal orchestrator messages
-        props = getattr(event.data, "additional_properties", None) or {}
-        event_type = props.get("magentic_event_type")
-        msg_kind = props.get("orchestrator_message_kind")
-        # Filter out internal orchestrator messages (task_ledger, instruction)
-        if event_type == MAGENTIC_EVENT_TYPE_ORCHESTRATOR:
-            if msg_kind in ("task_ledger", "instruction"):
-                continue  # Skip internal coordination messages
-        yield AgentEvent(type="streaming", message=event.data.text)
-```
-**Why this matters:**
-- Prevents internal JSON blobs from being displayed
-- Filters out raw planning/instruction prompts not meant for users
-- Aligns with how MS sample consumes events
----
-## Related Code Locations
-- `src/orchestrators/advanced.py` line 348-368: ExecutorCompletedEvent handling
-- `src/orchestrators/advanced.py` line 437-469: `_handle_completion_event` method
-- MS Framework: `python/packages/core/agent_framework/_workflows/_executor.py` line 277-281
-- MS Framework: `python/packages/core/agent_framework/_workflows/_magentic.py` line 1962-1976
----
-## Related Issues
-- P2 Round Counter Semantic Mismatch (FIXED) - Changed display from "Round X/Y" to "Step N"
-- This bug explains why step count was confusing - we count internal events too
----
-## Framework Event Architecture Deep Dive
-### Event Categories in MS Agent Framework
-The framework has distinct event categories with different purposes:
-#### 1. Workflow Lifecycle Events (Framework-emitted, internal)
-| Event | Purpose | UI Relevant? |
-|-------|---------|--------------|
-| `WorkflowStartedEvent` | Run begins | No |
-| `WorkflowStatusEvent` | State transitions (IN_PROGRESS, IDLE, FAILED) | No |
-| `WorkflowFailedEvent` | Error with structured details | Maybe (errors) |
-#### 2. Superstep Events (Framework-emitted, internal)
-| Event | Purpose | UI Relevant? |
-|-------|---------|--------------|
-| `SuperStepStartedEvent` | Pregel superstep begins | No |
-| `SuperStepCompletedEvent` | Pregel superstep ends | No |
-#### 3. Executor Events (Framework-emitted automatically, internal)
-| Event | Purpose | UI Relevant? |
-|-------|---------|--------------|
-| `ExecutorInvokedEvent` | Handler starts | No |
-| `ExecutorCompletedEvent` | Handler completes | **NO** |
-| `ExecutorFailedEvent` | Handler errors | Maybe (errors) |
-#### 4. Application Events (User-code emitted via ctx.add_event, UI-facing)
-| Event | Purpose | UI Relevant? |
-|-------|---------|--------------|
-| `AgentRunUpdateEvent` | Streaming content | **YES** |
-| `AgentRunEvent` | Complete agent response | Yes |
-| `WorkflowOutputEvent` | Final workflow output | **YES** |
-| `RequestInfoEvent` | HITL request | Yes |
-### Metadata Pattern in AgentRunUpdateEvent
-The MS framework uses `additional_properties` in `AgentRunUpdateEvent.data` for classification:
-```python
-# Orchestrator message
-additional_properties={
-    "magentic_event_type": "orchestrator_message",
-    "orchestrator_message_kind": "user_task" | "task_ledger" | "instruction" | "notice",
-    "orchestrator_id": "...",
-}
-# Agent streaming
-additional_properties={
-    "magentic_event_type": "agent_delta",
-    "agent_id": "searcher" | "judge" | ...,
-}
-```
-### What We Should Handle for UI
-1. **`AgentRunUpdateEvent`** with metadata filtering:
-   - `magentic_event_type: "agent_delta"` → Display agent streaming
-   - `magentic_event_type: "orchestrator_message"` → Filter by `orchestrator_message_kind`:
-     - `"user_task"` → Show (task assignment)
-     - `"instruction"` → Filter out (internal)
-     - `"task_ledger"` → Filter out (internal)
-     - `"notice"` → Maybe show (warnings)
-2. **`WorkflowOutputEvent`** → Final output
-### What We Should NOT Handle for UI
-- `ExecutorCompletedEvent` - Internal bookkeeping
-- `ExecutorInvokedEvent` - Internal bookkeeping
-- `SuperStepStartedEvent/CompletedEvent` - Internal iteration
-- `WorkflowStatusEvent` - Internal state machine
----
-## Required Import Changes
-**Current imports:**
-```python
-from agent_framework import (
-    MAGENTIC_EVENT_TYPE_ORCHESTRATOR,
-    AgentRunUpdateEvent,
-    ExecutorCompletedEvent,  # Keep for internal tracking
-    MagenticBuilder,
-    WorkflowOutputEvent,
-)
-```
-**Add these imports for metadata filtering:**
-```python
-from agent_framework import (
-    MAGENTIC_EVENT_TYPE_AGENT_DELTA,  # For agent streaming detection
-    ORCH_MSG_KIND_INSTRUCTION,         # Filter internal messages
-    ORCH_MSG_KIND_TASK_LEDGER,         # Filter internal messages
-)
-```
----
-## Test Cases
-```python
-def test_no_executor_completed_events_in_ui():
-    """UI should not emit any events from ExecutorCompletedEvent."""
-    # Run workflow to completion
-    # Collect all yielded AgentEvent objects
-    # Assert NONE have type "progress" with "task completed" message
-    # Assert NONE have type matching completion patterns
-    pass
-def test_internal_messages_filtered_from_streaming():
-    """Internal orchestrator messages should be filtered from UI stream."""
-    # Run workflow and collect all yielded events
-    # Assert no events contain "task_ledger" content
-    # Assert no events contain raw instruction prompts
-    # Assert no JSON blobs in streaming output
-    pass
-def test_reporter_ran_tracking_still_works():
-    """Internal state.reporter_ran should still be set correctly."""
-    # Run workflow to completion
-    # Verify fallback synthesis is NOT triggered (reporter did run)
-    # This ensures we didn't break internal tracking when removing UI events
-    pass
-```
----
-## Why the Free Tier "Works"
-The user asked why the free tier seems to work despite expectations. The answer:
-1. **The framework handles orchestration** - The MS Agent Framework manages the workflow (planning, progress tracking, agent coordination)
-2. **The LLM just provides reasoning** - The model generates text, but the framework decides when to delegate, when to stop, etc.
-3. **The "bugs" are in our UI layer** - The orchestration works correctly; we're just displaying internal events
-The free tier works because:
-- `MagenticBuilder` creates the workflow graph
-- `StandardMagenticManager` handles planning and progress evaluation
-- The framework routes messages between executors
-- The LLM quality affects answer quality, not workflow execution
-Our UI noise (trailing events) is a bug in how we consume framework events, not a framework bug.

docs/bugs/archive/P2_FIRST_TURN_TIMEOUT.md DELETED Viewed

@@ -1,160 +0,0 @@
-# P2 Bug: First Agent Turn Exceeds Workflow Timeout
-**Date**: 2025-12-03
-**Status**: FIXED (PR fix/p2-double-bug-squash)
-**Severity**: P2 (UX - Workflow always times out on complex queries)
-**Component**: `src/orchestrators/advanced.py` + `src/agents/search_agent.py`
-**Affects**: Both Free Tier (HuggingFace) AND Paid Tier (OpenAI)
----
-## Executive Summary
-The search agent's first turn can exceed the 5-minute workflow timeout, causing:
-1. `iterations=0` at timeout (no agent completed a turn)
-2. `_handle_timeout()` synthesizes from partial evidence
-3. Users get incomplete research results
-This is a **performance/architecture bug**, not a model issue.
----
-## Symptom
-```
-[warning] Workflow timed out             iterations=0
-```
-The workflow times out with `iterations=0` - meaning the first agent (search agent) never completed its turn before the 5-minute timeout.
----
-## Root Cause
-The search agent's first turn is **extremely expensive**:
-```
-Search Agent First Turn:
-├── Manager assigns task
-├── Search agent starts
-│   ├── Calls PubMed search tool (10 results)
-│   ├── Calls ClinicalTrials search tool (10 results)
-│   ├── Calls EuropePMC search tool (10 results)
-│   └── For EACH result (30 total):
-│       ├── Generate embedding (OpenAI API call)
-│       ├── Check for duplicates (ChromaDB query)
-│       └── Store in ChromaDB
-│
-│   TOTAL: 30 results × (embedding + dedup + store) = 90+ API/DB operations
-│
-└── Agent turn completes (if timeout hasn't fired)
-```
-**The timeout is on the WORKFLOW, not individual agent turns.** A single greedy agent can consume the entire timeout budget.
----
-## Impact
-| Aspect | Impact |
-|--------|--------|
-| UX | Queries always timeout on first turn |
-| Research quality | Synthesis happens on partial evidence |
-| Confusion | `iterations=0` looks like nothing happened |
----
-## The Fix (Consensus)
-**Reduce work per turn + increase timeout budget.**
-### Implementation
-**1. Reduce results per tool (immediate)**
-`src/agents/search_agent.py` line 70:
-```python
-# Change from 10 to 5
-result: SearchResult = await self._handler.execute(query, max_results_per_tool=5)
-```
-**2. Increase workflow timeout (immediate)**
-`src/utils/config.py`:
-```python
-advanced_timeout: float = Field(
-    default=600.0,  # Was 300.0 (5 min), now 10 min
-    ge=60.0,
-    le=900.0,
-    description="Timeout for Advanced mode in seconds",
-)
-```
-### Why NOT Per-Turn Timeout
-**DANGER**: The SearchHandler uses `asyncio.gather()`:
-```python
-# src/tools/search_handler.py line 163-164
-results = await asyncio.gather(*tasks, return_exceptions=True)
-```
-This is an **all-or-nothing** operation. If you wrap it with `asyncio.timeout()` and the timeout fires, you get **zero results**, not partial results.
-```python
-# DON'T DO THIS - yields nothing on timeout
-async with asyncio.timeout(60):
-    result = await self._handler.execute(query)  # Cancelled = zero results
-```
-Per-turn timeout requires `SearchHandler` to support cancellation with partial results. That's a separate architectural change (see Future Work).
----
-## Future Work (Streaming Evidence Ingestion)
-For proper fix, `SearchHandler.execute()` should:
-1. Yield results as they arrive (async generator)
-2. Support cancellation with partial results
-3. Allow agent to return "what we have so far" on timeout
-```python
-# Future architecture
-async def execute_streaming(self, query: str) -> AsyncIterator[Evidence]:
-    for tool in self.tools:
-        async for evidence in tool.search_streaming(query):
-            yield evidence  # Can be cancelled at any point
-```
-This is out of scope for the immediate fix.
----
-## Test Plan
-1. Run query with 10-minute timeout
-2. Verify first agent turn completes before timeout
-3. Verify `iterations >= 1` at workflow end
----
-## Verification Data
-From diagnostic run:
-```
-=== RAW FRAMEWORK EVENTS ===
-  MagenticAgentDeltaEvent: 284
-  MagenticOrchestratorMessageEvent: 3
-  ...
-  NO MagenticAgentMessageEvent  ← Agent never completed a turn!
-[warning] Workflow timed out             iterations=0
-```
----
-## Related
-- P2 Duplicate Report Bug (separate issue, happens after successful completion)
-- `_handle_timeout()` correctly synthesizes, but with partial evidence
-- Not related to model quality - this is infrastructure/performance

docs/bugs/archive/P2_GRADIO_EXAMPLE_NOT_FILLING.md DELETED Viewed

@@ -1,68 +0,0 @@
-# P2 Bug Report: Third Example Not Filling Chat Box
-## Status
-- **Date:** 2025-11-29
-- **Priority:** P2 (UX issue)
-- **Component:** `src/app.py` - Gradio examples
-- **Resolution:** FIXED in commit `2ea01fd`
----
-## Symptoms
-When clicking the third example in the Gradio UI:
-- **Example 1** (female libido): ✅ Fills chat box correctly
-- **Example 2** (ED alternatives): ✅ Fills chat box correctly
-- **Example 3** (HSDD testosterone): ❌ Does NOT fill chat box
-### User Experience
-User clicks example → nothing happens → confusion
----
-## Root Cause Hypothesis
-The third example contains parentheses and an abbreviation:
-```
-"Testosterone therapy for HSDD (Hypoactive Sexual Desire Disorder)?"
-```
-Possible causes:
-1. **Parentheses** - Gradio may have parsing issues with `(...)` in example text
-2. **Text length** - When expanded, this is the longest example
-3. **Special characters** - The combination of abbreviation + parenthetical may confuse Gradio's example caching
----
-## The Fix
-Simplify the example text - expand the abbreviation and remove parentheses:
-```python
-# Before (broken)
-"Testosterone therapy for HSDD (Hypoactive Sexual Desire Disorder)?"
-# After (fixed)
-"Testosterone therapy for Hypoactive Sexual Desire Disorder?"
-```
-This:
-1. Removes problematic parentheses
-2. Makes the text more readable (no cut-off abbreviation)
-3. Users don't need to know what HSDD stands for
----
-## Test Plan
-- [ ] Change example text in `src/app.py`
-- [ ] Deploy to HuggingFace Space
-- [ ] Verify all 3 examples fill chat box correctly
-- [ ] `make check` passes
----
-## Related
-- Gradio ChatInterface example caching behavior
-- Similar to P0 example caching crash (but different manifestation)

docs/bugs/archive/P2_ROUND_COUNTER_SEMANTIC_MISMATCH.md DELETED Viewed

@@ -1,321 +0,0 @@
-# P2 Bug: Round Counter Semantic Mismatch
-**Status**: ✅ FIXED
-**Discovered**: 2025-12-05
-**Fixed**: 2025-12-05
-**Severity**: P2 (Display bug, confusing UX but not blocking)
-**Component**: `src/orchestrators/advanced.py`
-**Commit**: `40ca236c refactor(orchestrator): implement semantic progress tracking`
----
-## Symptom
-Progress display shows impossible values like "Round 11/5":
-```text
-⏱️ **PROGRESS**: Round 11/5 (~0s remaining)
-```
-This is confusing to users - how can we be on round 11 when max is 5?
----
-## Root Cause Analysis
-### The Semantic Mismatch
-Two different concepts are being conflated:
-| Concept | What It Means | Variable |
-|---------|---------------|----------|
-| **Workflow Round** | One orchestration cycle where manager delegates to agents | `self._max_rounds` (5) |
-| **Agent Completion** | One agent finishes its task | `state.iteration` (incremented on each `ExecutorCompletedEvent`) |
-### The Bug
-```python
-# Line 348: Increments on EVERY agent completion
-if isinstance(event, ExecutorCompletedEvent):
-    state.iteration += 1
-# Line 467: Displays as if it's a workflow round
-message=f"Round {iteration}/{self._max_rounds} (~{est_display} remaining)"
-```
-### Why It Happens
-In a multi-agent workflow with 4 agents (searcher, hypothesizer, judge, reporter):
-- Each "round" involves the manager delegating to multiple agents
-- Each agent completion fires an `ExecutorCompletedEvent`
-- With 4+ agents, we see 4+ events per workflow round
-**Math**: 5 workflow rounds × 4 agents = 20+ agent completions, displayed as "Round 20/5"
----
-## Evidence From Logs
-The session showed this progression:
-```text
-Round 1/5   - First agent completed
-Round 2/5   - Second agent completed
-Round 3/5   - Third agent completed
-Round 4/5   - Fourth agent completed
-Round 5/5   - Fifth agent completed (still in workflow round 1!)
-Round 6/5   - Now exceeds max (workflow round 2 starting)
-...
-Round 11/5  - Multiple workflow rounds have passed
-```
----
-## Impact
-1. **User Confusion**: "Round 11/5" makes no sense
-2. **Time Estimation Wrong**: `rounds_remaining = max(5 - 11, 0) = 0` → always shows "~0s remaining"
-3. **No Actual Bug in Logic**: The workflow still runs correctly, just the display is wrong
----
-## Proposed Fixes
-### Option A: Rename to "Agent Step" (Quick Fix)
-Change the display to reflect what we're actually counting:
-```python
-# Before
-message=f"Round {iteration}/{self._max_rounds} (~{est_display} remaining)"
-# After
-message=f"Agent step {iteration} (Round limit: {self._max_rounds})"
-```
-**Pros**: Accurate, minimal code change
-**Cons**: Still doesn't track actual workflow rounds
-### Option B: Track Actual Workflow Rounds (Proper Fix)
-Track workflow rounds separately from agent completions:
-```python
-@dataclass
-class WorkflowState:
-    iteration: int = 0           # Agent completions (for internal tracking)
-    workflow_round: int = 0      # Actual orchestration rounds
-    current_message_buffer: str = ""
-    # ...
-# Increment workflow_round when manager delegates (different event type)
-# Display workflow_round in progress messages
-```
-**Pros**: Semantically correct, accurate time estimates
-**Cons**: Requires understanding which event signals a new round
-### Option C: Use Estimated Agent Count (Compromise)
-Estimate agents per round and display accordingly:
-```python
-AGENTS_PER_ROUND = 4  # searcher, hypothesizer, judge, reporter
-estimated_round = (iteration // AGENTS_PER_ROUND) + 1
-message=f"Round ~{estimated_round}/{self._max_rounds}"
-```
-**Pros**: Roughly accurate, no API research needed
-**Cons**: Estimation may be off if some agents are skipped
----
-## Recommendation
-**Short-term**: Apply Option A (rename to "Agent step") - fixes the confusion immediately
-**Long-term**: Investigate Option B - determine which event signals a new workflow round in Microsoft Agent Framework
----
-## Related Code
-```python
-# src/orchestrators/advanced.py
-# Line 348: Where iteration is incremented
-if isinstance(event, ExecutorCompletedEvent):
-    state.iteration += 1
-# Line 459-467: Where progress message is generated
-rounds_remaining = max(self._max_rounds - iteration, 0)
-est_seconds = rounds_remaining * 45
-progress_event = AgentEvent(
-    type="progress",
-    message=f"Round {iteration}/{self._max_rounds} (~{est_display} remaining)",
-    iteration=iteration,
-)
-```
----
-## Test Case
-```python
-def test_progress_display_never_exceeds_max_rounds():
-    """Progress should show Round X/Y where X <= Y."""
-    # Simulate 20 agent completions across 5 workflow rounds
-    # Assert displayed round never exceeds max_rounds
-    pass
-```
----
-## Additional Issues Found During Analysis
-### Issue 2: Dead Code - Unused `_get_progress_message` Method
-```python
-# Line 196-205: Method is defined but NEVER called
-def _get_progress_message(self, iteration: int) -> str:
-    """Generate progress message with time estimation."""
-    # ... logic duplicated in _handle_completion_event
-```
-The same logic is duplicated inline in `_handle_completion_event` (lines 458-469).
-**Fix**: Either use the method or delete it.
-### Issue 3: Hardcoded Constant
-```python
-# Line 87: Class constant defined
-_EST_SECONDS_PER_ROUND: int = 45
-# Line 199: Uses constant (correct)
-est_seconds = rounds_remaining * self._EST_SECONDS_PER_ROUND
-# Line 460: Uses hardcoded 45 (inconsistent)
-est_seconds = rounds_remaining * 45
-```
-**Fix**: Use `self._EST_SECONDS_PER_ROUND` consistently.
-### Issue 4: Time Estimate Always Shows "~0s remaining"
-Since `iteration` quickly exceeds `max_rounds`:
-```python
-rounds_remaining = max(self._max_rounds - iteration, 0)
-# When iteration=11, max_rounds=5: rounds_remaining = max(5-11, 0) = 0
-# est_seconds = 0 * 45 = 0
-# Display: "~0s remaining"
-```
-The time estimate becomes useless after the first few agent completions.
----
-## Complete Fix Recommendation
-1. **Rename display** from "Round X/5" to "Agent step X"
-2. **Delete dead code** - remove unused `_get_progress_message` method
-3. **Use constant** - replace hardcoded `45` with `self._EST_SECONDS_PER_ROUND`
-4. **Fix time estimate** - base it on agent steps, not workflow rounds
----
-## Senior Review Findings (2025-12-05)
-**Reviewer**: External Gemini CLI Agent
-**Status**: CONFIRMED - Analysis accurate and sufficient
-### Additional Nuances Identified
-1. **Manager Agent Also Fires Events**: The Manager itself is an agent. If `ExecutorCompletedEvent` fires for Manager's turn completion PLUS sub-agents' completions, the count accelerates 2-3x faster per logical round. This explains why we saw 11 events for ~2-3 workflow rounds.
-2. **Time Estimation Doubly Flawed**:
-   - Not just bottoming out at 0
-   - `_EST_SECONDS_PER_ROUND` (45s) is calibrated for a FULL workflow round, not a single agent step
-   - If we counted agent steps correctly: 10 steps × 45s = 450s (way overestimated)
-   - A full round of 4 agents might only take 60s total
-3. **API Discovery - Can Track Actual Rounds**:
-   ```python
-   # These constants exist in agent_framework:
-   ORCH_MSG_KIND_INSTRUCTION = 'instruction'
-   ORCH_MSG_KIND_USER_TASK = 'user_task'
-   ORCH_MSG_KIND_TASK_LEDGER = 'task_ledger'
-   ORCH_MSG_KIND_NOTICE = 'notice'
-   ```
-   Counting `user_task` events from `MagenticOrchestratorMessageEvent` would align iteration with `max_rounds` 1:1, since this signals "Manager is beginning a new evaluation cycle."
-### Reviewer Recommendations
-1. **Option A (Rename)**: APPROVED - Safest, most honest fix
-2. **Option B (Track Workflow Rounds)**: DEFER - Requires verifying framework behavior across versions, risks brittleness
-3. **Remove Denominator**: Display `Agent Step {iteration}` without `/5` to avoid confusion
-4. **Delete Dead Code**: Confirmed `_get_progress_message` is never called
-5. **Fix Constants**: Use `self._EST_SECONDS_PER_ROUND` consistently
-### Review Status: ✅ PASSED - Ready for Implementation
----
-## Resolution (2025-12-05)
-**Implemented**: Domain-driven semantic progress tracking
-### What Was Done
-1. **Deleted Dead Code**:
-   - Removed unused `_get_progress_message` method
-   - Removed unused `_EST_SECONDS_PER_ROUND` constant
-2. **Added Semantic Agent Mapping** (`_get_agent_semantic_name`):
-   ```python
-   def _get_agent_semantic_name(self, agent_id: str) -> str:
-       """Map internal agent ID to user-facing semantic name."""
-       name = agent_id.lower()
-       if SEARCHER_AGENT_ID in name:
-           return "SearchAgent"
-       if JUDGE_AGENT_ID in name:
-           return "JudgeAgent"
-       if HYPOTHESIZER_AGENT_ID in name:
-           return "HypothesisAgent"
-       if REPORTER_AGENT_ID in name:
-           return "ReportAgent"
-       return "ManagerAgent"
-   ```
-3. **Changed Progress Display**:
-   - Before: `"Round {iteration}/{self._max_rounds} (~{est_display} remaining)"`
-   - After: `"Step {iteration}: {semantic_name} task completed"`
-4. **Changed Initial Thinking Message**:
-   - Before: `"Multi-agent reasoning in progress (5 rounds max)... Estimated time: 3-5 minutes."`
-   - After: `"Multi-agent reasoning in progress (Limit: 5 Manager rounds)... Allocating time for deep research..."`
-5. **Updated Tests**: Changed test mocks to use domain-specific agent IDs (`searcher`, `judge`) instead of arbitrary strings.
-### Result
-- Before: `⏱️ **PROGRESS**: Round 11/5 (~0s remaining)` (confusing, broken math)
-- After: `⏱️ **PROGRESS**: Step 11: ReportAgent task completed` (accurate, professional)
-### Design Decision
-Rather than patching the counter display or trying to track "actual workflow rounds" (which requires deep framework integration), we chose **honest reporting**: Show exactly what happened (which agent completed) without making false promises about progress percentages or time estimates.
-This follows the Clean Code principle: "Don't lie to the user."
----
-## References
-- SPEC-18: Agent Framework Core Upgrade (where ExecutorCompletedEvent was introduced)
-- Microsoft Agent Framework documentation on workflow rounds vs agent executions

docs/bugs/archive/P3_ARCHITECTURAL_GAP_EPHEMERAL_MEMORY.md DELETED Viewed

@@ -1,23 +0,0 @@
-# P3: Ephemeral Memory Architecture (No Persistence)
-**Status:** OPEN
-**Priority:** P3 (Feature/Architecture Gap)
-**Found By:** Codebase Investigation
-**Date:** 2025-11-29
-## Description
-The current `EmbeddingService` (`src/services/embeddings.py`) initializes an **in-memory** ChromaDB client (`chromadb.Client()`) and creates a random UUID-based collection for every new session.
-While `src/utils/config.py` defines a `chroma_db_path` for persistence, it is currently **ignored**.
-## Impact
-1.  **No Long-Term Learning:** The agent cannot "remember" research from previous runs. Every time you restart the app, it starts from zero.
-2.  **Redundant Costs:** If a user researches "Diabetes" twice, the agent re-searches and re-embeds the same papers, wasting tokens and compute time.
-## Technical Details
-- **Current:** `self._client = chromadb.Client()` (In-Memory)
-- **Required:** `self._client = chromadb.PersistentClient(path=settings.chroma_db_path)`
-## Recommendation
-For a "Hackathon Demo," this is **low priority** (ephemeral is fine).
-For a "Real Product," this is **critical** (users expect a library of research).

docs/bugs/archive/P3_ARCHITECTURAL_GAP_STRUCTURED_MEMORY.md DELETED Viewed

@@ -1,150 +0,0 @@
-# P3: Missing Structured Cognitive Memory (Shared Blackboard)
-**Status:** OPEN
-**Priority:** P3 (Architecture/Enhancement)
-**Found By:** Deep Codebase Investigation
-**Date:** 2025-11-29
-**Spec:** [SPEC_07_LANGGRAPH_MEMORY_ARCH.md](../specs/SPEC_07_LANGGRAPH_MEMORY_ARCH.md)
-## Executive Summary
-DeepBoner's `AdvancedOrchestrator` has **Data Memory** (vector store for papers) but lacks **Cognitive Memory** (structured state for hypotheses, conflicts, and research plan). This causes "context drift" on long runs and prevents intelligent conflict resolution.
----
-## Current Architecture (What We Have)
-### 1. MagenticState (`src/agents/state.py:18-91`)
-```python
-class MagenticState(BaseModel):
-    evidence: list[Evidence] = Field(default_factory=list)
-    embedding_service: Any = None  # ChromaDB connection
-    def add_evidence(self, new_evidence: list[Evidence]) -> int: ...
-    async def search_related(self, query: str, n_results: int = 5) -> list[Evidence]: ...
-```
-- **What it does:** Stores Evidence objects, URL-based deduplication, semantic search via embeddings.
-- **What it DOESN'T do:** Track hypotheses, conflicts, or research plan status.
-### 2. EmbeddingService (`src/services/embeddings.py:29-180`)
-```python
-self._client = chromadb.Client()  # In-memory (Line 44)
-self._collection = self._client.create_collection(
-    name=f"evidence_{uuid.uuid4().hex}",  # Random name per session (Line 45-47)
-    ...
-)
-```
-- **What it does:** In-session semantic search/deduplication.
-- **Limitation:** New collection per session, no persistence despite `settings.chroma_db_path` existing.
-### 3. AdvancedOrchestrator (`src/orchestrators/advanced.py:51-371`)
-- Uses Microsoft's `agent-framework-core` (MagenticBuilder)
-- State is implicit in chat history passed between agents
-- Manager decides next step by reading conversation, not structured state
----
-## The Problem
-| Issue | Impact | Evidence |
-|-------|--------|----------|
-| **No Hypothesis Tracking** | Can't update hypothesis confidence systematically | `MagenticState` has no `hypotheses` field |
-| **No Conflict Detection** | Contradictory sources are ignored | No `conflicts` list to flag Source A vs Source B |
-| **Context Drift** | Manager forgets original query after 50+ messages | State lives only in chat, not structured object |
-| **No Plan State** | Can't pause/resume research | No `research_plan` or `next_step` tracking |
----
-## The Solution: LangGraph State Graph (Nov 2025 Best Practice)
-### Why LangGraph?
-Based on [comprehensive analysis](https://latenode.com/blog/langgraph-multi-agent-orchestration-complete-framework-guide-architecture-analysis-2025):
-1. **Explicit State Schema:** TypedDict/Pydantic model that ALL agents read/write
-2. **State Reducers:** `Annotated[List[X], operator.add]` for appending (not overwriting)
-3. **HuggingFace Compatible:** Works with `langchain-huggingface` (Llama 3.1)
-4. **Production-Ready:** MongoDB checkpointer for persistence, SQLite for dev
-### Target Architecture
-```python
-# src/agents/graph/state.py (IMPLEMENTED)
-from typing import Annotated, TypedDict, Literal
-import operator
-from pydantic import BaseModel, Field
-from langchain_core.messages import BaseMessage
-class Hypothesis(BaseModel):
-    id: str
-    statement: str
-    status: Literal["proposed", "validating", "confirmed", "refuted"]
-    confidence: float
-    supporting_evidence_ids: list[str]
-    contradicting_evidence_ids: list[str]
-class Conflict(BaseModel):
-    id: str
-    description: str
-    source_a_id: str
-    source_b_id: str
-    status: Literal["open", "resolved"]
-    resolution: str | None
-class ResearchState(TypedDict):
-    query: str  # Immutable original question
-    hypotheses: Annotated[list[Hypothesis], operator.add]
-    conflicts: Annotated[list[Conflict], operator.add]
-    evidence_ids: Annotated[list[str], operator.add]  # Links to ChromaDB
-    messages: Annotated[list[BaseMessage], operator.add]
-    next_step: Literal["search", "judge", "resolve", "synthesize", "finish"]
-    iteration_count: int
-```
----
-## Implementation Dependencies
-| Package | Purpose | Install |
-|---------|---------|---------|
-| `langgraph>=0.2` | State graph framework | `uv add langgraph` |
-| `langchain>=0.3` | Base abstractions | `uv add langchain` |
-| `langchain-huggingface` | Llama 3.1 integration | `uv add langchain-huggingface` |
-| `langgraph-checkpoint-sqlite` | Dev persistence | `uv add langgraph-checkpoint-sqlite` |
-**Note:** MongoDB checkpointer (`langgraph-checkpoint-mongodb`) recommended for production per [MongoDB blog](https://www.mongodb.com/company/blog/product-release-announcements/powering-long-term-memory-for-agents-langgraph).
----
-## Alternative Considered: Mem0
-[Mem0](https://mem0.ai/) specializes in long-term memory and [outperformed OpenAI by 26%](https://guptadeepak.com/the-ai-memory-wars-why-one-system-crushed-the-competition-and-its-not-openai/) in benchmarks. However:
-- **Mem0 excels at:** User personalization, cross-session memory
-- **LangGraph excels at:** Workflow orchestration, state machines
-- **Verdict:** Use LangGraph for orchestration + optionally add Mem0 for user-level memory later
----
-## Quick Win (Separate from LangGraph)
-Enable ChromaDB persistence in `src/services/embeddings.py:44`:
-```python
-# FROM:
-self._client = chromadb.Client()  # In-memory
-# TO:
-self._client = chromadb.PersistentClient(path=settings.chroma_db_path)
-```
-This alone gives cross-session evidence persistence (P3_ARCHITECTURAL_GAP_EPHEMERAL_MEMORY fix).
----
-## References
-- [LangGraph Multi-Agent Orchestration Guide 2025](https://latenode.com/blog/langgraph-multi-agent-orchestration-complete-framework-guide-architecture-analysis-2025)
-- [Long-Term Agentic Memory with LangGraph](https://medium.com/@anil.jain.baba/long-term-agentic-memory-with-langgraph-824050b09852)
-- [LangGraph vs LangChain 2025](https://kanerika.com/blogs/langchain-vs-langgraph/)
-- [MongoDB + LangGraph Checkpointers](https://www.mongodb.com/company/blog/product-release-announcements/powering-long-term-memory-for-agents-langgraph)
-- [Mem0 + LangGraph Integration](https://datacouch.io/blog/build-smarter-ai-agents-mem0-langgraph-guide/)

docs/bugs/archive/P3_MAGENTIC_NO_TERMINATION_EVENT.md DELETED Viewed

@@ -1,177 +0,0 @@
-# P3 Bug Report: Advanced Mode Missing Termination Guarantee
-## Status
-- **Date:** 2025-11-29
-- **Priority:** P3 (Edge case, but confusing UX)
-- **Component:** `src/orchestrator_magentic.py`
-- **Resolution:** Fixed (Guarantee termination event)
----
-## Symptoms
-In **Advanced (Magentic) mode** with OpenAI API key:
-1. Workflow runs for many iterations (up to 10 rounds)
-2. Agents search, judge, hypothesize repeatedly
-3. Eventually... **nothing happens**
-   - No "complete" event
-   - No error message
-   - UI just stops updating
-**User perception:** "Did it finish? Did it crash? What happened?"
-### Observed Behavior
-When workflow hits `max_round_count=10`:
-- `workflow.run_stream(task)` iterator ends
-- NO `MagenticFinalResultEvent` is emitted by agent-framework
-- Our code yields nothing after the loop
-- User is left hanging
----
-## Root Cause Analysis
-### Code Path (`src/orchestrator_magentic.py:170-186`)
-```python
-iteration = 0
-try:
-    async for event in workflow.run_stream(task):
-        agent_event = self._process_event(event, iteration)
-        if agent_event:
-            if isinstance(event, MagenticAgentMessageEvent):
-                iteration += 1
-            yield agent_event
-    # BUG: NO FALLBACK HERE!
-    # If loop ends without FinalResultEvent, user sees nothing
-except Exception as e:
-    logger.error("Magentic workflow failed", error=str(e))
-    yield AgentEvent(
-        type="error",
-        message=f"Workflow error: {e!s}",
-        iteration=iteration,
-    )
-# BUG: NO FINALLY BLOCK TO GUARANTEE TERMINATION EVENT
-```
-### Workflow Configuration (`src/orchestrator_magentic.py:110-116`)
-```python
-.with_standard_manager(
-    chat_client=manager_client,
-    max_round_count=self._max_rounds,  # 10 - can hit this limit
-    max_stall_count=3,                  # If agents repeat 3x
-    max_reset_count=2,                  # Workflow reset limit
-)
-```
-### Failure Modes
-| Scenario | What Happens | User Sees |
-|----------|--------------|-----------|
-| `MagenticFinalResultEvent` emitted | `_process_event` yields "complete" | Final report |
-| Max rounds (10) reached, no final event | Loop ends silently | **Nothing** |
-| `max_stall_count` triggered | Workflow ends | **Nothing** |
-| `max_reset_count` triggered | Workflow ends | **Nothing** |
-| OpenAI API error | Exception caught | Error message |
----
-## The Fix
-Add guaranteed termination event after the loop:
-```python
-iteration = 0
-final_event_received = False
-try:
-    async for event in workflow.run_stream(task):
-        agent_event = self._process_event(event, iteration)
-        if agent_event:
-            if isinstance(event, MagenticAgentMessageEvent):
-                iteration += 1
-            if agent_event.type == "complete":
-                final_event_received = True
-            yield agent_event
-except Exception as e:
-    logger.error("Magentic workflow failed", error=str(e))
-    yield AgentEvent(
-        type="error",
-        message=f"Workflow error: {e!s}",
-        iteration=iteration,
-    )
-    final_event_received = True  # Error is a form of termination
-finally:
-    # GUARANTEE: Always emit termination event
-    if not final_event_received:
-        logger.warning(
-            "Workflow ended without final event",
-            iterations=iteration,
-        )
-        yield AgentEvent(
-            type="complete",
-            message=(
-                f"Research completed after {iteration} agent rounds. "
-                "Max iterations reached - results may be partial. "
-                "Try a more specific query for better results."
-            ),
-            data={"iterations": iteration, "reason": "max_rounds_reached"},
-            iteration=iteration,
-        )
-```
----
-## Alternative: Increase Max Rounds
-The default `max_rounds=10` might be too low for complex queries.
-In `src/orchestrator_factory.py:52-53`:
-```python
-return orchestrator_cls(
-    max_rounds=config.max_iterations if config else 10,  # Could increase to 15-20
-    api_key=api_key,
-)
-```
-**Trade-off:** More rounds = more API cost, but better chance of complete results.
----
-## Test Plan
-- [ ] Add fallback yield after async for loop
-- [ ] Add `final_event_received` flag tracking
-- [ ] Log warning when fallback is used
-- [ ] Test with `max_rounds=2` to force hitting limit
-- [ ] Verify user always sees termination event
-- [ ] `make check` passes
----
-## Related Files
-- `src/orchestrator_magentic.py` - Main fix location
-- `src/orchestrator_factory.py` - Max rounds configuration
-- `src/utils/models.py` - AgentEvent types
-- `docs/bugs/P2_MAGENTIC_THINKING_STATE.md` - Related UX issue (implemented)
----
-## Priority Justification
-**P3** because:
-- Advanced mode is working for most queries
-- Only hits edge case when max rounds reached without synthesis
-- User CAN retry with different query
-- Not blocking hackathon demo (free tier Simple mode works)
-Would be P2 if:
-- This happened frequently
-- No workaround existed

docs/bugs/archive/P3_MODAL_INTEGRATION_REMOVAL.md DELETED Viewed

@@ -1,78 +0,0 @@
-# P3 Tech Debt: Modal Integration Removal
-**Date**: 2025-12-04
-**Status**: DONE
-**Severity**: P3 (Tech Debt - Not blocking functionality)
-**Component**: Multiple files
----
-## Executive Summary
-Modal (cloud function execution platform) is integrated throughout the codebase but was decided against for this project. This creates potential confusion and dead code paths that should be cleaned up when time permits.
----
-## Affected Files
-The following files contain Modal references:
-| File | Usage |
-|------|-------|
-| `src/utils/config.py` | `MODAL_TOKEN_ID`, `MODAL_TOKEN_SECRET` settings |
-| `src/utils/service_loader.py` | Modal service initialization |
-| `src/services/llamaindex_rag.py` | Modal integration for RAG |
-| `src/agents/code_executor_agent.py` | Modal sandbox execution |
-| `src/utils/exceptions.py` | Modal-related exceptions |
-| `src/tools/code_execution.py` | Modal code execution tool |
-| `src/services/statistical_analyzer.py` | Modal statistical analysis |
-| `src/mcp_tools.py` | Modal MCP tool wrappers |
-| `src/agents/analysis_agent.py` | Modal analysis agent |
----
-## Context
-Modal was originally integrated for:
-1. **Code Execution Sandbox**: Running untrusted code in isolated containers
-2. **Statistical Analysis**: Offloading heavy statistical computations
-3. **LlamaIndex RAG**: Premium embeddings with persistent storage
-However, the project decided against Modal because:
-- Added infrastructure complexity
-- Free Tier doesn't need cloud functions
-- Paid Tier uses OpenAI directly
----
-## Recommended Fix
-1. Remove Modal-related code from all affected files
-2. Remove `MODAL_TOKEN_ID` and `MODAL_TOKEN_SECRET` from config
-3. Remove Modal from dependencies in `pyproject.toml`
-4. Update any documentation referencing Modal
----
-## Impact If Not Fixed
-- Confusion for new contributors
-- Dead code in production
-- Unnecessary dependencies
-- Config settings that do nothing
----
-## Test Plan
-1. Remove Modal code
-2. Run `make check` to ensure no breakage
-3. Verify Free Tier and Paid Tier still work
-4. Search codebase for any remaining Modal references
----
-## Related
-- `P3_REMOVE_ANTHROPIC_PARTIAL_WIRING.md` - Similar tech debt for Anthropic
-- ARCHITECTURE.md - Current architecture excludes Modal

docs/bugs/archive/P3_REMOVE_ANTHROPIC_PARTIAL_WIRING.md DELETED Viewed

@@ -1,160 +0,0 @@
-# P3 Tech Debt: Remove Anthropic Partial Wiring
-**Date**: 2025-12-03
-**Status**: DONE
-**Severity**: P3 (Tech Debt / Simplification)
-**Component**: Architecture / Provider Integration
----
-## Summary
-Remove all Anthropic-related code, configuration, and references from the codebase. Anthropic is partially wired but **not fully threaded through the architecture**, creating confusion and half-implemented code paths.
----
-## Rationale
-### 1. Anthropic Does NOT Provide Embeddings
-Our architecture requires embeddings for:
-- RAG (LlamaIndex/ChromaDB)
-- Evidence deduplication
-- Semantic search
-Anthropic only provides chat completion, not embeddings. This means even with a working Anthropic chat client, users would need a **second provider** for embeddings, breaking the unified experience.
-### 2. Partial Implementation Creates Confusion
-Current state:
-- `settings.anthropic_api_key` exists ✅
-- `settings.has_anthropic_key` property exists ✅
-- `settings.anthropic_model` configured ✅
-- `AnthropicChatClient` for agent_framework **DOES NOT EXIST** ❌
-- Code raises `NotImplementedError` when Anthropic detected ❌
-This half-state causes:
-- User confusion ("Why doesn't my Anthropic key work?")
-- Developer confusion ("Is Anthropic supported or not?")
-- Dead code paths that need maintenance
-### 3. Unified Architecture Principle
-**Principle**: Only support providers that work **end-to-end** through the entire stack:
-```
-Provider Requirements:
-├── Chat Completion (for agents)     ✅ Required
-├── Function/Tool Calling            ✅ Required
-├── Embeddings (for RAG)             ✅ Required
-└── Streaming                        ✅ Required
-```
-| Provider | Chat | Tools | Embeddings | Streaming | Status |
-|----------|------|-------|------------|-----------|--------|
-| OpenAI | ✅ | ✅ | ✅ | ✅ | **KEEP** |
-| HuggingFace | ✅ | ✅ | ✅ (local) | ✅ | **KEEP** |
-| Gemini | ✅ | ✅ | ✅ | ✅ | Future (Phase 4) |
-| Anthropic | ✅ | ✅ | ❌ | ✅ | **REMOVE** |
----
-## Files to Clean Up
-### Configuration
-- [ ] `src/utils/config.py` - Remove `anthropic_api_key`, `anthropic_model`, `has_anthropic_key`
-### Client Factory
-- [ ] `src/clients/factory.py` - Remove Anthropic detection and `NotImplementedError`
-### Legacy Code (pydantic-ai based)
-- [ ] `src/utils/llm_factory.py` - Remove `AnthropicModel`, `AnthropicProvider` imports and handling
-- [ ] `src/agent_factory/judges.py` - Remove Anthropic model selection
-### App/UI
-- [ ] `src/app.py` - Remove `has_anthropic_key` checks and "Anthropic from env" backend info
-### Documentation
-- [ ] `CLAUDE.md` - Update LLM provider list
-- [ ] `AGENTS.md` - Update LLM provider list
-- [ ] `GEMINI.md` - Update LLM provider list
-### Tests
-- [ ] `tests/unit/clients/test_chat_client_factory.py` - Remove Anthropic test cases
-- [ ] `tests/unit/utils/test_config.py` - Remove Anthropic config tests
----
-## Code Snippets to Remove
-### `src/utils/config.py`
-```python
-# REMOVE these lines:
-anthropic_api_key: str | None = Field(default=None, description="Anthropic API key")
-anthropic_model: str = Field(
-    default="claude-sonnet-4-5-20250929", description="Anthropic model"
-)
-@property
-def has_anthropic_key(self) -> bool:
-    """Check if Anthropic API key is available."""
-    return bool(self.anthropic_api_key)
-```
-### `src/clients/factory.py`
-```python
-# REMOVE these lines:
-if api_key.startswith("sk-ant-"):
-    normalized = "anthropic"
-if normalized == "anthropic":
-    raise NotImplementedError(
-        "Anthropic client not yet implemented. "
-        "Use OpenAI key (sk-...) or leave empty for free HuggingFace tier."
-    )
-```
-### `src/app.py`
-```python
-# REMOVE these lines:
-elif settings.has_anthropic_key:
-    backend_info = "Paid API (Anthropic from env)"
-has_anthropic = settings.has_anthropic_key
-has_paid_key = has_openai or has_anthropic or bool(user_api_key)
-# Change to:
-has_paid_key = has_openai or bool(user_api_key)
-```
----
-## Migration Notes
-### For Users with Anthropic Keys
-If users have `ANTHROPIC_API_KEY` set in their environment:
-1. It will be **silently ignored** (not an error)
-2. System falls through to HuggingFace free tier
-3. Users should use `OPENAI_API_KEY` instead for paid tier
-### Future Consideration
-If Anthropic adds embeddings API in the future, we can re-add support. But until then, partial support creates more confusion than value.
----
-## Definition of Done
-- [ ] All Anthropic references removed from `src/`
-- [ ] All Anthropic tests removed or updated
-- [ ] Documentation updated to reflect supported providers: OpenAI, HuggingFace, (future: Gemini)
-- [ ] `make check` passes (lint, typecheck, tests)
-- [ ] PR reviewed and merged
----
-## Related Documents
-- `P2_7B_MODEL_GARBAGE_OUTPUT.md` - Current free tier model quality issues
-- `HF_FREE_TIER_ANALYSIS.md` - HuggingFace provider routing analysis
-- `CLAUDE.md` - Agent context with provider documentation