VibecoderMcSwaggins commited on
Commit
4878d51
·
1 Parent(s): d4d872a

Update ACTIVE_BUGS.md and archive completed bug documentation

Browse files

- Consolidate active bug documentation, focusing on the P3 Progress Bar Positioning issue.
- Archive resolved documentation for various P0 and P2 bugs, including the ExecutorCompletedEvent UI noise and Round Counter Semantic Mismatch.
- Ensure the ACTIVE_BUGS.md reflects the current status of ongoing issues and directs users to archived documentation for completed bugs.

Files changed (34) hide show
  1. docs/bugs/ACTIVE_BUGS.md +9 -65
  2. docs/bugs/archive/AUDIT_FINDINGS_2025_11_30.md +0 -70
  3. docs/bugs/archive/GRADIO_EXAMPLE_VS_CHAT_ARROW_ANALYSIS.md +0 -147
  4. docs/bugs/archive/P0_ADVANCED_MODE_TIMEOUT_NO_SYNTHESIS.md +0 -307
  5. docs/bugs/archive/P0_AIFUNCTION_NOT_JSON_SERIALIZABLE.md +0 -225
  6. docs/bugs/archive/P0_HUGGINGFACE_TOOL_CALLING_BROKEN.md +0 -173
  7. docs/bugs/archive/P0_MCP_TOOLUSECONTENT_MISSING.md +0 -88
  8. docs/bugs/archive/P0_ORCHESTRATOR_DEDUP_AND_JUDGE_BUGS.md +0 -144
  9. docs/bugs/archive/P0_REPR_BUG_ROOT_CAUSE_ANALYSIS.md +0 -99
  10. docs/bugs/archive/P0_SIMPLE_MODE_FORCED_SYNTHESIS_BYPASS.md +0 -59
  11. docs/bugs/archive/P0_SIMPLE_MODE_NEVER_SYNTHESIZES.md +0 -254
  12. docs/bugs/archive/P0_SYNTHESIS_PROVIDER_MISMATCH.md +0 -273
  13. docs/bugs/archive/P1_ADVANCED_MODE_UNINTERPRETABLE_CHAIN_OF_THOUGHT.md +0 -184
  14. docs/bugs/archive/P1_FREE_TIER_TOOL_EXECUTION_FAILURE.md +0 -319
  15. docs/bugs/archive/P1_GRADIO_EXAMPLE_CLICK_AUTO_SUBMIT.md +0 -273
  16. docs/bugs/archive/P1_HUGGINGFACE_NOVITA_500_ERROR.md +0 -133
  17. docs/bugs/archive/P1_HUGGINGFACE_ROUTER_401_HYPERBOLIC.md +0 -62
  18. docs/bugs/archive/P1_NARRATIVE_SYNTHESIS_FALLBACK.md +0 -185
  19. docs/bugs/archive/P1_NO_SYNTHESIS_FREE_TIER.md +0 -165
  20. docs/bugs/archive/P1_SIMPLE_MODE_REMOVED_BREAKS_FREE_TIER_UX.md +0 -61
  21. docs/bugs/archive/P1_SYNTHESIS_BROKEN_KEY_FALLBACK.md +0 -163
  22. docs/bugs/archive/P2_7B_MODEL_GARBAGE_OUTPUT.md +0 -266
  23. docs/bugs/archive/P2_ADVANCED_MODE_COLD_START_NO_FEEDBACK.md +0 -255
  24. docs/bugs/archive/P2_ARCHITECTURAL_BYOK_GAPS.md +0 -100
  25. docs/bugs/archive/P2_DUPLICATE_REPORT_CONTENT.md +0 -151
  26. docs/bugs/archive/P2_EXECUTOR_COMPLETED_EVENT_UI_NOISE.md +0 -351
  27. docs/bugs/archive/P2_FIRST_TURN_TIMEOUT.md +0 -160
  28. docs/bugs/archive/P2_GRADIO_EXAMPLE_NOT_FILLING.md +0 -68
  29. docs/bugs/archive/P2_ROUND_COUNTER_SEMANTIC_MISMATCH.md +0 -321
  30. docs/bugs/archive/P3_ARCHITECTURAL_GAP_EPHEMERAL_MEMORY.md +0 -23
  31. docs/bugs/archive/P3_ARCHITECTURAL_GAP_STRUCTURED_MEMORY.md +0 -150
  32. docs/bugs/archive/P3_MAGENTIC_NO_TERMINATION_EVENT.md +0 -177
  33. docs/bugs/archive/P3_MODAL_INTEGRATION_REMOVAL.md +0 -78
  34. docs/bugs/archive/P3_REMOVE_ANTHROPIC_PARTIAL_WIRING.md +0 -160
docs/bugs/ACTIVE_BUGS.md CHANGED
@@ -1,83 +1,27 @@
1
  # Active Bugs
2
 
3
  > Last updated: 2025-12-06
4
- >
5
- > **Note:** Completed bug docs archived to `docs/bugs/archive/`
6
- > **See also:** [ARCHITECTURE.md](../ARCHITECTURE.md) for unified architecture plan
7
 
8
  ---
9
 
10
- ## Currently Active Bugs
11
 
12
- ### P3 - Progress Bar Positioning in ChatInterface
 
 
13
 
14
- **File:** `docs/bugs/P3_PROGRESS_BAR_POSITIONING.md`
15
- **Status:** OPEN - Low Priority UX Polish
16
 
17
- **Problem:** The `gr.Progress()` bar renders in a strange position when used inside ChatInterface, causing visual overlap with chat messages.
18
-
19
- **Recommended Fix:** Remove `gr.Progress()` entirely and rely on emoji status messages in chat output.
20
-
21
- ---
22
-
23
- ## Resolved Bugs (December 2025)
24
-
25
- All resolved bugs have been moved to `docs/bugs/archive/`. Summary:
26
-
27
- ### P0 Bugs (All FIXED)
28
- - **P0 MCP ToolUseContent Missing** - FIXED, requirements.txt missing `mcp>=1.23.0` pin (HF Spaces crashed)
29
- - **P0 Repr Bug** - FIXED in PR #117 via Accumulator Pattern
30
- - **P0 AIFunction Not JSON Serializable** - FIXED, full tool support for HuggingFace
31
- - **P0 HuggingFace Tool Calling Broken** - FIXED, history serialization + Accumulator Pattern
32
- - **P0 Simple Mode Forced Synthesis Bypass** - N/A, simple.py deleted (Unified Architecture)
33
- - **P0 Synthesis Provider Mismatch** - FIXED, auto-detect in judges.py
34
- - **P0 Advanced Mode Timeout No Synthesis** - FIXED, actual synthesis on timeout
35
-
36
- ### P1 Bugs (All FIXED)
37
- - **P1 No Synthesis Free Tier** - FIXED in PR fix/p1-forced-synthesis, forced synthesis safety net when ReportAgent doesn't run
38
- - **P1 Free Tier Tool Execution Failure** - FIXED in PR fix/P1-free-tier-tool-execution, removed premature marker
39
- - **P1 Gradio Example Click Auto-Submits** - FIXED in PR #120, prevents auto-submit on example click
40
- - **P1 HuggingFace Router 401 Hyperbolic** - FIXED, invalid token was root cause
41
- - **P1 HuggingFace Novita 500 Error** - SUPERSEDED, switched to 7B model
42
- - **P1 Advanced Mode Uninterpretable Chain-of-Thought** - FIXED in PR #107
43
- - **P1 Synthesis Broken Key Fallback** - FIXED in PR #103
44
-
45
- ### P2 Bugs (All FIXED)
46
-
47
- - **P2 ExecutorCompletedEvent UI Noise** - FIXED in PR #133, silenced internal framework events
48
- - **P2 Round Counter Semantic Mismatch** - FIXED in PR #132, semantic progress tracking
49
- - **P2 Duplicate Report Content** - FIXED in PR fix/p2-double-bug-squash, stateful deduplication in `run()` loop
50
- - **P2 First Turn Timeout** - FIXED in PR fix/p2-double-bug-squash, reduced results per tool (10→5), increased timeout (5→10 min)
51
- - **P2 7B Model Garbage Output** - SUPERSEDED by P1 Free Tier fix (root cause was premature marker, not model capacity)
52
- - **P2 Advanced Mode Cold Start No Feedback** - FIXED, all phases complete
53
- - **P2 Architectural BYOK Gaps** - FIXED, end-to-end BYOK support in PR #119
54
-
55
- ### P3 Tech Debt (All RESOLVED)
56
-
57
- - **P3 Remove Anthropic Partial Wiring** - DONE in PR #130, all Anthropic code removed
58
- - **P3 Remove Modal Integration** - DONE in PR #130, all Modal code removed (~1400 lines deleted)
59
 
60
  ---
61
 
62
  ## How to Report Bugs
63
 
64
  1. Create `docs/bugs/P{N}_{SHORT_NAME}.md`
65
- 2. Include: Symptom, Root Cause, Fix Plan, Test Plan
66
- 3. Update this index
67
- 4. Priority: P0=blocker, P1=important, P2=UX, P3=edge case/tech debt
68
 
69
  ---
70
 
71
- ## Archived Documentation
72
-
73
- The following have been moved to `docs/bugs/archive/`:
74
-
75
- - All resolved P0-P2 bug reports
76
- - Code quality audit findings (2025-11-30)
77
- - Gradio example vs chat arrow analysis
78
-
79
- Additional documentation moved:
80
-
81
- - `HF_FREE_TIER_ANALYSIS.md` → `docs/architecture/`
82
- - `TOOL_ANALYSIS_CRITICAL.md` → `docs/future-roadmap/`
83
- - `P3_REMOVE_ANTHROPIC_PARTIAL_WIRING.md` → `docs/future-roadmap/`
 
1
  # Active Bugs
2
 
3
  > Last updated: 2025-12-06
 
 
 
4
 
5
  ---
6
 
7
+ ## P3 - Progress Bar Positioning in ChatInterface
8
 
9
+ **File:** [P3_PROGRESS_BAR_POSITIONING.md](./P3_PROGRESS_BAR_POSITIONING.md)
10
+ **Status:** OPEN
11
+ **Priority:** Low (cosmetic UX issue)
12
 
13
+ **Problem:** `gr.Progress()` conflicts with ChatInterface, causing the progress bar to float/overlap with chat messages.
 
14
 
15
+ **Fix:** Remove `gr.Progress()` entirely and rely on emoji status messages in chat output.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
16
 
17
  ---
18
 
19
  ## How to Report Bugs
20
 
21
  1. Create `docs/bugs/P{N}_{SHORT_NAME}.md`
22
+ 2. Add entry to this file
23
+ 3. Priority: P0=blocker, P1=important, P2=UX, P3=cosmetic
 
24
 
25
  ---
26
 
27
+ *Historical bugs are preserved in the [v0.1.0 release tag](https://github.com/The-Obstacle-Is-The-Way/DeepBoner/releases/tag/v0.1.0).*
 
 
 
 
 
 
 
 
 
 
 
 
docs/bugs/archive/AUDIT_FINDINGS_2025_11_30.md DELETED
@@ -1,70 +0,0 @@
1
- # Code Quality Audit Findings - 2025-11-30
2
-
3
- **Auditor:** Senior Staff Engineer (Gemini)
4
- **Date:** 2025-11-30
5
- **Scope:** `src/` (services, tools, agents, orchestrators)
6
- **Focus:** Configuration validation, Error handling, Defensive programming anti-patterns
7
-
8
- ## Summary
9
-
10
- The codebase is generally clean and modern, but exhibits specific anti-patterns related to configuration management and defensive error handling. The most critical finding is the reliance on manual `os.getenv` calls and "silent default" fallbacks which obscure configuration errors, directly contributing to the `OpenAIError` observed in production.
11
-
12
- ## Findings
13
-
14
- ### 1. Defensive Pass Block (Silent Failure) - MEDIUM
15
- **File:** `src/services/statistical_analyzer.py:246-247`
16
- ```python
17
- try:
18
- min_p = min(float(p) for p in p_values)
19
- # ... logic ...
20
- except ValueError:
21
- pass
22
- ```
23
- **Problem:** If p-values are found by regex but fail to parse, the error is swallowed silently. This makes debugging parser issues impossible.
24
- **Fix:** Replace `pass` with `logger.warning("Failed to parse p-values: %s", p_values)` to aid debugging.
25
-
26
- ### 2. Missing Pydantic Validation (Manual Config) - MEDIUM
27
- **File:** `src/tools/code_execution.py:75-76`
28
- ```python
29
- self.modal_token_id = os.getenv("MODAL_TOKEN_ID")
30
- self.modal_token_secret = os.getenv("MODAL_TOKEN_SECRET")
31
- ```
32
- **Problem:** Secrets are manually fetched from env vars, bypassing the centralized `Settings` validation.
33
- **Fix:** Move to `src/utils/config.py` in the `Settings` class and inject `settings` into `ModalCodeExecutor`.
34
-
35
- ### 3. Broad Exception Swallowing - MEDIUM
36
- **File:** `src/tools/pubmed.py:129-130`
37
- ```python
38
- except Exception:
39
- continue # Skip malformed articles
40
- ```
41
- **Problem:** Catching `Exception` hides potential bugs (like `NameError` or `TypeError` in our own code), not just malformed data.
42
- **Fix:** Catch specific exceptions (e.g., `(KeyError, AttributeError, TypeError)`) OR log the error before continuing: `logger.debug(f"Skipping malformed article {pmid}: {e}")`.
43
-
44
- ### 4. Missing Pydantic Validation (UI Layer) - LOW
45
- **File:** `src/app.py:115, 119`
46
- ```python
47
- elif os.getenv("OPENAI_API_KEY"):
48
- # ...
49
- elif os.getenv("ANTHROPIC_API_KEY"):
50
- ```
51
- **Problem:** Application logic relies on raw environment variable checks to determine available backends, creating duplication and potential inconsistency with `config.py`.
52
- **Fix:** Centralize this logic in `src/utils/config.py` (e.g., `settings.has_openai`, `settings.has_anthropic`).
53
-
54
- ### 5. Try/Except for Flow Control - LOW
55
- **File:** `src/tools/code_execution.py:244-249`
56
- ```python
57
- try:
58
- start_idx = text.index(start_marker) + len(start_marker)
59
- # ...
60
- except ValueError:
61
- return text.strip()
62
- ```
63
- **Problem:** Using exceptions for expected "not found" cases is slower and less explicit.
64
- **Fix:** Use `find()` which returns `-1` on failure.
65
-
66
- ## Action Plan
67
-
68
- 1. **Refactor Configuration:** Eliminate `os.getenv` in favor of `src/utils/config.py` `Settings` model.
69
- 2. **Fix Error Handling:** Remove empty `pass` blocks; add logging.
70
- 3. **Address P0 Bug:** Fix the `OpenAIError` in synthesis (caused by Finding #4/General Config issue) by injecting the correct model into the orchestrator.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/bugs/archive/GRADIO_EXAMPLE_VS_CHAT_ARROW_ANALYSIS.md DELETED
@@ -1,147 +0,0 @@
1
- # Gradio Example Click vs Chat Arrow - Code Path Analysis
2
-
3
- **Status**: ANALYZED - NOT A BUG (Same code path, different timing)
4
- **Priority**: N/A (Symptom of upstream repr bug)
5
- **Analyzed**: 2025-12-01
6
- **Related**: P0_HUGGINGFACE_TOOL_CALLING_BROKEN.md
7
-
8
- ---
9
-
10
- ## Symptom Reported
11
-
12
- User observed two different outputs when:
13
- 1. **Clicking an Example** → Shows progress at 10%, "THINKING" message
14
- 2. **Clicking Chat Arrow** → Shows full 5 rounds with repr garbage
15
-
16
- User suspected divergent code paths from vestigial Simple Mode deletion.
17
-
18
- ---
19
-
20
- ## Analysis: NO DIVERGENT CODE PATHS
21
-
22
- ### Code Trace
23
-
24
- Both Example Click and Chat Arrow use **the exact same code path**:
25
-
26
- ```text
27
- User Action (Example OR Chat Arrow)
28
-
29
- app.py:research_agent() ← SAME FUNCTION
30
-
31
- app.py:configure_orchestrator() ← SAME FUNCTION (mode="advanced" always)
32
-
33
- factory.py:create_orchestrator() ← SAME FUNCTION
34
-
35
- factory.py:_determine_mode() ← ALWAYS returns "advanced"
36
-
37
- AdvancedOrchestrator ← SAME CLASS
38
-
39
- clients/factory.py:get_chat_client() ← SAME FUNCTION
40
-
41
- HuggingFaceChatClient (no API key) OR OpenAIChatClient (with API key)
42
- ```
43
-
44
- ### Evidence from Code
45
-
46
- **app.py:279-325 - ChatInterface Setup:**
47
- ```python
48
- demo = gr.ChatInterface(
49
- fn=research_agent, # ← SAME FUNCTION FOR BOTH
50
- examples=[
51
- ["What drugs improve female libido post-menopause?", "sexual_health", None, None],
52
- # ...
53
- ],
54
- # ...
55
- )
56
- ```
57
-
58
- **factory.py:76-90 - Mode Determination:**
59
- ```python
60
- def _determine_mode(explicit_mode: str | None) -> str:
61
- if explicit_mode == "hierarchical":
62
- return "hierarchical"
63
- # "simple" is deprecated -> upgrade to "advanced"
64
- # "magentic" is alias for "advanced"
65
- return "advanced" # ← ALWAYS ADVANCED
66
- ```
67
-
68
- ---
69
-
70
- ## Explanation of Visual Difference
71
-
72
- The difference the user observed is **timing**, not code paths:
73
-
74
- | Screenshot | When Captured | Content |
75
- |------------|---------------|---------|
76
- | Example Click | Mid-execution | Progress bar at 10%, "THINKING" |
77
- | Chat Arrow | After completion | Full 5 rounds with repr garbage |
78
-
79
- **Both show the same process at different stages.**
80
-
81
- The repr garbage (`<agent_framework._types.ChatMessage object at 0x...>`) appears in BOTH:
82
- - Example Click: Would show repr garbage if captured after completion
83
- - Chat Arrow: Shows repr garbage because it was captured after completion
84
-
85
- ---
86
-
87
- ## The Real Bug: Upstream repr Issue
88
-
89
- The repr garbage is the **upstream Microsoft Agent Framework bug** documented in:
90
- - `docs/bugs/P0_HUGGINGFACE_TOOL_CALLING_BROKEN.md`
91
-
92
- **Root cause in upstream code:**
93
- ```python
94
- # agent_framework/_workflows/_magentic.py line ~1799
95
- text = last.text or str(last) # BUG: str(last) gives repr for tool-only messages
96
- ```
97
-
98
- **Our workaround in advanced.py:**
99
- ```python
100
- def _extract_text(self, message: Any) -> str:
101
- # Filter out repr strings
102
- if isinstance(message, str) and message.startswith("<") and "object at" in message:
103
- return ""
104
- # ...
105
- ```
106
-
107
- ---
108
-
109
- ## Verification
110
-
111
- 1. **No vestigial Simple Mode code** - `simple.py` is deleted, not imported anywhere
112
- 2. **Factory always returns AdvancedOrchestrator** - verified in `factory.py:66-73`
113
- 3. **Same research_agent function** - Gradio routes both Example and Chat Arrow through it
114
-
115
- ---
116
-
117
- ## Conclusion
118
-
119
- **There are NO divergent code paths.** The unified architecture is correctly implemented:
120
-
121
- | Component | Status |
122
- |-----------|--------|
123
- | Simple Mode | ✅ DELETED (no vestigial code) |
124
- | Factory Pattern | ✅ Always returns AdvancedOrchestrator |
125
- | Chat Client Factory | ✅ Auto-selects HuggingFace (free) or OpenAI (paid) |
126
- | Example Click | ✅ Uses same `research_agent()` function |
127
- | Chat Arrow Click | ✅ Uses same `research_agent()` function |
128
-
129
- **The only bug is the upstream repr display issue**, which affects BOTH paths equally.
130
-
131
- ---
132
-
133
- ## Next Steps
134
-
135
- 1. **Wait for upstream fix** - [PR #2566](https://github.com/microsoft/agent-framework/pull/2566)
136
- 2. **Once merged**: `uv add agent-framework@latest`
137
- 3. **Test**: Verify both Example Click and Chat Arrow work identically
138
-
139
- ---
140
-
141
- ## References
142
-
143
- - `src/app.py` - Line 134-247 (`research_agent()`)
144
- - `src/app.py` - Line 279-325 (ChatInterface with examples)
145
- - `src/orchestrators/factory.py` - Line 43-73 (`create_orchestrator()`)
146
- - `src/clients/factory.py` - Line 15-76 (`get_chat_client()`)
147
- - `docs/bugs/P0_HUGGINGFACE_TOOL_CALLING_BROKEN.md` - Upstream repr bug details
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/bugs/archive/P0_ADVANCED_MODE_TIMEOUT_NO_SYNTHESIS.md DELETED
@@ -1,307 +0,0 @@
1
- # P0 - Advanced Mode Timeout Yields False "Synthesizing" Message
2
-
3
- **Status:** RESOLVED
4
- **Priority:** P0 (Blocker for Advanced/Magentic mode)
5
- **Found:** 2025-11-30 (Manual Testing)
6
- **Resolved:** 2025-11-30
7
- **Component:** `src/orchestrators/advanced.py`
8
-
9
- ## Resolution Summary
10
-
11
- The issue where Advanced Mode timeouts produced a fake synthesis message has been fully resolved.
12
- We implemented a robust fallback mechanism that synthesizes a report from collected evidence upon timeout.
13
-
14
- ### Fix Details
15
-
16
- 1. **Implemented `ResearchMemory.get_context_summary()`**:
17
- - Added missing method to `src/services/research_memory.py`.
18
- - Generates a structured summary of hypotheses and top 20 evidence items.
19
- - Enables the ReportAgent to function even without a formal handoff from JudgeAgent.
20
-
21
- 2. **Fixed Factory Configuration**:
22
- - Updated `src/orchestrators/factory.py` to use `settings.advanced_max_rounds` (default 5).
23
- - Previously used global `max_iterations` (default 10), causing workflows to run 2x longer than intended and hitting timeouts.
24
-
25
- 3. **Implemented Timeout Synthesis Logic**:
26
- - Updated `src/orchestrators/advanced.py` to catch `TimeoutError`.
27
- - Now retrieves `get_context_summary()` from memory.
28
- - Directly invokes `ReportAgent` to generate a final report from available evidence.
29
- - Yields the actual report content instead of a static placeholder message.
30
-
31
- ### Verification
32
-
33
- - **Unit Tests**: `tests/unit/orchestrators/test_advanced_timeout.py` verifies:
34
- - Timeout triggers synthesis (mocked ReportAgent is called).
35
- - Factory correctly sets `max_rounds=5`.
36
- - **Manual Verification**:
37
- - Confirmed logic flow via TDD.
38
- - SearchAgent verbosity mitigated by reduced round count (5 rounds = ~20KB context vs 40KB+).
39
-
40
- ---
41
-
42
- ## Symptom (Archive)
43
-
44
- When using Advanced mode (Magentic/Multi-Agent) with an OpenAI API key, the workflow:
45
-
46
- 1. Starts correctly ("Starting research (Advanced mode)")
47
- 2. Shows "Multi-agent reasoning in progress (10 rounds max)"
48
- 3. Streams SearchAgent results successfully
49
- 4. Shows "Round 1/10" progress
50
- 5. Then hangs for ~5 minutes (timeout period)
51
- 6. Finally shows: **"Research timed out. Synthesizing available evidence..."**
52
- 7. **BUT NO SYNTHESIS OCCURS** - the output ends there
53
-
54
- User sees massive streaming output from SearchAgent but NO final research report.
55
-
56
- ## Observed Output
57
-
58
- ```text
59
- 🚀 **STARTED**: Starting research (Advanced mode): Clinical trials for PDE5 inhibitors alternatives?
60
- ⏳ **THINKING**: Multi-agent reasoning in progress (10 rounds max)...
61
- 🧠 **JUDGING**: Manager (user_task): Research sexual health and wellness interventions...
62
- 📡 **STREAMING**: [MASSIVE SearchAgent output - 10KB+ of clinical trial data]
63
- ⏱️ **PROGRESS**: Round 1/10 (~6m 45s remaining)
64
- 📚 **SEARCH_COMPLETE**: searcher: Below is a structured evidence dataset...
65
-
66
- Research timed out. Synthesizing available evidence...
67
- [END - Nothing more happens]
68
- ```
69
-
70
- ## Root Cause Analysis
71
-
72
- ### Bug Location: `src/orchestrators/advanced.py:254-261`
73
-
74
- ```python
75
- except TimeoutError:
76
- logger.warning("Workflow timed out", iterations=iteration)
77
- yield AgentEvent(
78
- type="complete",
79
- message="Research timed out. Synthesizing available evidence...", # <-- LIE
80
- data={"reason": "timeout", "iterations": iteration},
81
- iteration=iteration,
82
- )
83
- ```
84
-
85
- **The message is a lie.** It says "Synthesizing available evidence..." but:
86
- 1. No synthesis code is called
87
- 2. The `MagenticState` (containing gathered evidence) is never accessed
88
- 3. The `ReportAgent` is never invoked
89
- 4. User just sees the raw streaming output
90
-
91
- ### Secondary Issue: Workflow Never Progresses Past Round 1
92
-
93
- The SearchAgent produces a MASSIVE response (10KB+) in Round 1, but the workflow appears to stall and never delegate to:
94
- - HypothesisAgent
95
- - JudgeAgent
96
- - ReportAgent
97
-
98
- This suggests the Manager agent may be:
99
- 1. Overwhelmed by the verbose SearchAgent output
100
- 2. Stuck in a decision loop
101
- 3. Not receiving proper signals to delegate to next agent
102
-
103
- ### Configuration Issue: Wrong `max_rounds` Used
104
-
105
- **File:** `src/orchestrators/factory.py:93-97`
106
-
107
- ```python
108
- return orchestrator_cls(
109
- max_rounds=effective_config.max_iterations, # <-- Uses max_iterations (10)
110
- api_key=api_key,
111
- domain=domain,
112
- )
113
- ```
114
-
115
- The factory passes `max_iterations` (10) instead of using `settings.advanced_max_rounds` (5).
116
- This means timeout is more likely since workflows run longer.
117
-
118
- ## Impact
119
-
120
- - **User Experience:** After waiting 5+ minutes, users get NO useful output
121
- - **Demo Killer:** Advanced mode is effectively broken for external users
122
- - **Misleading UX:** Message claims synthesis is happening when it's not
123
-
124
- ## Proposed Fix
125
-
126
- ### Fix 1: Implement Actual Timeout Synthesis
127
-
128
- **File:** `src/orchestrators/advanced.py`
129
-
130
- ```python
131
- except TimeoutError:
132
- logger.warning("Workflow timed out", iterations=iteration)
133
-
134
- # ACTUALLY synthesize from gathered evidence
135
- try:
136
- from src.agents.state import get_magentic_state
137
- from src.agents.magentic_agents import create_report_agent
138
-
139
- state = get_magentic_state()
140
- memory: ResearchMemory = state.memory
141
-
142
- # Get evidence summary from memory
143
- evidence_summary = await memory.get_context_summary()
144
-
145
- # Create and invoke ReportAgent for synthesis
146
- report_agent = create_report_agent(self._chat_client, domain=self.domain)
147
- synthesis_result = await report_agent.invoke(
148
- f"Synthesize research report from this evidence:\n{evidence_summary}"
149
- )
150
-
151
- yield AgentEvent(
152
- type="complete",
153
- message=synthesis_result,
154
- data={"reason": "timeout_synthesis", "iterations": iteration},
155
- iteration=iteration,
156
- )
157
- except Exception as synth_error:
158
- logger.error("Timeout synthesis failed", error=str(synth_error))
159
- yield AgentEvent(
160
- type="complete",
161
- message=(
162
- f"Research timed out after {iteration} rounds. "
163
- f"Evidence gathered but synthesis failed: {synth_error}"
164
- ),
165
- data={"reason": "timeout_synthesis_failed", "iterations": iteration},
166
- iteration=iteration,
167
- )
168
- ```
169
-
170
- ### Fix 2: Address SearchAgent Verbosity
171
-
172
- The SearchAgent is producing large outputs (~4KB per search, accumulating to 40KB+ over 10 rounds), which overwhelms the Manager's context window.
173
- Consider:
174
- 1. Limiting SearchAgent output length further (currently 300 chars/result)
175
- 2. Summarizing results before returning to Manager
176
- 3. Using structured output format instead of prose
177
-
178
- ### Fix 3: Use Correct max_rounds
179
-
180
- **File:** `src/orchestrators/factory.py`
181
-
182
- ```python
183
- # Use advanced-specific setting, not max_iterations
184
- return orchestrator_cls(
185
- max_rounds=settings.advanced_max_rounds, # 5 by default
186
- api_key=api_key,
187
- domain=domain,
188
- )
189
- ```
190
-
191
- ### Fix 4: Implement `get_context_summary` in ResearchMemory
192
-
193
- **File:** `src/services/research_memory.py`
194
-
195
- The `ResearchMemory` class is missing the `get_context_summary` method required by Fix 1.
196
-
197
- ```python
198
- async def get_context_summary(self) -> str:
199
- """Generate a summary of all collected evidence for the final report."""
200
- if not self.evidence_ids:
201
- return "No evidence collected."
202
-
203
- summary = [f"Research Query: {self.query}\n"]
204
-
205
- # Add Hypotheses
206
- if self.hypotheses:
207
- summary.append("## Hypotheses")
208
- for h in self.hypotheses:
209
- summary.append(f"- {h.drug} -> {h.target}: {h.effect} (Conf: {h.confidence})")
210
- summary.append("")
211
-
212
- # Add Top Evidence (limit to avoid token overflow)
213
- # We use get_all_evidence() but might need to summarize if too large
214
- evidence = self.get_all_evidence()
215
- summary.append(f"## Evidence ({len(evidence)} items)")
216
-
217
- # Group by source for cleaner summary
218
- for i, ev in enumerate(evidence[:20], 1): # Limit to top 20 items
219
- summary.append(f"{i}. {ev.citation.title} ({ev.citation.date})")
220
- summary.append(f" {ev.content[:200]}...") # Brief snippet
221
-
222
- return "\n".join(summary)
223
- ```
224
-
225
- ## Call Stack Trace
226
-
227
- ```
228
- app.py:research_agent()
229
- → configure_orchestrator(mode="advanced")
230
- → factory.py:create_orchestrator()
231
- → AdvancedOrchestrator(max_rounds=10) # Should be 5
232
-
233
- → orchestrator.run(query)
234
- → advanced.py:run()
235
- → init_magentic_state(query)
236
- → workflow = _build_workflow() # MagenticBuilder
237
- → async for event in workflow.run_stream(task):
238
- # SearchAgent runs (accumulates 4KB+ per round)
239
- # Manager receives, but never delegates further
240
- # TimeoutError after 300 seconds
241
- → except TimeoutError:
242
- → yield AgentEvent(message="Synthesizing...") # LIE - no synthesis
243
- ```
244
-
245
- ## Files to Modify
246
-
247
- | File | Change |
248
- |------|--------|
249
- | `src/orchestrators/advanced.py:254-261` | Implement actual synthesis on timeout |
250
- | `src/orchestrators/factory.py:93-97` | Use `settings.advanced_max_rounds` |
251
- | `src/services/research_memory.py` | Implement `get_context_summary()` method |
252
- | `src/agents/magentic_agents.py` | Consider limiting SearchAgent output |
253
-
254
- ## Test Plan
255
-
256
- ### Unit Tests
257
-
258
- ```python
259
- # tests/unit/orchestrators/test_advanced_timeout.py
260
-
261
- @pytest.mark.asyncio
262
- async def test_timeout_synthesizes_evidence():
263
- """Timeout should produce synthesis, not empty message."""
264
- orchestrator = AdvancedOrchestrator(
265
- max_rounds=1,
266
- timeout_seconds=0.1, # Force immediate timeout
267
- api_key="sk-test",
268
- )
269
-
270
- events = [e async for e in orchestrator.run("test query")]
271
- complete_event = [e for e in events if e.type == "complete"][-1]
272
-
273
- # Should contain synthesis, not just "timed out"
274
- assert "Research timed out" not in complete_event.message or \
275
- len(complete_event.message) > 100 # Actual content present
276
-
277
- @pytest.mark.asyncio
278
- async def test_factory_uses_advanced_max_rounds():
279
- """Factory should use settings.advanced_max_rounds for advanced mode."""
280
- orchestrator = create_orchestrator(
281
- mode="advanced",
282
- api_key="sk-test",
283
- )
284
- assert orchestrator._max_rounds == settings.advanced_max_rounds
285
- ```
286
-
287
- ### Manual Verification
288
-
289
- 1. Set `OPENAI_API_KEY` and run app
290
- 2. Select "Advanced" mode
291
- 3. Submit: "Clinical trials for PDE5 inhibitors alternatives?"
292
- 4. Wait for completion or timeout
293
- 5. **Verify:** Final output contains synthesized report (not just "timed out" message)
294
-
295
- ## Related Issues
296
-
297
- - This may be related to the SearchAgent being too verbose
298
- - The Magentic pattern expects agents to produce concise outputs
299
- - Microsoft Agent Framework's Manager may struggle with 10KB+ messages
300
-
301
- ## Priority Justification
302
-
303
- **P0 because:**
304
- 1. Advanced mode is a major selling point (multi-agent, deep research)
305
- 2. Users with paid API keys expect it to work
306
- 3. The current behavior is deceptive (claims synthesis, delivers nothing)
307
- 4. Demo credibility is destroyed when users wait 5min for nothing
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/bugs/archive/P0_AIFUNCTION_NOT_JSON_SERIALIZABLE.md DELETED
@@ -1,225 +0,0 @@
1
- # P0 Bug: AIFunction Not JSON Serializable (Free Tier Broken)
2
-
3
- **Severity**: P0 (Critical) - Free Tier cannot perform research
4
- **Status**: RESOLVED
5
- **Discovered**: 2025-12-01
6
- **Resolved**: 2025-12-01
7
- **Reporter**: Production user via HuggingFace Spaces
8
-
9
- ## Symptom
10
-
11
- Every search round fails with:
12
- ```
13
- 📚 SEARCH_COMPLETE: searcher: Agent searcher: Error processing request -
14
- Object of type AIFunction is not JSON serializable
15
- ```
16
-
17
- Research never completes. Users see 5 rounds of the same error.
18
-
19
- ## Root Cause
20
-
21
- ### The Problem
22
-
23
- In `src/clients/huggingface.py` lines 82-103:
24
-
25
- ```python
26
- # Extract tool configuration
27
- tools = chat_options.tools if chat_options.tools else None # AIFunction objects!
28
- ...
29
- call_fn = partial(
30
- self._client.chat_completion,
31
- messages=hf_messages,
32
- tools=tools, # <-- RAW AIFunction objects passed here
33
- ...
34
- )
35
- ```
36
-
37
- The `chat_options.tools` contains `AIFunction` objects from Microsoft's agent-framework.
38
- When `requests` tries to serialize these for the HTTP request, it fails:
39
- ```
40
- TypeError: Object of type AIFunction is not JSON serializable
41
- ```
42
-
43
- ### Why This Happens
44
-
45
- 1. Microsoft's agent-framework defines tools as `AIFunction` objects
46
- 2. `ChatAgent` with tools passes them via `chat_options.tools`
47
- 3. Our `HuggingFaceChatClient` forwards them directly to `InferenceClient.chat_completion()`
48
- 4. `requests.post()` internally calls `json.dumps()` on the request body
49
- 5. `AIFunction` has no `__json__()` method or isn't a dict → TypeError
50
-
51
- ## Impact
52
-
53
- | Component | Impact |
54
- |-----------|--------|
55
- | Free Tier (HuggingFace) | **COMPLETELY BROKEN** |
56
- | Advanced Mode without API key | **Cannot do research** |
57
- | Paid Tier (OpenAI) | Unaffected (OpenAI handles AIFunction) |
58
-
59
- ## Professional Fix (Full Implementation)
60
-
61
- Qwen2.5-72B-Instruct **SUPPORTS** function calling via HuggingFace. The fix requires:
62
-
63
- 1. **Request Serialization**: Convert `AIFunction` → OpenAI-compatible JSON
64
- 2. **Response Parsing**: Convert HuggingFace `tool_calls` → Framework `FunctionCallContent`
65
-
66
- ### Part 1: Tool Serialization (`_convert_tools`)
67
-
68
- ```python
69
- def _convert_tools(self, tools: list[Any] | None) -> list[dict[str, Any]] | None:
70
- """Convert AIFunction objects to OpenAI-compatible tool definitions.
71
-
72
- AIFunction.to_dict() returns:
73
- {'type': 'ai_function', 'name': '...', 'description': '...', 'input_model': {...}}
74
-
75
- OpenAI/HuggingFace expects:
76
- {'type': 'function', 'function': {'name': '...', 'description': '...', 'parameters': {...}}}
77
- """
78
- if not tools:
79
- return None
80
-
81
- json_tools = []
82
- for tool in tools:
83
- if hasattr(tool, 'to_dict'):
84
- t_dict = tool.to_dict()
85
- json_tools.append({
86
- "type": "function",
87
- "function": {
88
- "name": t_dict["name"],
89
- "description": t_dict.get("description", ""),
90
- "parameters": t_dict["input_model"]
91
- }
92
- })
93
- elif isinstance(tool, dict):
94
- json_tools.append(tool)
95
- else:
96
- logger.warning(f"Skipping non-serializable tool: {type(tool)}")
97
-
98
- return json_tools if json_tools else None
99
- ```
100
-
101
- ### Part 2: Response Parsing (Tool Calls → FunctionCallContent)
102
-
103
- When HuggingFace returns tool calls, we must convert them to the framework's format:
104
-
105
- ```python
106
- from agent_framework._types import FunctionCallContent
107
-
108
- # In _inner_get_response, after getting the response:
109
- choice = choices[0]
110
- message = choice.message
111
- message_content = message.content or ""
112
-
113
- # Parse tool calls if present
114
- contents: list[Any] = []
115
- if hasattr(message, 'tool_calls') and message.tool_calls:
116
- for tc in message.tool_calls:
117
- # HF returns: tc.id, tc.function.name, tc.function.arguments
118
- contents.append(FunctionCallContent(
119
- call_id=tc.id,
120
- name=tc.function.name,
121
- arguments=tc.function.arguments # JSON string or dict
122
- ))
123
-
124
- response_msg = ChatMessage(
125
- role=cast(Any, message.role),
126
- text=message_content,
127
- contents=contents if contents else None
128
- )
129
- ```
130
-
131
- ### Verified Schema Mapping
132
-
133
- ```python
134
- # AIFunction.to_dict() output (verified 2025-12-01):
135
- {
136
- "type": "ai_function",
137
- "name": "search_pubmed",
138
- "description": "Search PubMed for biomedical research papers...",
139
- "input_model": {
140
- "properties": {"query": {"title": "Query", "type": "string"}, ...},
141
- "required": ["query"],
142
- "type": "object"
143
- }
144
- }
145
-
146
- # Mapped to OpenAI format:
147
- {
148
- "type": "function",
149
- "function": {
150
- "name": "search_pubmed",
151
- "description": "Search PubMed for biomedical research papers...",
152
- "parameters": {
153
- "properties": {"query": {"title": "Query", "type": "string"}, ...},
154
- "required": ["query"],
155
- "type": "object"
156
- }
157
- }
158
- }
159
- ```
160
-
161
- ## Call Stack Trace
162
-
163
- ```
164
- User Query (HuggingFace Spaces)
165
-
166
- src/app.py:research_agent()
167
-
168
- src/orchestrators/advanced.py:AdvancedOrchestrator.run()
169
-
170
- agent_framework.MagenticBuilder.run_stream()
171
-
172
- agent_framework.ChatAgent (SearchAgent with tools=[search_pubmed, ...])
173
-
174
- src/clients/huggingface.py:HuggingFaceChatClient._inner_get_response()
175
- → chat_options.tools contains AIFunction objects
176
-
177
- huggingface_hub.InferenceClient.chat_completion(tools=tools)
178
-
179
- requests.post(json={..., "tools": [AIFunction, ...]})
180
-
181
- json.dumps() → TypeError: Object of type AIFunction is not JSON serializable
182
- ```
183
-
184
- ## Testing
185
-
186
- ```bash
187
- # Reproduce locally (remove OpenAI key)
188
- unset OPENAI_API_KEY
189
- uv run python -c "
190
- import asyncio
191
- from src.orchestrators.advanced import AdvancedOrchestrator
192
-
193
- async def test():
194
- orch = AdvancedOrchestrator(max_rounds=2)
195
- async for event in orch.run('testosterone benefits'):
196
- print(f'[{event.type}] {str(event.message)[:50]}...')
197
-
198
- asyncio.run(test())
199
- "
200
-
201
- # Expected BEFORE fix: TypeError: Object of type AIFunction is not JSON serializable
202
- # Expected AFTER fix: Research completes with tool calls working
203
- ```
204
-
205
- ## Resolution
206
-
207
- Implemented full function calling support for HuggingFace client:
208
-
209
- 1. **Request Serialization**: Added `_convert_tools` to map `AIFunction` schemas to OpenAI-compatible JSON.
210
- 2. **Response Parsing (Sync)**: Added `_parse_tool_calls` to convert HF `tool_calls` to `FunctionCallContent`.
211
- 3. **Response Parsing (Async)**: Implemented tool call accumulator in `_inner_get_streaming_response` to handle partial tool call deltas and yield valid `FunctionCallContent` objects.
212
-
213
- ## Verification
214
-
215
- Verified with unit tests and manual simulation:
216
-
217
- 1. **Serialization**: Confirmed `AIFunction` -> JSON conversion works for `search_pubmed`.
218
- 2. **Streaming**: Verified that fragmented tool call deltas (e.g., `{"query":` then `"testosterone"}`) are correctly reassembled into a single `FunctionCallContent`.
219
- 3. **Integration**: Passed project-level `make check`.
220
-
221
- ## References
222
-
223
- - [HuggingFace Chat Completion - Function Calling](https://huggingface.co/docs/inference-providers/tasks/chat-completion)
224
- - [Qwen Function Calling](https://qwen.readthedocs.io/en/latest/framework/function_call.html)
225
- - [Microsoft Agent Framework - AIFunction](https://learn.microsoft.com/en-us/python/api/agent-framework-core/agent_framework.aifunction)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/bugs/archive/P0_HUGGINGFACE_TOOL_CALLING_BROKEN.md DELETED
@@ -1,173 +0,0 @@
1
- # P0 Bug: HuggingFace Free Tier Tool Calling Broken
2
-
3
- **Severity**: P0 (Critical) - Free Tier cannot perform multi-turn tool-based research
4
- **Status**: PARTIALLY RESOLVED - Bug #1 FIXED, Bug #2 requires upstream fix
5
- **Discovered**: 2025-12-01
6
- **Investigator**: Claude Code (Systematic First-Principles Analysis)
7
- **Last Updated**: 2025-12-01
8
-
9
- ## Executive Summary
10
-
11
- The HuggingFace Free Tier had two critical bugs preventing end-to-end tool-based research:
12
-
13
- 1. **Bug #1 (FIXED)**: Conversation history serialization missing `tool_calls` and `tool_call_id`
14
- 2. **Bug #2 (UPSTREAM)**: Microsoft Agent Framework produces repr strings instead of message text
15
-
16
- ## Current Status
17
-
18
- | Bug | Status | Location | Fix |
19
- |-----|--------|----------|-----|
20
- | #1 History Serialization | ✅ **FIXED** | `src/clients/huggingface.py` | Commit `809ad60` |
21
- | #2 Framework Repr Bug | ⏳ **UPSTREAM** | `agent_framework/_workflows/_magentic.py` | [Issue #2562](https://github.com/microsoft/agent-framework/issues/2562) |
22
-
23
- ---
24
-
25
- ## BUG #1: Conversation History Serialization ✅ FIXED
26
-
27
- ### What Was Wrong
28
- `_convert_messages()` didn't serialize `tool_calls` (for assistant messages) or `tool_call_id` (for tool messages).
29
-
30
- ### The Fix (Commit `809ad60`)
31
- Updated `_convert_messages()` in `src/clients/huggingface.py:71-121` to:
32
- 1. Extract `FunctionCallContent` from `msg.contents` → `tool_calls` array
33
- 2. Extract `FunctionResultContent` from `msg.contents` → `tool_call_id`
34
- 3. Properly format for HuggingFace/OpenAI API
35
-
36
- ### Verification
37
- ```python
38
- # Before fix: BadRequestError on multi-turn
39
- # After fix: Multi-turn conversations work
40
-
41
- # The message format is now correct:
42
- {
43
- "role": "assistant",
44
- "content": "",
45
- "tool_calls": [{"id": "call_123", "type": "function", "function": {...}}]
46
- }
47
- ```
48
-
49
- ---
50
-
51
- ## BUG #2: Framework Message Corruption ⏳ UPSTREAM
52
-
53
- ### Symptom
54
- `MagenticAgentMessageEvent.message.text` contains:
55
- ```text
56
- '<agent_framework._types.ChatMessage object at 0x10c394210>'
57
- ```
58
-
59
- ### Root Cause (CONFIRMED)
60
- **File**: `agent_framework/_workflows/_magentic.py` line ~1799
61
-
62
- ```python
63
- async def _invoke_agent(self, ctx, ...) -> ChatMessage:
64
- # ...
65
- if messages and len(messages) > 0:
66
- last: ChatMessage = messages[-1]
67
- text = last.text or str(last) # <-- BUG: str(last) gives repr!
68
- msg = ChatMessage(role=role, text=text, author_name=author)
69
- ```
70
-
71
- **Why it happens**:
72
- 1. `ChatMessage.text` property only extracts `TextContent` items
73
- 2. Tool-call-only messages have empty `.text` (returns `""`)
74
- 3. `"" or str(last)` evaluates to `str(last)`
75
- 4. `ChatMessage` has no `__str__` method → default Python repr
76
-
77
- ### Impact Assessment
78
-
79
- | Aspect | Impact | Critical? |
80
- |--------|--------|-----------|
81
- | UI Display | Shows garbage instead of agent output | YES for UX |
82
- | Logging | Can't debug what agents did | YES for debugging |
83
- | Tool Execution | Tools ARE being called (middleware works) | NO - Works |
84
- | Research Completion | Manager may not track progress properly | MAYBE - Unclear |
85
-
86
- **Observed behavior**: Research loops often reach max rounds without synthesis. The Manager keeps saying "no progress" even though tools ARE being called. This COULD be:
87
- 1. The repr bug affecting Manager's understanding
88
- 2. Qwen 72B not handling tool message format well
89
- 3. Unrelated orchestration issue
90
-
91
- ### Upstream Issue Filed
92
- **GitHub Issue**: [microsoft/agent-framework#2562](https://github.com/microsoft/agent-framework/issues/2562)
93
-
94
- **Suggested fixes in issue**:
95
- 1. **Minimal**: `text = last.text or ""`
96
- 2. **Better UX**: Format tool calls for display
97
- 3. **Best**: Add `__str__` to `ChatMessage` class
98
-
99
- ### Workaround (Implemented in `advanced.py`)
100
- We modified `_extract_text()` in `advanced.py` to extract tool call names from `.contents` when text is empty or looks like a repr:
101
-
102
- ```python
103
- def _extract_text(self, message: Any) -> str:
104
- # ... existing logic with repr filtering ...
105
-
106
- # Workaround: Extract tool call info when text is repr/empty
107
- if hasattr(message, "contents") and message.contents:
108
- tool_names = [
109
- f"[Tool: {c.name}]"
110
- for c in message.contents
111
- if hasattr(c, "name") # FunctionCallContent
112
- ]
113
- if tool_names:
114
- return " ".join(tool_names)
115
-
116
- return ""
117
- ```
118
-
119
- **Decision**: Implemented locally to fix display and logging while we wait for upstream fix.
120
-
121
- ---
122
-
123
- ## Verification Matrix (Updated)
124
-
125
- | Component | Status | Notes |
126
- |-----------|--------|-------|
127
- | Tool Serialization | ✅ WORKS | `_convert_tools()` |
128
- | Tool Call Parsing | ✅ WORKS | `_parse_tool_calls()` |
129
- | History Serialization | ✅ **FIXED** | `_convert_messages()` |
130
- | Middleware Decorators | ✅ **FIXED** | `@use_function_invocation` etc. |
131
- | Event Display | ❌ UPSTREAM | Shows repr - framework bug |
132
- | End-to-End Research | ⚠️ UNCLEAR | Needs testing after upstream fix |
133
-
134
- ---
135
-
136
- ## Files Changed
137
-
138
- ### Fixed (Commit `809ad60`)
139
- - `src/clients/huggingface.py`
140
- - `_convert_messages()` - Now serializes `tool_calls` and `tool_call_id`
141
- - Added `@use_function_invocation`, `@use_observability`, `@use_chat_middleware` decorators
142
- - Added `__function_invoking_chat_client__ = True` marker
143
-
144
- ### Also Fixed
145
- - `src/orchestrators/advanced.py` - `_extract_text()` now filters repr strings AND extracts tool call names
146
-
147
- ---
148
-
149
- ## Related Upstream Issues
150
-
151
- | Issue | Title | Status | Relevance |
152
- |-------|-------|--------|-----------|
153
- | [#2562](https://github.com/microsoft/agent-framework/issues/2562) | Repr string bug (OUR ISSUE) | OPEN | Direct cause |
154
- | [#1366](https://github.com/microsoft/agent-framework/issues/1366) | Thread corruption - unexecuted tool calls | OPEN | Same area |
155
- | [#2410](https://github.com/microsoft/agent-framework/issues/2410) | OpenAI client splits content/tool_calls | OPEN | Related bug |
156
-
157
- ---
158
-
159
- ## Next Steps
160
-
161
- 1. **Monitor**: Watch for response to [Issue #2562](https://github.com/microsoft/agent-framework/issues/2562)
162
- 2. **Test**: Run end-to-end research tests to see if Bug #2 actually blocks completion
163
- 3. **Optional**: Implement workaround in `_extract_text()` if display is critical
164
- 4. **Contribute**: Consider submitting PR to fix `_magentic.py` line 1799
165
-
166
- ---
167
-
168
- ## References
169
-
170
- - [HuggingFace Chat Completion API - Tool Use](https://huggingface.co/docs/huggingface_hub/package_reference/inference_client#huggingface_hub.InferenceClient.chat_completion)
171
- - [OpenAI Function Calling](https://platform.openai.com/docs/guides/function-calling)
172
- - [Microsoft Agent Framework Repository](https://github.com/microsoft/agent-framework)
173
- - [Our Upstream Issue #2562](https://github.com/microsoft/agent-framework/issues/2562)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/bugs/archive/P0_MCP_TOOLUSECONTENT_MISSING.md DELETED
@@ -1,88 +0,0 @@
1
- # P0 Bug: mcp.types.ToolUseContent AttributeError on HuggingFace Spaces
2
-
3
- **Status**: FIXED
4
- **Severity**: P0 (App completely broken)
5
- **Discovered**: 2025-12-04
6
- **Fixed**: 2025-12-04 (PR TBD)
7
-
8
- ---
9
-
10
- ## Symptom
11
-
12
- HuggingFace Spaces deployment crashes with:
13
-
14
- ```
15
- module 'mcp.types' has no attribute 'ToolUseContent'
16
- ```
17
-
18
- The app fails to start entirely. No functionality works.
19
-
20
- ---
21
-
22
- ## Root Cause
23
-
24
- **Dependency version mismatch between `pyproject.toml` and `requirements.txt`.**
25
-
26
- | File | MCP Pin | Result |
27
- |------|---------|--------|
28
- | `pyproject.toml` | `mcp>=1.23.0` | Correct - has `ToolUseContent` |
29
- | `requirements.txt` | (missing) | Pulls old MCP via `gradio[mcp]` transitive dep |
30
-
31
- **Background:**
32
- - `ToolUseContent` was added in MCP spec **2025-11-25** via **SEP-1577 (Sampling With Tools)**
33
- - Our pyproject.toml correctly pins `mcp>=1.23.0` (for security fix GHSA-9h52-p55h-vw2f)
34
- - HuggingFace Spaces uses `requirements.txt`, NOT `pyproject.toml`
35
- - `gradio[mcp]>=6.0.0` pulls in MCP as transitive dependency
36
- - Without explicit pin, Gradio was pulling an older MCP version lacking `ToolUseContent`
37
-
38
- ---
39
-
40
- ## Fix
41
-
42
- Added explicit MCP pin to `requirements.txt`:
43
-
44
- ```diff
45
- # UI (Gradio with MCP server support - 6.0 required for css in launch())
46
- gradio[mcp]>=6.0.0
47
- +
48
- +# Security: Pin mcp to fix GHSA-9h52-p55h-vw2f and ensure ToolUseContent exists
49
- +mcp>=1.23.0
50
- ```
51
-
52
- Also synced ALL dependencies between `pyproject.toml` and `requirements.txt` to prevent future drift.
53
-
54
- ---
55
-
56
- ## Changes Made
57
-
58
- **Files modified:**
59
- - `requirements.txt` - Full sync with `pyproject.toml`:
60
- - Added `mcp>=1.23.0` (root cause fix)
61
- - Added `beautifulsoup4>=4.12` (was missing)
62
- - Fixed `huggingface-hub>=0.24.0` (was 0.20.0)
63
- - Added upper bound to `agent-framework-core>=1.0.0b251120,<2.0.0`
64
- - Added sync header comment with date
65
-
66
- ---
67
-
68
- ## Prevention
69
-
70
- 1. **Sync header**: `requirements.txt` now has "Last synced: YYYY-MM-DD" comment
71
- 2. **CI check**: Consider adding a pre-commit hook to validate requirements.txt matches pyproject.toml
72
-
73
- ---
74
-
75
- ## References
76
-
77
- - [MCP Python SDK Releases](https://github.com/modelcontextprotocol/python-sdk/releases)
78
- - [MCP Spec 2025-11-25 - Sampling With Tools](https://modelcontextprotocol.io/specification/2025-11-25/client/sampling)
79
- - [GHSA-9h52-p55h-vw2f](https://github.com/advisories/GHSA-9h52-p55h-vw2f) - MCP security advisory
80
-
81
- ---
82
-
83
- ## Verification
84
-
85
- After fix:
86
- 1. Deploy to HuggingFace Spaces
87
- 2. Verify app starts without errors
88
- 3. Verify MCP server responds at `/gradio_api/mcp/`
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/bugs/archive/P0_ORCHESTRATOR_DEDUP_AND_JUDGE_BUGS.md DELETED
@@ -1,144 +0,0 @@
1
- # P0 Bug Report: Orchestrator Dedup + Judge Failures
2
-
3
- ## Status
4
- - **Date:** 2025-11-29
5
- - **Priority:** P0 (Blocker - Simple mode broken on HF Spaces)
6
- - **Component:** `src/orchestrator.py`, `src/agent_factory/judges.py`
7
- - **Resolution:** FIXED in commits `5e761eb`, `2588375`
8
-
9
- ---
10
-
11
- ## Symptoms
12
-
13
- When running Simple mode (free tier) on HuggingFace Spaces:
14
-
15
- 1. **Judge always returns 0% confidence** → loops forever with "continue"
16
- 2. **Deduplication removes ALL evidence** after iteration 1
17
- 3. **Never synthesizes** → user sees infinite loop
18
-
19
- ### Example Output
20
-
21
- ```
22
- 📚 SEARCH_COMPLETE: Found 20 new sources (19 total) ← Iteration 1 OK
23
- ✅ JUDGE_COMPLETE: Assessment: continue (confidence: 0%) ← FAIL: 0% = fallback
24
-
25
- 📚 SEARCH_COMPLETE: Found 12 new sources (11 total) ← Iteration 2 BROKEN
26
- ...
27
- 📚 SEARCH_COMPLETE: Found 31 new sources (0 total) ← 0 TOTAL = all removed!
28
- ✅ JUDGE_COMPLETE: Assessment: continue (confidence: 0%) ← Still failing
29
- ```
30
-
31
- ---
32
-
33
- ## Root Cause Analysis
34
-
35
- ### Bug 1: Semantic Deduplication Removes Old Evidence
36
-
37
- **File:** `src/orchestrator.py:213-219`
38
-
39
- ```python
40
- # URL dedup (correct)
41
- seen_urls = {e.citation.url for e in all_evidence}
42
- unique_new = [e for e in new_evidence if e.citation.url not in seen_urls]
43
- all_evidence.extend(unique_new)
44
-
45
- # BUG: Passes ALL evidence (including old) to semantic dedup
46
- all_evidence = await self._deduplicate_and_rank(all_evidence, query)
47
- ```
48
-
49
- **Problem:** The `deduplicate()` function checks each item against the vector store. Items from iteration 1 are ALREADY in the store. When re-checked in iteration 2+, they find THEMSELVES (distance ≈ 0) and are removed as "duplicates".
50
-
51
- **Result:** After iteration 1, evidence count drops to 0.
52
-
53
- ### Bug 2: HF Inference Judge Always Failing
54
-
55
- **File:** `src/agent_factory/judges.py:186-254`
56
-
57
- **Evidence:** Judge returns this every time:
58
- - `confidence: 0.0`
59
- - `recommendation: "continue"`
60
- - Next queries are just the original query with suffixes
61
-
62
- This is the `_create_fallback_assessment()` response, meaning:
63
- - The HF Inference API calls are failing
64
- - All 3 fallback models (Llama, Mistral, Zephyr) are failing
65
- - Likely due to rate limits, quota, or model availability
66
-
67
- ---
68
-
69
- ## The Fix
70
-
71
- ### Fix 1: Only Dedup NEW Evidence (not all_evidence)
72
-
73
- ```python
74
- # Before (broken)
75
- all_evidence.extend(unique_new)
76
- all_evidence = await self._deduplicate_and_rank(all_evidence, query)
77
-
78
- # After (fixed)
79
- # Only dedup the NEW evidence against the store
80
- if unique_new:
81
- unique_new = await self._deduplicate_new_evidence(unique_new, query)
82
- all_evidence.extend(unique_new)
83
- ```
84
-
85
- Or simpler - disable semantic dedup until we fix it properly:
86
-
87
- ```python
88
- # Disable broken semantic dedup
89
- # all_evidence = await self._deduplicate_and_rank(all_evidence, query)
90
- ```
91
-
92
- ### Fix 2: Handle HF Inference Failures Gracefully
93
-
94
- Option A: After N failed judge calls, force synthesize with available evidence
95
- Option B: Increase retry count or add longer backoff
96
- Option C: Fall back to MockJudgeHandler (which DOES work) after failures
97
-
98
- ```python
99
- # In _create_fallback_assessment, track failures
100
- if self._consecutive_failures >= 3:
101
- # Force synthesis instead of infinite loop
102
- return JudgeAssessment(
103
- sufficient=True, # STOP
104
- confidence=0.1,
105
- recommendation="synthesize",
106
- ...
107
- )
108
- ```
109
-
110
- ---
111
-
112
- ## Test Plan
113
-
114
- - [ ] Disable semantic dedup OR fix to only process new items
115
- - [ ] Verify evidence accumulates across iterations (not drops to 0)
116
- - [ ] Test HF Inference with fresh HF_TOKEN
117
- - [ ] If HF keeps failing, fall back to MockJudgeHandler
118
- - [ ] Verify "synthesize" is eventually reached
119
- - [ ] Deploy and test on HF Space
120
-
121
- ---
122
-
123
- ## Priority Justification
124
-
125
- **P0** because:
126
- - Simple mode (free tier) is the DEFAULT experience
127
- - Currently produces infinite loop with no output
128
- - Users see "confidence: 0%" and think tool is broken
129
- - Blocks hackathon demo for users without API keys
130
-
131
- ---
132
-
133
- ## Quick Workaround
134
-
135
- Disable semantic dedup by setting `enable_embeddings=False` in orchestrator creation:
136
-
137
- ```python
138
- orchestrator = create_orchestrator(
139
- ...
140
- enable_embeddings=False, # Disable broken dedup
141
- )
142
- ```
143
-
144
- Or users can enter an OpenAI/Anthropic API key to bypass HF Inference issues.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/bugs/archive/P0_REPR_BUG_ROOT_CAUSE_ANALYSIS.md DELETED
@@ -1,99 +0,0 @@
1
- # P0: Event Handling Implementation Spec
2
-
3
- **Status**: FIXED
4
- **Priority**: P0
5
- **Source of Truth**: `reference_repos/microsoft-agent-framework/python/samples/autogen-migration/orchestrations/04_magentic_one.py`
6
-
7
- ---
8
-
9
- ## Root Cause (One Sentence)
10
-
11
- We were extracting content from `MagenticAgentMessageEvent.message` — **the wrong event type** — instead of using `MagenticAgentDeltaEvent.text` as the sole source of streaming content.
12
-
13
- ---
14
-
15
- ## The Fix: Correct Event Handling Per Microsoft SSOT
16
-
17
- | Event Type | Correct Usage | What We Were Doing (Wrong) |
18
- |------------|---------------|----------------------------|
19
- | `MagenticAgentDeltaEvent` | **Extract `.text`** - This is the ONLY source of content | Partially used, not accumulated |
20
- | `MagenticAgentMessageEvent` | **Signal only** - Agent turn complete. IGNORE `.message` | Extracting `.message.text` (hits repr bug) |
21
- | `MagenticFinalResultEvent` | **Extract `.message.text`** - Final synthesis result | Correct |
22
-
23
- ---
24
-
25
- ## Implementation: Accumulator Pattern
26
-
27
- From Microsoft's `04_magentic_one.py` (lines 108-138):
28
-
29
- ```python
30
- # Microsoft's Pattern
31
- async for event in workflow.run_stream(task):
32
- if isinstance(event, MagenticAgentDeltaEvent):
33
- # STREAM CONTENT: Accumulate and display
34
- if event.text:
35
- print(event.text, end="", flush=True)
36
-
37
- elif isinstance(event, MagenticAgentMessageEvent):
38
- # SIGNAL ONLY: Agent done. Print newline. DO NOT read .message
39
- print()
40
-
41
- elif isinstance(event, MagenticFinalResultEvent):
42
- # FINAL RESULT: Safe to read .message.text
43
- print(event.message.text)
44
- ```
45
-
46
- ---
47
-
48
- ## Our Implementation (`src/orchestrators/advanced.py`)
49
-
50
- **Status**: ✅ IMPLEMENTED (lines 241-308)
51
-
52
- ```python
53
- # 1. Accumulate streaming content (ONLY source of truth)
54
- if isinstance(event, MagenticAgentDeltaEvent):
55
- if event.text:
56
- current_message_buffer += event.text
57
- yield AgentEvent(type="streaming", message=event.text, ...)
58
-
59
- # 2. Use buffer on completion signal (IGNORE event.message)
60
- if isinstance(event, MagenticAgentMessageEvent):
61
- text_content = current_message_buffer or "Action completed (Tool Call)"
62
- yield AgentEvent(message=f"{agent_name}: {text_content[:200]}...", ...)
63
- current_message_buffer = "" # Reset for next agent
64
-
65
- # 3. Final result - safe to extract
66
- if isinstance(event, MagenticFinalResultEvent):
67
- text = self._extract_text(event.message)
68
- yield AgentEvent(type="complete", message=text, ...)
69
- ```
70
-
71
- ---
72
-
73
- ## Why This Eliminates the Repr Bug
74
-
75
- The repr bug occurs at `_magentic.py:1730`:
76
-
77
- ```python
78
- text = last.text or str(last) # Falls back to repr() for tool-only messages
79
- ```
80
-
81
- By **never reading** `MagenticAgentMessageEvent.message.text`, we never hit this code path.
82
-
83
- **The repr bug is eliminated by correct implementation — no upstream fix required.**
84
-
85
- ---
86
-
87
- ## Verification Checklist
88
-
89
- - [x] `MagenticAgentDeltaEvent.text` used as sole content source
90
- - [x] `MagenticAgentMessageEvent` used as signal only (buffer consumed, not `.message`)
91
- - [x] `MagenticFinalResultEvent.message.text` extracted for final result
92
- - [x] Buffer reset on agent switch and completion
93
- - [x] Remove dead code path in `_process_event()` that still calls `_extract_text` on `MagenticAgentMessageEvent`
94
-
95
- ---
96
-
97
- ## Remaining Cleanup
98
-
99
- ✅ **DONE** - Dead code paths for `MagenticAgentMessageEvent` and `MagenticAgentDeltaEvent` have been removed from `_process_event()`. Comments now explain these events are handled by the Accumulator Pattern in `run()`.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/bugs/archive/P0_SIMPLE_MODE_FORCED_SYNTHESIS_BYPASS.md DELETED
@@ -1,59 +0,0 @@
1
- # P0 BUG: Simple Mode Synthesis Bypass (WILL BE FIXED BY UNIFIED ARCHITECTURE)
2
-
3
- **Status**: BLOCKED - Waiting for upstream PR #2566
4
- **Priority**: P0 (Demo-blocking)
5
- **Discovered**: 2025-12-01
6
- **GitHub Issue**: [#113](https://github.com/The-Obstacle-Is-The-Way/DeepBoner/issues/113)
7
-
8
- ---
9
-
10
- ## Current State
11
-
12
- **`simple.py` is DELETED.** This bug existed in the old Simple Mode code.
13
-
14
- The bug will NOT be fixed by restoring Simple Mode. Instead, it will be **automatically fixed** when we complete the unified architecture (after upstream PR #2566 merges).
15
-
16
- ---
17
-
18
- ## The Bug (Historical)
19
-
20
- When HuggingFace Inference API failed, Simple Mode's `_should_synthesize()` ignored forced synthesis signals due to overly strict thresholds.
21
-
22
- ```text
23
- ✅ JUDGE_COMPLETE: Assessment: synthesize (confidence: 10%)
24
- 🔄 LOOPING: Gathering more evidence... ← BUG: Should have synthesized!
25
- ```
26
-
27
- ---
28
-
29
- ## Why Unified Architecture Fixes This
30
-
31
- | Architecture | How Termination Works |
32
- |--------------|----------------------|
33
- | **Old (Simple Mode)** | Custom `_should_synthesize()` with buggy thresholds |
34
- | **New (Unified)** | Manager agent respects "SUFFICIENT EVIDENCE" signals |
35
-
36
- The Manager agent in Advanced Mode already works correctly. By completing the unified architecture with HuggingFace support, we inherit that correct behavior.
37
-
38
- **No need to patch `_should_synthesize()` because the code is deleted.**
39
-
40
- ---
41
-
42
- ## Path Forward
43
-
44
- 1. **Wait** for upstream PR #2566 to merge (fixes repr bug)
45
- 2. **Update** `agent-framework` dependency
46
- 3. **Verify** Advanced Mode + HuggingFace works
47
- 4. **Done** - This bug is gone (no `_should_synthesize()` thresholds)
48
-
49
- ---
50
-
51
- ## Related
52
-
53
- | Reference | Description |
54
- |-----------|-------------|
55
- | [ARCHITECTURE.md](../ARCHITECTURE.md) | Current state and unified plan |
56
- | [SPEC_16](../specs/SPEC_16_UNIFIED_CHAT_CLIENT_ARCHITECTURE.md) | Unified architecture spec |
57
- | [Issue #105](https://github.com/The-Obstacle-Is-The-Way/DeepBoner/issues/105) | GitHub tracking |
58
- | [Upstream #2562](https://github.com/microsoft/agent-framework/issues/2562) | Framework bug |
59
- | [Upstream PR #2566](https://github.com/microsoft/agent-framework/pull/2566) | Framework fix |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/bugs/archive/P0_SIMPLE_MODE_NEVER_SYNTHESIZES.md DELETED
@@ -1,254 +0,0 @@
1
- # P0 Bug Report: Simple Mode Never Synthesizes
2
-
3
- ## Status
4
- - **Date:** 2025-11-29
5
- - **Priority:** P0 (Blocker - Simple mode produces useless output)
6
- - **Component:** `src/orchestrators/simple.py`, `src/agent_factory/judges.py`, `src/prompts/judge.py`
7
- - **Environment:** Simple mode **WITHOUT OpenAI key** (HuggingFace Inference free tier)
8
-
9
- ---
10
-
11
- ## Symptoms
12
-
13
- When running Simple mode with a real research question:
14
-
15
- 1. **Judge never recommends "synthesize"** even with 455 sources and 90% confidence
16
- 2. **Confidence drops to 0%** in late iterations (API failures or context overflow)
17
- 3. **Search derails** to tangential topics (bone health, muscle mass instead of libido)
18
- 4. **Max iterations reached** → User gets garbage output (just citations, no synthesis)
19
-
20
- ### Example Output (Real Run)
21
-
22
- ```
23
- 🔍 SEARCHING: What drugs improve female libido post-menopause?
24
- 📚 SEARCH_COMPLETE: Found 30 new sources (30 total)
25
- ✅ JUDGE_COMPLETE: Assessment: continue (confidence: 70%) ← Never "synthesize"
26
-
27
- ... 8 more iterations ...
28
-
29
- 📚 SEARCH_COMPLETE: Found 10 new sources (429 total)
30
- ✅ JUDGE_COMPLETE: Assessment: continue (confidence: 0%) ← API failure?
31
-
32
- 📚 SEARCH_COMPLETE: Found 26 new sources (455 total)
33
- ✅ JUDGE_COMPLETE: Assessment: continue (confidence: 0%) ← Still failing
34
-
35
- ## Partial Analysis (Max Iterations Reached) ← GARBAGE OUTPUT
36
- ### Question
37
- What drugs improve female libido post-menopause?
38
- ### Status
39
- Maximum search iterations reached.
40
- ### Citations
41
- 1. [Tribulus terrestris and female reproductive...]
42
- 2. ...
43
- ---
44
- *Consider searching with more specific terms* ← NO SYNTHESIS AT ALL
45
- ```
46
-
47
- ---
48
-
49
- ## Root Cause Analysis
50
-
51
- ### Bug 1: Judge Never Says "sufficient=True"
52
-
53
- **File:** `src/prompts/judge.py:22-25`
54
-
55
- ```python
56
- 3. **Sufficiency**: Evidence is sufficient when:
57
- - Combined scores >= 12 AND
58
- - At least one specific drug candidate identified AND
59
- - Clear mechanistic rationale exists
60
- ```
61
-
62
- **Problem:** The prompt is too conservative. With 455 sources spanning testosterone, DHEA, estrogen, oxytocin, etc., the judge should have identified candidates and said "synthesize". But:
63
-
64
- 1. LLM may not be extracting drug candidates from evidence properly
65
- 2. The "AND" conditions are too strict - evidence can be "good enough" without hitting all criteria
66
- 3. The recommendation "continue" seems to be the default state
67
-
68
- **Evidence:** Output shows 70-90% confidence but still "continue" - the judge is confident but never satisfied.
69
-
70
- ### Bug 2: Confidence Drops to 0% (Late Iteration Failures)
71
-
72
- **File:** `src/agent_factory/judges.py:150-183`
73
-
74
- The `_create_fallback_assessment()` returns:
75
- - `confidence: 0.0`
76
- - `recommendation: "continue"`
77
-
78
- **Problem:** In iterations 9-10, something failed:
79
- - Context too long (455 sources × ~1500 chars = 680K chars → token limit exceeded)
80
- - API rate limit hit
81
- - Network timeout
82
-
83
- **Evidence:** Confidence went from 80%→0%→0% in final iterations - this is the fallback response.
84
-
85
- ### Bug 3: Search Derailment
86
-
87
- **Evidence from logs:**
88
- ```
89
- Next searches: androgen therapy and bone health, androgen therapy and muscle mass...
90
- Next searches: testosterone therapy in postmenopausal women, mechanisms of testosterone...
91
- ```
92
-
93
- **Problem:** Judge's `next_search_queries` drift off-topic. "Bone health" and "muscle mass" are tangential to "female libido". The judge should stay focused on the original question.
94
-
95
- ### Bug 4: Partial Synthesis is Garbage
96
-
97
- **File:** `src/orchestrators/simple.py:432-470`
98
-
99
- ```python
100
- def _generate_partial_synthesis(self, query: str, evidence: list[Evidence]) -> str:
101
- """Generate a partial synthesis when max iterations reached."""
102
- citations = "\n".join([...]) # Just citations
103
-
104
- return f"""## Partial Analysis (Max Iterations Reached)
105
- ### Question
106
- {query}
107
- ### Status
108
- Maximum search iterations reached. The evidence gathered may be incomplete.
109
- ### Evidence Collected
110
- Found {len(evidence)} sources.
111
- ### Citations
112
- {citations}
113
- ---
114
- *Consider searching with more specific terms*
115
- """
116
- ```
117
-
118
- **Problem:** When max iterations reached, we have 455 sources but output NO analysis. We should:
119
- 1. Force a synthesis call to the LLM
120
- 2. Or at minimum generate drug candidates/findings from the last good assessment
121
- 3. Not just dump citations and give up
122
-
123
- ---
124
-
125
- ## The Fix
126
-
127
- ### Fix 1: Lower the Bar for "synthesize"
128
-
129
- **Option A:** Change prompt to be less strict:
130
- ```python
131
- SYSTEM_PROMPT = """...
132
- 3. **Sufficiency**: Evidence is sufficient when:
133
- - Combined scores >= 10 (was 12) OR
134
- - Confidence >= 80% with drug candidates identified OR
135
- - 5+ iterations completed with 100+ sources
136
- """
137
- ```
138
-
139
- **Option B:** Add iteration-based heuristic in orchestrator:
140
- ```python
141
- # If we have lots of evidence and high confidence, force synthesis
142
- if iteration >= 5 and len(all_evidence) > 100 and assessment.confidence > 0.7:
143
- assessment.sufficient = True
144
- assessment.recommendation = "synthesize"
145
- ```
146
-
147
- ### Fix 2: Handle Context Overflow
148
-
149
- **File:** `src/agent_factory/judges.py`
150
-
151
- Before sending to LLM, cap evidence:
152
- ```python
153
- async def assess(self, question: str, evidence: list[Evidence]) -> JudgeAssessment:
154
- # Cap at 50 most recent/relevant to avoid token overflow
155
- if len(evidence) > 50:
156
- evidence = evidence[:50] # Or use embedding similarity to select best 50
157
- ```
158
-
159
- ### Fix 3: Keep Search Focused
160
-
161
- **File:** `src/prompts/judge.py`
162
-
163
- Add to prompt:
164
- ```python
165
- SYSTEM_PROMPT = """...
166
- ## Search Query Rules
167
-
168
- When suggesting next_search_queries:
169
- - Stay focused on the ORIGINAL question
170
- - Do NOT drift to tangential topics (e.g., don't search "bone health" for a libido question)
171
- - Refine existing good terms, don't explore random associations
172
- """
173
- ```
174
-
175
- ### Fix 4: Generate Real Synthesis on Max Iterations
176
-
177
- **File:** `src/orchestrators/simple.py`
178
-
179
- ```python
180
- def _generate_partial_synthesis(self, query: str, evidence: list[Evidence]) -> str:
181
- """Generate a REAL synthesis when max iterations reached."""
182
-
183
- # Get the last assessment's data (if available)
184
- last_assessment = self.history[-1]["assessment"] if self.history else None
185
-
186
- drug_candidates = last_assessment.get("details", {}).get("drug_candidates", []) if last_assessment else []
187
- key_findings = last_assessment.get("details", {}).get("key_findings", []) if last_assessment else []
188
-
189
- drug_list = "\n".join([f"- **{d}**" for d in drug_candidates]) or "- See sources below for candidates"
190
- findings_list = "\n".join([f"- {f}" for f in key_findings[:5]]) or "- Review citations for findings"
191
-
192
- citations = "\n".join([
193
- f"{i + 1}. [{e.citation.title}]({e.citation.url}) ({e.citation.source.upper()})"
194
- for i, e in enumerate(evidence[:10])
195
- ])
196
-
197
- return f"""## Drug Repurposing Analysis (Partial)
198
-
199
- ### Question
200
- {query}
201
-
202
- ### Status
203
- ⚠️ Maximum iterations reached. Analysis based on {len(evidence)} sources.
204
-
205
- ### Drug Candidates Identified
206
- {drug_list}
207
-
208
- ### Key Findings
209
- {findings_list}
210
-
211
- ### Top Citations ({len(evidence)} sources)
212
- {citations}
213
-
214
- ---
215
- *Analysis may be incomplete. Consider refining query or adding API key for better results.*
216
- """
217
- ```
218
-
219
- ---
220
-
221
- ## Test Plan
222
-
223
- - [ ] Verify judge says "synthesize" within 5 iterations for good queries
224
- - [ ] Test with 500+ sources to ensure no token overflow
225
- - [ ] Verify search stays on-topic (no bone/muscle tangents for libido query)
226
- - [ ] Verify partial synthesis shows drug candidates (not just citations)
227
- - [ ] Test with MockJudgeHandler to confirm issue is in LLM behavior
228
- - [ ] Add unit test: `test_judge_synthesizes_with_good_evidence`
229
-
230
- ---
231
-
232
- ## Priority Justification
233
-
234
- **P0** because:
235
- - Simple mode is the DEFAULT for users without API keys
236
- - 455 sources found but ZERO useful output generated
237
- - User waited 10 iterations just to get a citation dump
238
- - Makes the tool look completely broken
239
- - Blocks hackathon demo effectiveness
240
-
241
- ---
242
-
243
- ## Immediate Workaround
244
-
245
- 1. Use **Advanced mode** (requires OpenAI key) - it has its own synthesis logic
246
- 2. Or use **fewer iterations** (MAX_ITERATIONS=3) to hit partial synthesis faster
247
- 3. Or manually review the citations (they ARE relevant, just not synthesized)
248
-
249
- ---
250
-
251
- ## Related Issues
252
-
253
- - `P0_ORCHESTRATOR_DEDUP_AND_JUDGE_BUGS.md` - Fixed dedup issue, but synthesis problem persists
254
- - `ACTIVE_BUGS.md` - Update when this is resolved
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/bugs/archive/P0_SYNTHESIS_PROVIDER_MISMATCH.md DELETED
@@ -1,273 +0,0 @@
1
- # P0 - Systemic Provider Mismatch Across All Modes
2
-
3
- **Status:** RESOLVED
4
- **Priority:** P0 (Blocker for Free Tier/Demo)
5
- **Found:** 2025-11-30 (during Audit)
6
- **Resolved:** 2025-11-30
7
- **Component:** Multiple files across orchestrators, agents, services
8
-
9
- ## Resolution Summary
10
-
11
- The critical provider mismatch bug has been fixed by implementing auto-detection in `src/agent_factory/judges.py`.
12
- The `get_model()` function now checks for actual API key availability (`has_openai_key`, `has_anthropic_key`, `has_huggingface_key`)
13
- instead of relying on the static `settings.llm_provider` configuration.
14
-
15
- ### Fix Details
16
-
17
- - **Auto-Detection Implemented**: `get_model()` prioritizes OpenAI > Anthropic > HuggingFace based on *available keys*.
18
- - **Fail-Fast on No Keys**: If no API keys are configured, `get_model()` raises `ConfigurationError` with clear message.
19
- - **HuggingFace Requires Token**: Free Tier via `HuggingFaceModel` requires `HF_TOKEN` (PydanticAI requirement).
20
- - **Synthesis Fallback**: When `get_model()` fails, synthesis gracefully falls back to template.
21
- - **Audit Fixes Applied**:
22
- - Replaced manual `os.getenv` checks with centralized `settings` properties in `src/app.py`.
23
- - Added logging to `src/services/statistical_analyzer.py` (fixed silent `pass`).
24
- - Narrowed exception handling in `src/tools/pubmed.py`.
25
- - Optimized string search in `src/tools/code_execution.py`.
26
-
27
- ### Key Clarification
28
-
29
- The **Free Tier** in Simple Mode uses `HFInferenceJudgeHandler` (which uses `huggingface_hub.InferenceClient`)
30
- for judging - this does NOT require `HF_TOKEN`. However, synthesis via `get_model()` uses PydanticAI's
31
- `HuggingFaceModel` which DOES require `HF_TOKEN`. When no tokens are configured, synthesis falls back to
32
- the template-based summary (which is still useful).
33
-
34
- ### Verification
35
-
36
- - **Unit Tests**: 5 new TDD tests in `tests/unit/agent_factory/test_get_model_auto_detect.py` pass.
37
- - **All Tests**: 309 tests pass (`make check` succeeds).
38
- - **Regression Tests**: Fixed and verified `tests/unit/agent_factory/test_judges_factory.py`.
39
-
40
- ---
41
-
42
- ## Symptom (Archive)
43
-
44
- When running in "Simple Mode" (Free Tier / No API Key), the synthesis step fails to generate a narrative and falls back to a structured summary template. The user sees:
45
-
46
- ```text
47
- > ⚠️ Note: AI narrative synthesis unavailable. Showing structured summary.
48
- > _Error: OpenAIError_
49
- ```
50
-
51
- ## Affected Files (COMPREHENSIVE AUDIT)
52
-
53
- ### Files Calling `get_model()` Directly (9 locations)
54
-
55
- | File | Line | Context | Impact |
56
- |------|------|---------|--------|
57
- | `simple.py` | 547 | Synthesis step | Free Tier broken |
58
- | `statistical_analyzer.py` | 75 | Analysis agent | Free Tier broken |
59
- | `judge_agent_llm.py` | 18 | LLM Judge | Free Tier broken |
60
- | `graph/nodes.py` | 177 | LangGraph hypothesis | Free Tier broken |
61
- | `graph/nodes.py` | 249 | LangGraph synthesis | Free Tier broken |
62
- | `report_agent.py` | 45 | Report generation | Free Tier broken |
63
- | `hypothesis_agent.py` | 44 | Hypothesis generation | Free Tier broken |
64
- | `judges.py` | 100 | JudgeHandler default | OK (accepts param) |
65
-
66
- ### Files Hardcoding `OpenAIChatClient` (Architecturally OpenAI-Only)
67
-
68
- | File | Lines | Context |
69
- |------|-------|---------|
70
- | `advanced.py` | 100, 121 | Manager client |
71
- | `magentic_agents.py` | 29, 70, 129, 173 | All 4 agents |
72
- | `retrieval_agent.py` | 62 | Retrieval agent |
73
- | `code_executor_agent.py` | 52 | Code executor |
74
- | `llm_factory.py` | 42 | Factory default |
75
-
76
- **Note:** Advanced mode is architecturally locked to OpenAI via `agent_framework.openai.OpenAIChatClient`. This is by design - see `app.py:188-194` which falls back to Simple mode if no OpenAI key. However, users are not clearly informed of this limitation.
77
-
78
- ## Root Cause
79
-
80
- **Settings/Runtime Sync Gap - Two Separate Backend Selection Systems.**
81
-
82
- The codebase has **two independent** systems for selecting the LLM backend:
83
- 1. `settings.llm_provider` (config.py default: "openai")
84
- 2. `app.py` runtime detection via `os.getenv()` checks
85
-
86
- These are **never synchronized**, causing the Judge and Synthesis steps to use different backends.
87
-
88
- ### Detailed Call Chain
89
-
90
- 1. **`src/app.py:115-126`** (runtime detection):
91
- ```python
92
- # app.py bypasses settings entirely for JudgeHandler selection
93
- elif os.getenv("OPENAI_API_KEY"):
94
- judge_handler = JudgeHandler(model=None, domain=domain)
95
- elif os.getenv("ANTHROPIC_API_KEY"):
96
- judge_handler = JudgeHandler(model=None, domain=domain)
97
- else:
98
- judge_handler = HFInferenceJudgeHandler(domain=domain) # Free Tier
99
- ```
100
- **Note:** This creates the correct handler but does NOT update `settings.llm_provider`.
101
-
102
- 2. **`src/orchestrators/simple.py:546-552`** (synthesis step):
103
- ```python
104
- from src.agent_factory.judges import get_model
105
- agent: Agent[None, str] = Agent(model=get_model(), ...) # <-- BUG!
106
- ```
107
- Synthesis calls `get_model()` directly instead of using the injected judge's model.
108
-
109
- 3. **`src/agent_factory/judges.py:56-78`** (`get_model()`):
110
- ```python
111
- def get_model() -> Any:
112
- llm_provider = settings.llm_provider # <-- Reads from settings (still "openai")
113
- # ...
114
- openai_provider = OpenAIProvider(api_key=settings.openai_api_key) # <-- None!
115
- return OpenAIChatModel(settings.openai_model, provider=openai_provider)
116
- ```
117
- **Result:** Creates OpenAI model with `api_key=None` → `OpenAIError`
118
-
119
- ### Why Free Tier Fails
120
-
121
- | Step | System Used | Backend Selected |
122
- |------|-------------|------------------|
123
- | JudgeHandler | `app.py` runtime | HFInferenceJudgeHandler ✅ |
124
- | Synthesis | `settings.llm_provider` | OpenAI (default) ❌ |
125
-
126
- The Judge works because app.py explicitly creates `HFInferenceJudgeHandler`.
127
- Synthesis fails because it calls `get_model()` which reads `settings.llm_provider = "openai"` (unchanged from default).
128
-
129
- ## Impact
130
-
131
- - **User Experience:** Free tier users (Demo users) never see the high-quality narrative synthesis, only the fallback.
132
- - **System Integrity:** The orchestrator ignores the runtime backend selection.
133
-
134
- ## Implemented Fix
135
-
136
- **Strategy: Fix `get_model()` to Auto-Detect Available Provider**
137
-
138
- ### Actual Implementation (Merged)
139
-
140
- **File:** `src/agent_factory/judges.py`
141
-
142
- This is the **single point of fix** that resolves all 7 broken `get_model()` call sites.
143
-
144
- ```python
145
- def get_model() -> Any:
146
- """Get the LLM model based on available API keys.
147
-
148
- Priority order:
149
- 1. OpenAI (if OPENAI_API_KEY set)
150
- 2. Anthropic (if ANTHROPIC_API_KEY set)
151
- 3. HuggingFace (if HF_TOKEN set)
152
-
153
- Raises:
154
- ConfigurationError: If no API keys are configured.
155
-
156
- Note: settings.llm_provider is ignored in favor of actual key availability.
157
- This ensures the model matches what app.py selected for JudgeHandler.
158
- """
159
- from src.utils.exceptions import ConfigurationError
160
-
161
- # Priority 1: OpenAI (most common, best tool calling)
162
- if settings.has_openai_key:
163
- openai_provider = OpenAIProvider(api_key=settings.openai_api_key)
164
- return OpenAIChatModel(settings.openai_model, provider=openai_provider)
165
-
166
- # Priority 2: Anthropic
167
- if settings.has_anthropic_key:
168
- provider = AnthropicProvider(api_key=settings.anthropic_api_key)
169
- return AnthropicModel(settings.anthropic_model, provider=provider)
170
-
171
- # Priority 3: HuggingFace (requires HF_TOKEN)
172
- if settings.has_huggingface_key:
173
- model_name = settings.huggingface_model or "meta-llama/Llama-3.1-70B-Instruct"
174
- hf_provider = HuggingFaceProvider(api_key=settings.hf_token)
175
- return HuggingFaceModel(model_name, provider=hf_provider)
176
-
177
- # No keys configured - fail fast with clear error
178
- raise ConfigurationError(
179
- "No LLM API key configured. Set one of: OPENAI_API_KEY, ANTHROPIC_API_KEY, or HF_TOKEN"
180
- )
181
- ```
182
-
183
- **Why this works:**
184
- - Single fix location updates all 7 broken call sites
185
- - Matches app.py's detection logic (key availability, not settings.llm_provider)
186
- - HuggingFace works when HF_TOKEN is available
187
- - Raises clear error when no keys configured (callers can catch and fallback)
188
- - No changes needed to orchestrators, agents, or services
189
-
190
- ### What This Does NOT Fix (By Design)
191
-
192
- **Advanced Mode remains OpenAI-only.** The following files use `agent_framework.openai.OpenAIChatClient` which only supports OpenAI:
193
-
194
- - `advanced.py` (Manager + agents)
195
- - `magentic_agents.py` (SearchAgent, JudgeAgent, HypothesisAgent, ReportAgent)
196
- - `retrieval_agent.py`, `code_executor_agent.py`
197
-
198
- This is **by design** - the Microsoft Agent Framework library (`agent-framework-core`) only provides `OpenAIChatClient`. To support other providers in Advanced mode would require:
199
- 1. Wait for `agent-framework` to add Anthropic/HuggingFace clients, OR
200
- 2. Write our own `ChatClient` implementations (significant effort)
201
-
202
- **The current app.py behavior is correct:** it falls back to Simple mode when no OpenAI key is present (lines 188-194). The UI message could be clearer about why.
203
-
204
- ## Test Plan (Implemented)
205
-
206
- ### Unit Tests (Verified Passing)
207
-
208
- ```python
209
- # tests/unit/agent_factory/test_get_model_auto_detect.py
210
-
211
- import pytest
212
- from src.agent_factory.judges import get_model
213
- from src.utils.config import settings
214
- from src.utils.exceptions import ConfigurationError
215
-
216
- class TestGetModelAutoDetect:
217
- """Test that get_model() auto-detects available providers."""
218
-
219
- def test_returns_openai_when_key_present(self, monkeypatch):
220
- """OpenAI key present → OpenAI model."""
221
- monkeypatch.setattr(settings, "openai_api_key", "sk-test")
222
- monkeypatch.setattr(settings, "anthropic_api_key", None)
223
- monkeypatch.setattr(settings, "hf_token", None)
224
- model = get_model()
225
- assert isinstance(model, OpenAIChatModel)
226
-
227
- def test_returns_anthropic_when_only_anthropic_key(self, monkeypatch):
228
- """Only Anthropic key → Anthropic model."""
229
- monkeypatch.setattr(settings, "openai_api_key", None)
230
- monkeypatch.setattr(settings, "anthropic_api_key", "sk-ant-test")
231
- monkeypatch.setattr(settings, "hf_token", None)
232
- model = get_model()
233
- assert isinstance(model, AnthropicModel)
234
-
235
- def test_returns_huggingface_when_hf_token_present(self, monkeypatch):
236
- """HF_TOKEN present (no paid keys) → HuggingFace model."""
237
- monkeypatch.setattr(settings, "openai_api_key", None)
238
- monkeypatch.setattr(settings, "anthropic_api_key", None)
239
- monkeypatch.setattr(settings, "hf_token", "hf_test_token")
240
- model = get_model()
241
- assert isinstance(model, HuggingFaceModel)
242
-
243
- def test_raises_error_when_no_keys(self, monkeypatch):
244
- """No keys at all → ConfigurationError."""
245
- monkeypatch.setattr(settings, "openai_api_key", None)
246
- monkeypatch.setattr(settings, "anthropic_api_key", None)
247
- monkeypatch.setattr(settings, "hf_token", None)
248
- with pytest.raises(ConfigurationError) as exc_info:
249
- get_model()
250
- assert "No LLM API key configured" in str(exc_info.value)
251
-
252
- def test_openai_takes_priority_over_anthropic(self, monkeypatch):
253
- """Both keys present → OpenAI wins."""
254
- monkeypatch.setattr(settings, "openai_api_key", "sk-test")
255
- monkeypatch.setattr(settings, "anthropic_api_key", "sk-ant-test")
256
- model = get_model()
257
- assert isinstance(model, OpenAIChatModel)
258
- ```
259
-
260
- ### Full Test Suite
261
-
262
- ```bash
263
- $ make check
264
- # 309 passed in 238.16s (0:03:58)
265
- # All checks passed!
266
- ```
267
-
268
- ### Manual Verification
269
-
270
- 1. **Unset all API keys**: `unset OPENAI_API_KEY ANTHROPIC_API_KEY HF_TOKEN`
271
- 2. **Run app**: `uv run python -m src.app`
272
- 3. **Submit query**: "What drugs improve female libido?"
273
- 4. **Verify**: Synthesis falls back to template (shows `ConfigurationError` in logs, but user sees structured summary)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/bugs/archive/P1_ADVANCED_MODE_UNINTERPRETABLE_CHAIN_OF_THOUGHT.md DELETED
@@ -1,184 +0,0 @@
1
- # P1: Advanced Mode Exposes Uninterpretable Chain-of-Thought Events
2
-
3
- **Priority**: P1 (UX Degradation)
4
- **Component**: `src/orchestrators/advanced.py`
5
- **Status**: Resolved
6
- **Issue**: [#106](https://github.com/The-Obstacle-Is-The-Way/DeepBoner/issues/106)
7
- **PR**: [#107](https://github.com/The-Obstacle-Is-The-Way/DeepBoner/pull/107)
8
- **Created**: 2025-12-01
9
- **Resolved**: 2025-12-01
10
-
11
- ## Summary
12
-
13
- The Advanced orchestrator exposes raw internal framework events from `agent-framework-core` directly to users. These events contain internal manager bookkeeping (task assignments, ledgers, instructions) that are:
14
-
15
- 1. Truncated mid-sentence at 200 characters
16
- 2. Use internal framework terminology (`user_task`, `task_ledger`, `instruction`)
17
- 3. Shown with misleading "JUDGING" event type
18
- 4. Not meaningful to end users
19
-
20
- ## Resolution
21
-
22
- Implemented "Smart Filter + Transform" logic in `src/orchestrators/advanced.py`:
23
-
24
- 1. **Filtered**: `task_ledger` and `instruction` events are now hidden.
25
- 2. **Transformed**: `user_task` events are mapped to `type="progress"` with a friendly "Manager assigning research task..." message.
26
- 3. **Smart Truncation**: Text is now truncated at sentence boundaries or word boundaries, preventing mid-word cuts.
27
-
28
- Verified with new unit tests in `tests/unit/orchestrators/test_advanced_events.py`.
29
-
30
- ## Example of Bad Output
31
-
32
- ```
33
- 🧠 **JUDGING**: Manager (user_task): Research sexual health and wellness interventions for: sildenafil mechanism ##...
34
-
35
- 🧠 **JUDGING**: Manager (task_ledger): We are working to address the following user request: Research sexual healt...
36
-
37
- 🧠 **JUDGING**: Manager (instruction): Conduct targeted searches on PubMed, ClinicalTrials.gov, and Europe PMC to ga...
38
- ```
39
-
40
- Users see:
41
- - Raw internal prompts being passed between manager and agents
42
- - Truncated text that cuts off mid-word ("healt...", "ga...")
43
- - Technical jargon ("task_ledger") with no context
44
- - All events labeled as "JUDGING" even when they're task assignments
45
-
46
- ## Root Cause Analysis
47
-
48
- ### The Chain of Issues
49
-
50
- | Location | Issue |
51
- |----------|-------|
52
- | `src/orchestrators/advanced.py:363-370` | `MagenticOrchestratorMessageEvent` raw events exposed without filtering |
53
- | `src/orchestrators/advanced.py:368` | `event.kind` values (`user_task`, `task_ledger`, `instruction`) are internal framework concepts |
54
- | `src/orchestrators/advanced.py:368` | Hard truncation: `text[:200]...` breaks mid-sentence |
55
- | `src/orchestrators/advanced.py:367` | All manager events mapped to `type="judging"` regardless of actual purpose |
56
- | `src/orchestrators/advanced.py:380` | Agent messages also truncated at 200 chars |
57
- | `src/utils/models.py:136` | `"judging": "🧠"` icon shown for all these internal events |
58
- | `src/app.py:248` | Events displayed verbatim via `event.to_markdown()` |
59
-
60
- ### Code Path
61
-
62
- ```
63
- agent-framework-core (Microsoft)
64
-
65
- MagenticOrchestratorMessageEvent(kind="task_ledger", message="...")
66
-
67
- advanced.py:_process_event() - NO FILTERING
68
-
69
- AgentEvent(type="judging", message=f"Manager ({event.kind}): {text[:200]}...")
70
-
71
- models.py:to_markdown() → "🧠 **JUDGING**: Manager (task_ledger): ..."
72
-
73
- app.py → Displayed to user verbatim
74
- ```
75
-
76
- ## Impact
77
-
78
- 1. **User Confusion**: Users see internal framework bookkeeping, not meaningful progress
79
- 2. **Truncated Gibberish**: 200-char limit cuts prompts mid-sentence, making them uninterpretable
80
- 3. **Misleading Labels**: "JUDGING" event type is wrong - these are task assignments
81
- 4. **No Actionable Info**: Users can't understand what the system is actually doing
82
-
83
- ## Proposed Solutions
84
-
85
- ### Option A: Filter Internal Events (Minimal)
86
-
87
- Skip internal manager events entirely - they're framework bookkeeping:
88
-
89
- ```python
90
- def _process_event(self, event: Any, iteration: int) -> AgentEvent | None:
91
- if isinstance(event, MagenticOrchestratorMessageEvent):
92
- # Skip internal framework bookkeeping events
93
- if event.kind in ("user_task", "task_ledger", "instruction"):
94
- return None # Don't expose to users
95
- # ... rest of handling
96
- ```
97
-
98
- **Pros**: Simple, removes noise
99
- **Cons**: Users lose visibility into manager activity
100
-
101
- ### Option B: Transform to User-Friendly Messages (Better UX)
102
-
103
- Map internal events to meaningful user messages:
104
-
105
- ```python
106
- MANAGER_EVENT_MESSAGES = {
107
- "user_task": "Manager received research task",
108
- "task_ledger": "Manager tracking task progress",
109
- "instruction": "Manager assigning work to agent",
110
- }
111
-
112
- def _process_event(self, event: Any, iteration: int) -> AgentEvent | None:
113
- if isinstance(event, MagenticOrchestratorMessageEvent):
114
- if event.kind in MANAGER_EVENT_MESSAGES:
115
- return AgentEvent(
116
- type="progress", # Not "judging"!
117
- message=MANAGER_EVENT_MESSAGES[event.kind],
118
- iteration=iteration,
119
- )
120
- ```
121
-
122
- **Pros**: Users see meaningful progress, correct event types
123
- **Cons**: More code, loses raw detail for debugging
124
-
125
- ### Option C: Smart Truncation + Verbose Mode
126
-
127
- 1. Truncate at sentence boundaries, not hard character limit
128
- 2. Add `verbose_mode` setting that shows full internal events for debugging
129
- 3. Use appropriate event types based on `event.kind`
130
-
131
- ```python
132
- def _smart_truncate(self, text: str, max_len: int = 200) -> str:
133
- """Truncate at sentence boundary."""
134
- if len(text) <= max_len:
135
- return text
136
- # Find last sentence boundary before limit
137
- truncated = text[:max_len]
138
- last_period = truncated.rfind(". ")
139
- if last_period > max_len // 2:
140
- return truncated[:last_period + 1]
141
- return truncated.rsplit(" ", 1)[0] + "..."
142
- ```
143
-
144
- ### Recommended Approach
145
-
146
- **Combine Option A + B**:
147
-
148
- 1. **Default**: Filter out `task_ledger` and `instruction` events (pure bookkeeping)
149
- 2. **Transform**: `user_task` → "Assigning research task to agents"
150
- 3. **Proper Types**: Use `"progress"` not `"judging"` for manager events
151
- 4. **Future**: Add verbose mode for debugging
152
-
153
- ## Files to Modify
154
-
155
- 1. `src/orchestrators/advanced.py:361-410` - `_process_event()` method
156
- 2. `src/utils/models.py:107-123` - Add new event types if needed
157
- 3. `tests/unit/orchestrators/test_advanced_timeout.py` - Update assertions
158
-
159
- ## Related Issues
160
-
161
- - P0: Advanced Mode Timeout No Synthesis (FIXED in PR #104)
162
- - This P1 was discovered while testing the P0 fix
163
-
164
- ## Testing the Bug
165
-
166
- ```python
167
- import asyncio
168
- from src.orchestrators.advanced import AdvancedOrchestrator
169
-
170
- async def test():
171
- orch = AdvancedOrchestrator(max_rounds=3)
172
- async for event in orch.run("sildenafil mechanism"):
173
- if "Manager" in event.message:
174
- print(f"[{event.type}] {event.message}")
175
- # You'll see uninterpretable output
176
-
177
- asyncio.run(test())
178
- ```
179
-
180
- ## References
181
-
182
- - Microsoft Agent Framework: https://github.com/microsoft/agent-framework
183
- - AgentEvent model: `src/utils/models.py:104`
184
- - Advanced orchestrator: `src/orchestrators/advanced.py`
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/bugs/archive/P1_FREE_TIER_TOOL_EXECUTION_FAILURE.md DELETED
@@ -1,319 +0,0 @@
1
- # P1 Bug: Free Tier Tool Execution Failure
2
-
3
- **Date**: 2025-12-03
4
- **Status**: FIXED (PR fix/P1-free-tier-tool-execution)
5
- **Severity**: P1 (Critical - Free Tier Completely Broken)
6
- **Component**: HuggingFaceChatClient + Together.ai Routing + Tool Calling
7
- **Resolution**: Removed premature `__function_invoking_chat_client__ = True` marker from class body
8
-
9
- ---
10
-
11
- ## Executive Summary
12
-
13
- The Free Tier (HuggingFace) is fundamentally broken due to **multiple interacting issues** that cause tool calls to fail, resulting in garbage output, hallucinated results, and raw JSON appearing in the UI.
14
-
15
- **This is NOT a simple 7B model issue** - it's a chain of infrastructure and code problems.
16
-
17
- ---
18
-
19
- ## Symptoms
20
-
21
- Users on Free Tier see:
22
-
23
- 1. **Garbage tokens**: "oleon", "UrlParser", "MemoryWarning", "PostalCodes"
24
- 2. **Raw tool call XML tags**: `<tool_call>`, `</tool_call>` appearing as text
25
- 3. **Raw JSON tool calls**: `{"name": "search_pubmed", "arguments": {...}}`
26
- 4. **Hallucinated tool results**: Fake JSON responses that were never returned by actual tools:
27
- ```json
28
- {"response": "[{'title': 'Effect of Flibanserin...', ...}]"}
29
- ```
30
- 5. **No actual database searches**: PubMed, ClinicalTrials.gov never queried
31
-
32
- ---
33
-
34
- ## Root Cause Analysis
35
-
36
- ### Cause 1: Model Routed to Third-Party Provider (Together.ai)
37
-
38
- **Discovery**: Qwen2.5-7B-Instruct is NOT served by native HuggingFace infrastructure.
39
-
40
- ```python
41
- # API response from HuggingFace:
42
- {
43
- "inferenceProviderMapping": {
44
- "together": {
45
- "status": "live",
46
- "providerId": "Qwen/Qwen2.5-7B-Instruct-Turbo" # <-- TURBO variant!
47
- },
48
- "featherless-ai": {
49
- "status": "live",
50
- "providerId": "Qwen/Qwen2.5-7B-Instruct"
51
- }
52
- }
53
- }
54
- ```
55
-
56
- **Impact**:
57
- - Native HF-inference returns 404 for this model
58
- - All requests route through Together.ai
59
- - Together serves a "Turbo" variant, not the original
60
- - We cannot control how Together handles tool calling
61
-
62
- ### Cause 2: Qwen2.5 Uses XML-Style Tool Calling Format
63
-
64
- **Discovery**: The model's chat template instructs it to output tool calls in XML format:
65
-
66
- ```jinja
67
- For each function call, return a json object with function name and arguments
68
- within <tool_call></tool_call> XML tags:
69
- <tool_call>
70
- {"name": <function-name>, "arguments": <args-json-object>}
71
- </tool_call>
72
- ```
73
-
74
- **Impact**:
75
- - Model outputs `<tool_call>{"name":...}</tool_call>` as **text**
76
- - This text appears in `delta.content` (not `delta.tool_calls`)
77
- - Our streaming code yields this as visible text to the UI
78
- - When tool calling works correctly, the API parses this internally
79
- - When it fails, raw XML appears in output
80
-
81
- ### Cause 3: Together.ai Turbo Inconsistent Tool Call Parsing
82
-
83
- **Discovery**: Together's serving of the Turbo model has inconsistent behavior:
84
-
85
- | Test Scenario | Tool Call Behavior |
86
- |---------------|-------------------|
87
- | Simple query, single tool | ✅ Parsed correctly to `tool_calls` |
88
- | Complex multi-agent prompt | ❌ Mixed: some parsed, some as text |
89
- | Multi-turn with tool results | ❌ Model hallucinates fake results |
90
-
91
- **Evidence from testing**:
92
- ```python
93
- # Simple test - WORKS:
94
- finish_reason: tool_calls
95
- content: None
96
- tool_calls: [ChatCompletionOutputToolCall(function=..., name='search_pubmed')]
97
-
98
- # Complex prompt - FAILS:
99
- TEXT[49]: '建档立标' # Chinese garbage between tool calls
100
- TEXT[X]: '{"name": "search_preprints", ...}' # Raw JSON as text
101
- ```
102
-
103
- ### Cause 4: Potential Code Bug - Premature Marker Setting
104
-
105
- **Discovery**: In `HuggingFaceChatClient`, we set a marker that may prevent tool execution wrapping:
106
-
107
- ```python
108
- @use_function_invocation # Decorator checks marker BEFORE wrapping
109
- @use_observability
110
- @use_chat_middleware
111
- class HuggingFaceChatClient(BaseChatClient):
112
- # This marker causes decorator to return early!
113
- __function_invoking_chat_client__ = True # <-- BUG?
114
- ```
115
-
116
- The `@use_function_invocation` decorator source:
117
- ```python
118
- def use_function_invocation(chat_client):
119
- if getattr(chat_client, FUNCTION_INVOKING_CHAT_CLIENT_MARKER, False):
120
- return chat_client # EARLY RETURN - doesn't wrap methods!
121
- # ... wrapping code never runs ...
122
- ```
123
-
124
- **Impact**: The decorator sees the marker as `True` and returns early without wrapping `get_response` and `get_streaming_response` with the function invocation handler.
125
-
126
- **Status**: NEEDS VERIFICATION - Testing shows methods have `__wrapped__` attribute, suggesting some decoration occurred. May be from other decorators.
127
-
128
- ### Cause 5: Model Hallucination Under Complexity
129
-
130
- **Discovery**: When the model fails to make proper API tool calls, it **simulates** tool use by outputting fake results:
131
-
132
- ```
133
- {"response": "[{'title': 'Effect of Flibanserin...'}]"}
134
- ```
135
-
136
- This is pure hallucination - no actual API calls were made. The model is trained to produce tool-like outputs, so when the API tool calling fails, it falls back to text-based simulation.
137
-
138
- ---
139
-
140
- ## Verification Steps
141
-
142
- ### Test 1: Direct InferenceClient (PASSES)
143
-
144
- ```python
145
- from huggingface_hub import InferenceClient
146
-
147
- client = InferenceClient(model='Qwen/Qwen2.5-7B-Instruct')
148
- response = client.chat_completion(
149
- messages=[{'role': 'user', 'content': 'What is the weather?'}],
150
- tools=[weather_tool],
151
- tool_choice='auto',
152
- )
153
- # Result: tool_calls properly parsed, content=None
154
- ```
155
-
156
- ### Test 2: Complex Multi-Agent Prompt (FAILS)
157
-
158
- ```python
159
- # With our SearchAgent-style prompts:
160
- stream = client.chat_completion(
161
- messages=[system_prompt, user_query],
162
- tools=multiple_tools,
163
- ...
164
- )
165
- # Result: Mix of text content AND tool_calls, garbage tokens appear
166
- ```
167
-
168
- ### Test 3: ChatAgent Single Tool (PARTIAL)
169
-
170
- ```python
171
- agent = ChatAgent(
172
- chat_client=HuggingFaceChatClient(),
173
- tools=[search_pubmed],
174
- ...
175
- )
176
- result = await agent.run('Search for libido drugs')
177
- # Result: Tool call request made but function NOT executed (tool_calls=0)
178
- ```
179
-
180
- ---
181
-
182
- ## Impact Assessment
183
-
184
- | Aspect | Impact |
185
- |--------|--------|
186
- | Free Tier Users | **100% broken** - Cannot get any useful results |
187
- | Demo Quality | **Unprofessional** - Shows garbage/hallucinations |
188
- | User Trust | **Critical** - Appears completely broken |
189
- | Tool Execution | **Not working** - Tools never actually called |
190
-
191
- ---
192
-
193
- ## Fix Options
194
-
195
- ### Option 1: Remove Premature Marker (QUICK - Test First)
196
-
197
- **Location**: `src/clients/huggingface.py:43`
198
-
199
- ```python
200
- # REMOVE THIS LINE:
201
- __function_invoking_chat_client__ = True
202
- ```
203
-
204
- Let the `@use_function_invocation` decorator set the marker AFTER wrapping.
205
-
206
- **Risk**: Unknown - need to test if this actually enables tool execution.
207
-
208
- ### Option 2: Switch to Model with Native HF Support
209
-
210
- Find a model that runs on native HuggingFace infrastructure (not routed to third parties):
211
-
212
- | Model | Size | Native HF? | Tool Calling |
213
- |-------|------|------------|--------------|
214
- | `Qwen/Qwen2.5-3B-Instruct` | 3B | ❓ Test | ❓ |
215
- | `mistralai/Mistral-7B-Instruct-v0.3` | 7B | ❓ Test | ✅ |
216
- | `microsoft/Phi-3-mini-4k-instruct` | 3.8B | ❓ Test | Limited |
217
-
218
- ### Option 3: Simplify Free Tier to Single-Agent
219
-
220
- Remove multi-agent complexity for Free Tier:
221
- - Single ChatAgent with simpler prompt
222
- - Direct tool calls instead of MagenticBuilder workflow
223
- - Reduced prompt complexity
224
-
225
- ### Option 4: Streaming Content Filter (BAND-AID)
226
-
227
- Filter garbage from streaming output:
228
-
229
- ```python
230
- def should_stream_content(text: str) -> bool:
231
- """Filter garbage from streaming."""
232
- if text.strip().startswith('{"name":'):
233
- return False # Raw tool call JSON
234
- if '</tool_call>' in text or '<tool_call>' in text:
235
- return False # XML tags
236
- garbage = ["oleon", "UrlParser", "MemoryWarning", "建档立标"]
237
- if any(g in text for g in garbage):
238
- return False
239
- return True
240
- ```
241
-
242
- **Note**: This hides symptoms but doesn't fix the underlying tool execution failure.
243
-
244
- ### Option 5: Use Together.ai Directly with Their SDK
245
-
246
- Bypass HuggingFace routing entirely:
247
- - Use Together's official SDK
248
- - May have better tool calling support
249
- - Requires new client implementation
250
-
251
- ---
252
-
253
- ## Files Involved
254
-
255
- | File | Role |
256
- |------|------|
257
- | `src/clients/huggingface.py` | Main HF client - has premature marker |
258
- | `src/clients/factory.py` | Client selection logic |
259
- | `src/agents/magentic_agents.py` | Agent definitions with tools |
260
- | `src/orchestrators/advanced.py` | Multi-agent workflow |
261
- | `src/agents/tools.py` | Tool function definitions |
262
-
263
- ---
264
-
265
- ## Recommended Action Plan
266
-
267
- ### Phase 1: Verify Code Bug (Immediate)
268
-
269
- 1. Remove `__function_invoking_chat_client__ = True` from HuggingFaceChatClient
270
- 2. Test if tool execution now works
271
- 3. If yes, verify no regressions with full test suite
272
-
273
- ### Phase 2: Provider Testing
274
-
275
- 1. Test which small models have native HF support
276
- 2. Evaluate Together.ai direct integration
277
- 3. Document provider routing for all candidate models
278
-
279
- ### Phase 3: Architecture Decision
280
-
281
- Based on Phase 1-2 results:
282
- - If code fix works: Deploy and monitor
283
- - If provider issues persist: Implement simplified single-agent mode
284
- - Consider hybrid: Simple mode for free, advanced for paid
285
-
286
- ---
287
-
288
- ## Relation to P2_7B_MODEL_GARBAGE_OUTPUT
289
-
290
- This P1 bug **supersedes** the P2 bug. The P2 doc incorrectly blamed the model capacity. The real issues are:
291
-
292
- 1. **Provider routing** (Together.ai Turbo, not native HF)
293
- 2. **Tool execution failure** (possible code bug)
294
- 3. **Model hallucination** (consequence of #2, not root cause)
295
-
296
- The P2 symptoms are downstream effects of this P1 root cause.
297
-
298
- ---
299
-
300
- ## Investigation Timeline
301
-
302
- | Time | Finding |
303
- |------|---------|
304
- | 16:00 | Started deep investigation per user request |
305
- | 16:10 | Found Qwen chat template uses XML-style tool_call |
306
- | 16:20 | Confirmed HF API parses tool calls correctly |
307
- | 16:30 | Discovered model routed to Together.ai, not native HF |
308
- | 16:35 | Found premature marker in HuggingFaceChatClient |
309
- | 16:40 | Verified ChatAgent makes tool requests but doesn't execute |
310
- | 16:45 | Documented complete root cause chain |
311
-
312
- ---
313
-
314
- ## References
315
-
316
- - [HuggingFace Inference Providers](https://huggingface.co/docs/inference-providers/index)
317
- - [Together.ai Function Calling](https://docs.together.ai/docs/function-calling)
318
- - [Qwen Function Calling Docs](https://qwen.readthedocs.io/en/latest/framework/function_call.html)
319
- - [TGI Tool Calling Issue #2375](https://github.com/huggingface/text-generation-inference/issues/2375)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/bugs/archive/P1_GRADIO_EXAMPLE_CLICK_AUTO_SUBMIT.md DELETED
@@ -1,273 +0,0 @@
1
- # P1: Gradio Example Click Auto-Submits Instead of Loading
2
-
3
- **Status:** FIXED (PR #120, merged 2025-12-03)
4
- **Priority:** P1 (High - UX breaks BYOK flow)
5
- **Discovered:** 2025-12-03
6
- **Component:** `src/app.py` (Gradio UI)
7
-
8
- ---
9
-
10
- ## Summary
11
-
12
- Clicking on example questions in the Gradio ChatInterface immediately starts the research agent instead of just loading the text into the input field. This prevents users from:
13
- 1. Entering their API key before starting the chat
14
- 2. Modifying the example query before submission
15
- 3. Understanding what's happening (chat starts without explicit action)
16
-
17
- ---
18
-
19
- ## Reproduction Steps
20
-
21
- 1. Open DeepBoner Gradio UI
22
- 2. **Before entering any API key**, click on an example like "What drugs improve female libido post-menopause?"
23
- 3. Observe: Chat immediately starts with Free Tier
24
- 4. Try to enter an OpenAI API key in the accordion
25
- 5. Try to submit a new query
26
- 6. **Result:** Confusing UX - the chat already ran, state is unclear
27
-
28
- ### Expected Behavior
29
-
30
- 1. Click example → text loads into input field
31
- 2. User can enter API key
32
- 3. User clicks submit → chat starts with their configured settings
33
-
34
- ---
35
-
36
- ## Root Cause Analysis
37
-
38
- ### Problem 1: Missing `run_examples_on_click=False`
39
-
40
- Gradio's `ChatInterface` has a parameter `run_examples_on_click` (added in [PR #10109](https://github.com/gradio-app/gradio/pull/10109), December 2024):
41
-
42
- | Value | Behavior |
43
- |-------|----------|
44
- | `True` (default) | Clicking example immediately runs the function |
45
- | `False` | Clicking example only populates the input field |
46
-
47
- **Our code** in `src/app.py:279-325` does NOT set this parameter:
48
-
49
- ```python
50
- demo = gr.ChatInterface(
51
- fn=research_agent,
52
- examples=[...],
53
- # run_examples_on_click=False ← MISSING!
54
- )
55
- ```
56
-
57
- ### Problem 2: HuggingFace Spaces Default Overrides
58
-
59
- From [Gradio docs](https://www.gradio.app/docs/gradio/chatinterface):
60
-
61
- > `cache_examples`: The default option in HuggingFace Spaces is **True**.
62
- > `run_examples_on_click` has **no effect** if `cache_examples` is True.
63
-
64
- This means on HuggingFace Spaces:
65
- - `cache_examples` defaults to `True`
66
- - Even if we add `run_examples_on_click=False`, it would be **ignored**
67
- - We MUST explicitly set `cache_examples=False`
68
-
69
- ### ~~Problem 3: Example Data Overwrites User Settings~~ (CORRECTION: This is Actually Fine)
70
-
71
- Looking at lines 283-304:
72
-
73
- ```python
74
- examples=[
75
- [
76
- "What drugs improve female libido post-menopause?",
77
- "sexual_health",
78
- None, # ← api_key set to None
79
- None, # ← api_key_state set to None
80
- ],
81
- ...
82
- ]
83
- ```
84
-
85
- **CORRECTION:** Per [Stack Overflow research](https://stackoverflow.com/questions/78584977/how-to-use-additional-inputs-and-examples-at-the-same-time):
86
-
87
- > "If you set None for some input in all examples then it will not display this column in example and example will not change current value for this input."
88
-
89
- Since ALL examples have `None` for api_key and api_key_state:
90
- - Those columns won't display in the examples table
91
- - **Clicking an example will NOT change the API key textbox**
92
- - User's API key is PRESERVED!
93
-
94
- The current example structure is actually **correct**. The only issue is auto-submit.
95
-
96
- ### Dead Code: api_key_state Never Updated (Non-Blocking)
97
-
98
- Line 258-259 has a comment suggesting a fix was attempted:
99
-
100
- ```python
101
- # BUG FIX: Add gr.State for API key persistence across example clicks
102
- api_key_state = gr.State("")
103
- ```
104
-
105
- This code is **dead** because:
106
- 1. The `gr.State` is initialized empty (`""`)
107
- 2. There's NO event handler (`.change()`) to update the state when textbox changes
108
- 3. The value passed to `research_agent` is always `""`
109
- 4. In `_validate_inputs`: `(api_key or api_key_state or "")` - the State never contributes
110
-
111
- **However**, this is NOT blocking the fix. The fix works regardless of this dead code.
112
- We can clean it up in a separate PR after the fix is verified working.
113
-
114
- ---
115
-
116
- ## Architecture Implications
117
-
118
- ### BYOK Flow Broken
119
-
120
- The unified architecture (SPEC-16) relies on API key auto-detection:
121
-
122
- ```text
123
- User provides key?
124
- ├── YES → OpenAI backend (sk-...) or Anthropic backend (sk-ant-...)
125
- └── NO → HuggingFace Free Tier
126
- ```
127
-
128
- The example click bug forces users into Free Tier even if they intended to use their API key.
129
-
130
- ### Session State Confusion
131
-
132
- After an example auto-submits:
133
- 1. Chat history has content
134
- 2. User enters API key
135
- 3. User submits new query
136
- 4. **Question:** Does the new query use the new key? Is history preserved correctly?
137
-
138
- This creates ambiguous state that could lead to:
139
- - Inconsistent backend usage within a session
140
- - Confusion about which tier was used for which response
141
-
142
- ---
143
-
144
- ## Fix Implementation
145
-
146
- ### Required Changes to `src/app.py`
147
-
148
- ```python
149
- demo = gr.ChatInterface(
150
- fn=research_agent,
151
- title="🍆 DeepBoner",
152
- description=description,
153
- examples=[...],
154
- additional_inputs_accordion=additional_inputs_accordion,
155
- additional_inputs=[...],
156
- # === FIX: Prevent auto-submit on example click ===
157
- cache_examples=False, # MUST be False for run_examples_on_click to work
158
- run_examples_on_click=False, # Load into input, don't auto-run
159
- )
160
- ```
161
-
162
- ### Why This Fix is Safe (No Optional Enhancements Needed)
163
-
164
- The current example structure with `None` values is **correct**:
165
- - API key textbox value is PRESERVED when clicking examples
166
- - Only the message textbox is populated
167
- - No restructuring of examples needed
168
-
169
- **The fix is minimal and surgical:**
170
- ```python
171
- cache_examples=False,
172
- run_examples_on_click=False,
173
- ```
174
-
175
- No other changes required.
176
-
177
- ---
178
-
179
- ## Testing
180
-
181
- ### Manual Test Cases
182
-
183
- 1. **Fresh load, click example:** Should only populate input, not start chat
184
- 2. **Enter API key, click example:** Query loads, API key preserved
185
- 3. **Click example, enter key, submit:** Should use the entered key
186
- 4. **Multiple example clicks:** Each should just replace input text
187
-
188
- ### Automated Test (if possible)
189
-
190
- ```python
191
- def test_example_click_does_not_auto_submit():
192
- """Verify examples only populate input, not trigger function."""
193
- # Would need Gradio testing utilities
194
- pass
195
- ```
196
-
197
- ---
198
-
199
- ## Related Issues
200
-
201
- - [Gradio #10103](https://github.com/gradio-app/gradio/issues/10103): Original feature request for `run_examples_on_click`
202
- - [Gradio #10109](https://github.com/gradio-app/gradio/pull/10109): PR that implemented the parameter
203
- - SPEC-16: Unified Chat Client Architecture (relies on proper API key handling)
204
- - P2_ARCHITECTURAL_BYOK_GAPS.md (archived) - Related BYOK issues now fixed
205
-
206
- ---
207
-
208
- ## Priority Justification
209
-
210
- **P1 (High)** because:
211
- 1. Breaks the BYOK (Bring Your Own Key) user flow
212
- 2. Forces users into Free Tier unexpectedly
213
- 3. Creates confusing UX that may prevent demo adoption
214
- 4. Simple fix with clear solution path
215
-
216
- ---
217
-
218
- ## Files Affected
219
-
220
- - `src/app.py:279-325` - ChatInterface configuration
221
-
222
- ---
223
-
224
- ## Senior Review: Risk Assessment
225
-
226
- **Reviewed:** 2025-12-03
227
-
228
- ### Verification Performed
229
-
230
- 1. **Gradio Version Confirmed:** 6.0.1 (`uv pip show gradio`)
231
- 2. **Parameters Exist:** Both `run_examples_on_click` and `cache_examples` verified in `ChatInterface.__init__` signature
232
- 3. **No Hidden Gradio Usage:** Only `src/app.py` imports gradio (grep confirmed)
233
- 4. **No Event Handlers:** No `.change()`, `.click()`, `.submit()` events in app.py that could conflict
234
- 5. **Example Format Correct:** List-of-lists format matches `additional_inputs` order
235
-
236
- ### Potential Regressions Checked
237
-
238
- | Risk | Assessment | Mitigation |
239
- |------|------------|------------|
240
- | Cold start slower on HF Spaces | Low - examples aren't pre-cached, but they also don't run on click | None needed - acceptable tradeoff |
241
- | Progress bar issues | None - `gr.Progress()` issues only affect cached examples, we're disabling caching | N/A |
242
- | Example display changes | None - examples already appear below chatbot due to `additional_inputs` | N/A |
243
- | API key cleared on example click | **Verified SAFE** - `None` in all examples means input is preserved | N/A |
244
- | Dead State code causes issues | No - it's inert, just passes `""` always | Clean up in follow-up PR |
245
-
246
- ### Gotchas Investigated
247
-
248
- 1. **ViewFrame/hydration issues:** `ssr_mode=False` already set at line 339 - no conflict
249
- 2. **MCP server interaction:** MCP server (`mcp_server=True`) operates independently of examples - no conflict
250
- 3. **CSS injection:** Custom CSS only affects `.api-key-input` class - no conflict
251
- 4. **Accordion state:** `additional_inputs_accordion` unaffected by example behavior
252
-
253
- ### Confidence Level
254
-
255
- **HIGH** - This is a two-line, surgical fix that:
256
- - Uses documented, stable Gradio 6.0 parameters
257
- - Has no side effects on other components
258
- - Preserves existing example structure
259
- - Was explicitly designed for this use case (PR #10109)
260
-
261
- ### Recommended Approach
262
-
263
- 1. **Phase 1:** Add the two params, test manually on HF Spaces
264
- 2. **Phase 2:** (Optional) Clean up dead `api_key_state` code in follow-up PR
265
-
266
- ---
267
-
268
- ## References
269
-
270
- - [Gradio ChatInterface Docs](https://www.gradio.app/docs/gradio/chatinterface)
271
- - [Gradio Examples Behavior](https://www.gradio.app/guides/chatinterface-examples)
272
- - [PR #10109: run_examples_on_click](https://github.com/gradio-app/gradio/pull/10109)
273
- - [Stack Overflow: None values in examples](https://stackoverflow.com/questions/78584977/how-to-use-additional-inputs-and-examples-at-the-same-time)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/bugs/archive/P1_HUGGINGFACE_NOVITA_500_ERROR.md DELETED
@@ -1,133 +0,0 @@
1
- # P1 BUG: HuggingFace Router 500 Error via Novita Provider
2
-
3
- **Status**: ACTIVE - Upstream Infrastructure Issue
4
- **Priority**: P1 (Free Tier Broken)
5
- **Discovered**: 2025-12-02
6
- **Related**: CLAUDE.md (Llama/Hyperbolic issue)
7
-
8
- ---
9
-
10
- ## Symptom
11
-
12
- ```
13
- ❌ **ERROR**: Workflow error: 500 Server Error: Internal Server Error for url:
14
- https://router.huggingface.co/novita/v3/openai/chat/completions
15
- ```
16
-
17
- Free tier users (no API key) cannot use the system.
18
-
19
- ---
20
-
21
- ## Stack Trace
22
-
23
- ```text
24
- User (no API key)
25
-
26
- src/clients/factory.py:get_chat_client()
27
-
28
- src/clients/huggingface.py:HuggingFaceChatClient
29
-
30
- Model: Qwen/Qwen2.5-72B-Instruct (from config.py)
31
-
32
- huggingface_hub.InferenceClient
33
-
34
- HuggingFace Router: router.huggingface.co
35
-
36
- Routes to: NOVITA (third-party inference provider)
37
-
38
- ❌ Novita returns 500 Internal Server Error
39
- ```
40
-
41
- ---
42
-
43
- ## Root Cause
44
-
45
- **HuggingFace doesn't host all models directly.** For some models, they route to third-party inference providers:
46
-
47
- | Model | Provider | Status |
48
- |-------|----------|--------|
49
- | Llama-3.1-70B | Hyperbolic | ❌ "staging mode" auth issues |
50
- | Qwen2.5-72B | Novita | ❌ 500 Internal Server Error |
51
-
52
- We switched from Llama to Qwen specifically to avoid Hyperbolic's issues. Now Novita is having its own problems.
53
-
54
- **This is an upstream infrastructure issue - not a bug in our code.**
55
-
56
- ---
57
-
58
- ## Evidence
59
-
60
- From the error URL:
61
- ```
62
- https://router.huggingface.co/novita/v3/openai/chat/completions
63
- ^^^^^^
64
- Third-party provider in URL path
65
- ```
66
-
67
- ---
68
-
69
- ## Potential Fixes
70
-
71
- ### Option 1: Try a Different Model (Quick)
72
- Find a model that HuggingFace hosts natively (not routed to partners):
73
-
74
- ```python
75
- # Candidates to test:
76
- # - mistralai/Mistral-7B-Instruct-v0.3
77
- # - microsoft/Phi-3-mini-4k-instruct
78
- # - google/gemma-2-9b-it
79
- ```
80
-
81
- ### Option 2: Add Fallback Logic (Robust)
82
- ```python
83
- FALLBACK_MODELS = [
84
- "Qwen/Qwen2.5-72B-Instruct",
85
- "mistralai/Mistral-7B-Instruct-v0.3",
86
- "microsoft/Phi-3-mini-4k-instruct",
87
- ]
88
-
89
- async def get_response_with_fallback(...):
90
- for model in FALLBACK_MODELS:
91
- try:
92
- return await client.chat_completion(model=model, ...)
93
- except HfHubHTTPError as e:
94
- if e.status_code == 500:
95
- continue
96
- raise
97
- raise AllModelsFailedError()
98
- ```
99
-
100
- ### Option 3: Wait for Novita Fix (Passive)
101
- 500 errors are typically transient. Novita may fix their infrastructure.
102
-
103
- ---
104
-
105
- ## Verification
106
-
107
- To check if issue is resolved:
108
- ```bash
109
- curl -X POST "https://router.huggingface.co/novita/v3/openai/chat/completions" \
110
- -H "Authorization: Bearer $HF_TOKEN" \
111
- -H "Content-Type: application/json" \
112
- -d '{"model": "Qwen/Qwen2.5-72B-Instruct", "messages": [{"role": "user", "content": "hi"}]}'
113
- ```
114
-
115
- ---
116
-
117
- ## Historical Context
118
-
119
- From `CLAUDE.md`:
120
- ```
121
- - **HuggingFace (Free Tier):** `Qwen/Qwen2.5-72B-Instruct`
122
- - Changed from Llama-3.1-70B (Dec 2025) due to HuggingFace routing Llama
123
- to Hyperbolic provider which has unreliable "staging mode" auth.
124
- ```
125
-
126
- Now Qwen is being routed to Novita, continuing the pattern of unreliable third-party routing.
127
-
128
- ---
129
-
130
- ## Recommendation
131
-
132
- **Short-term**: Switch to a model hosted natively by HuggingFace (test candidates above)
133
- **Long-term**: Implement fallback model logic to handle provider outages gracefully
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/bugs/archive/P1_HUGGINGFACE_ROUTER_401_HYPERBOLIC.md DELETED
@@ -1,62 +0,0 @@
1
- # P1 Bug: HuggingFace Router 401 Unauthorized
2
-
3
- **Severity**: P1 (High)
4
- **Status**: RESOLVED
5
- **Discovered**: 2025-12-01
6
- **Resolved**: 2025-12-01
7
- **Reporter**: Production user via HuggingFace Spaces
8
-
9
- ## Symptom
10
-
11
- ```
12
- 401 Client Error: Unauthorized for url:
13
- https://router.huggingface.co/hyperbolic/v1/chat/completions
14
- Invalid username or password.
15
- ```
16
-
17
- ## Root Cause
18
-
19
- **The HF_TOKEN in `.env` and HuggingFace Spaces secrets was invalid/expired.**
20
-
21
- Token `hf_ssayg...` failed `HfApi().whoami()` verification.
22
-
23
- ## Resolution
24
-
25
- 1. Generated new HF_TOKEN at https://huggingface.co/settings/tokens
26
- 2. Updated `.env` with new token: `hf_gZVBI...`
27
- 3. Updated HuggingFace Spaces secret with same token
28
- 4. Switched default model from `meta-llama/Llama-3.1-70B-Instruct` to `Qwen/Qwen2.5-72B-Instruct` (better reliability via HF router)
29
-
30
- ## Verification
31
-
32
- ```bash
33
- uv run python -c "
34
- import os
35
- from huggingface_hub import InferenceClient, HfApi
36
-
37
- token = os.environ['HF_TOKEN'] # Your valid token from .env
38
- api = HfApi(token=token)
39
- print(f'Token valid: {api.whoami()[\"name\"]}')
40
-
41
- client = InferenceClient(model='Qwen/Qwen2.5-72B-Instruct', token=token)
42
- response = client.chat_completion(messages=[{'role': 'user', 'content': '2+2=?'}], max_tokens=10)
43
- print(f'Inference works: {response.choices[0].message.content}')
44
- "
45
- # Output:
46
- # Token valid: VibecoderMcSwaggins
47
- # Inference works: 4
48
- ```
49
-
50
- ## Lessons Learned
51
-
52
- 1. **First-principles debugging**: Before adding complex "fixes", verify basic assumptions (is the token actually valid?)
53
- 2. **Token expiration**: HuggingFace tokens can expire or become invalid. Always verify with `whoami()`.
54
- 3. **Model routing**: HuggingFace routes large models to partner providers (Hyperbolic, Novita). All require valid auth.
55
-
56
- ## Files Changed
57
-
58
- - `src/utils/config.py`: Changed default model to `Qwen/Qwen2.5-72B-Instruct`
59
- - `src/clients/huggingface.py`: Updated fallback model reference
60
- - `src/agent_factory/judges.py`: Updated fallback model reference
61
- - `src/orchestrators/langgraph_orchestrator.py`: Updated hardcoded model
62
- - `CLAUDE.md`, `AGENTS.md`, `GEMINI.md`: Updated documentation
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/bugs/archive/P1_NARRATIVE_SYNTHESIS_FALLBACK.md DELETED
@@ -1,185 +0,0 @@
1
- # P1: Narrative Synthesis Falls Back to Template (SPEC_12 Not Taking Effect)
2
-
3
- **Status**: Open
4
- **Priority**: P1 - Major UX degradation
5
- **Affects**: Simple mode, all deployments
6
- **Root Cause**: LLM synthesis silently failing → template fallback
7
- **Related**: SPEC_12 (implemented but not functioning)
8
-
9
- ---
10
-
11
- ## Problem Statement
12
-
13
- SPEC_12 implemented LLM-based narrative synthesis, but users still see **template-formatted bullet points** instead of **prose paragraphs**:
14
-
15
- ### What Users See (Template Fallback)
16
-
17
- ```markdown
18
- ## Sexual Health Analysis
19
-
20
- ### Question
21
- what medication for the best boners?
22
-
23
- ### Drug Candidates
24
- - **tadalafil**
25
- - **sildenafil**
26
-
27
- ### Key Findings
28
- - Tadalafil improves erectile function
29
-
30
- ### Assessment
31
- - **Mechanism Score**: 4/10
32
- - **Clinical Evidence Score**: 6/10
33
- ```
34
-
35
- ### What They Should See (LLM Synthesis)
36
-
37
- ```markdown
38
- ### Executive Summary
39
-
40
- Sildenafil demonstrates clinically meaningful efficacy for erectile dysfunction,
41
- with strong evidence from multiple RCTs demonstrating improved erectile function...
42
-
43
- ### Background
44
-
45
- Erectile dysfunction (ED) is a common male sexual health disorder...
46
-
47
- ### Evidence Synthesis
48
-
49
- **Mechanism of Action**
50
- Sildenafil works by inhibiting phosphodiesterase type 5 (PDE5)...
51
- ```
52
-
53
- ---
54
-
55
- ## Root Cause Analysis
56
-
57
- ### Location: `src/orchestrators/simple.py:555-564`
58
-
59
- ```python
60
- try:
61
- agent = Agent(model=get_model(), output_type=str, system_prompt=system_prompt)
62
- result = await agent.run(user_prompt)
63
- narrative = result.output
64
- except Exception as e: # ← SILENT FALLBACK
65
- logger.warning("LLM synthesis failed, using template fallback", error=str(e))
66
- return self._generate_template_synthesis(query, evidence, assessment)
67
- ```
68
-
69
- **The Problem**: When ANY exception occurs during LLM synthesis, it silently falls back to template. Users see janky bullet points with no indication that the LLM call failed.
70
-
71
- ### Why Synthesis Fails
72
-
73
- | Cause | Symptom | Frequency |
74
- |-------|---------|-----------|
75
- | No API key in deployment | HuggingFace Spaces | HIGH |
76
- | API rate limiting | Heavy usage | MEDIUM |
77
- | Token overflow | Long evidence lists | MEDIUM |
78
- | Model mismatch | Wrong model ID | LOW |
79
- | Network timeout | Slow connections | LOW |
80
-
81
- ---
82
-
83
- ## Evidence: LLM Synthesis WORKS When Configured
84
-
85
- Local test with API key:
86
- ```python
87
- # This works perfectly:
88
- agent = Agent(model=get_model(), output_type=str, system_prompt=system_prompt)
89
- result = await agent.run(user_prompt)
90
- print(result.output) # → Beautiful narrative prose!
91
- ```
92
-
93
- Output:
94
- ```
95
- ### Executive Summary
96
-
97
- Sildenafil demonstrates clinically meaningful efficacy for erectile dysfunction,
98
- with one study (Smith, 2020; N=100) reporting improved erectile function...
99
- ```
100
-
101
- ---
102
-
103
- ## Impact
104
-
105
- | Metric | Current | Expected |
106
- |--------|---------|----------|
107
- | Report quality | 3/10 (metadata dump) | 9/10 (professional prose) |
108
- | User satisfaction | Low | High |
109
- | Clinical utility | Limited | High |
110
-
111
- The ENTIRE VALUE PROPOSITION of the research agent is the synthesized report. Template output defeats the purpose.
112
-
113
- ---
114
-
115
- ## Fix Options
116
-
117
- ### Option A: Surface Error to User (RECOMMENDED)
118
-
119
- When LLM synthesis fails, don't silently fall back. Show the user what went wrong:
120
-
121
- ```python
122
- except Exception as e:
123
- logger.error("LLM synthesis failed", error=str(e), exc_info=True)
124
-
125
- # Show error in report instead of silent fallback
126
- error_note = f"""
127
- ⚠️ **Note**: AI narrative synthesis unavailable.
128
- Showing structured summary instead.
129
-
130
- _Technical: {type(e).__name__}: {str(e)[:100]}_
131
- """
132
- template = self._generate_template_synthesis(query, evidence, assessment)
133
- return f"{error_note}\n\n{template}"
134
- ```
135
-
136
- ### Option B: HuggingFace Secrets Configuration
137
-
138
- For HuggingFace Spaces deployment, add secrets:
139
- - `OPENAI_API_KEY` → Required for synthesis
140
- - `ANTHROPIC_API_KEY` → Alternative provider
141
-
142
- ### Option C: Graceful Degradation with Explanation
143
-
144
- Add a banner explaining synthesis status:
145
- - ✅ "AI-synthesized narrative report" (when LLM works)
146
- - ⚠️ "Structured summary (AI synthesis unavailable)" (fallback)
147
-
148
- ---
149
-
150
- ## Diagnostic Steps
151
-
152
- To determine why synthesis is failing in production:
153
-
154
- 1. **Review logs** for warning: `"LLM synthesis failed, using template fallback"`
155
- 2. **Verify API key**: Is `OPENAI_API_KEY` set in environment?
156
- 3. **Confirm model access**: Is `gpt-5` accessible with current API tier?
157
- 4. **Inspect rate limits**: Is the account quota exhausted?
158
-
159
- ---
160
-
161
- ## Acceptance Criteria
162
-
163
- - [ ] Users see narrative prose reports (not bullet points) when API key is configured
164
- - [ ] When synthesis fails, user sees clear indication (not silent fallback)
165
- - [ ] HuggingFace Spaces deployment has proper secrets configured
166
- - [ ] Logging captures the specific exception for debugging
167
-
168
- ---
169
-
170
- ## Files to Modify
171
-
172
- | File | Change |
173
- |------|--------|
174
- | `src/orchestrators/simple.py:555-580` | Add error surfacing in fallback |
175
- | `src/app.py` | Add synthesis status indicator to UI |
176
- | HuggingFace Spaces Settings | Add `OPENAI_API_KEY` secret |
177
-
178
- ---
179
-
180
- ## Test Plan
181
-
182
- 1. Run locally with API key → Should get narrative prose
183
- 2. Run locally WITHOUT API key → Should get template WITH error message
184
- 3. Deploy to HuggingFace with secrets → Should get narrative prose
185
- 4. Deploy to HuggingFace WITHOUT secrets → Should get template WITH warning
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/bugs/archive/P1_NO_SYNTHESIS_FREE_TIER.md DELETED
@@ -1,165 +0,0 @@
1
- # P1 Bug: No Synthesis Report in Free Tier (Premature Workflow Termination)
2
-
3
- **Date**: 2025-12-04
4
- **Status**: FIXED (PR fix/p1-forced-synthesis)
5
- **Severity**: P1 (Critical UX - No usable output from research)
6
- **Component**: `src/orchestrators/advanced.py`
7
- **Affects**: Free Tier (HuggingFace) primarily, potentially Paid Tier
8
-
9
- ---
10
-
11
- ## Executive Summary
12
-
13
- The workflow terminates without the ReportAgent ever producing a synthesis report. Users see search results and hypotheses streaming, but the final output is just "Research complete." with no actual research report. This is caused by the 7B Manager model failing to properly delegate to ReportAgent before workflow termination.
14
-
15
- ---
16
-
17
- ## Symptom
18
-
19
- ```text
20
- 📚 **SEARCH_COMPLETE**: searcher: [search results]
21
- ⏱️ **PROGRESS**: Round 1/5 (~3m 0s remaining)
22
- 🔬 **HYPOTHESIZING**: hypothesizer: [hypotheses]
23
- ⏱️ **PROGRESS**: Round 2/5 (~2m 15s remaining)
24
- ✅ **JUDGE_COMPLETE**: judge: [asks for more evidence]
25
- ⏱️ **PROGRESS**: Round 4/5 (~45s remaining)
26
- Research complete.
27
- Research complete. ← NO SYNTHESIS REPORT!
28
- ```
29
-
30
- The workflow runs through multiple agents (Search, Hypothesis, Judge) but never reaches the ReportAgent. The user receives no usable research report.
31
-
32
- ---
33
-
34
- ## Root Cause Analysis
35
-
36
- ### Primary Issue: Manager Model Failure
37
-
38
- The `with_standard_manager()` in Microsoft Agent Framework uses the provided chat client (HuggingFace 7B model) to coordinate agents. The 7B model:
39
-
40
- 1. **Cannot follow complex multi-step instructions** - The manager prompt instructs: "When JudgeAgent says SUFFICIENT EVIDENCE → delegate to ReportAgent." The 7B model doesn't reliably follow this.
41
-
42
- 2. **Triggers premature termination** - The framework has `max_stall_count=3` and `max_reset_count=2`. If the manager keeps making the same delegation or gets confused, the workflow terminates.
43
-
44
- 3. **Emits final event without synthesis** - The framework sends `MagenticFinalResultEvent` or `WorkflowOutputEvent` without ReportAgent ever running.
45
-
46
- ### Secondary Issue: Duplicate Complete Events
47
-
48
- Both `MagenticFinalResultEvent` and `WorkflowOutputEvent` are emitted when the workflow ends. The previous code handled both, yielding "Research complete." twice.
49
-
50
- ---
51
-
52
- ## The Fix
53
-
54
- ### 1. Track ReportAgent Execution (Forced Synthesis)
55
-
56
- Add a `reporter_ran` flag that tracks whether ReportAgent produced output:
57
-
58
- ```python
59
- reporter_ran = False # P1 FIX: Track if ReportAgent produced output
60
-
61
- # In MagenticAgentMessageEvent handler:
62
- agent_name = (event.agent_id or "").lower()
63
- if "report" in agent_name:
64
- reporter_ran = True
65
- ```
66
-
67
- ### 2. Force Synthesis on Final Event
68
-
69
- If the workflow ends without ReportAgent running, force synthesis:
70
-
71
- ```python
72
- if isinstance(event, (MagenticFinalResultEvent, WorkflowOutputEvent)):
73
- if not reporter_ran:
74
- logger.warning("ReportAgent never ran - forcing synthesis")
75
- async for synth_event in self._force_synthesis(iteration):
76
- yield synth_event
77
- else:
78
- yield self._handle_final_event(event, iteration, last_streamed_length)
79
- ```
80
-
81
- ### 3. `_force_synthesis()` Method
82
-
83
- Similar to `_handle_timeout()`, invokes ReportAgent directly:
84
-
85
- ```python
86
- async def _force_synthesis(self, iteration: int) -> AsyncGenerator[AgentEvent, None]:
87
- """Force synthesis when workflow ends without ReportAgent running."""
88
- state = get_magentic_state()
89
- evidence_summary = await state.memory.get_context_summary()
90
- report_agent = create_report_agent(self._chat_client, domain=self.domain)
91
-
92
- yield AgentEvent(type="synthesizing", message="Synthesizing research findings...")
93
-
94
- synthesis_result = await report_agent.run(
95
- f"Synthesize research report from this evidence.\n\n{evidence_summary}"
96
- )
97
-
98
- yield AgentEvent(type="complete", message=synthesis_result.text)
99
- ```
100
-
101
- ### 4. Skip Duplicate Final Events
102
-
103
- Prevent "Research complete." appearing twice:
104
-
105
- ```python
106
- if isinstance(event, (MagenticFinalResultEvent, WorkflowOutputEvent)):
107
- if final_event_received:
108
- continue # Skip duplicate final events
109
- final_event_received = True
110
- ```
111
-
112
- ---
113
-
114
- ## Why This Is The Correct Architecture
115
-
116
- | Alternative | Why Wrong |
117
- |-------------|-----------|
118
- | Improve manager prompt | 7B models have fundamental reasoning limitations |
119
- | Use larger model for manager | Defeats "free tier" purpose |
120
- | Wait for upstream fix | Framework may never change; we control our code |
121
- | **Forced synthesis safety net** | ✅ Guarantees output regardless of manager behavior |
122
-
123
- The `_force_synthesis()` pattern is a **defensive architecture**. It guarantees users always get a research report, even if:
124
- - The manager model fails to delegate properly
125
- - The workflow hits stall/reset limits
126
- - Any unexpected termination occurs
127
-
128
- ---
129
-
130
- ## Files Modified
131
-
132
- | File | Change |
133
- |------|--------|
134
- | `src/orchestrators/advanced.py` | Added `reporter_ran` tracking |
135
- | `src/orchestrators/advanced.py` | Added `_force_synthesis()` method |
136
- | `src/orchestrators/advanced.py` | Added duplicate final event skipping |
137
- | `src/orchestrators/advanced.py` | Added forced synthesis in final event handler |
138
- | `src/orchestrators/advanced.py` | Added forced synthesis in max rounds fallback |
139
-
140
- ---
141
-
142
- ## Test Plan
143
-
144
- 1. **Free Tier**: Run query, verify synthesis report is always generated
145
- 2. **Paid Tier**: Run query, verify no regression in OpenAI behavior
146
- 3. **Timeout**: Verify existing timeout synthesis still works
147
- 4. **Max Rounds**: Verify synthesis happens even at max rounds
148
-
149
- ---
150
-
151
- ## Related
152
-
153
- - P2 Duplicate Report Bug (separate issue, also fixed in this PR)
154
- - P2 First Turn Timeout Bug (previously fixed)
155
- - Manager model limitations are fundamental to 7B models
156
- - OpenAI tier works because GPT-5 follows instructions better
157
-
158
- ---
159
-
160
- ## Lessons Learned
161
-
162
- 1. **Defensive architecture** - Don't trust upstream components to always behave correctly
163
- 2. **Tracking flags** - Simple boolean flags can enable powerful safety nets
164
- 3. **AI-native challenges** - When using AI models as infrastructure components, build in fallbacks for model failures
165
- 4. **Regression prevention** - This bug was likely introduced when we unified the architecture; comprehensive test coverage is critical
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/bugs/archive/P1_SIMPLE_MODE_REMOVED_BREAKS_FREE_TIER_UX.md DELETED
@@ -1,61 +0,0 @@
1
- # Free Tier (No API Key) - BLOCKED by Upstream #2562
2
-
3
- **Status**: BLOCKED - Waiting for upstream PR #2566
4
- **Priority**: P1
5
- **Discovered**: 2025-12-01
6
-
7
- ---
8
-
9
- ## Problem
10
-
11
- Free tier (no API key provided) shows garbage output:
12
-
13
- ```
14
- 📚 **SEARCH_COMPLETE**: searcher: <agent_framework._types.ChatMessage object at 0x7fd3f8617b10>
15
- ```
16
-
17
- ## Cause
18
-
19
- **Upstream Bug #2562**: Microsoft Agent Framework produces `repr()` garbage for tool-call-only messages.
20
-
21
- ## Architecture
22
-
23
- ```
24
- User provides API key?
25
-
26
- NO (Free Tier) YES (Paid Tier)
27
- ────────────── ───────────────
28
- HuggingFace backend OpenAI backend
29
- Qwen 2.5 72B (free) GPT-5 (paid)
30
-
31
- SAME orchestration, different backends
32
- ONE codebase, not parallel universes
33
- ```
34
-
35
- ## Framework Stack
36
-
37
- | Framework | Role |
38
- |-----------|------|
39
- | Microsoft Agent Framework | Multi-agent orchestration |
40
- | Pydantic AI | Structured outputs & validation |
41
-
42
- Both work TOGETHER. Not mutually exclusive.
43
-
44
- ## Fix
45
-
46
- **Upstream PR #2566** will fix this.
47
-
48
- Once merged:
49
- 1. `uv add agent-framework@latest`
50
- 2. Verify free tier works
51
- 3. Done
52
-
53
- ## What Was Deleted
54
-
55
- `simple.py` (778 lines) was a SEPARATE orchestrator. Created parallel universe. Now deleted. ONE orchestrator with different backends.
56
-
57
- ## Related
58
-
59
- - [Issue #105](https://github.com/The-Obstacle-Is-The-Way/DeepBoner/issues/105)
60
- - [Upstream #2562](https://github.com/microsoft/agent-framework/issues/2562)
61
- - [Upstream PR #2566](https://github.com/microsoft/agent-framework/pull/2566)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/bugs/archive/P1_SYNTHESIS_BROKEN_KEY_FALLBACK.md DELETED
@@ -1,163 +0,0 @@
1
- # P0 - Free Tier Synthesis Incorrectly Uses Server-Side API Keys
2
-
3
- **Status:** RESOLVED
4
- **Priority:** P0 (Breaks Free Tier Promise)
5
- **Found:** 2025-11-30
6
- **Resolved:** 2025-11-30
7
- **Component:** `src/orchestrators/simple.py`, `src/agent_factory/judges.py`
8
-
9
- ## Resolution Summary
10
-
11
- The architectural bug where Simple Mode synthesis incorrectly used server-side API keys has been fixed.
12
- We implemented a dedicated `synthesize()` method in `HFInferenceJudgeHandler` that uses the free
13
- HuggingFace Inference API, consistent with the judging phase.
14
-
15
- ### Fix Details
16
-
17
- 1. **New Feature**: Added `synthesize()` method to `HFInferenceJudgeHandler` (and `JudgeHandler` protocol).
18
- - Uses `huggingface_hub.InferenceClient.chat_completion` (Free Tier).
19
- - Mirrors the `assess()` logic for consistent free access.
20
-
21
- 2. **Orchestrator Logic Update**:
22
- - `SimpleOrchestrator` now checks `if hasattr(self.judge, "synthesize")`.
23
- - If true (Free Tier), it calls `judge.synthesize()` directly, skipping `get_model()`/`pydantic_ai`.
24
- - If false (Paid Tier), it falls back to the existing `pydantic_ai` agent flow using `get_model()`.
25
-
26
- 3. **Test Coverage**:
27
- - Updated `tests/unit/orchestrators/test_simple_synthesis.py` to mock `judge.synthesize`.
28
- - Added new test case ensuring Free Tier path is taken when available.
29
- - Fixed integration tests to simulate Free Tier correctly.
30
-
31
- ### Verification
32
-
33
- - **Unit Tests**: `tests/unit/orchestrators/test_simple_synthesis.py` passed (7/7).
34
- - **Integration**: `tests/integration/test_simple_mode_synthesis.py` passed.
35
- - **Full Suite**: `make check` passed (310/310 tests).
36
-
37
- ---
38
-
39
- ## Symptom (Archive)
40
-
41
- When using Simple Mode (Free Tier) without providing a user API key, users see:
42
-
43
- ```
44
- > ⚠️ **Note**: AI narrative synthesis unavailable. Showing structured summary.
45
- > _Error: OpenAIError_
46
- ```
47
-
48
- This is confusing because the user didn't configure any OpenAI key - they expected Free Tier to work.
49
-
50
- ## Root Cause
51
-
52
- **Architecture bug: Synthesis is decoupled from JudgeHandler selection.**
53
-
54
- | Component | Paid Tier | Free Tier |
55
- |-----------|-----------|-----------|
56
- | Judge | `JudgeHandler` (uses `get_model()`) | `HFInferenceJudgeHandler` (free HF Inference) |
57
- | Synthesis | `get_model()` | **BUG: Also uses `get_model()`** |
58
-
59
- **Flow:**
60
- 1. User selects Simple mode, leaves API key empty
61
- 2. `app.py` correctly creates `HFInferenceJudgeHandler` for judging (works)
62
- 3. Search works (no keys needed for PubMed/ClinicalTrials/Europe PMC)
63
- 4. Judge works (HFInferenceJudgeHandler uses free HuggingFace inference)
64
- 5. **BUG:** Synthesis calls `get_model()` in `simple.py:547`
65
- 6. `get_model()` checks `settings.has_openai_key` → reads SERVER-SIDE env vars
66
- 7. If ANY server-side key is set (even broken), synthesis tries to use it
67
- 8. This VIOLATES the Free Tier promise - user didn't provide a key!
68
-
69
- **The bug is NOT about broken keys - it's about synthesis ignoring the Free Tier selection.**
70
-
71
- ## Impact
72
-
73
- - **User Confusion**: User didn't provide a key, sees "OpenAIError"
74
- - **Free Tier Perception**: Makes Free Tier seem broken when it's actually working (template synthesis is still useful)
75
- - **Demo Quality**: Hackathon judges may think the app is broken
76
-
77
- ## Fix Options
78
-
79
- ### Option A: Remove/Fix Admin Key (Quick Fix for Hackathon)
80
- Remove or update the `OPENAI_API_KEY` secret on HuggingFace Spaces.
81
- - If removed: Free Tier works as designed (template synthesis)
82
- - If fixed: OpenAI synthesis works
83
-
84
- **Pros:** Instant fix, no code changes
85
- **Cons:** Doesn't fix the underlying UX issue
86
-
87
- ### Option B: Better Error Message
88
- Change error message to be more user-friendly:
89
-
90
- ```python
91
- # src/orchestrators/simple.py:569-573
92
- error_note = (
93
- f"\n\n> ⚠️ **Note**: AI narrative synthesis unavailable. "
94
- f"Showing structured summary.\n"
95
- f"> _Tip: Provide your own API key for full synthesis._\n"
96
- )
97
- ```
98
-
99
- **Pros:** Clearer UX
100
- **Cons:** Hides the real error for debugging
101
-
102
- ### Option C: Provider Fallback Chain (Best Long-term)
103
- If primary provider fails, try next provider before falling back to template:
104
-
105
- ```python
106
- def get_model_with_fallback() -> Any:
107
- """Try providers in order, return first that works."""
108
- from src.utils.exceptions import ConfigurationError
109
-
110
- providers = []
111
- if settings.has_openai_key:
112
- providers.append(("openai", lambda: OpenAIChatModel(...)))
113
- if settings.has_anthropic_key:
114
- providers.append(("anthropic", lambda: AnthropicModel(...)))
115
- if settings.has_huggingface_key:
116
- providers.append(("huggingface", lambda: HuggingFaceModel(...)))
117
-
118
- for name, factory in providers:
119
- try:
120
- return factory()
121
- except Exception as e:
122
- logger.warning(f"Provider {name} failed: {e}")
123
- continue
124
-
125
- raise ConfigurationError("No working LLM provider available")
126
- ```
127
-
128
- **Pros:** Most robust, graceful degradation
129
- **Cons:** More complex, may hide real errors
130
-
131
- ### Option D: Validate Key Before Using (Recommended)
132
- Add key validation to `get_model()`:
133
-
134
- ```python
135
- def get_model() -> Any:
136
- if settings.has_openai_key:
137
- # Quick validation - check key format
138
- key = settings.openai_api_key
139
- if not key or not key.startswith("sk-"):
140
- logger.warning("Invalid OpenAI key format, trying next provider")
141
- else:
142
- return OpenAIChatModel(...)
143
- # ... continue to next provider
144
- ```
145
-
146
- **Pros:** Catches obviously invalid keys early
147
- **Cons:** Can't catch quota/permission issues without API call
148
-
149
- ## Recommended Action (Hackathon)
150
-
151
- 1. **Immediate**: Remove `OPENAI_API_KEY` from HuggingFace Space secrets, OR replace with valid key
152
- 2. **If key is valid**: Check if model `gpt-5` is accessible (may need to use `gpt-4o` instead)
153
-
154
- ## Test Plan
155
-
156
- 1. Remove all secrets from HuggingFace Space
157
- 2. Run Simple mode query
158
- 3. Verify: Search works, Judge works, Synthesis shows template (no error message)
159
-
160
- ## Related
161
-
162
- - `docs/bugs/P0_SYNTHESIS_PROVIDER_MISMATCH.md` (RESOLVED - handles "no keys" case)
163
- - This bug is specifically about "key exists but broken" case
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/bugs/archive/P2_7B_MODEL_GARBAGE_OUTPUT.md DELETED
@@ -1,266 +0,0 @@
1
- # P2 Bug: 7B Model Produces Garbage Streaming Output
2
-
3
- **Date**: 2025-12-02
4
- **Status**: OPEN - Investigating
5
- **Severity**: P2 (Major - Degrades User Experience)
6
- **Component**: Free Tier / HuggingFace + Multi-Agent Orchestration
7
-
8
- ---
9
-
10
- ## Symptoms
11
-
12
- When running a research query on Free Tier (Qwen2.5-7B-Instruct), the streaming output shows **garbage tokens** and **malformed tool calls** instead of coherent agent reasoning:
13
-
14
- ### Symptom A: Random Garbage Tokens
15
- ```text
16
- 📡 **STREAMING**: yarg
17
- 📡 **STREAMING**: PostalCodes
18
- 📡 **STREAMING**: FunctionFlags
19
- 📡 **STREAMING**: system
20
- 📡 **STREAMING**: Transferred to searcher, adopt the persona immediately.
21
- ```
22
-
23
- ### Symptom B: Raw Tool Call JSON in Text (NEW - 2025-12-03)
24
- ```text
25
- 📡 **STREAMING**:
26
- oleon
27
- {"name": "search_preprints", "arguments": {"query": "female libido post-menopause drug", "max_results": 10}}
28
- </tool_call>
29
- system
30
-
31
- UrlParser
32
- {"name": "search_clinical_trials", "arguments": {"query": "female libido post-menopause drug", "max_results": 10}}
33
- ```
34
-
35
- The model is outputting:
36
- 1. **Garbage tokens**: "oleon", "UrlParser" - meaningless fragments
37
- 2. **Raw JSON tool calls**: `{"name": "search_preprints", ...}` - intended tool calls output as TEXT
38
- 3. **XML-style tags**: `</tool_call>` - model trying to use wrong tool calling format
39
- 4. **"system" keyword**: Model confusing role markers with content
40
-
41
- **Root Cause of Symptom B**: The 7B model is attempting to make tool calls but outputting them as **text content** instead of using the HuggingFace API's native `tool_calls` structure. The model may have been trained on a different tool calling format (XML-style like Claude's `<tool_call>` tags) and doesn't properly use the OpenAI-compatible JSON format.
42
-
43
- The model outputs random tokens like "yarg", "PostalCodes", "FunctionFlags" instead of actual research reasoning.
44
-
45
- ---
46
-
47
- ## Reproduction Steps
48
-
49
- 1. Go to HuggingFace Spaces: https://huggingface.co/spaces/vcms/deepboner
50
- 2. Leave API key empty (Free Tier)
51
- 3. Click any example query or type a question
52
- 4. Click submit
53
- 5. Observe streaming output - garbage tokens appear
54
-
55
- **Expected**: Coherent agent reasoning like "Searching PubMed for female libido treatments..."
56
- **Actual**: Random tokens like "yarg", "PostalCodes"
57
-
58
- ---
59
-
60
- ## Root Cause Analysis
61
-
62
- ### Primary Cause: 7B Model Too Small for Multi-Agent Prompts
63
-
64
- The Qwen2.5-7B-Instruct model has **insufficient reasoning capacity** for the complex multi-agent framework. The system requires the model to:
65
-
66
- 1. **Adopt agent personas** with specialized instructions
67
- 2. **Follow structured workflows** (Search → Judge → Hypothesis → Report)
68
- 3. **Make tool calls** (search_pubmed, search_clinical_trials, etc.)
69
- 4. **Generate JSON-formatted progress ledgers** for workflow control
70
- 5. **Understand manager instructions** and delegate appropriately
71
-
72
- A 7B parameter model simply does not have the reasoning depth to handle this. Larger models (70B+) were originally intended, but those are routed to unreliable third-party providers (see `HF_FREE_TIER_ANALYSIS.md`).
73
-
74
- ### Technical Flow (Where Garbage Appears)
75
-
76
- ```
77
- User Query
78
-
79
- AdvancedOrchestrator.run() [advanced.py:247]
80
-
81
- workflow.run_stream(task) [builds Magentic workflow]
82
-
83
- MagenticAgentDeltaEvent emitted with event.text
84
-
85
- Yields AgentEvent(type="streaming", message=event.text) [advanced.py:314-319]
86
-
87
- Gradio displays: "📡 **STREAMING**: {garbage}"
88
- ```
89
-
90
- The garbage tokens are **raw model output**. The 7B model is:
91
- - Not following the system prompt
92
- - Outputting partial/incomplete token sequences
93
- - Possibly attempting tool calls but formatting incorrectly
94
- - Hallucinating random words
95
-
96
- ### Evidence from Microsoft Reference Framework
97
-
98
- The Microsoft Agent Framework's `_magentic.py` (lines 1717-1741) shows how agent invocation works:
99
-
100
- ```python
101
- async for update in agent.run_stream(messages=self._chat_history):
102
- updates.append(update)
103
- await self._emit_agent_delta_event(ctx, update)
104
- ```
105
-
106
- The framework passes through whatever the underlying chat client produces. If the model produces garbage, the framework streams it directly.
107
-
108
- ### Why Click Example vs Submit Shows Different Initial State
109
-
110
- Both code paths go through the same `research_agent()` function in `app.py`. The difference:
111
-
112
- - **Example click**: Immediately submits query, so you see garbage quickly
113
- - **Submit button click**: Shows "Starting research (Advanced mode)" banner first, then garbage
114
-
115
- Both ultimately produce the same garbage output from the 7B model.
116
-
117
- ---
118
-
119
- ## Impact Assessment
120
-
121
- | Aspect | Impact |
122
- |--------|--------|
123
- | Free Tier Users | Cannot get usable research results |
124
- | Demo Quality | Appears broken/unprofessional |
125
- | Trust | Users may think the entire system is broken |
126
- | Differentiation | Undermines "free tier works!" messaging |
127
-
128
- ---
129
-
130
- ## Potential Solutions
131
-
132
- ### Option 1: Switch to Better Small Model (Recommended - Quick Fix)
133
-
134
- Find a small model that better handles complex instructions. Candidates:
135
-
136
- | Model | Size | Tool Calling | Instruction Following |
137
- |-------|------|--------------|----------------------|
138
- | `mistralai/Mistral-7B-Instruct-v0.3` | 7B | Yes | Better |
139
- | `microsoft/Phi-3-mini-4k-instruct` | 3.8B | Limited | Good |
140
- | `google/gemma-2-9b-it` | 9B | Yes | Good |
141
- | `Qwen/Qwen2.5-14B-Instruct` | 14B | Yes | Better |
142
-
143
- **Risk**: 14B model might still be routed to third-party providers. Need to test each.
144
-
145
- ### Option 2: Simplify Free Tier Architecture
146
-
147
- Create a **simpler single-agent mode** for Free Tier:
148
- - Remove multi-agent coordination (Manager, multiple ChatAgents)
149
- - Use a single direct query → search → synthesize flow
150
- - Reduce prompt complexity significantly
151
-
152
- **Pros**: More reliable with smaller models
153
- **Cons**: Loses sophisticated multi-agent research capability
154
-
155
- ### Option 3: Output Filtering/Validation
156
-
157
- Add validation layer to detect and filter garbage output:
158
-
159
- ```python
160
- def is_valid_streaming_token(text: str) -> bool:
161
- """Check if streaming token appears valid."""
162
- # Garbage patterns we've seen
163
- garbage_patterns = ["yarg", "PostalCodes", "FunctionFlags"]
164
- if any(g in text for g in garbage_patterns):
165
- return False
166
- # Check for minimum coherence (has spaces, reasonable length)
167
- return len(text) > 0 and text.strip()
168
- ```
169
-
170
- **Pros**: Band-aid fix, quick to implement
171
- **Cons**: Doesn't fix root cause, will miss new garbage patterns
172
-
173
- ### Option 4: Graceful Degradation
174
-
175
- Detect when model output is incoherent and fall back to:
176
- - Returning an error message
177
- - Suggesting user provide an API key
178
- - Using a cached/templated response
179
-
180
- ### Option 5: Prompt Engineering for 7B Models
181
-
182
- Significantly simplify the agent prompts for 7B compatibility:
183
- - Shorter system prompts
184
- - More explicit step-by-step instructions
185
- - Remove abstract concepts
186
- - Use few-shot examples
187
-
188
- ### Option 6: Streaming Content Filter (For Symptom B)
189
-
190
- Filter raw tool call JSON from streaming output:
191
-
192
- ```python
193
- def should_stream_content(text: str) -> bool:
194
- """Filter garbage and raw tool calls from streaming."""
195
- # Don't stream raw JSON tool calls
196
- if text.strip().startswith('{"name":'):
197
- return False
198
- # Don't stream XML-style tool tags
199
- if '</tool_call>' in text or '<tool_call>' in text:
200
- return False
201
- # Don't stream garbage tokens (extend as needed)
202
- garbage = ["oleon", "UrlParser", "yarg", "PostalCodes", "FunctionFlags"]
203
- if any(g in text for g in garbage):
204
- return False
205
- return True
206
- ```
207
-
208
- **Location**: `src/orchestrators/advanced.py` lines 315-322
209
-
210
- This would prevent the raw tool call JSON from being shown to users, even if the model produces it.
211
-
212
- ---
213
-
214
- ## Recommended Action Plan
215
-
216
- ### Phase 1: Quick Fix (P2)
217
- 1. Test `mistralai/Mistral-7B-Instruct-v0.3` or `Qwen/Qwen2.5-14B-Instruct`
218
- 2. Verify they stay on HuggingFace native infrastructure (no third-party routing)
219
- 3. Evaluate output quality on sample queries
220
-
221
- ### Phase 2: Architecture Review (P3)
222
- 1. Consider simplified single-agent mode for Free Tier
223
- 2. Design graceful degradation when model output is invalid
224
- 3. Add output validation layer
225
-
226
- ### Phase 3: Long-term (P4)
227
- 1. Consider hybrid approach: simple mode for free tier, advanced for paid
228
- 2. Explore fine-tuning a small model specifically for research agent tasks
229
-
230
- ---
231
-
232
- ## Files Involved
233
-
234
- | File | Relevance |
235
- |------|-----------|
236
- | `src/orchestrators/advanced.py` | Main orchestrator, streaming event handling |
237
- | `src/clients/huggingface.py` | HuggingFace chat client adapter |
238
- | `src/agents/magentic_agents.py` | Agent definitions and prompts |
239
- | `src/app.py` | Gradio UI, event display |
240
- | `src/utils/config.py` | Model configuration |
241
-
242
- ---
243
-
244
- ## Relation to Previous Bugs
245
-
246
- - **P0 Repr Bug (RESOLVED)**: Fixed in PR #117 - Was about `<generator object>` appearing due to async generator mishandling
247
- - **P1 HuggingFace Novita Error (RESOLVED)**: Fixed in PR #118 - Was about 72B models being routed to failing third-party providers
248
-
249
- This P2 bug is **downstream** of the P1 fix - we fixed the 500 errors by switching to 7B, but now the 7B model doesn't produce quality output.
250
-
251
- ---
252
-
253
- ## Questions to Investigate
254
-
255
- 1. What models in the 7-20B range stay on HuggingFace native infrastructure?
256
- 2. Can we detect third-party routing before making the full request?
257
- 3. Is the chat template correct for Qwen2.5-7B? (Some models need specific formatting)
258
- 4. Are there HuggingFace serverless models specifically optimized for tool calling?
259
-
260
- ---
261
-
262
- ## References
263
-
264
- - `HF_FREE_TIER_ANALYSIS.md` - Analysis of HuggingFace provider routing
265
- - `CLAUDE.md` - Critical HuggingFace Free Tier section
266
- - Microsoft Agent Framework `_magentic.py` - Reference implementation
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/bugs/archive/P2_ADVANCED_MODE_COLD_START_NO_FEEDBACK.md DELETED
@@ -1,255 +0,0 @@
1
- # P2: Advanced Mode Cold Start Has No User Feedback
2
-
3
- **Priority**: P2 (UX Friction)
4
- **Component**: `src/orchestrators/advanced.py`
5
- **Status**: ✅ FIXED (All Phases Complete)
6
- **Issue**: [#108](https://github.com/The-Obstacle-Is-The-Way/DeepBoner/issues/108)
7
- **Created**: 2025-12-01
8
-
9
- ## Summary
10
-
11
- When Advanced Mode starts, users experience three significant "dead zones" with no visual feedback:
12
-
13
- 1. **Initialization delay** (5-15 seconds): Between "STARTED" and "THINKING" events
14
- 2. **First LLM call delay** (10-30+ seconds): Between "THINKING" and first "PROGRESS" event
15
- 3. **Agent execution delay** (30-90+ seconds): After "PROGRESS" while SearchAgent executes
16
-
17
- Users see the UI freeze with no indication of what's happening, leading to confusion about whether the system is working.
18
-
19
- ## Visual Timeline
20
-
21
- ```
22
- 🚀 STARTED: Starting research (Advanced mode)...
23
-
24
- │ ← DEAD ZONE #1: 5-15 seconds of nothing
25
- │ - Loading LlamaIndex/ChromaDB
26
- │ - Initializing embedding service
27
- │ - Building 4 agents + manager
28
-
29
- ⏳ THINKING: Multi-agent reasoning in progress...
30
-
31
- │ ← DEAD ZONE #2: 10-30+ seconds of nothing
32
- │ - Manager agent's first OpenAI API call
33
- │ - Cold connection to OpenAI
34
-
35
- ⏱️ PROGRESS: Manager assigning research task...
36
-
37
- │ ← DEAD ZONE #3: 30-90+ seconds of nothing
38
- │ - SearchAgent executing PubMed/ClinicalTrials/EuropePMC queries
39
- │ - Embedding and storing results in ChromaDB
40
- │ - No streaming events during search execution
41
-
42
- 📊 SEARCH_COMPLETE / PROGRESS: Round 1/5...
43
- ```
44
-
45
- ## Root Cause Analysis
46
-
47
- ### Dead Zone #1: Initialization (Lines 162-165)
48
-
49
- ```python
50
- yield AgentEvent(type="started", ...) # User sees this
51
-
52
- # === BLOCKING OPERATIONS (no events yielded) ===
53
- embedding_service = self._init_embedding_service() # ChromaDB, embeddings
54
- init_magentic_state(query, embedding_service) # Shared state
55
- workflow = self._build_workflow() # 4 agents + manager
56
-
57
- yield AgentEvent(type="thinking", ...) # User finally sees this
58
- ```
59
-
60
- **What's happening:**
61
- 1. `_init_embedding_service()` → Loads LlamaIndex, connects to ChromaDB, initializes OpenAI embeddings
62
- 2. `init_magentic_state()` → Creates ResearchMemory, sets up context
63
- 3. `_build_workflow()` → Instantiates SearchAgent, JudgeAgent, HypothesisAgent, ReportAgent, Manager
64
-
65
- ### Dead Zone #2: First LLM Call (Line 206)
66
-
67
- ```python
68
- yield AgentEvent(type="thinking", ...) # User sees this
69
-
70
- async for event in workflow.run_stream(task): # BLOCKING until first event
71
- # Manager makes first OpenAI call here
72
- # No events until manager responds and starts delegating
73
- ```
74
-
75
- **What's happening:**
76
- - Microsoft Agent Framework's manager agent receives the task
77
- - Makes synchronous(ish) call to OpenAI for orchestration planning
78
- - Only after response does it emit `MagenticOrchestratorMessageEvent`
79
-
80
- ### Dead Zone #3: Agent Execution (After PROGRESS event)
81
-
82
- After "Manager assigning research task...", the SearchAgent executes but emits no events until complete:
83
-
84
- **What's happening:**
85
- - SearchAgent receives task from manager
86
- - Executes parallel queries to PubMed, ClinicalTrials.gov, Europe PMC
87
- - Each result is embedded and stored in ChromaDB
88
- - Only after ALL searches complete does it emit `MagenticAgentMessageEvent`
89
-
90
- **Why no streaming:**
91
- - The agent's internal tool calls (search APIs, embeddings) don't emit framework events
92
- - Microsoft Agent Framework only emits events at agent message boundaries
93
- - 3 databases × multiple queries × embedding each result = long silent period
94
-
95
- **Potential fix:** Add progress callbacks to `SearchAgent` tools:
96
- ```python
97
- # In search_agent.py - hypothetical
98
- async def search_pubmed(query: str, on_progress: Callable = None):
99
- results = await pubmed_client.search(query)
100
- if on_progress:
101
- on_progress(f"Found {len(results)} PubMed results")
102
- # ... embed and store
103
- ```
104
-
105
- ## Impact
106
-
107
- 1. **User Confusion**: "Is it frozen? Should I refresh?"
108
- 2. **Perceived Slowness**: Dead time feels longer than active progress
109
- 3. **No Cancel Option**: Users can't abort during these zones
110
- 4. **Support Burden**: Users report "it's not working" when it's actually initializing
111
-
112
- ## Proposed Solutions
113
-
114
- ### Option A: Granular Initialization Events (Quick Win)
115
-
116
- Add progress events during initialization:
117
-
118
- ```python
119
- yield AgentEvent(type="started", ...)
120
-
121
- yield AgentEvent(
122
- type="progress",
123
- message="Loading embedding service...",
124
- iteration=0,
125
- )
126
- embedding_service = self._init_embedding_service()
127
-
128
- yield AgentEvent(
129
- type="progress",
130
- message="Initializing research memory...",
131
- iteration=0,
132
- )
133
- init_magentic_state(query, embedding_service)
134
-
135
- yield AgentEvent(
136
- type="progress",
137
- message="Building agent team (Search, Judge, Hypothesis, Report)...",
138
- iteration=0,
139
- )
140
- workflow = self._build_workflow()
141
-
142
- yield AgentEvent(type="thinking", ...)
143
- ```
144
-
145
- **Pros**: Simple, immediate feedback
146
- **Cons**: Still sequential, doesn't speed up actual time
147
-
148
- ### Option B: Parallel Initialization (Performance + UX)
149
-
150
- Use `asyncio.gather()` for independent operations:
151
-
152
- ```python
153
- yield AgentEvent(type="progress", message="Initializing agents...", iteration=0)
154
-
155
- # These could potentially run in parallel
156
- embedding_task = asyncio.create_task(self._init_embedding_service_async())
157
- workflow_task = asyncio.create_task(self._build_workflow_async())
158
-
159
- embedding_service, workflow = await asyncio.gather(embedding_task, workflow_task)
160
- init_magentic_state(query, embedding_service)
161
- ```
162
-
163
- **Pros**: Faster initialization, better UX
164
- **Cons**: Need to verify thread safety, more complex
165
-
166
- ### Option C: Pre-warming / Singleton Services
167
-
168
- Initialize expensive services once at app startup, not per-request:
169
-
170
- ```python
171
- # In app.py startup
172
- global_embedding_service = init_embedding_service()
173
- global_workflow_template = build_workflow_template()
174
-
175
- # In orchestrator
176
- workflow = global_workflow_template.clone() # Fast
177
- ```
178
-
179
- **Pros**: Near-instant start after first request
180
- **Cons**: Memory overhead, cold start on first request still slow
181
-
182
- ### Option D: Animated Progress Indicator (UI-Only)
183
-
184
- Add a Gradio progress bar or spinner that animates during the dead zones:
185
-
186
- ```python
187
- # In app.py
188
- with gr.Blocks() as demo:
189
- progress = gr.Progress()
190
-
191
- async def research(query):
192
- progress(0.1, desc="Initializing...")
193
- # ...
194
- progress(0.2, desc="Building agents...")
195
- ```
196
-
197
- **Pros**: User sees activity even if nothing to report
198
- **Cons**: Doesn't solve the actual blocking, Gradio-specific
199
-
200
- ## Recommended Approach
201
-
202
- **Phase 1 (Quick Win)**: Option A - Add granular events ✅ COMPLETE
203
- **Phase 2 (Performance)**: Option C - Pre-warm services at startup ✅ COMPLETE
204
- **Phase 3 (Polish)**: Option D - Gradio progress bar ✅ COMPLETE
205
-
206
- ## Related Considerations
207
-
208
- ### Parallel Agent Orchestration
209
-
210
- The current Microsoft Agent Framework runs agents sequentially through the manager. True parallel execution would require:
211
-
212
- 1. Breaking out of the framework's `run_stream()` pattern
213
- 2. Implementing our own parallel task dispatch
214
- 3. Managing agent coordination manually
215
-
216
- This is a larger architectural change (P1 scope) and should be tracked separately if desired.
217
-
218
- ## Files to Modify
219
-
220
- 1. `src/orchestrators/advanced.py:155-210` - Add initialization events in `run()` method
221
- 2. `src/utils/service_loader.py` - Pre-warming logic
222
- 3. `src/app.py` - Gradio progress integration
223
-
224
- ## Testing the Issue
225
-
226
- ```python
227
- import asyncio
228
- import time
229
- from src.orchestrators.advanced import AdvancedOrchestrator
230
-
231
- async def test():
232
- orch = AdvancedOrchestrator(max_rounds=3)
233
- start = time.time()
234
- async for event in orch.run("test query"):
235
- elapsed = time.time() - start
236
- print(f"[{elapsed:.1f}s] {event.type}: {event.message[:50]}...")
237
- if event.type == "complete":
238
- break
239
-
240
- asyncio.run(test())
241
- ```
242
-
243
- Expected output showing the gaps:
244
- ```
245
- [0.0s] started: Starting research (Advanced mode)...
246
- [8.2s] thinking: Multi-agent reasoning in progress... ← 8 second gap!
247
- [22.5s] progress: Manager assigning research task... ← 14 second gap!
248
- ```
249
-
250
- ## References
251
-
252
- - Advanced orchestrator: `src/orchestrators/advanced.py`
253
- - Embedding service loader: `src/utils/service_loader.py`
254
- - LlamaIndex RAG: `src/services/llamaindex_rag.py`
255
- - Microsoft Agent Framework: `agent-framework-core`
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/bugs/archive/P2_ARCHITECTURAL_BYOK_GAPS.md DELETED
@@ -1,100 +0,0 @@
1
- # P2 Architectural: BYOK Gaps in Non-Critical Paths
2
-
3
- **Date**: 2025-12-03
4
- **Status**: ✅ RESOLVED
5
- **Severity**: P2 (Architectural Debt)
6
- **Component**: LLM Routing / BYOK Support
7
- **Resolution**: Fixed end-to-end BYOK support in this PR
8
-
9
- ---
10
-
11
- ## Summary
12
-
13
- Two code paths do NOT support BYOK (Bring Your Own Key) from Gradio:
14
-
15
- 1. **HierarchicalOrchestrator** - Doesn't receive `api_key` parameter
16
- 2. **get_model() (PydanticAI)** - Only checks env vars, no BYOK
17
-
18
- These are **latent bugs** - they don't affect the main user flow currently.
19
-
20
- ---
21
-
22
- ## Bug 1: HierarchicalOrchestrator Missing api_key
23
-
24
- **Location**: `src/orchestrators/factory.py:61-64`
25
-
26
- ```python
27
- if effective_mode == "hierarchical":
28
- from src.orchestrators.hierarchical import HierarchicalOrchestrator
29
- return HierarchicalOrchestrator(config=effective_config, domain=domain)
30
- # BUG: api_key is NOT passed to HierarchicalOrchestrator
31
- ```
32
-
33
- **Impact**: If hierarchical mode were exposed in UI, BYOK would not work.
34
-
35
- **Current State**: Hierarchical mode is NOT exposed in Gradio UI, so this is latent.
36
-
37
- **Fix**: Pass `api_key` to HierarchicalOrchestrator when instantiating.
38
-
39
- ---
40
-
41
- ## Bug 2: get_model() Doesn't Support BYOK
42
-
43
- **Location**: `src/agent_factory/judges.py:62-91` (function `get_model()`)
44
-
45
- ```python
46
- def get_model() -> Any:
47
- # Priority 1: OpenAI
48
- if settings.has_openai_key: # Only checks ENV VAR
49
- ...
50
- # Priority 2: Anthropic
51
- if settings.has_anthropic_key: # Only checks ENV VAR
52
- ...
53
- # Priority 3: HuggingFace
54
- if settings.has_huggingface_key: # Only checks ENV VAR
55
- ...
56
- ```
57
-
58
- **Impact**: PydanticAI-based components (judges, statistical analyzer) cannot use BYOK keys.
59
-
60
- **Current State**: The main Advanced mode flow uses `get_chat_client()` (Microsoft Agent Framework), NOT `get_model()`. So this is latent.
61
-
62
- **Fix**: Either:
63
- 1. Add `api_key` parameter to `get_model()`
64
- 2. Or deprecate `get_model()` in favor of `get_chat_client()` everywhere
65
-
66
- ---
67
-
68
- ## Architecture Notes
69
-
70
- The codebase has **TWO separate LLM routing systems**:
71
-
72
- | System | Function | BYOK Support | Used By |
73
- |--------|----------|--------------|---------|
74
- | Microsoft Agent Framework | `get_chat_client()` | **YES** (key prefix detection) | Advanced mode (main flow) |
75
- | PydanticAI | `get_model()` | **NO** (env vars only) | Judges, statistical analyzer |
76
-
77
- This dual-system architecture creates confusion and maintenance burden.
78
-
79
- ---
80
-
81
- ## Recommendation
82
-
83
- **Short-term**: Leave as-is (latent, not blocking)
84
-
85
- **Long-term**: Unify on `get_chat_client()` and deprecate `get_model()` (see P3_REMOVE_ANTHROPIC_PARTIAL_WIRING.md for related cleanup)
86
-
87
- ---
88
-
89
- ## Test Results
90
-
91
- - All 310 unit tests pass
92
- - Main user flow (Gradio → Advanced) works with BYOK
93
-
94
- ---
95
-
96
- ## Related Documents
97
-
98
- - `P3_REMOVE_ANTHROPIC_PARTIAL_WIRING.md` - Related architecture cleanup
99
- - `src/clients/factory.py` - BYOK-capable factory (correct implementation)
100
- - `src/agent_factory/judges.py` - Non-BYOK factory (needs fix)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/bugs/archive/P2_DUPLICATE_REPORT_CONTENT.md DELETED
@@ -1,151 +0,0 @@
1
- # P2 Bug: Duplicate Report Content in Output
2
-
3
- **Date**: 2025-12-03
4
- **Status**: FIXED (PR fix/p2-double-bug-squash)
5
- **Severity**: P2 (UX - Duplicate content confuses users)
6
- **Component**: `src/orchestrators/advanced.py`
7
- **Affects**: Both Free Tier (HuggingFace) AND Paid Tier (OpenAI)
8
-
9
- ---
10
-
11
- ## Executive Summary
12
-
13
- This is a **confirmed stack bug**, NOT a model limitation. The duplicate report appears because:
14
-
15
- 1. Streaming events yield the full report content character-by-character
16
- 2. Final events (`MagenticFinalResultEvent`/`WorkflowOutputEvent`) contain the SAME content
17
- 3. No deduplication exists between streamed content and final event content
18
- 4. Both are appended to the output
19
-
20
- ---
21
-
22
- ## Symptom
23
-
24
- The final research report appears **twice** in the UI output:
25
- 1. First as streaming content (with `📡 **STREAMING**:` prefix)
26
- 2. Then again as a complete event (without prefix)
27
-
28
- ---
29
-
30
- ## Root Cause
31
-
32
- The `_process_event()` method handles final events but has **no access to buffer state**. The buffer was already cleared at line 337 before these events arrive.
33
-
34
- ```python
35
- # Line 337: Buffer cleared
36
- current_message_buffer = ""
37
- continue
38
-
39
- # Line 341: Final events processed WITHOUT buffer context
40
- agent_event = self._process_event(event, iteration) # No buffer info!
41
- ```
42
-
43
- ---
44
-
45
- ## The Fix (Consensus: Stateful Orchestrator Logic)
46
-
47
- **Location**: `src/orchestrators/advanced.py` `run()` method
48
-
49
- **Strategy**: Handle final events **inline in the run() loop** where buffer state exists. Track streaming volume to decide whether to re-emit content.
50
-
51
- ### Why This Is Correct
52
-
53
- | Rejected Approach | Why Wrong |
54
- |-------------------|-----------|
55
- | UI-side string comparison | Wrong layer, fragile, treats symptom |
56
- | Stateless `_process_event` fix | No state = can't know if streaming occurred |
57
- | **Stateful run() loop** | ✅ Only place with full lifecycle visibility |
58
-
59
- The `run()` loop is the **single source of truth** for the request lifecycle. It "saw" the content stream out. It must decide whether to re-emit.
60
-
61
- ### Implementation
62
-
63
- ```python
64
- # In run() method, add tracking variable after line 302:
65
- last_streamed_length: int = 0
66
-
67
- # Before clearing buffer at line 337, save its length:
68
- last_streamed_length = len(current_message_buffer)
69
- current_message_buffer = ""
70
- continue
71
-
72
- # Replace lines 340-345 with inline handling of final events:
73
- if isinstance(event, (MagenticFinalResultEvent, WorkflowOutputEvent)):
74
- final_event_received = True
75
-
76
- # DECISION: Did we stream substantial content?
77
- if last_streamed_length > 100:
78
- # YES: Final event is a SIGNAL, not a payload
79
- yield AgentEvent(
80
- type="complete",
81
- message="Research complete.",
82
- data={"iterations": iteration, "streamed_chars": last_streamed_length},
83
- iteration=iteration,
84
- )
85
- else:
86
- # NO: Final event must carry the payload (tool-only turn, cache hit)
87
- if isinstance(event, MagenticFinalResultEvent):
88
- text = self._extract_text(event.message) if event.message else "No result"
89
- else: # WorkflowOutputEvent
90
- text = self._extract_text(event.data) if event.data else "Research complete"
91
- yield AgentEvent(
92
- type="complete",
93
- message=text,
94
- data={"iterations": iteration},
95
- iteration=iteration,
96
- )
97
- continue
98
-
99
- # Keep existing fallback for other events:
100
- agent_event = self._process_event(event, iteration)
101
- ```
102
-
103
- ### Why Threshold of 100 Chars?
104
-
105
- - `> 0` is too aggressive (might catch single-word streams)
106
- - `> 500` is too conservative (might miss short but complete responses)
107
- - `> 100` distinguishes "real content was streamed" from "just status messages"
108
-
109
- ---
110
-
111
- ## Edge Cases Handled
112
-
113
- | Scenario | `last_streamed_length` | Action |
114
- |----------|------------------------|--------|
115
- | Normal streaming report | 5000+ | Emit "Research complete." |
116
- | Tool call, no text | 0 | Emit full content from final event |
117
- | Very short response | 50 | Emit full content (fallback) |
118
- | Agent switch mid-stream | Reset on switch | Tracks only final agent |
119
-
120
- ---
121
-
122
- ## Files to Modify
123
-
124
- | File | Lines | Change |
125
- |------|-------|--------|
126
- | `src/orchestrators/advanced.py` | 296-345 | Add `last_streamed_length`, handle final events inline |
127
- | `src/orchestrators/advanced.py` | 532-552 | Optional: remove dead code from `_process_event()` |
128
-
129
- ---
130
-
131
- ## Test Plan
132
-
133
- 1. **Happy Path**: Run query, verify report appears ONCE
134
- 2. **Fallback**: Mock tool-only turn (no streaming), verify full content emitted
135
- 3. **Both Tiers**: Test Free Tier and Paid Tier
136
-
137
- ---
138
-
139
- ## Validation
140
-
141
- This fix was independently validated by two AI agents (Claude and Gemini) analyzing the architecture. Both concluded:
142
-
143
- > "The Stateful Orchestrator Fix is the correct engineering solution. The 'Source of Truth' is the Orchestrator's runtime state."
144
-
145
- ---
146
-
147
- ## Related
148
-
149
- - **Not related to model quality** - This is a stack bug
150
- - P1 Free Tier fix enabled streaming, exposing this bug
151
- - SPEC-17 Accumulator Pattern addressed repr bug but created this side effect
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/bugs/archive/P2_EXECUTOR_COMPLETED_EVENT_UI_NOISE.md DELETED
@@ -1,351 +0,0 @@
1
- # P2 Bug: ExecutorCompletedEvent UI Noise
2
-
3
- **Status**: VALIDATED - Ready for Implementation
4
- **Discovered**: 2025-12-05
5
- **Senior Review**: 2025-12-05 (External agent audit confirmed analysis)
6
- **Severity**: P2 (UX noise, confusing but not blocking)
7
- **Component**: `src/orchestrators/advanced.py`
8
-
9
- ---
10
-
11
- ## Symptom
12
-
13
- After the report synthesis completes, extra events appear in the UI:
14
-
15
- ```text
16
- 📝 **SYNTHESIZING**: Synthesizing research findings...
17
- [...full report content...]
18
-
19
- 🧠 **JUDGING**: ManagerAgent: Action completed (Tool Call)
20
- ⏱️ **PROGRESS**: Step 11: ManagerAgent task completed
21
- ```
22
-
23
- The "JUDGING" and "PROGRESS" events appear AFTER the report is already displayed, creating confusion.
24
-
25
- ---
26
-
27
- ## Root Cause Analysis
28
-
29
- ### The Misunderstanding
30
-
31
- We're treating `ExecutorCompletedEvent` as a **UI event** when it's actually an **internal framework bookkeeping event**.
32
-
33
- ### Microsoft Agent Framework Design
34
-
35
- Looking at `agent_framework/_workflows/_executor.py` (lines 266-281):
36
-
37
- ```python
38
- # This is auto-emitted by the framework - NOT for UI consumption
39
- with _framework_event_origin():
40
- completed_event = ExecutorCompletedEvent(self.id, sent_messages if sent_messages else None)
41
- await context.add_event(completed_event)
42
- ```
43
-
44
- The framework emits `ExecutorCompletedEvent` automatically after every executor handler completes. This includes:
45
- - SearchAgent completing a search
46
- - JudgeAgent completing evaluation
47
- - ReportAgent completing synthesis
48
- - **ManagerAgent completing coordination** (this is the problem)
49
-
50
- ### What the MS Framework Sample Does
51
-
52
- From `samples/getting_started/workflows/orchestration/magentic.py`:
53
-
54
- ```python
55
- async for event in workflow.run_stream(task):
56
- if isinstance(event, AgentRunUpdateEvent):
57
- # Handle streaming with metadata
58
- props = event.data.additional_properties if event.data else None
59
- event_type = props.get("magentic_event_type") if props else None
60
- # ...
61
- elif isinstance(event, WorkflowOutputEvent):
62
- # Handle final output
63
- output = output_messages[-1].text
64
- ```
65
-
66
- They only handle:
67
- 1. `AgentRunUpdateEvent` - for streaming content (with `magentic_event_type` metadata)
68
- 2. `WorkflowOutputEvent` - for final output
69
-
70
- **They do NOT emit UI events for `ExecutorCompletedEvent`.**
71
-
72
- ### Our Problematic Code
73
-
74
- In `src/orchestrators/advanced.py`:
75
-
76
- ```python
77
- # Line 348-368: We emit UI events for EVERY ExecutorCompletedEvent
78
- if isinstance(event, ExecutorCompletedEvent):
79
- state.iteration += 1
80
-
81
- comp_event, prog_event = self._handle_completion_event(...)
82
- yield comp_event # <-- WRONG: UI event for internal framework event
83
- yield prog_event # <-- WRONG: More noise
84
- ```
85
-
86
- ### Why the Manager Fires a Completion Event
87
-
88
- The workflow execution order:
89
- 1. ReportAgent streams its output (`AgentRunUpdateEvent`)
90
- 2. ReportAgent handler completes → `ExecutorCompletedEvent(reporter)` (we display this)
91
- 3. Manager orchestrator handler completes → `ExecutorCompletedEvent(manager)` (we display this too!)
92
- 4. `WorkflowOutputEvent` (final)
93
-
94
- The Manager is also an executor in the framework. When it finishes coordinating (after ReportAgent returns), it fires its own `ExecutorCompletedEvent`. We're incorrectly emitting UI events for this.
95
-
96
- ---
97
-
98
- ## Impact
99
-
100
- 1. **User Confusion**: Extra "JUDGING: ManagerAgent" events after the report
101
- 2. **UX Noise**: Progress events that don't add value
102
- 3. **Incorrect Semantics**: Manager completions displayed as agent activity
103
- 4. **No Functional Bug**: The workflow completes correctly, just noisy
104
-
105
- ---
106
-
107
- ## The Fix
108
-
109
- ### Stop Emitting UI Events for ExecutorCompletedEvent
110
-
111
- Remove UI event emission for `ExecutorCompletedEvent` entirely. Keep internal state tracking only.
112
-
113
- **Before (buggy):**
114
-
115
- ```python
116
- if isinstance(event, ExecutorCompletedEvent):
117
- state.iteration += 1
118
- agent_name = getattr(event, "executor_id", "") or "unknown"
119
- if REPORTER_AGENT_ID in agent_name.lower():
120
- state.reporter_ran = True
121
-
122
- comp_event, prog_event = self._handle_completion_event(...)
123
- yield comp_event # <-- REMOVE: Emits UI noise
124
- yield prog_event # <-- REMOVE: Emits UI noise
125
- ```
126
-
127
- **After (correct):**
128
-
129
- ```python
130
- if isinstance(event, ExecutorCompletedEvent):
131
- # Internal state tracking only - NO UI events
132
- agent_name = getattr(event, "executor_id", "") or "unknown"
133
- if REPORTER_AGENT_ID in agent_name.lower():
134
- state.reporter_ran = True
135
- state.current_message_buffer = ""
136
- continue # Skip to next event - do not yield anything
137
- ```
138
-
139
- **Key changes:**
140
- 1. Remove `yield comp_event` and `yield prog_event`
141
- 2. Remove `state.iteration += 1` (iteration counter becomes meaningless without UI events)
142
- 3. Keep `state.reporter_ran` tracking (needed for fallback synthesis logic)
143
- 4. Add `continue` to skip to next event
144
-
145
- **Why this is correct:**
146
- - Aligns with MS framework design (their sample ignores `ExecutorCompletedEvent`)
147
- - Eliminates all completion noise including trailing "ManagerAgent" events
148
- - The streaming events (`AgentRunUpdateEvent`) already provide real-time feedback
149
- - `WorkflowOutputEvent` signals completion
150
-
151
- ### Additional Fix: Add Metadata Filtering to AgentRunUpdateEvent
152
-
153
- The senior review identified a gap: we're not filtering `AgentRunUpdateEvent` by `magentic_event_type`.
154
-
155
- **Current (incomplete):**
156
-
157
- ```python
158
- if isinstance(event, AgentRunUpdateEvent):
159
- if event.data and hasattr(event.data, "text") and event.data.text:
160
- yield AgentEvent(type="streaming", message=event.data.text)
161
- ```
162
-
163
- **Should be:**
164
-
165
- ```python
166
- if isinstance(event, AgentRunUpdateEvent):
167
- if event.data and hasattr(event.data, "text") and event.data.text:
168
- # Check metadata to filter internal orchestrator messages
169
- props = getattr(event.data, "additional_properties", None) or {}
170
- event_type = props.get("magentic_event_type")
171
- msg_kind = props.get("orchestrator_message_kind")
172
-
173
- # Filter out internal orchestrator messages (task_ledger, instruction)
174
- if event_type == MAGENTIC_EVENT_TYPE_ORCHESTRATOR:
175
- if msg_kind in ("task_ledger", "instruction"):
176
- continue # Skip internal coordination messages
177
-
178
- yield AgentEvent(type="streaming", message=event.data.text)
179
- ```
180
-
181
- **Why this matters:**
182
- - Prevents internal JSON blobs from being displayed
183
- - Filters out raw planning/instruction prompts not meant for users
184
- - Aligns with how MS sample consumes events
185
-
186
- ---
187
-
188
- ## Related Code Locations
189
-
190
- - `src/orchestrators/advanced.py` line 348-368: ExecutorCompletedEvent handling
191
- - `src/orchestrators/advanced.py` line 437-469: `_handle_completion_event` method
192
- - MS Framework: `python/packages/core/agent_framework/_workflows/_executor.py` line 277-281
193
- - MS Framework: `python/packages/core/agent_framework/_workflows/_magentic.py` line 1962-1976
194
-
195
- ---
196
-
197
- ## Related Issues
198
-
199
- - P2 Round Counter Semantic Mismatch (FIXED) - Changed display from "Round X/Y" to "Step N"
200
- - This bug explains why step count was confusing - we count internal events too
201
-
202
- ---
203
-
204
- ## Framework Event Architecture Deep Dive
205
-
206
- ### Event Categories in MS Agent Framework
207
-
208
- The framework has distinct event categories with different purposes:
209
-
210
- #### 1. Workflow Lifecycle Events (Framework-emitted, internal)
211
-
212
- | Event | Purpose | UI Relevant? |
213
- |-------|---------|--------------|
214
- | `WorkflowStartedEvent` | Run begins | No |
215
- | `WorkflowStatusEvent` | State transitions (IN_PROGRESS, IDLE, FAILED) | No |
216
- | `WorkflowFailedEvent` | Error with structured details | Maybe (errors) |
217
-
218
- #### 2. Superstep Events (Framework-emitted, internal)
219
-
220
- | Event | Purpose | UI Relevant? |
221
- |-------|---------|--------------|
222
- | `SuperStepStartedEvent` | Pregel superstep begins | No |
223
- | `SuperStepCompletedEvent` | Pregel superstep ends | No |
224
-
225
- #### 3. Executor Events (Framework-emitted automatically, internal)
226
-
227
- | Event | Purpose | UI Relevant? |
228
- |-------|---------|--------------|
229
- | `ExecutorInvokedEvent` | Handler starts | No |
230
- | `ExecutorCompletedEvent` | Handler completes | **NO** |
231
- | `ExecutorFailedEvent` | Handler errors | Maybe (errors) |
232
-
233
- #### 4. Application Events (User-code emitted via ctx.add_event, UI-facing)
234
-
235
- | Event | Purpose | UI Relevant? |
236
- |-------|---------|--------------|
237
- | `AgentRunUpdateEvent` | Streaming content | **YES** |
238
- | `AgentRunEvent` | Complete agent response | Yes |
239
- | `WorkflowOutputEvent` | Final workflow output | **YES** |
240
- | `RequestInfoEvent` | HITL request | Yes |
241
-
242
- ### Metadata Pattern in AgentRunUpdateEvent
243
-
244
- The MS framework uses `additional_properties` in `AgentRunUpdateEvent.data` for classification:
245
-
246
- ```python
247
- # Orchestrator message
248
- additional_properties={
249
- "magentic_event_type": "orchestrator_message",
250
- "orchestrator_message_kind": "user_task" | "task_ledger" | "instruction" | "notice",
251
- "orchestrator_id": "...",
252
- }
253
-
254
- # Agent streaming
255
- additional_properties={
256
- "magentic_event_type": "agent_delta",
257
- "agent_id": "searcher" | "judge" | ...,
258
- }
259
- ```
260
-
261
- ### What We Should Handle for UI
262
-
263
- 1. **`AgentRunUpdateEvent`** with metadata filtering:
264
- - `magentic_event_type: "agent_delta"` → Display agent streaming
265
- - `magentic_event_type: "orchestrator_message"` → Filter by `orchestrator_message_kind`:
266
- - `"user_task"` → Show (task assignment)
267
- - `"instruction"` → Filter out (internal)
268
- - `"task_ledger"` → Filter out (internal)
269
- - `"notice"` → Maybe show (warnings)
270
-
271
- 2. **`WorkflowOutputEvent`** → Final output
272
-
273
- ### What We Should NOT Handle for UI
274
-
275
- - `ExecutorCompletedEvent` - Internal bookkeeping
276
- - `ExecutorInvokedEvent` - Internal bookkeeping
277
- - `SuperStepStartedEvent/CompletedEvent` - Internal iteration
278
- - `WorkflowStatusEvent` - Internal state machine
279
-
280
- ---
281
-
282
- ## Required Import Changes
283
-
284
- **Current imports:**
285
-
286
- ```python
287
- from agent_framework import (
288
- MAGENTIC_EVENT_TYPE_ORCHESTRATOR,
289
- AgentRunUpdateEvent,
290
- ExecutorCompletedEvent, # Keep for internal tracking
291
- MagenticBuilder,
292
- WorkflowOutputEvent,
293
- )
294
- ```
295
-
296
- **Add these imports for metadata filtering:**
297
-
298
- ```python
299
- from agent_framework import (
300
- MAGENTIC_EVENT_TYPE_AGENT_DELTA, # For agent streaming detection
301
- ORCH_MSG_KIND_INSTRUCTION, # Filter internal messages
302
- ORCH_MSG_KIND_TASK_LEDGER, # Filter internal messages
303
- )
304
- ```
305
-
306
- ---
307
-
308
- ## Test Cases
309
-
310
- ```python
311
- def test_no_executor_completed_events_in_ui():
312
- """UI should not emit any events from ExecutorCompletedEvent."""
313
- # Run workflow to completion
314
- # Collect all yielded AgentEvent objects
315
- # Assert NONE have type "progress" with "task completed" message
316
- # Assert NONE have type matching completion patterns
317
- pass
318
-
319
- def test_internal_messages_filtered_from_streaming():
320
- """Internal orchestrator messages should be filtered from UI stream."""
321
- # Run workflow and collect all yielded events
322
- # Assert no events contain "task_ledger" content
323
- # Assert no events contain raw instruction prompts
324
- # Assert no JSON blobs in streaming output
325
- pass
326
-
327
- def test_reporter_ran_tracking_still_works():
328
- """Internal state.reporter_ran should still be set correctly."""
329
- # Run workflow to completion
330
- # Verify fallback synthesis is NOT triggered (reporter did run)
331
- # This ensures we didn't break internal tracking when removing UI events
332
- pass
333
- ```
334
-
335
- ---
336
-
337
- ## Why the Free Tier "Works"
338
-
339
- The user asked why the free tier seems to work despite expectations. The answer:
340
-
341
- 1. **The framework handles orchestration** - The MS Agent Framework manages the workflow (planning, progress tracking, agent coordination)
342
- 2. **The LLM just provides reasoning** - The model generates text, but the framework decides when to delegate, when to stop, etc.
343
- 3. **The "bugs" are in our UI layer** - The orchestration works correctly; we're just displaying internal events
344
-
345
- The free tier works because:
346
- - `MagenticBuilder` creates the workflow graph
347
- - `StandardMagenticManager` handles planning and progress evaluation
348
- - The framework routes messages between executors
349
- - The LLM quality affects answer quality, not workflow execution
350
-
351
- Our UI noise (trailing events) is a bug in how we consume framework events, not a framework bug.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/bugs/archive/P2_FIRST_TURN_TIMEOUT.md DELETED
@@ -1,160 +0,0 @@
1
- # P2 Bug: First Agent Turn Exceeds Workflow Timeout
2
-
3
- **Date**: 2025-12-03
4
- **Status**: FIXED (PR fix/p2-double-bug-squash)
5
- **Severity**: P2 (UX - Workflow always times out on complex queries)
6
- **Component**: `src/orchestrators/advanced.py` + `src/agents/search_agent.py`
7
- **Affects**: Both Free Tier (HuggingFace) AND Paid Tier (OpenAI)
8
-
9
- ---
10
-
11
- ## Executive Summary
12
-
13
- The search agent's first turn can exceed the 5-minute workflow timeout, causing:
14
- 1. `iterations=0` at timeout (no agent completed a turn)
15
- 2. `_handle_timeout()` synthesizes from partial evidence
16
- 3. Users get incomplete research results
17
-
18
- This is a **performance/architecture bug**, not a model issue.
19
-
20
- ---
21
-
22
- ## Symptom
23
-
24
- ```
25
- [warning] Workflow timed out iterations=0
26
- ```
27
-
28
- The workflow times out with `iterations=0` - meaning the first agent (search agent) never completed its turn before the 5-minute timeout.
29
-
30
- ---
31
-
32
- ## Root Cause
33
-
34
- The search agent's first turn is **extremely expensive**:
35
-
36
- ```
37
- Search Agent First Turn:
38
- ├── Manager assigns task
39
- ├── Search agent starts
40
- │ ├── Calls PubMed search tool (10 results)
41
- │ ├── Calls ClinicalTrials search tool (10 results)
42
- │ ├── Calls EuropePMC search tool (10 results)
43
- │ └── For EACH result (30 total):
44
- │ ├── Generate embedding (OpenAI API call)
45
- │ ├── Check for duplicates (ChromaDB query)
46
- │ └── Store in ChromaDB
47
-
48
- │ TOTAL: 30 results × (embedding + dedup + store) = 90+ API/DB operations
49
-
50
- └── Agent turn completes (if timeout hasn't fired)
51
- ```
52
-
53
- **The timeout is on the WORKFLOW, not individual agent turns.** A single greedy agent can consume the entire timeout budget.
54
-
55
- ---
56
-
57
- ## Impact
58
-
59
- | Aspect | Impact |
60
- |--------|--------|
61
- | UX | Queries always timeout on first turn |
62
- | Research quality | Synthesis happens on partial evidence |
63
- | Confusion | `iterations=0` looks like nothing happened |
64
-
65
- ---
66
-
67
- ## The Fix (Consensus)
68
-
69
- **Reduce work per turn + increase timeout budget.**
70
-
71
- ### Implementation
72
-
73
- **1. Reduce results per tool (immediate)**
74
-
75
- `src/agents/search_agent.py` line 70:
76
- ```python
77
- # Change from 10 to 5
78
- result: SearchResult = await self._handler.execute(query, max_results_per_tool=5)
79
- ```
80
-
81
- **2. Increase workflow timeout (immediate)**
82
-
83
- `src/utils/config.py`:
84
- ```python
85
- advanced_timeout: float = Field(
86
- default=600.0, # Was 300.0 (5 min), now 10 min
87
- ge=60.0,
88
- le=900.0,
89
- description="Timeout for Advanced mode in seconds",
90
- )
91
- ```
92
-
93
- ### Why NOT Per-Turn Timeout
94
-
95
- **DANGER**: The SearchHandler uses `asyncio.gather()`:
96
-
97
- ```python
98
- # src/tools/search_handler.py line 163-164
99
- results = await asyncio.gather(*tasks, return_exceptions=True)
100
- ```
101
-
102
- This is an **all-or-nothing** operation. If you wrap it with `asyncio.timeout()` and the timeout fires, you get **zero results**, not partial results.
103
-
104
- ```python
105
- # DON'T DO THIS - yields nothing on timeout
106
- async with asyncio.timeout(60):
107
- result = await self._handler.execute(query) # Cancelled = zero results
108
- ```
109
-
110
- Per-turn timeout requires `SearchHandler` to support cancellation with partial results. That's a separate architectural change (see Future Work).
111
-
112
- ---
113
-
114
- ## Future Work (Streaming Evidence Ingestion)
115
-
116
- For proper fix, `SearchHandler.execute()` should:
117
- 1. Yield results as they arrive (async generator)
118
- 2. Support cancellation with partial results
119
- 3. Allow agent to return "what we have so far" on timeout
120
-
121
- ```python
122
- # Future architecture
123
- async def execute_streaming(self, query: str) -> AsyncIterator[Evidence]:
124
- for tool in self.tools:
125
- async for evidence in tool.search_streaming(query):
126
- yield evidence # Can be cancelled at any point
127
- ```
128
-
129
- This is out of scope for the immediate fix.
130
-
131
- ---
132
-
133
- ## Test Plan
134
-
135
- 1. Run query with 10-minute timeout
136
- 2. Verify first agent turn completes before timeout
137
- 3. Verify `iterations >= 1` at workflow end
138
-
139
- ---
140
-
141
- ## Verification Data
142
-
143
- From diagnostic run:
144
- ```
145
- === RAW FRAMEWORK EVENTS ===
146
- MagenticAgentDeltaEvent: 284
147
- MagenticOrchestratorMessageEvent: 3
148
- ...
149
- NO MagenticAgentMessageEvent ← Agent never completed a turn!
150
-
151
- [warning] Workflow timed out iterations=0
152
- ```
153
-
154
- ---
155
-
156
- ## Related
157
-
158
- - P2 Duplicate Report Bug (separate issue, happens after successful completion)
159
- - `_handle_timeout()` correctly synthesizes, but with partial evidence
160
- - Not related to model quality - this is infrastructure/performance
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/bugs/archive/P2_GRADIO_EXAMPLE_NOT_FILLING.md DELETED
@@ -1,68 +0,0 @@
1
- # P2 Bug Report: Third Example Not Filling Chat Box
2
-
3
- ## Status
4
- - **Date:** 2025-11-29
5
- - **Priority:** P2 (UX issue)
6
- - **Component:** `src/app.py` - Gradio examples
7
- - **Resolution:** FIXED in commit `2ea01fd`
8
-
9
- ---
10
-
11
- ## Symptoms
12
-
13
- When clicking the third example in the Gradio UI:
14
- - **Example 1** (female libido): ✅ Fills chat box correctly
15
- - **Example 2** (ED alternatives): ✅ Fills chat box correctly
16
- - **Example 3** (HSDD testosterone): ❌ Does NOT fill chat box
17
-
18
- ### User Experience
19
- User clicks example → nothing happens → confusion
20
-
21
- ---
22
-
23
- ## Root Cause Hypothesis
24
-
25
- The third example contains parentheses and an abbreviation:
26
- ```
27
- "Testosterone therapy for HSDD (Hypoactive Sexual Desire Disorder)?"
28
- ```
29
-
30
- Possible causes:
31
- 1. **Parentheses** - Gradio may have parsing issues with `(...)` in example text
32
- 2. **Text length** - When expanded, this is the longest example
33
- 3. **Special characters** - The combination of abbreviation + parenthetical may confuse Gradio's example caching
34
-
35
- ---
36
-
37
- ## The Fix
38
-
39
- Simplify the example text - expand the abbreviation and remove parentheses:
40
-
41
- ```python
42
- # Before (broken)
43
- "Testosterone therapy for HSDD (Hypoactive Sexual Desire Disorder)?"
44
-
45
- # After (fixed)
46
- "Testosterone therapy for Hypoactive Sexual Desire Disorder?"
47
- ```
48
-
49
- This:
50
- 1. Removes problematic parentheses
51
- 2. Makes the text more readable (no cut-off abbreviation)
52
- 3. Users don't need to know what HSDD stands for
53
-
54
- ---
55
-
56
- ## Test Plan
57
-
58
- - [ ] Change example text in `src/app.py`
59
- - [ ] Deploy to HuggingFace Space
60
- - [ ] Verify all 3 examples fill chat box correctly
61
- - [ ] `make check` passes
62
-
63
- ---
64
-
65
- ## Related
66
-
67
- - Gradio ChatInterface example caching behavior
68
- - Similar to P0 example caching crash (but different manifestation)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/bugs/archive/P2_ROUND_COUNTER_SEMANTIC_MISMATCH.md DELETED
@@ -1,321 +0,0 @@
1
- # P2 Bug: Round Counter Semantic Mismatch
2
-
3
- **Status**: ✅ FIXED
4
- **Discovered**: 2025-12-05
5
- **Fixed**: 2025-12-05
6
- **Severity**: P2 (Display bug, confusing UX but not blocking)
7
- **Component**: `src/orchestrators/advanced.py`
8
- **Commit**: `40ca236c refactor(orchestrator): implement semantic progress tracking`
9
-
10
- ---
11
-
12
- ## Symptom
13
-
14
- Progress display shows impossible values like "Round 11/5":
15
-
16
- ```text
17
- ⏱️ **PROGRESS**: Round 11/5 (~0s remaining)
18
- ```
19
-
20
- This is confusing to users - how can we be on round 11 when max is 5?
21
-
22
- ---
23
-
24
- ## Root Cause Analysis
25
-
26
- ### The Semantic Mismatch
27
-
28
- Two different concepts are being conflated:
29
-
30
- | Concept | What It Means | Variable |
31
- |---------|---------------|----------|
32
- | **Workflow Round** | One orchestration cycle where manager delegates to agents | `self._max_rounds` (5) |
33
- | **Agent Completion** | One agent finishes its task | `state.iteration` (incremented on each `ExecutorCompletedEvent`) |
34
-
35
- ### The Bug
36
-
37
- ```python
38
- # Line 348: Increments on EVERY agent completion
39
- if isinstance(event, ExecutorCompletedEvent):
40
- state.iteration += 1
41
-
42
- # Line 467: Displays as if it's a workflow round
43
- message=f"Round {iteration}/{self._max_rounds} (~{est_display} remaining)"
44
- ```
45
-
46
- ### Why It Happens
47
-
48
- In a multi-agent workflow with 4 agents (searcher, hypothesizer, judge, reporter):
49
-
50
- - Each "round" involves the manager delegating to multiple agents
51
- - Each agent completion fires an `ExecutorCompletedEvent`
52
- - With 4+ agents, we see 4+ events per workflow round
53
-
54
- **Math**: 5 workflow rounds × 4 agents = 20+ agent completions, displayed as "Round 20/5"
55
-
56
- ---
57
-
58
- ## Evidence From Logs
59
-
60
- The session showed this progression:
61
-
62
- ```text
63
- Round 1/5 - First agent completed
64
- Round 2/5 - Second agent completed
65
- Round 3/5 - Third agent completed
66
- Round 4/5 - Fourth agent completed
67
- Round 5/5 - Fifth agent completed (still in workflow round 1!)
68
- Round 6/5 - Now exceeds max (workflow round 2 starting)
69
- ...
70
- Round 11/5 - Multiple workflow rounds have passed
71
- ```
72
-
73
- ---
74
-
75
- ## Impact
76
-
77
- 1. **User Confusion**: "Round 11/5" makes no sense
78
- 2. **Time Estimation Wrong**: `rounds_remaining = max(5 - 11, 0) = 0` → always shows "~0s remaining"
79
- 3. **No Actual Bug in Logic**: The workflow still runs correctly, just the display is wrong
80
-
81
- ---
82
-
83
- ## Proposed Fixes
84
-
85
- ### Option A: Rename to "Agent Step" (Quick Fix)
86
-
87
- Change the display to reflect what we're actually counting:
88
-
89
- ```python
90
- # Before
91
- message=f"Round {iteration}/{self._max_rounds} (~{est_display} remaining)"
92
-
93
- # After
94
- message=f"Agent step {iteration} (Round limit: {self._max_rounds})"
95
- ```
96
-
97
- **Pros**: Accurate, minimal code change
98
- **Cons**: Still doesn't track actual workflow rounds
99
-
100
- ### Option B: Track Actual Workflow Rounds (Proper Fix)
101
-
102
- Track workflow rounds separately from agent completions:
103
-
104
- ```python
105
- @dataclass
106
- class WorkflowState:
107
- iteration: int = 0 # Agent completions (for internal tracking)
108
- workflow_round: int = 0 # Actual orchestration rounds
109
- current_message_buffer: str = ""
110
- # ...
111
-
112
- # Increment workflow_round when manager delegates (different event type)
113
- # Display workflow_round in progress messages
114
- ```
115
-
116
- **Pros**: Semantically correct, accurate time estimates
117
- **Cons**: Requires understanding which event signals a new round
118
-
119
- ### Option C: Use Estimated Agent Count (Compromise)
120
-
121
- Estimate agents per round and display accordingly:
122
-
123
- ```python
124
- AGENTS_PER_ROUND = 4 # searcher, hypothesizer, judge, reporter
125
- estimated_round = (iteration // AGENTS_PER_ROUND) + 1
126
- message=f"Round ~{estimated_round}/{self._max_rounds}"
127
- ```
128
-
129
- **Pros**: Roughly accurate, no API research needed
130
- **Cons**: Estimation may be off if some agents are skipped
131
-
132
- ---
133
-
134
- ## Recommendation
135
-
136
- **Short-term**: Apply Option A (rename to "Agent step") - fixes the confusion immediately
137
-
138
- **Long-term**: Investigate Option B - determine which event signals a new workflow round in Microsoft Agent Framework
139
-
140
- ---
141
-
142
- ## Related Code
143
-
144
- ```python
145
- # src/orchestrators/advanced.py
146
-
147
- # Line 348: Where iteration is incremented
148
- if isinstance(event, ExecutorCompletedEvent):
149
- state.iteration += 1
150
-
151
- # Line 459-467: Where progress message is generated
152
- rounds_remaining = max(self._max_rounds - iteration, 0)
153
- est_seconds = rounds_remaining * 45
154
- progress_event = AgentEvent(
155
- type="progress",
156
- message=f"Round {iteration}/{self._max_rounds} (~{est_display} remaining)",
157
- iteration=iteration,
158
- )
159
- ```
160
-
161
- ---
162
-
163
- ## Test Case
164
-
165
- ```python
166
- def test_progress_display_never_exceeds_max_rounds():
167
- """Progress should show Round X/Y where X <= Y."""
168
- # Simulate 20 agent completions across 5 workflow rounds
169
- # Assert displayed round never exceeds max_rounds
170
- pass
171
- ```
172
-
173
- ---
174
-
175
- ## Additional Issues Found During Analysis
176
-
177
- ### Issue 2: Dead Code - Unused `_get_progress_message` Method
178
-
179
- ```python
180
- # Line 196-205: Method is defined but NEVER called
181
- def _get_progress_message(self, iteration: int) -> str:
182
- """Generate progress message with time estimation."""
183
- # ... logic duplicated in _handle_completion_event
184
- ```
185
-
186
- The same logic is duplicated inline in `_handle_completion_event` (lines 458-469).
187
-
188
- **Fix**: Either use the method or delete it.
189
-
190
- ### Issue 3: Hardcoded Constant
191
-
192
- ```python
193
- # Line 87: Class constant defined
194
- _EST_SECONDS_PER_ROUND: int = 45
195
-
196
- # Line 199: Uses constant (correct)
197
- est_seconds = rounds_remaining * self._EST_SECONDS_PER_ROUND
198
-
199
- # Line 460: Uses hardcoded 45 (inconsistent)
200
- est_seconds = rounds_remaining * 45
201
- ```
202
-
203
- **Fix**: Use `self._EST_SECONDS_PER_ROUND` consistently.
204
-
205
- ### Issue 4: Time Estimate Always Shows "~0s remaining"
206
-
207
- Since `iteration` quickly exceeds `max_rounds`:
208
-
209
- ```python
210
- rounds_remaining = max(self._max_rounds - iteration, 0)
211
- # When iteration=11, max_rounds=5: rounds_remaining = max(5-11, 0) = 0
212
- # est_seconds = 0 * 45 = 0
213
- # Display: "~0s remaining"
214
- ```
215
-
216
- The time estimate becomes useless after the first few agent completions.
217
-
218
- ---
219
-
220
- ## Complete Fix Recommendation
221
-
222
- 1. **Rename display** from "Round X/5" to "Agent step X"
223
- 2. **Delete dead code** - remove unused `_get_progress_message` method
224
- 3. **Use constant** - replace hardcoded `45` with `self._EST_SECONDS_PER_ROUND`
225
- 4. **Fix time estimate** - base it on agent steps, not workflow rounds
226
-
227
- ---
228
-
229
- ## Senior Review Findings (2025-12-05)
230
-
231
- **Reviewer**: External Gemini CLI Agent
232
- **Status**: CONFIRMED - Analysis accurate and sufficient
233
-
234
- ### Additional Nuances Identified
235
-
236
- 1. **Manager Agent Also Fires Events**: The Manager itself is an agent. If `ExecutorCompletedEvent` fires for Manager's turn completion PLUS sub-agents' completions, the count accelerates 2-3x faster per logical round. This explains why we saw 11 events for ~2-3 workflow rounds.
237
-
238
- 2. **Time Estimation Doubly Flawed**:
239
- - Not just bottoming out at 0
240
- - `_EST_SECONDS_PER_ROUND` (45s) is calibrated for a FULL workflow round, not a single agent step
241
- - If we counted agent steps correctly: 10 steps × 45s = 450s (way overestimated)
242
- - A full round of 4 agents might only take 60s total
243
-
244
- 3. **API Discovery - Can Track Actual Rounds**:
245
-
246
- ```python
247
- # These constants exist in agent_framework:
248
- ORCH_MSG_KIND_INSTRUCTION = 'instruction'
249
- ORCH_MSG_KIND_USER_TASK = 'user_task'
250
- ORCH_MSG_KIND_TASK_LEDGER = 'task_ledger'
251
- ORCH_MSG_KIND_NOTICE = 'notice'
252
- ```
253
-
254
- Counting `user_task` events from `MagenticOrchestratorMessageEvent` would align iteration with `max_rounds` 1:1, since this signals "Manager is beginning a new evaluation cycle."
255
-
256
- ### Reviewer Recommendations
257
-
258
- 1. **Option A (Rename)**: APPROVED - Safest, most honest fix
259
- 2. **Option B (Track Workflow Rounds)**: DEFER - Requires verifying framework behavior across versions, risks brittleness
260
- 3. **Remove Denominator**: Display `Agent Step {iteration}` without `/5` to avoid confusion
261
- 4. **Delete Dead Code**: Confirmed `_get_progress_message` is never called
262
- 5. **Fix Constants**: Use `self._EST_SECONDS_PER_ROUND` consistently
263
-
264
- ### Review Status: ✅ PASSED - Ready for Implementation
265
-
266
- ---
267
-
268
- ## Resolution (2025-12-05)
269
-
270
- **Implemented**: Domain-driven semantic progress tracking
271
-
272
- ### What Was Done
273
-
274
- 1. **Deleted Dead Code**:
275
- - Removed unused `_get_progress_message` method
276
- - Removed unused `_EST_SECONDS_PER_ROUND` constant
277
-
278
- 2. **Added Semantic Agent Mapping** (`_get_agent_semantic_name`):
279
-
280
- ```python
281
- def _get_agent_semantic_name(self, agent_id: str) -> str:
282
- """Map internal agent ID to user-facing semantic name."""
283
- name = agent_id.lower()
284
- if SEARCHER_AGENT_ID in name:
285
- return "SearchAgent"
286
- if JUDGE_AGENT_ID in name:
287
- return "JudgeAgent"
288
- if HYPOTHESIZER_AGENT_ID in name:
289
- return "HypothesisAgent"
290
- if REPORTER_AGENT_ID in name:
291
- return "ReportAgent"
292
- return "ManagerAgent"
293
- ```
294
-
295
- 3. **Changed Progress Display**:
296
- - Before: `"Round {iteration}/{self._max_rounds} (~{est_display} remaining)"`
297
- - After: `"Step {iteration}: {semantic_name} task completed"`
298
-
299
- 4. **Changed Initial Thinking Message**:
300
- - Before: `"Multi-agent reasoning in progress (5 rounds max)... Estimated time: 3-5 minutes."`
301
- - After: `"Multi-agent reasoning in progress (Limit: 5 Manager rounds)... Allocating time for deep research..."`
302
-
303
- 5. **Updated Tests**: Changed test mocks to use domain-specific agent IDs (`searcher`, `judge`) instead of arbitrary strings.
304
-
305
- ### Result
306
-
307
- - Before: `⏱️ **PROGRESS**: Round 11/5 (~0s remaining)` (confusing, broken math)
308
- - After: `⏱️ **PROGRESS**: Step 11: ReportAgent task completed` (accurate, professional)
309
-
310
- ### Design Decision
311
-
312
- Rather than patching the counter display or trying to track "actual workflow rounds" (which requires deep framework integration), we chose **honest reporting**: Show exactly what happened (which agent completed) without making false promises about progress percentages or time estimates.
313
-
314
- This follows the Clean Code principle: "Don't lie to the user."
315
-
316
- ---
317
-
318
- ## References
319
-
320
- - SPEC-18: Agent Framework Core Upgrade (where ExecutorCompletedEvent was introduced)
321
- - Microsoft Agent Framework documentation on workflow rounds vs agent executions
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/bugs/archive/P3_ARCHITECTURAL_GAP_EPHEMERAL_MEMORY.md DELETED
@@ -1,23 +0,0 @@
1
- # P3: Ephemeral Memory Architecture (No Persistence)
2
-
3
- **Status:** OPEN
4
- **Priority:** P3 (Feature/Architecture Gap)
5
- **Found By:** Codebase Investigation
6
- **Date:** 2025-11-29
7
-
8
- ## Description
9
- The current `EmbeddingService` (`src/services/embeddings.py`) initializes an **in-memory** ChromaDB client (`chromadb.Client()`) and creates a random UUID-based collection for every new session.
10
-
11
- While `src/utils/config.py` defines a `chroma_db_path` for persistence, it is currently **ignored**.
12
-
13
- ## Impact
14
- 1. **No Long-Term Learning:** The agent cannot "remember" research from previous runs. Every time you restart the app, it starts from zero.
15
- 2. **Redundant Costs:** If a user researches "Diabetes" twice, the agent re-searches and re-embeds the same papers, wasting tokens and compute time.
16
-
17
- ## Technical Details
18
- - **Current:** `self._client = chromadb.Client()` (In-Memory)
19
- - **Required:** `self._client = chromadb.PersistentClient(path=settings.chroma_db_path)`
20
-
21
- ## Recommendation
22
- For a "Hackathon Demo," this is **low priority** (ephemeral is fine).
23
- For a "Real Product," this is **critical** (users expect a library of research).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/bugs/archive/P3_ARCHITECTURAL_GAP_STRUCTURED_MEMORY.md DELETED
@@ -1,150 +0,0 @@
1
- # P3: Missing Structured Cognitive Memory (Shared Blackboard)
2
-
3
- **Status:** OPEN
4
- **Priority:** P3 (Architecture/Enhancement)
5
- **Found By:** Deep Codebase Investigation
6
- **Date:** 2025-11-29
7
- **Spec:** [SPEC_07_LANGGRAPH_MEMORY_ARCH.md](../specs/SPEC_07_LANGGRAPH_MEMORY_ARCH.md)
8
-
9
- ## Executive Summary
10
-
11
- DeepBoner's `AdvancedOrchestrator` has **Data Memory** (vector store for papers) but lacks **Cognitive Memory** (structured state for hypotheses, conflicts, and research plan). This causes "context drift" on long runs and prevents intelligent conflict resolution.
12
-
13
- ---
14
-
15
- ## Current Architecture (What We Have)
16
-
17
- ### 1. MagenticState (`src/agents/state.py:18-91`)
18
- ```python
19
- class MagenticState(BaseModel):
20
- evidence: list[Evidence] = Field(default_factory=list)
21
- embedding_service: Any = None # ChromaDB connection
22
-
23
- def add_evidence(self, new_evidence: list[Evidence]) -> int: ...
24
- async def search_related(self, query: str, n_results: int = 5) -> list[Evidence]: ...
25
- ```
26
- - **What it does:** Stores Evidence objects, URL-based deduplication, semantic search via embeddings.
27
- - **What it DOESN'T do:** Track hypotheses, conflicts, or research plan status.
28
-
29
- ### 2. EmbeddingService (`src/services/embeddings.py:29-180`)
30
- ```python
31
- self._client = chromadb.Client() # In-memory (Line 44)
32
- self._collection = self._client.create_collection(
33
- name=f"evidence_{uuid.uuid4().hex}", # Random name per session (Line 45-47)
34
- ...
35
- )
36
- ```
37
- - **What it does:** In-session semantic search/deduplication.
38
- - **Limitation:** New collection per session, no persistence despite `settings.chroma_db_path` existing.
39
-
40
- ### 3. AdvancedOrchestrator (`src/orchestrators/advanced.py:51-371`)
41
- - Uses Microsoft's `agent-framework-core` (MagenticBuilder)
42
- - State is implicit in chat history passed between agents
43
- - Manager decides next step by reading conversation, not structured state
44
-
45
- ---
46
-
47
- ## The Problem
48
-
49
- | Issue | Impact | Evidence |
50
- |-------|--------|----------|
51
- | **No Hypothesis Tracking** | Can't update hypothesis confidence systematically | `MagenticState` has no `hypotheses` field |
52
- | **No Conflict Detection** | Contradictory sources are ignored | No `conflicts` list to flag Source A vs Source B |
53
- | **Context Drift** | Manager forgets original query after 50+ messages | State lives only in chat, not structured object |
54
- | **No Plan State** | Can't pause/resume research | No `research_plan` or `next_step` tracking |
55
-
56
- ---
57
-
58
- ## The Solution: LangGraph State Graph (Nov 2025 Best Practice)
59
-
60
- ### Why LangGraph?
61
-
62
- Based on [comprehensive analysis](https://latenode.com/blog/langgraph-multi-agent-orchestration-complete-framework-guide-architecture-analysis-2025):
63
-
64
- 1. **Explicit State Schema:** TypedDict/Pydantic model that ALL agents read/write
65
- 2. **State Reducers:** `Annotated[List[X], operator.add]` for appending (not overwriting)
66
- 3. **HuggingFace Compatible:** Works with `langchain-huggingface` (Llama 3.1)
67
- 4. **Production-Ready:** MongoDB checkpointer for persistence, SQLite for dev
68
-
69
- ### Target Architecture
70
-
71
- ```python
72
- # src/agents/graph/state.py (IMPLEMENTED)
73
- from typing import Annotated, TypedDict, Literal
74
- import operator
75
- from pydantic import BaseModel, Field
76
- from langchain_core.messages import BaseMessage
77
-
78
- class Hypothesis(BaseModel):
79
- id: str
80
- statement: str
81
- status: Literal["proposed", "validating", "confirmed", "refuted"]
82
- confidence: float
83
- supporting_evidence_ids: list[str]
84
- contradicting_evidence_ids: list[str]
85
-
86
- class Conflict(BaseModel):
87
- id: str
88
- description: str
89
- source_a_id: str
90
- source_b_id: str
91
- status: Literal["open", "resolved"]
92
- resolution: str | None
93
-
94
- class ResearchState(TypedDict):
95
- query: str # Immutable original question
96
- hypotheses: Annotated[list[Hypothesis], operator.add]
97
- conflicts: Annotated[list[Conflict], operator.add]
98
- evidence_ids: Annotated[list[str], operator.add] # Links to ChromaDB
99
- messages: Annotated[list[BaseMessage], operator.add]
100
- next_step: Literal["search", "judge", "resolve", "synthesize", "finish"]
101
- iteration_count: int
102
- ```
103
-
104
- ---
105
-
106
- ## Implementation Dependencies
107
-
108
- | Package | Purpose | Install |
109
- |---------|---------|---------|
110
- | `langgraph>=0.2` | State graph framework | `uv add langgraph` |
111
- | `langchain>=0.3` | Base abstractions | `uv add langchain` |
112
- | `langchain-huggingface` | Llama 3.1 integration | `uv add langchain-huggingface` |
113
- | `langgraph-checkpoint-sqlite` | Dev persistence | `uv add langgraph-checkpoint-sqlite` |
114
-
115
- **Note:** MongoDB checkpointer (`langgraph-checkpoint-mongodb`) recommended for production per [MongoDB blog](https://www.mongodb.com/company/blog/product-release-announcements/powering-long-term-memory-for-agents-langgraph).
116
-
117
- ---
118
-
119
- ## Alternative Considered: Mem0
120
-
121
- [Mem0](https://mem0.ai/) specializes in long-term memory and [outperformed OpenAI by 26%](https://guptadeepak.com/the-ai-memory-wars-why-one-system-crushed-the-competition-and-its-not-openai/) in benchmarks. However:
122
-
123
- - **Mem0 excels at:** User personalization, cross-session memory
124
- - **LangGraph excels at:** Workflow orchestration, state machines
125
- - **Verdict:** Use LangGraph for orchestration + optionally add Mem0 for user-level memory later
126
-
127
- ---
128
-
129
- ## Quick Win (Separate from LangGraph)
130
-
131
- Enable ChromaDB persistence in `src/services/embeddings.py:44`:
132
- ```python
133
- # FROM:
134
- self._client = chromadb.Client() # In-memory
135
-
136
- # TO:
137
- self._client = chromadb.PersistentClient(path=settings.chroma_db_path)
138
- ```
139
-
140
- This alone gives cross-session evidence persistence (P3_ARCHITECTURAL_GAP_EPHEMERAL_MEMORY fix).
141
-
142
- ---
143
-
144
- ## References
145
-
146
- - [LangGraph Multi-Agent Orchestration Guide 2025](https://latenode.com/blog/langgraph-multi-agent-orchestration-complete-framework-guide-architecture-analysis-2025)
147
- - [Long-Term Agentic Memory with LangGraph](https://medium.com/@anil.jain.baba/long-term-agentic-memory-with-langgraph-824050b09852)
148
- - [LangGraph vs LangChain 2025](https://kanerika.com/blogs/langchain-vs-langgraph/)
149
- - [MongoDB + LangGraph Checkpointers](https://www.mongodb.com/company/blog/product-release-announcements/powering-long-term-memory-for-agents-langgraph)
150
- - [Mem0 + LangGraph Integration](https://datacouch.io/blog/build-smarter-ai-agents-mem0-langgraph-guide/)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/bugs/archive/P3_MAGENTIC_NO_TERMINATION_EVENT.md DELETED
@@ -1,177 +0,0 @@
1
- # P3 Bug Report: Advanced Mode Missing Termination Guarantee
2
-
3
- ## Status
4
- - **Date:** 2025-11-29
5
- - **Priority:** P3 (Edge case, but confusing UX)
6
- - **Component:** `src/orchestrator_magentic.py`
7
- - **Resolution:** Fixed (Guarantee termination event)
8
-
9
- ---
10
-
11
- ## Symptoms
12
-
13
- In **Advanced (Magentic) mode** with OpenAI API key:
14
-
15
- 1. Workflow runs for many iterations (up to 10 rounds)
16
- 2. Agents search, judge, hypothesize repeatedly
17
- 3. Eventually... **nothing happens**
18
- - No "complete" event
19
- - No error message
20
- - UI just stops updating
21
-
22
- **User perception:** "Did it finish? Did it crash? What happened?"
23
-
24
- ### Observed Behavior
25
-
26
- When workflow hits `max_round_count=10`:
27
- - `workflow.run_stream(task)` iterator ends
28
- - NO `MagenticFinalResultEvent` is emitted by agent-framework
29
- - Our code yields nothing after the loop
30
- - User is left hanging
31
-
32
- ---
33
-
34
- ## Root Cause Analysis
35
-
36
- ### Code Path (`src/orchestrator_magentic.py:170-186`)
37
-
38
- ```python
39
- iteration = 0
40
- try:
41
- async for event in workflow.run_stream(task):
42
- agent_event = self._process_event(event, iteration)
43
- if agent_event:
44
- if isinstance(event, MagenticAgentMessageEvent):
45
- iteration += 1
46
- yield agent_event
47
- # BUG: NO FALLBACK HERE!
48
- # If loop ends without FinalResultEvent, user sees nothing
49
-
50
- except Exception as e:
51
- logger.error("Magentic workflow failed", error=str(e))
52
- yield AgentEvent(
53
- type="error",
54
- message=f"Workflow error: {e!s}",
55
- iteration=iteration,
56
- )
57
- # BUG: NO FINALLY BLOCK TO GUARANTEE TERMINATION EVENT
58
- ```
59
-
60
- ### Workflow Configuration (`src/orchestrator_magentic.py:110-116`)
61
-
62
- ```python
63
- .with_standard_manager(
64
- chat_client=manager_client,
65
- max_round_count=self._max_rounds, # 10 - can hit this limit
66
- max_stall_count=3, # If agents repeat 3x
67
- max_reset_count=2, # Workflow reset limit
68
- )
69
- ```
70
-
71
- ### Failure Modes
72
-
73
- | Scenario | What Happens | User Sees |
74
- |----------|--------------|-----------|
75
- | `MagenticFinalResultEvent` emitted | `_process_event` yields "complete" | Final report |
76
- | Max rounds (10) reached, no final event | Loop ends silently | **Nothing** |
77
- | `max_stall_count` triggered | Workflow ends | **Nothing** |
78
- | `max_reset_count` triggered | Workflow ends | **Nothing** |
79
- | OpenAI API error | Exception caught | Error message |
80
-
81
- ---
82
-
83
- ## The Fix
84
-
85
- Add guaranteed termination event after the loop:
86
-
87
- ```python
88
- iteration = 0
89
- final_event_received = False
90
-
91
- try:
92
- async for event in workflow.run_stream(task):
93
- agent_event = self._process_event(event, iteration)
94
- if agent_event:
95
- if isinstance(event, MagenticAgentMessageEvent):
96
- iteration += 1
97
- if agent_event.type == "complete":
98
- final_event_received = True
99
- yield agent_event
100
-
101
- except Exception as e:
102
- logger.error("Magentic workflow failed", error=str(e))
103
- yield AgentEvent(
104
- type="error",
105
- message=f"Workflow error: {e!s}",
106
- iteration=iteration,
107
- )
108
- final_event_received = True # Error is a form of termination
109
-
110
- finally:
111
- # GUARANTEE: Always emit termination event
112
- if not final_event_received:
113
- logger.warning(
114
- "Workflow ended without final event",
115
- iterations=iteration,
116
- )
117
- yield AgentEvent(
118
- type="complete",
119
- message=(
120
- f"Research completed after {iteration} agent rounds. "
121
- "Max iterations reached - results may be partial. "
122
- "Try a more specific query for better results."
123
- ),
124
- data={"iterations": iteration, "reason": "max_rounds_reached"},
125
- iteration=iteration,
126
- )
127
- ```
128
-
129
- ---
130
-
131
- ## Alternative: Increase Max Rounds
132
-
133
- The default `max_rounds=10` might be too low for complex queries.
134
-
135
- In `src/orchestrator_factory.py:52-53`:
136
- ```python
137
- return orchestrator_cls(
138
- max_rounds=config.max_iterations if config else 10, # Could increase to 15-20
139
- api_key=api_key,
140
- )
141
- ```
142
-
143
- **Trade-off:** More rounds = more API cost, but better chance of complete results.
144
-
145
- ---
146
-
147
- ## Test Plan
148
-
149
- - [ ] Add fallback yield after async for loop
150
- - [ ] Add `final_event_received` flag tracking
151
- - [ ] Log warning when fallback is used
152
- - [ ] Test with `max_rounds=2` to force hitting limit
153
- - [ ] Verify user always sees termination event
154
- - [ ] `make check` passes
155
-
156
- ---
157
-
158
- ## Related Files
159
-
160
- - `src/orchestrator_magentic.py` - Main fix location
161
- - `src/orchestrator_factory.py` - Max rounds configuration
162
- - `src/utils/models.py` - AgentEvent types
163
- - `docs/bugs/P2_MAGENTIC_THINKING_STATE.md` - Related UX issue (implemented)
164
-
165
- ---
166
-
167
- ## Priority Justification
168
-
169
- **P3** because:
170
- - Advanced mode is working for most queries
171
- - Only hits edge case when max rounds reached without synthesis
172
- - User CAN retry with different query
173
- - Not blocking hackathon demo (free tier Simple mode works)
174
-
175
- Would be P2 if:
176
- - This happened frequently
177
- - No workaround existed
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/bugs/archive/P3_MODAL_INTEGRATION_REMOVAL.md DELETED
@@ -1,78 +0,0 @@
1
- # P3 Tech Debt: Modal Integration Removal
2
-
3
- **Date**: 2025-12-04
4
- **Status**: DONE
5
- **Severity**: P3 (Tech Debt - Not blocking functionality)
6
- **Component**: Multiple files
7
-
8
- ---
9
-
10
- ## Executive Summary
11
-
12
- Modal (cloud function execution platform) is integrated throughout the codebase but was decided against for this project. This creates potential confusion and dead code paths that should be cleaned up when time permits.
13
-
14
- ---
15
-
16
- ## Affected Files
17
-
18
- The following files contain Modal references:
19
-
20
- | File | Usage |
21
- |------|-------|
22
- | `src/utils/config.py` | `MODAL_TOKEN_ID`, `MODAL_TOKEN_SECRET` settings |
23
- | `src/utils/service_loader.py` | Modal service initialization |
24
- | `src/services/llamaindex_rag.py` | Modal integration for RAG |
25
- | `src/agents/code_executor_agent.py` | Modal sandbox execution |
26
- | `src/utils/exceptions.py` | Modal-related exceptions |
27
- | `src/tools/code_execution.py` | Modal code execution tool |
28
- | `src/services/statistical_analyzer.py` | Modal statistical analysis |
29
- | `src/mcp_tools.py` | Modal MCP tool wrappers |
30
- | `src/agents/analysis_agent.py` | Modal analysis agent |
31
-
32
- ---
33
-
34
- ## Context
35
-
36
- Modal was originally integrated for:
37
- 1. **Code Execution Sandbox**: Running untrusted code in isolated containers
38
- 2. **Statistical Analysis**: Offloading heavy statistical computations
39
- 3. **LlamaIndex RAG**: Premium embeddings with persistent storage
40
-
41
- However, the project decided against Modal because:
42
- - Added infrastructure complexity
43
- - Free Tier doesn't need cloud functions
44
- - Paid Tier uses OpenAI directly
45
-
46
- ---
47
-
48
- ## Recommended Fix
49
-
50
- 1. Remove Modal-related code from all affected files
51
- 2. Remove `MODAL_TOKEN_ID` and `MODAL_TOKEN_SECRET` from config
52
- 3. Remove Modal from dependencies in `pyproject.toml`
53
- 4. Update any documentation referencing Modal
54
-
55
- ---
56
-
57
- ## Impact If Not Fixed
58
-
59
- - Confusion for new contributors
60
- - Dead code in production
61
- - Unnecessary dependencies
62
- - Config settings that do nothing
63
-
64
- ---
65
-
66
- ## Test Plan
67
-
68
- 1. Remove Modal code
69
- 2. Run `make check` to ensure no breakage
70
- 3. Verify Free Tier and Paid Tier still work
71
- 4. Search codebase for any remaining Modal references
72
-
73
- ---
74
-
75
- ## Related
76
-
77
- - `P3_REMOVE_ANTHROPIC_PARTIAL_WIRING.md` - Similar tech debt for Anthropic
78
- - ARCHITECTURE.md - Current architecture excludes Modal
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/bugs/archive/P3_REMOVE_ANTHROPIC_PARTIAL_WIRING.md DELETED
@@ -1,160 +0,0 @@
1
- # P3 Tech Debt: Remove Anthropic Partial Wiring
2
-
3
- **Date**: 2025-12-03
4
- **Status**: DONE
5
- **Severity**: P3 (Tech Debt / Simplification)
6
- **Component**: Architecture / Provider Integration
7
-
8
- ---
9
-
10
- ## Summary
11
-
12
- Remove all Anthropic-related code, configuration, and references from the codebase. Anthropic is partially wired but **not fully threaded through the architecture**, creating confusion and half-implemented code paths.
13
-
14
- ---
15
-
16
- ## Rationale
17
-
18
- ### 1. Anthropic Does NOT Provide Embeddings
19
-
20
- Our architecture requires embeddings for:
21
- - RAG (LlamaIndex/ChromaDB)
22
- - Evidence deduplication
23
- - Semantic search
24
-
25
- Anthropic only provides chat completion, not embeddings. This means even with a working Anthropic chat client, users would need a **second provider** for embeddings, breaking the unified experience.
26
-
27
- ### 2. Partial Implementation Creates Confusion
28
-
29
- Current state:
30
- - `settings.anthropic_api_key` exists ✅
31
- - `settings.has_anthropic_key` property exists ✅
32
- - `settings.anthropic_model` configured ✅
33
- - `AnthropicChatClient` for agent_framework **DOES NOT EXIST** ❌
34
- - Code raises `NotImplementedError` when Anthropic detected ❌
35
-
36
- This half-state causes:
37
- - User confusion ("Why doesn't my Anthropic key work?")
38
- - Developer confusion ("Is Anthropic supported or not?")
39
- - Dead code paths that need maintenance
40
-
41
- ### 3. Unified Architecture Principle
42
-
43
- **Principle**: Only support providers that work **end-to-end** through the entire stack:
44
-
45
- ```
46
- Provider Requirements:
47
- ├── Chat Completion (for agents) ✅ Required
48
- ├── Function/Tool Calling ✅ Required
49
- ├── Embeddings (for RAG) ✅ Required
50
- └── Streaming ✅ Required
51
- ```
52
-
53
- | Provider | Chat | Tools | Embeddings | Streaming | Status |
54
- |----------|------|-------|------------|-----------|--------|
55
- | OpenAI | ✅ | ✅ | ✅ | ✅ | **KEEP** |
56
- | HuggingFace | ✅ | ✅ | ✅ (local) | ✅ | **KEEP** |
57
- | Gemini | ✅ | ✅ | ✅ | ✅ | Future (Phase 4) |
58
- | Anthropic | ✅ | ✅ | ❌ | ✅ | **REMOVE** |
59
-
60
- ---
61
-
62
- ## Files to Clean Up
63
-
64
- ### Configuration
65
- - [ ] `src/utils/config.py` - Remove `anthropic_api_key`, `anthropic_model`, `has_anthropic_key`
66
-
67
- ### Client Factory
68
- - [ ] `src/clients/factory.py` - Remove Anthropic detection and `NotImplementedError`
69
-
70
- ### Legacy Code (pydantic-ai based)
71
- - [ ] `src/utils/llm_factory.py` - Remove `AnthropicModel`, `AnthropicProvider` imports and handling
72
- - [ ] `src/agent_factory/judges.py` - Remove Anthropic model selection
73
-
74
- ### App/UI
75
- - [ ] `src/app.py` - Remove `has_anthropic_key` checks and "Anthropic from env" backend info
76
-
77
- ### Documentation
78
- - [ ] `CLAUDE.md` - Update LLM provider list
79
- - [ ] `AGENTS.md` - Update LLM provider list
80
- - [ ] `GEMINI.md` - Update LLM provider list
81
-
82
- ### Tests
83
- - [ ] `tests/unit/clients/test_chat_client_factory.py` - Remove Anthropic test cases
84
- - [ ] `tests/unit/utils/test_config.py` - Remove Anthropic config tests
85
-
86
- ---
87
-
88
- ## Code Snippets to Remove
89
-
90
- ### `src/utils/config.py`
91
- ```python
92
- # REMOVE these lines:
93
- anthropic_api_key: str | None = Field(default=None, description="Anthropic API key")
94
- anthropic_model: str = Field(
95
- default="claude-sonnet-4-5-20250929", description="Anthropic model"
96
- )
97
-
98
- @property
99
- def has_anthropic_key(self) -> bool:
100
- """Check if Anthropic API key is available."""
101
- return bool(self.anthropic_api_key)
102
- ```
103
-
104
- ### `src/clients/factory.py`
105
- ```python
106
- # REMOVE these lines:
107
- if api_key.startswith("sk-ant-"):
108
- normalized = "anthropic"
109
-
110
- if normalized == "anthropic":
111
- raise NotImplementedError(
112
- "Anthropic client not yet implemented. "
113
- "Use OpenAI key (sk-...) or leave empty for free HuggingFace tier."
114
- )
115
- ```
116
-
117
- ### `src/app.py`
118
- ```python
119
- # REMOVE these lines:
120
- elif settings.has_anthropic_key:
121
- backend_info = "Paid API (Anthropic from env)"
122
-
123
- has_anthropic = settings.has_anthropic_key
124
- has_paid_key = has_openai or has_anthropic or bool(user_api_key)
125
- # Change to:
126
- has_paid_key = has_openai or bool(user_api_key)
127
- ```
128
-
129
- ---
130
-
131
- ## Migration Notes
132
-
133
- ### For Users with Anthropic Keys
134
-
135
- If users have `ANTHROPIC_API_KEY` set in their environment:
136
- 1. It will be **silently ignored** (not an error)
137
- 2. System falls through to HuggingFace free tier
138
- 3. Users should use `OPENAI_API_KEY` instead for paid tier
139
-
140
- ### Future Consideration
141
-
142
- If Anthropic adds embeddings API in the future, we can re-add support. But until then, partial support creates more confusion than value.
143
-
144
- ---
145
-
146
- ## Definition of Done
147
-
148
- - [ ] All Anthropic references removed from `src/`
149
- - [ ] All Anthropic tests removed or updated
150
- - [ ] Documentation updated to reflect supported providers: OpenAI, HuggingFace, (future: Gemini)
151
- - [ ] `make check` passes (lint, typecheck, tests)
152
- - [ ] PR reviewed and merged
153
-
154
- ---
155
-
156
- ## Related Documents
157
-
158
- - `P2_7B_MODEL_GARBAGE_OUTPUT.md` - Current free tier model quality issues
159
- - `HF_FREE_TIER_ANALYSIS.md` - HuggingFace provider routing analysis
160
- - `CLAUDE.md` - Agent context with provider documentation