P0 Bug Report: Simple Mode Never Synthesizes
Status
- Date: 2025-11-29
- Priority: P0 (Blocker - Simple mode produces useless output)
- Component:
src/orchestrators/simple.py,src/agent_factory/judges.py,src/prompts/judge.py - Environment: Simple mode WITHOUT OpenAI key (HuggingFace Inference free tier)
Symptoms
When running Simple mode with a real research question:
- Judge never recommends "synthesize" even with 455 sources and 90% confidence
- Confidence drops to 0% in late iterations (API failures or context overflow)
- Search derails to tangential topics (bone health, muscle mass instead of libido)
- Max iterations reached β User gets garbage output (just citations, no synthesis)
Example Output (Real Run)
π SEARCHING: What drugs improve female libido post-menopause?
π SEARCH_COMPLETE: Found 30 new sources (30 total)
β
JUDGE_COMPLETE: Assessment: continue (confidence: 70%) β Never "synthesize"
... 8 more iterations ...
π SEARCH_COMPLETE: Found 10 new sources (429 total)
β
JUDGE_COMPLETE: Assessment: continue (confidence: 0%) β API failure?
π SEARCH_COMPLETE: Found 26 new sources (455 total)
β
JUDGE_COMPLETE: Assessment: continue (confidence: 0%) β Still failing
## Partial Analysis (Max Iterations Reached) β GARBAGE OUTPUT
### Question
What drugs improve female libido post-menopause?
### Status
Maximum search iterations reached.
### Citations
1. [Tribulus terrestris and female reproductive...]
2. ...
---
*Consider searching with more specific terms* β NO SYNTHESIS AT ALL
Root Cause Analysis
Bug 1: Judge Never Says "sufficient=True"
File: src/prompts/judge.py:22-25
3. **Sufficiency**: Evidence is sufficient when:
- Combined scores >= 12 AND
- At least one specific drug candidate identified AND
- Clear mechanistic rationale exists
Problem: The prompt is too conservative. With 455 sources spanning testosterone, DHEA, estrogen, oxytocin, etc., the judge should have identified candidates and said "synthesize". But:
- LLM may not be extracting drug candidates from evidence properly
- The "AND" conditions are too strict - evidence can be "good enough" without hitting all criteria
- The recommendation "continue" seems to be the default state
Evidence: Output shows 70-90% confidence but still "continue" - the judge is confident but never satisfied.
Bug 2: Confidence Drops to 0% (Late Iteration Failures)
File: src/agent_factory/judges.py:150-183
The _create_fallback_assessment() returns:
confidence: 0.0recommendation: "continue"
Problem: In iterations 9-10, something failed:
- Context too long (455 sources Γ ~1500 chars = 680K chars β token limit exceeded)
- API rate limit hit
- Network timeout
Evidence: Confidence went from 80%β0%β0% in final iterations - this is the fallback response.
Bug 3: Search Derailment
Evidence from logs:
Next searches: androgen therapy and bone health, androgen therapy and muscle mass...
Next searches: testosterone therapy in postmenopausal women, mechanisms of testosterone...
Problem: Judge's next_search_queries drift off-topic. "Bone health" and "muscle mass" are tangential to "female libido". The judge should stay focused on the original question.
Bug 4: Partial Synthesis is Garbage
File: src/orchestrators/simple.py:432-470
def _generate_partial_synthesis(self, query: str, evidence: list[Evidence]) -> str:
"""Generate a partial synthesis when max iterations reached."""
citations = "\n".join([...]) # Just citations
return f"""## Partial Analysis (Max Iterations Reached)
### Question
{query}
### Status
Maximum search iterations reached. The evidence gathered may be incomplete.
### Evidence Collected
Found {len(evidence)} sources.
### Citations
{citations}
---
*Consider searching with more specific terms*
"""
Problem: When max iterations reached, we have 455 sources but output NO analysis. We should:
- Force a synthesis call to the LLM
- Or at minimum generate drug candidates/findings from the last good assessment
- Not just dump citations and give up
The Fix
Fix 1: Lower the Bar for "synthesize"
Option A: Change prompt to be less strict:
SYSTEM_PROMPT = """...
3. **Sufficiency**: Evidence is sufficient when:
- Combined scores >= 10 (was 12) OR
- Confidence >= 80% with drug candidates identified OR
- 5+ iterations completed with 100+ sources
"""
Option B: Add iteration-based heuristic in orchestrator:
# If we have lots of evidence and high confidence, force synthesis
if iteration >= 5 and len(all_evidence) > 100 and assessment.confidence > 0.7:
assessment.sufficient = True
assessment.recommendation = "synthesize"
Fix 2: Handle Context Overflow
File: src/agent_factory/judges.py
Before sending to LLM, cap evidence:
async def assess(self, question: str, evidence: list[Evidence]) -> JudgeAssessment:
# Cap at 50 most recent/relevant to avoid token overflow
if len(evidence) > 50:
evidence = evidence[:50] # Or use embedding similarity to select best 50
Fix 3: Keep Search Focused
File: src/prompts/judge.py
Add to prompt:
SYSTEM_PROMPT = """...
## Search Query Rules
When suggesting next_search_queries:
- Stay focused on the ORIGINAL question
- Do NOT drift to tangential topics (e.g., don't search "bone health" for a libido question)
- Refine existing good terms, don't explore random associations
"""
Fix 4: Generate Real Synthesis on Max Iterations
File: src/orchestrators/simple.py
def _generate_partial_synthesis(self, query: str, evidence: list[Evidence]) -> str:
"""Generate a REAL synthesis when max iterations reached."""
# Get the last assessment's data (if available)
last_assessment = self.history[-1]["assessment"] if self.history else None
drug_candidates = last_assessment.get("details", {}).get("drug_candidates", []) if last_assessment else []
key_findings = last_assessment.get("details", {}).get("key_findings", []) if last_assessment else []
drug_list = "\n".join([f"- **{d}**" for d in drug_candidates]) or "- See sources below for candidates"
findings_list = "\n".join([f"- {f}" for f in key_findings[:5]]) or "- Review citations for findings"
citations = "\n".join([
f"{i + 1}. [{e.citation.title}]({e.citation.url}) ({e.citation.source.upper()})"
for i, e in enumerate(evidence[:10])
])
return f"""## Drug Repurposing Analysis (Partial)
### Question
{query}
### Status
β οΈ Maximum iterations reached. Analysis based on {len(evidence)} sources.
### Drug Candidates Identified
{drug_list}
### Key Findings
{findings_list}
### Top Citations ({len(evidence)} sources)
{citations}
---
*Analysis may be incomplete. Consider refining query or adding API key for better results.*
"""
Test Plan
- Verify judge says "synthesize" within 5 iterations for good queries
- Test with 500+ sources to ensure no token overflow
- Verify search stays on-topic (no bone/muscle tangents for libido query)
- Verify partial synthesis shows drug candidates (not just citations)
- Test with MockJudgeHandler to confirm issue is in LLM behavior
- Add unit test:
test_judge_synthesizes_with_good_evidence
Priority Justification
P0 because:
- Simple mode is the DEFAULT for users without API keys
- 455 sources found but ZERO useful output generated
- User waited 10 iterations just to get a citation dump
- Makes the tool look completely broken
- Blocks hackathon demo effectiveness
Immediate Workaround
- Use Advanced mode (requires OpenAI key) - it has its own synthesis logic
- Or use fewer iterations (MAX_ITERATIONS=3) to hit partial synthesis faster
- Or manually review the citations (they ARE relevant, just not synthesized)
Related Issues
P0_ORCHESTRATOR_DEDUP_AND_JUDGE_BUGS.md- Fixed dedup issue, but synthesis problem persistsACTIVE_BUGS.md- Update when this is resolved