Spaces:

VibecoderMcSwaggins
/

DeepBoner

Paused

App Files Files Community

DeepBoner / docs /bugs /P0_SIMPLE_MODE_NEVER_SYNTHESIZES.md

VibecoderMcSwaggins

feat: add service loader + SPEC_06 + P0 bug report

9639483 about 1 month ago

preview code

raw

history blame

8.22 kB

P0 Bug Report: Simple Mode Never Synthesizes

Status

Date: 2025-11-29
Priority: P0 (Blocker - Simple mode produces useless output)
Component: src/orchestrators/simple.py, src/agent_factory/judges.py, src/prompts/judge.py
Environment: Simple mode WITHOUT OpenAI key (HuggingFace Inference free tier)

Symptoms

When running Simple mode with a real research question:

Judge never recommends "synthesize" even with 455 sources and 90% confidence
Confidence drops to 0% in late iterations (API failures or context overflow)
Search derails to tangential topics (bone health, muscle mass instead of libido)
Max iterations reached → User gets garbage output (just citations, no synthesis)

Example Output (Real Run)

🔍 SEARCHING: What drugs improve female libido post-menopause?
📚 SEARCH_COMPLETE: Found 30 new sources (30 total)
✅ JUDGE_COMPLETE: Assessment: continue (confidence: 70%)    ← Never "synthesize"

... 8 more iterations ...

📚 SEARCH_COMPLETE: Found 10 new sources (429 total)
✅ JUDGE_COMPLETE: Assessment: continue (confidence: 0%)     ← API failure?

📚 SEARCH_COMPLETE: Found 26 new sources (455 total)
✅ JUDGE_COMPLETE: Assessment: continue (confidence: 0%)     ← Still failing

## Partial Analysis (Max Iterations Reached)      ← GARBAGE OUTPUT
### Question
What drugs improve female libido post-menopause?
### Status
Maximum search iterations reached.
### Citations
1. [Tribulus terrestris and female reproductive...]
2. ...
---
*Consider searching with more specific terms*     ← NO SYNTHESIS AT ALL

Root Cause Analysis

Bug 1: Judge Never Says "sufficient=True"

File: src/prompts/judge.py:22-25

3. **Sufficiency**: Evidence is sufficient when:
   - Combined scores >= 12 AND
   - At least one specific drug candidate identified AND
   - Clear mechanistic rationale exists

Problem: The prompt is too conservative. With 455 sources spanning testosterone, DHEA, estrogen, oxytocin, etc., the judge should have identified candidates and said "synthesize". But:

LLM may not be extracting drug candidates from evidence properly
The "AND" conditions are too strict - evidence can be "good enough" without hitting all criteria
The recommendation "continue" seems to be the default state

Evidence: Output shows 70-90% confidence but still "continue" - the judge is confident but never satisfied.

Bug 2: Confidence Drops to 0% (Late Iteration Failures)

File: src/agent_factory/judges.py:150-183

The _create_fallback_assessment() returns:

confidence: 0.0
recommendation: "continue"

Problem: In iterations 9-10, something failed:

Context too long (455 sources × ~1500 chars = 680K chars → token limit exceeded)
API rate limit hit
Network timeout

Evidence: Confidence went from 80%→0%→0% in final iterations - this is the fallback response.

Bug 3: Search Derailment

Evidence from logs:

Next searches: androgen therapy and bone health, androgen therapy and muscle mass...
Next searches: testosterone therapy in postmenopausal women, mechanisms of testosterone...

Problem: Judge's next_search_queries drift off-topic. "Bone health" and "muscle mass" are tangential to "female libido". The judge should stay focused on the original question.

Bug 4: Partial Synthesis is Garbage

File: src/orchestrators/simple.py:432-470

def _generate_partial_synthesis(self, query: str, evidence: list[Evidence]) -> str:
    """Generate a partial synthesis when max iterations reached."""
    citations = "\n".join([...])  # Just citations

    return f"""## Partial Analysis (Max Iterations Reached)
### Question
{query}
### Status
Maximum search iterations reached. The evidence gathered may be incomplete.
### Evidence Collected
Found {len(evidence)} sources.
### Citations
{citations}
---
*Consider searching with more specific terms*
"""

Problem: When max iterations reached, we have 455 sources but output NO analysis. We should:

Force a synthesis call to the LLM
Or at minimum generate drug candidates/findings from the last good assessment
Not just dump citations and give up

The Fix

Fix 1: Lower the Bar for "synthesize"

Option A: Change prompt to be less strict:

SYSTEM_PROMPT = """...
3. **Sufficiency**: Evidence is sufficient when:
   - Combined scores >= 10 (was 12) OR
   - Confidence >= 80% with drug candidates identified OR
   - 5+ iterations completed with 100+ sources
"""

Option B: Add iteration-based heuristic in orchestrator:

# If we have lots of evidence and high confidence, force synthesis
if iteration >= 5 and len(all_evidence) > 100 and assessment.confidence > 0.7:
    assessment.sufficient = True
    assessment.recommendation = "synthesize"

Fix 2: Handle Context Overflow

File: src/agent_factory/judges.py

Before sending to LLM, cap evidence:

async def assess(self, question: str, evidence: list[Evidence]) -> JudgeAssessment:
    # Cap at 50 most recent/relevant to avoid token overflow
    if len(evidence) > 50:
        evidence = evidence[:50]  # Or use embedding similarity to select best 50

Fix 3: Keep Search Focused

File: src/prompts/judge.py

Add to prompt:

SYSTEM_PROMPT = """...
## Search Query Rules

When suggesting next_search_queries:
- Stay focused on the ORIGINAL question
- Do NOT drift to tangential topics (e.g., don't search "bone health" for a libido question)
- Refine existing good terms, don't explore random associations
"""

Fix 4: Generate Real Synthesis on Max Iterations

File: src/orchestrators/simple.py

def _generate_partial_synthesis(self, query: str, evidence: list[Evidence]) -> str:
    """Generate a REAL synthesis when max iterations reached."""

    # Get the last assessment's data (if available)
    last_assessment = self.history[-1]["assessment"] if self.history else None

    drug_candidates = last_assessment.get("details", {}).get("drug_candidates", []) if last_assessment else []
    key_findings = last_assessment.get("details", {}).get("key_findings", []) if last_assessment else []

    drug_list = "\n".join([f"- **{d}**" for d in drug_candidates]) or "- See sources below for candidates"
    findings_list = "\n".join([f"- {f}" for f in key_findings[:5]]) or "- Review citations for findings"

    citations = "\n".join([
        f"{i + 1}. [{e.citation.title}]({e.citation.url}) ({e.citation.source.upper()})"
        for i, e in enumerate(evidence[:10])
    ])

    return f"""## Drug Repurposing Analysis (Partial)

### Question
{query}

### Status
⚠️ Maximum iterations reached. Analysis based on {len(evidence)} sources.

### Drug Candidates Identified
{drug_list}

### Key Findings
{findings_list}

### Top Citations ({len(evidence)} sources)
{citations}

---
*Analysis may be incomplete. Consider refining query or adding API key for better results.*
"""

Test Plan

Verify judge says "synthesize" within 5 iterations for good queries
Test with 500+ sources to ensure no token overflow
Verify search stays on-topic (no bone/muscle tangents for libido query)
Verify partial synthesis shows drug candidates (not just citations)
Test with MockJudgeHandler to confirm issue is in LLM behavior
Add unit test: test_judge_synthesizes_with_good_evidence

Priority Justification

P0 because:

Simple mode is the DEFAULT for users without API keys
455 sources found but ZERO useful output generated
User waited 10 iterations just to get a citation dump
Makes the tool look completely broken
Blocks hackathon demo effectiveness

Immediate Workaround

Use Advanced mode (requires OpenAI key) - it has its own synthesis logic
Or use fewer iterations (MAX_ITERATIONS=3) to hit partial synthesis faster
Or manually review the citations (they ARE relevant, just not synthesized)

Related Issues

P0_ORCHESTRATOR_DEDUP_AND_JUDGE_BUGS.md - Fixed dedup issue, but synthesis problem persists
ACTIVE_BUGS.md - Update when this is resolved