Spaces:

VibecoderMcSwaggins
/

DeepBoner

Paused

File size: 28,076 Bytes
# SPEC 06: Simple Mode Synthesis Fix

## Priority: P0 (Blocker - Simple mode produces garbage output)

## Problem Statement

Simple mode (HuggingFace free tier) runs 10 iterations, collects 455 sources, but outputs only a citation dump with no actual synthesis. The user waits through the entire process just to see "Partial Analysis (Max Iterations Reached)" with no drug candidates or analysis.

**Observed Behavior** (real run):
```
Iterations 1-8:  confidence 70-90%, recommendation="continue"  ← Never synthesizes
Iteration 9-10:  confidence 0%                                 ← LLM context overflow
Final output:    Citation list only, no drug candidates, no analysis
```

---

## Research Context (November 2025 Best Practices)

This spec incorporates findings from current industry research on LLM-as-Judge, RAG systems, and multi-agent orchestration.

### LLM-as-Judge Biases ([Evidently AI](https://www.evidentlyai.com/llm-guide/llm-as-a-judge), [arXiv Survey](https://arxiv.org/abs/2411.15594))

| Bias | Description | Impact on Our System |
|------|-------------|---------------------|
| **Verbosity Bias** | LLM judges prefer longer, more detailed responses | Judge defaults to verbose "continue" explanations |
| **Position Bias** | Systematic preference based on order (primacy/recency) | Most recent evidence over-weighted |
| **Self-Preference Bias** | LLM favors outputs matching its own generation patterns | Defaults to "comfortable" pattern (continue) |

**Key Finding**: "Sophisticated judge models can align with human judgment up to 85%, which is actually higher than human-to-human agreement (81%)." However, this requires careful prompt design and debiasing.

### RAG Context Limits ([Pinecone](https://www.pinecone.io/learn/retrieval-augmented-generation/), [TrueState](https://www.truestate.io/blog/lessons-from-rag))

> "Long context didn't kill retrieval. Bigger windows add cost and noise; **retrieval focuses attention where it matters.**"

**Key Finding**: RAG is **8-82× cheaper** than long context approaches. Best practice is:
- **Diverse selection** over recency-only selection
- **Re-ranking** before sending to judge
- **Lost-in-the-middle mitigation** - put critical context at prompt edges

### Multi-Agent Termination ([LangGraph Guide](https://latenode.com/blog/langgraph-multi-agent-orchestration-complete-framework-guide-architecture-analysis-2025), [AWS Guidance](https://aws.amazon.com/solutions/guidance/multi-agent-orchestration-on-aws/))

> "The planning agent evaluates whether output **fully satisfies task objectives**. If so, the workflow is **terminated early**."

**Key Finding**: Code-enforced termination criteria outperform LLM-decided termination. The pattern is:
1. LLM provides **scores only** (mechanism, clinical, drug candidates)
2. Code evaluates scores against **explicit thresholds**
3. Code decides synthesize vs continue

---

## Root Cause Analysis

### Bug 1: No Evidence Limit in Judge Prompt (CRITICAL)

**File:** `src/prompts/judge.py:62`

```python
# BROKEN: Sends ALL evidence to the LLM
evidence_text = "\n\n".join([format_single_evidence(i, e) for i, e in enumerate(evidence)])
```

**Impact:**
- 455 sources × 1700 chars/source = **773,500 characters ≈ 193K tokens**
- HuggingFace Inference free tier limit: **~4K-8K tokens**
- Result: **Context overflow → LLM failure → fallback response → 0% confidence**

This explains why confidence dropped to 0% in iterations 9-10: the context became too large for the LLM.

### Bug 2: LLM Decides Both Scoring AND Recommendation (Anti-Pattern)

**Current Design:**
```python
# LLM does BOTH - subject to verbosity/self-preference bias
"Evaluate evidence... Respond with recommendation: 'continue' or 'synthesize'"
```

**Problem** (per 2025 research):
- LLM exhibits **self-preference bias** - defaults to its "comfortable" pattern
- "Be conservative" instruction triggers **verbosity bias** - prefers longer explanations for "continue"
- No **separation of concerns** - scoring and decision-making conflated

### Bug 3: No Diverse Evidence Selection

**Current Design:**
```python
# Just truncates to most recent - subject to position bias
capped_evidence = evidence[-30:]
```

**Problem** (per RAG research):
- **Position bias** - most recent ≠ most relevant
- **Lost-in-the-middle** - important early evidence ignored
- No **diversity** - may select 30 similar papers

### Bug 4: Prompt Encourages "Continue" Forever

**File:** `src/prompts/judge.py:22-32`

```python
## Sufficiency Criteria (TOO STRICT - requires ALL conditions)
- Combined scores >= 12 AND
- At least one specific drug candidate identified AND
- Clear mechanistic rationale exists

## Output Rules
- Be conservative: only recommend "synthesize" when truly confident  ← TRIGGERS VERBOSITY BIAS
```

### Bug 5: Search Derailment

**Evidence from logs:**
```
Next searches: androgen therapy and bone health, androgen therapy and muscle mass...
```

Original question: "female libido post-menopause" → Judge suggests tangentially related topics.

### Bug 6: Partial Synthesis is Garbage

**File:** `src/orchestrators/simple.py:432-470`

When max iterations reached, outputs only citations with no analysis, drug candidates, or key findings.

---

## Solution Design

### Architecture Change: Separate Scoring from Decision

**Before (biased):**
```
User Question → LLM Judge → { scores, recommendation } → Orchestrator follows recommendation
```

**After (debiased, per 2025 best practices):**
```
User Question → LLM Judge → { scores only } → Code evaluates → Code decides synthesize/continue
```

This follows the [Spring AI LLM-as-Judge pattern](https://spring.io/blog/2025/11/10/spring-ai-llm-as-judge-blog-post/): "Run agent in while loop with evaluator, until evaluator says output passes criteria" - but criteria are **code-enforced**, not LLM-decided.

---

### Fix 1: Diverse Evidence Selection (Not Just Capping)

**File:** `src/prompts/judge.py`

```python
MAX_EVIDENCE_FOR_JUDGE = 30  # Keep under token limits

async def select_evidence_for_judge(
    evidence: list[Evidence],
    query: str,
    max_items: int = MAX_EVIDENCE_FOR_JUDGE,
) -> list[Evidence]:
    """
    Select diverse, relevant evidence for judge evaluation.

    Implements RAG best practices (November 2025):
    - Diversity selection over recency-only
    - Lost-in-the-middle mitigation
    - Relevance re-ranking

    References:
    - https://www.pinecone.io/learn/retrieval-augmented-generation/
    - https://www.truestate.io/blog/lessons-from-rag
    """
    if len(evidence) <= max_items:
        return evidence

    try:
        from src.utils.text_utils import select_diverse_evidence
        # Use embedding-based diversity selection
        return await select_diverse_evidence(evidence, n=max_items, query=query)
    except ImportError:
        # Fallback: mix of recent + early (lost-in-the-middle mitigation)
        early = evidence[:max_items // 3]           # First third
        recent = evidence[-(max_items * 2 // 3):]   # Last two-thirds
        return early + recent


def format_user_prompt(
    question: str,
    evidence: list[Evidence],
    iteration: int = 0,
    max_iterations: int = 10,
    total_evidence_count: int | None = None,
) -> str:
    """
    Format user prompt with selected evidence and iteration context.

    NOTE: Evidence should be pre-selected using select_evidence_for_judge().
    This function assumes evidence is already capped.
    """
    total_count = total_evidence_count or len(evidence)
    max_content_len = 1500

    def format_single_evidence(i: int, e: Evidence) -> str:
        content = e.content
        if len(content) > max_content_len:
            content = content[:max_content_len] + "..."
        return (
            f"### Evidence {i + 1}\n"
            f"**Source**: {e.citation.source.upper()} - {e.citation.title}\n"
            f"**URL**: {e.citation.url}\n"
            f"**Content**:\n{content}"
        )

    evidence_text = "\n\n".join([format_single_evidence(i, e) for i, e in enumerate(evidence)])

    # Lost-in-the-middle mitigation: put critical context at START and END
    return f"""## Research Question (IMPORTANT - stay focused on this)
{question}

## Search Progress
- **Iteration**: {iteration}/{max_iterations}
- **Total evidence collected**: {total_count} sources
- **Evidence shown below**: {len(evidence)} diverse sources (selected for relevance)

## Available Evidence

{evidence_text}

## Your Task

Score this evidence for drug repurposing potential. Provide ONLY scores and extracted data.
DO NOT decide "synthesize" vs "continue" - that decision is made by the system.

## REMINDER: Original Question (stay focused)
{question}
"""
```

### Fix 2: Debiased Judge Prompt (Scoring Only)

**File:** `src/prompts/judge.py`

```python
SYSTEM_PROMPT = """You are an expert drug repurposing research judge.

Your task is to SCORE evidence from biomedical literature. You do NOT decide whether to
continue searching or synthesize - that decision is made by the orchestration system
based on your scores.

## Your Role: Scoring Only

You provide objective scores. The system decides next steps based on explicit thresholds.
This separation prevents bias in the decision-making process.

## Scoring Criteria

1. **Mechanism Score (0-10)**: How well does the evidence explain the biological mechanism?
   - 0-3: No clear mechanism, speculative
   - 4-6: Some mechanistic insight, but gaps exist
   - 7-10: Clear, well-supported mechanism of action

2. **Clinical Evidence Score (0-10)**: Strength of clinical/preclinical support?
   - 0-3: No clinical data, only theoretical
   - 4-6: Preclinical or early clinical data
   - 7-10: Strong clinical evidence (trials, meta-analyses)

3. **Drug Candidates**: List SPECIFIC drug names mentioned in the evidence
   - Only include drugs explicitly mentioned
   - Do NOT hallucinate or infer drug names
   - Include drug class if specific names aren't available (e.g., "SSRI antidepressants")

4. **Key Findings**: Extract 3-5 key findings from the evidence
   - Focus on findings relevant to the research question
   - Include mechanism insights and clinical outcomes

5. **Confidence (0.0-1.0)**: Your confidence in the scores
   - Based on evidence quality and relevance
   - Lower if evidence is tangential or low-quality

## Output Format

Return valid JSON with these fields:
- details.mechanism_score (int 0-10)
- details.mechanism_reasoning (string)
- details.clinical_evidence_score (int 0-10)
- details.clinical_reasoning (string)
- details.drug_candidates (list of strings)
- details.key_findings (list of strings)
- sufficient (boolean) - TRUE if scores suggest enough evidence
- confidence (float 0-1)
- recommendation ("continue" or "synthesize") - Your suggestion (system may override)
- next_search_queries (list) - If continuing, suggest FOCUSED queries
- reasoning (string)

## CRITICAL: Search Query Rules

When suggesting next_search_queries:
- STAY FOCUSED on the original research question
- Do NOT drift to tangential topics
- If question is about "female libido", do NOT suggest "bone health" or "muscle mass"
- Refine existing terms, don't explore random medical associations
- Example: "female libido post-menopause" → "testosterone therapy female sexual dysfunction"
"""
```

### Fix 3: Code-Enforced Termination Criteria

**File:** `src/orchestrators/simple.py`

```python
# Termination thresholds (code-enforced, not LLM-decided)
# Based on multi-agent orchestration best practices (November 2025)
# Reference: https://aws.amazon.com/solutions/guidance/multi-agent-orchestration-on-aws/

TERMINATION_CRITERIA = {
    "min_combined_score": 12,      # mechanism + clinical >= 12
    "min_score_with_volume": 10,   # >= 10 if 50+ sources
    "late_iteration_threshold": 8, # >= 8 in iterations 8+
    "max_evidence_threshold": 100, # Force synthesis with 100+ sources
    "emergency_iteration": 8,      # Last 2 iterations = emergency mode
    "min_confidence": 0.5,         # Minimum confidence for emergency synthesis
}


def should_synthesize(
    assessment: JudgeAssessment,
    iteration: int,
    max_iterations: int,
    evidence_count: int,
) -> tuple[bool, str]:
    """
    Code-enforced synthesis decision.

    Returns (should_synthesize, reason).

    This function implements the "explicit termination criteria" pattern
    from multi-agent orchestration best practices. The LLM provides scores,
    but CODE decides when to stop.

    Reference: https://latenode.com/blog/langgraph-multi-agent-orchestration-complete-framework-guide-architecture-analysis-2025
    """
    combined_score = (
        assessment.details.mechanism_score +
        assessment.details.clinical_evidence_score
    )
    has_drug_candidates = len(assessment.details.drug_candidates) > 0
    confidence = assessment.confidence

    # Priority 1: LLM explicitly says sufficient with good scores
    if assessment.sufficient and assessment.recommendation == "synthesize":
        if combined_score >= 10:
            return True, "judge_approved"

    # Priority 2: High scores with drug candidates
    if combined_score >= TERMINATION_CRITERIA["min_combined_score"] and has_drug_candidates:
        return True, "high_scores_with_candidates"

    # Priority 3: Good scores with high evidence volume
    if combined_score >= TERMINATION_CRITERIA["min_score_with_volume"] and evidence_count >= 50:
        return True, "good_scores_high_volume"

    # Priority 4: Late iteration with acceptable scores (diminishing returns)
    is_late_iteration = iteration >= max_iterations - 2
    if is_late_iteration and combined_score >= TERMINATION_CRITERIA["late_iteration_threshold"]:
        return True, "late_iteration_acceptable"

    # Priority 5: Very high evidence count (enough to synthesize something)
    if evidence_count >= TERMINATION_CRITERIA["max_evidence_threshold"]:
        return True, "max_evidence_reached"

    # Priority 6: Emergency synthesis (avoid garbage output)
    if is_late_iteration and evidence_count >= 30 and confidence >= TERMINATION_CRITERIA["min_confidence"]:
        return True, "emergency_synthesis"

    return False, "continue_searching"
```

### Fix 4: Update Orchestrator Decision Phase

**File:** `src/orchestrators/simple.py`

```python
# In the run() method, replace the decision phase:

# === DECISION PHASE (Code-Enforced) ===
should_synth, reason = should_synthesize(
    assessment=assessment,
    iteration=iteration,
    max_iterations=self.config.max_iterations,
    evidence_count=len(all_evidence),
)

logger.info(
    "Synthesis decision",
    should_synthesize=should_synth,
    reason=reason,
    iteration=iteration,
    combined_score=assessment.details.mechanism_score + assessment.details.clinical_evidence_score,
    evidence_count=len(all_evidence),
    confidence=assessment.confidence,
)

if should_synth:
    # Log synthesis trigger reason for debugging
    if reason != "judge_approved":
        logger.info(f"Code-enforced synthesis triggered: {reason}")

    # Optional Analysis Phase
    async for event in self._run_analysis_phase(query, all_evidence, iteration):
        yield event

    yield AgentEvent(
        type="synthesizing",
        message=f"Evidence sufficient ({reason})! Preparing synthesis...",
        iteration=iteration,
    )

    # Generate final response
    final_response = self._generate_synthesis(query, all_evidence, assessment)

    yield AgentEvent(
        type="complete",
        message=final_response,
        data={
            "evidence_count": len(all_evidence),
            "iterations": iteration,
            "synthesis_reason": reason,
            "drug_candidates": assessment.details.drug_candidates,
            "key_findings": assessment.details.key_findings,
        },
        iteration=iteration,
    )
    return

else:
    # Need more evidence - prepare next queries
    current_queries = assessment.next_search_queries or [
        f"{query} mechanism of action",
        f"{query} clinical evidence",
    ]

    yield AgentEvent(
        type="looping",
        message=(
            f"Gathering more evidence (scores: {assessment.details.mechanism_score}+"
            f"{assessment.details.clinical_evidence_score}). "
            f"Next: {', '.join(current_queries[:2])}..."
        ),
        data={"next_queries": current_queries, "reason": reason},
        iteration=iteration,
    )
```

### Fix 5: Real Partial Synthesis

**File:** `src/orchestrators/simple.py`

```python
def _generate_partial_synthesis(
    self,
    query: str,
    evidence: list[Evidence],
) -> str:
    """
    Generate a REAL synthesis when max iterations reached.

    Even when forced to stop, we should provide:
    - Drug candidates (if any were found)
    - Key findings
    - Assessment scores
    - Actionable citations

    This is still better than a citation dump.
    """
    # Extract data from last assessment if available
    last_assessment = self.history[-1]["assessment"] if self.history else {}
    details = last_assessment.get("details", {})

    drug_candidates = details.get("drug_candidates", [])
    key_findings = details.get("key_findings", [])
    mechanism_score = details.get("mechanism_score", 0)
    clinical_score = details.get("clinical_evidence_score", 0)
    reasoning = last_assessment.get("reasoning", "Analysis incomplete due to iteration limit.")

    # Format drug candidates
    if drug_candidates:
        drug_list = "\n".join([f"- **{d}**" for d in drug_candidates[:5]])
    else:
        drug_list = "- *No specific drug candidates identified in evidence*\n- *Try a more specific query or add an API key for better analysis*"

    # Format key findings
    if key_findings:
        findings_list = "\n".join([f"- {f}" for f in key_findings[:5]])
    else:
        findings_list = "- *Key findings require further analysis*\n- *See citations below for relevant sources*"

    # Format citations (top 10)
    citations = "\n".join([
        f"{i + 1}. [{e.citation.title}]({e.citation.url}) "
        f"({e.citation.source.upper()}, {e.citation.date})"
        for i, e in enumerate(evidence[:10])
    ])

    combined_score = mechanism_score + clinical_score

    return f"""## Drug Repurposing Analysis

### Research Question
{query}

### Status
Analysis based on {len(evidence)} sources across {len(self.history)} iterations.
Maximum iterations reached - results may be incomplete.

### Drug Candidates Identified
{drug_list}

### Key Findings
{findings_list}

### Evidence Quality Scores
| Criterion | Score | Interpretation |
|-----------|-------|----------------|
| Mechanism | {mechanism_score}/10 | {"Strong" if mechanism_score >= 7 else "Moderate" if mechanism_score >= 4 else "Limited"} mechanistic evidence |
| Clinical | {clinical_score}/10 | {"Strong" if clinical_score >= 7 else "Moderate" if clinical_score >= 4 else "Limited"} clinical support |
| Combined | {combined_score}/20 | {"Sufficient" if combined_score >= 12 else "Partial"} for synthesis |

### Analysis Summary
{reasoning}

### Top Citations ({len(evidence)} sources total)
{citations}

---
*For more complete analysis:*
- *Add an OpenAI or Anthropic API key for enhanced LLM analysis*
- *Try a more specific query (e.g., include drug names)*
- *Use Advanced mode for multi-agent research*
"""
```

### Fix 6: Update Judge Handler Signature

**File:** `src/orchestrators/base.py`

```python
class JudgeHandlerProtocol(Protocol):
    """Protocol for judge handler."""

    async def assess(
        self,
        question: str,
        evidence: list[Evidence],
        iteration: int = 0,           # NEW
        max_iterations: int = 10,     # NEW
    ) -> JudgeAssessment:
        """Assess evidence quality and provide scores."""
        ...
```

**File:** `src/agent_factory/judges.py`

Update all handlers (`JudgeHandler`, `HFInferenceJudgeHandler`, `MockJudgeHandler`) to:

```python
async def assess(
    self,
    question: str,
    evidence: list[Evidence],
    iteration: int = 0,
    max_iterations: int = 10,
) -> JudgeAssessment:
    """Assess evidence with iteration context."""
    # Select diverse evidence (not just truncate)
    selected_evidence = await select_evidence_for_judge(evidence, question)

    # Format prompt with iteration context
    user_prompt = format_user_prompt(
        question=question,
        evidence=selected_evidence,
        iteration=iteration,
        max_iterations=max_iterations,
        total_evidence_count=len(evidence),
    )

    # ... rest of implementation
```

---

## Implementation Order

| Order | Fix | Priority | Impact |
|-------|-----|----------|--------|
| 1 | Diverse evidence selection | CRITICAL | Prevents token overflow + position bias |
| 2 | Code-enforced termination | CRITICAL | Guarantees synthesis before max iterations |
| 3 | Debiased judge prompt | HIGH | Removes verbosity/self-preference bias |
| 4 | Real partial synthesis | HIGH | Ensures useful output even on forced stop |
| 5 | Update handler signatures | MEDIUM | Enables iteration context |
| 6 | Update orchestrator | MEDIUM | Integrates all fixes |

---

## Files to Modify

| File | Changes |
|------|---------|
| `src/prompts/judge.py` | New `select_evidence_for_judge()`, updated `format_user_prompt()`, debiased `SYSTEM_PROMPT` |
| `src/orchestrators/simple.py` | New `should_synthesize()`, updated decision phase, real `_generate_partial_synthesis()` |
| `src/orchestrators/base.py` | Update `JudgeHandlerProtocol` signature |
| `src/agent_factory/judges.py` | Update all handlers with iteration params, use diverse selection |

---

## Test Plan

### Unit Tests

```python
# tests/unit/prompts/test_judge_prompt.py

@pytest.mark.asyncio
async def test_evidence_selection_diverse():
    """Verify evidence selection includes early and recent items."""
    evidence = [make_evidence(f"Paper {i}") for i in range(100)]
    selected = await select_evidence_for_judge(evidence, "test query", max_items=30)

    # Should include some early evidence (lost-in-the-middle mitigation)
    titles = [e.citation.title for e in selected]
    assert any("Paper 0" in t or "Paper 1" in t for t in titles)
    assert any("Paper 99" in t or "Paper 98" in t for t in titles)


def test_prompt_includes_question_at_edges():
    """Verify lost-in-the-middle mitigation."""
    evidence = [make_evidence("Test")]
    prompt = format_user_prompt("important question", evidence, iteration=5, max_iterations=10)

    # Question should appear at START and END of prompt
    lines = prompt.split("\n")
    assert "important question" in lines[1]  # Near start
    assert "important question" in lines[-2]  # Near end


# tests/unit/orchestrators/test_termination.py

def test_should_synthesize_high_scores():
    """High scores with drug candidates triggers synthesis."""
    assessment = make_assessment(mechanism=7, clinical=6, drug_candidates=["Metformin"])
    should_synth, reason = should_synthesize(assessment, iteration=3, max_iterations=10, evidence_count=50)

    assert should_synth is True
    assert reason == "high_scores_with_candidates"


def test_should_synthesize_late_iteration():
    """Late iteration with acceptable scores triggers synthesis."""
    assessment = make_assessment(mechanism=5, clinical=4, drug_candidates=[])
    should_synth, reason = should_synthesize(assessment, iteration=9, max_iterations=10, evidence_count=80)

    assert should_synth is True
    assert reason in ["late_iteration_acceptable", "emergency_synthesis"]


def test_should_not_synthesize_early_low_scores():
    """Early iteration with low scores continues searching."""
    assessment = make_assessment(mechanism=3, clinical=2, drug_candidates=[])
    should_synth, reason = should_synthesize(assessment, iteration=2, max_iterations=10, evidence_count=20)

    assert should_synth is False
    assert reason == "continue_searching"


def test_partial_synthesis_has_drug_candidates():
    """Partial synthesis includes extracted data."""
    orchestrator = Orchestrator(...)
    orchestrator.history = [{
        "assessment": {
            "details": {
                "drug_candidates": ["Testosterone", "DHEA"],
                "key_findings": ["Finding 1", "Finding 2"],
                "mechanism_score": 6,
                "clinical_evidence_score": 5,
            },
            "reasoning": "Good evidence found.",
        }
    }]

    result = orchestrator._generate_partial_synthesis("test", [make_evidence("Test")])

    assert "Testosterone" in result
    assert "DHEA" in result
    assert "Drug Candidates" in result
    assert "6/10" in result  # mechanism score
```

### Integration Tests

```python
# tests/integration/test_simple_mode_synthesis.py

@pytest.mark.asyncio
async def test_simple_mode_synthesizes_before_max_iterations():
    """Verify simple mode produces useful output with mocked judge."""
    # Mock judge to return good scores
    mock_judge = MockJudgeHandler()
    orchestrator = Orchestrator(
        search_handler=mock_search_handler,
        judge_handler=mock_judge,
    )

    events = []
    async for event in orchestrator.run("metformin diabetes mechanism"):
        events.append(event)

    # Must have synthesis with drug candidates
    complete_event = next(e for e in events if e.type == "complete")
    assert "Drug Candidates" in complete_event.message
    assert complete_event.data.get("synthesis_reason") is not None


@pytest.mark.asyncio
async def test_large_evidence_does_not_crash():
    """Verify 500 sources don't cause token overflow."""
    evidence = [make_evidence(f"Paper {i}") for i in range(500)]
    selected = await select_evidence_for_judge(evidence, "test query")

    # Should be capped
    assert len(selected) <= MAX_EVIDENCE_FOR_JUDGE

    # Total chars should be under ~50K (safe for most LLMs)
    prompt = format_user_prompt("test", selected, iteration=5, max_iterations=10, total_evidence_count=500)
    assert len(prompt) < 100_000  # Well under token limits
```

---

## Acceptance Criteria

- [ ] Evidence sent to judge is diverse-selected (not just truncated)
- [ ] Prompt includes question at START and END (lost-in-the-middle mitigation)
- [ ] Code-enforced `should_synthesize()` makes termination decision
- [ ] Synthesis triggered by iteration 8 with 50+ sources and scores >= 8
- [ ] Partial synthesis includes drug candidates and scores (not just citations)
- [ ] Search queries stay on-topic (judge prompt enforces focus)
- [ ] 500+ sources don't cause LLM crashes
- [ ] All existing tests pass

---

## Risk Assessment

| Risk | Mitigation |
|------|------------|
| Diverse selection misses critical evidence | Include relevance scoring in selection |
| Code-enforced thresholds too aggressive | Log all synthesis decisions for tuning |
| Prompt changes affect OpenAI/Anthropic differently | Test with all providers |
| Emergency synthesis produces low-quality output | Still better than citation dump |

---

## Success Metrics

| Metric | Before | After |
|--------|--------|-------|
| Synthesis rate | 0% | 90%+ |
| Average iterations to synthesis | 10 (max) | 5-7 |
| Drug candidates in output | Never | Always (if found) |
| LLM token overflow errors | Common | None |
| User-reported "useless output" | Frequent | Rare |

---

## References

- [LLM-as-a-Judge Guide - Evidently AI](https://www.evidentlyai.com/llm-guide/llm-as-a-judge)
- [Survey on LLM-as-a-Judge - arXiv](https://arxiv.org/abs/2411.15594)
- [RAG Best Practices - Pinecone](https://www.pinecone.io/learn/retrieval-augmented-generation/)
- [Lessons from RAG 2025 - TrueState](https://www.truestate.io/blog/lessons-from-rag)
- [LangGraph Multi-Agent Orchestration 2025](https://latenode.com/blog/langgraph-multi-agent-orchestration-complete-framework-guide-architecture-analysis-2025)
- [Multi-Agent Orchestration on AWS](https://aws.amazon.com/solutions/guidance/multi-agent-orchestration-on-aws/)
- [Spring AI LLM-as-Judge Pattern](https://spring.io/blog/2025/11/10/spring-ai-llm-as-judge-blog-post/)