DeepBoner / docs /specs /SPEC_06_SIMPLE_MODE_SYNTHESIS.md
VibecoderMcSwaggins's picture
feat: add service loader + SPEC_06 + P0 bug report
9639483
|
raw
history blame
28.1 kB
# SPEC 06: Simple Mode Synthesis Fix
## Priority: P0 (Blocker - Simple mode produces garbage output)
## Problem Statement
Simple mode (HuggingFace free tier) runs 10 iterations, collects 455 sources, but outputs only a citation dump with no actual synthesis. The user waits through the entire process just to see "Partial Analysis (Max Iterations Reached)" with no drug candidates or analysis.
**Observed Behavior** (real run):
```
Iterations 1-8: confidence 70-90%, recommendation="continue" ← Never synthesizes
Iteration 9-10: confidence 0% ← LLM context overflow
Final output: Citation list only, no drug candidates, no analysis
```
---
## Research Context (November 2025 Best Practices)
This spec incorporates findings from current industry research on LLM-as-Judge, RAG systems, and multi-agent orchestration.
### LLM-as-Judge Biases ([Evidently AI](https://www.evidentlyai.com/llm-guide/llm-as-a-judge), [arXiv Survey](https://arxiv.org/abs/2411.15594))
| Bias | Description | Impact on Our System |
|------|-------------|---------------------|
| **Verbosity Bias** | LLM judges prefer longer, more detailed responses | Judge defaults to verbose "continue" explanations |
| **Position Bias** | Systematic preference based on order (primacy/recency) | Most recent evidence over-weighted |
| **Self-Preference Bias** | LLM favors outputs matching its own generation patterns | Defaults to "comfortable" pattern (continue) |
**Key Finding**: "Sophisticated judge models can align with human judgment up to 85%, which is actually higher than human-to-human agreement (81%)." However, this requires careful prompt design and debiasing.
### RAG Context Limits ([Pinecone](https://www.pinecone.io/learn/retrieval-augmented-generation/), [TrueState](https://www.truestate.io/blog/lessons-from-rag))
> "Long context didn't kill retrieval. Bigger windows add cost and noise; **retrieval focuses attention where it matters.**"
**Key Finding**: RAG is **8-82Γ— cheaper** than long context approaches. Best practice is:
- **Diverse selection** over recency-only selection
- **Re-ranking** before sending to judge
- **Lost-in-the-middle mitigation** - put critical context at prompt edges
### Multi-Agent Termination ([LangGraph Guide](https://latenode.com/blog/langgraph-multi-agent-orchestration-complete-framework-guide-architecture-analysis-2025), [AWS Guidance](https://aws.amazon.com/solutions/guidance/multi-agent-orchestration-on-aws/))
> "The planning agent evaluates whether output **fully satisfies task objectives**. If so, the workflow is **terminated early**."
**Key Finding**: Code-enforced termination criteria outperform LLM-decided termination. The pattern is:
1. LLM provides **scores only** (mechanism, clinical, drug candidates)
2. Code evaluates scores against **explicit thresholds**
3. Code decides synthesize vs continue
---
## Root Cause Analysis
### Bug 1: No Evidence Limit in Judge Prompt (CRITICAL)
**File:** `src/prompts/judge.py:62`
```python
# BROKEN: Sends ALL evidence to the LLM
evidence_text = "\n\n".join([format_single_evidence(i, e) for i, e in enumerate(evidence)])
```
**Impact:**
- 455 sources Γ— 1700 chars/source = **773,500 characters β‰ˆ 193K tokens**
- HuggingFace Inference free tier limit: **~4K-8K tokens**
- Result: **Context overflow β†’ LLM failure β†’ fallback response β†’ 0% confidence**
This explains why confidence dropped to 0% in iterations 9-10: the context became too large for the LLM.
### Bug 2: LLM Decides Both Scoring AND Recommendation (Anti-Pattern)
**Current Design:**
```python
# LLM does BOTH - subject to verbosity/self-preference bias
"Evaluate evidence... Respond with recommendation: 'continue' or 'synthesize'"
```
**Problem** (per 2025 research):
- LLM exhibits **self-preference bias** - defaults to its "comfortable" pattern
- "Be conservative" instruction triggers **verbosity bias** - prefers longer explanations for "continue"
- No **separation of concerns** - scoring and decision-making conflated
### Bug 3: No Diverse Evidence Selection
**Current Design:**
```python
# Just truncates to most recent - subject to position bias
capped_evidence = evidence[-30:]
```
**Problem** (per RAG research):
- **Position bias** - most recent β‰  most relevant
- **Lost-in-the-middle** - important early evidence ignored
- No **diversity** - may select 30 similar papers
### Bug 4: Prompt Encourages "Continue" Forever
**File:** `src/prompts/judge.py:22-32`
```python
## Sufficiency Criteria (TOO STRICT - requires ALL conditions)
- Combined scores >= 12 AND
- At least one specific drug candidate identified AND
- Clear mechanistic rationale exists
## Output Rules
- Be conservative: only recommend "synthesize" when truly confident ← TRIGGERS VERBOSITY BIAS
```
### Bug 5: Search Derailment
**Evidence from logs:**
```
Next searches: androgen therapy and bone health, androgen therapy and muscle mass...
```
Original question: "female libido post-menopause" β†’ Judge suggests tangentially related topics.
### Bug 6: Partial Synthesis is Garbage
**File:** `src/orchestrators/simple.py:432-470`
When max iterations reached, outputs only citations with no analysis, drug candidates, or key findings.
---
## Solution Design
### Architecture Change: Separate Scoring from Decision
**Before (biased):**
```
User Question β†’ LLM Judge β†’ { scores, recommendation } β†’ Orchestrator follows recommendation
```
**After (debiased, per 2025 best practices):**
```
User Question β†’ LLM Judge β†’ { scores only } β†’ Code evaluates β†’ Code decides synthesize/continue
```
This follows the [Spring AI LLM-as-Judge pattern](https://spring.io/blog/2025/11/10/spring-ai-llm-as-judge-blog-post/): "Run agent in while loop with evaluator, until evaluator says output passes criteria" - but criteria are **code-enforced**, not LLM-decided.
---
### Fix 1: Diverse Evidence Selection (Not Just Capping)
**File:** `src/prompts/judge.py`
```python
MAX_EVIDENCE_FOR_JUDGE = 30 # Keep under token limits
async def select_evidence_for_judge(
evidence: list[Evidence],
query: str,
max_items: int = MAX_EVIDENCE_FOR_JUDGE,
) -> list[Evidence]:
"""
Select diverse, relevant evidence for judge evaluation.
Implements RAG best practices (November 2025):
- Diversity selection over recency-only
- Lost-in-the-middle mitigation
- Relevance re-ranking
References:
- https://www.pinecone.io/learn/retrieval-augmented-generation/
- https://www.truestate.io/blog/lessons-from-rag
"""
if len(evidence) <= max_items:
return evidence
try:
from src.utils.text_utils import select_diverse_evidence
# Use embedding-based diversity selection
return await select_diverse_evidence(evidence, n=max_items, query=query)
except ImportError:
# Fallback: mix of recent + early (lost-in-the-middle mitigation)
early = evidence[:max_items // 3] # First third
recent = evidence[-(max_items * 2 // 3):] # Last two-thirds
return early + recent
def format_user_prompt(
question: str,
evidence: list[Evidence],
iteration: int = 0,
max_iterations: int = 10,
total_evidence_count: int | None = None,
) -> str:
"""
Format user prompt with selected evidence and iteration context.
NOTE: Evidence should be pre-selected using select_evidence_for_judge().
This function assumes evidence is already capped.
"""
total_count = total_evidence_count or len(evidence)
max_content_len = 1500
def format_single_evidence(i: int, e: Evidence) -> str:
content = e.content
if len(content) > max_content_len:
content = content[:max_content_len] + "..."
return (
f"### Evidence {i + 1}\n"
f"**Source**: {e.citation.source.upper()} - {e.citation.title}\n"
f"**URL**: {e.citation.url}\n"
f"**Content**:\n{content}"
)
evidence_text = "\n\n".join([format_single_evidence(i, e) for i, e in enumerate(evidence)])
# Lost-in-the-middle mitigation: put critical context at START and END
return f"""## Research Question (IMPORTANT - stay focused on this)
{question}
## Search Progress
- **Iteration**: {iteration}/{max_iterations}
- **Total evidence collected**: {total_count} sources
- **Evidence shown below**: {len(evidence)} diverse sources (selected for relevance)
## Available Evidence
{evidence_text}
## Your Task
Score this evidence for drug repurposing potential. Provide ONLY scores and extracted data.
DO NOT decide "synthesize" vs "continue" - that decision is made by the system.
## REMINDER: Original Question (stay focused)
{question}
"""
```
### Fix 2: Debiased Judge Prompt (Scoring Only)
**File:** `src/prompts/judge.py`
```python
SYSTEM_PROMPT = """You are an expert drug repurposing research judge.
Your task is to SCORE evidence from biomedical literature. You do NOT decide whether to
continue searching or synthesize - that decision is made by the orchestration system
based on your scores.
## Your Role: Scoring Only
You provide objective scores. The system decides next steps based on explicit thresholds.
This separation prevents bias in the decision-making process.
## Scoring Criteria
1. **Mechanism Score (0-10)**: How well does the evidence explain the biological mechanism?
- 0-3: No clear mechanism, speculative
- 4-6: Some mechanistic insight, but gaps exist
- 7-10: Clear, well-supported mechanism of action
2. **Clinical Evidence Score (0-10)**: Strength of clinical/preclinical support?
- 0-3: No clinical data, only theoretical
- 4-6: Preclinical or early clinical data
- 7-10: Strong clinical evidence (trials, meta-analyses)
3. **Drug Candidates**: List SPECIFIC drug names mentioned in the evidence
- Only include drugs explicitly mentioned
- Do NOT hallucinate or infer drug names
- Include drug class if specific names aren't available (e.g., "SSRI antidepressants")
4. **Key Findings**: Extract 3-5 key findings from the evidence
- Focus on findings relevant to the research question
- Include mechanism insights and clinical outcomes
5. **Confidence (0.0-1.0)**: Your confidence in the scores
- Based on evidence quality and relevance
- Lower if evidence is tangential or low-quality
## Output Format
Return valid JSON with these fields:
- details.mechanism_score (int 0-10)
- details.mechanism_reasoning (string)
- details.clinical_evidence_score (int 0-10)
- details.clinical_reasoning (string)
- details.drug_candidates (list of strings)
- details.key_findings (list of strings)
- sufficient (boolean) - TRUE if scores suggest enough evidence
- confidence (float 0-1)
- recommendation ("continue" or "synthesize") - Your suggestion (system may override)
- next_search_queries (list) - If continuing, suggest FOCUSED queries
- reasoning (string)
## CRITICAL: Search Query Rules
When suggesting next_search_queries:
- STAY FOCUSED on the original research question
- Do NOT drift to tangential topics
- If question is about "female libido", do NOT suggest "bone health" or "muscle mass"
- Refine existing terms, don't explore random medical associations
- Example: "female libido post-menopause" β†’ "testosterone therapy female sexual dysfunction"
"""
```
### Fix 3: Code-Enforced Termination Criteria
**File:** `src/orchestrators/simple.py`
```python
# Termination thresholds (code-enforced, not LLM-decided)
# Based on multi-agent orchestration best practices (November 2025)
# Reference: https://aws.amazon.com/solutions/guidance/multi-agent-orchestration-on-aws/
TERMINATION_CRITERIA = {
"min_combined_score": 12, # mechanism + clinical >= 12
"min_score_with_volume": 10, # >= 10 if 50+ sources
"late_iteration_threshold": 8, # >= 8 in iterations 8+
"max_evidence_threshold": 100, # Force synthesis with 100+ sources
"emergency_iteration": 8, # Last 2 iterations = emergency mode
"min_confidence": 0.5, # Minimum confidence for emergency synthesis
}
def should_synthesize(
assessment: JudgeAssessment,
iteration: int,
max_iterations: int,
evidence_count: int,
) -> tuple[bool, str]:
"""
Code-enforced synthesis decision.
Returns (should_synthesize, reason).
This function implements the "explicit termination criteria" pattern
from multi-agent orchestration best practices. The LLM provides scores,
but CODE decides when to stop.
Reference: https://latenode.com/blog/langgraph-multi-agent-orchestration-complete-framework-guide-architecture-analysis-2025
"""
combined_score = (
assessment.details.mechanism_score +
assessment.details.clinical_evidence_score
)
has_drug_candidates = len(assessment.details.drug_candidates) > 0
confidence = assessment.confidence
# Priority 1: LLM explicitly says sufficient with good scores
if assessment.sufficient and assessment.recommendation == "synthesize":
if combined_score >= 10:
return True, "judge_approved"
# Priority 2: High scores with drug candidates
if combined_score >= TERMINATION_CRITERIA["min_combined_score"] and has_drug_candidates:
return True, "high_scores_with_candidates"
# Priority 3: Good scores with high evidence volume
if combined_score >= TERMINATION_CRITERIA["min_score_with_volume"] and evidence_count >= 50:
return True, "good_scores_high_volume"
# Priority 4: Late iteration with acceptable scores (diminishing returns)
is_late_iteration = iteration >= max_iterations - 2
if is_late_iteration and combined_score >= TERMINATION_CRITERIA["late_iteration_threshold"]:
return True, "late_iteration_acceptable"
# Priority 5: Very high evidence count (enough to synthesize something)
if evidence_count >= TERMINATION_CRITERIA["max_evidence_threshold"]:
return True, "max_evidence_reached"
# Priority 6: Emergency synthesis (avoid garbage output)
if is_late_iteration and evidence_count >= 30 and confidence >= TERMINATION_CRITERIA["min_confidence"]:
return True, "emergency_synthesis"
return False, "continue_searching"
```
### Fix 4: Update Orchestrator Decision Phase
**File:** `src/orchestrators/simple.py`
```python
# In the run() method, replace the decision phase:
# === DECISION PHASE (Code-Enforced) ===
should_synth, reason = should_synthesize(
assessment=assessment,
iteration=iteration,
max_iterations=self.config.max_iterations,
evidence_count=len(all_evidence),
)
logger.info(
"Synthesis decision",
should_synthesize=should_synth,
reason=reason,
iteration=iteration,
combined_score=assessment.details.mechanism_score + assessment.details.clinical_evidence_score,
evidence_count=len(all_evidence),
confidence=assessment.confidence,
)
if should_synth:
# Log synthesis trigger reason for debugging
if reason != "judge_approved":
logger.info(f"Code-enforced synthesis triggered: {reason}")
# Optional Analysis Phase
async for event in self._run_analysis_phase(query, all_evidence, iteration):
yield event
yield AgentEvent(
type="synthesizing",
message=f"Evidence sufficient ({reason})! Preparing synthesis...",
iteration=iteration,
)
# Generate final response
final_response = self._generate_synthesis(query, all_evidence, assessment)
yield AgentEvent(
type="complete",
message=final_response,
data={
"evidence_count": len(all_evidence),
"iterations": iteration,
"synthesis_reason": reason,
"drug_candidates": assessment.details.drug_candidates,
"key_findings": assessment.details.key_findings,
},
iteration=iteration,
)
return
else:
# Need more evidence - prepare next queries
current_queries = assessment.next_search_queries or [
f"{query} mechanism of action",
f"{query} clinical evidence",
]
yield AgentEvent(
type="looping",
message=(
f"Gathering more evidence (scores: {assessment.details.mechanism_score}+"
f"{assessment.details.clinical_evidence_score}). "
f"Next: {', '.join(current_queries[:2])}..."
),
data={"next_queries": current_queries, "reason": reason},
iteration=iteration,
)
```
### Fix 5: Real Partial Synthesis
**File:** `src/orchestrators/simple.py`
```python
def _generate_partial_synthesis(
self,
query: str,
evidence: list[Evidence],
) -> str:
"""
Generate a REAL synthesis when max iterations reached.
Even when forced to stop, we should provide:
- Drug candidates (if any were found)
- Key findings
- Assessment scores
- Actionable citations
This is still better than a citation dump.
"""
# Extract data from last assessment if available
last_assessment = self.history[-1]["assessment"] if self.history else {}
details = last_assessment.get("details", {})
drug_candidates = details.get("drug_candidates", [])
key_findings = details.get("key_findings", [])
mechanism_score = details.get("mechanism_score", 0)
clinical_score = details.get("clinical_evidence_score", 0)
reasoning = last_assessment.get("reasoning", "Analysis incomplete due to iteration limit.")
# Format drug candidates
if drug_candidates:
drug_list = "\n".join([f"- **{d}**" for d in drug_candidates[:5]])
else:
drug_list = "- *No specific drug candidates identified in evidence*\n- *Try a more specific query or add an API key for better analysis*"
# Format key findings
if key_findings:
findings_list = "\n".join([f"- {f}" for f in key_findings[:5]])
else:
findings_list = "- *Key findings require further analysis*\n- *See citations below for relevant sources*"
# Format citations (top 10)
citations = "\n".join([
f"{i + 1}. [{e.citation.title}]({e.citation.url}) "
f"({e.citation.source.upper()}, {e.citation.date})"
for i, e in enumerate(evidence[:10])
])
combined_score = mechanism_score + clinical_score
return f"""## Drug Repurposing Analysis
### Research Question
{query}
### Status
Analysis based on {len(evidence)} sources across {len(self.history)} iterations.
Maximum iterations reached - results may be incomplete.
### Drug Candidates Identified
{drug_list}
### Key Findings
{findings_list}
### Evidence Quality Scores
| Criterion | Score | Interpretation |
|-----------|-------|----------------|
| Mechanism | {mechanism_score}/10 | {"Strong" if mechanism_score >= 7 else "Moderate" if mechanism_score >= 4 else "Limited"} mechanistic evidence |
| Clinical | {clinical_score}/10 | {"Strong" if clinical_score >= 7 else "Moderate" if clinical_score >= 4 else "Limited"} clinical support |
| Combined | {combined_score}/20 | {"Sufficient" if combined_score >= 12 else "Partial"} for synthesis |
### Analysis Summary
{reasoning}
### Top Citations ({len(evidence)} sources total)
{citations}
---
*For more complete analysis:*
- *Add an OpenAI or Anthropic API key for enhanced LLM analysis*
- *Try a more specific query (e.g., include drug names)*
- *Use Advanced mode for multi-agent research*
"""
```
### Fix 6: Update Judge Handler Signature
**File:** `src/orchestrators/base.py`
```python
class JudgeHandlerProtocol(Protocol):
"""Protocol for judge handler."""
async def assess(
self,
question: str,
evidence: list[Evidence],
iteration: int = 0, # NEW
max_iterations: int = 10, # NEW
) -> JudgeAssessment:
"""Assess evidence quality and provide scores."""
...
```
**File:** `src/agent_factory/judges.py`
Update all handlers (`JudgeHandler`, `HFInferenceJudgeHandler`, `MockJudgeHandler`) to:
```python
async def assess(
self,
question: str,
evidence: list[Evidence],
iteration: int = 0,
max_iterations: int = 10,
) -> JudgeAssessment:
"""Assess evidence with iteration context."""
# Select diverse evidence (not just truncate)
selected_evidence = await select_evidence_for_judge(evidence, question)
# Format prompt with iteration context
user_prompt = format_user_prompt(
question=question,
evidence=selected_evidence,
iteration=iteration,
max_iterations=max_iterations,
total_evidence_count=len(evidence),
)
# ... rest of implementation
```
---
## Implementation Order
| Order | Fix | Priority | Impact |
|-------|-----|----------|--------|
| 1 | Diverse evidence selection | CRITICAL | Prevents token overflow + position bias |
| 2 | Code-enforced termination | CRITICAL | Guarantees synthesis before max iterations |
| 3 | Debiased judge prompt | HIGH | Removes verbosity/self-preference bias |
| 4 | Real partial synthesis | HIGH | Ensures useful output even on forced stop |
| 5 | Update handler signatures | MEDIUM | Enables iteration context |
| 6 | Update orchestrator | MEDIUM | Integrates all fixes |
---
## Files to Modify
| File | Changes |
|------|---------|
| `src/prompts/judge.py` | New `select_evidence_for_judge()`, updated `format_user_prompt()`, debiased `SYSTEM_PROMPT` |
| `src/orchestrators/simple.py` | New `should_synthesize()`, updated decision phase, real `_generate_partial_synthesis()` |
| `src/orchestrators/base.py` | Update `JudgeHandlerProtocol` signature |
| `src/agent_factory/judges.py` | Update all handlers with iteration params, use diverse selection |
---
## Test Plan
### Unit Tests
```python
# tests/unit/prompts/test_judge_prompt.py
@pytest.mark.asyncio
async def test_evidence_selection_diverse():
"""Verify evidence selection includes early and recent items."""
evidence = [make_evidence(f"Paper {i}") for i in range(100)]
selected = await select_evidence_for_judge(evidence, "test query", max_items=30)
# Should include some early evidence (lost-in-the-middle mitigation)
titles = [e.citation.title for e in selected]
assert any("Paper 0" in t or "Paper 1" in t for t in titles)
assert any("Paper 99" in t or "Paper 98" in t for t in titles)
def test_prompt_includes_question_at_edges():
"""Verify lost-in-the-middle mitigation."""
evidence = [make_evidence("Test")]
prompt = format_user_prompt("important question", evidence, iteration=5, max_iterations=10)
# Question should appear at START and END of prompt
lines = prompt.split("\n")
assert "important question" in lines[1] # Near start
assert "important question" in lines[-2] # Near end
# tests/unit/orchestrators/test_termination.py
def test_should_synthesize_high_scores():
"""High scores with drug candidates triggers synthesis."""
assessment = make_assessment(mechanism=7, clinical=6, drug_candidates=["Metformin"])
should_synth, reason = should_synthesize(assessment, iteration=3, max_iterations=10, evidence_count=50)
assert should_synth is True
assert reason == "high_scores_with_candidates"
def test_should_synthesize_late_iteration():
"""Late iteration with acceptable scores triggers synthesis."""
assessment = make_assessment(mechanism=5, clinical=4, drug_candidates=[])
should_synth, reason = should_synthesize(assessment, iteration=9, max_iterations=10, evidence_count=80)
assert should_synth is True
assert reason in ["late_iteration_acceptable", "emergency_synthesis"]
def test_should_not_synthesize_early_low_scores():
"""Early iteration with low scores continues searching."""
assessment = make_assessment(mechanism=3, clinical=2, drug_candidates=[])
should_synth, reason = should_synthesize(assessment, iteration=2, max_iterations=10, evidence_count=20)
assert should_synth is False
assert reason == "continue_searching"
def test_partial_synthesis_has_drug_candidates():
"""Partial synthesis includes extracted data."""
orchestrator = Orchestrator(...)
orchestrator.history = [{
"assessment": {
"details": {
"drug_candidates": ["Testosterone", "DHEA"],
"key_findings": ["Finding 1", "Finding 2"],
"mechanism_score": 6,
"clinical_evidence_score": 5,
},
"reasoning": "Good evidence found.",
}
}]
result = orchestrator._generate_partial_synthesis("test", [make_evidence("Test")])
assert "Testosterone" in result
assert "DHEA" in result
assert "Drug Candidates" in result
assert "6/10" in result # mechanism score
```
### Integration Tests
```python
# tests/integration/test_simple_mode_synthesis.py
@pytest.mark.asyncio
async def test_simple_mode_synthesizes_before_max_iterations():
"""Verify simple mode produces useful output with mocked judge."""
# Mock judge to return good scores
mock_judge = MockJudgeHandler()
orchestrator = Orchestrator(
search_handler=mock_search_handler,
judge_handler=mock_judge,
)
events = []
async for event in orchestrator.run("metformin diabetes mechanism"):
events.append(event)
# Must have synthesis with drug candidates
complete_event = next(e for e in events if e.type == "complete")
assert "Drug Candidates" in complete_event.message
assert complete_event.data.get("synthesis_reason") is not None
@pytest.mark.asyncio
async def test_large_evidence_does_not_crash():
"""Verify 500 sources don't cause token overflow."""
evidence = [make_evidence(f"Paper {i}") for i in range(500)]
selected = await select_evidence_for_judge(evidence, "test query")
# Should be capped
assert len(selected) <= MAX_EVIDENCE_FOR_JUDGE
# Total chars should be under ~50K (safe for most LLMs)
prompt = format_user_prompt("test", selected, iteration=5, max_iterations=10, total_evidence_count=500)
assert len(prompt) < 100_000 # Well under token limits
```
---
## Acceptance Criteria
- [ ] Evidence sent to judge is diverse-selected (not just truncated)
- [ ] Prompt includes question at START and END (lost-in-the-middle mitigation)
- [ ] Code-enforced `should_synthesize()` makes termination decision
- [ ] Synthesis triggered by iteration 8 with 50+ sources and scores >= 8
- [ ] Partial synthesis includes drug candidates and scores (not just citations)
- [ ] Search queries stay on-topic (judge prompt enforces focus)
- [ ] 500+ sources don't cause LLM crashes
- [ ] All existing tests pass
---
## Risk Assessment
| Risk | Mitigation |
|------|------------|
| Diverse selection misses critical evidence | Include relevance scoring in selection |
| Code-enforced thresholds too aggressive | Log all synthesis decisions for tuning |
| Prompt changes affect OpenAI/Anthropic differently | Test with all providers |
| Emergency synthesis produces low-quality output | Still better than citation dump |
---
## Success Metrics
| Metric | Before | After |
|--------|--------|-------|
| Synthesis rate | 0% | 90%+ |
| Average iterations to synthesis | 10 (max) | 5-7 |
| Drug candidates in output | Never | Always (if found) |
| LLM token overflow errors | Common | None |
| User-reported "useless output" | Frequent | Rare |
---
## References
- [LLM-as-a-Judge Guide - Evidently AI](https://www.evidentlyai.com/llm-guide/llm-as-a-judge)
- [Survey on LLM-as-a-Judge - arXiv](https://arxiv.org/abs/2411.15594)
- [RAG Best Practices - Pinecone](https://www.pinecone.io/learn/retrieval-augmented-generation/)
- [Lessons from RAG 2025 - TrueState](https://www.truestate.io/blog/lessons-from-rag)
- [LangGraph Multi-Agent Orchestration 2025](https://latenode.com/blog/langgraph-multi-agent-orchestration-complete-framework-guide-architecture-analysis-2025)
- [Multi-Agent Orchestration on AWS](https://aws.amazon.com/solutions/guidance/multi-agent-orchestration-on-aws/)
- [Spring AI LLM-as-Judge Pattern](https://spring.io/blog/2025/11/10/spring-ai-llm-as-judge-blog-post/)