| # SPEC 06: Simple Mode Synthesis Fix | |
| ## Priority: P0 (Blocker - Simple mode produces garbage output) | |
| ## Problem Statement | |
| Simple mode (HuggingFace free tier) runs 10 iterations, collects 455 sources, but outputs only a citation dump with no actual synthesis. The user waits through the entire process just to see "Partial Analysis (Max Iterations Reached)" with no drug candidates or analysis. | |
| **Observed Behavior** (real run): | |
| ``` | |
| Iterations 1-8: confidence 70-90%, recommendation="continue" β Never synthesizes | |
| Iteration 9-10: confidence 0% β LLM context overflow | |
| Final output: Citation list only, no drug candidates, no analysis | |
| ``` | |
| --- | |
| ## Research Context (November 2025 Best Practices) | |
| This spec incorporates findings from current industry research on LLM-as-Judge, RAG systems, and multi-agent orchestration. | |
| ### LLM-as-Judge Biases ([Evidently AI](https://www.evidentlyai.com/llm-guide/llm-as-a-judge), [arXiv Survey](https://arxiv.org/abs/2411.15594)) | |
| | Bias | Description | Impact on Our System | | |
| |------|-------------|---------------------| | |
| | **Verbosity Bias** | LLM judges prefer longer, more detailed responses | Judge defaults to verbose "continue" explanations | | |
| | **Position Bias** | Systematic preference based on order (primacy/recency) | Most recent evidence over-weighted | | |
| | **Self-Preference Bias** | LLM favors outputs matching its own generation patterns | Defaults to "comfortable" pattern (continue) | | |
| **Key Finding**: "Sophisticated judge models can align with human judgment up to 85%, which is actually higher than human-to-human agreement (81%)." However, this requires careful prompt design and debiasing. | |
| ### RAG Context Limits ([Pinecone](https://www.pinecone.io/learn/retrieval-augmented-generation/), [TrueState](https://www.truestate.io/blog/lessons-from-rag)) | |
| > "Long context didn't kill retrieval. Bigger windows add cost and noise; **retrieval focuses attention where it matters.**" | |
| **Key Finding**: RAG is **8-82Γ cheaper** than long context approaches. Best practice is: | |
| - **Diverse selection** over recency-only selection | |
| - **Re-ranking** before sending to judge | |
| - **Lost-in-the-middle mitigation** - put critical context at prompt edges | |
| ### Multi-Agent Termination ([LangGraph Guide](https://latenode.com/blog/langgraph-multi-agent-orchestration-complete-framework-guide-architecture-analysis-2025), [AWS Guidance](https://aws.amazon.com/solutions/guidance/multi-agent-orchestration-on-aws/)) | |
| > "The planning agent evaluates whether output **fully satisfies task objectives**. If so, the workflow is **terminated early**." | |
| **Key Finding**: Code-enforced termination criteria outperform LLM-decided termination. The pattern is: | |
| 1. LLM provides **scores only** (mechanism, clinical, drug candidates) | |
| 2. Code evaluates scores against **explicit thresholds** | |
| 3. Code decides synthesize vs continue | |
| --- | |
| ## Root Cause Analysis | |
| ### Bug 1: No Evidence Limit in Judge Prompt (CRITICAL) | |
| **File:** `src/prompts/judge.py:62` | |
| ```python | |
| # BROKEN: Sends ALL evidence to the LLM | |
| evidence_text = "\n\n".join([format_single_evidence(i, e) for i, e in enumerate(evidence)]) | |
| ``` | |
| **Impact:** | |
| - 455 sources Γ 1700 chars/source = **773,500 characters β 193K tokens** | |
| - HuggingFace Inference free tier limit: **~4K-8K tokens** | |
| - Result: **Context overflow β LLM failure β fallback response β 0% confidence** | |
| This explains why confidence dropped to 0% in iterations 9-10: the context became too large for the LLM. | |
| ### Bug 2: LLM Decides Both Scoring AND Recommendation (Anti-Pattern) | |
| **Current Design:** | |
| ```python | |
| # LLM does BOTH - subject to verbosity/self-preference bias | |
| "Evaluate evidence... Respond with recommendation: 'continue' or 'synthesize'" | |
| ``` | |
| **Problem** (per 2025 research): | |
| - LLM exhibits **self-preference bias** - defaults to its "comfortable" pattern | |
| - "Be conservative" instruction triggers **verbosity bias** - prefers longer explanations for "continue" | |
| - No **separation of concerns** - scoring and decision-making conflated | |
| ### Bug 3: No Diverse Evidence Selection | |
| **Current Design:** | |
| ```python | |
| # Just truncates to most recent - subject to position bias | |
| capped_evidence = evidence[-30:] | |
| ``` | |
| **Problem** (per RAG research): | |
| - **Position bias** - most recent β most relevant | |
| - **Lost-in-the-middle** - important early evidence ignored | |
| - No **diversity** - may select 30 similar papers | |
| ### Bug 4: Prompt Encourages "Continue" Forever | |
| **File:** `src/prompts/judge.py:22-32` | |
| ```python | |
| ## Sufficiency Criteria (TOO STRICT - requires ALL conditions) | |
| - Combined scores >= 12 AND | |
| - At least one specific drug candidate identified AND | |
| - Clear mechanistic rationale exists | |
| ## Output Rules | |
| - Be conservative: only recommend "synthesize" when truly confident β TRIGGERS VERBOSITY BIAS | |
| ``` | |
| ### Bug 5: Search Derailment | |
| **Evidence from logs:** | |
| ``` | |
| Next searches: androgen therapy and bone health, androgen therapy and muscle mass... | |
| ``` | |
| Original question: "female libido post-menopause" β Judge suggests tangentially related topics. | |
| ### Bug 6: Partial Synthesis is Garbage | |
| **File:** `src/orchestrators/simple.py:432-470` | |
| When max iterations reached, outputs only citations with no analysis, drug candidates, or key findings. | |
| --- | |
| ## Solution Design | |
| ### Architecture Change: Separate Scoring from Decision | |
| **Before (biased):** | |
| ``` | |
| User Question β LLM Judge β { scores, recommendation } β Orchestrator follows recommendation | |
| ``` | |
| **After (debiased, per 2025 best practices):** | |
| ``` | |
| User Question β LLM Judge β { scores only } β Code evaluates β Code decides synthesize/continue | |
| ``` | |
| This follows the [Spring AI LLM-as-Judge pattern](https://spring.io/blog/2025/11/10/spring-ai-llm-as-judge-blog-post/): "Run agent in while loop with evaluator, until evaluator says output passes criteria" - but criteria are **code-enforced**, not LLM-decided. | |
| --- | |
| ### Fix 1: Diverse Evidence Selection (Not Just Capping) | |
| **File:** `src/prompts/judge.py` | |
| ```python | |
| MAX_EVIDENCE_FOR_JUDGE = 30 # Keep under token limits | |
| async def select_evidence_for_judge( | |
| evidence: list[Evidence], | |
| query: str, | |
| max_items: int = MAX_EVIDENCE_FOR_JUDGE, | |
| ) -> list[Evidence]: | |
| """ | |
| Select diverse, relevant evidence for judge evaluation. | |
| Implements RAG best practices (November 2025): | |
| - Diversity selection over recency-only | |
| - Lost-in-the-middle mitigation | |
| - Relevance re-ranking | |
| References: | |
| - https://www.pinecone.io/learn/retrieval-augmented-generation/ | |
| - https://www.truestate.io/blog/lessons-from-rag | |
| """ | |
| if len(evidence) <= max_items: | |
| return evidence | |
| try: | |
| from src.utils.text_utils import select_diverse_evidence | |
| # Use embedding-based diversity selection | |
| return await select_diverse_evidence(evidence, n=max_items, query=query) | |
| except ImportError: | |
| # Fallback: mix of recent + early (lost-in-the-middle mitigation) | |
| early = evidence[:max_items // 3] # First third | |
| recent = evidence[-(max_items * 2 // 3):] # Last two-thirds | |
| return early + recent | |
| def format_user_prompt( | |
| question: str, | |
| evidence: list[Evidence], | |
| iteration: int = 0, | |
| max_iterations: int = 10, | |
| total_evidence_count: int | None = None, | |
| ) -> str: | |
| """ | |
| Format user prompt with selected evidence and iteration context. | |
| NOTE: Evidence should be pre-selected using select_evidence_for_judge(). | |
| This function assumes evidence is already capped. | |
| """ | |
| total_count = total_evidence_count or len(evidence) | |
| max_content_len = 1500 | |
| def format_single_evidence(i: int, e: Evidence) -> str: | |
| content = e.content | |
| if len(content) > max_content_len: | |
| content = content[:max_content_len] + "..." | |
| return ( | |
| f"### Evidence {i + 1}\n" | |
| f"**Source**: {e.citation.source.upper()} - {e.citation.title}\n" | |
| f"**URL**: {e.citation.url}\n" | |
| f"**Content**:\n{content}" | |
| ) | |
| evidence_text = "\n\n".join([format_single_evidence(i, e) for i, e in enumerate(evidence)]) | |
| # Lost-in-the-middle mitigation: put critical context at START and END | |
| return f"""## Research Question (IMPORTANT - stay focused on this) | |
| {question} | |
| ## Search Progress | |
| - **Iteration**: {iteration}/{max_iterations} | |
| - **Total evidence collected**: {total_count} sources | |
| - **Evidence shown below**: {len(evidence)} diverse sources (selected for relevance) | |
| ## Available Evidence | |
| {evidence_text} | |
| ## Your Task | |
| Score this evidence for drug repurposing potential. Provide ONLY scores and extracted data. | |
| DO NOT decide "synthesize" vs "continue" - that decision is made by the system. | |
| ## REMINDER: Original Question (stay focused) | |
| {question} | |
| """ | |
| ``` | |
| ### Fix 2: Debiased Judge Prompt (Scoring Only) | |
| **File:** `src/prompts/judge.py` | |
| ```python | |
| SYSTEM_PROMPT = """You are an expert drug repurposing research judge. | |
| Your task is to SCORE evidence from biomedical literature. You do NOT decide whether to | |
| continue searching or synthesize - that decision is made by the orchestration system | |
| based on your scores. | |
| ## Your Role: Scoring Only | |
| You provide objective scores. The system decides next steps based on explicit thresholds. | |
| This separation prevents bias in the decision-making process. | |
| ## Scoring Criteria | |
| 1. **Mechanism Score (0-10)**: How well does the evidence explain the biological mechanism? | |
| - 0-3: No clear mechanism, speculative | |
| - 4-6: Some mechanistic insight, but gaps exist | |
| - 7-10: Clear, well-supported mechanism of action | |
| 2. **Clinical Evidence Score (0-10)**: Strength of clinical/preclinical support? | |
| - 0-3: No clinical data, only theoretical | |
| - 4-6: Preclinical or early clinical data | |
| - 7-10: Strong clinical evidence (trials, meta-analyses) | |
| 3. **Drug Candidates**: List SPECIFIC drug names mentioned in the evidence | |
| - Only include drugs explicitly mentioned | |
| - Do NOT hallucinate or infer drug names | |
| - Include drug class if specific names aren't available (e.g., "SSRI antidepressants") | |
| 4. **Key Findings**: Extract 3-5 key findings from the evidence | |
| - Focus on findings relevant to the research question | |
| - Include mechanism insights and clinical outcomes | |
| 5. **Confidence (0.0-1.0)**: Your confidence in the scores | |
| - Based on evidence quality and relevance | |
| - Lower if evidence is tangential or low-quality | |
| ## Output Format | |
| Return valid JSON with these fields: | |
| - details.mechanism_score (int 0-10) | |
| - details.mechanism_reasoning (string) | |
| - details.clinical_evidence_score (int 0-10) | |
| - details.clinical_reasoning (string) | |
| - details.drug_candidates (list of strings) | |
| - details.key_findings (list of strings) | |
| - sufficient (boolean) - TRUE if scores suggest enough evidence | |
| - confidence (float 0-1) | |
| - recommendation ("continue" or "synthesize") - Your suggestion (system may override) | |
| - next_search_queries (list) - If continuing, suggest FOCUSED queries | |
| - reasoning (string) | |
| ## CRITICAL: Search Query Rules | |
| When suggesting next_search_queries: | |
| - STAY FOCUSED on the original research question | |
| - Do NOT drift to tangential topics | |
| - If question is about "female libido", do NOT suggest "bone health" or "muscle mass" | |
| - Refine existing terms, don't explore random medical associations | |
| - Example: "female libido post-menopause" β "testosterone therapy female sexual dysfunction" | |
| """ | |
| ``` | |
| ### Fix 3: Code-Enforced Termination Criteria | |
| **File:** `src/orchestrators/simple.py` | |
| ```python | |
| # Termination thresholds (code-enforced, not LLM-decided) | |
| # Based on multi-agent orchestration best practices (November 2025) | |
| # Reference: https://aws.amazon.com/solutions/guidance/multi-agent-orchestration-on-aws/ | |
| TERMINATION_CRITERIA = { | |
| "min_combined_score": 12, # mechanism + clinical >= 12 | |
| "min_score_with_volume": 10, # >= 10 if 50+ sources | |
| "late_iteration_threshold": 8, # >= 8 in iterations 8+ | |
| "max_evidence_threshold": 100, # Force synthesis with 100+ sources | |
| "emergency_iteration": 8, # Last 2 iterations = emergency mode | |
| "min_confidence": 0.5, # Minimum confidence for emergency synthesis | |
| } | |
| def should_synthesize( | |
| assessment: JudgeAssessment, | |
| iteration: int, | |
| max_iterations: int, | |
| evidence_count: int, | |
| ) -> tuple[bool, str]: | |
| """ | |
| Code-enforced synthesis decision. | |
| Returns (should_synthesize, reason). | |
| This function implements the "explicit termination criteria" pattern | |
| from multi-agent orchestration best practices. The LLM provides scores, | |
| but CODE decides when to stop. | |
| Reference: https://latenode.com/blog/langgraph-multi-agent-orchestration-complete-framework-guide-architecture-analysis-2025 | |
| """ | |
| combined_score = ( | |
| assessment.details.mechanism_score + | |
| assessment.details.clinical_evidence_score | |
| ) | |
| has_drug_candidates = len(assessment.details.drug_candidates) > 0 | |
| confidence = assessment.confidence | |
| # Priority 1: LLM explicitly says sufficient with good scores | |
| if assessment.sufficient and assessment.recommendation == "synthesize": | |
| if combined_score >= 10: | |
| return True, "judge_approved" | |
| # Priority 2: High scores with drug candidates | |
| if combined_score >= TERMINATION_CRITERIA["min_combined_score"] and has_drug_candidates: | |
| return True, "high_scores_with_candidates" | |
| # Priority 3: Good scores with high evidence volume | |
| if combined_score >= TERMINATION_CRITERIA["min_score_with_volume"] and evidence_count >= 50: | |
| return True, "good_scores_high_volume" | |
| # Priority 4: Late iteration with acceptable scores (diminishing returns) | |
| is_late_iteration = iteration >= max_iterations - 2 | |
| if is_late_iteration and combined_score >= TERMINATION_CRITERIA["late_iteration_threshold"]: | |
| return True, "late_iteration_acceptable" | |
| # Priority 5: Very high evidence count (enough to synthesize something) | |
| if evidence_count >= TERMINATION_CRITERIA["max_evidence_threshold"]: | |
| return True, "max_evidence_reached" | |
| # Priority 6: Emergency synthesis (avoid garbage output) | |
| if is_late_iteration and evidence_count >= 30 and confidence >= TERMINATION_CRITERIA["min_confidence"]: | |
| return True, "emergency_synthesis" | |
| return False, "continue_searching" | |
| ``` | |
| ### Fix 4: Update Orchestrator Decision Phase | |
| **File:** `src/orchestrators/simple.py` | |
| ```python | |
| # In the run() method, replace the decision phase: | |
| # === DECISION PHASE (Code-Enforced) === | |
| should_synth, reason = should_synthesize( | |
| assessment=assessment, | |
| iteration=iteration, | |
| max_iterations=self.config.max_iterations, | |
| evidence_count=len(all_evidence), | |
| ) | |
| logger.info( | |
| "Synthesis decision", | |
| should_synthesize=should_synth, | |
| reason=reason, | |
| iteration=iteration, | |
| combined_score=assessment.details.mechanism_score + assessment.details.clinical_evidence_score, | |
| evidence_count=len(all_evidence), | |
| confidence=assessment.confidence, | |
| ) | |
| if should_synth: | |
| # Log synthesis trigger reason for debugging | |
| if reason != "judge_approved": | |
| logger.info(f"Code-enforced synthesis triggered: {reason}") | |
| # Optional Analysis Phase | |
| async for event in self._run_analysis_phase(query, all_evidence, iteration): | |
| yield event | |
| yield AgentEvent( | |
| type="synthesizing", | |
| message=f"Evidence sufficient ({reason})! Preparing synthesis...", | |
| iteration=iteration, | |
| ) | |
| # Generate final response | |
| final_response = self._generate_synthesis(query, all_evidence, assessment) | |
| yield AgentEvent( | |
| type="complete", | |
| message=final_response, | |
| data={ | |
| "evidence_count": len(all_evidence), | |
| "iterations": iteration, | |
| "synthesis_reason": reason, | |
| "drug_candidates": assessment.details.drug_candidates, | |
| "key_findings": assessment.details.key_findings, | |
| }, | |
| iteration=iteration, | |
| ) | |
| return | |
| else: | |
| # Need more evidence - prepare next queries | |
| current_queries = assessment.next_search_queries or [ | |
| f"{query} mechanism of action", | |
| f"{query} clinical evidence", | |
| ] | |
| yield AgentEvent( | |
| type="looping", | |
| message=( | |
| f"Gathering more evidence (scores: {assessment.details.mechanism_score}+" | |
| f"{assessment.details.clinical_evidence_score}). " | |
| f"Next: {', '.join(current_queries[:2])}..." | |
| ), | |
| data={"next_queries": current_queries, "reason": reason}, | |
| iteration=iteration, | |
| ) | |
| ``` | |
| ### Fix 5: Real Partial Synthesis | |
| **File:** `src/orchestrators/simple.py` | |
| ```python | |
| def _generate_partial_synthesis( | |
| self, | |
| query: str, | |
| evidence: list[Evidence], | |
| ) -> str: | |
| """ | |
| Generate a REAL synthesis when max iterations reached. | |
| Even when forced to stop, we should provide: | |
| - Drug candidates (if any were found) | |
| - Key findings | |
| - Assessment scores | |
| - Actionable citations | |
| This is still better than a citation dump. | |
| """ | |
| # Extract data from last assessment if available | |
| last_assessment = self.history[-1]["assessment"] if self.history else {} | |
| details = last_assessment.get("details", {}) | |
| drug_candidates = details.get("drug_candidates", []) | |
| key_findings = details.get("key_findings", []) | |
| mechanism_score = details.get("mechanism_score", 0) | |
| clinical_score = details.get("clinical_evidence_score", 0) | |
| reasoning = last_assessment.get("reasoning", "Analysis incomplete due to iteration limit.") | |
| # Format drug candidates | |
| if drug_candidates: | |
| drug_list = "\n".join([f"- **{d}**" for d in drug_candidates[:5]]) | |
| else: | |
| drug_list = "- *No specific drug candidates identified in evidence*\n- *Try a more specific query or add an API key for better analysis*" | |
| # Format key findings | |
| if key_findings: | |
| findings_list = "\n".join([f"- {f}" for f in key_findings[:5]]) | |
| else: | |
| findings_list = "- *Key findings require further analysis*\n- *See citations below for relevant sources*" | |
| # Format citations (top 10) | |
| citations = "\n".join([ | |
| f"{i + 1}. [{e.citation.title}]({e.citation.url}) " | |
| f"({e.citation.source.upper()}, {e.citation.date})" | |
| for i, e in enumerate(evidence[:10]) | |
| ]) | |
| combined_score = mechanism_score + clinical_score | |
| return f"""## Drug Repurposing Analysis | |
| ### Research Question | |
| {query} | |
| ### Status | |
| Analysis based on {len(evidence)} sources across {len(self.history)} iterations. | |
| Maximum iterations reached - results may be incomplete. | |
| ### Drug Candidates Identified | |
| {drug_list} | |
| ### Key Findings | |
| {findings_list} | |
| ### Evidence Quality Scores | |
| | Criterion | Score | Interpretation | | |
| |-----------|-------|----------------| | |
| | Mechanism | {mechanism_score}/10 | {"Strong" if mechanism_score >= 7 else "Moderate" if mechanism_score >= 4 else "Limited"} mechanistic evidence | | |
| | Clinical | {clinical_score}/10 | {"Strong" if clinical_score >= 7 else "Moderate" if clinical_score >= 4 else "Limited"} clinical support | | |
| | Combined | {combined_score}/20 | {"Sufficient" if combined_score >= 12 else "Partial"} for synthesis | | |
| ### Analysis Summary | |
| {reasoning} | |
| ### Top Citations ({len(evidence)} sources total) | |
| {citations} | |
| --- | |
| *For more complete analysis:* | |
| - *Add an OpenAI or Anthropic API key for enhanced LLM analysis* | |
| - *Try a more specific query (e.g., include drug names)* | |
| - *Use Advanced mode for multi-agent research* | |
| """ | |
| ``` | |
| ### Fix 6: Update Judge Handler Signature | |
| **File:** `src/orchestrators/base.py` | |
| ```python | |
| class JudgeHandlerProtocol(Protocol): | |
| """Protocol for judge handler.""" | |
| async def assess( | |
| self, | |
| question: str, | |
| evidence: list[Evidence], | |
| iteration: int = 0, # NEW | |
| max_iterations: int = 10, # NEW | |
| ) -> JudgeAssessment: | |
| """Assess evidence quality and provide scores.""" | |
| ... | |
| ``` | |
| **File:** `src/agent_factory/judges.py` | |
| Update all handlers (`JudgeHandler`, `HFInferenceJudgeHandler`, `MockJudgeHandler`) to: | |
| ```python | |
| async def assess( | |
| self, | |
| question: str, | |
| evidence: list[Evidence], | |
| iteration: int = 0, | |
| max_iterations: int = 10, | |
| ) -> JudgeAssessment: | |
| """Assess evidence with iteration context.""" | |
| # Select diverse evidence (not just truncate) | |
| selected_evidence = await select_evidence_for_judge(evidence, question) | |
| # Format prompt with iteration context | |
| user_prompt = format_user_prompt( | |
| question=question, | |
| evidence=selected_evidence, | |
| iteration=iteration, | |
| max_iterations=max_iterations, | |
| total_evidence_count=len(evidence), | |
| ) | |
| # ... rest of implementation | |
| ``` | |
| --- | |
| ## Implementation Order | |
| | Order | Fix | Priority | Impact | | |
| |-------|-----|----------|--------| | |
| | 1 | Diverse evidence selection | CRITICAL | Prevents token overflow + position bias | | |
| | 2 | Code-enforced termination | CRITICAL | Guarantees synthesis before max iterations | | |
| | 3 | Debiased judge prompt | HIGH | Removes verbosity/self-preference bias | | |
| | 4 | Real partial synthesis | HIGH | Ensures useful output even on forced stop | | |
| | 5 | Update handler signatures | MEDIUM | Enables iteration context | | |
| | 6 | Update orchestrator | MEDIUM | Integrates all fixes | | |
| --- | |
| ## Files to Modify | |
| | File | Changes | | |
| |------|---------| | |
| | `src/prompts/judge.py` | New `select_evidence_for_judge()`, updated `format_user_prompt()`, debiased `SYSTEM_PROMPT` | | |
| | `src/orchestrators/simple.py` | New `should_synthesize()`, updated decision phase, real `_generate_partial_synthesis()` | | |
| | `src/orchestrators/base.py` | Update `JudgeHandlerProtocol` signature | | |
| | `src/agent_factory/judges.py` | Update all handlers with iteration params, use diverse selection | | |
| --- | |
| ## Test Plan | |
| ### Unit Tests | |
| ```python | |
| # tests/unit/prompts/test_judge_prompt.py | |
| @pytest.mark.asyncio | |
| async def test_evidence_selection_diverse(): | |
| """Verify evidence selection includes early and recent items.""" | |
| evidence = [make_evidence(f"Paper {i}") for i in range(100)] | |
| selected = await select_evidence_for_judge(evidence, "test query", max_items=30) | |
| # Should include some early evidence (lost-in-the-middle mitigation) | |
| titles = [e.citation.title for e in selected] | |
| assert any("Paper 0" in t or "Paper 1" in t for t in titles) | |
| assert any("Paper 99" in t or "Paper 98" in t for t in titles) | |
| def test_prompt_includes_question_at_edges(): | |
| """Verify lost-in-the-middle mitigation.""" | |
| evidence = [make_evidence("Test")] | |
| prompt = format_user_prompt("important question", evidence, iteration=5, max_iterations=10) | |
| # Question should appear at START and END of prompt | |
| lines = prompt.split("\n") | |
| assert "important question" in lines[1] # Near start | |
| assert "important question" in lines[-2] # Near end | |
| # tests/unit/orchestrators/test_termination.py | |
| def test_should_synthesize_high_scores(): | |
| """High scores with drug candidates triggers synthesis.""" | |
| assessment = make_assessment(mechanism=7, clinical=6, drug_candidates=["Metformin"]) | |
| should_synth, reason = should_synthesize(assessment, iteration=3, max_iterations=10, evidence_count=50) | |
| assert should_synth is True | |
| assert reason == "high_scores_with_candidates" | |
| def test_should_synthesize_late_iteration(): | |
| """Late iteration with acceptable scores triggers synthesis.""" | |
| assessment = make_assessment(mechanism=5, clinical=4, drug_candidates=[]) | |
| should_synth, reason = should_synthesize(assessment, iteration=9, max_iterations=10, evidence_count=80) | |
| assert should_synth is True | |
| assert reason in ["late_iteration_acceptable", "emergency_synthesis"] | |
| def test_should_not_synthesize_early_low_scores(): | |
| """Early iteration with low scores continues searching.""" | |
| assessment = make_assessment(mechanism=3, clinical=2, drug_candidates=[]) | |
| should_synth, reason = should_synthesize(assessment, iteration=2, max_iterations=10, evidence_count=20) | |
| assert should_synth is False | |
| assert reason == "continue_searching" | |
| def test_partial_synthesis_has_drug_candidates(): | |
| """Partial synthesis includes extracted data.""" | |
| orchestrator = Orchestrator(...) | |
| orchestrator.history = [{ | |
| "assessment": { | |
| "details": { | |
| "drug_candidates": ["Testosterone", "DHEA"], | |
| "key_findings": ["Finding 1", "Finding 2"], | |
| "mechanism_score": 6, | |
| "clinical_evidence_score": 5, | |
| }, | |
| "reasoning": "Good evidence found.", | |
| } | |
| }] | |
| result = orchestrator._generate_partial_synthesis("test", [make_evidence("Test")]) | |
| assert "Testosterone" in result | |
| assert "DHEA" in result | |
| assert "Drug Candidates" in result | |
| assert "6/10" in result # mechanism score | |
| ``` | |
| ### Integration Tests | |
| ```python | |
| # tests/integration/test_simple_mode_synthesis.py | |
| @pytest.mark.asyncio | |
| async def test_simple_mode_synthesizes_before_max_iterations(): | |
| """Verify simple mode produces useful output with mocked judge.""" | |
| # Mock judge to return good scores | |
| mock_judge = MockJudgeHandler() | |
| orchestrator = Orchestrator( | |
| search_handler=mock_search_handler, | |
| judge_handler=mock_judge, | |
| ) | |
| events = [] | |
| async for event in orchestrator.run("metformin diabetes mechanism"): | |
| events.append(event) | |
| # Must have synthesis with drug candidates | |
| complete_event = next(e for e in events if e.type == "complete") | |
| assert "Drug Candidates" in complete_event.message | |
| assert complete_event.data.get("synthesis_reason") is not None | |
| @pytest.mark.asyncio | |
| async def test_large_evidence_does_not_crash(): | |
| """Verify 500 sources don't cause token overflow.""" | |
| evidence = [make_evidence(f"Paper {i}") for i in range(500)] | |
| selected = await select_evidence_for_judge(evidence, "test query") | |
| # Should be capped | |
| assert len(selected) <= MAX_EVIDENCE_FOR_JUDGE | |
| # Total chars should be under ~50K (safe for most LLMs) | |
| prompt = format_user_prompt("test", selected, iteration=5, max_iterations=10, total_evidence_count=500) | |
| assert len(prompt) < 100_000 # Well under token limits | |
| ``` | |
| --- | |
| ## Acceptance Criteria | |
| - [ ] Evidence sent to judge is diverse-selected (not just truncated) | |
| - [ ] Prompt includes question at START and END (lost-in-the-middle mitigation) | |
| - [ ] Code-enforced `should_synthesize()` makes termination decision | |
| - [ ] Synthesis triggered by iteration 8 with 50+ sources and scores >= 8 | |
| - [ ] Partial synthesis includes drug candidates and scores (not just citations) | |
| - [ ] Search queries stay on-topic (judge prompt enforces focus) | |
| - [ ] 500+ sources don't cause LLM crashes | |
| - [ ] All existing tests pass | |
| --- | |
| ## Risk Assessment | |
| | Risk | Mitigation | | |
| |------|------------| | |
| | Diverse selection misses critical evidence | Include relevance scoring in selection | | |
| | Code-enforced thresholds too aggressive | Log all synthesis decisions for tuning | | |
| | Prompt changes affect OpenAI/Anthropic differently | Test with all providers | | |
| | Emergency synthesis produces low-quality output | Still better than citation dump | | |
| --- | |
| ## Success Metrics | |
| | Metric | Before | After | | |
| |--------|--------|-------| | |
| | Synthesis rate | 0% | 90%+ | | |
| | Average iterations to synthesis | 10 (max) | 5-7 | | |
| | Drug candidates in output | Never | Always (if found) | | |
| | LLM token overflow errors | Common | None | | |
| | User-reported "useless output" | Frequent | Rare | | |
| --- | |
| ## References | |
| - [LLM-as-a-Judge Guide - Evidently AI](https://www.evidentlyai.com/llm-guide/llm-as-a-judge) | |
| - [Survey on LLM-as-a-Judge - arXiv](https://arxiv.org/abs/2411.15594) | |
| - [RAG Best Practices - Pinecone](https://www.pinecone.io/learn/retrieval-augmented-generation/) | |
| - [Lessons from RAG 2025 - TrueState](https://www.truestate.io/blog/lessons-from-rag) | |
| - [LangGraph Multi-Agent Orchestration 2025](https://latenode.com/blog/langgraph-multi-agent-orchestration-complete-framework-guide-architecture-analysis-2025) | |
| - [Multi-Agent Orchestration on AWS](https://aws.amazon.com/solutions/guidance/multi-agent-orchestration-on-aws/) | |
| - [Spring AI LLM-as-Judge Pattern](https://spring.io/blog/2025/11/10/spring-ai-llm-as-judge-blog-post/) | |