File size: 28,076 Bytes
9639483
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
# SPEC 06: Simple Mode Synthesis Fix

## Priority: P0 (Blocker - Simple mode produces garbage output)

## Problem Statement

Simple mode (HuggingFace free tier) runs 10 iterations, collects 455 sources, but outputs only a citation dump with no actual synthesis. The user waits through the entire process just to see "Partial Analysis (Max Iterations Reached)" with no drug candidates or analysis.

**Observed Behavior** (real run):
```
Iterations 1-8:  confidence 70-90%, recommendation="continue"  ← Never synthesizes
Iteration 9-10:  confidence 0%                                 ← LLM context overflow
Final output:    Citation list only, no drug candidates, no analysis
```

---

## Research Context (November 2025 Best Practices)

This spec incorporates findings from current industry research on LLM-as-Judge, RAG systems, and multi-agent orchestration.

### LLM-as-Judge Biases ([Evidently AI](https://www.evidentlyai.com/llm-guide/llm-as-a-judge), [arXiv Survey](https://arxiv.org/abs/2411.15594))

| Bias | Description | Impact on Our System |
|------|-------------|---------------------|
| **Verbosity Bias** | LLM judges prefer longer, more detailed responses | Judge defaults to verbose "continue" explanations |
| **Position Bias** | Systematic preference based on order (primacy/recency) | Most recent evidence over-weighted |
| **Self-Preference Bias** | LLM favors outputs matching its own generation patterns | Defaults to "comfortable" pattern (continue) |

**Key Finding**: "Sophisticated judge models can align with human judgment up to 85%, which is actually higher than human-to-human agreement (81%)." However, this requires careful prompt design and debiasing.

### RAG Context Limits ([Pinecone](https://www.pinecone.io/learn/retrieval-augmented-generation/), [TrueState](https://www.truestate.io/blog/lessons-from-rag))

> "Long context didn't kill retrieval. Bigger windows add cost and noise; **retrieval focuses attention where it matters.**"

**Key Finding**: RAG is **8-82Γ— cheaper** than long context approaches. Best practice is:
- **Diverse selection** over recency-only selection
- **Re-ranking** before sending to judge
- **Lost-in-the-middle mitigation** - put critical context at prompt edges

### Multi-Agent Termination ([LangGraph Guide](https://latenode.com/blog/langgraph-multi-agent-orchestration-complete-framework-guide-architecture-analysis-2025), [AWS Guidance](https://aws.amazon.com/solutions/guidance/multi-agent-orchestration-on-aws/))

> "The planning agent evaluates whether output **fully satisfies task objectives**. If so, the workflow is **terminated early**."

**Key Finding**: Code-enforced termination criteria outperform LLM-decided termination. The pattern is:
1. LLM provides **scores only** (mechanism, clinical, drug candidates)
2. Code evaluates scores against **explicit thresholds**
3. Code decides synthesize vs continue

---

## Root Cause Analysis

### Bug 1: No Evidence Limit in Judge Prompt (CRITICAL)

**File:** `src/prompts/judge.py:62`

```python
# BROKEN: Sends ALL evidence to the LLM
evidence_text = "\n\n".join([format_single_evidence(i, e) for i, e in enumerate(evidence)])
```

**Impact:**
- 455 sources Γ— 1700 chars/source = **773,500 characters β‰ˆ 193K tokens**
- HuggingFace Inference free tier limit: **~4K-8K tokens**
- Result: **Context overflow β†’ LLM failure β†’ fallback response β†’ 0% confidence**

This explains why confidence dropped to 0% in iterations 9-10: the context became too large for the LLM.

### Bug 2: LLM Decides Both Scoring AND Recommendation (Anti-Pattern)

**Current Design:**
```python
# LLM does BOTH - subject to verbosity/self-preference bias
"Evaluate evidence... Respond with recommendation: 'continue' or 'synthesize'"
```

**Problem** (per 2025 research):
- LLM exhibits **self-preference bias** - defaults to its "comfortable" pattern
- "Be conservative" instruction triggers **verbosity bias** - prefers longer explanations for "continue"
- No **separation of concerns** - scoring and decision-making conflated

### Bug 3: No Diverse Evidence Selection

**Current Design:**
```python
# Just truncates to most recent - subject to position bias
capped_evidence = evidence[-30:]
```

**Problem** (per RAG research):
- **Position bias** - most recent β‰  most relevant
- **Lost-in-the-middle** - important early evidence ignored
- No **diversity** - may select 30 similar papers

### Bug 4: Prompt Encourages "Continue" Forever

**File:** `src/prompts/judge.py:22-32`

```python
## Sufficiency Criteria (TOO STRICT - requires ALL conditions)
- Combined scores >= 12 AND
- At least one specific drug candidate identified AND
- Clear mechanistic rationale exists

## Output Rules
- Be conservative: only recommend "synthesize" when truly confident  ← TRIGGERS VERBOSITY BIAS
```

### Bug 5: Search Derailment

**Evidence from logs:**
```
Next searches: androgen therapy and bone health, androgen therapy and muscle mass...
```

Original question: "female libido post-menopause" β†’ Judge suggests tangentially related topics.

### Bug 6: Partial Synthesis is Garbage

**File:** `src/orchestrators/simple.py:432-470`

When max iterations reached, outputs only citations with no analysis, drug candidates, or key findings.

---

## Solution Design

### Architecture Change: Separate Scoring from Decision

**Before (biased):**
```
User Question β†’ LLM Judge β†’ { scores, recommendation } β†’ Orchestrator follows recommendation
```

**After (debiased, per 2025 best practices):**
```
User Question β†’ LLM Judge β†’ { scores only } β†’ Code evaluates β†’ Code decides synthesize/continue
```

This follows the [Spring AI LLM-as-Judge pattern](https://spring.io/blog/2025/11/10/spring-ai-llm-as-judge-blog-post/): "Run agent in while loop with evaluator, until evaluator says output passes criteria" - but criteria are **code-enforced**, not LLM-decided.

---

### Fix 1: Diverse Evidence Selection (Not Just Capping)

**File:** `src/prompts/judge.py`

```python
MAX_EVIDENCE_FOR_JUDGE = 30  # Keep under token limits

async def select_evidence_for_judge(
    evidence: list[Evidence],
    query: str,
    max_items: int = MAX_EVIDENCE_FOR_JUDGE,
) -> list[Evidence]:
    """
    Select diverse, relevant evidence for judge evaluation.

    Implements RAG best practices (November 2025):
    - Diversity selection over recency-only
    - Lost-in-the-middle mitigation
    - Relevance re-ranking

    References:
    - https://www.pinecone.io/learn/retrieval-augmented-generation/
    - https://www.truestate.io/blog/lessons-from-rag
    """
    if len(evidence) <= max_items:
        return evidence

    try:
        from src.utils.text_utils import select_diverse_evidence
        # Use embedding-based diversity selection
        return await select_diverse_evidence(evidence, n=max_items, query=query)
    except ImportError:
        # Fallback: mix of recent + early (lost-in-the-middle mitigation)
        early = evidence[:max_items // 3]           # First third
        recent = evidence[-(max_items * 2 // 3):]   # Last two-thirds
        return early + recent


def format_user_prompt(
    question: str,
    evidence: list[Evidence],
    iteration: int = 0,
    max_iterations: int = 10,
    total_evidence_count: int | None = None,
) -> str:
    """
    Format user prompt with selected evidence and iteration context.

    NOTE: Evidence should be pre-selected using select_evidence_for_judge().
    This function assumes evidence is already capped.
    """
    total_count = total_evidence_count or len(evidence)
    max_content_len = 1500

    def format_single_evidence(i: int, e: Evidence) -> str:
        content = e.content
        if len(content) > max_content_len:
            content = content[:max_content_len] + "..."
        return (
            f"### Evidence {i + 1}\n"
            f"**Source**: {e.citation.source.upper()} - {e.citation.title}\n"
            f"**URL**: {e.citation.url}\n"
            f"**Content**:\n{content}"
        )

    evidence_text = "\n\n".join([format_single_evidence(i, e) for i, e in enumerate(evidence)])

    # Lost-in-the-middle mitigation: put critical context at START and END
    return f"""## Research Question (IMPORTANT - stay focused on this)
{question}

## Search Progress
- **Iteration**: {iteration}/{max_iterations}
- **Total evidence collected**: {total_count} sources
- **Evidence shown below**: {len(evidence)} diverse sources (selected for relevance)

## Available Evidence

{evidence_text}

## Your Task

Score this evidence for drug repurposing potential. Provide ONLY scores and extracted data.
DO NOT decide "synthesize" vs "continue" - that decision is made by the system.

## REMINDER: Original Question (stay focused)
{question}
"""
```

### Fix 2: Debiased Judge Prompt (Scoring Only)

**File:** `src/prompts/judge.py`

```python
SYSTEM_PROMPT = """You are an expert drug repurposing research judge.

Your task is to SCORE evidence from biomedical literature. You do NOT decide whether to
continue searching or synthesize - that decision is made by the orchestration system
based on your scores.

## Your Role: Scoring Only

You provide objective scores. The system decides next steps based on explicit thresholds.
This separation prevents bias in the decision-making process.

## Scoring Criteria

1. **Mechanism Score (0-10)**: How well does the evidence explain the biological mechanism?
   - 0-3: No clear mechanism, speculative
   - 4-6: Some mechanistic insight, but gaps exist
   - 7-10: Clear, well-supported mechanism of action

2. **Clinical Evidence Score (0-10)**: Strength of clinical/preclinical support?
   - 0-3: No clinical data, only theoretical
   - 4-6: Preclinical or early clinical data
   - 7-10: Strong clinical evidence (trials, meta-analyses)

3. **Drug Candidates**: List SPECIFIC drug names mentioned in the evidence
   - Only include drugs explicitly mentioned
   - Do NOT hallucinate or infer drug names
   - Include drug class if specific names aren't available (e.g., "SSRI antidepressants")

4. **Key Findings**: Extract 3-5 key findings from the evidence
   - Focus on findings relevant to the research question
   - Include mechanism insights and clinical outcomes

5. **Confidence (0.0-1.0)**: Your confidence in the scores
   - Based on evidence quality and relevance
   - Lower if evidence is tangential or low-quality

## Output Format

Return valid JSON with these fields:
- details.mechanism_score (int 0-10)
- details.mechanism_reasoning (string)
- details.clinical_evidence_score (int 0-10)
- details.clinical_reasoning (string)
- details.drug_candidates (list of strings)
- details.key_findings (list of strings)
- sufficient (boolean) - TRUE if scores suggest enough evidence
- confidence (float 0-1)
- recommendation ("continue" or "synthesize") - Your suggestion (system may override)
- next_search_queries (list) - If continuing, suggest FOCUSED queries
- reasoning (string)

## CRITICAL: Search Query Rules

When suggesting next_search_queries:
- STAY FOCUSED on the original research question
- Do NOT drift to tangential topics
- If question is about "female libido", do NOT suggest "bone health" or "muscle mass"
- Refine existing terms, don't explore random medical associations
- Example: "female libido post-menopause" β†’ "testosterone therapy female sexual dysfunction"
"""
```

### Fix 3: Code-Enforced Termination Criteria

**File:** `src/orchestrators/simple.py`

```python
# Termination thresholds (code-enforced, not LLM-decided)
# Based on multi-agent orchestration best practices (November 2025)
# Reference: https://aws.amazon.com/solutions/guidance/multi-agent-orchestration-on-aws/

TERMINATION_CRITERIA = {
    "min_combined_score": 12,      # mechanism + clinical >= 12
    "min_score_with_volume": 10,   # >= 10 if 50+ sources
    "late_iteration_threshold": 8, # >= 8 in iterations 8+
    "max_evidence_threshold": 100, # Force synthesis with 100+ sources
    "emergency_iteration": 8,      # Last 2 iterations = emergency mode
    "min_confidence": 0.5,         # Minimum confidence for emergency synthesis
}


def should_synthesize(
    assessment: JudgeAssessment,
    iteration: int,
    max_iterations: int,
    evidence_count: int,
) -> tuple[bool, str]:
    """
    Code-enforced synthesis decision.

    Returns (should_synthesize, reason).

    This function implements the "explicit termination criteria" pattern
    from multi-agent orchestration best practices. The LLM provides scores,
    but CODE decides when to stop.

    Reference: https://latenode.com/blog/langgraph-multi-agent-orchestration-complete-framework-guide-architecture-analysis-2025
    """
    combined_score = (
        assessment.details.mechanism_score +
        assessment.details.clinical_evidence_score
    )
    has_drug_candidates = len(assessment.details.drug_candidates) > 0
    confidence = assessment.confidence

    # Priority 1: LLM explicitly says sufficient with good scores
    if assessment.sufficient and assessment.recommendation == "synthesize":
        if combined_score >= 10:
            return True, "judge_approved"

    # Priority 2: High scores with drug candidates
    if combined_score >= TERMINATION_CRITERIA["min_combined_score"] and has_drug_candidates:
        return True, "high_scores_with_candidates"

    # Priority 3: Good scores with high evidence volume
    if combined_score >= TERMINATION_CRITERIA["min_score_with_volume"] and evidence_count >= 50:
        return True, "good_scores_high_volume"

    # Priority 4: Late iteration with acceptable scores (diminishing returns)
    is_late_iteration = iteration >= max_iterations - 2
    if is_late_iteration and combined_score >= TERMINATION_CRITERIA["late_iteration_threshold"]:
        return True, "late_iteration_acceptable"

    # Priority 5: Very high evidence count (enough to synthesize something)
    if evidence_count >= TERMINATION_CRITERIA["max_evidence_threshold"]:
        return True, "max_evidence_reached"

    # Priority 6: Emergency synthesis (avoid garbage output)
    if is_late_iteration and evidence_count >= 30 and confidence >= TERMINATION_CRITERIA["min_confidence"]:
        return True, "emergency_synthesis"

    return False, "continue_searching"
```

### Fix 4: Update Orchestrator Decision Phase

**File:** `src/orchestrators/simple.py`

```python
# In the run() method, replace the decision phase:

# === DECISION PHASE (Code-Enforced) ===
should_synth, reason = should_synthesize(
    assessment=assessment,
    iteration=iteration,
    max_iterations=self.config.max_iterations,
    evidence_count=len(all_evidence),
)

logger.info(
    "Synthesis decision",
    should_synthesize=should_synth,
    reason=reason,
    iteration=iteration,
    combined_score=assessment.details.mechanism_score + assessment.details.clinical_evidence_score,
    evidence_count=len(all_evidence),
    confidence=assessment.confidence,
)

if should_synth:
    # Log synthesis trigger reason for debugging
    if reason != "judge_approved":
        logger.info(f"Code-enforced synthesis triggered: {reason}")

    # Optional Analysis Phase
    async for event in self._run_analysis_phase(query, all_evidence, iteration):
        yield event

    yield AgentEvent(
        type="synthesizing",
        message=f"Evidence sufficient ({reason})! Preparing synthesis...",
        iteration=iteration,
    )

    # Generate final response
    final_response = self._generate_synthesis(query, all_evidence, assessment)

    yield AgentEvent(
        type="complete",
        message=final_response,
        data={
            "evidence_count": len(all_evidence),
            "iterations": iteration,
            "synthesis_reason": reason,
            "drug_candidates": assessment.details.drug_candidates,
            "key_findings": assessment.details.key_findings,
        },
        iteration=iteration,
    )
    return

else:
    # Need more evidence - prepare next queries
    current_queries = assessment.next_search_queries or [
        f"{query} mechanism of action",
        f"{query} clinical evidence",
    ]

    yield AgentEvent(
        type="looping",
        message=(
            f"Gathering more evidence (scores: {assessment.details.mechanism_score}+"
            f"{assessment.details.clinical_evidence_score}). "
            f"Next: {', '.join(current_queries[:2])}..."
        ),
        data={"next_queries": current_queries, "reason": reason},
        iteration=iteration,
    )
```

### Fix 5: Real Partial Synthesis

**File:** `src/orchestrators/simple.py`

```python
def _generate_partial_synthesis(
    self,
    query: str,
    evidence: list[Evidence],
) -> str:
    """
    Generate a REAL synthesis when max iterations reached.

    Even when forced to stop, we should provide:
    - Drug candidates (if any were found)
    - Key findings
    - Assessment scores
    - Actionable citations

    This is still better than a citation dump.
    """
    # Extract data from last assessment if available
    last_assessment = self.history[-1]["assessment"] if self.history else {}
    details = last_assessment.get("details", {})

    drug_candidates = details.get("drug_candidates", [])
    key_findings = details.get("key_findings", [])
    mechanism_score = details.get("mechanism_score", 0)
    clinical_score = details.get("clinical_evidence_score", 0)
    reasoning = last_assessment.get("reasoning", "Analysis incomplete due to iteration limit.")

    # Format drug candidates
    if drug_candidates:
        drug_list = "\n".join([f"- **{d}**" for d in drug_candidates[:5]])
    else:
        drug_list = "- *No specific drug candidates identified in evidence*\n- *Try a more specific query or add an API key for better analysis*"

    # Format key findings
    if key_findings:
        findings_list = "\n".join([f"- {f}" for f in key_findings[:5]])
    else:
        findings_list = "- *Key findings require further analysis*\n- *See citations below for relevant sources*"

    # Format citations (top 10)
    citations = "\n".join([
        f"{i + 1}. [{e.citation.title}]({e.citation.url}) "
        f"({e.citation.source.upper()}, {e.citation.date})"
        for i, e in enumerate(evidence[:10])
    ])

    combined_score = mechanism_score + clinical_score

    return f"""## Drug Repurposing Analysis

### Research Question
{query}

### Status
Analysis based on {len(evidence)} sources across {len(self.history)} iterations.
Maximum iterations reached - results may be incomplete.

### Drug Candidates Identified
{drug_list}

### Key Findings
{findings_list}

### Evidence Quality Scores
| Criterion | Score | Interpretation |
|-----------|-------|----------------|
| Mechanism | {mechanism_score}/10 | {"Strong" if mechanism_score >= 7 else "Moderate" if mechanism_score >= 4 else "Limited"} mechanistic evidence |
| Clinical | {clinical_score}/10 | {"Strong" if clinical_score >= 7 else "Moderate" if clinical_score >= 4 else "Limited"} clinical support |
| Combined | {combined_score}/20 | {"Sufficient" if combined_score >= 12 else "Partial"} for synthesis |

### Analysis Summary
{reasoning}

### Top Citations ({len(evidence)} sources total)
{citations}

---
*For more complete analysis:*
- *Add an OpenAI or Anthropic API key for enhanced LLM analysis*
- *Try a more specific query (e.g., include drug names)*
- *Use Advanced mode for multi-agent research*
"""
```

### Fix 6: Update Judge Handler Signature

**File:** `src/orchestrators/base.py`

```python
class JudgeHandlerProtocol(Protocol):
    """Protocol for judge handler."""

    async def assess(
        self,
        question: str,
        evidence: list[Evidence],
        iteration: int = 0,           # NEW
        max_iterations: int = 10,     # NEW
    ) -> JudgeAssessment:
        """Assess evidence quality and provide scores."""
        ...
```

**File:** `src/agent_factory/judges.py`

Update all handlers (`JudgeHandler`, `HFInferenceJudgeHandler`, `MockJudgeHandler`) to:

```python
async def assess(
    self,
    question: str,
    evidence: list[Evidence],
    iteration: int = 0,
    max_iterations: int = 10,
) -> JudgeAssessment:
    """Assess evidence with iteration context."""
    # Select diverse evidence (not just truncate)
    selected_evidence = await select_evidence_for_judge(evidence, question)

    # Format prompt with iteration context
    user_prompt = format_user_prompt(
        question=question,
        evidence=selected_evidence,
        iteration=iteration,
        max_iterations=max_iterations,
        total_evidence_count=len(evidence),
    )

    # ... rest of implementation
```

---

## Implementation Order

| Order | Fix | Priority | Impact |
|-------|-----|----------|--------|
| 1 | Diverse evidence selection | CRITICAL | Prevents token overflow + position bias |
| 2 | Code-enforced termination | CRITICAL | Guarantees synthesis before max iterations |
| 3 | Debiased judge prompt | HIGH | Removes verbosity/self-preference bias |
| 4 | Real partial synthesis | HIGH | Ensures useful output even on forced stop |
| 5 | Update handler signatures | MEDIUM | Enables iteration context |
| 6 | Update orchestrator | MEDIUM | Integrates all fixes |

---

## Files to Modify

| File | Changes |
|------|---------|
| `src/prompts/judge.py` | New `select_evidence_for_judge()`, updated `format_user_prompt()`, debiased `SYSTEM_PROMPT` |
| `src/orchestrators/simple.py` | New `should_synthesize()`, updated decision phase, real `_generate_partial_synthesis()` |
| `src/orchestrators/base.py` | Update `JudgeHandlerProtocol` signature |
| `src/agent_factory/judges.py` | Update all handlers with iteration params, use diverse selection |

---

## Test Plan

### Unit Tests

```python
# tests/unit/prompts/test_judge_prompt.py

@pytest.mark.asyncio
async def test_evidence_selection_diverse():
    """Verify evidence selection includes early and recent items."""
    evidence = [make_evidence(f"Paper {i}") for i in range(100)]
    selected = await select_evidence_for_judge(evidence, "test query", max_items=30)

    # Should include some early evidence (lost-in-the-middle mitigation)
    titles = [e.citation.title for e in selected]
    assert any("Paper 0" in t or "Paper 1" in t for t in titles)
    assert any("Paper 99" in t or "Paper 98" in t for t in titles)


def test_prompt_includes_question_at_edges():
    """Verify lost-in-the-middle mitigation."""
    evidence = [make_evidence("Test")]
    prompt = format_user_prompt("important question", evidence, iteration=5, max_iterations=10)

    # Question should appear at START and END of prompt
    lines = prompt.split("\n")
    assert "important question" in lines[1]  # Near start
    assert "important question" in lines[-2]  # Near end


# tests/unit/orchestrators/test_termination.py

def test_should_synthesize_high_scores():
    """High scores with drug candidates triggers synthesis."""
    assessment = make_assessment(mechanism=7, clinical=6, drug_candidates=["Metformin"])
    should_synth, reason = should_synthesize(assessment, iteration=3, max_iterations=10, evidence_count=50)

    assert should_synth is True
    assert reason == "high_scores_with_candidates"


def test_should_synthesize_late_iteration():
    """Late iteration with acceptable scores triggers synthesis."""
    assessment = make_assessment(mechanism=5, clinical=4, drug_candidates=[])
    should_synth, reason = should_synthesize(assessment, iteration=9, max_iterations=10, evidence_count=80)

    assert should_synth is True
    assert reason in ["late_iteration_acceptable", "emergency_synthesis"]


def test_should_not_synthesize_early_low_scores():
    """Early iteration with low scores continues searching."""
    assessment = make_assessment(mechanism=3, clinical=2, drug_candidates=[])
    should_synth, reason = should_synthesize(assessment, iteration=2, max_iterations=10, evidence_count=20)

    assert should_synth is False
    assert reason == "continue_searching"


def test_partial_synthesis_has_drug_candidates():
    """Partial synthesis includes extracted data."""
    orchestrator = Orchestrator(...)
    orchestrator.history = [{
        "assessment": {
            "details": {
                "drug_candidates": ["Testosterone", "DHEA"],
                "key_findings": ["Finding 1", "Finding 2"],
                "mechanism_score": 6,
                "clinical_evidence_score": 5,
            },
            "reasoning": "Good evidence found.",
        }
    }]

    result = orchestrator._generate_partial_synthesis("test", [make_evidence("Test")])

    assert "Testosterone" in result
    assert "DHEA" in result
    assert "Drug Candidates" in result
    assert "6/10" in result  # mechanism score
```

### Integration Tests

```python
# tests/integration/test_simple_mode_synthesis.py

@pytest.mark.asyncio
async def test_simple_mode_synthesizes_before_max_iterations():
    """Verify simple mode produces useful output with mocked judge."""
    # Mock judge to return good scores
    mock_judge = MockJudgeHandler()
    orchestrator = Orchestrator(
        search_handler=mock_search_handler,
        judge_handler=mock_judge,
    )

    events = []
    async for event in orchestrator.run("metformin diabetes mechanism"):
        events.append(event)

    # Must have synthesis with drug candidates
    complete_event = next(e for e in events if e.type == "complete")
    assert "Drug Candidates" in complete_event.message
    assert complete_event.data.get("synthesis_reason") is not None


@pytest.mark.asyncio
async def test_large_evidence_does_not_crash():
    """Verify 500 sources don't cause token overflow."""
    evidence = [make_evidence(f"Paper {i}") for i in range(500)]
    selected = await select_evidence_for_judge(evidence, "test query")

    # Should be capped
    assert len(selected) <= MAX_EVIDENCE_FOR_JUDGE

    # Total chars should be under ~50K (safe for most LLMs)
    prompt = format_user_prompt("test", selected, iteration=5, max_iterations=10, total_evidence_count=500)
    assert len(prompt) < 100_000  # Well under token limits
```

---

## Acceptance Criteria

- [ ] Evidence sent to judge is diverse-selected (not just truncated)
- [ ] Prompt includes question at START and END (lost-in-the-middle mitigation)
- [ ] Code-enforced `should_synthesize()` makes termination decision
- [ ] Synthesis triggered by iteration 8 with 50+ sources and scores >= 8
- [ ] Partial synthesis includes drug candidates and scores (not just citations)
- [ ] Search queries stay on-topic (judge prompt enforces focus)
- [ ] 500+ sources don't cause LLM crashes
- [ ] All existing tests pass

---

## Risk Assessment

| Risk | Mitigation |
|------|------------|
| Diverse selection misses critical evidence | Include relevance scoring in selection |
| Code-enforced thresholds too aggressive | Log all synthesis decisions for tuning |
| Prompt changes affect OpenAI/Anthropic differently | Test with all providers |
| Emergency synthesis produces low-quality output | Still better than citation dump |

---

## Success Metrics

| Metric | Before | After |
|--------|--------|-------|
| Synthesis rate | 0% | 90%+ |
| Average iterations to synthesis | 10 (max) | 5-7 |
| Drug candidates in output | Never | Always (if found) |
| LLM token overflow errors | Common | None |
| User-reported "useless output" | Frequent | Rare |

---

## References

- [LLM-as-a-Judge Guide - Evidently AI](https://www.evidentlyai.com/llm-guide/llm-as-a-judge)
- [Survey on LLM-as-a-Judge - arXiv](https://arxiv.org/abs/2411.15594)
- [RAG Best Practices - Pinecone](https://www.pinecone.io/learn/retrieval-augmented-generation/)
- [Lessons from RAG 2025 - TrueState](https://www.truestate.io/blog/lessons-from-rag)
- [LangGraph Multi-Agent Orchestration 2025](https://latenode.com/blog/langgraph-multi-agent-orchestration-complete-framework-guide-architecture-analysis-2025)
- [Multi-Agent Orchestration on AWS](https://aws.amazon.com/solutions/guidance/multi-agent-orchestration-on-aws/)
- [Spring AI LLM-as-Judge Pattern](https://spring.io/blog/2025/11/10/spring-ai-llm-as-judge-blog-post/)