Spaces:

VibecoderMcSwaggins
/

DeepBoner

Paused

App Files Files Community

VibecoderMcSwaggins commited on 15 days ago

Commit

912218f

2 Parent(s): d2ad6f5 3d25956

Merge branch 'main' into dev

Browse files

Files changed (12) hide show

docs/bugs/ACTIVE_BUGS.md +23 -12
docs/bugs/P3_ARCHITECTURAL_GAP_EPHEMERAL_MEMORY.md +23 -0
docs/bugs/P3_ARCHITECTURAL_GAP_STRUCTURED_MEMORY.md +148 -0
docs/specs/SPEC_07_LANGGRAPH_MEMORY_ARCH.md +492 -0
src/agent_factory/judges.py +27 -2
src/orchestrators/base.py +9 -1
src/orchestrators/simple.py +177 -22
src/prompts/judge.py +97 -27
tests/e2e/conftest.py +1 -1
tests/integration/test_simple_mode_synthesis.py +147 -0
tests/unit/orchestrators/test_termination.py +104 -0
tests/unit/prompts/test_judge_prompt.py +61 -0

docs/bugs/ACTIVE_BUGS.md CHANGED Viewed

@@ -4,29 +4,40 @@
 ## P0 - Blocker
-### P0 - Simple Mode Never Synthesizes
-**File:** `P0_SIMPLE_MODE_NEVER_SYNTHESIZES.md`
-**Symptom:** Simple mode finds 455 sources but outputs only citations (no synthesis).
-**Root Causes:**
-1. Judge never recommends "synthesize" (prompt too conservative)
-2. Confidence drops to 0% in late iterations (context overflow / API failure)
-3. Search derails to tangential topics (bone health instead of libido)
-4. `_generate_partial_synthesis()` outputs garbage (just citations, no analysis)
-**Status:** Documented, fix plan ready.
----
-## P3 - Edge Case
-*(None)*
 ---
 ## Resolved Bugs
 ### ~~P3 - Magentic Mode Missing Termination Guarantee~~ FIXED
 **Commit**: `d36ce3c` (2025-11-29)

 ## P0 - Blocker
+*(None - P0 bugs resolved)*
+---
+## P3 - Architecture/Enhancement
+### P3 - Missing Structured Cognitive Memory
+**File:** `P3_ARCHITECTURAL_GAP_STRUCTURED_MEMORY.md`
+**Spec:** [SPEC_07_LANGGRAPH_MEMORY_ARCH.md](../specs/SPEC_07_LANGGRAPH_MEMORY_ARCH.md)
+**Problem:** AdvancedOrchestrator uses chat-based state (context drift on long runs).
+**Solution:** Implement LangGraph StateGraph with explicit hypothesis/conflict tracking.
+**Status:** Spec complete, implementation pending.
+### P3 - Ephemeral Memory (No Persistence)
+**File:** `P3_ARCHITECTURAL_GAP_EPHEMERAL_MEMORY.md`
+**Problem:** ChromaDB uses in-memory client despite `settings.chroma_db_path` existing.
+**Solution:** Switch to `PersistentClient(path=settings.chroma_db_path)`.
+**Status:** Quick fix identified, not yet implemented.
 ---
 ## Resolved Bugs
+### ~~P0 - Simple Mode Never Synthesizes~~ FIXED
+**PR:** [#71](https://github.com/The-Obstacle-Is-The-Way/DeepBoner/pull/71) (SPEC_06)
+**Commit**: `5cac97d` (2025-11-29)
+- Root cause: LLM-as-Judge recommendations were being IGNORED
+- Fix: Code-enforced termination criteria (`_should_synthesize()`)
+- Added combined score thresholds, late-iteration logic, emergency fallback
+- Simple mode now synthesizes instead of spinning forever
 ### ~~P3 - Magentic Mode Missing Termination Guarantee~~ FIXED
 **Commit**: `d36ce3c` (2025-11-29)

docs/bugs/P3_ARCHITECTURAL_GAP_EPHEMERAL_MEMORY.md ADDED Viewed

	@@ -0,0 +1,23 @@

+# P3: Ephemeral Memory Architecture (No Persistence)
+**Status:** OPEN
+**Priority:** P3 (Feature/Architecture Gap)
+**Found By:** Codebase Investigation
+**Date:** 2025-11-29
+## Description
+The current `EmbeddingService` (`src/services/embeddings.py`) initializes an **in-memory** ChromaDB client (`chromadb.Client()`) and creates a random UUID-based collection for every new session.
+While `src/utils/config.py` defines a `chroma_db_path` for persistence, it is currently **ignored**.
+## Impact
+1.  **No Long-Term Learning:** The agent cannot "remember" research from previous runs. Every time you restart the app, it starts from zero.
+2.  **Redundant Costs:** If a user researches "Diabetes" twice, the agent re-searches and re-embeds the same papers, wasting tokens and compute time.
+## Technical Details
+- **Current:** `self._client = chromadb.Client()` (In-Memory)
+- **Required:** `self._client = chromadb.PersistentClient(path=settings.chroma_db_path)`
+## Recommendation
+For a "Hackathon Demo," this is **low priority** (ephemeral is fine).
+For a "Real Product," this is **critical** (users expect a library of research).

docs/bugs/P3_ARCHITECTURAL_GAP_STRUCTURED_MEMORY.md ADDED Viewed

	@@ -0,0 +1,148 @@

+# P3: Missing Structured Cognitive Memory (Shared Blackboard)
+**Status:** OPEN
+**Priority:** P3 (Architecture/Enhancement)
+**Found By:** Deep Codebase Investigation
+**Date:** 2025-11-29
+**Spec:** [SPEC_07_LANGGRAPH_MEMORY_ARCH.md](../specs/SPEC_07_LANGGRAPH_MEMORY_ARCH.md)
+## Executive Summary
+DeepBoner's `AdvancedOrchestrator` has **Data Memory** (vector store for papers) but lacks **Cognitive Memory** (structured state for hypotheses, conflicts, and research plan). This causes "context drift" on long runs and prevents intelligent conflict resolution.
+---
+## Current Architecture (What We Have)
+### 1. MagenticState (`src/agents/state.py:18-91`)
+```python
+class MagenticState(BaseModel):
+    evidence: list[Evidence] = Field(default_factory=list)
+    embedding_service: Any = None  # ChromaDB connection
+    def add_evidence(self, new_evidence: list[Evidence]) -> int: ...
+    async def search_related(self, query: str, n_results: int = 5) -> list[Evidence]: ...
+```
+- **What it does:** Stores Evidence objects, URL-based deduplication, semantic search via embeddings.
+- **What it DOESN'T do:** Track hypotheses, conflicts, or research plan status.
+### 2. EmbeddingService (`src/services/embeddings.py:29-180`)
+```python
+self._client = chromadb.Client()  # In-memory (Line 44)
+self._collection = self._client.create_collection(
+    name=f"evidence_{uuid.uuid4().hex}",  # Random name per session (Line 45-47)
+    ...
+)
+```
+- **What it does:** In-session semantic search/deduplication.
+- **Limitation:** New collection per session, no persistence despite `settings.chroma_db_path` existing.
+### 3. AdvancedOrchestrator (`src/orchestrators/advanced.py:51-371`)
+- Uses Microsoft's `agent-framework-core` (MagenticBuilder)
+- State is implicit in chat history passed between agents
+- Manager decides next step by reading conversation, not structured state
+---
+## The Problem
+| Issue | Impact | Evidence |
+|-------|--------|----------|
+| **No Hypothesis Tracking** | Can't update hypothesis confidence systematically | `MagenticState` has no `hypotheses` field |
+| **No Conflict Detection** | Contradictory sources are ignored | No `conflicts` list to flag Source A vs Source B |
+| **Context Drift** | Manager forgets original query after 50+ messages | State lives only in chat, not structured object |
+| **No Plan State** | Can't pause/resume research | No `research_plan` or `next_step` tracking |
+---
+## The Solution: LangGraph State Graph (Nov 2025 Best Practice)
+### Why LangGraph?
+Based on [comprehensive analysis](https://latenode.com/blog/langgraph-multi-agent-orchestration-complete-framework-guide-architecture-analysis-2025):
+1. **Explicit State Schema:** TypedDict/Pydantic model that ALL agents read/write
+2. **State Reducers:** `Annotated[List[X], operator.add]` for appending (not overwriting)
+3. **HuggingFace Compatible:** Works with `langchain-huggingface` (Llama 3.1)
+4. **Production-Ready:** MongoDB checkpointer for persistence, SQLite for dev
+### Target Architecture
+```python
+# src/agents/graph/state.py (PROPOSED)
+from typing import Annotated, TypedDict, Literal
+import operator
+class Hypothesis(TypedDict):
+    id: str
+    statement: str
+    status: Literal["proposed", "validating", "confirmed", "refuted"]
+    confidence: float
+    supporting_evidence_ids: list[str]
+    contradicting_evidence_ids: list[str]
+class Conflict(TypedDict):
+    id: str
+    description: str
+    source_a_id: str
+    source_b_id: str
+    status: Literal["open", "resolved"]
+    resolution: str | None
+class ResearchState(TypedDict):
+    query: str  # Immutable original question
+    hypotheses: Annotated[list[Hypothesis], operator.add]
+    conflicts: Annotated[list[Conflict], operator.add]
+    evidence_ids: Annotated[list[str], operator.add]  # Links to ChromaDB
+    messages: Annotated[list[BaseMessage], operator.add]
+    next_step: Literal["search", "judge", "resolve", "synthesize", "finish"]
+    iteration_count: int
+```
+---
+## Implementation Dependencies
+| Package | Purpose | Install |
+|---------|---------|---------|
+| `langgraph>=0.2` | State graph framework | `uv add langgraph` |
+| `langchain>=0.3` | Base abstractions | `uv add langchain` |
+| `langchain-huggingface` | Llama 3.1 integration | `uv add langchain-huggingface` |
+| `langgraph-checkpoint-sqlite` | Dev persistence | `uv add langgraph-checkpoint-sqlite` |
+**Note:** MongoDB checkpointer (`langgraph-checkpoint-mongodb`) recommended for production per [MongoDB blog](https://www.mongodb.com/company/blog/product-release-announcements/powering-long-term-memory-for-agents-langgraph).
+---
+## Alternative Considered: Mem0
+[Mem0](https://mem0.ai/) specializes in long-term memory and [outperformed OpenAI by 26%](https://guptadeepak.com/the-ai-memory-wars-why-one-system-crushed-the-competition-and-its-not-openai/) in benchmarks. However:
+- **Mem0 excels at:** User personalization, cross-session memory
+- **LangGraph excels at:** Workflow orchestration, state machines
+- **Verdict:** Use LangGraph for orchestration + optionally add Mem0 for user-level memory later
+---
+## Quick Win (Separate from LangGraph)
+Enable ChromaDB persistence in `src/services/embeddings.py:44`:
+```python
+# FROM:
+self._client = chromadb.Client()  # In-memory
+# TO:
+self._client = chromadb.PersistentClient(path=settings.chroma_db_path)
+```
+This alone gives cross-session evidence persistence (P3_ARCHITECTURAL_GAP_EPHEMERAL_MEMORY fix).
+---
+## References
+- [LangGraph Multi-Agent Orchestration Guide 2025](https://latenode.com/blog/langgraph-multi-agent-orchestration-complete-framework-guide-architecture-analysis-2025)
+- [Long-Term Agentic Memory with LangGraph](https://medium.com/@anil.jain.baba/long-term-agentic-memory-with-langgraph-824050b09852)
+- [LangGraph vs LangChain 2025](https://kanerika.com/blogs/langchain-vs-langgraph/)
+- [MongoDB + LangGraph Checkpointers](https://www.mongodb.com/company/blog/product-release-announcements/powering-long-term-memory-for-agents-langgraph)
+- [Mem0 + LangGraph Integration](https://datacouch.io/blog/build-smarter-ai-agents-mem0-langgraph-guide/)

docs/specs/SPEC_07_LANGGRAPH_MEMORY_ARCH.md ADDED Viewed

	@@ -0,0 +1,492 @@

+# SPEC-07: Structured Cognitive Memory Architecture (LangGraph)
+**Status:** APPROVED
+**Priority:** HIGH (Strategic)
+**Author:** DeepBoner Architecture Team
+**Date:** 2025-11-29
+**Last Updated:** 2025-11-29
+**Related Bugs:** [P3_ARCHITECTURAL_GAP_STRUCTURED_MEMORY](../bugs/P3_ARCHITECTURAL_GAP_STRUCTURED_MEMORY.md)
+---
+## 1. Executive Summary
+Upgrade DeepBoner's "Advanced Mode" from chat-based coordination to a **State-Driven Cognitive Architecture** using LangGraph. This enables:
+- Explicit hypothesis tracking with confidence scores
+- Automatic conflict detection and resolution
+- Persistent research state (pause/resume)
+- Context-aware decision making over long runs
+---
+## 2. Problem Statement
+### Current Architecture Limitations
+The `AdvancedOrchestrator` (`src/orchestrators/advanced.py`) uses Microsoft's `agent-framework-core` with chat-based coordination:
+```python
+# Current: State is IMPLICIT (chat history)
+workflow = MagenticBuilder()
+    .participants(searcher=..., judge=..., ...)
+    .with_standard_manager(chat_client=..., max_round_count=10)
+    .build()
+```
+| Problem | Root Cause | File Location |
+|---------|------------|---------------|
+| Context Drift | State lives only in chat messages | `advanced.py:126-132` |
+| Conflict Blindness | No structured conflict tracking | `state.py` (no `conflicts` field) |
+| No Hypothesis Management | `MagenticState` only tracks `evidence` | `state.py:21` |
+| Can't Pause/Resume | No checkpointing mechanism | N/A |
+### Evidence from Codebase
+**MagenticState (src/agents/state.py:18-26):**
+```python
+class MagenticState(BaseModel):
+    evidence: list[Evidence] = Field(default_factory=list)
+    embedding_service: Any = None  # Just data, no cognitive state
+```
+**EmbeddingService (src/services/embeddings.py:44-47):**
+```python
+self._client = chromadb.Client()  # In-memory only
+self._collection = self._client.create_collection(
+    name=f"evidence_{uuid.uuid4().hex}",  # Random name = ephemeral
+    ...
+)
+```
+---
+## 3. Solution: LangGraph State Graph
+### Why LangGraph? (November 2025 Analysis)
+Based on [comprehensive framework comparison](https://kanerika.com/blogs/langchain-vs-langgraph/):
+| Feature | `agent-framework-core` (Current) | LangGraph (Proposed) |
+|---------|----------------------------------|----------------------|
+| State Management | Implicit (chat) | Explicit (TypedDict) |
+| Loops/Branches | Limited | Native support |
+| Checkpointing | None | SQLite/MongoDB |
+| HuggingFace | Requires OpenAI format | Native `langchain-huggingface` |
+### Architecture Overview
+```
+┌─────────────────────────────────────────────────────────────────┐
+│                      ResearchState                              │
+│  ┌─────────────┬──────────────┬───────────────┬──────────────┐ │
+│  │   query     │  hypotheses  │   conflicts   │  next_step   │ │
+│  │  (string)   │    (list)    │    (list)     │   (enum)     │ │
+│  └─────────────┴──────────────┴───────────────┴──────────────┘ │
+└─────────────────────────────────────────────────────────────────┘
+                              │
+                              ▼
+┌─────────────────────────────────────────────────────────────────┐
+│                      StateGraph                                 │
+│                                                                 │
+│    ┌──────────┐     ┌──────────┐     ┌──────────┐              │
+│    │ SEARCH   │────▶│  JUDGE   │────▶│ RESOLVE  │              │
+│    │  Node    │     │   Node   │     │   Node   │              │
+│    └──────────┘     └──────────┘     └──────────┘              │
+│         ▲                │                 │                    │
+│         │                ▼                 │                    │
+│         │          ┌──────────┐           │                    │
+│         └──────────│SUPERVISOR│◀──────────┘                    │
+│                    │   Node   │                                 │
+│                    └──────────┘                                 │
+│                          │                                      │
+│                          ▼                                      │
+│                    ┌──────────┐                                 │
+│                    │SYNTHESIZE│                                 │
+│                    │   Node   │                                 │
+│                    └──────────┘                                 │
+└─────────────────────────────────────────────────────────────────┘
+```
+---
+## 4. Technical Specification
+### 4.1 State Schema
+**File:** `src/agents/graph/state.py`
+```python
+"""Structured state for LangGraph research workflow."""
+from typing import Annotated, TypedDict, Literal
+import operator
+from langchain_core.messages import BaseMessage
+class Hypothesis(TypedDict):
+    """A research hypothesis with evidence tracking."""
+    id: str
+    statement: str
+    status: Literal["proposed", "validating", "confirmed", "refuted"]
+    confidence: float  # 0.0 - 1.0
+    supporting_evidence_ids: list[str]
+    contradicting_evidence_ids: list[str]
+class Conflict(TypedDict):
+    """A detected contradiction between sources."""
+    id: str
+    description: str
+    source_a_id: str
+    source_b_id: str
+    status: Literal["open", "resolved"]
+    resolution: str | None
+class ResearchState(TypedDict):
+    """The cognitive state shared across all graph nodes.
+    Uses Annotated with operator.add for list fields to enable
+    additive updates (append) rather than replacement.
+    """
+    # Immutable context
+    query: str
+    # Cognitive state (the "blackboard")
+    hypotheses: Annotated[list[Hypothesis], operator.add]
+    conflicts: Annotated[list[Conflict], operator.add]
+    # Evidence links (actual content in ChromaDB)
+    evidence_ids: Annotated[list[str], operator.add]
+    # Chat history (for LLM context)
+    messages: Annotated[list[BaseMessage], operator.add]
+    # Control flow
+    next_step: Literal["search", "judge", "resolve", "synthesize", "finish"]
+    iteration_count: int
+    max_iterations: int
+```
+### 4.2 Graph Nodes
+Each node is a pure function: `(state: ResearchState) -> dict`
+**File:** `src/agents/graph/nodes.py`
+```python
+"""Graph node implementations."""
+from langchain_core.messages import HumanMessage, AIMessage
+from src.tools.pubmed import search_pubmed
+from src.tools.clinicaltrials import search_clinicaltrials
+from src.tools.europepmc import search_europepmc
+async def search_node(state: ResearchState) -> dict:
+    """Execute search across all sources.
+    Returns partial state update (additive via operator.add).
+    """
+    query = state["query"]
+    # Reuse existing tools
+    results = await asyncio.gather(
+        search_pubmed(query),
+        search_clinicaltrials(query),
+        search_europepmc(query),
+    )
+    new_evidence_ids = [...]  # Store in ChromaDB, return IDs
+    return {
+        "evidence_ids": new_evidence_ids,
+        "messages": [AIMessage(content=f"Found {len(new_evidence_ids)} papers")],
+    }
+async def judge_node(state: ResearchState) -> dict:
+    """Evaluate evidence and update hypothesis confidence.
+    Key responsibility: Detect conflicts and flag them.
+    """
+    # LLM call to evaluate hypotheses against evidence
+    # If contradiction found: add to conflicts list
+    return {
+        "hypotheses": updated_hypotheses,  # With new confidence scores
+        "conflicts": new_conflicts,  # Any detected contradictions
+        "messages": [...],
+    }
+async def resolve_node(state: ResearchState) -> dict:
+    """Handle open conflicts via tie-breaker logic.
+    Triggers targeted search or reasoning to resolve.
+    """
+    open_conflicts = [c for c in state["conflicts"] if c["status"] == "open"]
+    # For each conflict: search for decisive evidence or make judgment call
+    return {
+        "conflicts": resolved_conflicts,
+        "messages": [...],
+    }
+async def synthesize_node(state: ResearchState) -> dict:
+    """Generate final research report.
+    Only uses confirmed hypotheses and resolved conflicts.
+    """
+    confirmed = [h for h in state["hypotheses"] if h["status"] == "confirmed"]
+    # Generate structured report
+    return {
+        "messages": [AIMessage(content=report_markdown)],
+        "next_step": "finish",
+    }
+def supervisor_node(state: ResearchState) -> dict:
+    """Route to next node based on state.
+    This is the "brain" - uses LLM to decide next action
+    based on STRUCTURED STATE (not just chat).
+    """
+    # Decision logic:
+    # 1. If open conflicts exist -> "resolve"
+    # 2. If hypotheses need more evidence -> "search"
+    # 3. If evidence is sufficient -> "judge"
+    # 4. If all hypotheses confirmed -> "synthesize"
+    # 5. If max iterations -> "synthesize" (forced)
+    return {"next_step": decided_step, "iteration_count": state["iteration_count"] + 1}
+```
+### 4.3 Graph Definition
+**File:** `src/agents/graph/workflow.py`
+```python
+"""LangGraph workflow definition."""
+from langgraph.graph import StateGraph, END
+from langgraph.checkpoint.sqlite import SqliteSaver
+from src.agents.graph.state import ResearchState
+from src.agents.graph.nodes import (
+    search_node,
+    judge_node,
+    resolve_node,
+    synthesize_node,
+    supervisor_node,
+)
+def create_research_graph(checkpointer=None):
+    """Build the research state graph.
+    Args:
+        checkpointer: Optional SqliteSaver/MongoDBSaver for persistence
+    """
+    graph = StateGraph(ResearchState)
+    # Add nodes
+    graph.add_node("supervisor", supervisor_node)
+    graph.add_node("search", search_node)
+    graph.add_node("judge", judge_node)
+    graph.add_node("resolve", resolve_node)
+    graph.add_node("synthesize", synthesize_node)
+    # Define edges (supervisor routes based on state.next_step)
+    graph.add_edge("search", "supervisor")
+    graph.add_edge("judge", "supervisor")
+    graph.add_edge("resolve", "supervisor")
+    graph.add_edge("synthesize", END)
+    # Conditional routing from supervisor
+    graph.add_conditional_edges(
+        "supervisor",
+        lambda state: state["next_step"],
+        {
+            "search": "search",
+            "judge": "judge",
+            "resolve": "resolve",
+            "synthesize": "synthesize",
+            "finish": END,
+        },
+    )
+    # Entry point
+    graph.set_entry_point("supervisor")
+    return graph.compile(checkpointer=checkpointer)
+```
+### 4.4 Orchestrator Integration
+**File:** `src/orchestrators/langgraph_orchestrator.py`
+```python
+"""LangGraph-based orchestrator with structured state."""
+from collections.abc import AsyncGenerator
+from langgraph.checkpoint.sqlite.aio import AsyncSqliteSaver
+from src.agents.graph.workflow import create_research_graph
+from src.agents.graph.state import ResearchState
+from src.orchestrators.base import OrchestratorProtocol
+from src.utils.models import AgentEvent
+class LangGraphOrchestrator(OrchestratorProtocol):
+    """State-driven research orchestrator using LangGraph."""
+    def __init__(
+        self,
+        max_iterations: int = 10,
+        checkpoint_path: str | None = None,
+    ):
+        self._max_iterations = max_iterations
+        self._checkpoint_path = checkpoint_path
+    async def run(self, query: str) -> AsyncGenerator[AgentEvent, None]:
+        """Execute research workflow with structured state."""
+        # Setup checkpointer (SQLite for dev, MongoDB for prod)
+        checkpointer = None
+        if self._checkpoint_path:
+            checkpointer = AsyncSqliteSaver.from_conn_string(self._checkpoint_path)
+        graph = create_research_graph(checkpointer)
+        # Initialize state
+        initial_state: ResearchState = {
+            "query": query,
+            "hypotheses": [],
+            "conflicts": [],
+            "evidence_ids": [],
+            "messages": [],
+            "next_step": "search",
+            "iteration_count": 0,
+            "max_iterations": self._max_iterations,
+        }
+        yield AgentEvent(type="started", message=f"Starting research: {query}")
+        # Stream through graph
+        async for event in graph.astream(initial_state):
+            # Convert graph events to AgentEvents
+            yield self._convert_event(event)
+```
+---
+## 5. Dependencies
+### Required Packages
+```toml
+# pyproject.toml additions
+[project.optional-dependencies]
+langgraph = [
+    "langgraph>=0.2.50",
+    "langchain>=0.3.9",
+    "langchain-core>=0.3.21",
+    "langchain-huggingface>=0.1.2",
+    "langgraph-checkpoint-sqlite>=2.0.0",
+]
+```
+### Installation
+```bash
+# Development
+uv add langgraph langchain langchain-huggingface langgraph-checkpoint-sqlite
+# Production (add MongoDB checkpointer)
+uv add langgraph-checkpoint-mongodb
+```
+### HuggingFace Model Integration
+```python
+# Using Llama 3.1 via HuggingFace Inference API
+from langchain_huggingface import ChatHuggingFace, HuggingFaceEndpoint
+llm = HuggingFaceEndpoint(
+    repo_id="meta-llama/Llama-3.1-70B-Instruct",
+    task="text-generation",
+    max_new_tokens=2048,
+    huggingfacehub_api_token=settings.hf_token,
+)
+chat = ChatHuggingFace(llm=llm)
+```
+---
+## 6. Implementation Plan (TDD)
+### Phase 1: State Schema (2 hours)
+1. Create `src/agents/graph/__init__.py`
+2. Create `src/agents/graph/state.py` with TypedDict schemas
+3. Write `tests/unit/graph/test_state.py`:
+   - Test reducer behavior (operator.add)
+   - Test state initialization
+   - Test hypothesis/conflict type validation
+### Phase 2: Graph Nodes (4 hours)
+1. Create `src/agents/graph/nodes.py`
+2. Adapt existing tool calls (pubmed, clinicaltrials, europepmc)
+3. Write `tests/unit/graph/test_nodes.py`:
+   - Test each node in isolation (mock LLM)
+   - Test state update format
+### Phase 3: Workflow Graph (2 hours)
+1. Create `src/agents/graph/workflow.py`
+2. Wire up StateGraph with conditional edges
+3. Write `tests/integration/graph/test_workflow.py`:
+   - Test routing logic
+   - Test end-to-end with mocked nodes
+### Phase 4: Orchestrator (2 hours)
+1. Create `src/orchestrators/langgraph_orchestrator.py`
+2. Update `src/orchestrators/factory.py` to include "langgraph" mode
+3. Update `src/app.py` UI dropdown
+4. Write `tests/e2e/test_langgraph_mode.py`
+### Phase 5: Gradio Integration (1 hour)
+1. Add "God Mode" option to Gradio dropdown
+2. Test streaming events
+3. Verify checkpointing (pause/resume)
+---
+## 7. Migration Strategy
+1. **Parallel Implementation:** Build as new mode alongside existing "simple" and "magentic"
+2. **UI Dropdown:** Add "God Mode (Experimental)" option
+3. **Feature Flag:** Use `settings.enable_langgraph_mode` to control availability
+4. **Deprecation Path:** Once stable, deprecate "magentic" mode (Q1 2026)
+---
+## 8. Acceptance Criteria
+- [ ] `ResearchState` TypedDict defined with all fields
+- [ ] All 4 nodes (search, judge, resolve, synthesize) implemented
+- [ ] Supervisor routing logic works based on structured state
+- [ ] Checkpointing enables pause/resume
+- [ ] Works with HuggingFace Inference API (no OpenAI required)
+- [ ] Integration tests pass with mocked LLM
+- [ ] E2E test passes with real API call
+---
+## 9. References
+### Primary Sources
+- [LangGraph Official Docs](https://docs.langchain.com/oss/python/langgraph)
+- [LangGraph Persistence Guide](https://docs.langchain.com/oss/python/langgraph/persistence)
+- [MongoDB + LangGraph Integration](https://www.mongodb.com/docs/atlas/ai-integrations/langgraph/)
+### Research & Analysis
+- [LangGraph Multi-Agent Orchestration 2025](https://latenode.com/blog/langgraph-multi-agent-orchestration-complete-framework-guide-architecture-analysis-2025)
+- [LangChain vs LangGraph Comparison](https://kanerika.com/blogs/langchain-vs-langgraph/)
+- [Building Deep Research Agents](https://towardsdatascience.com/langgraph-101-lets-build-a-deep-research-agent/)
+- [Mem0 + LangGraph Integration](https://blog.futuresmart.ai/ai-agents-memory-mem0-langgraph-agent-integration)
+- [AI Memory Wars Benchmark](https://guptadeepak.com/the-ai-memory-wars-why-one-system-crushed-the-competition-and-its-not-openai/)

src/agent_factory/judges.py CHANGED Viewed

@@ -19,6 +19,7 @@ from src.prompts.judge import (
     SYSTEM_PROMPT,
     format_empty_evidence_prompt,
     format_user_prompt,
 )
 from src.utils.config import settings
 from src.utils.models import AssessmentDetails, Evidence, JudgeAssessment
@@ -102,6 +103,8 @@ class JudgeHandler:
         self,
         question: str,
         evidence: list[Evidence],
     ) -> JudgeAssessment:
         """
         Assess evidence and determine if it's sufficient.
@@ -109,6 +112,8 @@ class JudgeHandler:
         Args:
             question: The user's research question
             evidence: List of Evidence objects from search
         Returns:
             JudgeAssessment with evaluation results
@@ -120,11 +125,20 @@ class JudgeHandler:
             "Starting evidence assessment",
             question=question[:100],
             evidence_count=len(evidence),
         )
         # Format the prompt based on whether we have evidence
         if evidence:
-            user_prompt = format_user_prompt(question, evidence)
         else:
             user_prompt = format_empty_evidence_prompt(question)
@@ -218,6 +232,8 @@ class HFInferenceJudgeHandler:
         self,
         question: str,
         evidence: list[Evidence],
     ) -> JudgeAssessment:
         """
         Assess evidence using HuggingFace Inference API.
@@ -246,7 +262,14 @@ class HFInferenceJudgeHandler:
         # Format the user prompt
         if evidence:
-            user_prompt = format_user_prompt(question, evidence)
         else:
             user_prompt = format_empty_evidence_prompt(question)
@@ -535,6 +558,8 @@ class MockJudgeHandler:
         self,
         question: str,
         evidence: list[Evidence],
     ) -> JudgeAssessment:
         """Return assessment based on actual evidence (demo mode)."""
         self.call_count += 1

     SYSTEM_PROMPT,
     format_empty_evidence_prompt,
     format_user_prompt,
+    select_evidence_for_judge,
 )
 from src.utils.config import settings
 from src.utils.models import AssessmentDetails, Evidence, JudgeAssessment
         self,
         question: str,
         evidence: list[Evidence],
+        iteration: int = 0,
+        max_iterations: int = 10,
     ) -> JudgeAssessment:
         """
         Assess evidence and determine if it's sufficient.
         Args:
             question: The user's research question
             evidence: List of Evidence objects from search
+            iteration: Current iteration number
+            max_iterations: Maximum allowed iterations
         Returns:
             JudgeAssessment with evaluation results
             "Starting evidence assessment",
             question=question[:100],
             evidence_count=len(evidence),
+            iteration=iteration,
         )
         # Format the prompt based on whether we have evidence
         if evidence:
+            # Select diverse evidence using embeddings (if available)
+            selected_evidence = await select_evidence_for_judge(evidence, question)
+            user_prompt = format_user_prompt(
+                question,
+                selected_evidence,
+                iteration,
+                max_iterations,
+                total_evidence_count=len(evidence),
+            )
         else:
             user_prompt = format_empty_evidence_prompt(question)
         self,
         question: str,
         evidence: list[Evidence],
+        iteration: int = 0,
+        max_iterations: int = 10,
     ) -> JudgeAssessment:
         """
         Assess evidence using HuggingFace Inference API.
         # Format the user prompt
         if evidence:
+            selected_evidence = await select_evidence_for_judge(evidence, question)
+            user_prompt = format_user_prompt(
+                question,
+                selected_evidence,
+                iteration,
+                max_iterations,
+                total_evidence_count=len(evidence),
+            )
         else:
             user_prompt = format_empty_evidence_prompt(question)
         self,
         question: str,
         evidence: list[Evidence],
+        iteration: int = 0,
+        max_iterations: int = 10,
     ) -> JudgeAssessment:
         """Return assessment based on actual evidence (demo mode)."""
         self.call_count += 1

src/orchestrators/base.py CHANGED Viewed

@@ -40,12 +40,20 @@ class JudgeHandlerProtocol(Protocol):
     and MockJudgeHandler.
     """
-    async def assess(self, question: str, evidence: list[Evidence]) -> JudgeAssessment:
         """Assess whether collected evidence is sufficient.
         Args:
             question: The original research question
             evidence: List of evidence items to assess
         Returns:
             JudgeAssessment with sufficiency determination and next steps

     and MockJudgeHandler.
     """
+    async def assess(
+        self,
+        question: str,
+        evidence: list[Evidence],
+        iteration: int = 0,
+        max_iterations: int = 10,
+    ) -> JudgeAssessment:
         """Assess whether collected evidence is sufficient.
         Args:
             question: The original research question
             evidence: List of evidence items to assess
+            iteration: Current iteration number
+            max_iterations: Maximum allowed iterations
         Returns:
             JudgeAssessment with sufficiency determination and next steps

src/orchestrators/simple.py CHANGED Viewed

@@ -12,7 +12,7 @@ from __future__ import annotations
 import asyncio
 from collections.abc import AsyncGenerator
-from typing import TYPE_CHECKING, Any
 import structlog
@@ -42,6 +42,18 @@ class Orchestrator:
     Microsoft Agent Framework.
     """
     def __init__(
         self,
         search_handler: SearchHandlerProtocol,
@@ -100,6 +112,7 @@ class Orchestrator:
         try:
             # Deduplicate using semantic similarity
             unique_evidence: list[Evidence] = await embeddings.deduplicate(evidence, threshold=0.85)
             logger.info(
                 "Deduplicated evidence",
                 before=len(evidence),
@@ -153,6 +166,65 @@ class Orchestrator:
                 iteration=iteration,
             )
     async def run(self, query: str) -> AsyncGenerator[AgentEvent, None]:  # noqa: PLR0915
         """
         Run the agent loop for a query.
@@ -252,7 +324,9 @@ class Orchestrator:
             )
             try:
-                assessment = await self.judge.assess(query, all_evidence)
                 yield AgentEvent(
                     type="judge_complete",
@@ -279,15 +353,37 @@ class Orchestrator:
                     }
                 )
-                # === DECISION PHASE ===
-                if assessment.sufficient and assessment.recommendation == "synthesize":
                     # Optional Analysis Phase
                     async for event in self._run_analysis_phase(query, all_evidence, iteration):
                         yield event
                     yield AgentEvent(
                         type="synthesizing",
-                        message="Evidence sufficient! Preparing synthesis...",
                         iteration=iteration,
                     )
@@ -300,6 +396,7 @@ class Orchestrator:
                         data={
                             "evidence_count": len(all_evidence),
                             "iterations": iteration,
                             "drug_candidates": assessment.details.drug_candidates,
                             "key_findings": assessment.details.key_findings,
                         },
@@ -317,10 +414,11 @@ class Orchestrator:
                     yield AgentEvent(
                         type="looping",
                         message=(
-                            f"Need more evidence. "
-                            f"Next searches: {', '.join(current_queries[:2])}..."
                         ),
-                        data={"next_queries": current_queries},
                         iteration=iteration,
                     )
@@ -410,36 +508,93 @@ class Orchestrator:
         evidence: list[Evidence],
     ) -> str:
         """
-        Generate a partial synthesis when max iterations reached.
-        Args:
-            query: The original question
-            evidence: All collected evidence
-        Returns:
-            Formatted partial synthesis as markdown
         """
         citations = "\n".join(
             [
-                f"{i + 1}. [{e.citation.title}]({e.citation.url}) ({e.citation.source.upper()})"
                 for i, e in enumerate(evidence[:10])
             ]
         )
-        return f"""## Partial Analysis (Max Iterations Reached)
-### Question
 {query}
 ### Status
-Maximum search iterations reached. The evidence gathered may be incomplete.
-### Evidence Collected
-Found {len(evidence)} sources. Consider refining your query for more specific results.
-### Citations
 {citations}
 ---
-*Consider searching with more specific terms or drug names.*
 """

 import asyncio
 from collections.abc import AsyncGenerator
+from typing import TYPE_CHECKING, Any, ClassVar
 import structlog
     Microsoft Agent Framework.
     """
+    # Termination thresholds (code-enforced, not LLM-decided)
+    TERMINATION_CRITERIA: ClassVar[dict[str, float]] = {
+        "min_combined_score": 12.0,  # mechanism + clinical >= 12
+        "min_score_with_volume": 10.0,  # >= 10 if 50+ sources
+        "min_evidence_for_volume": 50.0,  # Priority 3: evidence count threshold
+        "late_iteration_threshold": 8.0,  # >= 8 in iterations 8+
+        "max_evidence_threshold": 100.0,  # Force synthesis with 100+ sources
+        "emergency_iteration": 8.0,  # Last 2 iterations = emergency mode
+        "min_confidence": 0.5,  # Minimum confidence for emergency synthesis
+        "min_evidence_for_emergency": 30.0,  # Priority 6: min evidence for emergency
+    }
     def __init__(
         self,
         search_handler: SearchHandlerProtocol,
         try:
             # Deduplicate using semantic similarity
             unique_evidence: list[Evidence] = await embeddings.deduplicate(evidence, threshold=0.85)
             logger.info(
                 "Deduplicated evidence",
                 before=len(evidence),
                 iteration=iteration,
             )
+    def _should_synthesize(
+        self,
+        assessment: JudgeAssessment,
+        iteration: int,
+        max_iterations: int,
+        evidence_count: int,
+    ) -> tuple[bool, str]:
+        """
+        Code-enforced synthesis decision.
+        Returns (should_synthesize, reason).
+        """
+        combined_score = (
+            assessment.details.mechanism_score + assessment.details.clinical_evidence_score
+        )
+        has_drug_candidates = len(assessment.details.drug_candidates) > 0
+        confidence = assessment.confidence
+        # Priority 1: LLM explicitly says sufficient with good scores
+        if assessment.sufficient and assessment.recommendation == "synthesize":
+            if combined_score >= 10:
+                return True, "judge_approved"
+        # Priority 2: High scores with drug candidates
+        if (
+            combined_score >= self.TERMINATION_CRITERIA["min_combined_score"]
+            and has_drug_candidates
+        ):
+            return True, "high_scores_with_candidates"
+        # Priority 3: Good scores with high evidence volume
+        if (
+            combined_score >= self.TERMINATION_CRITERIA["min_score_with_volume"]
+            and evidence_count >= self.TERMINATION_CRITERIA["min_evidence_for_volume"]
+        ):
+            return True, "good_scores_high_volume"
+        # Priority 4: Late iteration with acceptable scores (diminishing returns)
+        is_late_iteration = iteration >= max_iterations - 2
+        if (
+            is_late_iteration
+            and combined_score >= self.TERMINATION_CRITERIA["late_iteration_threshold"]
+        ):
+            return True, "late_iteration_acceptable"
+        # Priority 5: Very high evidence count (enough to synthesize something)
+        if evidence_count >= self.TERMINATION_CRITERIA["max_evidence_threshold"]:
+            return True, "max_evidence_reached"
+        # Priority 6: Emergency synthesis (avoid garbage output)
+        if (
+            is_late_iteration
+            and evidence_count >= self.TERMINATION_CRITERIA["min_evidence_for_emergency"]
+            and confidence >= self.TERMINATION_CRITERIA["min_confidence"]
+        ):
+            return True, "emergency_synthesis"
+        return False, "continue_searching"
     async def run(self, query: str) -> AsyncGenerator[AgentEvent, None]:  # noqa: PLR0915
         """
         Run the agent loop for a query.
             )
             try:
+                assessment = await self.judge.assess(
+                    query, all_evidence, iteration, self.config.max_iterations
+                )
                 yield AgentEvent(
                     type="judge_complete",
                     }
                 )
+                # === DECISION PHASE (Code-Enforced) ===
+                should_synth, reason = self._should_synthesize(
+                    assessment=assessment,
+                    iteration=iteration,
+                    max_iterations=self.config.max_iterations,
+                    evidence_count=len(all_evidence),
+                )
+                logger.info(
+                    "Synthesis decision",
+                    should_synthesize=should_synth,
+                    reason=reason,
+                    iteration=iteration,
+                    combined_score=assessment.details.mechanism_score
+                    + assessment.details.clinical_evidence_score,
+                    evidence_count=len(all_evidence),
+                    confidence=assessment.confidence,
+                )
+                if should_synth:
+                    # Log synthesis trigger reason for debugging
+                    if reason != "judge_approved":
+                        logger.info(f"Code-enforced synthesis triggered: {reason}")
                     # Optional Analysis Phase
                     async for event in self._run_analysis_phase(query, all_evidence, iteration):
                         yield event
                     yield AgentEvent(
                         type="synthesizing",
+                        message=f"Evidence sufficient ({reason})! Preparing synthesis...",
                         iteration=iteration,
                     )
                         data={
                             "evidence_count": len(all_evidence),
                             "iterations": iteration,
+                            "synthesis_reason": reason,
                             "drug_candidates": assessment.details.drug_candidates,
                             "key_findings": assessment.details.key_findings,
                         },
                     yield AgentEvent(
                         type="looping",
                         message=(
+                            f"Gathering more evidence (scores: {assessment.details.mechanism_score}"
+                            f"+{assessment.details.clinical_evidence_score}). "
+                            f"Next: {', '.join(current_queries[:2])}..."
                         ),
+                        data={"next_queries": current_queries, "reason": reason},
                         iteration=iteration,
                     )
         evidence: list[Evidence],
     ) -> str:
         """
+        Generate a REAL synthesis when max iterations reached.
+        Even when forced to stop, we should provide:
+        - Drug candidates (if any were found)
+        - Key findings
+        - Assessment scores
+        - Actionable citations
+        This is still better than a citation dump.
         """
+        # Extract data from last assessment if available
+        last_assessment = self.history[-1]["assessment"] if self.history else {}
+        details = last_assessment.get("details", {})
+        drug_candidates = details.get("drug_candidates", [])
+        key_findings = details.get("key_findings", [])
+        mechanism_score = details.get("mechanism_score", 0)
+        clinical_score = details.get("clinical_evidence_score", 0)
+        reasoning = last_assessment.get("reasoning", "Analysis incomplete due to iteration limit.")
+        # Format drug candidates
+        if drug_candidates:
+            drug_list = "\n".join([f"- **{d}**" for d in drug_candidates[:5]])
+        else:
+            drug_list = (
+                "- *No specific drug candidates identified in evidence*\n"
+                "- *Try a more specific query or add an API key for better analysis*"
+            )
+        # Format key findings
+        if key_findings:
+            findings_list = "\n".join([f"- {f}" for f in key_findings[:5]])
+        else:
+            findings_list = (
+                "- *Key findings require further analysis*\n"
+                "- *See citations below for relevant sources*"
+            )
+        # Format citations (top 10)
         citations = "\n".join(
             [
+                f"{i + 1}. [{e.citation.title}]({e.citation.url}) "
+                f"({e.citation.source.upper()}, {e.citation.date})"
                 for i, e in enumerate(evidence[:10])
             ]
         )
+        combined_score = mechanism_score + clinical_score
+        mech_strength = (
+            "Strong" if mechanism_score >= 7 else "Moderate" if mechanism_score >= 4 else "Limited"
+        )
+        clin_strength = (
+            "Strong" if clinical_score >= 7 else "Moderate" if clinical_score >= 4 else "Limited"
+        )
+        comb_strength = "Sufficient" if combined_score >= 12 else "Partial"
+        return f"""## Drug Repurposing Analysis
+### Research Question
 {query}
 ### Status
+Analysis based on {len(evidence)} sources across {len(self.history)} iterations.
+Maximum iterations reached - results may be incomplete.
+### Drug Candidates Identified
+{drug_list}
+### Key Findings
+{findings_list}
+### Evidence Quality Scores
+| Criterion | Score | Interpretation |
+|-----------|-------|----------------|
+| Mechanism | {mechanism_score}/10 | {mech_strength} mechanistic evidence |
+| Clinical | {clinical_score}/10 | {clin_strength} clinical support |
+| Combined | {combined_score}/20 | {comb_strength} for synthesis |
+### Analysis Summary
+{reasoning}
+### Top Citations ({len(evidence)} sources total)
 {citations}
 ---
+*For more complete analysis:*
+- *Add an OpenAI or Anthropic API key for enhanced LLM analysis*
+- *Try a more specific query (e.g., include drug names)*
+- *Use Advanced mode for multi-agent research*
 """

src/prompts/judge.py CHANGED Viewed

@@ -4,10 +4,16 @@ from src.utils.models import Evidence
 SYSTEM_PROMPT = """You are an expert drug repurposing research judge.
-Your task is to evaluate evidence from biomedical literature and determine if it's sufficient to
-recommend drug candidates for a given condition.
-## Evaluation Criteria
 1. **Mechanism Score (0-10)**: How well does the evidence explain the biological mechanism?
    - 0-3: No clear mechanism, speculative
@@ -19,59 +25,123 @@ recommend drug candidates for a given condition.
    - 4-6: Preclinical or early clinical data
    - 7-10: Strong clinical evidence (trials, meta-analyses)
-3. **Sufficiency**: Evidence is sufficient when:
-   - Combined scores >= 12 AND
-   - At least one specific drug candidate identified AND
-   - Clear mechanistic rationale exists
-## Output Rules
-- Always output valid JSON matching the schema
-- Be conservative: only recommend "synthesize" when truly confident
-- If continuing, suggest specific, actionable search queries
-- Never hallucinate drug names or findings not in the evidence
 """
-def format_user_prompt(question: str, evidence: list[Evidence]) -> str:
     """
-    Format the user prompt with question and evidence.
-    Args:
-        question: The user's research question
-        evidence: List of Evidence objects from search
-    Returns:
-        Formatted prompt string
     """
     max_content_len = 1500
     def format_single_evidence(i: int, e: Evidence) -> str:
         content = e.content
         if len(content) > max_content_len:
             content = content[:max_content_len] + "..."
         return (
             f"### Evidence {i + 1}\n"
             f"**Source**: {e.citation.source.upper()} - {e.citation.title}\n"
             f"**URL**: {e.citation.url}\n"
-            f"**Date**: {e.citation.date}\n"
             f"**Content**:\n{content}"
         )
     evidence_text = "\n\n".join([format_single_evidence(i, e) for i, e in enumerate(evidence)])
-    return f"""## Research Question
 {question}
-## Available Evidence ({len(evidence)} sources)
 {evidence_text}
 ## Your Task
-Evaluate this evidence and determine if it's sufficient to recommend drug repurposing candidates.
-Respond with a JSON object matching the JudgeAssessment schema.
 """

 SYSTEM_PROMPT = """You are an expert drug repurposing research judge.
+Your task is to SCORE evidence from biomedical literature. You do NOT decide whether to
+continue searching or synthesize - that decision is made by the orchestration system
+based on your scores.
+## Your Role: Scoring Only
+You provide objective scores. The system decides next steps based on explicit thresholds.
+This separation prevents bias in the decision-making process.
+## Scoring Criteria
 1. **Mechanism Score (0-10)**: How well does the evidence explain the biological mechanism?
    - 0-3: No clear mechanism, speculative
    - 4-6: Preclinical or early clinical data
    - 7-10: Strong clinical evidence (trials, meta-analyses)
+3. **Drug Candidates**: List SPECIFIC drug names mentioned in the evidence
+   - Only include drugs explicitly mentioned
+   - Do NOT hallucinate or infer drug names
+   - Include drug class if specific names aren't available (e.g., "SSRI antidepressants")
+4. **Key Findings**: Extract 3-5 key findings from the evidence
+   - Focus on findings relevant to the research question
+   - Include mechanism insights and clinical outcomes
+5. **Confidence (0.0-1.0)**: Your confidence in the scores
+   - Based on evidence quality and relevance
+   - Lower if evidence is tangential or low-quality
+## Output Format
+Return valid JSON with these fields:
+- details.mechanism_score (int 0-10)
+- details.mechanism_reasoning (string)
+- details.clinical_evidence_score (int 0-10)
+- details.clinical_reasoning (string)
+- details.drug_candidates (list of strings)
+- details.key_findings (list of strings)
+- sufficient (boolean) - TRUE if scores suggest enough evidence
+- confidence (float 0-1)
+- recommendation ("continue" or "synthesize") - Your suggestion (system may override)
+- next_search_queries (list) - If continuing, suggest FOCUSED queries
+- reasoning (string)
+## CRITICAL: Search Query Rules
+When suggesting next_search_queries:
+- STAY FOCUSED on the original research question
+- Do NOT drift to tangential topics
+- If question is about "female libido", do NOT suggest "bone health" or "muscle mass"
+- Refine existing terms, don't explore random medical associations
 """
+MAX_EVIDENCE_FOR_JUDGE = 30  # Keep under token limits
+async def select_evidence_for_judge(
+    evidence: list[Evidence],
+    query: str,
+    max_items: int = MAX_EVIDENCE_FOR_JUDGE,
+) -> list[Evidence]:
     """
+    Select diverse, relevant evidence for judge evaluation.
+    Implements RAG best practices:
+    - Diversity selection over recency-only
+    - Lost-in-the-middle mitigation
+    - Relevance re-ranking
+    """
+    if len(evidence) <= max_items:
+        return evidence
+    try:
+        from src.utils.text_utils import select_diverse_evidence
+        # Use embedding-based diversity selection
+        return await select_diverse_evidence(evidence, n=max_items, query=query)
+    except ImportError:
+        # Fallback: mix of recent + early (lost-in-the-middle mitigation)
+        early = evidence[: max_items // 3]  # First third
+        recent = evidence[-(max_items * 2 // 3) :]  # Last two-thirds
+        return early + recent
+def format_user_prompt(
+    question: str,
+    evidence: list[Evidence],
+    iteration: int = 0,
+    max_iterations: int = 10,
+    total_evidence_count: int | None = None,
+) -> str:
+    """
+    Format user prompt with selected evidence and iteration context.
+    NOTE: Evidence should be pre-selected using select_evidence_for_judge().
+    This function assumes evidence is already capped.
     """
+    total_count = total_evidence_count or len(evidence)
     max_content_len = 1500
     def format_single_evidence(i: int, e: Evidence) -> str:
         content = e.content
         if len(content) > max_content_len:
             content = content[:max_content_len] + "..."
         return (
             f"### Evidence {i + 1}\n"
             f"**Source**: {e.citation.source.upper()} - {e.citation.title}\n"
             f"**URL**: {e.citation.url}\n"
             f"**Content**:\n{content}"
         )
     evidence_text = "\n\n".join([format_single_evidence(i, e) for i, e in enumerate(evidence)])
+    # Lost-in-the-middle mitigation: put critical context at START and END
+    return f"""## Research Question (IMPORTANT - stay focused on this)
 {question}
+## Search Progress
+- **Iteration**: {iteration}/{max_iterations}
+- **Total evidence collected**: {total_count} sources
+- **Evidence shown below**: {len(evidence)} diverse sources (selected for relevance)
+## Available Evidence
 {evidence_text}
 ## Your Task
+Score this evidence for drug repurposing potential. Provide ONLY scores and extracted data.
+DO NOT decide "synthesize" vs "continue" - that decision is made by the system.
+## REMINDER: Original Question (stay focused)
+{question}
 """

tests/e2e/conftest.py CHANGED Viewed

@@ -39,7 +39,7 @@ def mock_judge_handler():
     """Return a mock judge that always says 'synthesize'."""
     mock = MagicMock()
-    async def mock_assess(question, evidence):
         return JudgeAssessment(
             sufficient=True,
             confidence=0.9,

     """Return a mock judge that always says 'synthesize'."""
     mock = MagicMock()
+    async def mock_assess(question, evidence, iteration=1, max_iterations=10):
         return JudgeAssessment(
             sufficient=True,
             confidence=0.9,

tests/integration/test_simple_mode_synthesis.py ADDED Viewed

	@@ -0,0 +1,147 @@

+from unittest.mock import AsyncMock
+import pytest
+from src.orchestrators.simple import Orchestrator
+from src.utils.models import (
+    AssessmentDetails,
+    Citation,
+    Evidence,
+    JudgeAssessment,
+    OrchestratorConfig,
+    SearchResult,
+)
+def make_evidence(title: str) -> Evidence:
+    return Evidence(
+        content="content",
+        citation=Citation(title=title, url="http://test.com", date="2025", source="pubmed"),
+    )
+@pytest.mark.integration
+@pytest.mark.asyncio
+async def test_simple_mode_synthesizes_before_max_iterations():
+    """Verify simple mode produces useful output with mocked judge."""
+    # Mock search to return evidence
+    mock_search = AsyncMock()
+    mock_search.execute.return_value = SearchResult(
+        query="test query",
+        evidence=[make_evidence(f"Paper {i}") for i in range(5)],
+        errors=[],
+        sources_searched=["pubmed"],
+        total_found=5,
+    )
+    # Mock judge to return GOOD scores eventually
+    # We can use MockJudgeHandler or a pure mock. Let's use a pure mock to control scores precisely.
+    mock_judge = AsyncMock()
+    # Iteration 1: Low scores
+    assess_1 = JudgeAssessment(
+        details=AssessmentDetails(
+            mechanism_score=2,
+            mechanism_reasoning="reasoning is sufficient for valid model",
+            clinical_evidence_score=2,
+            clinical_reasoning="reasoning is sufficient for valid model",
+            drug_candidates=[],
+            key_findings=[],
+        ),
+        sufficient=False,
+        confidence=0.5,
+        recommendation="continue",
+        next_search_queries=["q2"],
+        reasoning="need more evidence to support conclusions about this topic",
+    )
+    # Iteration 2: High scores (should trigger synthesis)
+    assess_2 = JudgeAssessment(
+        details=AssessmentDetails(
+            mechanism_score=8,
+            mechanism_reasoning="reasoning is sufficient for valid model",
+            clinical_evidence_score=7,
+            clinical_reasoning="reasoning is sufficient for valid model",
+            drug_candidates=["MagicDrug"],
+            key_findings=["It works"],
+        ),
+        sufficient=False,  # Judge is conservative
+        confidence=0.9,
+        recommendation="continue",  # Judge still says continue (simulating bias)
+        next_search_queries=[],
+        reasoning="good scores but maybe more evidence needed technically",
+    )
+    mock_judge.assess.side_effect = [assess_1, assess_2]
+    orchestrator = Orchestrator(
+        search_handler=mock_search,
+        judge_handler=mock_judge,
+        config=OrchestratorConfig(max_iterations=5),
+    )
+    events = []
+    async for event in orchestrator.run("test query"):
+        events.append(event)
+        if event.type == "complete":
+            break
+    # Must have synthesis with drug candidates
+    complete_events = [e for e in events if e.type == "complete"]
+    assert len(complete_events) == 1
+    complete_event = complete_events[0]
+    assert "MagicDrug" in complete_event.message
+    assert "Drug Candidates" in complete_event.message
+    assert complete_event.data.get("synthesis_reason") == "high_scores_with_candidates"
+    assert complete_event.iteration == 2  # Should stop at it 2
+@pytest.mark.integration
+@pytest.mark.asyncio
+async def test_partial_synthesis_generation():
+    """Verify partial synthesis includes drug candidates even if max iterations reached."""
+    mock_search = AsyncMock()
+    mock_search.execute.return_value = SearchResult(
+        query="test", evidence=[], errors=[], sources_searched=["pubmed"], total_found=0
+    )
+    mock_judge = AsyncMock()
+    # Always return low scores but WITH candidates
+    # Scores 3+3 = 6 < 8 (late threshold), so it should NOT synthesize early
+    mock_judge.assess.return_value = JudgeAssessment(
+        details=AssessmentDetails(
+            mechanism_score=3,
+            mechanism_reasoning="reasoning is sufficient for valid model",
+            clinical_evidence_score=3,
+            clinical_reasoning="reasoning is sufficient for valid model",
+            drug_candidates=["PartialDrug"],
+            key_findings=["Partial finding"],
+        ),
+        sufficient=False,
+        confidence=0.5,
+        recommendation="continue",
+        next_search_queries=[],
+        reasoning="keep going to find more evidence about this topic please",
+    )
+    orchestrator = Orchestrator(
+        search_handler=mock_search,
+        judge_handler=mock_judge,
+        config=OrchestratorConfig(max_iterations=2),
+    )
+    events = []
+    async for event in orchestrator.run("test"):
+        events.append(event)
+    complete_events = [e for e in events if e.type == "complete"]
+    assert (
+        len(complete_events) == 1
+    ), f"Expected exactly one complete event, got {len(complete_events)}"
+    complete_event = complete_events[0]
+    assert complete_event.data.get("max_reached") is True
+    # The output message should contain the drug candidate from the last assessment
+    assert "PartialDrug" in complete_event.message
+    assert "Maximum iterations reached" in complete_event.message

tests/unit/orchestrators/test_termination.py ADDED Viewed

	@@ -0,0 +1,104 @@

+from typing import Literal
+from unittest.mock import MagicMock
+import pytest
+from src.orchestrators.simple import Orchestrator
+from src.utils.models import AssessmentDetails, JudgeAssessment
+def make_assessment(
+    mechanism: int,
+    clinical: int,
+    drug_candidates: list[str],
+    sufficient: bool = False,
+    recommendation: Literal["continue", "synthesize"] = "continue",
+    confidence: float = 0.8,
+) -> JudgeAssessment:
+    return JudgeAssessment(
+        details=AssessmentDetails(
+            mechanism_score=mechanism,
+            mechanism_reasoning="reasoning is sufficient for testing purposes",
+            clinical_evidence_score=clinical,
+            clinical_reasoning="reasoning is sufficient for testing purposes",
+            drug_candidates=drug_candidates,
+            key_findings=["finding"],
+        ),
+        sufficient=sufficient,
+        confidence=confidence,
+        recommendation=recommendation,
+        next_search_queries=[],
+        reasoning="reasoning is sufficient for testing purposes",
+    )
+@pytest.fixture
+def orchestrator():
+    search = MagicMock()
+    judge = MagicMock()
+    return Orchestrator(search, judge)
+@pytest.mark.unit
+def test_should_synthesize_high_scores(orchestrator):
+    """High scores with drug candidates triggers synthesis."""
+    assessment = make_assessment(mechanism=7, clinical=6, drug_candidates=["Metformin"])
+    # Access the private method via name mangling or just call it if it was public.
+    # Since I made it private _should_synthesize, I access it directly.
+    should_synth, reason = orchestrator._should_synthesize(
+        assessment, iteration=3, max_iterations=10, evidence_count=50
+    )
+    assert should_synth is True
+    assert reason == "high_scores_with_candidates"
+@pytest.mark.unit
+def test_should_synthesize_late_iteration(orchestrator):
+    """Late iteration with acceptable scores triggers synthesis."""
+    assessment = make_assessment(mechanism=5, clinical=4, drug_candidates=[])
+    should_synth, reason = orchestrator._should_synthesize(
+        assessment, iteration=9, max_iterations=10, evidence_count=80
+    )
+    assert should_synth is True
+    assert reason in ["late_iteration_acceptable", "emergency_synthesis"]
+@pytest.mark.unit
+def test_should_not_synthesize_early_low_scores(orchestrator):
+    """Early iteration with low scores continues searching."""
+    assessment = make_assessment(mechanism=3, clinical=2, drug_candidates=[])
+    should_synth, reason = orchestrator._should_synthesize(
+        assessment, iteration=2, max_iterations=10, evidence_count=20
+    )
+    assert should_synth is False
+    assert reason == "continue_searching"
+@pytest.mark.unit
+def test_judge_approved_overrides_all(orchestrator):
+    """If judge explicitly says synthesize with good scores, do it."""
+    assessment = make_assessment(
+        mechanism=6, clinical=5, drug_candidates=[], sufficient=True, recommendation="synthesize"
+    )
+    should_synth, reason = orchestrator._should_synthesize(
+        assessment, iteration=2, max_iterations=10, evidence_count=20
+    )
+    assert should_synth is True
+    assert reason == "judge_approved"
+@pytest.mark.unit
+def test_max_evidence_threshold(orchestrator):
+    """Force synthesis if we have tons of evidence."""
+    assessment = make_assessment(mechanism=2, clinical=2, drug_candidates=[])
+    should_synth, reason = orchestrator._should_synthesize(
+        assessment, iteration=5, max_iterations=10, evidence_count=150
+    )
+    assert should_synth is True
+    assert reason == "max_evidence_reached"

tests/unit/prompts/test_judge_prompt.py ADDED Viewed

	@@ -0,0 +1,61 @@

+from unittest.mock import patch
+import pytest
+from src.prompts.judge import format_user_prompt, select_evidence_for_judge
+from src.utils.models import Citation, Evidence
+def make_evidence(title: str, content: str = "content") -> Evidence:
+    return Evidence(
+        content=content,
+        citation=Citation(title=title, url="http://test.com", date="2025", source="pubmed"),
+    )
+@pytest.mark.unit
+@pytest.mark.asyncio
+async def test_evidence_selection_diverse():
+    """Verify evidence selection includes early and recent items (fallback logic)."""
+    # Create enough evidence to trigger selection
+    evidence = [make_evidence(f"Paper {i}") for i in range(100)]
+    # Mock select_diverse_evidence to raise ImportError to trigger fallback logic
+    with patch("src.utils.text_utils.select_diverse_evidence", side_effect=ImportError):
+        selected = await select_evidence_for_judge(evidence, "test query", max_items=30)
+    assert len(selected) == 30
+    # Should include some early evidence (lost-in-the-middle mitigation)
+    titles = [e.citation.title for e in selected]
+    # Check for start (Paper 0..9) - using set intersection for clarity
+    early_papers = {f"Paper {i}" for i in range(10)}
+    has_early = any(title in early_papers for title in titles)
+    # Check for end (Paper 90..99)
+    late_papers = {f"Paper {i}" for i in range(90, 100)}
+    has_late = any(title in late_papers for title in titles)
+    assert has_early, "Should include early evidence"
+    assert has_late, "Should include recent evidence"
+@pytest.mark.unit
+def test_prompt_includes_question_at_edges():
+    """Verify lost-in-the-middle mitigation in prompt formatting."""
+    evidence = [make_evidence("Test Paper")]
+    question = "CRITICAL RESEARCH QUESTION"
+    prompt = format_user_prompt(question, evidence, iteration=5, max_iterations=10)
+    # Question should appear at START and END of prompt
+    lines = prompt.split("\n")
+    # Check start (first few lines)
+    start_content = "\n".join(lines[:10])
+    assert question in start_content
+    # Check end (last few lines)
+    end_content = "\n".join(lines[-10:])
+    assert question in end_content
+    assert "REMINDER: Original Question" in end_content