VibecoderMcSwaggins commited on
Commit
912218f
·
2 Parent(s): d2ad6f5 3d25956

Merge branch 'main' into dev

Browse files
docs/bugs/ACTIVE_BUGS.md CHANGED
@@ -4,29 +4,40 @@
4
 
5
  ## P0 - Blocker
6
 
7
- ### P0 - Simple Mode Never Synthesizes
8
- **File:** `P0_SIMPLE_MODE_NEVER_SYNTHESIZES.md`
9
 
10
- **Symptom:** Simple mode finds 455 sources but outputs only citations (no synthesis).
11
 
12
- **Root Causes:**
13
- 1. Judge never recommends "synthesize" (prompt too conservative)
14
- 2. Confidence drops to 0% in late iterations (context overflow / API failure)
15
- 3. Search derails to tangential topics (bone health instead of libido)
16
- 4. `_generate_partial_synthesis()` outputs garbage (just citations, no analysis)
17
 
18
- **Status:** Documented, fix plan ready.
 
 
19
 
20
- ---
 
 
21
 
22
- ## P3 - Edge Case
 
23
 
24
- *(None)*
 
 
25
 
26
  ---
27
 
28
  ## Resolved Bugs
29
 
 
 
 
 
 
 
 
 
 
30
  ### ~~P3 - Magentic Mode Missing Termination Guarantee~~ FIXED
31
  **Commit**: `d36ce3c` (2025-11-29)
32
 
 
4
 
5
  ## P0 - Blocker
6
 
7
+ *(None - P0 bugs resolved)*
 
8
 
9
+ ---
10
 
11
+ ## P3 - Architecture/Enhancement
 
 
 
 
12
 
13
+ ### P3 - Missing Structured Cognitive Memory
14
+ **File:** `P3_ARCHITECTURAL_GAP_STRUCTURED_MEMORY.md`
15
+ **Spec:** [SPEC_07_LANGGRAPH_MEMORY_ARCH.md](../specs/SPEC_07_LANGGRAPH_MEMORY_ARCH.md)
16
 
17
+ **Problem:** AdvancedOrchestrator uses chat-based state (context drift on long runs).
18
+ **Solution:** Implement LangGraph StateGraph with explicit hypothesis/conflict tracking.
19
+ **Status:** Spec complete, implementation pending.
20
 
21
+ ### P3 - Ephemeral Memory (No Persistence)
22
+ **File:** `P3_ARCHITECTURAL_GAP_EPHEMERAL_MEMORY.md`
23
 
24
+ **Problem:** ChromaDB uses in-memory client despite `settings.chroma_db_path` existing.
25
+ **Solution:** Switch to `PersistentClient(path=settings.chroma_db_path)`.
26
+ **Status:** Quick fix identified, not yet implemented.
27
 
28
  ---
29
 
30
  ## Resolved Bugs
31
 
32
+ ### ~~P0 - Simple Mode Never Synthesizes~~ FIXED
33
+ **PR:** [#71](https://github.com/The-Obstacle-Is-The-Way/DeepBoner/pull/71) (SPEC_06)
34
+ **Commit**: `5cac97d` (2025-11-29)
35
+
36
+ - Root cause: LLM-as-Judge recommendations were being IGNORED
37
+ - Fix: Code-enforced termination criteria (`_should_synthesize()`)
38
+ - Added combined score thresholds, late-iteration logic, emergency fallback
39
+ - Simple mode now synthesizes instead of spinning forever
40
+
41
  ### ~~P3 - Magentic Mode Missing Termination Guarantee~~ FIXED
42
  **Commit**: `d36ce3c` (2025-11-29)
43
 
docs/bugs/P3_ARCHITECTURAL_GAP_EPHEMERAL_MEMORY.md ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # P3: Ephemeral Memory Architecture (No Persistence)
2
+
3
+ **Status:** OPEN
4
+ **Priority:** P3 (Feature/Architecture Gap)
5
+ **Found By:** Codebase Investigation
6
+ **Date:** 2025-11-29
7
+
8
+ ## Description
9
+ The current `EmbeddingService` (`src/services/embeddings.py`) initializes an **in-memory** ChromaDB client (`chromadb.Client()`) and creates a random UUID-based collection for every new session.
10
+
11
+ While `src/utils/config.py` defines a `chroma_db_path` for persistence, it is currently **ignored**.
12
+
13
+ ## Impact
14
+ 1. **No Long-Term Learning:** The agent cannot "remember" research from previous runs. Every time you restart the app, it starts from zero.
15
+ 2. **Redundant Costs:** If a user researches "Diabetes" twice, the agent re-searches and re-embeds the same papers, wasting tokens and compute time.
16
+
17
+ ## Technical Details
18
+ - **Current:** `self._client = chromadb.Client()` (In-Memory)
19
+ - **Required:** `self._client = chromadb.PersistentClient(path=settings.chroma_db_path)`
20
+
21
+ ## Recommendation
22
+ For a "Hackathon Demo," this is **low priority** (ephemeral is fine).
23
+ For a "Real Product," this is **critical** (users expect a library of research).
docs/bugs/P3_ARCHITECTURAL_GAP_STRUCTURED_MEMORY.md ADDED
@@ -0,0 +1,148 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # P3: Missing Structured Cognitive Memory (Shared Blackboard)
2
+
3
+ **Status:** OPEN
4
+ **Priority:** P3 (Architecture/Enhancement)
5
+ **Found By:** Deep Codebase Investigation
6
+ **Date:** 2025-11-29
7
+ **Spec:** [SPEC_07_LANGGRAPH_MEMORY_ARCH.md](../specs/SPEC_07_LANGGRAPH_MEMORY_ARCH.md)
8
+
9
+ ## Executive Summary
10
+
11
+ DeepBoner's `AdvancedOrchestrator` has **Data Memory** (vector store for papers) but lacks **Cognitive Memory** (structured state for hypotheses, conflicts, and research plan). This causes "context drift" on long runs and prevents intelligent conflict resolution.
12
+
13
+ ---
14
+
15
+ ## Current Architecture (What We Have)
16
+
17
+ ### 1. MagenticState (`src/agents/state.py:18-91`)
18
+ ```python
19
+ class MagenticState(BaseModel):
20
+ evidence: list[Evidence] = Field(default_factory=list)
21
+ embedding_service: Any = None # ChromaDB connection
22
+
23
+ def add_evidence(self, new_evidence: list[Evidence]) -> int: ...
24
+ async def search_related(self, query: str, n_results: int = 5) -> list[Evidence]: ...
25
+ ```
26
+ - **What it does:** Stores Evidence objects, URL-based deduplication, semantic search via embeddings.
27
+ - **What it DOESN'T do:** Track hypotheses, conflicts, or research plan status.
28
+
29
+ ### 2. EmbeddingService (`src/services/embeddings.py:29-180`)
30
+ ```python
31
+ self._client = chromadb.Client() # In-memory (Line 44)
32
+ self._collection = self._client.create_collection(
33
+ name=f"evidence_{uuid.uuid4().hex}", # Random name per session (Line 45-47)
34
+ ...
35
+ )
36
+ ```
37
+ - **What it does:** In-session semantic search/deduplication.
38
+ - **Limitation:** New collection per session, no persistence despite `settings.chroma_db_path` existing.
39
+
40
+ ### 3. AdvancedOrchestrator (`src/orchestrators/advanced.py:51-371`)
41
+ - Uses Microsoft's `agent-framework-core` (MagenticBuilder)
42
+ - State is implicit in chat history passed between agents
43
+ - Manager decides next step by reading conversation, not structured state
44
+
45
+ ---
46
+
47
+ ## The Problem
48
+
49
+ | Issue | Impact | Evidence |
50
+ |-------|--------|----------|
51
+ | **No Hypothesis Tracking** | Can't update hypothesis confidence systematically | `MagenticState` has no `hypotheses` field |
52
+ | **No Conflict Detection** | Contradictory sources are ignored | No `conflicts` list to flag Source A vs Source B |
53
+ | **Context Drift** | Manager forgets original query after 50+ messages | State lives only in chat, not structured object |
54
+ | **No Plan State** | Can't pause/resume research | No `research_plan` or `next_step` tracking |
55
+
56
+ ---
57
+
58
+ ## The Solution: LangGraph State Graph (Nov 2025 Best Practice)
59
+
60
+ ### Why LangGraph?
61
+
62
+ Based on [comprehensive analysis](https://latenode.com/blog/langgraph-multi-agent-orchestration-complete-framework-guide-architecture-analysis-2025):
63
+
64
+ 1. **Explicit State Schema:** TypedDict/Pydantic model that ALL agents read/write
65
+ 2. **State Reducers:** `Annotated[List[X], operator.add]` for appending (not overwriting)
66
+ 3. **HuggingFace Compatible:** Works with `langchain-huggingface` (Llama 3.1)
67
+ 4. **Production-Ready:** MongoDB checkpointer for persistence, SQLite for dev
68
+
69
+ ### Target Architecture
70
+
71
+ ```python
72
+ # src/agents/graph/state.py (PROPOSED)
73
+ from typing import Annotated, TypedDict, Literal
74
+ import operator
75
+
76
+ class Hypothesis(TypedDict):
77
+ id: str
78
+ statement: str
79
+ status: Literal["proposed", "validating", "confirmed", "refuted"]
80
+ confidence: float
81
+ supporting_evidence_ids: list[str]
82
+ contradicting_evidence_ids: list[str]
83
+
84
+ class Conflict(TypedDict):
85
+ id: str
86
+ description: str
87
+ source_a_id: str
88
+ source_b_id: str
89
+ status: Literal["open", "resolved"]
90
+ resolution: str | None
91
+
92
+ class ResearchState(TypedDict):
93
+ query: str # Immutable original question
94
+ hypotheses: Annotated[list[Hypothesis], operator.add]
95
+ conflicts: Annotated[list[Conflict], operator.add]
96
+ evidence_ids: Annotated[list[str], operator.add] # Links to ChromaDB
97
+ messages: Annotated[list[BaseMessage], operator.add]
98
+ next_step: Literal["search", "judge", "resolve", "synthesize", "finish"]
99
+ iteration_count: int
100
+ ```
101
+
102
+ ---
103
+
104
+ ## Implementation Dependencies
105
+
106
+ | Package | Purpose | Install |
107
+ |---------|---------|---------|
108
+ | `langgraph>=0.2` | State graph framework | `uv add langgraph` |
109
+ | `langchain>=0.3` | Base abstractions | `uv add langchain` |
110
+ | `langchain-huggingface` | Llama 3.1 integration | `uv add langchain-huggingface` |
111
+ | `langgraph-checkpoint-sqlite` | Dev persistence | `uv add langgraph-checkpoint-sqlite` |
112
+
113
+ **Note:** MongoDB checkpointer (`langgraph-checkpoint-mongodb`) recommended for production per [MongoDB blog](https://www.mongodb.com/company/blog/product-release-announcements/powering-long-term-memory-for-agents-langgraph).
114
+
115
+ ---
116
+
117
+ ## Alternative Considered: Mem0
118
+
119
+ [Mem0](https://mem0.ai/) specializes in long-term memory and [outperformed OpenAI by 26%](https://guptadeepak.com/the-ai-memory-wars-why-one-system-crushed-the-competition-and-its-not-openai/) in benchmarks. However:
120
+
121
+ - **Mem0 excels at:** User personalization, cross-session memory
122
+ - **LangGraph excels at:** Workflow orchestration, state machines
123
+ - **Verdict:** Use LangGraph for orchestration + optionally add Mem0 for user-level memory later
124
+
125
+ ---
126
+
127
+ ## Quick Win (Separate from LangGraph)
128
+
129
+ Enable ChromaDB persistence in `src/services/embeddings.py:44`:
130
+ ```python
131
+ # FROM:
132
+ self._client = chromadb.Client() # In-memory
133
+
134
+ # TO:
135
+ self._client = chromadb.PersistentClient(path=settings.chroma_db_path)
136
+ ```
137
+
138
+ This alone gives cross-session evidence persistence (P3_ARCHITECTURAL_GAP_EPHEMERAL_MEMORY fix).
139
+
140
+ ---
141
+
142
+ ## References
143
+
144
+ - [LangGraph Multi-Agent Orchestration Guide 2025](https://latenode.com/blog/langgraph-multi-agent-orchestration-complete-framework-guide-architecture-analysis-2025)
145
+ - [Long-Term Agentic Memory with LangGraph](https://medium.com/@anil.jain.baba/long-term-agentic-memory-with-langgraph-824050b09852)
146
+ - [LangGraph vs LangChain 2025](https://kanerika.com/blogs/langchain-vs-langgraph/)
147
+ - [MongoDB + LangGraph Checkpointers](https://www.mongodb.com/company/blog/product-release-announcements/powering-long-term-memory-for-agents-langgraph)
148
+ - [Mem0 + LangGraph Integration](https://datacouch.io/blog/build-smarter-ai-agents-mem0-langgraph-guide/)
docs/specs/SPEC_07_LANGGRAPH_MEMORY_ARCH.md ADDED
@@ -0,0 +1,492 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # SPEC-07: Structured Cognitive Memory Architecture (LangGraph)
2
+
3
+ **Status:** APPROVED
4
+ **Priority:** HIGH (Strategic)
5
+ **Author:** DeepBoner Architecture Team
6
+ **Date:** 2025-11-29
7
+ **Last Updated:** 2025-11-29
8
+ **Related Bugs:** [P3_ARCHITECTURAL_GAP_STRUCTURED_MEMORY](../bugs/P3_ARCHITECTURAL_GAP_STRUCTURED_MEMORY.md)
9
+
10
+ ---
11
+
12
+ ## 1. Executive Summary
13
+
14
+ Upgrade DeepBoner's "Advanced Mode" from chat-based coordination to a **State-Driven Cognitive Architecture** using LangGraph. This enables:
15
+ - Explicit hypothesis tracking with confidence scores
16
+ - Automatic conflict detection and resolution
17
+ - Persistent research state (pause/resume)
18
+ - Context-aware decision making over long runs
19
+
20
+ ---
21
+
22
+ ## 2. Problem Statement
23
+
24
+ ### Current Architecture Limitations
25
+
26
+ The `AdvancedOrchestrator` (`src/orchestrators/advanced.py`) uses Microsoft's `agent-framework-core` with chat-based coordination:
27
+
28
+ ```python
29
+ # Current: State is IMPLICIT (chat history)
30
+ workflow = MagenticBuilder()
31
+ .participants(searcher=..., judge=..., ...)
32
+ .with_standard_manager(chat_client=..., max_round_count=10)
33
+ .build()
34
+ ```
35
+
36
+ | Problem | Root Cause | File Location |
37
+ |---------|------------|---------------|
38
+ | Context Drift | State lives only in chat messages | `advanced.py:126-132` |
39
+ | Conflict Blindness | No structured conflict tracking | `state.py` (no `conflicts` field) |
40
+ | No Hypothesis Management | `MagenticState` only tracks `evidence` | `state.py:21` |
41
+ | Can't Pause/Resume | No checkpointing mechanism | N/A |
42
+
43
+ ### Evidence from Codebase
44
+
45
+ **MagenticState (src/agents/state.py:18-26):**
46
+ ```python
47
+ class MagenticState(BaseModel):
48
+ evidence: list[Evidence] = Field(default_factory=list)
49
+ embedding_service: Any = None # Just data, no cognitive state
50
+ ```
51
+
52
+ **EmbeddingService (src/services/embeddings.py:44-47):**
53
+ ```python
54
+ self._client = chromadb.Client() # In-memory only
55
+ self._collection = self._client.create_collection(
56
+ name=f"evidence_{uuid.uuid4().hex}", # Random name = ephemeral
57
+ ...
58
+ )
59
+ ```
60
+
61
+ ---
62
+
63
+ ## 3. Solution: LangGraph State Graph
64
+
65
+ ### Why LangGraph? (November 2025 Analysis)
66
+
67
+ Based on [comprehensive framework comparison](https://kanerika.com/blogs/langchain-vs-langgraph/):
68
+
69
+ | Feature | `agent-framework-core` (Current) | LangGraph (Proposed) |
70
+ |---------|----------------------------------|----------------------|
71
+ | State Management | Implicit (chat) | Explicit (TypedDict) |
72
+ | Loops/Branches | Limited | Native support |
73
+ | Checkpointing | None | SQLite/MongoDB |
74
+ | HuggingFace | Requires OpenAI format | Native `langchain-huggingface` |
75
+
76
+ ### Architecture Overview
77
+
78
+ ```
79
+ ┌─────────────────────────────────────────────────────────────────┐
80
+ │ ResearchState │
81
+ │ ┌─────────────┬──────────────┬───────────────┬──────────────┐ │
82
+ │ │ query │ hypotheses │ conflicts │ next_step │ │
83
+ │ │ (string) │ (list) │ (list) │ (enum) │ │
84
+ │ └─────────────┴──────────────┴───────────────┴──────────────┘ │
85
+ └─────────────────────────────────────────────────────────────────┘
86
+
87
+
88
+ ┌─────────────────────────────────────────────────────────────────┐
89
+ │ StateGraph │
90
+ │ │
91
+ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
92
+ │ │ SEARCH │────▶│ JUDGE │────▶│ RESOLVE │ │
93
+ │ │ Node │ │ Node │ │ Node │ │
94
+ │ └──────────┘ └──────────┘ └──────────┘ │
95
+ │ ▲ │ │ │
96
+ │ │ ▼ │ │
97
+ │ │ ┌──────────┐ │ │
98
+ │ └──────────│SUPERVISOR│◀──────────┘ │
99
+ │ │ Node │ │
100
+ │ └──────────┘ │
101
+ │ │ │
102
+ │ ▼ │
103
+ │ ┌──────────┐ │
104
+ │ │SYNTHESIZE│ │
105
+ │ │ Node │ │
106
+ │ └──────────┘ │
107
+ └─────────────────────────────────────────────────────────────────┘
108
+ ```
109
+
110
+ ---
111
+
112
+ ## 4. Technical Specification
113
+
114
+ ### 4.1 State Schema
115
+
116
+ **File:** `src/agents/graph/state.py`
117
+
118
+ ```python
119
+ """Structured state for LangGraph research workflow."""
120
+ from typing import Annotated, TypedDict, Literal
121
+ import operator
122
+ from langchain_core.messages import BaseMessage
123
+
124
+
125
+ class Hypothesis(TypedDict):
126
+ """A research hypothesis with evidence tracking."""
127
+ id: str
128
+ statement: str
129
+ status: Literal["proposed", "validating", "confirmed", "refuted"]
130
+ confidence: float # 0.0 - 1.0
131
+ supporting_evidence_ids: list[str]
132
+ contradicting_evidence_ids: list[str]
133
+
134
+
135
+ class Conflict(TypedDict):
136
+ """A detected contradiction between sources."""
137
+ id: str
138
+ description: str
139
+ source_a_id: str
140
+ source_b_id: str
141
+ status: Literal["open", "resolved"]
142
+ resolution: str | None
143
+
144
+
145
+ class ResearchState(TypedDict):
146
+ """The cognitive state shared across all graph nodes.
147
+
148
+ Uses Annotated with operator.add for list fields to enable
149
+ additive updates (append) rather than replacement.
150
+ """
151
+ # Immutable context
152
+ query: str
153
+
154
+ # Cognitive state (the "blackboard")
155
+ hypotheses: Annotated[list[Hypothesis], operator.add]
156
+ conflicts: Annotated[list[Conflict], operator.add]
157
+
158
+ # Evidence links (actual content in ChromaDB)
159
+ evidence_ids: Annotated[list[str], operator.add]
160
+
161
+ # Chat history (for LLM context)
162
+ messages: Annotated[list[BaseMessage], operator.add]
163
+
164
+ # Control flow
165
+ next_step: Literal["search", "judge", "resolve", "synthesize", "finish"]
166
+ iteration_count: int
167
+ max_iterations: int
168
+ ```
169
+
170
+ ### 4.2 Graph Nodes
171
+
172
+ Each node is a pure function: `(state: ResearchState) -> dict`
173
+
174
+ **File:** `src/agents/graph/nodes.py`
175
+
176
+ ```python
177
+ """Graph node implementations."""
178
+ from langchain_core.messages import HumanMessage, AIMessage
179
+ from src.tools.pubmed import search_pubmed
180
+ from src.tools.clinicaltrials import search_clinicaltrials
181
+ from src.tools.europepmc import search_europepmc
182
+
183
+
184
+ async def search_node(state: ResearchState) -> dict:
185
+ """Execute search across all sources.
186
+
187
+ Returns partial state update (additive via operator.add).
188
+ """
189
+ query = state["query"]
190
+ # Reuse existing tools
191
+ results = await asyncio.gather(
192
+ search_pubmed(query),
193
+ search_clinicaltrials(query),
194
+ search_europepmc(query),
195
+ )
196
+ new_evidence_ids = [...] # Store in ChromaDB, return IDs
197
+ return {
198
+ "evidence_ids": new_evidence_ids,
199
+ "messages": [AIMessage(content=f"Found {len(new_evidence_ids)} papers")],
200
+ }
201
+
202
+
203
+ async def judge_node(state: ResearchState) -> dict:
204
+ """Evaluate evidence and update hypothesis confidence.
205
+
206
+ Key responsibility: Detect conflicts and flag them.
207
+ """
208
+ # LLM call to evaluate hypotheses against evidence
209
+ # If contradiction found: add to conflicts list
210
+ return {
211
+ "hypotheses": updated_hypotheses, # With new confidence scores
212
+ "conflicts": new_conflicts, # Any detected contradictions
213
+ "messages": [...],
214
+ }
215
+
216
+
217
+ async def resolve_node(state: ResearchState) -> dict:
218
+ """Handle open conflicts via tie-breaker logic.
219
+
220
+ Triggers targeted search or reasoning to resolve.
221
+ """
222
+ open_conflicts = [c for c in state["conflicts"] if c["status"] == "open"]
223
+ # For each conflict: search for decisive evidence or make judgment call
224
+ return {
225
+ "conflicts": resolved_conflicts,
226
+ "messages": [...],
227
+ }
228
+
229
+
230
+ async def synthesize_node(state: ResearchState) -> dict:
231
+ """Generate final research report.
232
+
233
+ Only uses confirmed hypotheses and resolved conflicts.
234
+ """
235
+ confirmed = [h for h in state["hypotheses"] if h["status"] == "confirmed"]
236
+ # Generate structured report
237
+ return {
238
+ "messages": [AIMessage(content=report_markdown)],
239
+ "next_step": "finish",
240
+ }
241
+
242
+
243
+ def supervisor_node(state: ResearchState) -> dict:
244
+ """Route to next node based on state.
245
+
246
+ This is the "brain" - uses LLM to decide next action
247
+ based on STRUCTURED STATE (not just chat).
248
+ """
249
+ # Decision logic:
250
+ # 1. If open conflicts exist -> "resolve"
251
+ # 2. If hypotheses need more evidence -> "search"
252
+ # 3. If evidence is sufficient -> "judge"
253
+ # 4. If all hypotheses confirmed -> "synthesize"
254
+ # 5. If max iterations -> "synthesize" (forced)
255
+ return {"next_step": decided_step, "iteration_count": state["iteration_count"] + 1}
256
+ ```
257
+
258
+ ### 4.3 Graph Definition
259
+
260
+ **File:** `src/agents/graph/workflow.py`
261
+
262
+ ```python
263
+ """LangGraph workflow definition."""
264
+ from langgraph.graph import StateGraph, END
265
+ from langgraph.checkpoint.sqlite import SqliteSaver
266
+
267
+ from src.agents.graph.state import ResearchState
268
+ from src.agents.graph.nodes import (
269
+ search_node,
270
+ judge_node,
271
+ resolve_node,
272
+ synthesize_node,
273
+ supervisor_node,
274
+ )
275
+
276
+
277
+ def create_research_graph(checkpointer=None):
278
+ """Build the research state graph.
279
+
280
+ Args:
281
+ checkpointer: Optional SqliteSaver/MongoDBSaver for persistence
282
+ """
283
+ graph = StateGraph(ResearchState)
284
+
285
+ # Add nodes
286
+ graph.add_node("supervisor", supervisor_node)
287
+ graph.add_node("search", search_node)
288
+ graph.add_node("judge", judge_node)
289
+ graph.add_node("resolve", resolve_node)
290
+ graph.add_node("synthesize", synthesize_node)
291
+
292
+ # Define edges (supervisor routes based on state.next_step)
293
+ graph.add_edge("search", "supervisor")
294
+ graph.add_edge("judge", "supervisor")
295
+ graph.add_edge("resolve", "supervisor")
296
+ graph.add_edge("synthesize", END)
297
+
298
+ # Conditional routing from supervisor
299
+ graph.add_conditional_edges(
300
+ "supervisor",
301
+ lambda state: state["next_step"],
302
+ {
303
+ "search": "search",
304
+ "judge": "judge",
305
+ "resolve": "resolve",
306
+ "synthesize": "synthesize",
307
+ "finish": END,
308
+ },
309
+ )
310
+
311
+ # Entry point
312
+ graph.set_entry_point("supervisor")
313
+
314
+ return graph.compile(checkpointer=checkpointer)
315
+ ```
316
+
317
+ ### 4.4 Orchestrator Integration
318
+
319
+ **File:** `src/orchestrators/langgraph_orchestrator.py`
320
+
321
+ ```python
322
+ """LangGraph-based orchestrator with structured state."""
323
+ from collections.abc import AsyncGenerator
324
+ from langgraph.checkpoint.sqlite.aio import AsyncSqliteSaver
325
+
326
+ from src.agents.graph.workflow import create_research_graph
327
+ from src.agents.graph.state import ResearchState
328
+ from src.orchestrators.base import OrchestratorProtocol
329
+ from src.utils.models import AgentEvent
330
+
331
+
332
+ class LangGraphOrchestrator(OrchestratorProtocol):
333
+ """State-driven research orchestrator using LangGraph."""
334
+
335
+ def __init__(
336
+ self,
337
+ max_iterations: int = 10,
338
+ checkpoint_path: str | None = None,
339
+ ):
340
+ self._max_iterations = max_iterations
341
+ self._checkpoint_path = checkpoint_path
342
+
343
+ async def run(self, query: str) -> AsyncGenerator[AgentEvent, None]:
344
+ """Execute research workflow with structured state."""
345
+ # Setup checkpointer (SQLite for dev, MongoDB for prod)
346
+ checkpointer = None
347
+ if self._checkpoint_path:
348
+ checkpointer = AsyncSqliteSaver.from_conn_string(self._checkpoint_path)
349
+
350
+ graph = create_research_graph(checkpointer)
351
+
352
+ # Initialize state
353
+ initial_state: ResearchState = {
354
+ "query": query,
355
+ "hypotheses": [],
356
+ "conflicts": [],
357
+ "evidence_ids": [],
358
+ "messages": [],
359
+ "next_step": "search",
360
+ "iteration_count": 0,
361
+ "max_iterations": self._max_iterations,
362
+ }
363
+
364
+ yield AgentEvent(type="started", message=f"Starting research: {query}")
365
+
366
+ # Stream through graph
367
+ async for event in graph.astream(initial_state):
368
+ # Convert graph events to AgentEvents
369
+ yield self._convert_event(event)
370
+ ```
371
+
372
+ ---
373
+
374
+ ## 5. Dependencies
375
+
376
+ ### Required Packages
377
+
378
+ ```toml
379
+ # pyproject.toml additions
380
+ [project.optional-dependencies]
381
+ langgraph = [
382
+ "langgraph>=0.2.50",
383
+ "langchain>=0.3.9",
384
+ "langchain-core>=0.3.21",
385
+ "langchain-huggingface>=0.1.2",
386
+ "langgraph-checkpoint-sqlite>=2.0.0",
387
+ ]
388
+ ```
389
+
390
+ ### Installation
391
+
392
+ ```bash
393
+ # Development
394
+ uv add langgraph langchain langchain-huggingface langgraph-checkpoint-sqlite
395
+
396
+ # Production (add MongoDB checkpointer)
397
+ uv add langgraph-checkpoint-mongodb
398
+ ```
399
+
400
+ ### HuggingFace Model Integration
401
+
402
+ ```python
403
+ # Using Llama 3.1 via HuggingFace Inference API
404
+ from langchain_huggingface import ChatHuggingFace, HuggingFaceEndpoint
405
+
406
+ llm = HuggingFaceEndpoint(
407
+ repo_id="meta-llama/Llama-3.1-70B-Instruct",
408
+ task="text-generation",
409
+ max_new_tokens=2048,
410
+ huggingfacehub_api_token=settings.hf_token,
411
+ )
412
+ chat = ChatHuggingFace(llm=llm)
413
+ ```
414
+
415
+ ---
416
+
417
+ ## 6. Implementation Plan (TDD)
418
+
419
+ ### Phase 1: State Schema (2 hours)
420
+
421
+ 1. Create `src/agents/graph/__init__.py`
422
+ 2. Create `src/agents/graph/state.py` with TypedDict schemas
423
+ 3. Write `tests/unit/graph/test_state.py`:
424
+ - Test reducer behavior (operator.add)
425
+ - Test state initialization
426
+ - Test hypothesis/conflict type validation
427
+
428
+ ### Phase 2: Graph Nodes (4 hours)
429
+
430
+ 1. Create `src/agents/graph/nodes.py`
431
+ 2. Adapt existing tool calls (pubmed, clinicaltrials, europepmc)
432
+ 3. Write `tests/unit/graph/test_nodes.py`:
433
+ - Test each node in isolation (mock LLM)
434
+ - Test state update format
435
+
436
+ ### Phase 3: Workflow Graph (2 hours)
437
+
438
+ 1. Create `src/agents/graph/workflow.py`
439
+ 2. Wire up StateGraph with conditional edges
440
+ 3. Write `tests/integration/graph/test_workflow.py`:
441
+ - Test routing logic
442
+ - Test end-to-end with mocked nodes
443
+
444
+ ### Phase 4: Orchestrator (2 hours)
445
+
446
+ 1. Create `src/orchestrators/langgraph_orchestrator.py`
447
+ 2. Update `src/orchestrators/factory.py` to include "langgraph" mode
448
+ 3. Update `src/app.py` UI dropdown
449
+ 4. Write `tests/e2e/test_langgraph_mode.py`
450
+
451
+ ### Phase 5: Gradio Integration (1 hour)
452
+
453
+ 1. Add "God Mode" option to Gradio dropdown
454
+ 2. Test streaming events
455
+ 3. Verify checkpointing (pause/resume)
456
+
457
+ ---
458
+
459
+ ## 7. Migration Strategy
460
+
461
+ 1. **Parallel Implementation:** Build as new mode alongside existing "simple" and "magentic"
462
+ 2. **UI Dropdown:** Add "God Mode (Experimental)" option
463
+ 3. **Feature Flag:** Use `settings.enable_langgraph_mode` to control availability
464
+ 4. **Deprecation Path:** Once stable, deprecate "magentic" mode (Q1 2026)
465
+
466
+ ---
467
+
468
+ ## 8. Acceptance Criteria
469
+
470
+ - [ ] `ResearchState` TypedDict defined with all fields
471
+ - [ ] All 4 nodes (search, judge, resolve, synthesize) implemented
472
+ - [ ] Supervisor routing logic works based on structured state
473
+ - [ ] Checkpointing enables pause/resume
474
+ - [ ] Works with HuggingFace Inference API (no OpenAI required)
475
+ - [ ] Integration tests pass with mocked LLM
476
+ - [ ] E2E test passes with real API call
477
+
478
+ ---
479
+
480
+ ## 9. References
481
+
482
+ ### Primary Sources
483
+ - [LangGraph Official Docs](https://docs.langchain.com/oss/python/langgraph)
484
+ - [LangGraph Persistence Guide](https://docs.langchain.com/oss/python/langgraph/persistence)
485
+ - [MongoDB + LangGraph Integration](https://www.mongodb.com/docs/atlas/ai-integrations/langgraph/)
486
+
487
+ ### Research & Analysis
488
+ - [LangGraph Multi-Agent Orchestration 2025](https://latenode.com/blog/langgraph-multi-agent-orchestration-complete-framework-guide-architecture-analysis-2025)
489
+ - [LangChain vs LangGraph Comparison](https://kanerika.com/blogs/langchain-vs-langgraph/)
490
+ - [Building Deep Research Agents](https://towardsdatascience.com/langgraph-101-lets-build-a-deep-research-agent/)
491
+ - [Mem0 + LangGraph Integration](https://blog.futuresmart.ai/ai-agents-memory-mem0-langgraph-agent-integration)
492
+ - [AI Memory Wars Benchmark](https://guptadeepak.com/the-ai-memory-wars-why-one-system-crushed-the-competition-and-its-not-openai/)
src/agent_factory/judges.py CHANGED
@@ -19,6 +19,7 @@ from src.prompts.judge import (
19
  SYSTEM_PROMPT,
20
  format_empty_evidence_prompt,
21
  format_user_prompt,
 
22
  )
23
  from src.utils.config import settings
24
  from src.utils.models import AssessmentDetails, Evidence, JudgeAssessment
@@ -102,6 +103,8 @@ class JudgeHandler:
102
  self,
103
  question: str,
104
  evidence: list[Evidence],
 
 
105
  ) -> JudgeAssessment:
106
  """
107
  Assess evidence and determine if it's sufficient.
@@ -109,6 +112,8 @@ class JudgeHandler:
109
  Args:
110
  question: The user's research question
111
  evidence: List of Evidence objects from search
 
 
112
 
113
  Returns:
114
  JudgeAssessment with evaluation results
@@ -120,11 +125,20 @@ class JudgeHandler:
120
  "Starting evidence assessment",
121
  question=question[:100],
122
  evidence_count=len(evidence),
 
123
  )
124
 
125
  # Format the prompt based on whether we have evidence
126
  if evidence:
127
- user_prompt = format_user_prompt(question, evidence)
 
 
 
 
 
 
 
 
128
  else:
129
  user_prompt = format_empty_evidence_prompt(question)
130
 
@@ -218,6 +232,8 @@ class HFInferenceJudgeHandler:
218
  self,
219
  question: str,
220
  evidence: list[Evidence],
 
 
221
  ) -> JudgeAssessment:
222
  """
223
  Assess evidence using HuggingFace Inference API.
@@ -246,7 +262,14 @@ class HFInferenceJudgeHandler:
246
 
247
  # Format the user prompt
248
  if evidence:
249
- user_prompt = format_user_prompt(question, evidence)
 
 
 
 
 
 
 
250
  else:
251
  user_prompt = format_empty_evidence_prompt(question)
252
 
@@ -535,6 +558,8 @@ class MockJudgeHandler:
535
  self,
536
  question: str,
537
  evidence: list[Evidence],
 
 
538
  ) -> JudgeAssessment:
539
  """Return assessment based on actual evidence (demo mode)."""
540
  self.call_count += 1
 
19
  SYSTEM_PROMPT,
20
  format_empty_evidence_prompt,
21
  format_user_prompt,
22
+ select_evidence_for_judge,
23
  )
24
  from src.utils.config import settings
25
  from src.utils.models import AssessmentDetails, Evidence, JudgeAssessment
 
103
  self,
104
  question: str,
105
  evidence: list[Evidence],
106
+ iteration: int = 0,
107
+ max_iterations: int = 10,
108
  ) -> JudgeAssessment:
109
  """
110
  Assess evidence and determine if it's sufficient.
 
112
  Args:
113
  question: The user's research question
114
  evidence: List of Evidence objects from search
115
+ iteration: Current iteration number
116
+ max_iterations: Maximum allowed iterations
117
 
118
  Returns:
119
  JudgeAssessment with evaluation results
 
125
  "Starting evidence assessment",
126
  question=question[:100],
127
  evidence_count=len(evidence),
128
+ iteration=iteration,
129
  )
130
 
131
  # Format the prompt based on whether we have evidence
132
  if evidence:
133
+ # Select diverse evidence using embeddings (if available)
134
+ selected_evidence = await select_evidence_for_judge(evidence, question)
135
+ user_prompt = format_user_prompt(
136
+ question,
137
+ selected_evidence,
138
+ iteration,
139
+ max_iterations,
140
+ total_evidence_count=len(evidence),
141
+ )
142
  else:
143
  user_prompt = format_empty_evidence_prompt(question)
144
 
 
232
  self,
233
  question: str,
234
  evidence: list[Evidence],
235
+ iteration: int = 0,
236
+ max_iterations: int = 10,
237
  ) -> JudgeAssessment:
238
  """
239
  Assess evidence using HuggingFace Inference API.
 
262
 
263
  # Format the user prompt
264
  if evidence:
265
+ selected_evidence = await select_evidence_for_judge(evidence, question)
266
+ user_prompt = format_user_prompt(
267
+ question,
268
+ selected_evidence,
269
+ iteration,
270
+ max_iterations,
271
+ total_evidence_count=len(evidence),
272
+ )
273
  else:
274
  user_prompt = format_empty_evidence_prompt(question)
275
 
 
558
  self,
559
  question: str,
560
  evidence: list[Evidence],
561
+ iteration: int = 0,
562
+ max_iterations: int = 10,
563
  ) -> JudgeAssessment:
564
  """Return assessment based on actual evidence (demo mode)."""
565
  self.call_count += 1
src/orchestrators/base.py CHANGED
@@ -40,12 +40,20 @@ class JudgeHandlerProtocol(Protocol):
40
  and MockJudgeHandler.
41
  """
42
 
43
- async def assess(self, question: str, evidence: list[Evidence]) -> JudgeAssessment:
 
 
 
 
 
 
44
  """Assess whether collected evidence is sufficient.
45
 
46
  Args:
47
  question: The original research question
48
  evidence: List of evidence items to assess
 
 
49
 
50
  Returns:
51
  JudgeAssessment with sufficiency determination and next steps
 
40
  and MockJudgeHandler.
41
  """
42
 
43
+ async def assess(
44
+ self,
45
+ question: str,
46
+ evidence: list[Evidence],
47
+ iteration: int = 0,
48
+ max_iterations: int = 10,
49
+ ) -> JudgeAssessment:
50
  """Assess whether collected evidence is sufficient.
51
 
52
  Args:
53
  question: The original research question
54
  evidence: List of evidence items to assess
55
+ iteration: Current iteration number
56
+ max_iterations: Maximum allowed iterations
57
 
58
  Returns:
59
  JudgeAssessment with sufficiency determination and next steps
src/orchestrators/simple.py CHANGED
@@ -12,7 +12,7 @@ from __future__ import annotations
12
 
13
  import asyncio
14
  from collections.abc import AsyncGenerator
15
- from typing import TYPE_CHECKING, Any
16
 
17
  import structlog
18
 
@@ -42,6 +42,18 @@ class Orchestrator:
42
  Microsoft Agent Framework.
43
  """
44
 
 
 
 
 
 
 
 
 
 
 
 
 
45
  def __init__(
46
  self,
47
  search_handler: SearchHandlerProtocol,
@@ -100,6 +112,7 @@ class Orchestrator:
100
  try:
101
  # Deduplicate using semantic similarity
102
  unique_evidence: list[Evidence] = await embeddings.deduplicate(evidence, threshold=0.85)
 
103
  logger.info(
104
  "Deduplicated evidence",
105
  before=len(evidence),
@@ -153,6 +166,65 @@ class Orchestrator:
153
  iteration=iteration,
154
  )
155
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
156
  async def run(self, query: str) -> AsyncGenerator[AgentEvent, None]: # noqa: PLR0915
157
  """
158
  Run the agent loop for a query.
@@ -252,7 +324,9 @@ class Orchestrator:
252
  )
253
 
254
  try:
255
- assessment = await self.judge.assess(query, all_evidence)
 
 
256
 
257
  yield AgentEvent(
258
  type="judge_complete",
@@ -279,15 +353,37 @@ class Orchestrator:
279
  }
280
  )
281
 
282
- # === DECISION PHASE ===
283
- if assessment.sufficient and assessment.recommendation == "synthesize":
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
284
  # Optional Analysis Phase
285
  async for event in self._run_analysis_phase(query, all_evidence, iteration):
286
  yield event
287
 
288
  yield AgentEvent(
289
  type="synthesizing",
290
- message="Evidence sufficient! Preparing synthesis...",
291
  iteration=iteration,
292
  )
293
 
@@ -300,6 +396,7 @@ class Orchestrator:
300
  data={
301
  "evidence_count": len(all_evidence),
302
  "iterations": iteration,
 
303
  "drug_candidates": assessment.details.drug_candidates,
304
  "key_findings": assessment.details.key_findings,
305
  },
@@ -317,10 +414,11 @@ class Orchestrator:
317
  yield AgentEvent(
318
  type="looping",
319
  message=(
320
- f"Need more evidence. "
321
- f"Next searches: {', '.join(current_queries[:2])}..."
 
322
  ),
323
- data={"next_queries": current_queries},
324
  iteration=iteration,
325
  )
326
 
@@ -410,36 +508,93 @@ class Orchestrator:
410
  evidence: list[Evidence],
411
  ) -> str:
412
  """
413
- Generate a partial synthesis when max iterations reached.
414
 
415
- Args:
416
- query: The original question
417
- evidence: All collected evidence
 
 
418
 
419
- Returns:
420
- Formatted partial synthesis as markdown
421
  """
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
422
  citations = "\n".join(
423
  [
424
- f"{i + 1}. [{e.citation.title}]({e.citation.url}) ({e.citation.source.upper()})"
 
425
  for i, e in enumerate(evidence[:10])
426
  ]
427
  )
428
 
429
- return f"""## Partial Analysis (Max Iterations Reached)
 
 
 
 
 
 
 
430
 
431
- ### Question
 
 
432
  {query}
433
 
434
  ### Status
435
- Maximum search iterations reached. The evidence gathered may be incomplete.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
436
 
437
- ### Evidence Collected
438
- Found {len(evidence)} sources. Consider refining your query for more specific results.
439
 
440
- ### Citations
441
  {citations}
442
 
443
  ---
444
- *Consider searching with more specific terms or drug names.*
 
 
 
445
  """
 
12
 
13
  import asyncio
14
  from collections.abc import AsyncGenerator
15
+ from typing import TYPE_CHECKING, Any, ClassVar
16
 
17
  import structlog
18
 
 
42
  Microsoft Agent Framework.
43
  """
44
 
45
+ # Termination thresholds (code-enforced, not LLM-decided)
46
+ TERMINATION_CRITERIA: ClassVar[dict[str, float]] = {
47
+ "min_combined_score": 12.0, # mechanism + clinical >= 12
48
+ "min_score_with_volume": 10.0, # >= 10 if 50+ sources
49
+ "min_evidence_for_volume": 50.0, # Priority 3: evidence count threshold
50
+ "late_iteration_threshold": 8.0, # >= 8 in iterations 8+
51
+ "max_evidence_threshold": 100.0, # Force synthesis with 100+ sources
52
+ "emergency_iteration": 8.0, # Last 2 iterations = emergency mode
53
+ "min_confidence": 0.5, # Minimum confidence for emergency synthesis
54
+ "min_evidence_for_emergency": 30.0, # Priority 6: min evidence for emergency
55
+ }
56
+
57
  def __init__(
58
  self,
59
  search_handler: SearchHandlerProtocol,
 
112
  try:
113
  # Deduplicate using semantic similarity
114
  unique_evidence: list[Evidence] = await embeddings.deduplicate(evidence, threshold=0.85)
115
+
116
  logger.info(
117
  "Deduplicated evidence",
118
  before=len(evidence),
 
166
  iteration=iteration,
167
  )
168
 
169
+ def _should_synthesize(
170
+ self,
171
+ assessment: JudgeAssessment,
172
+ iteration: int,
173
+ max_iterations: int,
174
+ evidence_count: int,
175
+ ) -> tuple[bool, str]:
176
+ """
177
+ Code-enforced synthesis decision.
178
+
179
+ Returns (should_synthesize, reason).
180
+ """
181
+ combined_score = (
182
+ assessment.details.mechanism_score + assessment.details.clinical_evidence_score
183
+ )
184
+ has_drug_candidates = len(assessment.details.drug_candidates) > 0
185
+ confidence = assessment.confidence
186
+
187
+ # Priority 1: LLM explicitly says sufficient with good scores
188
+ if assessment.sufficient and assessment.recommendation == "synthesize":
189
+ if combined_score >= 10:
190
+ return True, "judge_approved"
191
+
192
+ # Priority 2: High scores with drug candidates
193
+ if (
194
+ combined_score >= self.TERMINATION_CRITERIA["min_combined_score"]
195
+ and has_drug_candidates
196
+ ):
197
+ return True, "high_scores_with_candidates"
198
+
199
+ # Priority 3: Good scores with high evidence volume
200
+ if (
201
+ combined_score >= self.TERMINATION_CRITERIA["min_score_with_volume"]
202
+ and evidence_count >= self.TERMINATION_CRITERIA["min_evidence_for_volume"]
203
+ ):
204
+ return True, "good_scores_high_volume"
205
+
206
+ # Priority 4: Late iteration with acceptable scores (diminishing returns)
207
+ is_late_iteration = iteration >= max_iterations - 2
208
+ if (
209
+ is_late_iteration
210
+ and combined_score >= self.TERMINATION_CRITERIA["late_iteration_threshold"]
211
+ ):
212
+ return True, "late_iteration_acceptable"
213
+
214
+ # Priority 5: Very high evidence count (enough to synthesize something)
215
+ if evidence_count >= self.TERMINATION_CRITERIA["max_evidence_threshold"]:
216
+ return True, "max_evidence_reached"
217
+
218
+ # Priority 6: Emergency synthesis (avoid garbage output)
219
+ if (
220
+ is_late_iteration
221
+ and evidence_count >= self.TERMINATION_CRITERIA["min_evidence_for_emergency"]
222
+ and confidence >= self.TERMINATION_CRITERIA["min_confidence"]
223
+ ):
224
+ return True, "emergency_synthesis"
225
+
226
+ return False, "continue_searching"
227
+
228
  async def run(self, query: str) -> AsyncGenerator[AgentEvent, None]: # noqa: PLR0915
229
  """
230
  Run the agent loop for a query.
 
324
  )
325
 
326
  try:
327
+ assessment = await self.judge.assess(
328
+ query, all_evidence, iteration, self.config.max_iterations
329
+ )
330
 
331
  yield AgentEvent(
332
  type="judge_complete",
 
353
  }
354
  )
355
 
356
+ # === DECISION PHASE (Code-Enforced) ===
357
+ should_synth, reason = self._should_synthesize(
358
+ assessment=assessment,
359
+ iteration=iteration,
360
+ max_iterations=self.config.max_iterations,
361
+ evidence_count=len(all_evidence),
362
+ )
363
+
364
+ logger.info(
365
+ "Synthesis decision",
366
+ should_synthesize=should_synth,
367
+ reason=reason,
368
+ iteration=iteration,
369
+ combined_score=assessment.details.mechanism_score
370
+ + assessment.details.clinical_evidence_score,
371
+ evidence_count=len(all_evidence),
372
+ confidence=assessment.confidence,
373
+ )
374
+
375
+ if should_synth:
376
+ # Log synthesis trigger reason for debugging
377
+ if reason != "judge_approved":
378
+ logger.info(f"Code-enforced synthesis triggered: {reason}")
379
+
380
  # Optional Analysis Phase
381
  async for event in self._run_analysis_phase(query, all_evidence, iteration):
382
  yield event
383
 
384
  yield AgentEvent(
385
  type="synthesizing",
386
+ message=f"Evidence sufficient ({reason})! Preparing synthesis...",
387
  iteration=iteration,
388
  )
389
 
 
396
  data={
397
  "evidence_count": len(all_evidence),
398
  "iterations": iteration,
399
+ "synthesis_reason": reason,
400
  "drug_candidates": assessment.details.drug_candidates,
401
  "key_findings": assessment.details.key_findings,
402
  },
 
414
  yield AgentEvent(
415
  type="looping",
416
  message=(
417
+ f"Gathering more evidence (scores: {assessment.details.mechanism_score}"
418
+ f"+{assessment.details.clinical_evidence_score}). "
419
+ f"Next: {', '.join(current_queries[:2])}..."
420
  ),
421
+ data={"next_queries": current_queries, "reason": reason},
422
  iteration=iteration,
423
  )
424
 
 
508
  evidence: list[Evidence],
509
  ) -> str:
510
  """
511
+ Generate a REAL synthesis when max iterations reached.
512
 
513
+ Even when forced to stop, we should provide:
514
+ - Drug candidates (if any were found)
515
+ - Key findings
516
+ - Assessment scores
517
+ - Actionable citations
518
 
519
+ This is still better than a citation dump.
 
520
  """
521
+ # Extract data from last assessment if available
522
+ last_assessment = self.history[-1]["assessment"] if self.history else {}
523
+ details = last_assessment.get("details", {})
524
+
525
+ drug_candidates = details.get("drug_candidates", [])
526
+ key_findings = details.get("key_findings", [])
527
+ mechanism_score = details.get("mechanism_score", 0)
528
+ clinical_score = details.get("clinical_evidence_score", 0)
529
+ reasoning = last_assessment.get("reasoning", "Analysis incomplete due to iteration limit.")
530
+
531
+ # Format drug candidates
532
+ if drug_candidates:
533
+ drug_list = "\n".join([f"- **{d}**" for d in drug_candidates[:5]])
534
+ else:
535
+ drug_list = (
536
+ "- *No specific drug candidates identified in evidence*\n"
537
+ "- *Try a more specific query or add an API key for better analysis*"
538
+ )
539
+
540
+ # Format key findings
541
+ if key_findings:
542
+ findings_list = "\n".join([f"- {f}" for f in key_findings[:5]])
543
+ else:
544
+ findings_list = (
545
+ "- *Key findings require further analysis*\n"
546
+ "- *See citations below for relevant sources*"
547
+ )
548
+
549
+ # Format citations (top 10)
550
  citations = "\n".join(
551
  [
552
+ f"{i + 1}. [{e.citation.title}]({e.citation.url}) "
553
+ f"({e.citation.source.upper()}, {e.citation.date})"
554
  for i, e in enumerate(evidence[:10])
555
  ]
556
  )
557
 
558
+ combined_score = mechanism_score + clinical_score
559
+ mech_strength = (
560
+ "Strong" if mechanism_score >= 7 else "Moderate" if mechanism_score >= 4 else "Limited"
561
+ )
562
+ clin_strength = (
563
+ "Strong" if clinical_score >= 7 else "Moderate" if clinical_score >= 4 else "Limited"
564
+ )
565
+ comb_strength = "Sufficient" if combined_score >= 12 else "Partial"
566
 
567
+ return f"""## Drug Repurposing Analysis
568
+
569
+ ### Research Question
570
  {query}
571
 
572
  ### Status
573
+ Analysis based on {len(evidence)} sources across {len(self.history)} iterations.
574
+ Maximum iterations reached - results may be incomplete.
575
+
576
+ ### Drug Candidates Identified
577
+ {drug_list}
578
+
579
+ ### Key Findings
580
+ {findings_list}
581
+
582
+ ### Evidence Quality Scores
583
+ | Criterion | Score | Interpretation |
584
+ |-----------|-------|----------------|
585
+ | Mechanism | {mechanism_score}/10 | {mech_strength} mechanistic evidence |
586
+ | Clinical | {clinical_score}/10 | {clin_strength} clinical support |
587
+ | Combined | {combined_score}/20 | {comb_strength} for synthesis |
588
 
589
+ ### Analysis Summary
590
+ {reasoning}
591
 
592
+ ### Top Citations ({len(evidence)} sources total)
593
  {citations}
594
 
595
  ---
596
+ *For more complete analysis:*
597
+ - *Add an OpenAI or Anthropic API key for enhanced LLM analysis*
598
+ - *Try a more specific query (e.g., include drug names)*
599
+ - *Use Advanced mode for multi-agent research*
600
  """
src/prompts/judge.py CHANGED
@@ -4,10 +4,16 @@ from src.utils.models import Evidence
4
 
5
  SYSTEM_PROMPT = """You are an expert drug repurposing research judge.
6
 
7
- Your task is to evaluate evidence from biomedical literature and determine if it's sufficient to
8
- recommend drug candidates for a given condition.
 
9
 
10
- ## Evaluation Criteria
 
 
 
 
 
11
 
12
  1. **Mechanism Score (0-10)**: How well does the evidence explain the biological mechanism?
13
  - 0-3: No clear mechanism, speculative
@@ -19,59 +25,123 @@ recommend drug candidates for a given condition.
19
  - 4-6: Preclinical or early clinical data
20
  - 7-10: Strong clinical evidence (trials, meta-analyses)
21
 
22
- 3. **Sufficiency**: Evidence is sufficient when:
23
- - Combined scores >= 12 AND
24
- - At least one specific drug candidate identified AND
25
- - Clear mechanistic rationale exists
26
-
27
- ## Output Rules
28
-
29
- - Always output valid JSON matching the schema
30
- - Be conservative: only recommend "synthesize" when truly confident
31
- - If continuing, suggest specific, actionable search queries
32
- - Never hallucinate drug names or findings not in the evidence
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
33
  """
34
 
 
 
35
 
36
- def format_user_prompt(question: str, evidence: list[Evidence]) -> str:
 
 
 
 
37
  """
38
- Format the user prompt with question and evidence.
39
 
40
- Args:
41
- question: The user's research question
42
- evidence: List of Evidence objects from search
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
43
 
44
- Returns:
45
- Formatted prompt string
46
  """
 
47
  max_content_len = 1500
48
 
49
  def format_single_evidence(i: int, e: Evidence) -> str:
50
  content = e.content
51
  if len(content) > max_content_len:
52
  content = content[:max_content_len] + "..."
53
-
54
  return (
55
  f"### Evidence {i + 1}\n"
56
  f"**Source**: {e.citation.source.upper()} - {e.citation.title}\n"
57
  f"**URL**: {e.citation.url}\n"
58
- f"**Date**: {e.citation.date}\n"
59
  f"**Content**:\n{content}"
60
  )
61
 
62
  evidence_text = "\n\n".join([format_single_evidence(i, e) for i, e in enumerate(evidence)])
63
 
64
- return f"""## Research Question
 
65
  {question}
66
 
67
- ## Available Evidence ({len(evidence)} sources)
 
 
 
 
 
68
 
69
  {evidence_text}
70
 
71
  ## Your Task
72
 
73
- Evaluate this evidence and determine if it's sufficient to recommend drug repurposing candidates.
74
- Respond with a JSON object matching the JudgeAssessment schema.
 
 
 
75
  """
76
 
77
 
 
4
 
5
  SYSTEM_PROMPT = """You are an expert drug repurposing research judge.
6
 
7
+ Your task is to SCORE evidence from biomedical literature. You do NOT decide whether to
8
+ continue searching or synthesize - that decision is made by the orchestration system
9
+ based on your scores.
10
 
11
+ ## Your Role: Scoring Only
12
+
13
+ You provide objective scores. The system decides next steps based on explicit thresholds.
14
+ This separation prevents bias in the decision-making process.
15
+
16
+ ## Scoring Criteria
17
 
18
  1. **Mechanism Score (0-10)**: How well does the evidence explain the biological mechanism?
19
  - 0-3: No clear mechanism, speculative
 
25
  - 4-6: Preclinical or early clinical data
26
  - 7-10: Strong clinical evidence (trials, meta-analyses)
27
 
28
+ 3. **Drug Candidates**: List SPECIFIC drug names mentioned in the evidence
29
+ - Only include drugs explicitly mentioned
30
+ - Do NOT hallucinate or infer drug names
31
+ - Include drug class if specific names aren't available (e.g., "SSRI antidepressants")
32
+
33
+ 4. **Key Findings**: Extract 3-5 key findings from the evidence
34
+ - Focus on findings relevant to the research question
35
+ - Include mechanism insights and clinical outcomes
36
+
37
+ 5. **Confidence (0.0-1.0)**: Your confidence in the scores
38
+ - Based on evidence quality and relevance
39
+ - Lower if evidence is tangential or low-quality
40
+
41
+ ## Output Format
42
+
43
+ Return valid JSON with these fields:
44
+ - details.mechanism_score (int 0-10)
45
+ - details.mechanism_reasoning (string)
46
+ - details.clinical_evidence_score (int 0-10)
47
+ - details.clinical_reasoning (string)
48
+ - details.drug_candidates (list of strings)
49
+ - details.key_findings (list of strings)
50
+ - sufficient (boolean) - TRUE if scores suggest enough evidence
51
+ - confidence (float 0-1)
52
+ - recommendation ("continue" or "synthesize") - Your suggestion (system may override)
53
+ - next_search_queries (list) - If continuing, suggest FOCUSED queries
54
+ - reasoning (string)
55
+
56
+ ## CRITICAL: Search Query Rules
57
+
58
+ When suggesting next_search_queries:
59
+ - STAY FOCUSED on the original research question
60
+ - Do NOT drift to tangential topics
61
+ - If question is about "female libido", do NOT suggest "bone health" or "muscle mass"
62
+ - Refine existing terms, don't explore random medical associations
63
  """
64
 
65
+ MAX_EVIDENCE_FOR_JUDGE = 30 # Keep under token limits
66
+
67
 
68
+ async def select_evidence_for_judge(
69
+ evidence: list[Evidence],
70
+ query: str,
71
+ max_items: int = MAX_EVIDENCE_FOR_JUDGE,
72
+ ) -> list[Evidence]:
73
  """
74
+ Select diverse, relevant evidence for judge evaluation.
75
 
76
+ Implements RAG best practices:
77
+ - Diversity selection over recency-only
78
+ - Lost-in-the-middle mitigation
79
+ - Relevance re-ranking
80
+ """
81
+ if len(evidence) <= max_items:
82
+ return evidence
83
+
84
+ try:
85
+ from src.utils.text_utils import select_diverse_evidence
86
+
87
+ # Use embedding-based diversity selection
88
+ return await select_diverse_evidence(evidence, n=max_items, query=query)
89
+ except ImportError:
90
+ # Fallback: mix of recent + early (lost-in-the-middle mitigation)
91
+ early = evidence[: max_items // 3] # First third
92
+ recent = evidence[-(max_items * 2 // 3) :] # Last two-thirds
93
+ return early + recent
94
+
95
+
96
+ def format_user_prompt(
97
+ question: str,
98
+ evidence: list[Evidence],
99
+ iteration: int = 0,
100
+ max_iterations: int = 10,
101
+ total_evidence_count: int | None = None,
102
+ ) -> str:
103
+ """
104
+ Format user prompt with selected evidence and iteration context.
105
 
106
+ NOTE: Evidence should be pre-selected using select_evidence_for_judge().
107
+ This function assumes evidence is already capped.
108
  """
109
+ total_count = total_evidence_count or len(evidence)
110
  max_content_len = 1500
111
 
112
  def format_single_evidence(i: int, e: Evidence) -> str:
113
  content = e.content
114
  if len(content) > max_content_len:
115
  content = content[:max_content_len] + "..."
 
116
  return (
117
  f"### Evidence {i + 1}\n"
118
  f"**Source**: {e.citation.source.upper()} - {e.citation.title}\n"
119
  f"**URL**: {e.citation.url}\n"
 
120
  f"**Content**:\n{content}"
121
  )
122
 
123
  evidence_text = "\n\n".join([format_single_evidence(i, e) for i, e in enumerate(evidence)])
124
 
125
+ # Lost-in-the-middle mitigation: put critical context at START and END
126
+ return f"""## Research Question (IMPORTANT - stay focused on this)
127
  {question}
128
 
129
+ ## Search Progress
130
+ - **Iteration**: {iteration}/{max_iterations}
131
+ - **Total evidence collected**: {total_count} sources
132
+ - **Evidence shown below**: {len(evidence)} diverse sources (selected for relevance)
133
+
134
+ ## Available Evidence
135
 
136
  {evidence_text}
137
 
138
  ## Your Task
139
 
140
+ Score this evidence for drug repurposing potential. Provide ONLY scores and extracted data.
141
+ DO NOT decide "synthesize" vs "continue" - that decision is made by the system.
142
+
143
+ ## REMINDER: Original Question (stay focused)
144
+ {question}
145
  """
146
 
147
 
tests/e2e/conftest.py CHANGED
@@ -39,7 +39,7 @@ def mock_judge_handler():
39
  """Return a mock judge that always says 'synthesize'."""
40
  mock = MagicMock()
41
 
42
- async def mock_assess(question, evidence):
43
  return JudgeAssessment(
44
  sufficient=True,
45
  confidence=0.9,
 
39
  """Return a mock judge that always says 'synthesize'."""
40
  mock = MagicMock()
41
 
42
+ async def mock_assess(question, evidence, iteration=1, max_iterations=10):
43
  return JudgeAssessment(
44
  sufficient=True,
45
  confidence=0.9,
tests/integration/test_simple_mode_synthesis.py ADDED
@@ -0,0 +1,147 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from unittest.mock import AsyncMock
2
+
3
+ import pytest
4
+
5
+ from src.orchestrators.simple import Orchestrator
6
+ from src.utils.models import (
7
+ AssessmentDetails,
8
+ Citation,
9
+ Evidence,
10
+ JudgeAssessment,
11
+ OrchestratorConfig,
12
+ SearchResult,
13
+ )
14
+
15
+
16
+ def make_evidence(title: str) -> Evidence:
17
+ return Evidence(
18
+ content="content",
19
+ citation=Citation(title=title, url="http://test.com", date="2025", source="pubmed"),
20
+ )
21
+
22
+
23
+ @pytest.mark.integration
24
+ @pytest.mark.asyncio
25
+ async def test_simple_mode_synthesizes_before_max_iterations():
26
+ """Verify simple mode produces useful output with mocked judge."""
27
+ # Mock search to return evidence
28
+ mock_search = AsyncMock()
29
+ mock_search.execute.return_value = SearchResult(
30
+ query="test query",
31
+ evidence=[make_evidence(f"Paper {i}") for i in range(5)],
32
+ errors=[],
33
+ sources_searched=["pubmed"],
34
+ total_found=5,
35
+ )
36
+
37
+ # Mock judge to return GOOD scores eventually
38
+ # We can use MockJudgeHandler or a pure mock. Let's use a pure mock to control scores precisely.
39
+ mock_judge = AsyncMock()
40
+
41
+ # Iteration 1: Low scores
42
+ assess_1 = JudgeAssessment(
43
+ details=AssessmentDetails(
44
+ mechanism_score=2,
45
+ mechanism_reasoning="reasoning is sufficient for valid model",
46
+ clinical_evidence_score=2,
47
+ clinical_reasoning="reasoning is sufficient for valid model",
48
+ drug_candidates=[],
49
+ key_findings=[],
50
+ ),
51
+ sufficient=False,
52
+ confidence=0.5,
53
+ recommendation="continue",
54
+ next_search_queries=["q2"],
55
+ reasoning="need more evidence to support conclusions about this topic",
56
+ )
57
+
58
+ # Iteration 2: High scores (should trigger synthesis)
59
+ assess_2 = JudgeAssessment(
60
+ details=AssessmentDetails(
61
+ mechanism_score=8,
62
+ mechanism_reasoning="reasoning is sufficient for valid model",
63
+ clinical_evidence_score=7,
64
+ clinical_reasoning="reasoning is sufficient for valid model",
65
+ drug_candidates=["MagicDrug"],
66
+ key_findings=["It works"],
67
+ ),
68
+ sufficient=False, # Judge is conservative
69
+ confidence=0.9,
70
+ recommendation="continue", # Judge still says continue (simulating bias)
71
+ next_search_queries=[],
72
+ reasoning="good scores but maybe more evidence needed technically",
73
+ )
74
+
75
+ mock_judge.assess.side_effect = [assess_1, assess_2]
76
+
77
+ orchestrator = Orchestrator(
78
+ search_handler=mock_search,
79
+ judge_handler=mock_judge,
80
+ config=OrchestratorConfig(max_iterations=5),
81
+ )
82
+
83
+ events = []
84
+ async for event in orchestrator.run("test query"):
85
+ events.append(event)
86
+ if event.type == "complete":
87
+ break
88
+
89
+ # Must have synthesis with drug candidates
90
+ complete_events = [e for e in events if e.type == "complete"]
91
+ assert len(complete_events) == 1
92
+ complete_event = complete_events[0]
93
+
94
+ assert "MagicDrug" in complete_event.message
95
+ assert "Drug Candidates" in complete_event.message
96
+ assert complete_event.data.get("synthesis_reason") == "high_scores_with_candidates"
97
+ assert complete_event.iteration == 2 # Should stop at it 2
98
+
99
+
100
+ @pytest.mark.integration
101
+ @pytest.mark.asyncio
102
+ async def test_partial_synthesis_generation():
103
+ """Verify partial synthesis includes drug candidates even if max iterations reached."""
104
+ mock_search = AsyncMock()
105
+ mock_search.execute.return_value = SearchResult(
106
+ query="test", evidence=[], errors=[], sources_searched=["pubmed"], total_found=0
107
+ )
108
+
109
+ mock_judge = AsyncMock()
110
+ # Always return low scores but WITH candidates
111
+ # Scores 3+3 = 6 < 8 (late threshold), so it should NOT synthesize early
112
+ mock_judge.assess.return_value = JudgeAssessment(
113
+ details=AssessmentDetails(
114
+ mechanism_score=3,
115
+ mechanism_reasoning="reasoning is sufficient for valid model",
116
+ clinical_evidence_score=3,
117
+ clinical_reasoning="reasoning is sufficient for valid model",
118
+ drug_candidates=["PartialDrug"],
119
+ key_findings=["Partial finding"],
120
+ ),
121
+ sufficient=False,
122
+ confidence=0.5,
123
+ recommendation="continue",
124
+ next_search_queries=[],
125
+ reasoning="keep going to find more evidence about this topic please",
126
+ )
127
+
128
+ orchestrator = Orchestrator(
129
+ search_handler=mock_search,
130
+ judge_handler=mock_judge,
131
+ config=OrchestratorConfig(max_iterations=2),
132
+ )
133
+
134
+ events = []
135
+ async for event in orchestrator.run("test"):
136
+ events.append(event)
137
+
138
+ complete_events = [e for e in events if e.type == "complete"]
139
+ assert (
140
+ len(complete_events) == 1
141
+ ), f"Expected exactly one complete event, got {len(complete_events)}"
142
+ complete_event = complete_events[0]
143
+ assert complete_event.data.get("max_reached") is True
144
+
145
+ # The output message should contain the drug candidate from the last assessment
146
+ assert "PartialDrug" in complete_event.message
147
+ assert "Maximum iterations reached" in complete_event.message
tests/unit/orchestrators/test_termination.py ADDED
@@ -0,0 +1,104 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from typing import Literal
2
+ from unittest.mock import MagicMock
3
+
4
+ import pytest
5
+
6
+ from src.orchestrators.simple import Orchestrator
7
+ from src.utils.models import AssessmentDetails, JudgeAssessment
8
+
9
+
10
+ def make_assessment(
11
+ mechanism: int,
12
+ clinical: int,
13
+ drug_candidates: list[str],
14
+ sufficient: bool = False,
15
+ recommendation: Literal["continue", "synthesize"] = "continue",
16
+ confidence: float = 0.8,
17
+ ) -> JudgeAssessment:
18
+ return JudgeAssessment(
19
+ details=AssessmentDetails(
20
+ mechanism_score=mechanism,
21
+ mechanism_reasoning="reasoning is sufficient for testing purposes",
22
+ clinical_evidence_score=clinical,
23
+ clinical_reasoning="reasoning is sufficient for testing purposes",
24
+ drug_candidates=drug_candidates,
25
+ key_findings=["finding"],
26
+ ),
27
+ sufficient=sufficient,
28
+ confidence=confidence,
29
+ recommendation=recommendation,
30
+ next_search_queries=[],
31
+ reasoning="reasoning is sufficient for testing purposes",
32
+ )
33
+
34
+
35
+ @pytest.fixture
36
+ def orchestrator():
37
+ search = MagicMock()
38
+ judge = MagicMock()
39
+ return Orchestrator(search, judge)
40
+
41
+
42
+ @pytest.mark.unit
43
+ def test_should_synthesize_high_scores(orchestrator):
44
+ """High scores with drug candidates triggers synthesis."""
45
+ assessment = make_assessment(mechanism=7, clinical=6, drug_candidates=["Metformin"])
46
+
47
+ # Access the private method via name mangling or just call it if it was public.
48
+ # Since I made it private _should_synthesize, I access it directly.
49
+ should_synth, reason = orchestrator._should_synthesize(
50
+ assessment, iteration=3, max_iterations=10, evidence_count=50
51
+ )
52
+
53
+ assert should_synth is True
54
+ assert reason == "high_scores_with_candidates"
55
+
56
+
57
+ @pytest.mark.unit
58
+ def test_should_synthesize_late_iteration(orchestrator):
59
+ """Late iteration with acceptable scores triggers synthesis."""
60
+ assessment = make_assessment(mechanism=5, clinical=4, drug_candidates=[])
61
+ should_synth, reason = orchestrator._should_synthesize(
62
+ assessment, iteration=9, max_iterations=10, evidence_count=80
63
+ )
64
+
65
+ assert should_synth is True
66
+ assert reason in ["late_iteration_acceptable", "emergency_synthesis"]
67
+
68
+
69
+ @pytest.mark.unit
70
+ def test_should_not_synthesize_early_low_scores(orchestrator):
71
+ """Early iteration with low scores continues searching."""
72
+ assessment = make_assessment(mechanism=3, clinical=2, drug_candidates=[])
73
+ should_synth, reason = orchestrator._should_synthesize(
74
+ assessment, iteration=2, max_iterations=10, evidence_count=20
75
+ )
76
+
77
+ assert should_synth is False
78
+ assert reason == "continue_searching"
79
+
80
+
81
+ @pytest.mark.unit
82
+ def test_judge_approved_overrides_all(orchestrator):
83
+ """If judge explicitly says synthesize with good scores, do it."""
84
+ assessment = make_assessment(
85
+ mechanism=6, clinical=5, drug_candidates=[], sufficient=True, recommendation="synthesize"
86
+ )
87
+ should_synth, reason = orchestrator._should_synthesize(
88
+ assessment, iteration=2, max_iterations=10, evidence_count=20
89
+ )
90
+
91
+ assert should_synth is True
92
+ assert reason == "judge_approved"
93
+
94
+
95
+ @pytest.mark.unit
96
+ def test_max_evidence_threshold(orchestrator):
97
+ """Force synthesis if we have tons of evidence."""
98
+ assessment = make_assessment(mechanism=2, clinical=2, drug_candidates=[])
99
+ should_synth, reason = orchestrator._should_synthesize(
100
+ assessment, iteration=5, max_iterations=10, evidence_count=150
101
+ )
102
+
103
+ assert should_synth is True
104
+ assert reason == "max_evidence_reached"
tests/unit/prompts/test_judge_prompt.py ADDED
@@ -0,0 +1,61 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from unittest.mock import patch
2
+
3
+ import pytest
4
+
5
+ from src.prompts.judge import format_user_prompt, select_evidence_for_judge
6
+ from src.utils.models import Citation, Evidence
7
+
8
+
9
+ def make_evidence(title: str, content: str = "content") -> Evidence:
10
+ return Evidence(
11
+ content=content,
12
+ citation=Citation(title=title, url="http://test.com", date="2025", source="pubmed"),
13
+ )
14
+
15
+
16
+ @pytest.mark.unit
17
+ @pytest.mark.asyncio
18
+ async def test_evidence_selection_diverse():
19
+ """Verify evidence selection includes early and recent items (fallback logic)."""
20
+ # Create enough evidence to trigger selection
21
+ evidence = [make_evidence(f"Paper {i}") for i in range(100)]
22
+
23
+ # Mock select_diverse_evidence to raise ImportError to trigger fallback logic
24
+ with patch("src.utils.text_utils.select_diverse_evidence", side_effect=ImportError):
25
+ selected = await select_evidence_for_judge(evidence, "test query", max_items=30)
26
+
27
+ assert len(selected) == 30
28
+
29
+ # Should include some early evidence (lost-in-the-middle mitigation)
30
+ titles = [e.citation.title for e in selected]
31
+
32
+ # Check for start (Paper 0..9) - using set intersection for clarity
33
+ early_papers = {f"Paper {i}" for i in range(10)}
34
+ has_early = any(title in early_papers for title in titles)
35
+ # Check for end (Paper 90..99)
36
+ late_papers = {f"Paper {i}" for i in range(90, 100)}
37
+ has_late = any(title in late_papers for title in titles)
38
+
39
+ assert has_early, "Should include early evidence"
40
+ assert has_late, "Should include recent evidence"
41
+
42
+
43
+ @pytest.mark.unit
44
+ def test_prompt_includes_question_at_edges():
45
+ """Verify lost-in-the-middle mitigation in prompt formatting."""
46
+ evidence = [make_evidence("Test Paper")]
47
+ question = "CRITICAL RESEARCH QUESTION"
48
+
49
+ prompt = format_user_prompt(question, evidence, iteration=5, max_iterations=10)
50
+
51
+ # Question should appear at START and END of prompt
52
+ lines = prompt.split("\n")
53
+
54
+ # Check start (first few lines)
55
+ start_content = "\n".join(lines[:10])
56
+ assert question in start_content
57
+
58
+ # Check end (last few lines)
59
+ end_content = "\n".join(lines[-10:])
60
+ assert question in end_content
61
+ assert "REMINDER: Original Question" in end_content