Spaces:

DataQuests
/

DeepCritical

Running

App Files Files Community

DeepCritical / docs /implementation /03_phase_judge.md

VibecoderMcSwaggins

docs: enhance Phase 4 UI and Orchestrator documentation

5c8b030 25 days ago

preview code

raw

history blame

10.8 kB

	# Phase 3 Implementation Spec: Judge Vertical Slice

	Goal: Implement the "Brain" of the agent — evaluating evidence quality.
	Philosophy: "Structured Output or Bust."
	Estimated Effort: 3-4 hours
	Prerequisite: Phase 2 complete

	---

	## 1. The Slice Definition

	This slice covers:
	1. Input: Question + List of `Evidence`.
	2. Process:
	- Construct prompt with evidence.
	- Call LLM (PydanticAI).
	- Parse into `JudgeAssessment`.
	3. Output: `JudgeAssessment` object.

	Files:
	- `src/utils/models.py`: Add Judge models
	- `src/prompts/judge.py`: Prompt templates
	- `src/agent_factory/judges.py`: Handler logic

	---

	## 2. Models (`src/utils/models.py`)

	Add these to the existing models file:

	```python
	class DrugCandidate(BaseModel):
	"""A potential drug repurposing candidate."""
	drug_name: str = Field(..., description="Name of the drug")
	original_indication: str = Field(..., description="What the drug was originally approved for")
	proposed_indication: str = Field(..., description="The new proposed use")
	mechanism: str = Field(..., description="Proposed mechanism of action")
	evidence_strength: Literal["weak", "moderate", "strong"] = Field(
	...,
	description="Strength of supporting evidence"
	)

	class JudgeAssessment(BaseModel):
	"""The judge's assessment of the collected evidence."""
	sufficient: bool = Field(
	...,
	description="Is there enough evidence to write a report?"
	)
	recommendation: Literal["continue", "synthesize"] = Field(
	...,
	description="Should we search more or synthesize a report?"
	)
	reasoning: str = Field(
	...,
	max_length=500,
	description="Explanation of the assessment"
	)
	overall_quality_score: int = Field(
	...,
	ge=0,
	le=10,
	description="Overall quality of evidence (0-10)"
	)
	coverage_score: int = Field(
	...,
	ge=0,
	le=10,
	description="How well does evidence cover the query (0-10)"
	)
	candidates: list[DrugCandidate] = Field(
	default_factory=list,
	description="Drug candidates identified in the evidence"
	)
	next_search_queries: list[str] = Field(
	default_factory=list,
	max_length=5,
	description="Suggested follow-up queries if more evidence needed"
	)
	gaps: list[str] = Field(
	default_factory=list,
	description="Information gaps identified in current evidence"
	)
	```

	---

	## 3. Prompts (`src/prompts/judge.py`)

	```python
	"""Prompt templates for the Judge."""
	from typing import List
	from src.utils.models import Evidence

	JUDGE_SYSTEM_PROMPT = """You are a biomedical research quality assessor specializing in drug repurposing.

	Your job is to evaluate evidence retrieved from PubMed and web searches, and decide if:
	1. There is SUFFICIENT evidence to write a research report
	2. More searching is needed to fill gaps

	## Evaluation Criteria

	### For "sufficient" = True (ready to synthesize):
	- At least 3 relevant pieces of evidence
	- At least one peer-reviewed source (PubMed)
	- Clear mechanism of action identified
	- Drug candidates with at least "moderate" evidence strength

	### For "sufficient" = False (continue searching):
	- Fewer than 3 relevant pieces
	- No clear drug candidates identified
	- Major gaps in mechanism understanding
	- All evidence is low quality

	## Output Requirements
	- Be STRICT. Only mark sufficient=True if evidence is genuinely adequate
	- Always provide reasoning for your decision
	- If continuing, suggest SPECIFIC, ACTIONABLE search queries
	- Identify concrete gaps, not vague statements

	## Important
	- You are assessing DRUG REPURPOSING potential
	- Focus on: mechanism of action, existing clinical data, safety profile
	- Ignore marketing content or non-scientific sources"""

	def format_evidence_for_prompt(evidence_list: List[Evidence]) -> str:
	"""Format evidence list into a string for the prompt."""
	if not evidence_list:
	return "NO EVIDENCE COLLECTED YET"

	formatted = []
	for i, ev in enumerate(evidence_list, 1):
	formatted.append(f"""
	---
	Source: {ev.citation.source.upper()}
	Title: {ev.citation.title}
	Date: {ev.citation.date}
	URL: {ev.citation.url}

	Content:
	{ev.content[:1500]}
	---")

	return "\n".join(formatted)

	def build_judge_user_prompt(question: str, evidence: List[Evidence]) -> str:
	"""Build the user prompt for the judge."""
	evidence_text = format_evidence_for_prompt(evidence)

	return f"""## Research Question
	{question}

	## Collected Evidence ({len(evidence)} pieces)
	{evidence_text}

	## Your Task
	Assess the evidence above and provide your structured assessment.
	If evidence is insufficient, suggest 2-3 specific follow-up search queries."""
	```

	---

	## 4. Handler (`src/agent_factory/judges.py`)

	```python
	"""Judge handler - evaluates evidence quality."""
	import structlog
	from typing import List
	from pydantic_ai import Agent
	from pydantic_ai.models.openai import OpenAIModel
	from pydantic_ai.models.anthropic import AnthropicModel
	from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type

	from src.utils.config import settings
	from src.utils.exceptions import JudgeError
	from src.utils.models import JudgeAssessment, Evidence
	from src.prompts.judge import JUDGE_SYSTEM_PROMPT, build_judge_user_prompt

	logger = structlog.get_logger()

	def get_llm_model():
	"""Get the configured LLM model for PydanticAI."""
	if settings.llm_provider == "openai":
	return OpenAIModel(
	settings.llm_model,
	api_key=settings.get_api_key(),
	)
	elif settings.llm_provider == "anthropic":
	return AnthropicModel(
	settings.llm_model,
	api_key=settings.get_api_key(),
	)
	else:
	raise JudgeError(f"Unknown LLM provider: {settings.llm_provider}")

	# Initialize Agent
	judge_agent = Agent(
	model=get_llm_model(),
	result_type=JudgeAssessment,
	system_prompt=JUDGE_SYSTEM_PROMPT,
	)

	class JudgeHandler:
	"""Handles evidence assessment using LLM."""

	def __init__(self, agent: Agent \| None = None):
	"""
	Initialize the judge handler.

	Args:
	agent: Optional PydanticAI agent (for testing injection)
	"""
	self.agent = agent or judge_agent
	self._call_count = 0

	@retry(
	stop=stop_after_attempt(3),
	wait=wait_exponential(multiplier=1, min=2, max=10),
	retry=retry_if_exception_type((TimeoutError, ConnectionError)),
	reraise=True,
	)
	async def assess(
	self,
	question: str,
	evidence: List[Evidence],
	) -> JudgeAssessment:
	"""
	Assess the quality and sufficiency of evidence.

	Args:
	question: The original research question
	evidence: List of Evidence objects to assess

	Returns:
	JudgeAssessment with decision and recommendations

	Raises:
	JudgeError: If assessment fails after retries
	"""
	logger.info(
	"Starting evidence assessment",
	question=question[:100],
	evidence_count=len(evidence),
	)

	self._call_count += 1

	# Build the prompt
	user_prompt = build_judge_user_prompt(question, evidence)

	try:
	# Run the agent - PydanticAI handles structured output
	result = await self.agent.run(user_prompt)

	# result.data is already a JudgeAssessment (typed!)
	assessment = result.data

	logger.info(
	"Assessment complete",
	sufficient=assessment.sufficient,
	recommendation=assessment.recommendation,
	quality_score=assessment.overall_quality_score,
	candidates_found=len(assessment.candidates),
	)

	return assessment

	except Exception as e:
	logger.error("Judge assessment failed", error=str(e))
	raise JudgeError(f"Failed to assess evidence: {e}") from e

	async def should_continue(self, assessment: JudgeAssessment) -> bool:
	"""
	Decide if the search loop should continue based on the assessment.

	Returns:
	True if we should search more, False if we should stop (synthesize or give up).
	"""
	return not assessment.sufficient and assessment.recommendation == "continue"

	@property
	def call_count(self) -> int:
	"""Number of LLM calls made (for budget tracking)."""
	return self._call_count
	```

	---

	## 5. TDD Workflow

	### Test File: `tests/unit/agent_factory/test_judges.py`

	```python
	"""Unit tests for JudgeHandler."""
	import pytest
	from unittest.mock import AsyncMock, MagicMock

	class TestJudgeHandler:
	@pytest.mark.asyncio
	async def test_assess_returns_assessment(self, mocker):
	from src.agent_factory.judges import JudgeHandler
	from src.utils.models import JudgeAssessment, Evidence, Citation

	# Mock PydanticAI agent result
	mock_result = MagicMock()
	mock_result.data = JudgeAssessment(
	sufficient=True,
	recommendation="synthesize",
	reasoning="Good",
	overall_quality_score=8,
	coverage_score=8
	)

	mock_agent = AsyncMock()
	mock_agent.run = AsyncMock(return_value=mock_result)

	handler = JudgeHandler(agent=mock_agent)
	result = await handler.assess("q", [])

	assert result.sufficient is True

	@pytest.mark.asyncio
	async def test_should_continue(self, mocker):
	from src.agent_factory.judges import JudgeHandler
	from src.utils.models import JudgeAssessment

	handler = JudgeHandler(agent=AsyncMock())

	# Continue case
	assess1 = JudgeAssessment(
	sufficient=False,
	recommendation="continue",
	reasoning="Need more",
	overall_quality_score=5,
	coverage_score=5
	)
	assert await handler.should_continue(assess1) is True

	# Stop case
	assess2 = JudgeAssessment(
	sufficient=True,
	recommendation="synthesize",
	reasoning="Done",
	overall_quality_score=8,
	coverage_score=8
	)
	assert await handler.should_continue(assess2) is False
	```

	---

	## 6. Implementation Checklist

	- [ ] Update `src/utils/models.py` with Judge models
	- [ ] Create `src/prompts/judge.py`
	- [ ] Implement `src/agent_factory/judges.py`
	- [ ] Write tests in `tests/unit/agent_factory/test_judges.py`
	- [ ] Run `uv run pytest tests/unit/agent_factory/`

	```