DeepBoner / docs /specs /archive /SPEC_02_E2E_TESTING.md
VibecoderMcSwaggins's picture
feat(search): SPEC_13 Evidence Deduplication (#98)
2c5db87 unverified
|
raw
history blame
5.97 kB
# SPEC 02: End-to-End Testing
## Priority: P1 (Validation Before Features)
## Problem Statement
We have 141 unit tests that verify individual components work, but **no test that proves the full pipeline produces useful research output**.
We don't know if:
1. Simple mode produces a valid report
2. Advanced mode produces a valid report
3. The output is actually useful (has citations, mechanisms, etc.)
**Golden Rule**: Don't add features (OpenAlex, persistence) until we prove current features work.
## What We Need to Test
### Level 1: Smoke Test (Does it run?)
```python
@pytest.mark.asyncio
@pytest.mark.e2e
async def test_simple_mode_completes(mock_search_handler, mock_judge_handler):
"""Verify Simple mode runs without crashing."""
from src.orchestrator import Orchestrator
from src.utils.models import OrchestratorConfig
config = OrchestratorConfig(max_iterations=2)
orchestrator = Orchestrator(
search_handler=mock_search_handler,
judge_handler=mock_judge_handler,
config=config,
enable_analysis=False,
enable_embeddings=False,
)
events = []
async for event in orchestrator.run("test query"):
events.append(event)
# Must complete
assert any(e.type == "complete" for e in events)
# Must not error
assert not any(e.type == "error" for e in events)
```
### Level 2: Structure Test (Is output valid?)
```python
@pytest.mark.e2e
async def test_output_has_required_fields():
"""Verify output contains expected structure."""
result = await run_research("metformin for PCOS")
# Must have citations
assert len(result.citations) >= 1
# Must have some text
assert len(result.report) > 100
# Must mention the query topic
assert "metformin" in result.report.lower() or "pcos" in result.report.lower()
```
### Level 3: Quality Test (Is output useful?)
```python
@pytest.mark.e2e
async def test_output_quality():
"""Verify output contains actionable research."""
result = await run_research("drugs for female libido")
# Should have PMIDs or NCT IDs
has_citations = any(
"PMID" in str(c) or "NCT" in str(c)
for c in result.citations
)
assert has_citations, "No real citations found"
# Should discuss mechanism
mechanism_words = ["mechanism", "pathway", "receptor", "target"]
has_mechanism = any(w in result.report.lower() for w in mechanism_words)
assert has_mechanism, "No mechanism discussion found"
```
## Test Strategy
### Mocking Strategy
For CI/fast tests, mock external APIs via pytest fixtures in `tests/e2e/conftest.py`:
```python
@pytest.fixture
def mock_search_handler():
"""Return a mock search handler that returns fake evidence."""
from unittest.mock import MagicMock
from src.utils.models import Citation, Evidence, SearchResult
async def mock_execute(query: str):
return SearchResult(
evidence=[
Evidence(
content="Study on test query showing positive results...",
citation=Citation(
source="pubmed",
title="Study on test query",
url="https://pubmed.example.com/123",
date="2024",
),
)
],
sources_searched=["pubmed", "clinicaltrials"],
)
mock = MagicMock()
mock.execute = mock_execute
return mock
@pytest.fixture
def mock_judge_handler():
"""Return a mock judge that always says 'synthesize'."""
from unittest.mock import MagicMock
from src.utils.models import JudgeAssessment
async def mock_assess(evidence, query):
return JudgeAssessment(
sufficient=True,
reasoning="Mock: Evidence is sufficient",
suggested_refinements=[],
key_findings=["Finding 1", "Finding 2"],
evidence_gaps=[],
recommended_drugs=["MockDrug A", "MockDrug B"],
)
mock = MagicMock()
mock.assess = mock_assess
return mock
```
### Integration Tests (Real APIs)
For validation, run against real APIs (marked `@pytest.mark.integration`):
```python
@pytest.mark.integration
@pytest.mark.slow
async def test_real_pubmed_search():
"""Integration test with real PubMed API."""
# Requires NCBI_API_KEY in env
...
```
## Test Matrix
| Mode | Mock | Real API | Status |
|------|------|----------|--------|
| Simple (Free) | βœ… Done | ⏳ Optional | βœ… IMPLEMENTED |
| Advanced (OpenAI) | βœ… Done | ⏳ Optional | βœ… IMPLEMENTED |
## Directory Structure
```
tests/
β”œβ”€β”€ unit/ # Existing 141 tests
β”œβ”€β”€ integration/ # Real API tests (existing)
└── e2e/ # NEW: Full pipeline tests
β”œβ”€β”€ conftest.py # E2E fixtures
β”œβ”€β”€ test_simple_mode.py # Simple mode E2E
└── test_advanced_mode.py # Magentic mode E2E
```
## Acceptance Criteria
- [x] E2E test for Simple mode (mocked)
- [x] E2E test for Advanced mode (mocked)
- [x] Tests validate output structure
- [x] Tests run in CI (<2 minutes)
- [ ] At least one integration test with real API (existing in tests/integration/)
**Status: IMPLEMENTED** (commit b1d094d)
## Why Before OpenAlex?
1. **Prove current system works** before adding complexity
2. **Establish baseline** - what does "good output" look like?
3. **Catch regressions** - future changes won't break core functionality
4. **Confidence for hackathon** - we know the demo will produce something
## Related Issues
- #47: E2E Testing - Does Pipeline Actually Generate Useful Reports?
- #65: Demo timing (must fix first to make E2E tests practical)
## Files Created
1. `tests/e2e/conftest.py` - E2E fixtures (mock_search_handler, mock_judge_handler)
2. `tests/e2e/test_simple_mode.py` - Simple mode tests (2 tests)
3. `tests/e2e/test_advanced_mode.py` - Advanced mode tests (1 test, mocked workflow)