| # SPEC 02: End-to-End Testing | |
| ## Priority: P1 (Validation Before Features) | |
| ## Problem Statement | |
| We have 141 unit tests that verify individual components work, but **no test that proves the full pipeline produces useful research output**. | |
| We don't know if: | |
| 1. Simple mode produces a valid report | |
| 2. Advanced mode produces a valid report | |
| 3. The output is actually useful (has citations, mechanisms, etc.) | |
| **Golden Rule**: Don't add features (OpenAlex, persistence) until we prove current features work. | |
| ## What We Need to Test | |
| ### Level 1: Smoke Test (Does it run?) | |
| ```python | |
| @pytest.mark.asyncio | |
| @pytest.mark.e2e | |
| async def test_simple_mode_completes(mock_search_handler, mock_judge_handler): | |
| """Verify Simple mode runs without crashing.""" | |
| from src.orchestrator import Orchestrator | |
| from src.utils.models import OrchestratorConfig | |
| config = OrchestratorConfig(max_iterations=2) | |
| orchestrator = Orchestrator( | |
| search_handler=mock_search_handler, | |
| judge_handler=mock_judge_handler, | |
| config=config, | |
| enable_analysis=False, | |
| enable_embeddings=False, | |
| ) | |
| events = [] | |
| async for event in orchestrator.run("test query"): | |
| events.append(event) | |
| # Must complete | |
| assert any(e.type == "complete" for e in events) | |
| # Must not error | |
| assert not any(e.type == "error" for e in events) | |
| ``` | |
| ### Level 2: Structure Test (Is output valid?) | |
| ```python | |
| @pytest.mark.e2e | |
| async def test_output_has_required_fields(): | |
| """Verify output contains expected structure.""" | |
| result = await run_research("metformin for PCOS") | |
| # Must have citations | |
| assert len(result.citations) >= 1 | |
| # Must have some text | |
| assert len(result.report) > 100 | |
| # Must mention the query topic | |
| assert "metformin" in result.report.lower() or "pcos" in result.report.lower() | |
| ``` | |
| ### Level 3: Quality Test (Is output useful?) | |
| ```python | |
| @pytest.mark.e2e | |
| async def test_output_quality(): | |
| """Verify output contains actionable research.""" | |
| result = await run_research("drugs for female libido") | |
| # Should have PMIDs or NCT IDs | |
| has_citations = any( | |
| "PMID" in str(c) or "NCT" in str(c) | |
| for c in result.citations | |
| ) | |
| assert has_citations, "No real citations found" | |
| # Should discuss mechanism | |
| mechanism_words = ["mechanism", "pathway", "receptor", "target"] | |
| has_mechanism = any(w in result.report.lower() for w in mechanism_words) | |
| assert has_mechanism, "No mechanism discussion found" | |
| ``` | |
| ## Test Strategy | |
| ### Mocking Strategy | |
| For CI/fast tests, mock external APIs via pytest fixtures in `tests/e2e/conftest.py`: | |
| ```python | |
| @pytest.fixture | |
| def mock_search_handler(): | |
| """Return a mock search handler that returns fake evidence.""" | |
| from unittest.mock import MagicMock | |
| from src.utils.models import Citation, Evidence, SearchResult | |
| async def mock_execute(query: str): | |
| return SearchResult( | |
| evidence=[ | |
| Evidence( | |
| content="Study on test query showing positive results...", | |
| citation=Citation( | |
| source="pubmed", | |
| title="Study on test query", | |
| url="https://pubmed.example.com/123", | |
| date="2024", | |
| ), | |
| ) | |
| ], | |
| sources_searched=["pubmed", "clinicaltrials"], | |
| ) | |
| mock = MagicMock() | |
| mock.execute = mock_execute | |
| return mock | |
| @pytest.fixture | |
| def mock_judge_handler(): | |
| """Return a mock judge that always says 'synthesize'.""" | |
| from unittest.mock import MagicMock | |
| from src.utils.models import JudgeAssessment | |
| async def mock_assess(evidence, query): | |
| return JudgeAssessment( | |
| sufficient=True, | |
| reasoning="Mock: Evidence is sufficient", | |
| suggested_refinements=[], | |
| key_findings=["Finding 1", "Finding 2"], | |
| evidence_gaps=[], | |
| recommended_drugs=["MockDrug A", "MockDrug B"], | |
| ) | |
| mock = MagicMock() | |
| mock.assess = mock_assess | |
| return mock | |
| ``` | |
| ### Integration Tests (Real APIs) | |
| For validation, run against real APIs (marked `@pytest.mark.integration`): | |
| ```python | |
| @pytest.mark.integration | |
| @pytest.mark.slow | |
| async def test_real_pubmed_search(): | |
| """Integration test with real PubMed API.""" | |
| # Requires NCBI_API_KEY in env | |
| ... | |
| ``` | |
| ## Test Matrix | |
| | Mode | Mock | Real API | Status | | |
| |------|------|----------|--------| | |
| | Simple (Free) | β Done | β³ Optional | β IMPLEMENTED | | |
| | Advanced (OpenAI) | β Done | β³ Optional | β IMPLEMENTED | | |
| ## Directory Structure | |
| ``` | |
| tests/ | |
| βββ unit/ # Existing 141 tests | |
| βββ integration/ # Real API tests (existing) | |
| βββ e2e/ # NEW: Full pipeline tests | |
| βββ conftest.py # E2E fixtures | |
| βββ test_simple_mode.py # Simple mode E2E | |
| βββ test_advanced_mode.py # Magentic mode E2E | |
| ``` | |
| ## Acceptance Criteria | |
| - [x] E2E test for Simple mode (mocked) | |
| - [x] E2E test for Advanced mode (mocked) | |
| - [x] Tests validate output structure | |
| - [x] Tests run in CI (<2 minutes) | |
| - [ ] At least one integration test with real API (existing in tests/integration/) | |
| **Status: IMPLEMENTED** (commit b1d094d) | |
| ## Why Before OpenAlex? | |
| 1. **Prove current system works** before adding complexity | |
| 2. **Establish baseline** - what does "good output" look like? | |
| 3. **Catch regressions** - future changes won't break core functionality | |
| 4. **Confidence for hackathon** - we know the demo will produce something | |
| ## Related Issues | |
| - #47: E2E Testing - Does Pipeline Actually Generate Useful Reports? | |
| - #65: Demo timing (must fix first to make E2E tests practical) | |
| ## Files Created | |
| 1. `tests/e2e/conftest.py` - E2E fixtures (mock_search_handler, mock_judge_handler) | |
| 2. `tests/e2e/test_simple_mode.py` - Simple mode tests (2 tests) | |
| 3. `tests/e2e/test_advanced_mode.py` - Advanced mode tests (1 test, mocked workflow) | |