Spaces:

VibecoderMcSwaggins
/

DeepBoner

Paused

File size: 5,974 Bytes

0257d2f

# SPEC 02: End-to-End Testing

## Priority: P1 (Validation Before Features)

## Problem Statement

We have 141 unit tests that verify individual components work, but **no test that proves the full pipeline produces useful research output**.

We don't know if:
1. Simple mode produces a valid report
2. Advanced mode produces a valid report
3. The output is actually useful (has citations, mechanisms, etc.)

**Golden Rule**: Don't add features (OpenAlex, persistence) until we prove current features work.

## What We Need to Test

### Level 1: Smoke Test (Does it run?)

```python
@pytest.mark.asyncio
@pytest.mark.e2e
async def test_simple_mode_completes(mock_search_handler, mock_judge_handler):
    """Verify Simple mode runs without crashing."""
    from src.orchestrator import Orchestrator
    from src.utils.models import OrchestratorConfig

    config = OrchestratorConfig(max_iterations=2)
    orchestrator = Orchestrator(
        search_handler=mock_search_handler,
        judge_handler=mock_judge_handler,
        config=config,
        enable_analysis=False,
        enable_embeddings=False,
    )

    events = []
    async for event in orchestrator.run("test query"):
        events.append(event)

    # Must complete
    assert any(e.type == "complete" for e in events)
    # Must not error
    assert not any(e.type == "error" for e in events)
```

### Level 2: Structure Test (Is output valid?)

```python
@pytest.mark.e2e
async def test_output_has_required_fields():
    """Verify output contains expected structure."""
    result = await run_research("metformin for PCOS")

    # Must have citations
    assert len(result.citations) >= 1

    # Must have some text
    assert len(result.report) > 100

    # Must mention the query topic
    assert "metformin" in result.report.lower() or "pcos" in result.report.lower()
```

### Level 3: Quality Test (Is output useful?)

```python
@pytest.mark.e2e
async def test_output_quality():
    """Verify output contains actionable research."""
    result = await run_research("drugs for female libido")

    # Should have PMIDs or NCT IDs
    has_citations = any(
        "PMID" in str(c) or "NCT" in str(c)
        for c in result.citations
    )
    assert has_citations, "No real citations found"

    # Should discuss mechanism
    mechanism_words = ["mechanism", "pathway", "receptor", "target"]
    has_mechanism = any(w in result.report.lower() for w in mechanism_words)
    assert has_mechanism, "No mechanism discussion found"
```

## Test Strategy

### Mocking Strategy

For CI/fast tests, mock external APIs via pytest fixtures in `tests/e2e/conftest.py`:

```python
@pytest.fixture
def mock_search_handler():
    """Return a mock search handler that returns fake evidence."""
    from unittest.mock import MagicMock
    from src.utils.models import Citation, Evidence, SearchResult

    async def mock_execute(query: str):
        return SearchResult(
            evidence=[
                Evidence(
                    content="Study on test query showing positive results...",
                    citation=Citation(
                        source="pubmed",
                        title="Study on test query",
                        url="https://pubmed.example.com/123",
                        date="2024",
                    ),
                )
            ],
            sources_searched=["pubmed", "clinicaltrials"],
        )

    mock = MagicMock()
    mock.execute = mock_execute
    return mock

@pytest.fixture
def mock_judge_handler():
    """Return a mock judge that always says 'synthesize'."""
    from unittest.mock import MagicMock
    from src.utils.models import JudgeAssessment

    async def mock_assess(evidence, query):
        return JudgeAssessment(
            sufficient=True,
            reasoning="Mock: Evidence is sufficient",
            suggested_refinements=[],
            key_findings=["Finding 1", "Finding 2"],
            evidence_gaps=[],
            recommended_drugs=["MockDrug A", "MockDrug B"],
        )

    mock = MagicMock()
    mock.assess = mock_assess
    return mock
```

### Integration Tests (Real APIs)

For validation, run against real APIs (marked `@pytest.mark.integration`):

```python
@pytest.mark.integration
@pytest.mark.slow
async def test_real_pubmed_search():
    """Integration test with real PubMed API."""
    # Requires NCBI_API_KEY in env
    ...
```

## Test Matrix

| Mode | Mock | Real API | Status |
|------|------|----------|--------|
| Simple (Free) | ✅ Done | ⏳ Optional | ✅ IMPLEMENTED |
| Advanced (OpenAI) | ✅ Done | ⏳ Optional | ✅ IMPLEMENTED |

## Directory Structure

```
tests/
├── unit/           # Existing 141 tests
├── integration/    # Real API tests (existing)
└── e2e/            # NEW: Full pipeline tests
    ├── conftest.py         # E2E fixtures
    ├── test_simple_mode.py # Simple mode E2E
    └── test_advanced_mode.py # Magentic mode E2E
```

## Acceptance Criteria

- [x] E2E test for Simple mode (mocked)
- [x] E2E test for Advanced mode (mocked)
- [x] Tests validate output structure
- [x] Tests run in CI (<2 minutes)
- [ ] At least one integration test with real API (existing in tests/integration/)

**Status: IMPLEMENTED** (commit b1d094d)

## Why Before OpenAlex?

1. **Prove current system works** before adding complexity
2. **Establish baseline** - what does "good output" look like?
3. **Catch regressions** - future changes won't break core functionality
4. **Confidence for hackathon** - we know the demo will produce something

## Related Issues

- #47: E2E Testing - Does Pipeline Actually Generate Useful Reports?
- #65: Demo timing (must fix first to make E2E tests practical)

## Files Created

1. `tests/e2e/conftest.py` - E2E fixtures (mock_search_handler, mock_judge_handler)
2. `tests/e2e/test_simple_mode.py` - Simple mode tests (2 tests)
3. `tests/e2e/test_advanced_mode.py` - Advanced mode tests (1 test, mocked workflow)