Commit
Β·
9d02fee
1
Parent(s):
864d85d
docs: add specs for P0 termination fix and P1 E2E testing
Browse filesSPEC_01: Demo Termination Fix
- Investigate if max_round_count actually works
- Add hard timeout (5 min) for hackathon
- Add round progress indicator
SPEC_02: E2E Testing
- Smoke tests (does it run?)
- Structure tests (is output valid?)
- Quality tests (is output useful?)
- Must pass BEFORE adding new features (OpenAlex, etc.)
docs/specs/SPEC_01_DEMO_TERMINATION.md
ADDED
|
@@ -0,0 +1,136 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# SPEC 01: Demo Termination & Timing Fix
|
| 2 |
+
|
| 3 |
+
## Priority: P0 (Hackathon Blocker)
|
| 4 |
+
|
| 5 |
+
## Problem Statement
|
| 6 |
+
|
| 7 |
+
Advanced (Magentic) mode runs indefinitely from user perspective. The demo was manually terminated after ~10 minutes without reaching synthesis.
|
| 8 |
+
|
| 9 |
+
**Root Cause Hypothesis**: We're trusting `agent_framework.MagenticBuilder.max_round_count` to enforce termination, but:
|
| 10 |
+
1. We don't know how the framework counts "rounds"
|
| 11 |
+
2. Our `iteration` counter only tracks `MagenticAgentMessageEvent`, not all framework rounds
|
| 12 |
+
3. Manager coordination messages (JUDGING) happen between rounds and don't count
|
| 13 |
+
|
| 14 |
+
## Investigation Required
|
| 15 |
+
|
| 16 |
+
### Question 1: Does max_round_count actually work?
|
| 17 |
+
|
| 18 |
+
```python
|
| 19 |
+
# Current code (src/orchestrator_magentic.py:112)
|
| 20 |
+
.with_standard_manager(
|
| 21 |
+
chat_client=manager_client,
|
| 22 |
+
max_round_count=self._max_rounds, # Default: 10
|
| 23 |
+
max_stall_count=3,
|
| 24 |
+
max_reset_count=2,
|
| 25 |
+
)
|
| 26 |
+
```
|
| 27 |
+
|
| 28 |
+
**Test**: Set `max_round_count=2` and verify termination.
|
| 29 |
+
|
| 30 |
+
### Question 2: What counts as a "round"?
|
| 31 |
+
|
| 32 |
+
From demo output:
|
| 33 |
+
- `JUDGING` (Manager) - many of these
|
| 34 |
+
- `SEARCH_COMPLETE` (Agent)
|
| 35 |
+
- `HYPOTHESIZING` (Agent)
|
| 36 |
+
- `JUDGE_COMPLETE` (Agent)
|
| 37 |
+
- `STREAMING` (Delta events)
|
| 38 |
+
|
| 39 |
+
Is one "round" = one full cycle of all agents? Or one agent message?
|
| 40 |
+
|
| 41 |
+
### Question 3: Why no final synthesis?
|
| 42 |
+
|
| 43 |
+
The demo showed lots of evidence gathering but never reached `ReportAgent`. Either:
|
| 44 |
+
1. JudgeAgent never said "sufficient=True"
|
| 45 |
+
2. Framework terminated before synthesis (unlikely given time)
|
| 46 |
+
3. Something else broke the flow
|
| 47 |
+
|
| 48 |
+
## Proposed Solutions
|
| 49 |
+
|
| 50 |
+
### Option A: Add Hard Timeout (Recommended for Hackathon)
|
| 51 |
+
|
| 52 |
+
```python
|
| 53 |
+
# src/orchestrator_magentic.py
|
| 54 |
+
import asyncio
|
| 55 |
+
|
| 56 |
+
async def run(self, query: str) -> AsyncGenerator[AgentEvent, None]:
|
| 57 |
+
# ...existing setup...
|
| 58 |
+
|
| 59 |
+
DEMO_TIMEOUT_SECONDS = 300 # 5 minutes max
|
| 60 |
+
|
| 61 |
+
try:
|
| 62 |
+
async with asyncio.timeout(DEMO_TIMEOUT_SECONDS):
|
| 63 |
+
async for event in workflow.run_stream(task):
|
| 64 |
+
# ...existing processing...
|
| 65 |
+
|
| 66 |
+
except TimeoutError:
|
| 67 |
+
yield AgentEvent(
|
| 68 |
+
type="complete",
|
| 69 |
+
message="Research timed out. Synthesizing available evidence...",
|
| 70 |
+
data={"reason": "timeout", "iterations": iteration},
|
| 71 |
+
iteration=iteration,
|
| 72 |
+
)
|
| 73 |
+
# Attempt to synthesize whatever we have
|
| 74 |
+
```
|
| 75 |
+
|
| 76 |
+
### Option B: Reduce max_rounds AND Add Progress
|
| 77 |
+
|
| 78 |
+
```python
|
| 79 |
+
# Lower the round count AND show which round we're on
|
| 80 |
+
max_round_count=5, # Was 10
|
| 81 |
+
```
|
| 82 |
+
|
| 83 |
+
Plus yield round number:
|
| 84 |
+
```python
|
| 85 |
+
yield AgentEvent(
|
| 86 |
+
type="progress",
|
| 87 |
+
message=f"Round {round_num}/{max_rounds}...",
|
| 88 |
+
iteration=round_num,
|
| 89 |
+
)
|
| 90 |
+
```
|
| 91 |
+
|
| 92 |
+
### Option C: Force Synthesis After N Evidence Items
|
| 93 |
+
|
| 94 |
+
```python
|
| 95 |
+
# In judge logic
|
| 96 |
+
if len(evidence) >= 20:
|
| 97 |
+
return "synthesize" # We have enough, stop searching
|
| 98 |
+
```
|
| 99 |
+
|
| 100 |
+
## Acceptance Criteria
|
| 101 |
+
|
| 102 |
+
- [ ] Demo completes in <5 minutes with visible progress
|
| 103 |
+
- [ ] User sees round count (e.g., "Round 3/5")
|
| 104 |
+
- [ ] Always produces SOME output (even if partial)
|
| 105 |
+
- [ ] Timeout prevents infinite running
|
| 106 |
+
|
| 107 |
+
## Test Plan
|
| 108 |
+
|
| 109 |
+
```python
|
| 110 |
+
@pytest.mark.asyncio
|
| 111 |
+
async def test_magentic_terminates_within_timeout():
|
| 112 |
+
"""Verify demo completes in reasonable time."""
|
| 113 |
+
orchestrator = MagenticOrchestrator(max_rounds=3)
|
| 114 |
+
|
| 115 |
+
events = []
|
| 116 |
+
start = time.time()
|
| 117 |
+
|
| 118 |
+
async for event in orchestrator.run("simple test query"):
|
| 119 |
+
events.append(event)
|
| 120 |
+
if time.time() - start > 120: # 2 min max for test
|
| 121 |
+
pytest.fail("Orchestrator did not terminate")
|
| 122 |
+
|
| 123 |
+
# Must have a completion event
|
| 124 |
+
assert any(e.type == "complete" for e in events)
|
| 125 |
+
```
|
| 126 |
+
|
| 127 |
+
## Related Issues
|
| 128 |
+
|
| 129 |
+
- #65: P1: Advanced Mode takes too long for hackathon demo
|
| 130 |
+
- #47: E2E Testing
|
| 131 |
+
|
| 132 |
+
## Files to Modify
|
| 133 |
+
|
| 134 |
+
1. `src/orchestrator_magentic.py` - Add timeout and progress
|
| 135 |
+
2. `src/app.py` - Display round progress in UI
|
| 136 |
+
3. `tests/unit/test_magentic_termination.py` - Add timeout test
|
docs/specs/SPEC_02_E2E_TESTING.md
ADDED
|
@@ -0,0 +1,157 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# SPEC 02: End-to-End Testing
|
| 2 |
+
|
| 3 |
+
## Priority: P1 (Validation Before Features)
|
| 4 |
+
|
| 5 |
+
## Problem Statement
|
| 6 |
+
|
| 7 |
+
We have 142 unit tests that verify individual components work, but **no test that proves the full pipeline produces useful research output**.
|
| 8 |
+
|
| 9 |
+
We don't know if:
|
| 10 |
+
1. Simple mode produces a valid report
|
| 11 |
+
2. Advanced mode produces a valid report
|
| 12 |
+
3. The output is actually useful (has citations, mechanisms, etc.)
|
| 13 |
+
|
| 14 |
+
**Golden Rule**: Don't add features (OpenAlex, persistence) until we prove current features work.
|
| 15 |
+
|
| 16 |
+
## What We Need to Test
|
| 17 |
+
|
| 18 |
+
### Level 1: Smoke Test (Does it run?)
|
| 19 |
+
|
| 20 |
+
```python
|
| 21 |
+
@pytest.mark.e2e
|
| 22 |
+
async def test_simple_mode_completes():
|
| 23 |
+
"""Verify Simple mode runs without crashing."""
|
| 24 |
+
from src.orchestrator import Orchestrator
|
| 25 |
+
|
| 26 |
+
# Mock the search tools to avoid real API calls
|
| 27 |
+
orchestrator = create_test_orchestrator(mode="simple")
|
| 28 |
+
|
| 29 |
+
events = []
|
| 30 |
+
async for event in orchestrator.run("test query"):
|
| 31 |
+
events.append(event)
|
| 32 |
+
|
| 33 |
+
# Must complete
|
| 34 |
+
assert any(e.type == "complete" for e in events)
|
| 35 |
+
# Must not error
|
| 36 |
+
assert not any(e.type == "error" for e in events)
|
| 37 |
+
```
|
| 38 |
+
|
| 39 |
+
### Level 2: Structure Test (Is output valid?)
|
| 40 |
+
|
| 41 |
+
```python
|
| 42 |
+
@pytest.mark.e2e
|
| 43 |
+
async def test_output_has_required_fields():
|
| 44 |
+
"""Verify output contains expected structure."""
|
| 45 |
+
result = await run_research("metformin for PCOS")
|
| 46 |
+
|
| 47 |
+
# Must have citations
|
| 48 |
+
assert len(result.citations) >= 1
|
| 49 |
+
|
| 50 |
+
# Must have some text
|
| 51 |
+
assert len(result.report) > 100
|
| 52 |
+
|
| 53 |
+
# Must mention the query topic
|
| 54 |
+
assert "metformin" in result.report.lower() or "pcos" in result.report.lower()
|
| 55 |
+
```
|
| 56 |
+
|
| 57 |
+
### Level 3: Quality Test (Is output useful?)
|
| 58 |
+
|
| 59 |
+
```python
|
| 60 |
+
@pytest.mark.e2e
|
| 61 |
+
async def test_output_quality():
|
| 62 |
+
"""Verify output contains actionable research."""
|
| 63 |
+
result = await run_research("drugs for female libido")
|
| 64 |
+
|
| 65 |
+
# Should have PMIDs or NCT IDs
|
| 66 |
+
has_citations = any(
|
| 67 |
+
"PMID" in str(c) or "NCT" in str(c)
|
| 68 |
+
for c in result.citations
|
| 69 |
+
)
|
| 70 |
+
assert has_citations, "No real citations found"
|
| 71 |
+
|
| 72 |
+
# Should discuss mechanism
|
| 73 |
+
mechanism_words = ["mechanism", "pathway", "receptor", "target"]
|
| 74 |
+
has_mechanism = any(w in result.report.lower() for w in mechanism_words)
|
| 75 |
+
assert has_mechanism, "No mechanism discussion found"
|
| 76 |
+
```
|
| 77 |
+
|
| 78 |
+
## Test Strategy
|
| 79 |
+
|
| 80 |
+
### Mocking Strategy
|
| 81 |
+
|
| 82 |
+
For CI/fast tests, mock external APIs:
|
| 83 |
+
|
| 84 |
+
```python
|
| 85 |
+
@pytest.fixture
|
| 86 |
+
def mock_pubmed():
|
| 87 |
+
"""Return realistic but fake PubMed results."""
|
| 88 |
+
return [
|
| 89 |
+
Evidence(
|
| 90 |
+
content="Metformin improves insulin sensitivity...",
|
| 91 |
+
citation=Citation(
|
| 92 |
+
source="pubmed",
|
| 93 |
+
title="Metformin in PCOS: A Meta-Analysis",
|
| 94 |
+
url="https://pubmed.ncbi.nlm.nih.gov/12345678/",
|
| 95 |
+
date="2024",
|
| 96 |
+
)
|
| 97 |
+
)
|
| 98 |
+
]
|
| 99 |
+
```
|
| 100 |
+
|
| 101 |
+
### Integration Tests (Real APIs)
|
| 102 |
+
|
| 103 |
+
For validation, run against real APIs (marked `@pytest.mark.integration`):
|
| 104 |
+
|
| 105 |
+
```python
|
| 106 |
+
@pytest.mark.integration
|
| 107 |
+
@pytest.mark.slow
|
| 108 |
+
async def test_real_pubmed_search():
|
| 109 |
+
"""Integration test with real PubMed API."""
|
| 110 |
+
# Requires NCBI_API_KEY in env
|
| 111 |
+
...
|
| 112 |
+
```
|
| 113 |
+
|
| 114 |
+
## Test Matrix
|
| 115 |
+
|
| 116 |
+
| Mode | Mock | Real API | Status |
|
| 117 |
+
|------|------|----------|--------|
|
| 118 |
+
| Simple (Free) | β
Need | β³ Optional | Not implemented |
|
| 119 |
+
| Advanced (OpenAI) | β
Need | β³ Optional | Not implemented |
|
| 120 |
+
|
| 121 |
+
## Directory Structure
|
| 122 |
+
|
| 123 |
+
```
|
| 124 |
+
tests/
|
| 125 |
+
βββ unit/ # Existing 142 tests
|
| 126 |
+
βββ integration/ # Real API tests (existing)
|
| 127 |
+
βββ e2e/ # NEW: Full pipeline tests
|
| 128 |
+
βββ conftest.py # E2E fixtures
|
| 129 |
+
βββ test_simple_mode.py # Simple mode E2E
|
| 130 |
+
βββ test_advanced_mode.py # Magentic mode E2E
|
| 131 |
+
```
|
| 132 |
+
|
| 133 |
+
## Acceptance Criteria
|
| 134 |
+
|
| 135 |
+
- [ ] E2E test for Simple mode (mocked)
|
| 136 |
+
- [ ] E2E test for Advanced mode (mocked)
|
| 137 |
+
- [ ] Tests validate output structure
|
| 138 |
+
- [ ] Tests run in CI (<2 minutes)
|
| 139 |
+
- [ ] At least one integration test with real API
|
| 140 |
+
|
| 141 |
+
## Why Before OpenAlex?
|
| 142 |
+
|
| 143 |
+
1. **Prove current system works** before adding complexity
|
| 144 |
+
2. **Establish baseline** - what does "good output" look like?
|
| 145 |
+
3. **Catch regressions** - future changes won't break core functionality
|
| 146 |
+
4. **Confidence for hackathon** - we know the demo will produce something
|
| 147 |
+
|
| 148 |
+
## Related Issues
|
| 149 |
+
|
| 150 |
+
- #47: E2E Testing - Does Pipeline Actually Generate Useful Reports?
|
| 151 |
+
- #65: Demo timing (must fix first to make E2E tests practical)
|
| 152 |
+
|
| 153 |
+
## Files to Create
|
| 154 |
+
|
| 155 |
+
1. `tests/e2e/conftest.py` - E2E fixtures and mocks
|
| 156 |
+
2. `tests/e2e/test_simple_mode.py` - Simple mode tests
|
| 157 |
+
3. `tests/e2e/test_advanced_mode.py` - Advanced mode tests
|