VibecoderMcSwaggins commited on
Commit
9d02fee
Β·
1 Parent(s): 864d85d

docs: add specs for P0 termination fix and P1 E2E testing

Browse files

SPEC_01: Demo Termination Fix
- Investigate if max_round_count actually works
- Add hard timeout (5 min) for hackathon
- Add round progress indicator

SPEC_02: E2E Testing
- Smoke tests (does it run?)
- Structure tests (is output valid?)
- Quality tests (is output useful?)
- Must pass BEFORE adding new features (OpenAlex, etc.)

docs/specs/SPEC_01_DEMO_TERMINATION.md ADDED
@@ -0,0 +1,136 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # SPEC 01: Demo Termination & Timing Fix
2
+
3
+ ## Priority: P0 (Hackathon Blocker)
4
+
5
+ ## Problem Statement
6
+
7
+ Advanced (Magentic) mode runs indefinitely from user perspective. The demo was manually terminated after ~10 minutes without reaching synthesis.
8
+
9
+ **Root Cause Hypothesis**: We're trusting `agent_framework.MagenticBuilder.max_round_count` to enforce termination, but:
10
+ 1. We don't know how the framework counts "rounds"
11
+ 2. Our `iteration` counter only tracks `MagenticAgentMessageEvent`, not all framework rounds
12
+ 3. Manager coordination messages (JUDGING) happen between rounds and don't count
13
+
14
+ ## Investigation Required
15
+
16
+ ### Question 1: Does max_round_count actually work?
17
+
18
+ ```python
19
+ # Current code (src/orchestrator_magentic.py:112)
20
+ .with_standard_manager(
21
+ chat_client=manager_client,
22
+ max_round_count=self._max_rounds, # Default: 10
23
+ max_stall_count=3,
24
+ max_reset_count=2,
25
+ )
26
+ ```
27
+
28
+ **Test**: Set `max_round_count=2` and verify termination.
29
+
30
+ ### Question 2: What counts as a "round"?
31
+
32
+ From demo output:
33
+ - `JUDGING` (Manager) - many of these
34
+ - `SEARCH_COMPLETE` (Agent)
35
+ - `HYPOTHESIZING` (Agent)
36
+ - `JUDGE_COMPLETE` (Agent)
37
+ - `STREAMING` (Delta events)
38
+
39
+ Is one "round" = one full cycle of all agents? Or one agent message?
40
+
41
+ ### Question 3: Why no final synthesis?
42
+
43
+ The demo showed lots of evidence gathering but never reached `ReportAgent`. Either:
44
+ 1. JudgeAgent never said "sufficient=True"
45
+ 2. Framework terminated before synthesis (unlikely given time)
46
+ 3. Something else broke the flow
47
+
48
+ ## Proposed Solutions
49
+
50
+ ### Option A: Add Hard Timeout (Recommended for Hackathon)
51
+
52
+ ```python
53
+ # src/orchestrator_magentic.py
54
+ import asyncio
55
+
56
+ async def run(self, query: str) -> AsyncGenerator[AgentEvent, None]:
57
+ # ...existing setup...
58
+
59
+ DEMO_TIMEOUT_SECONDS = 300 # 5 minutes max
60
+
61
+ try:
62
+ async with asyncio.timeout(DEMO_TIMEOUT_SECONDS):
63
+ async for event in workflow.run_stream(task):
64
+ # ...existing processing...
65
+
66
+ except TimeoutError:
67
+ yield AgentEvent(
68
+ type="complete",
69
+ message="Research timed out. Synthesizing available evidence...",
70
+ data={"reason": "timeout", "iterations": iteration},
71
+ iteration=iteration,
72
+ )
73
+ # Attempt to synthesize whatever we have
74
+ ```
75
+
76
+ ### Option B: Reduce max_rounds AND Add Progress
77
+
78
+ ```python
79
+ # Lower the round count AND show which round we're on
80
+ max_round_count=5, # Was 10
81
+ ```
82
+
83
+ Plus yield round number:
84
+ ```python
85
+ yield AgentEvent(
86
+ type="progress",
87
+ message=f"Round {round_num}/{max_rounds}...",
88
+ iteration=round_num,
89
+ )
90
+ ```
91
+
92
+ ### Option C: Force Synthesis After N Evidence Items
93
+
94
+ ```python
95
+ # In judge logic
96
+ if len(evidence) >= 20:
97
+ return "synthesize" # We have enough, stop searching
98
+ ```
99
+
100
+ ## Acceptance Criteria
101
+
102
+ - [ ] Demo completes in <5 minutes with visible progress
103
+ - [ ] User sees round count (e.g., "Round 3/5")
104
+ - [ ] Always produces SOME output (even if partial)
105
+ - [ ] Timeout prevents infinite running
106
+
107
+ ## Test Plan
108
+
109
+ ```python
110
+ @pytest.mark.asyncio
111
+ async def test_magentic_terminates_within_timeout():
112
+ """Verify demo completes in reasonable time."""
113
+ orchestrator = MagenticOrchestrator(max_rounds=3)
114
+
115
+ events = []
116
+ start = time.time()
117
+
118
+ async for event in orchestrator.run("simple test query"):
119
+ events.append(event)
120
+ if time.time() - start > 120: # 2 min max for test
121
+ pytest.fail("Orchestrator did not terminate")
122
+
123
+ # Must have a completion event
124
+ assert any(e.type == "complete" for e in events)
125
+ ```
126
+
127
+ ## Related Issues
128
+
129
+ - #65: P1: Advanced Mode takes too long for hackathon demo
130
+ - #47: E2E Testing
131
+
132
+ ## Files to Modify
133
+
134
+ 1. `src/orchestrator_magentic.py` - Add timeout and progress
135
+ 2. `src/app.py` - Display round progress in UI
136
+ 3. `tests/unit/test_magentic_termination.py` - Add timeout test
docs/specs/SPEC_02_E2E_TESTING.md ADDED
@@ -0,0 +1,157 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # SPEC 02: End-to-End Testing
2
+
3
+ ## Priority: P1 (Validation Before Features)
4
+
5
+ ## Problem Statement
6
+
7
+ We have 142 unit tests that verify individual components work, but **no test that proves the full pipeline produces useful research output**.
8
+
9
+ We don't know if:
10
+ 1. Simple mode produces a valid report
11
+ 2. Advanced mode produces a valid report
12
+ 3. The output is actually useful (has citations, mechanisms, etc.)
13
+
14
+ **Golden Rule**: Don't add features (OpenAlex, persistence) until we prove current features work.
15
+
16
+ ## What We Need to Test
17
+
18
+ ### Level 1: Smoke Test (Does it run?)
19
+
20
+ ```python
21
+ @pytest.mark.e2e
22
+ async def test_simple_mode_completes():
23
+ """Verify Simple mode runs without crashing."""
24
+ from src.orchestrator import Orchestrator
25
+
26
+ # Mock the search tools to avoid real API calls
27
+ orchestrator = create_test_orchestrator(mode="simple")
28
+
29
+ events = []
30
+ async for event in orchestrator.run("test query"):
31
+ events.append(event)
32
+
33
+ # Must complete
34
+ assert any(e.type == "complete" for e in events)
35
+ # Must not error
36
+ assert not any(e.type == "error" for e in events)
37
+ ```
38
+
39
+ ### Level 2: Structure Test (Is output valid?)
40
+
41
+ ```python
42
+ @pytest.mark.e2e
43
+ async def test_output_has_required_fields():
44
+ """Verify output contains expected structure."""
45
+ result = await run_research("metformin for PCOS")
46
+
47
+ # Must have citations
48
+ assert len(result.citations) >= 1
49
+
50
+ # Must have some text
51
+ assert len(result.report) > 100
52
+
53
+ # Must mention the query topic
54
+ assert "metformin" in result.report.lower() or "pcos" in result.report.lower()
55
+ ```
56
+
57
+ ### Level 3: Quality Test (Is output useful?)
58
+
59
+ ```python
60
+ @pytest.mark.e2e
61
+ async def test_output_quality():
62
+ """Verify output contains actionable research."""
63
+ result = await run_research("drugs for female libido")
64
+
65
+ # Should have PMIDs or NCT IDs
66
+ has_citations = any(
67
+ "PMID" in str(c) or "NCT" in str(c)
68
+ for c in result.citations
69
+ )
70
+ assert has_citations, "No real citations found"
71
+
72
+ # Should discuss mechanism
73
+ mechanism_words = ["mechanism", "pathway", "receptor", "target"]
74
+ has_mechanism = any(w in result.report.lower() for w in mechanism_words)
75
+ assert has_mechanism, "No mechanism discussion found"
76
+ ```
77
+
78
+ ## Test Strategy
79
+
80
+ ### Mocking Strategy
81
+
82
+ For CI/fast tests, mock external APIs:
83
+
84
+ ```python
85
+ @pytest.fixture
86
+ def mock_pubmed():
87
+ """Return realistic but fake PubMed results."""
88
+ return [
89
+ Evidence(
90
+ content="Metformin improves insulin sensitivity...",
91
+ citation=Citation(
92
+ source="pubmed",
93
+ title="Metformin in PCOS: A Meta-Analysis",
94
+ url="https://pubmed.ncbi.nlm.nih.gov/12345678/",
95
+ date="2024",
96
+ )
97
+ )
98
+ ]
99
+ ```
100
+
101
+ ### Integration Tests (Real APIs)
102
+
103
+ For validation, run against real APIs (marked `@pytest.mark.integration`):
104
+
105
+ ```python
106
+ @pytest.mark.integration
107
+ @pytest.mark.slow
108
+ async def test_real_pubmed_search():
109
+ """Integration test with real PubMed API."""
110
+ # Requires NCBI_API_KEY in env
111
+ ...
112
+ ```
113
+
114
+ ## Test Matrix
115
+
116
+ | Mode | Mock | Real API | Status |
117
+ |------|------|----------|--------|
118
+ | Simple (Free) | βœ… Need | ⏳ Optional | Not implemented |
119
+ | Advanced (OpenAI) | βœ… Need | ⏳ Optional | Not implemented |
120
+
121
+ ## Directory Structure
122
+
123
+ ```
124
+ tests/
125
+ β”œβ”€β”€ unit/ # Existing 142 tests
126
+ β”œβ”€β”€ integration/ # Real API tests (existing)
127
+ └── e2e/ # NEW: Full pipeline tests
128
+ β”œβ”€β”€ conftest.py # E2E fixtures
129
+ β”œβ”€β”€ test_simple_mode.py # Simple mode E2E
130
+ └── test_advanced_mode.py # Magentic mode E2E
131
+ ```
132
+
133
+ ## Acceptance Criteria
134
+
135
+ - [ ] E2E test for Simple mode (mocked)
136
+ - [ ] E2E test for Advanced mode (mocked)
137
+ - [ ] Tests validate output structure
138
+ - [ ] Tests run in CI (<2 minutes)
139
+ - [ ] At least one integration test with real API
140
+
141
+ ## Why Before OpenAlex?
142
+
143
+ 1. **Prove current system works** before adding complexity
144
+ 2. **Establish baseline** - what does "good output" look like?
145
+ 3. **Catch regressions** - future changes won't break core functionality
146
+ 4. **Confidence for hackathon** - we know the demo will produce something
147
+
148
+ ## Related Issues
149
+
150
+ - #47: E2E Testing - Does Pipeline Actually Generate Useful Reports?
151
+ - #65: Demo timing (must fix first to make E2E tests practical)
152
+
153
+ ## Files to Create
154
+
155
+ 1. `tests/e2e/conftest.py` - E2E fixtures and mocks
156
+ 2. `tests/e2e/test_simple_mode.py` - Simple mode tests
157
+ 3. `tests/e2e/test_advanced_mode.py` - Advanced mode tests