Spaces:

VibecoderMcSwaggins
/

DeepBoner

Paused

VibecoderMcSwaggins commited on 25 days ago

Commit

9d02fee

1 Parent(s): 864d85d

docs: add specs for P0 termination fix and P1 E2E testing

SPEC_01: Demo Termination Fix
- Investigate if max_round_count actually works
- Add hard timeout (5 min) for hackathon
- Add round progress indicator

SPEC_02: E2E Testing
- Smoke tests (does it run?)
- Structure tests (is output valid?)
- Quality tests (is output useful?)
- Must pass BEFORE adding new features (OpenAlex, etc.)

Files changed (2) hide show

docs/specs/SPEC_01_DEMO_TERMINATION.md +136 -0
docs/specs/SPEC_02_E2E_TESTING.md +157 -0

docs/specs/SPEC_01_DEMO_TERMINATION.md ADDED Viewed

	@@ -0,0 +1,136 @@

+# SPEC 01: Demo Termination & Timing Fix
+## Priority: P0 (Hackathon Blocker)
+## Problem Statement
+Advanced (Magentic) mode runs indefinitely from user perspective. The demo was manually terminated after ~10 minutes without reaching synthesis.
+**Root Cause Hypothesis**: We're trusting `agent_framework.MagenticBuilder.max_round_count` to enforce termination, but:
+1. We don't know how the framework counts "rounds"
+2. Our `iteration` counter only tracks `MagenticAgentMessageEvent`, not all framework rounds
+3. Manager coordination messages (JUDGING) happen between rounds and don't count
+## Investigation Required
+### Question 1: Does max_round_count actually work?
+```python
+# Current code (src/orchestrator_magentic.py:112)
+.with_standard_manager(
+    chat_client=manager_client,
+    max_round_count=self._max_rounds,  # Default: 10
+    max_stall_count=3,
+    max_reset_count=2,
+)
+```
+**Test**: Set `max_round_count=2` and verify termination.
+### Question 2: What counts as a "round"?
+From demo output:
+- `JUDGING` (Manager) - many of these
+- `SEARCH_COMPLETE` (Agent)
+- `HYPOTHESIZING` (Agent)
+- `JUDGE_COMPLETE` (Agent)
+- `STREAMING` (Delta events)
+Is one "round" = one full cycle of all agents? Or one agent message?
+### Question 3: Why no final synthesis?
+The demo showed lots of evidence gathering but never reached `ReportAgent`. Either:
+1. JudgeAgent never said "sufficient=True"
+2. Framework terminated before synthesis (unlikely given time)
+3. Something else broke the flow
+## Proposed Solutions
+### Option A: Add Hard Timeout (Recommended for Hackathon)
+```python
+# src/orchestrator_magentic.py
+import asyncio
+async def run(self, query: str) -> AsyncGenerator[AgentEvent, None]:
+    # ...existing setup...
+    DEMO_TIMEOUT_SECONDS = 300  # 5 minutes max
+    try:
+        async with asyncio.timeout(DEMO_TIMEOUT_SECONDS):
+            async for event in workflow.run_stream(task):
+                # ...existing processing...
+    except TimeoutError:
+        yield AgentEvent(
+            type="complete",
+            message="Research timed out. Synthesizing available evidence...",
+            data={"reason": "timeout", "iterations": iteration},
+            iteration=iteration,
+        )
+        # Attempt to synthesize whatever we have
+```
+### Option B: Reduce max_rounds AND Add Progress
+```python
+# Lower the round count AND show which round we're on
+max_round_count=5,  # Was 10
+```
+Plus yield round number:
+```python
+yield AgentEvent(
+    type="progress",
+    message=f"Round {round_num}/{max_rounds}...",
+    iteration=round_num,
+)
+```
+### Option C: Force Synthesis After N Evidence Items
+```python
+# In judge logic
+if len(evidence) >= 20:
+    return "synthesize"  # We have enough, stop searching
+```
+## Acceptance Criteria
+- [ ] Demo completes in <5 minutes with visible progress
+- [ ] User sees round count (e.g., "Round 3/5")
+- [ ] Always produces SOME output (even if partial)
+- [ ] Timeout prevents infinite running
+## Test Plan
+```python
+@pytest.mark.asyncio
+async def test_magentic_terminates_within_timeout():
+    """Verify demo completes in reasonable time."""
+    orchestrator = MagenticOrchestrator(max_rounds=3)
+    events = []
+    start = time.time()
+    async for event in orchestrator.run("simple test query"):
+        events.append(event)
+        if time.time() - start > 120:  # 2 min max for test
+            pytest.fail("Orchestrator did not terminate")
+    # Must have a completion event
+    assert any(e.type == "complete" for e in events)
+```
+## Related Issues
+- #65: P1: Advanced Mode takes too long for hackathon demo
+- #47: E2E Testing
+## Files to Modify
+1. `src/orchestrator_magentic.py` - Add timeout and progress
+2. `src/app.py` - Display round progress in UI
+3. `tests/unit/test_magentic_termination.py` - Add timeout test

docs/specs/SPEC_02_E2E_TESTING.md ADDED Viewed

	@@ -0,0 +1,157 @@

+# SPEC 02: End-to-End Testing
+## Priority: P1 (Validation Before Features)
+## Problem Statement
+We have 142 unit tests that verify individual components work, but **no test that proves the full pipeline produces useful research output**.
+We don't know if:
+1. Simple mode produces a valid report
+2. Advanced mode produces a valid report
+3. The output is actually useful (has citations, mechanisms, etc.)
+**Golden Rule**: Don't add features (OpenAlex, persistence) until we prove current features work.
+## What We Need to Test
+### Level 1: Smoke Test (Does it run?)
+```python
+@pytest.mark.e2e
+async def test_simple_mode_completes():
+    """Verify Simple mode runs without crashing."""
+    from src.orchestrator import Orchestrator
+    # Mock the search tools to avoid real API calls
+    orchestrator = create_test_orchestrator(mode="simple")
+    events = []
+    async for event in orchestrator.run("test query"):
+        events.append(event)
+    # Must complete
+    assert any(e.type == "complete" for e in events)
+    # Must not error
+    assert not any(e.type == "error" for e in events)
+```
+### Level 2: Structure Test (Is output valid?)
+```python
+@pytest.mark.e2e
+async def test_output_has_required_fields():
+    """Verify output contains expected structure."""
+    result = await run_research("metformin for PCOS")
+    # Must have citations
+    assert len(result.citations) >= 1
+    # Must have some text
+    assert len(result.report) > 100
+    # Must mention the query topic
+    assert "metformin" in result.report.lower() or "pcos" in result.report.lower()
+```
+### Level 3: Quality Test (Is output useful?)
+```python
+@pytest.mark.e2e
+async def test_output_quality():
+    """Verify output contains actionable research."""
+    result = await run_research("drugs for female libido")
+    # Should have PMIDs or NCT IDs
+    has_citations = any(
+        "PMID" in str(c) or "NCT" in str(c)
+        for c in result.citations
+    )
+    assert has_citations, "No real citations found"
+    # Should discuss mechanism
+    mechanism_words = ["mechanism", "pathway", "receptor", "target"]
+    has_mechanism = any(w in result.report.lower() for w in mechanism_words)
+    assert has_mechanism, "No mechanism discussion found"
+```
+## Test Strategy
+### Mocking Strategy
+For CI/fast tests, mock external APIs:
+```python
+@pytest.fixture
+def mock_pubmed():
+    """Return realistic but fake PubMed results."""
+    return [
+        Evidence(
+            content="Metformin improves insulin sensitivity...",
+            citation=Citation(
+                source="pubmed",
+                title="Metformin in PCOS: A Meta-Analysis",
+                url="https://pubmed.ncbi.nlm.nih.gov/12345678/",
+                date="2024",
+            )
+        )
+    ]
+```
+### Integration Tests (Real APIs)
+For validation, run against real APIs (marked `@pytest.mark.integration`):
+```python
+@pytest.mark.integration
+@pytest.mark.slow
+async def test_real_pubmed_search():
+    """Integration test with real PubMed API."""
+    # Requires NCBI_API_KEY in env
+    ...
+```
+## Test Matrix
+| Mode | Mock | Real API | Status |
+|------|------|----------|--------|
+| Simple (Free) | ✅ Need | ⏳ Optional | Not implemented |
+| Advanced (OpenAI) | ✅ Need | ⏳ Optional | Not implemented |
+## Directory Structure
+```
+tests/
+├── unit/           # Existing 142 tests
+├── integration/    # Real API tests (existing)
+└── e2e/            # NEW: Full pipeline tests
+    ├── conftest.py         # E2E fixtures
+    ├── test_simple_mode.py # Simple mode E2E
+    └── test_advanced_mode.py # Magentic mode E2E
+```
+## Acceptance Criteria
+- [ ] E2E test for Simple mode (mocked)
+- [ ] E2E test for Advanced mode (mocked)
+- [ ] Tests validate output structure
+- [ ] Tests run in CI (<2 minutes)
+- [ ] At least one integration test with real API
+## Why Before OpenAlex?
+1. **Prove current system works** before adding complexity
+2. **Establish baseline** - what does "good output" look like?
+3. **Catch regressions** - future changes won't break core functionality
+4. **Confidence for hackathon** - we know the demo will produce something
+## Related Issues
+- #47: E2E Testing - Does Pipeline Actually Generate Useful Reports?
+- #65: Demo timing (must fix first to make E2E tests practical)
+## Files to Create
+1. `tests/e2e/conftest.py` - E2E fixtures and mocks
+2. `tests/e2e/test_simple_mode.py` - Simple mode tests
+3. `tests/e2e/test_advanced_mode.py` - Advanced mode tests