VibecoderMcSwaggins commited on
Commit
15566c9
·
1 Parent(s): a539ae6

docs: Add PubMed Full-Text Retrieval Phase Documentation

Browse files

- Introduce a new document detailing Phase 16: PubMed Full-Text Retrieval, outlining its objectives, prerequisites, success criteria, and implementation notes.
- This phase aims to enhance evidence quality by enabling full-text retrieval from PubMed, including structured sections and search integration.
- The document serves as a comprehensive guide for developers and stakeholders involved in the implementation of this feature.

Files changed (1) hide show
  1. docs/development/testing.md +0 -139
docs/development/testing.md DELETED
@@ -1,139 +0,0 @@
1
- # Testing Strategy
2
- ## ensuring DeepBoner is Ironclad
3
-
4
- ---
5
-
6
- ## Overview
7
-
8
- Our testing strategy follows a strict **Pyramid of Reliability**:
9
- 1. **Unit Tests**: Fast, isolated logic checks (60% of tests)
10
- 2. **Integration Tests**: Tool interactions & Agent loops (30% of tests)
11
- 3. **E2E / Regression Tests**: Full research workflows (10% of tests)
12
-
13
- **Goal**: Ship a research agent that doesn't hallucinate, crash on API limits, or burn $100 in tokens by accident.
14
-
15
- ---
16
-
17
- ## 1. Unit Tests (Fast & Cheap)
18
-
19
- **Location**: `tests/unit/`
20
-
21
- Focus on individual components without external network calls. Mock everything.
22
-
23
- ### Key Test Cases
24
-
25
- #### Agent Logic
26
- - **Initialization**: Verify default config loads correctly.
27
- - **State Updates**: Ensure `ResearchState` updates correctly (e.g., token counts increment).
28
- - **Budget Checks**: Test `should_continue()` returns `False` when budget exceeded.
29
- - **Error Handling**: Test partial failure recovery (e.g., one tool fails, agent continues).
30
-
31
- #### Tools (Mocked)
32
- - **Parser Logic**: Feed raw XML/JSON to tool parsers and verify `Evidence` objects.
33
- - **Validation**: Ensure tools reject invalid queries (empty strings, etc.).
34
-
35
- #### Judge Prompts
36
- - **Schema Compliance**: Verify prompt templates generate valid JSON structure instructions.
37
- - **Variable Injection**: Ensure `{question}` and `{context}` are injected correctly into prompts.
38
-
39
- ```python
40
- # Example: Testing State Logic
41
- def test_budget_stop():
42
- state = ResearchState(tokens_used=50001, max_tokens=50000)
43
- assert should_continue(state) is False
44
- ```
45
-
46
- ---
47
-
48
- ## 2. Integration Tests (Realistic & Mocked I/O)
49
-
50
- **Location**: `tests/integration/`
51
-
52
- Focus on the interaction between the Orchestrator, Tools, and LLM Judge. Use **VCR.py** or **Replay** patterns to record/replay API calls to save money/time.
53
-
54
- ### Key Test Cases
55
-
56
- #### Search Loop
57
- - **Iteration Flow**: Verify agent performs Search -> Judge -> Search loop.
58
- - **Tool Selection**: Verify correct tools are called based on judge output (mocked judge).
59
- - **Context Accumulation**: Ensure findings from Iteration 1 are passed to Iteration 2.
60
-
61
- #### MCP Server Integration
62
- - **Server Startup**: Verify MCP server starts and exposes tools.
63
- - **Client Connection**: Verify agent can call tools via MCP protocol.
64
-
65
- ```python
66
- # Example: Testing Search Loop with Mocked Tools
67
- async def test_search_loop_flow():
68
- agent = ResearchAgent(tools=[MockPubMed(), MockWeb()])
69
- report = await agent.run("test query")
70
- assert agent.state.iterations > 0
71
- assert len(report.sources) > 0
72
- ```
73
-
74
- ---
75
-
76
- ## 3. End-to-End (E2E) Tests (The "Real Deal")
77
-
78
- **Location**: `tests/e2e/`
79
-
80
- Run against **real APIs** (with strict rate limits) to verify system integrity. Run these **on demand** or **nightly**, not on every commit.
81
-
82
- ### Key Test Cases
83
-
84
- #### The "Golden Query"
85
- Run the primary demo query: *"What existing drugs might help treat long COVID fatigue?"*
86
- - **Success Criteria**:
87
- - Returns at least 2 valid drug candidates (e.g., CoQ10, LDN).
88
- - Includes citations from PubMed.
89
- - Completes within 3 iterations.
90
- - JSON output matches schema.
91
-
92
- #### Deployment Smoke Test
93
- - **Gradio UI**: Verify UI launches and accepts input.
94
- - **Streaming**: Verify generator yields chunks (first chunk within 2s).
95
-
96
- ---
97
-
98
- ## 4. Tools & Config
99
-
100
- ### Pytest Configuration
101
- ```toml
102
- # pyproject.toml
103
- [tool.pytest.ini_options]
104
- markers = [
105
- "unit: fast, isolated tests",
106
- "integration: mocked network tests",
107
- "e2e: real network tests (slow, expensive)"
108
- ]
109
- asyncio_mode = "auto"
110
- ```
111
-
112
- ### CI/CD Pipeline (GitHub Actions)
113
- 1. **Lint**: `ruff check .`
114
- 2. **Type Check**: `mypy .`
115
- 3. **Unit**: `pytest -m unit`
116
- 4. **Integration**: `pytest -m integration`
117
- 5. **E2E**: (Manual trigger only)
118
-
119
- ---
120
-
121
- ## 5. Anti-Hallucination Validation
122
-
123
- How do we test if the agent is lying?
124
-
125
- 1. **Citation Check**:
126
- - Regex verify that every `[PMID: 12345]` in the report exists in the `Evidence` list.
127
- - Fail if a citation is "orphaned" (hallucinated ID).
128
-
129
- 2. **Negative Constraints**:
130
- - Test queries for fake diseases ("Ligma syndrome") -> Agent should return "No evidence found".
131
-
132
- ---
133
-
134
- ## Checklist for Implementation
135
-
136
- - [ ] Set up `tests/` directory structure
137
- - [ ] Configure `pytest` and `vcrpy`
138
- - [ ] Create `tests/fixtures/` for mock data (PubMed XMLs)
139
- - [ ] Write first unit test for `ResearchState`