File size: 4,311 Bytes
b4aa4ad
5d12635
b4aa4ad
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
# Testing Strategy
## ensuring DeepBoner is Ironclad

---

## Overview

Our testing strategy follows a strict **Pyramid of Reliability**:
1. **Unit Tests**: Fast, isolated logic checks (60% of tests)
2. **Integration Tests**: Tool interactions & Agent loops (30% of tests)
3. **E2E / Regression Tests**: Full research workflows (10% of tests)

**Goal**: Ship a research agent that doesn't hallucinate, crash on API limits, or burn $100 in tokens by accident.

---

## 1. Unit Tests (Fast & Cheap)

**Location**: `tests/unit/`

Focus on individual components without external network calls. Mock everything.

### Key Test Cases

#### Agent Logic
- **Initialization**: Verify default config loads correctly.
- **State Updates**: Ensure `ResearchState` updates correctly (e.g., token counts increment).
- **Budget Checks**: Test `should_continue()` returns `False` when budget exceeded.
- **Error Handling**: Test partial failure recovery (e.g., one tool fails, agent continues).

#### Tools (Mocked)
- **Parser Logic**: Feed raw XML/JSON to tool parsers and verify `Evidence` objects.
- **Validation**: Ensure tools reject invalid queries (empty strings, etc.).

#### Judge Prompts
- **Schema Compliance**: Verify prompt templates generate valid JSON structure instructions.
- **Variable Injection**: Ensure `{question}` and `{context}` are injected correctly into prompts.

```python
# Example: Testing State Logic
def test_budget_stop():
    state = ResearchState(tokens_used=50001, max_tokens=50000)
    assert should_continue(state) is False
```

---

## 2. Integration Tests (Realistic & Mocked I/O)

**Location**: `tests/integration/`

Focus on the interaction between the Orchestrator, Tools, and LLM Judge. Use **VCR.py** or **Replay** patterns to record/replay API calls to save money/time.

### Key Test Cases

#### Search Loop
- **Iteration Flow**: Verify agent performs Search -> Judge -> Search loop.
- **Tool Selection**: Verify correct tools are called based on judge output (mocked judge).
- **Context Accumulation**: Ensure findings from Iteration 1 are passed to Iteration 2.

#### MCP Server Integration
- **Server Startup**: Verify MCP server starts and exposes tools.
- **Client Connection**: Verify agent can call tools via MCP protocol.

```python
# Example: Testing Search Loop with Mocked Tools
async def test_search_loop_flow():
    agent = ResearchAgent(tools=[MockPubMed(), MockWeb()])
    report = await agent.run("test query")
    assert agent.state.iterations > 0
    assert len(report.sources) > 0
```

---

## 3. End-to-End (E2E) Tests (The "Real Deal")

**Location**: `tests/e2e/`

Run against **real APIs** (with strict rate limits) to verify system integrity. Run these **on demand** or **nightly**, not on every commit.

### Key Test Cases

#### The "Golden Query"
Run the primary demo query: *"What existing drugs might help treat long COVID fatigue?"*
- **Success Criteria**:
  - Returns at least 2 valid drug candidates (e.g., CoQ10, LDN).
  - Includes citations from PubMed.
  - Completes within 3 iterations.
  - JSON output matches schema.

#### Deployment Smoke Test
- **Gradio UI**: Verify UI launches and accepts input.
- **Streaming**: Verify generator yields chunks (first chunk within 2s).

---

## 4. Tools & Config

### Pytest Configuration
```toml
# pyproject.toml
[tool.pytest.ini_options]
markers = [
    "unit: fast, isolated tests",
    "integration: mocked network tests",
    "e2e: real network tests (slow, expensive)"
]
asyncio_mode = "auto"
```

### CI/CD Pipeline (GitHub Actions)
1. **Lint**: `ruff check .`
2. **Type Check**: `mypy .`
3. **Unit**: `pytest -m unit`
4. **Integration**: `pytest -m integration`
5. **E2E**: (Manual trigger only)

---

## 5. Anti-Hallucination Validation

How do we test if the agent is lying?

1. **Citation Check**:
   - Regex verify that every `[PMID: 12345]` in the report exists in the `Evidence` list.
   - Fail if a citation is "orphaned" (hallucinated ID).

2. **Negative Constraints**:
   - Test queries for fake diseases ("Ligma syndrome") -> Agent should return "No evidence found".

---

## Checklist for Implementation

- [ ] Set up `tests/` directory structure
- [ ] Configure `pytest` and `vcrpy`
- [ ] Create `tests/fixtures/` for mock data (PubMed XMLs)
- [ ] Write first unit test for `ResearchState`