Spaces:
Running
Running
File size: 4,311 Bytes
b4aa4ad 5d12635 b4aa4ad |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 |
# Testing Strategy
## ensuring DeepBoner is Ironclad
---
## Overview
Our testing strategy follows a strict **Pyramid of Reliability**:
1. **Unit Tests**: Fast, isolated logic checks (60% of tests)
2. **Integration Tests**: Tool interactions & Agent loops (30% of tests)
3. **E2E / Regression Tests**: Full research workflows (10% of tests)
**Goal**: Ship a research agent that doesn't hallucinate, crash on API limits, or burn $100 in tokens by accident.
---
## 1. Unit Tests (Fast & Cheap)
**Location**: `tests/unit/`
Focus on individual components without external network calls. Mock everything.
### Key Test Cases
#### Agent Logic
- **Initialization**: Verify default config loads correctly.
- **State Updates**: Ensure `ResearchState` updates correctly (e.g., token counts increment).
- **Budget Checks**: Test `should_continue()` returns `False` when budget exceeded.
- **Error Handling**: Test partial failure recovery (e.g., one tool fails, agent continues).
#### Tools (Mocked)
- **Parser Logic**: Feed raw XML/JSON to tool parsers and verify `Evidence` objects.
- **Validation**: Ensure tools reject invalid queries (empty strings, etc.).
#### Judge Prompts
- **Schema Compliance**: Verify prompt templates generate valid JSON structure instructions.
- **Variable Injection**: Ensure `{question}` and `{context}` are injected correctly into prompts.
```python
# Example: Testing State Logic
def test_budget_stop():
state = ResearchState(tokens_used=50001, max_tokens=50000)
assert should_continue(state) is False
```
---
## 2. Integration Tests (Realistic & Mocked I/O)
**Location**: `tests/integration/`
Focus on the interaction between the Orchestrator, Tools, and LLM Judge. Use **VCR.py** or **Replay** patterns to record/replay API calls to save money/time.
### Key Test Cases
#### Search Loop
- **Iteration Flow**: Verify agent performs Search -> Judge -> Search loop.
- **Tool Selection**: Verify correct tools are called based on judge output (mocked judge).
- **Context Accumulation**: Ensure findings from Iteration 1 are passed to Iteration 2.
#### MCP Server Integration
- **Server Startup**: Verify MCP server starts and exposes tools.
- **Client Connection**: Verify agent can call tools via MCP protocol.
```python
# Example: Testing Search Loop with Mocked Tools
async def test_search_loop_flow():
agent = ResearchAgent(tools=[MockPubMed(), MockWeb()])
report = await agent.run("test query")
assert agent.state.iterations > 0
assert len(report.sources) > 0
```
---
## 3. End-to-End (E2E) Tests (The "Real Deal")
**Location**: `tests/e2e/`
Run against **real APIs** (with strict rate limits) to verify system integrity. Run these **on demand** or **nightly**, not on every commit.
### Key Test Cases
#### The "Golden Query"
Run the primary demo query: *"What existing drugs might help treat long COVID fatigue?"*
- **Success Criteria**:
- Returns at least 2 valid drug candidates (e.g., CoQ10, LDN).
- Includes citations from PubMed.
- Completes within 3 iterations.
- JSON output matches schema.
#### Deployment Smoke Test
- **Gradio UI**: Verify UI launches and accepts input.
- **Streaming**: Verify generator yields chunks (first chunk within 2s).
---
## 4. Tools & Config
### Pytest Configuration
```toml
# pyproject.toml
[tool.pytest.ini_options]
markers = [
"unit: fast, isolated tests",
"integration: mocked network tests",
"e2e: real network tests (slow, expensive)"
]
asyncio_mode = "auto"
```
### CI/CD Pipeline (GitHub Actions)
1. **Lint**: `ruff check .`
2. **Type Check**: `mypy .`
3. **Unit**: `pytest -m unit`
4. **Integration**: `pytest -m integration`
5. **E2E**: (Manual trigger only)
---
## 5. Anti-Hallucination Validation
How do we test if the agent is lying?
1. **Citation Check**:
- Regex verify that every `[PMID: 12345]` in the report exists in the `Evidence` list.
- Fail if a citation is "orphaned" (hallucinated ID).
2. **Negative Constraints**:
- Test queries for fake diseases ("Ligma syndrome") -> Agent should return "No evidence found".
---
## Checklist for Implementation
- [ ] Set up `tests/` directory structure
- [ ] Configure `pytest` and `vcrpy`
- [ ] Create `tests/fixtures/` for mock data (PubMed XMLs)
- [ ] Write first unit test for `ResearchState`
|