File size: 14,197 Bytes
5dd1bb4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
# Verification Specification

**Feature:** F005
**Generated from:** specs/F005-VERIFICATION_INPUT.json
**Generated:** 2026-03-27

---

## 1. Unit Tests

### 1.1 EpisodeResult (frozen dataclass)

| Test | Description | Input | Expected | Category |
|------|-------------|-------|----------|----------|
| test_episode_result_creation | Happy path construction | `EpisodeResult(episode_index=0, correct=True, total_reward=1.0, steps=5, error=None)` | All fields accessible, values match | happy |
| test_episode_result_frozen | Cannot mutate after creation | Attempt `result.correct = False` | `FrozenInstanceError` raised | edge |
| test_episode_result_with_error | Episode that failed | `EpisodeResult(episode_index=1, correct=False, total_reward=0.0, steps=0, error="connection error")` | `error` field is `"connection error"` | error |
| test_episode_result_error_default_none | Error field defaults to None | `EpisodeResult(episode_index=0, correct=True, total_reward=1.0, steps=3)` | `error is None` | happy |

**Run:** `uv run pytest tests/unit/test_evaluation.py -v -k "EpisodeResult"`

### 1.2 EvaluationResult (frozen dataclass)

| Test | Description | Input | Expected | Category |
|------|-------------|-------|----------|----------|
| test_evaluation_result_creation | Happy path with episodes | `EvaluationResult(success_rate=0.5, avg_reward=0.75, avg_steps=3.0, n_episodes=2, n_completed=2, episodes=[...])` | All fields match | happy |
| test_evaluation_result_frozen | Cannot mutate after creation | Attempt `result.success_rate = 1.0` | `FrozenInstanceError` raised | edge |
| test_evaluation_result_empty_episodes | Zero episodes edge case | `EvaluationResult(success_rate=0.0, avg_reward=0.0, avg_steps=0.0, n_episodes=0, n_completed=0, episodes=[])` | Valid construction, all zeros | edge |
| test_evaluation_result_partial_completion | Some episodes failed | `n_episodes=10, n_completed=7` | `n_completed < n_episodes` allowed | edge |
| test_evaluation_result_success_rate_bounds | Success rate between 0 and 1 | `success_rate=0.0` and `success_rate=1.0` | Both valid | edge |

**Run:** `uv run pytest tests/unit/test_evaluation.py -v -k "EvaluationResult"`

### 1.3 Policy Protocol

| Test | Description | Input | Expected | Category |
|------|-------------|-------|----------|----------|
| test_policy_protocol_compliance | Object with select_action satisfies Policy | Custom class with `select_action(obs) -> SQLAction` | `isinstance(obj, Policy)` or structural match | happy |
| test_policy_protocol_missing_method | Object without select_action | Plain object | Does NOT satisfy Protocol | error |

**Run:** `uv run pytest tests/unit/test_evaluation.py -v -k "Policy"`

### 1.4 RandomPolicy.__init__

| Test | Description | Input | Expected | Category |
|------|-------------|-------|----------|----------|
| test_random_policy_default_seed | No seed provided | `RandomPolicy()` | Constructs successfully | happy |
| test_random_policy_with_seed | Explicit seed | `RandomPolicy(seed=42)` | Constructs successfully | happy |
| test_random_policy_none_seed | Explicit None seed | `RandomPolicy(seed=None)` | Constructs successfully | happy |

**Run:** `uv run pytest tests/unit/test_evaluation.py -v -k "random_policy_init or random_policy_default or random_policy_with_seed or random_policy_none"`

### 1.5 RandomPolicy.select_action

| Test | Description | Input | Expected | Category |
|------|-------------|-------|----------|----------|
| test_random_policy_explores_when_budget_gt_1 | Budget > 1 means exploration | Observation with `budget_remaining=10` | Returns SQLAction with `action_type` in `{DESCRIBE, SAMPLE, QUERY}` | happy |
| test_random_policy_answers_when_budget_eq_1 | Budget == 1 forces ANSWER | Observation with `budget_remaining=1` | Returns SQLAction with `action_type == "ANSWER"` | happy |
| test_random_policy_returns_sql_action | Return type is correct | Any valid observation | `isinstance(result, SQLAction)` | happy |
| test_random_policy_deterministic_with_seed | Same seed produces same actions | Two RandomPolicy(seed=42) with identical observations | Same sequence of actions | happy |
| test_random_policy_varies_without_seed | Different runs produce different actions (probabilistic) | Multiple calls without seed | Not all actions identical (run 50 times) | edge |
| test_random_policy_explores_all_action_types | Over many calls, all exploration types appear | Run 100 times with budget > 1 | DESCRIBE, SAMPLE, and QUERY each appear at least once | edge |

**Run:** `uv run pytest tests/unit/test_evaluation.py -v -k "random_policy_select"`

### 1.6 evaluate()

| Test | Description | Input | Expected | Category |
|------|-------------|-------|----------|----------|
| test_evaluate_happy_path | Run N episodes successfully | `evaluate(env, policy, n_episodes=5)` | Returns EvaluationResult with `n_episodes=5, n_completed=5` | happy |
| test_evaluate_returns_evaluation_result | Return type correct | Any valid call | `isinstance(result, EvaluationResult)` | happy |
| test_evaluate_default_n_episodes | Default is 100 | `evaluate(env, policy)` | `result.n_episodes == 100` | happy |
| test_evaluate_n_episodes_zero | Zero episodes | `evaluate(env, policy, n_episodes=0)` | `EvaluationResult` with all zeros, empty episodes list | edge |
| test_evaluate_negative_n_episodes | Negative episodes | `evaluate(env, policy, n_episodes=-1)` | Raises `ValueError` | error |
| test_evaluate_success_rate_calculation | Correct fraction | Policy that answers correctly 3 out of 5 times | `success_rate == 0.6` | happy |
| test_evaluate_avg_reward_calculation | Mean reward correct | Known rewards per episode | `avg_reward` matches manual calculation | happy |
| test_evaluate_avg_steps_calculation | Mean steps correct | Known steps per episode | `avg_steps` matches manual calculation | happy |
| test_evaluate_episodes_list_length | Per-episode breakdown | `n_episodes=5` | `len(result.episodes) == 5` | happy |
| test_evaluate_episode_indices | 0-based episode indices | `n_episodes=3` | `[e.episode_index for e in result.episodes] == [0, 1, 2]` | happy |
| test_evaluate_seed_determinism | Same seed produces same results | Two calls with `seed=42, n_episodes=10` | Both EvaluationResults have identical `success_rate, avg_reward, avg_steps` | happy |
| test_evaluate_seed_per_episode | Episode i uses seed+i | `seed=100, n_episodes=3` | env.reset called with seeds 100, 101, 102 (verify via mock) | happy |
| test_evaluate_no_seed_variation | No seed allows variation | Two calls without seed | Results may differ (non-deterministic) | edge |
| test_evaluate_n_episodes_one | Single episode | `n_episodes=1` | Valid result with 1 episode | edge |
| test_evaluate_large_n_episodes | Large run | `n_episodes=500` | Completes without error, correct counts | edge |

**Run:** `uv run pytest tests/unit/test_evaluation.py -v -k "test_evaluate"`

### 1.7 evaluate() -- Error Handling

| Test | Description | Input | Expected | Category |
|------|-------------|-------|----------|----------|
| test_evaluate_episode_exception_recorded | Exception during episode is caught | Policy that raises on episode 2 | Episode 2 has `correct=False, total_reward=0.0, steps=0, error=<message>` | error |
| test_evaluate_continues_after_exception | Failed episode does not stop evaluation | Exception on episode 1 of 5 | `n_episodes=5`, all 5 episodes in result | error |
| test_evaluate_n_completed_excludes_errors | n_completed counts only successes | 2 out of 5 episodes raise | `n_completed == 3` | error |
| test_evaluate_averages_exclude_failed | avg_reward/avg_steps from completed episodes only | 3 completed with known values, 2 failed | Averages match only the 3 completed | error |
| test_evaluate_env_reset_exception | Exception during env.reset() | Mock env.reset() to raise on episode 3 | Episode 3 recorded with error, others complete | error |
| test_evaluate_policy_exception | Exception during select_action() | Mock policy.select_action() to raise | Episode recorded with error, evaluation continues | error |
| test_evaluate_env_step_exception | Exception during env.step() | Mock env.step() to raise | Episode recorded with error, evaluation continues | error |
| test_evaluate_all_episodes_fail | Every episode fails | Policy that always raises | `n_completed=0`, `success_rate=0.0`, `avg_reward=0.0`, `avg_steps=0.0` | error |

**Run:** `uv run pytest tests/unit/test_evaluation.py -v -k "exception or error or fail"`

### 1.8 evaluate() -- Progress Callback

| Test | Description | Input | Expected | Category |
|------|-------------|-------|----------|----------|
| test_evaluate_progress_callback_called | Callback receives updates | Mock callback, `n_episodes=5` | Callback called with `(1,5), (2,5), (3,5), (4,5), (5,5)` | happy |
| test_evaluate_no_callback | None callback is fine | `progress_callback=None` | No error | happy |
| test_evaluate_callback_receives_correct_total | Total matches n_episodes | `n_episodes=10` | Every callback call has `total=10` | happy |

**Run:** `uv run pytest tests/unit/test_evaluation.py -v -k "callback"`

---

## 2. Integration Tests

### Flow: Full Evaluation with RandomPolicy

| Step | Action | Expected | Verification |
|------|--------|----------|--------------|
| 1 | Create SQLEnvironment with test DB and questions | Environment loads successfully | `len(env.questions) > 0` |
| 2 | Create `RandomPolicy(seed=42)` | Policy created | Object has `select_action` method |
| 3 | Call `evaluate(env, RandomPolicy(seed=42), n_episodes=10, seed=0)` | Returns EvaluationResult | `result.n_episodes == 10` |
| 4 | Verify all episodes recorded | Per-episode breakdown present | `len(result.episodes) == 10` |
| 5 | Verify aggregate metrics are consistent | success_rate matches manual count | `result.success_rate == sum(e.correct for e in result.episodes) / 10` |
| 6 | Verify avg_reward consistent | avg_reward matches manual mean | `result.avg_reward == mean([e.total_reward for e in result.episodes if e.error is None])` |
| 7 | Verify determinism | Repeat with same seed | Identical results |

**Run:** `uv run pytest tests/integration/test_evaluation_integration.py -v -k "full_evaluation"`

### Flow: Evaluation with Partial Failures

| Step | Action | Expected | Verification |
|------|--------|----------|--------------|
| 1 | Create environment and a policy that fails on specific episodes | Setup complete | -- |
| 2 | Call `evaluate(env, flaky_policy, n_episodes=5)` | Returns result with mix of successes and failures | `result.n_completed < result.n_episodes` |
| 3 | Inspect failed episodes | Have error field set | `any(e.error is not None for e in result.episodes)` |
| 4 | Inspect successful episodes | Have error=None | Completed episodes have `error is None` and valid metrics |

**Run:** `uv run pytest tests/integration/test_evaluation_integration.py -v -k "partial_failure"`

### Flow: Zero Episodes

| Step | Action | Expected | Verification |
|------|--------|----------|--------------|
| 1 | Call `evaluate(env, policy, n_episodes=0)` | Returns zero-state result | All aggregate values are 0.0, episodes list is empty |

**Run:** `uv run pytest tests/integration/test_evaluation_integration.py -v -k "zero_episodes"`

---

## 3. API Tests

No API endpoints defined for F005. This section is intentionally empty.

---

## 4. E2E Tests

### Scenario: Single-Command Evaluation of Random Baseline

**Setup:** SQLEnvironment initialized with Spider-format test database and questions file.
**Actions:** Call `evaluate(env, RandomPolicy(seed=42), n_episodes=20, seed=0)` and inspect output.
**Expected:**
- Returns EvaluationResult with `n_episodes=20`
- `success_rate` is a float in [0.0, 1.0]
- `avg_reward` is a float
- `avg_steps` is a positive float
- `n_completed == 20` (no errors with valid env + RandomPolicy)
- All 20 EpisodeResult entries present with valid fields
- Deterministic: re-running with same seeds yields identical results

**Run:** `uv run pytest tests/e2e/test_evaluation_e2e.py -v`

### Scenario: Comparison of Two Policies

**Setup:** SQLEnvironment with test data.
**Actions:**
1. Evaluate RandomPolicy(seed=1) over 20 episodes
2. Evaluate a "always answer immediately" policy over 20 episodes
3. Compare results
**Expected:**
- Both return valid EvaluationResult
- Results are structurally comparable (same fields)
- Metrics differ between policies

**Run:** `uv run pytest tests/e2e/test_evaluation_e2e.py -v -k "comparison"`

---

## 5. Edge Cases Checklist

- [ ] n_episodes = 0 returns zero-valued EvaluationResult with empty episodes list
- [ ] n_episodes = -1 raises ValueError immediately
- [ ] n_episodes = 1 works correctly (single episode)
- [ ] All episodes fail -- n_completed=0, averages are 0.0, success_rate is 0.0
- [ ] Exception during env.reset() is caught and recorded
- [ ] Exception during policy.select_action() is caught and recorded
- [ ] Exception during env.step() is caught and recorded
- [ ] RandomPolicy with budget_remaining=1 always returns ANSWER
- [ ] RandomPolicy with budget_remaining > 1 never returns ANSWER
- [ ] Seed determinism: same seed + same n_episodes = identical EvaluationResult
- [ ] Per-episode seeding: episode i uses seed+i for env.reset()
- [ ] Progress callback receives (current, total) for each episode
- [ ] Progress callback=None does not cause errors
- [ ] EpisodeResult and EvaluationResult are frozen (immutable)
- [ ] Large n_episodes (500+) completes without memory issues
- [ ] success_rate is always in [0.0, 1.0]
- [ ] avg_reward and avg_steps computed only from completed (non-error) episodes

---

## 6. Evidence Requirements

| Category | Evidence Type | Example |
|----------|---------------|---------|
| Unit tests | pytest output | `X passed` from `uv run pytest tests/unit/test_evaluation.py -v` |
| Integration | pytest output | `X passed` from `uv run pytest tests/integration/test_evaluation_integration.py -v` |
| E2E | pytest output | `X passed` from `uv run pytest tests/e2e/test_evaluation_e2e.py -v` |
| Edge cases | pytest output | All edge-case tests in checklist pass |
| Determinism | pytest output | Seed-based tests produce identical results across runs |