File size: 14,197 Bytes
5dd1bb4 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 | # Verification Specification
**Feature:** F005
**Generated from:** specs/F005-VERIFICATION_INPUT.json
**Generated:** 2026-03-27
---
## 1. Unit Tests
### 1.1 EpisodeResult (frozen dataclass)
| Test | Description | Input | Expected | Category |
|------|-------------|-------|----------|----------|
| test_episode_result_creation | Happy path construction | `EpisodeResult(episode_index=0, correct=True, total_reward=1.0, steps=5, error=None)` | All fields accessible, values match | happy |
| test_episode_result_frozen | Cannot mutate after creation | Attempt `result.correct = False` | `FrozenInstanceError` raised | edge |
| test_episode_result_with_error | Episode that failed | `EpisodeResult(episode_index=1, correct=False, total_reward=0.0, steps=0, error="connection error")` | `error` field is `"connection error"` | error |
| test_episode_result_error_default_none | Error field defaults to None | `EpisodeResult(episode_index=0, correct=True, total_reward=1.0, steps=3)` | `error is None` | happy |
**Run:** `uv run pytest tests/unit/test_evaluation.py -v -k "EpisodeResult"`
### 1.2 EvaluationResult (frozen dataclass)
| Test | Description | Input | Expected | Category |
|------|-------------|-------|----------|----------|
| test_evaluation_result_creation | Happy path with episodes | `EvaluationResult(success_rate=0.5, avg_reward=0.75, avg_steps=3.0, n_episodes=2, n_completed=2, episodes=[...])` | All fields match | happy |
| test_evaluation_result_frozen | Cannot mutate after creation | Attempt `result.success_rate = 1.0` | `FrozenInstanceError` raised | edge |
| test_evaluation_result_empty_episodes | Zero episodes edge case | `EvaluationResult(success_rate=0.0, avg_reward=0.0, avg_steps=0.0, n_episodes=0, n_completed=0, episodes=[])` | Valid construction, all zeros | edge |
| test_evaluation_result_partial_completion | Some episodes failed | `n_episodes=10, n_completed=7` | `n_completed < n_episodes` allowed | edge |
| test_evaluation_result_success_rate_bounds | Success rate between 0 and 1 | `success_rate=0.0` and `success_rate=1.0` | Both valid | edge |
**Run:** `uv run pytest tests/unit/test_evaluation.py -v -k "EvaluationResult"`
### 1.3 Policy Protocol
| Test | Description | Input | Expected | Category |
|------|-------------|-------|----------|----------|
| test_policy_protocol_compliance | Object with select_action satisfies Policy | Custom class with `select_action(obs) -> SQLAction` | `isinstance(obj, Policy)` or structural match | happy |
| test_policy_protocol_missing_method | Object without select_action | Plain object | Does NOT satisfy Protocol | error |
**Run:** `uv run pytest tests/unit/test_evaluation.py -v -k "Policy"`
### 1.4 RandomPolicy.__init__
| Test | Description | Input | Expected | Category |
|------|-------------|-------|----------|----------|
| test_random_policy_default_seed | No seed provided | `RandomPolicy()` | Constructs successfully | happy |
| test_random_policy_with_seed | Explicit seed | `RandomPolicy(seed=42)` | Constructs successfully | happy |
| test_random_policy_none_seed | Explicit None seed | `RandomPolicy(seed=None)` | Constructs successfully | happy |
**Run:** `uv run pytest tests/unit/test_evaluation.py -v -k "random_policy_init or random_policy_default or random_policy_with_seed or random_policy_none"`
### 1.5 RandomPolicy.select_action
| Test | Description | Input | Expected | Category |
|------|-------------|-------|----------|----------|
| test_random_policy_explores_when_budget_gt_1 | Budget > 1 means exploration | Observation with `budget_remaining=10` | Returns SQLAction with `action_type` in `{DESCRIBE, SAMPLE, QUERY}` | happy |
| test_random_policy_answers_when_budget_eq_1 | Budget == 1 forces ANSWER | Observation with `budget_remaining=1` | Returns SQLAction with `action_type == "ANSWER"` | happy |
| test_random_policy_returns_sql_action | Return type is correct | Any valid observation | `isinstance(result, SQLAction)` | happy |
| test_random_policy_deterministic_with_seed | Same seed produces same actions | Two RandomPolicy(seed=42) with identical observations | Same sequence of actions | happy |
| test_random_policy_varies_without_seed | Different runs produce different actions (probabilistic) | Multiple calls without seed | Not all actions identical (run 50 times) | edge |
| test_random_policy_explores_all_action_types | Over many calls, all exploration types appear | Run 100 times with budget > 1 | DESCRIBE, SAMPLE, and QUERY each appear at least once | edge |
**Run:** `uv run pytest tests/unit/test_evaluation.py -v -k "random_policy_select"`
### 1.6 evaluate()
| Test | Description | Input | Expected | Category |
|------|-------------|-------|----------|----------|
| test_evaluate_happy_path | Run N episodes successfully | `evaluate(env, policy, n_episodes=5)` | Returns EvaluationResult with `n_episodes=5, n_completed=5` | happy |
| test_evaluate_returns_evaluation_result | Return type correct | Any valid call | `isinstance(result, EvaluationResult)` | happy |
| test_evaluate_default_n_episodes | Default is 100 | `evaluate(env, policy)` | `result.n_episodes == 100` | happy |
| test_evaluate_n_episodes_zero | Zero episodes | `evaluate(env, policy, n_episodes=0)` | `EvaluationResult` with all zeros, empty episodes list | edge |
| test_evaluate_negative_n_episodes | Negative episodes | `evaluate(env, policy, n_episodes=-1)` | Raises `ValueError` | error |
| test_evaluate_success_rate_calculation | Correct fraction | Policy that answers correctly 3 out of 5 times | `success_rate == 0.6` | happy |
| test_evaluate_avg_reward_calculation | Mean reward correct | Known rewards per episode | `avg_reward` matches manual calculation | happy |
| test_evaluate_avg_steps_calculation | Mean steps correct | Known steps per episode | `avg_steps` matches manual calculation | happy |
| test_evaluate_episodes_list_length | Per-episode breakdown | `n_episodes=5` | `len(result.episodes) == 5` | happy |
| test_evaluate_episode_indices | 0-based episode indices | `n_episodes=3` | `[e.episode_index for e in result.episodes] == [0, 1, 2]` | happy |
| test_evaluate_seed_determinism | Same seed produces same results | Two calls with `seed=42, n_episodes=10` | Both EvaluationResults have identical `success_rate, avg_reward, avg_steps` | happy |
| test_evaluate_seed_per_episode | Episode i uses seed+i | `seed=100, n_episodes=3` | env.reset called with seeds 100, 101, 102 (verify via mock) | happy |
| test_evaluate_no_seed_variation | No seed allows variation | Two calls without seed | Results may differ (non-deterministic) | edge |
| test_evaluate_n_episodes_one | Single episode | `n_episodes=1` | Valid result with 1 episode | edge |
| test_evaluate_large_n_episodes | Large run | `n_episodes=500` | Completes without error, correct counts | edge |
**Run:** `uv run pytest tests/unit/test_evaluation.py -v -k "test_evaluate"`
### 1.7 evaluate() -- Error Handling
| Test | Description | Input | Expected | Category |
|------|-------------|-------|----------|----------|
| test_evaluate_episode_exception_recorded | Exception during episode is caught | Policy that raises on episode 2 | Episode 2 has `correct=False, total_reward=0.0, steps=0, error=<message>` | error |
| test_evaluate_continues_after_exception | Failed episode does not stop evaluation | Exception on episode 1 of 5 | `n_episodes=5`, all 5 episodes in result | error |
| test_evaluate_n_completed_excludes_errors | n_completed counts only successes | 2 out of 5 episodes raise | `n_completed == 3` | error |
| test_evaluate_averages_exclude_failed | avg_reward/avg_steps from completed episodes only | 3 completed with known values, 2 failed | Averages match only the 3 completed | error |
| test_evaluate_env_reset_exception | Exception during env.reset() | Mock env.reset() to raise on episode 3 | Episode 3 recorded with error, others complete | error |
| test_evaluate_policy_exception | Exception during select_action() | Mock policy.select_action() to raise | Episode recorded with error, evaluation continues | error |
| test_evaluate_env_step_exception | Exception during env.step() | Mock env.step() to raise | Episode recorded with error, evaluation continues | error |
| test_evaluate_all_episodes_fail | Every episode fails | Policy that always raises | `n_completed=0`, `success_rate=0.0`, `avg_reward=0.0`, `avg_steps=0.0` | error |
**Run:** `uv run pytest tests/unit/test_evaluation.py -v -k "exception or error or fail"`
### 1.8 evaluate() -- Progress Callback
| Test | Description | Input | Expected | Category |
|------|-------------|-------|----------|----------|
| test_evaluate_progress_callback_called | Callback receives updates | Mock callback, `n_episodes=5` | Callback called with `(1,5), (2,5), (3,5), (4,5), (5,5)` | happy |
| test_evaluate_no_callback | None callback is fine | `progress_callback=None` | No error | happy |
| test_evaluate_callback_receives_correct_total | Total matches n_episodes | `n_episodes=10` | Every callback call has `total=10` | happy |
**Run:** `uv run pytest tests/unit/test_evaluation.py -v -k "callback"`
---
## 2. Integration Tests
### Flow: Full Evaluation with RandomPolicy
| Step | Action | Expected | Verification |
|------|--------|----------|--------------|
| 1 | Create SQLEnvironment with test DB and questions | Environment loads successfully | `len(env.questions) > 0` |
| 2 | Create `RandomPolicy(seed=42)` | Policy created | Object has `select_action` method |
| 3 | Call `evaluate(env, RandomPolicy(seed=42), n_episodes=10, seed=0)` | Returns EvaluationResult | `result.n_episodes == 10` |
| 4 | Verify all episodes recorded | Per-episode breakdown present | `len(result.episodes) == 10` |
| 5 | Verify aggregate metrics are consistent | success_rate matches manual count | `result.success_rate == sum(e.correct for e in result.episodes) / 10` |
| 6 | Verify avg_reward consistent | avg_reward matches manual mean | `result.avg_reward == mean([e.total_reward for e in result.episodes if e.error is None])` |
| 7 | Verify determinism | Repeat with same seed | Identical results |
**Run:** `uv run pytest tests/integration/test_evaluation_integration.py -v -k "full_evaluation"`
### Flow: Evaluation with Partial Failures
| Step | Action | Expected | Verification |
|------|--------|----------|--------------|
| 1 | Create environment and a policy that fails on specific episodes | Setup complete | -- |
| 2 | Call `evaluate(env, flaky_policy, n_episodes=5)` | Returns result with mix of successes and failures | `result.n_completed < result.n_episodes` |
| 3 | Inspect failed episodes | Have error field set | `any(e.error is not None for e in result.episodes)` |
| 4 | Inspect successful episodes | Have error=None | Completed episodes have `error is None` and valid metrics |
**Run:** `uv run pytest tests/integration/test_evaluation_integration.py -v -k "partial_failure"`
### Flow: Zero Episodes
| Step | Action | Expected | Verification |
|------|--------|----------|--------------|
| 1 | Call `evaluate(env, policy, n_episodes=0)` | Returns zero-state result | All aggregate values are 0.0, episodes list is empty |
**Run:** `uv run pytest tests/integration/test_evaluation_integration.py -v -k "zero_episodes"`
---
## 3. API Tests
No API endpoints defined for F005. This section is intentionally empty.
---
## 4. E2E Tests
### Scenario: Single-Command Evaluation of Random Baseline
**Setup:** SQLEnvironment initialized with Spider-format test database and questions file.
**Actions:** Call `evaluate(env, RandomPolicy(seed=42), n_episodes=20, seed=0)` and inspect output.
**Expected:**
- Returns EvaluationResult with `n_episodes=20`
- `success_rate` is a float in [0.0, 1.0]
- `avg_reward` is a float
- `avg_steps` is a positive float
- `n_completed == 20` (no errors with valid env + RandomPolicy)
- All 20 EpisodeResult entries present with valid fields
- Deterministic: re-running with same seeds yields identical results
**Run:** `uv run pytest tests/e2e/test_evaluation_e2e.py -v`
### Scenario: Comparison of Two Policies
**Setup:** SQLEnvironment with test data.
**Actions:**
1. Evaluate RandomPolicy(seed=1) over 20 episodes
2. Evaluate a "always answer immediately" policy over 20 episodes
3. Compare results
**Expected:**
- Both return valid EvaluationResult
- Results are structurally comparable (same fields)
- Metrics differ between policies
**Run:** `uv run pytest tests/e2e/test_evaluation_e2e.py -v -k "comparison"`
---
## 5. Edge Cases Checklist
- [ ] n_episodes = 0 returns zero-valued EvaluationResult with empty episodes list
- [ ] n_episodes = -1 raises ValueError immediately
- [ ] n_episodes = 1 works correctly (single episode)
- [ ] All episodes fail -- n_completed=0, averages are 0.0, success_rate is 0.0
- [ ] Exception during env.reset() is caught and recorded
- [ ] Exception during policy.select_action() is caught and recorded
- [ ] Exception during env.step() is caught and recorded
- [ ] RandomPolicy with budget_remaining=1 always returns ANSWER
- [ ] RandomPolicy with budget_remaining > 1 never returns ANSWER
- [ ] Seed determinism: same seed + same n_episodes = identical EvaluationResult
- [ ] Per-episode seeding: episode i uses seed+i for env.reset()
- [ ] Progress callback receives (current, total) for each episode
- [ ] Progress callback=None does not cause errors
- [ ] EpisodeResult and EvaluationResult are frozen (immutable)
- [ ] Large n_episodes (500+) completes without memory issues
- [ ] success_rate is always in [0.0, 1.0]
- [ ] avg_reward and avg_steps computed only from completed (non-error) episodes
---
## 6. Evidence Requirements
| Category | Evidence Type | Example |
|----------|---------------|---------|
| Unit tests | pytest output | `X passed` from `uv run pytest tests/unit/test_evaluation.py -v` |
| Integration | pytest output | `X passed` from `uv run pytest tests/integration/test_evaluation_integration.py -v` |
| E2E | pytest output | `X passed` from `uv run pytest tests/e2e/test_evaluation_e2e.py -v` |
| Edge cases | pytest output | All edge-case tests in checklist pass |
| Determinism | pytest output | Seed-based tests produce identical results across runs |
|