Spaces:

hjerpe
/

sql_env

Sleeping

File size: 15,186 Bytes

5dd1bb4

# Verification Specification

**Feature:** F003
**Generated from:** specs/F003-VERIFICATION_INPUT.json
**Generated:** 2026-03-27

---

## 1. Unit Tests

### EpisodeContext (Type Extension)

| Test | Description | Input | Expected | Category |
|------|-------------|-------|----------|----------|
| test_episode_context_has_gold_rows | New field exists and defaults | `EpisodeContext(...)` | `gold_rows` is `[]` | happy |
| test_episode_context_has_query_hashes | New field exists and defaults | `EpisodeContext(...)` | `query_hashes` is `set()` | happy |
| test_episode_context_has_best_progress | New field exists and defaults | `EpisodeContext(...)` | `best_progress` is `0.0` | happy |
| test_episode_context_has_cumulative_step_reward | New field exists and defaults | `EpisodeContext(...)` | `cumulative_step_reward` is `0.0` | happy |
| test_episode_context_has_cumulative_new_info_reward | New field exists and defaults | `EpisodeContext(...)` | `cumulative_new_info_reward` is `0.0` | happy |
| test_episode_context_gold_rows_accepts_tuples | Field stores tuple list | `gold_rows=[(1, "a"), (2, "b")]` | Stored correctly | happy |

**Run:** `uv run pytest tests/unit/test_reward.py -v -k "EpisodeContext"`

---

### _cardinality_score

| Test | Description | Input | Expected | Category |
|------|-------------|-------|----------|----------|
| test_cardinality_exact_match | Same row count | `pred=[(1,),(2,)], gold=[(3,),(4,)]` | `1.0` | happy |
| test_cardinality_zero_pred | Empty prediction | `pred=[], gold=[(1,)]` | `0.0` | edge |
| test_cardinality_zero_gold | Empty gold | `pred=[(1,)], gold=[]` | `0.0` | edge |
| test_cardinality_both_empty | Both empty | `pred=[], gold=[]` | `1.0` (0/max(0,0,1)=0, 1-0=1) | edge |
| test_cardinality_pred_larger | More pred rows | `pred=[(i,) for i in range(10)], gold=[(1,)]` | `0.1` (1-9/10) | boundary |
| test_cardinality_gold_larger | More gold rows | `pred=[(1,)], gold=[(i,) for i in range(4)]` | `0.25` (1-3/4) | boundary |
| test_cardinality_returns_float_in_range | Any input | Various | Result in `[0.0, 1.0]` | invariant |

**Run:** `uv run pytest tests/unit/test_reward.py -v -k "cardinality"`

---

### _value_overlap_score

| Test | Description | Input | Expected | Category |
|------|-------------|-------|----------|----------|
| test_value_overlap_identical | Same rows | `pred=[(1,"a")], gold=[(1,"a")]` | `1.0` | happy |
| test_value_overlap_disjoint | No shared values | `pred=[(1,"x")], gold=[(2,"y")]` | `0.0` | edge |
| test_value_overlap_partial | Some overlap | `pred=[(1,"a"),(2,"b")], gold=[(1,"a"),(3,"c")]` | Jaccard of `{"1","a","2","b"}` vs `{"1","a","3","c"}` = 2/6 ~ 0.333 | happy |
| test_value_overlap_empty_pred | No pred rows | `pred=[], gold=[(1,)]` | `0.0` | edge |
| test_value_overlap_empty_gold | No gold rows | `pred=[(1,)], gold=[]` | `0.0` | edge |
| test_value_overlap_both_empty | Both empty | `pred=[], gold=[]` | `0.0` (empty Jaccard) or `1.0` (convention) | edge |
| test_value_overlap_stringifies_values | Mixed types | `pred=[(1, 2.5, None)], gold=[(1, 2.5, None)]` | `1.0` (all stringify to same) | edge |
| test_value_overlap_returns_float_in_range | Any input | Various | Result in `[0.0, 1.0]` | invariant |

**Run:** `uv run pytest tests/unit/test_reward.py -v -k "value_overlap"`

---

### _numeric_range_score

| Test | Description | Input | Expected | Category |
|------|-------------|-------|----------|----------|
| test_numeric_range_identical | Same numbers | `pred=[(10,)], gold=[(10,)]` | `1.0` | happy |
| test_numeric_range_no_numerics_in_gold | Only strings in gold | `pred=[("a",)], gold=[("b",)]` | `1.0` (spec: returns 1.0 if no numerics in gold) | edge |
| test_numeric_range_close_values | Near match | `pred=[(11,)], gold=[(10,)]` | Close to 1.0 (1/(1+log(1+1)) ~ 0.59) | happy |
| test_numeric_range_far_values | Very different | `pred=[(1000000,)], gold=[(1,)]` | Near 0.0 | boundary |
| test_numeric_range_zero_distance | Exact match numerics | `pred=[(0,)], gold=[(0,)]` | `1.0` (1/(1+log(1+0))=1) | edge |
| test_numeric_range_negative_numbers | Negative values | `pred=[(-5,)], gold=[(5,)]` | Uses absolute difference `|(-5)-5|=10` | edge |
| test_numeric_range_mixed_types | Some numeric some not | `pred=[(10,"a")], gold=[(10,"b")]` | Score based only on numeric columns | edge |
| test_numeric_range_empty_pred | No pred rows | `pred=[], gold=[(1,)]` | Gracefully handle, likely `0.0` | edge |
| test_numeric_range_returns_float_in_range | Any input | Various | Result in `[0.0, 1.0]` | invariant |

**Run:** `uv run pytest tests/unit/test_reward.py -v -k "numeric_range"`

---

### _bin_progress

| Test | Description | Input | Expected | Category |
|------|-------------|-------|----------|----------|
| test_bin_progress_zero | Score 0.0 | `0.0` | `0.0` (below 0.125) | boundary |
| test_bin_progress_low | Score 0.124 | `0.124` | `0.0` | boundary |
| test_bin_progress_boundary_0125 | Score exactly 0.125 | `0.125` | `0.25` | boundary |
| test_bin_progress_mid_low | Score 0.3 | `0.3` | `0.25` (between 0.125 and 0.375) | happy |
| test_bin_progress_boundary_0375 | Score exactly 0.375 | `0.375` | `0.5` | boundary |
| test_bin_progress_mid | Score 0.5 | `0.5` | `0.5` (between 0.375 and 0.625) | happy |
| test_bin_progress_boundary_0625 | Score exactly 0.625 | `0.625` | `0.75` | boundary |
| test_bin_progress_mid_high | Score 0.7 | `0.7` | `0.75` | happy |
| test_bin_progress_boundary_0875 | Score exactly 0.875 | `0.875` | `1.0` | boundary |
| test_bin_progress_one | Score 1.0 | `1.0` | `1.0` | boundary |

**Run:** `uv run pytest tests/unit/test_reward.py -v -k "bin_progress"`

---

### _layer1_operational

| Test | Description | Input | Expected | Category |
|------|-------------|-------|----------|----------|
| test_layer1_successful_query | exec_ok + step_cost | `action_type="QUERY", rows=[(1,)], error=None, new sql` | `+0.02 - 0.005 = +0.015` (plus possible new_info) | happy |
| test_layer1_successful_describe | exec_ok + step_cost | `action_type="DESCRIBE", rows=..., error=None` | `+0.02 - 0.005 = +0.015` | happy |
| test_layer1_successful_sample | exec_ok + step_cost | `action_type="SAMPLE", rows=..., error=None` | `+0.02 - 0.005 = +0.015` | happy |
| test_layer1_error_query | step_cost only | `error="some error", rows=None` | `-0.005` | error |
| test_layer1_new_info_reward | First unique SQL | `new sql hash, rows not None` | Includes `+0.01` new_info | happy |
| test_layer1_new_info_capped | Cap at 0.10 | Execute 11+ unique queries | `cumulative_new_info_reward` does not exceed `0.10` | boundary |
| test_layer1_repeat_penalty | Same SQL twice | Submit same SQL hash twice | Second call includes `-0.01` repeat | error |
| test_layer1_repeat_no_exec_ok | Repeated query skips exec_ok | Same SQL hash as before | No `+0.02` bonus | edge |
| test_layer1_step_cost_always_applied | Step cost on every call | Any action | Always includes `-0.005` | invariant |

**Run:** `uv run pytest tests/unit/test_reward.py -v -k "layer1"`

---

### _layer2_progress

| Test | Description | Input | Expected | Category |
|------|-------------|-------|----------|----------|
| test_layer2_perfect_match | All sub-metrics = 1.0 | `rows == gold_rows` (exact match) | Binned 1.0, improvement from 0 = 1.0, scaled by 0.15 = `0.15` | happy |
| test_layer2_no_improvement | Same binned score as best | Second identical query | `0.0` (no improvement over best_progress) | edge |
| test_layer2_improvement_only | New bin > best | First query close, second closer | Reward = `(new_bin - best_progress) * 0.15` | happy |
| test_layer2_empty_gold_rows | Gold is empty | `ctx.gold_rows = []` | `0.0` | edge |
| test_layer2_weighted_average | Check weight formula | Known sub-metric values | `0.25*card + 0.50*overlap + 0.25*numeric` | happy |
| test_layer2_updates_best_progress | Mutates ctx | Query improves progress | `ctx.best_progress` updated to new bin | happy |
| test_layer2_does_not_downgrade_best | Worse query after good | Good query then bad query | `ctx.best_progress` stays at higher value | edge |

**Run:** `uv run pytest tests/unit/test_reward.py -v -k "layer2"`

---

### compute_step_reward

| Test | Description | Input | Expected | Category |
|------|-------------|-------|----------|----------|
| test_compute_reward_query_success | Layer 1 + Layer 2 combined | QUERY with valid rows, gold_rows set | Sum of L1 + L2, clamped | happy |
| test_compute_reward_query_error | Layer 1 only, no Layer 2 | QUERY with error | `-0.005` (step_cost only) | error |
| test_compute_reward_describe | Layer 1 only, no Layer 2 | DESCRIBE action | L1 signal only | happy |
| test_compute_reward_sample | Layer 1 only, no Layer 2 | SAMPLE action | L1 signal only | happy |
| test_compute_reward_clamp_upper | Cumulative capped at +0.5 | Many successful improving queries | Cumulative never exceeds `+0.5` | boundary |
| test_compute_reward_clamp_lower | Cumulative floored at -0.2 | Many errors in a row | Cumulative never goes below `-0.2` | boundary |
| test_compute_reward_clamp_returns_delta | Step reward reflects clamp | Cumulative at 0.49, next step would add 0.05 | Returns `0.01` (clamped to 0.5) | boundary |
| test_compute_reward_mutates_ctx | Updates tracking fields | Any call | `ctx.cumulative_step_reward` updated | happy |
| test_compute_reward_layer2_skipped_for_describe | No progress calc for non-QUERY | DESCRIBE with rows | Layer 2 not called | happy |
| test_compute_reward_layer2_skipped_when_rows_none | No progress calc on error | QUERY, rows=None | Layer 2 not called | edge |
| test_compute_reward_layer2_skipped_empty_gold | No progress with empty gold | QUERY, gold_rows=[] | Layer 2 returns 0.0 | edge |

**Run:** `uv run pytest tests/unit/test_reward.py -v -k "compute_step_reward"`

---

## 2. Integration Tests

### Flow: Primary Reward Computation Through step()

| Step | Action | Expected | Verification |
|------|--------|----------|--------------|
| 1 | `env.reset(seed=42)` | Episode created, `gold_rows` populated from gold SQL | `ctx.gold_rows` is non-empty list of tuples |
| 2 | `env.step(DESCRIBE employees)` | Step reward from Layer 1 only | `observation.reward` is None (non-terminal), but internal reward tracked |
| 3 | `env.step(QUERY "SELECT COUNT(*) FROM employees")` | Layer 1 + Layer 2 computed | Progress score reflects cardinality/value/numeric comparison to gold |
| 4 | `env.step(QUERY same_sql_again)` | Repeat penalty applied | Lower reward than step 3 |
| 5 | `env.step(ANSWER correct_value)` | Terminal reward = 1.0 | `observation.done=True, observation.reward=1.0` |

**Run:** `uv run pytest tests/integration/test_reward_flow.py -v`

---

### Flow: SQL Error Handling

| Step | Action | Expected | Verification |
|------|--------|----------|--------------|
| 1 | `env.reset(seed=42)` | Episode active | Episode context initialized |
| 2 | `env.step(QUERY "SELECT nonexistent FROM employees")` | Error caught, step_cost only | Reward is `-0.005`, Layer 2 not computed |
| 3 | `env.step(QUERY valid_query)` | Normal reward resumes | Layer 1 + Layer 2 computed normally |

**Run:** `uv run pytest tests/integration/test_reward_flow.py -v -k "error"`

---

### Flow: Empty Gold Rows

| Step | Action | Expected | Verification |
|------|--------|----------|--------------|
| 1 | Reset with question whose gold SQL returns empty | `ctx.gold_rows == []` | gold_rows stored as empty list |
| 2 | `env.step(QUERY any_query)` | Layer 1 operates, Layer 2 returns 0.0 | Reward is Layer 1 signal only |

**Run:** `uv run pytest tests/integration/test_reward_flow.py -v -k "empty_gold"`

---

### Flow: Repeated Query Detection

| Step | Action | Expected | Verification |
|------|--------|----------|--------------|
| 1 | `env.reset(seed=42)` | Fresh episode | `ctx.query_hashes` is empty |
| 2 | `env.step(QUERY "SELECT 1")` | Hash added, no repeat penalty | `ctx.query_hashes` has 1 entry |
| 3 | `env.step(QUERY "SELECT 1")` | Same hash detected, repeat penalty | Reward includes `-0.01`, no exec_ok |
| 4 | `env.step(QUERY "SELECT 2")` | New hash, no repeat penalty | Normal reward, `ctx.query_hashes` has 2 entries |

**Run:** `uv run pytest tests/integration/test_reward_flow.py -v -k "repeat"`

---

## 3. API Tests

No API endpoints defined for F003. The reward system is internal server-side logic.

---

## 4. E2E Tests

### Scenario: Random Exploration Yields ~0.1 Cumulative Reward

**Setup:** Environment reset with a known question.
**Actions:** Execute 10 random DESCRIBE/SAMPLE/QUERY actions (no targeted queries).
**Expected:** Cumulative step reward is approximately 0.1 (within [0.0, 0.2]).

**Run:** `uv run pytest tests/e2e/test_reward_scenarios.py -v -k "random_exploration"`

---

### Scenario: Targeted Queries Yield ~0.3 Cumulative Reward

**Setup:** Environment reset with a known question.
**Actions:** Execute targeted queries that progressively approach the gold answer.
**Expected:** Cumulative step reward is approximately 0.3 (within [0.2, 0.5]).

**Run:** `uv run pytest tests/e2e/test_reward_scenarios.py -v -k "targeted_queries"`

---

### Scenario: Correct Answer Yields ~1.3 Total Reward

**Setup:** Environment reset with a known question.
**Actions:** Execute targeted queries, then ANSWER correctly.
**Expected:** Total reward (cumulative step + terminal 1.0) is approximately 1.3 (within [1.0, 1.5]).

**Run:** `uv run pytest tests/e2e/test_reward_scenarios.py -v -k "correct_answer"`

---

## 5. Edge Cases Checklist

- [ ] Null/None rows passed to compute_step_reward (SQL error case)
- [ ] Empty result rows from a valid query (e.g., `SELECT * FROM t WHERE 1=0`)
- [ ] Single-row gold vs multi-row prediction
- [ ] Multi-row gold vs single-row prediction
- [ ] Gold rows with only non-numeric values (numeric_range returns 1.0)
- [ ] Gold rows with mixed numeric and string columns
- [ ] Very large numeric values (boundary for log-distance formula)
- [ ] Negative numeric values in gold or prediction
- [ ] Float vs integer comparison in numeric range (e.g., `10` vs `10.0`)
- [ ] None/NULL values in result tuples (stringification for value_overlap)
- [ ] SQL strings that differ only by whitespace (hash should differ or normalize)
- [ ] Cumulative new_info exactly at cap (0.10) -- next unique query gets 0
- [ ] Cumulative step reward exactly at clamp boundary (-0.2 or +0.5)
- [ ] Layer 2 called with pred_rows and gold_rows of different column counts
- [ ] _bin_progress with values outside [0, 1] (e.g., negative or > 1.0 from rounding)
- [ ] Concurrent episodes (if supported) -- each has independent tracking fields

---

## 6. Evidence Requirements

| Category | Evidence Type | Example |
|----------|---------------|---------|
| Unit tests | pytest output | `uv run pytest tests/unit/test_reward.py -v` shows `X passed` |
| Integration | pytest output | `uv run pytest tests/integration/test_reward_flow.py -v` shows `X passed` |
| E2E | pytest output | `uv run pytest tests/e2e/test_reward_scenarios.py -v` shows `X passed` |
| Reward calibration | Logged values | Random exploration ~0.1, targeted ~0.3, correct ~1.3 |
| Existing tests | pytest output | `uv run pytest tests/test_smoke.py -v` still passes (no regressions) |