sql_env / specs /F002-VERIFICATION_SPEC.md
hjerpe's picture
Upload folder using huggingface_hub
5dd1bb4 verified
|
Raw
History Blame Contribute Delete
15 kB
# Verification Specification
**Feature:** F002
**Generated from:** specs/F002-VERIFICATION_INPUT.json
**Generated:** 2026-03-27
---
## 1. Unit Tests
### verify_answer (dispatcher)
| Test | Description | Input | Expected | Category |
|------|-------------|-------|----------|----------|
| test_verify_integer_exact_match | Dispatches to integer comparer for exact match | `predicted="42", gold="42", answer_type="integer"` | `True` | happy |
| test_verify_float_within_tolerance | Dispatches to float comparer within 1% | `predicted="3.14", gold="3.15", answer_type="float"` | `True` | happy |
| test_verify_string_case_insensitive | Dispatches to string comparer ignoring case | `predicted="Alice", gold="alice", answer_type="string"` | `True` | happy |
| test_verify_list_order_insensitive | Dispatches to list comparer ignoring order | `predicted="a, b", gold="b, a", answer_type="list"` | `True` | happy |
| test_verify_none_type_falls_back_to_string | Falls back to string comparison when answer_type is None | `predicted="hello", gold="hello", answer_type=None` | `True` | fallback |
| test_verify_unknown_type_falls_back_to_string | Falls back to string comparison for unrecognized type | `predicted="foo", gold="foo", answer_type="table"` | `True` | fallback |
| test_verify_empty_predicted_returns_false | Empty string after strip returns False immediately | `predicted=" ", gold="42", answer_type="integer"` | `False` | edge |
| test_verify_none_predicted_returns_false | Handles None-like empty input | `predicted="", gold="42", answer_type=None` | `False` | edge |
**Run:** `uv run pytest tests/test_verifier.py -v -k "test_verify"`
---
### _compare_integer
| Test | Description | Input | Expected | Category |
|------|-------------|-------|----------|----------|
| test_int_exact_match | Both sides are integers | `predicted="25", gold="25"` | `True` | happy |
| test_int_from_float_string | Coerces "25.0" via int(float(x)) | `predicted="25.0", gold="25"` | `True` | happy |
| test_int_mismatch | Different integers | `predicted="24", gold="25"` | `False` | happy |
| test_int_negative_values | Negative integers match | `predicted="-3", gold="-3"` | `True` | happy |
| test_int_negative_mismatch | Negative vs positive | `predicted="-3", gold="3"` | `False` | happy |
| test_int_zero | Zero matches zero | `predicted="0", gold="0"` | `True` | edge |
| test_int_large_value | Large integers | `predicted="999999999", gold="999999999"` | `True` | edge |
| test_int_non_numeric_returns_false | Non-numeric predicted returns False | `predicted="abc", gold="25"` | `False` | error |
| test_int_non_numeric_gold_returns_false | Non-numeric gold returns False | `predicted="25", gold="abc"` | `False` | error |
| test_int_empty_string_returns_false | Empty string returns False | `predicted="", gold="25"` | `False` | edge |
| test_int_whitespace_only_returns_false | Whitespace-only returns False | `predicted=" ", gold="25"` | `False` | edge |
| test_int_float_truncation | "25.9" coerced to 25 matches gold "25" | `predicted="25.9", gold="25"` | `True` | edge |
**Run:** `uv run pytest tests/test_verifier.py -v -k "_compare_integer"`
---
### _compare_float
| Test | Description | Input | Expected | Category |
|------|-------------|-------|----------|----------|
| test_float_exact_match | Identical float strings | `predicted="3.14", gold="3.14"` | `True` | happy |
| test_float_within_1pct_tolerance | Difference within 1% | `predicted="100.5", gold="100.0"` | `True` | happy |
| test_float_outside_1pct_tolerance | Difference exceeds 1% | `predicted="102.0", gold="100.0"` | `False` | happy |
| test_float_boundary_exactly_1pct | Exactly at 1% boundary | `predicted="101.0", gold="100.0"` | `True` | edge |
| test_float_just_over_1pct | Just past 1% boundary | `predicted="101.01", gold="100.0"` | `False` | edge |
| test_float_gold_zero_uses_absolute_tolerance | Gold is 0, uses 1e-9 absolute | `predicted="0.0000000001", gold="0"` | `True` | edge |
| test_float_gold_zero_fails_large_diff | Gold is 0, predicted too far | `predicted="0.001", gold="0"` | `False` | edge |
| test_float_negative_values | Negative floats within tolerance | `predicted="-99.5", gold="-100.0"` | `True` | happy |
| test_float_non_numeric_returns_false | Non-numeric predicted | `predicted="abc", gold="3.14"` | `False` | error |
| test_float_non_numeric_gold_returns_false | Non-numeric gold | `predicted="3.14", gold="abc"` | `False` | error |
| test_float_integer_strings | Integer strings as floats | `predicted="42", gold="42"` | `True` | edge |
| test_float_very_small_values | Very small but non-zero | `predicted="0.0001", gold="0.0001"` | `True` | edge |
**Run:** `uv run pytest tests/test_verifier.py -v -k "_compare_float"`
---
### _compare_string
| Test | Description | Input | Expected | Category |
|------|-------------|-------|----------|----------|
| test_string_exact_match | Identical strings | `predicted="Alice", gold="Alice"` | `True` | happy |
| test_string_case_insensitive | Different casing | `predicted="ALICE", gold="alice"` | `True` | happy |
| test_string_whitespace_normalized | Leading/trailing/extra whitespace | `predicted=" Alice Bob ", gold="Alice Bob"` | `True` | happy |
| test_string_mismatch | Different strings | `predicted="Alice", gold="Bob"` | `False` | happy |
| test_string_empty_both | Both empty | `predicted="", gold=""` | `True` | edge |
| test_string_unicode | Unicode characters | `predicted="cafe\u0301", gold="cafe\u0301"` | `True` | edge |
| test_string_special_characters | Special characters match | `predicted="O'Brien", gold="O'Brien"` | `True` | edge |
| test_string_numeric_as_string | Numbers compared as strings | `predicted="42", gold="42"` | `True` | edge |
**Run:** `uv run pytest tests/test_verifier.py -v -k "_compare_string"`
---
### _compare_list
| Test | Description | Input | Expected | Category |
|------|-------------|-------|----------|----------|
| test_list_same_order | Identical lists | `predicted="a, b, c", gold="a, b, c"` | `True` | happy |
| test_list_different_order | Reordered elements | `predicted="c, a, b", gold="a, b, c"` | `True` | happy |
| test_list_mismatch | Different elements | `predicted="a, b, d", gold="a, b, c"` | `False` | happy |
| test_list_extra_element | Predicted has extra | `predicted="a, b, c, d", gold="a, b, c"` | `False` | happy |
| test_list_missing_element | Predicted is missing one | `predicted="a, b", gold="a, b, c"` | `False` | happy |
| test_list_duplicates_matter | Duplicates in one side | `predicted="a, a, b", gold="a, b"` | Defined by impl | edge |
| test_list_with_gold_rows | Uses gold_rows when provided | `predicted="a, b", gold="...", gold_rows=[("a",), ("b",)]` | `True` | happy |
| test_list_gold_rows_none_fallback | Falls back to string parsing when gold_rows is None | `predicted="a, b", gold="a, b", gold_rows=None` | `True` | fallback |
| test_list_empty | Both sides empty | `predicted="", gold=""` | Defined by impl | edge |
| test_list_single_element | Single element lists | `predicted="only", gold="only"` | `True` | edge |
| test_list_whitespace_in_elements | Elements with whitespace | `predicted=" a , b ", gold="a, b"` | `True` | edge |
| test_list_case_sensitivity | Case handling in list elements | `predicted="Alice, Bob", gold="alice, bob"` | Defined by impl | edge |
**Run:** `uv run pytest tests/test_verifier.py -v -k "_compare_list"`
---
### EpisodeContext.gold_rows field
| Test | Description | Input | Expected | Category |
|------|-------------|-------|----------|----------|
| test_episode_context_gold_rows_default | gold_rows defaults to None | `EpisodeContext(...)` | `gold_rows is None` | happy |
| test_episode_context_gold_rows_set | gold_rows can be set to list of tuples | `EpisodeContext(..., gold_rows=[(1,), (2,)])` | `gold_rows == [(1,), (2,)]` | happy |
| test_episode_context_gold_rows_empty_list | gold_rows can be empty list | `EpisodeContext(..., gold_rows=[])` | `gold_rows == []` | edge |
**Run:** `uv run pytest tests/test_verifier.py -v -k "episode_context"`
---
## 2. Integration Tests
### Flow: Primary answer verification through step()
| Step | Action | Expected | Verification |
|------|--------|----------|--------------|
| 1 | Agent sends ANSWER action with value string | step() dispatches to _handle_answer | `env.step(SQLAction(action_type="ANSWER", argument=value))` |
| 2 | _handle_answer calls verify_answer with predicted, gold, answer_type, gold_rows | verify_answer receives all four arguments | Correct reward returned in observation |
| 3 | verify_answer dispatches to type-specific comparer | Correct comparer chosen based on answer_type | `observation.reward == 1.0` for correct answers |
| 4 | Boolean result maps to reward | True -> 1.0, False -> 0.0 | `observation.done is True` |
### Flow: Integer answer through full environment
| Step | Action | Expected | Verification |
|------|--------|----------|--------------|
| 1 | Reset environment with question that has answer_type="integer" | Episode created with integer question | `observation.done is False` |
| 2 | Submit ANSWER with correct integer (possibly as float string) | verify_answer coerces and matches | `observation.reward == 1.0` |
### Flow: Float answer through full environment
| Step | Action | Expected | Verification |
|------|--------|----------|--------------|
| 1 | Reset with question that has answer_type="float" | Episode created with float question | `observation.done is False` |
| 2 | Submit ANSWER within 1% tolerance | verify_answer accepts within tolerance | `observation.reward == 1.0` |
### Flow: String answer through full environment
| Step | Action | Expected | Verification |
|------|--------|----------|--------------|
| 1 | Reset with question that has answer_type="string" | Episode created with string question | `observation.done is False` |
| 2 | Submit ANSWER with different casing/whitespace | verify_answer normalizes and matches | `observation.reward == 1.0` |
### Flow: List answer through full environment
| Step | Action | Expected | Verification |
|------|--------|----------|--------------|
| 1 | Reset with question that has answer_type="list" | Episode created with list question, gold_rows populated | `observation.done is False` |
| 2 | Submit ANSWER with reordered list | verify_answer compares as sets | `observation.reward == 1.0` |
### Flow: Fallback for missing answer_type
| Step | Action | Expected | Verification |
|------|--------|----------|--------------|
| 1 | Reset with question that has answer_type=None or missing | Episode created without explicit type | `observation.done is False` |
| 2 | Submit ANSWER matching gold exactly (modulo case/whitespace) | Falls back to string comparison | `observation.reward == 1.0` |
### Flow: Type coercion failure
| Step | Action | Expected | Verification |
|------|--------|----------|--------------|
| 1 | Reset with question that has answer_type="integer" | Episode created with integer question | `observation.done is False` |
| 2 | Submit ANSWER with non-numeric string | _compare_integer catches ValueError, returns False | `observation.reward == 0.0` |
**Run:** `uv run pytest tests/test_verifier_integration.py -v`
---
## 3. API Tests
No API endpoints are defined for F002. Answer verification is an internal server-side function called within the step() handler. API-level testing is covered by the integration tests above (testing through the step() interface).
---
## 4. E2E Tests
### Scenario: Correct integer answer accepted
**Setup:** Environment initialized with a question whose gold answer is "25" and answer_type is "integer".
**Actions:** Agent submits ANSWER "25".
**Expected:** observation.done is True, observation.reward is 1.0.
### Scenario: Correct float answer accepted within tolerance
**Setup:** Environment initialized with a question whose gold answer is "3.14159" and answer_type is "float".
**Actions:** Agent submits ANSWER "3.14".
**Expected:** observation.done is True, observation.reward is 1.0 (within 1% tolerance).
### Scenario: Correct string answer accepted case-insensitively
**Setup:** Environment initialized with a question whose gold answer is "Engineering" and answer_type is "string".
**Actions:** Agent submits ANSWER "engineering".
**Expected:** observation.done is True, observation.reward is 1.0.
### Scenario: Correct list answer accepted order-insensitively
**Setup:** Environment initialized with a question whose gold answer is "alice, bob, charlie" and answer_type is "list".
**Actions:** Agent submits ANSWER "charlie, alice, bob".
**Expected:** observation.done is True, observation.reward is 1.0.
### Scenario: Wrong answer rejected
**Setup:** Environment initialized with any question.
**Actions:** Agent submits ANSWER with clearly wrong value.
**Expected:** observation.done is True, observation.reward is 0.0.
### Scenario: Backward compatibility -- no answer_type field
**Setup:** Environment initialized with a legacy question record that has no answer_type (or answer_type is None).
**Actions:** Agent submits ANSWER matching gold answer exactly.
**Expected:** observation.done is True, observation.reward is 1.0 (string fallback used).
**Run:** `uv run pytest tests/test_smoke.py tests/test_verifier_integration.py -v`
---
## 5. Edge Cases Checklist
- [ ] Empty string predicted (after strip) returns False immediately
- [ ] Whitespace-only predicted returns False
- [ ] Non-numeric string for integer comparison returns False (ValueError caught)
- [ ] Non-numeric string for float comparison returns False (ValueError caught)
- [ ] Gold value of "0" for float comparison uses absolute tolerance 1e-9
- [ ] Float boundary at exactly 1% tolerance (should pass)
- [ ] Float just over 1% tolerance (should fail)
- [ ] Integer coercion via int(float(x)) handles "25.0" -> 25
- [ ] Integer coercion truncates "25.9" -> 25
- [ ] List with gold_rows=None falls back to string parsing
- [ ] List with gold_rows provided uses structured comparison
- [ ] answer_type=None dispatches to string comparison
- [ ] Unknown answer_type (e.g., "table", "unknown") dispatches to string comparison
- [ ] Very large integer values (MAX_INT range)
- [ ] Unicode characters in string comparison
- [ ] Special characters in string comparison (quotes, apostrophes)
- [ ] Negative numbers for integer and float comparisons
- [ ] List with duplicate elements
- [ ] Single-element list
- [ ] Mixed whitespace in list elements
---
## 6. Evidence Requirements
| Category | Evidence Type | Example |
|----------|---------------|---------|
| Unit tests | pytest output | `uv run pytest tests/test_verifier.py -v` -- `X passed` |
| Integration | pytest output | `uv run pytest tests/test_verifier_integration.py -v` -- `X passed` |
| E2E | pytest output via smoke tests | `uv run pytest tests/test_smoke.py -v` -- answer tests pass |
| Backward compat | pytest output | Existing test_answer_ends_episode_without_budget_decrement still passes |