Spaces:

hjerpe
/

sql_env

Sleeping

App Files Files Community

sql_env / specs /F002-VERIFICATION_SPEC.md

hjerpe

Upload folder using huggingface_hub

5dd1bb4 verified 3 months ago

preview code

Raw

History Blame Contribute Delete

15 kB

	# Verification Specification

	Feature: F002
	Generated from: specs/F002-VERIFICATION_INPUT.json
	Generated: 2026-03-27

	---

	## 1. Unit Tests

	### verify_answer (dispatcher)

	\| Test \| Description \| Input \| Expected \| Category \|
	\|------\|-------------\|-------\|----------\|----------\|
	\| test_verify_integer_exact_match \| Dispatches to integer comparer for exact match \| `predicted="42", gold="42", answer_type="integer"` \| `True` \| happy \|
	\| test_verify_float_within_tolerance \| Dispatches to float comparer within 1% \| `predicted="3.14", gold="3.15", answer_type="float"` \| `True` \| happy \|
	\| test_verify_string_case_insensitive \| Dispatches to string comparer ignoring case \| `predicted="Alice", gold="alice", answer_type="string"` \| `True` \| happy \|
	\| test_verify_list_order_insensitive \| Dispatches to list comparer ignoring order \| `predicted="a, b", gold="b, a", answer_type="list"` \| `True` \| happy \|
	\| test_verify_none_type_falls_back_to_string \| Falls back to string comparison when answer_type is None \| `predicted="hello", gold="hello", answer_type=None` \| `True` \| fallback \|
	\| test_verify_unknown_type_falls_back_to_string \| Falls back to string comparison for unrecognized type \| `predicted="foo", gold="foo", answer_type="table"` \| `True` \| fallback \|
	\| test_verify_empty_predicted_returns_false \| Empty string after strip returns False immediately \| `predicted=" ", gold="42", answer_type="integer"` \| `False` \| edge \|
	\| test_verify_none_predicted_returns_false \| Handles None-like empty input \| `predicted="", gold="42", answer_type=None` \| `False` \| edge \|

	Run: `uv run pytest tests/test_verifier.py -v -k "test_verify"`

	---

	### _compare_integer

	\| Test \| Description \| Input \| Expected \| Category \|
	\|------\|-------------\|-------\|----------\|----------\|
	\| test_int_exact_match \| Both sides are integers \| `predicted="25", gold="25"` \| `True` \| happy \|
	\| test_int_from_float_string \| Coerces "25.0" via int(float(x)) \| `predicted="25.0", gold="25"` \| `True` \| happy \|
	\| test_int_mismatch \| Different integers \| `predicted="24", gold="25"` \| `False` \| happy \|
	\| test_int_negative_values \| Negative integers match \| `predicted="-3", gold="-3"` \| `True` \| happy \|
	\| test_int_negative_mismatch \| Negative vs positive \| `predicted="-3", gold="3"` \| `False` \| happy \|
	\| test_int_zero \| Zero matches zero \| `predicted="0", gold="0"` \| `True` \| edge \|
	\| test_int_large_value \| Large integers \| `predicted="999999999", gold="999999999"` \| `True` \| edge \|
	\| test_int_non_numeric_returns_false \| Non-numeric predicted returns False \| `predicted="abc", gold="25"` \| `False` \| error \|
	\| test_int_non_numeric_gold_returns_false \| Non-numeric gold returns False \| `predicted="25", gold="abc"` \| `False` \| error \|
	\| test_int_empty_string_returns_false \| Empty string returns False \| `predicted="", gold="25"` \| `False` \| edge \|
	\| test_int_whitespace_only_returns_false \| Whitespace-only returns False \| `predicted=" ", gold="25"` \| `False` \| edge \|
	\| test_int_float_truncation \| "25.9" coerced to 25 matches gold "25" \| `predicted="25.9", gold="25"` \| `True` \| edge \|

	Run: `uv run pytest tests/test_verifier.py -v -k "_compare_integer"`

	---

	### _compare_float

	\| Test \| Description \| Input \| Expected \| Category \|
	\|------\|-------------\|-------\|----------\|----------\|
	\| test_float_exact_match \| Identical float strings \| `predicted="3.14", gold="3.14"` \| `True` \| happy \|
	\| test_float_within_1pct_tolerance \| Difference within 1% \| `predicted="100.5", gold="100.0"` \| `True` \| happy \|
	\| test_float_outside_1pct_tolerance \| Difference exceeds 1% \| `predicted="102.0", gold="100.0"` \| `False` \| happy \|
	\| test_float_boundary_exactly_1pct \| Exactly at 1% boundary \| `predicted="101.0", gold="100.0"` \| `True` \| edge \|
	\| test_float_just_over_1pct \| Just past 1% boundary \| `predicted="101.01", gold="100.0"` \| `False` \| edge \|
	\| test_float_gold_zero_uses_absolute_tolerance \| Gold is 0, uses 1e-9 absolute \| `predicted="0.0000000001", gold="0"` \| `True` \| edge \|
	\| test_float_gold_zero_fails_large_diff \| Gold is 0, predicted too far \| `predicted="0.001", gold="0"` \| `False` \| edge \|
	\| test_float_negative_values \| Negative floats within tolerance \| `predicted="-99.5", gold="-100.0"` \| `True` \| happy \|
	\| test_float_non_numeric_returns_false \| Non-numeric predicted \| `predicted="abc", gold="3.14"` \| `False` \| error \|
	\| test_float_non_numeric_gold_returns_false \| Non-numeric gold \| `predicted="3.14", gold="abc"` \| `False` \| error \|
	\| test_float_integer_strings \| Integer strings as floats \| `predicted="42", gold="42"` \| `True` \| edge \|
	\| test_float_very_small_values \| Very small but non-zero \| `predicted="0.0001", gold="0.0001"` \| `True` \| edge \|

	Run: `uv run pytest tests/test_verifier.py -v -k "_compare_float"`

	---

	### _compare_string

	\| Test \| Description \| Input \| Expected \| Category \|
	\|------\|-------------\|-------\|----------\|----------\|
	\| test_string_exact_match \| Identical strings \| `predicted="Alice", gold="Alice"` \| `True` \| happy \|
	\| test_string_case_insensitive \| Different casing \| `predicted="ALICE", gold="alice"` \| `True` \| happy \|
	\| test_string_whitespace_normalized \| Leading/trailing/extra whitespace \| `predicted=" Alice Bob ", gold="Alice Bob"` \| `True` \| happy \|
	\| test_string_mismatch \| Different strings \| `predicted="Alice", gold="Bob"` \| `False` \| happy \|
	\| test_string_empty_both \| Both empty \| `predicted="", gold=""` \| `True` \| edge \|
	\| test_string_unicode \| Unicode characters \| `predicted="cafe\u0301", gold="cafe\u0301"` \| `True` \| edge \|
	\| test_string_special_characters \| Special characters match \| `predicted="O'Brien", gold="O'Brien"` \| `True` \| edge \|
	\| test_string_numeric_as_string \| Numbers compared as strings \| `predicted="42", gold="42"` \| `True` \| edge \|

	Run: `uv run pytest tests/test_verifier.py -v -k "_compare_string"`

	---

	### _compare_list

	\| Test \| Description \| Input \| Expected \| Category \|
	\|------\|-------------\|-------\|----------\|----------\|
	\| test_list_same_order \| Identical lists \| `predicted="a, b, c", gold="a, b, c"` \| `True` \| happy \|
	\| test_list_different_order \| Reordered elements \| `predicted="c, a, b", gold="a, b, c"` \| `True` \| happy \|
	\| test_list_mismatch \| Different elements \| `predicted="a, b, d", gold="a, b, c"` \| `False` \| happy \|
	\| test_list_extra_element \| Predicted has extra \| `predicted="a, b, c, d", gold="a, b, c"` \| `False` \| happy \|
	\| test_list_missing_element \| Predicted is missing one \| `predicted="a, b", gold="a, b, c"` \| `False` \| happy \|
	\| test_list_duplicates_matter \| Duplicates in one side \| `predicted="a, a, b", gold="a, b"` \| Defined by impl \| edge \|
	\| test_list_with_gold_rows \| Uses gold_rows when provided \| `predicted="a, b", gold="...", gold_rows=[("a",), ("b",)]` \| `True` \| happy \|
	\| test_list_gold_rows_none_fallback \| Falls back to string parsing when gold_rows is None \| `predicted="a, b", gold="a, b", gold_rows=None` \| `True` \| fallback \|
	\| test_list_empty \| Both sides empty \| `predicted="", gold=""` \| Defined by impl \| edge \|
	\| test_list_single_element \| Single element lists \| `predicted="only", gold="only"` \| `True` \| edge \|
	\| test_list_whitespace_in_elements \| Elements with whitespace \| `predicted=" a , b ", gold="a, b"` \| `True` \| edge \|
	\| test_list_case_sensitivity \| Case handling in list elements \| `predicted="Alice, Bob", gold="alice, bob"` \| Defined by impl \| edge \|

	Run: `uv run pytest tests/test_verifier.py -v -k "_compare_list"`

	---

	### EpisodeContext.gold_rows field

	\| Test \| Description \| Input \| Expected \| Category \|
	\|------\|-------------\|-------\|----------\|----------\|
	\| test_episode_context_gold_rows_default \| gold_rows defaults to None \| `EpisodeContext(...)` \| `gold_rows is None` \| happy \|
	\| test_episode_context_gold_rows_set \| gold_rows can be set to list of tuples \| `EpisodeContext(..., gold_rows=[(1,), (2,)])` \| `gold_rows == [(1,), (2,)]` \| happy \|
	\| test_episode_context_gold_rows_empty_list \| gold_rows can be empty list \| `EpisodeContext(..., gold_rows=[])` \| `gold_rows == []` \| edge \|

	Run: `uv run pytest tests/test_verifier.py -v -k "episode_context"`

	---

	## 2. Integration Tests

	### Flow: Primary answer verification through step()

	\| Step \| Action \| Expected \| Verification \|
	\|------\|--------\|----------\|--------------\|
	\| 1 \| Agent sends ANSWER action with value string \| step() dispatches to _handle_answer \| `env.step(SQLAction(action_type="ANSWER", argument=value))` \|
	\| 2 \| _handle_answer calls verify_answer with predicted, gold, answer_type, gold_rows \| verify_answer receives all four arguments \| Correct reward returned in observation \|
	\| 3 \| verify_answer dispatches to type-specific comparer \| Correct comparer chosen based on answer_type \| `observation.reward == 1.0` for correct answers \|
	\| 4 \| Boolean result maps to reward \| True -> 1.0, False -> 0.0 \| `observation.done is True` \|

	### Flow: Integer answer through full environment

	\| Step \| Action \| Expected \| Verification \|
	\|------\|--------\|----------\|--------------\|
	\| 1 \| Reset environment with question that has answer_type="integer" \| Episode created with integer question \| `observation.done is False` \|
	\| 2 \| Submit ANSWER with correct integer (possibly as float string) \| verify_answer coerces and matches \| `observation.reward == 1.0` \|

	### Flow: Float answer through full environment

	\| Step \| Action \| Expected \| Verification \|
	\|------\|--------\|----------\|--------------\|
	\| 1 \| Reset with question that has answer_type="float" \| Episode created with float question \| `observation.done is False` \|
	\| 2 \| Submit ANSWER within 1% tolerance \| verify_answer accepts within tolerance \| `observation.reward == 1.0` \|

	### Flow: String answer through full environment

	\| Step \| Action \| Expected \| Verification \|
	\|------\|--------\|----------\|--------------\|
	\| 1 \| Reset with question that has answer_type="string" \| Episode created with string question \| `observation.done is False` \|
	\| 2 \| Submit ANSWER with different casing/whitespace \| verify_answer normalizes and matches \| `observation.reward == 1.0` \|

	### Flow: List answer through full environment

	\| Step \| Action \| Expected \| Verification \|
	\|------\|--------\|----------\|--------------\|
	\| 1 \| Reset with question that has answer_type="list" \| Episode created with list question, gold_rows populated \| `observation.done is False` \|
	\| 2 \| Submit ANSWER with reordered list \| verify_answer compares as sets \| `observation.reward == 1.0` \|

	### Flow: Fallback for missing answer_type

	\| Step \| Action \| Expected \| Verification \|
	\|------\|--------\|----------\|--------------\|
	\| 1 \| Reset with question that has answer_type=None or missing \| Episode created without explicit type \| `observation.done is False` \|
	\| 2 \| Submit ANSWER matching gold exactly (modulo case/whitespace) \| Falls back to string comparison \| `observation.reward == 1.0` \|

	### Flow: Type coercion failure

	\| Step \| Action \| Expected \| Verification \|
	\|------\|--------\|----------\|--------------\|
	\| 1 \| Reset with question that has answer_type="integer" \| Episode created with integer question \| `observation.done is False` \|
	\| 2 \| Submit ANSWER with non-numeric string \| _compare_integer catches ValueError, returns False \| `observation.reward == 0.0` \|

	Run: `uv run pytest tests/test_verifier_integration.py -v`

	---

	## 3. API Tests

	No API endpoints are defined for F002. Answer verification is an internal server-side function called within the step() handler. API-level testing is covered by the integration tests above (testing through the step() interface).

	---

	## 4. E2E Tests

	### Scenario: Correct integer answer accepted

	Setup: Environment initialized with a question whose gold answer is "25" and answer_type is "integer".
	Actions: Agent submits ANSWER "25".
	Expected: observation.done is True, observation.reward is 1.0.

	### Scenario: Correct float answer accepted within tolerance

	Setup: Environment initialized with a question whose gold answer is "3.14159" and answer_type is "float".
	Actions: Agent submits ANSWER "3.14".
	Expected: observation.done is True, observation.reward is 1.0 (within 1% tolerance).

	### Scenario: Correct string answer accepted case-insensitively

	Setup: Environment initialized with a question whose gold answer is "Engineering" and answer_type is "string".
	Actions: Agent submits ANSWER "engineering".
	Expected: observation.done is True, observation.reward is 1.0.

	### Scenario: Correct list answer accepted order-insensitively

	Setup: Environment initialized with a question whose gold answer is "alice, bob, charlie" and answer_type is "list".
	Actions: Agent submits ANSWER "charlie, alice, bob".
	Expected: observation.done is True, observation.reward is 1.0.

	### Scenario: Wrong answer rejected

	Setup: Environment initialized with any question.
	Actions: Agent submits ANSWER with clearly wrong value.
	Expected: observation.done is True, observation.reward is 0.0.

	### Scenario: Backward compatibility -- no answer_type field

	Setup: Environment initialized with a legacy question record that has no answer_type (or answer_type is None).
	Actions: Agent submits ANSWER matching gold answer exactly.
	Expected: observation.done is True, observation.reward is 1.0 (string fallback used).

	Run: `uv run pytest tests/test_smoke.py tests/test_verifier_integration.py -v`

	---

	## 5. Edge Cases Checklist

	- [ ] Empty string predicted (after strip) returns False immediately
	- [ ] Whitespace-only predicted returns False
	- [ ] Non-numeric string for integer comparison returns False (ValueError caught)
	- [ ] Non-numeric string for float comparison returns False (ValueError caught)
	- [ ] Gold value of "0" for float comparison uses absolute tolerance 1e-9
	- [ ] Float boundary at exactly 1% tolerance (should pass)
	- [ ] Float just over 1% tolerance (should fail)
	- [ ] Integer coercion via int(float(x)) handles "25.0" -> 25
	- [ ] Integer coercion truncates "25.9" -> 25
	- [ ] List with gold_rows=None falls back to string parsing
	- [ ] List with gold_rows provided uses structured comparison
	- [ ] answer_type=None dispatches to string comparison
	- [ ] Unknown answer_type (e.g., "table", "unknown") dispatches to string comparison
	- [ ] Very large integer values (MAX_INT range)
	- [ ] Unicode characters in string comparison
	- [ ] Special characters in string comparison (quotes, apostrophes)
	- [ ] Negative numbers for integer and float comparisons
	- [ ] List with duplicate elements
	- [ ] Single-element list
	- [ ] Mixed whitespace in list elements

	---

	## 6. Evidence Requirements

	\| Category \| Evidence Type \| Example \|
	\|----------\|---------------\|---------\|
	\| Unit tests \| pytest output \| `uv run pytest tests/test_verifier.py -v` -- `X passed` \|
	\| Integration \| pytest output \| `uv run pytest tests/test_verifier_integration.py -v` -- `X passed` \|
	\| E2E \| pytest output via smoke tests \| `uv run pytest tests/test_smoke.py -v` -- answer tests pass \|
	\| Backward compat \| pytest output \| Existing test_answer_ends_episode_without_budget_decrement still passes \|