Spaces:

hjerpe
/

sql_env

Sleeping

App Files Files Community

sql_env / specs /F003-DEMO.md

hjerpe

Upload folder using huggingface_hub

5dd1bb4 verified 3 months ago

preview code

Raw

History Blame Contribute Delete

7.48 kB

	# Feature Demo: F003 — Dense Reward System

	> Generated: 2026-03-28T06:07:34Z
	> Context source: spec + discovery only (implementation not read)
	> Feature entry: [FEATURES.json #F003](FEATURES.json)

	---

	## What This Feature Does

	Before this feature, agents only got a binary reward at the end of an episode, which made exploration hard to learn from. With F003, agents now get small, meaningful reward signals during non-terminal DESCRIBE/SAMPLE/QUERY steps, plus the final terminal correctness reward.

	From the user perspective, this means random exploration should produce low cumulative reward, targeted exploration should produce higher reward, and anti-gaming controls should prevent farming rewards via repeated or low-value behavior.

	---

	## What Is Already Proven

	### Verified in This Demo Run

	- Happy-path SQL exploration smoke flow passes locally.
	- Non-SELECT query error handling passes locally.
	- Budget-exhaustion terminal reward behavior passes locally.
	- Clamp boundary unit tests for step-reward floor/ceiling pass locally.
	- Full smoke suite passes locally (25/25).

	### Previously Verified Evidence

	- `specs/FEATURES.json` records verifier-approved evidence for F003: `uv run --with pytest pytest tests/ -v` with `166 passed`.
	- `specs/F003-IMPLEMENTATION_SPEC.md` (Section 7, Step 3.2) records final verification evidence and verifier approval.
	- `specs/F003-VERIFICATION_SPEC.md` defines unit/integration/e2e scenarios and edge-case checklist used for this demo plan.

	---

	## What Still Needs User Verification

	- Run a real episode manually (`reset` → `DESCRIBE/SAMPLE/QUERY/ANSWER`) and inspect live `observation.reward` progression across steps.
	- Confirm training-facing calibration in your own workload (random exploration ~0.1, targeted ~0.3, correct answer total ~1.3) under your runtime conditions.

	---

	## Quickstart / Verification Steps

	> Run these commands to see the feature in action:

	```bash
	uv run --with pytest pytest tests/test_smoke.py -v -k "sample_and_query_success"
	uv run --with pytest pytest tests/test_smoke.py -v -k "query_rejects_non_select"
	uv run --with pytest pytest tests/unit/test_reward.py -v -k "compute_reward_clamp_upper or compute_reward_clamp_lower"
	```

	No extra setup was needed in this environment beyond project dependencies.

	---

	## Live Local Proof

	> This feature is internal server-side reward logic (no direct end-user CLI command for reward computation itself), so strongest truthful local proof is targeted runtime smoke/unit execution.

	### Run a happy-path exploration step flow

	This validates a representative non-terminal exploration path.

	```bash
	uv run --with pytest pytest tests/test_smoke.py -v -k "sample_and_query_success"
	```

	```text
	============================= test session starts ==============================
	platform darwin -- Python 3.12.3, pytest-9.0.2, pluggy-1.6.0 -- /Users/hjerp/.cache/uv/builds-v0/.tmpjnSgOs/bin/python
	cachedir: .pytest_cache
	rootdir: /Users/hjerp/Projects/sql-env-F003-dense-reward-system
	configfile: pyproject.toml
	plugins: anyio-4.13.0
	collecting ... collected 25 items / 24 deselected / 1 selected

	tests/test_smoke.py::TestEnvironment::test_sample_and_query_success PASSED [100%]

	======================= 1 passed, 24 deselected in 3.79s =======================
	```

	Notice the targeted flow test passes, showing exploration/query behavior remains valid under dense reward integration.

	### Verify boundary clamping behavior

	This checks upper/lower clamp boundaries for cumulative step rewards.

	```bash
	uv run --with pytest pytest tests/unit/test_reward.py -v -k "compute_reward_clamp_upper or compute_reward_clamp_lower"
	```

	```text
	============================= test session starts ==============================
	platform darwin -- Python 3.12.3, pytest-9.0.2, pluggy-1.6.0 -- /Users/hjerp/.cache/uv/builds-v0/.tmp91LChv/bin/python
	cachedir: .pytest_cache
	rootdir: /Users/hjerp/Projects/sql-env-F003-dense-reward-system
	configfile: pyproject.toml
	plugins: anyio-4.13.0
	collecting ... collected 66 items / 64 deselected / 2 selected

	tests/unit/test_reward.py::TestComputeStepReward::test_compute_reward_clamp_upper PASSED [ 50%]
	tests/unit/test_reward.py::TestComputeStepReward::test_compute_reward_clamp_lower PASSED [100%]

	======================= 2 passed, 64 deselected in 4.58s =======================
	```

	This confirms reward accumulation boundaries are enforced at both extremes.

	---

	## Existing Evidence

	- `specs/F003-IMPLEMENTATION_SPEC.md` Section 7 includes recorded per-slice evidence for Layer 1, Layer 2, integration wiring, and full-suite verification.
	- `specs/FEATURES.json` includes approved verification evidence (`tests_run: 166`, `tests_passed: 166`).

	---

	## Manual Verification Checklist

	1. Start a fresh episode and run one `DESCRIBE` action.
	2. Run at least two distinct `QUERY` actions, then repeat one exact query.
	3. Confirm repeat behavior is less rewarding than first-time useful queries.
	4. Submit an invalid/non-SELECT query and confirm safe penalty behavior.
	5. End with `ANSWER` and verify terminal reward still follows correctness outcome.

	---

	## Edge Cases Exercised

	### Invalid non-SELECT query is safely handled

	```bash
	uv run --with pytest pytest tests/test_smoke.py -v -k "query_rejects_non_select"
	```

	```text
	============================= test session starts ==============================
	platform darwin -- Python 3.12.3, pytest-9.0.2, pluggy-1.6.0 -- /Users/hjerp/.cache/uv/builds-v0/.tmpitwmJ8/bin/python
	cachedir: .pytest_cache
	rootdir: /Users/hjerp/Projects/sql-env-F003-dense-reward-system
	configfile: pyproject.toml
	plugins: anyio-4.13.0
	collecting ... collected 25 items / 24 deselected / 1 selected

	tests/test_smoke.py::TestEnvironment::test_query_rejects_non_select PASSED [100%]

	======================= 1 passed, 24 deselected in 4.04s =======================
	```

	This matters because SQL errors/unsafe query patterns should not break reward flow.

	### Budget exhaustion keeps terminal reward contract

	```bash
	uv run --with pytest pytest tests/test_smoke.py -v -k "budget_exhaustion_sets_done_and_zero_reward"
	```

	```text
	============================= test session starts ==============================
	platform darwin -- Python 3.12.3, pytest-9.0.2, pluggy-1.6.0 -- /Users/hjerp/.cache/uv/builds-v0/.tmpRB9qch/bin/python
	cachedir: .pytest_cache
	rootdir: /Users/hjerp/Projects/sql-env-F003-dense-reward-system
	configfile: pyproject.toml
	plugins: anyio-4.13.0
	collecting ... collected 25 items / 24 deselected / 1 selected

	tests/test_smoke.py::TestEnvironment::test_budget_exhaustion_sets_done_and_zero_reward PASSED [100%]

	======================= 1 passed, 24 deselected in 3.89s =======================
	```

	This matters because dense shaping must not corrupt terminal episode semantics.

	---

	## Test Evidence (Optional)

	> Supplementary proof that the feature works correctly across broader scenarios.

	\| Test Suite \| Tests \| Status \|
	\|---\|---\|---\|
	\| Smoke suite (`tests/test_smoke.py`) \| 25 \| All passed \|

	Representative command:

	```bash
	uv run --with pytest pytest tests/test_smoke.py -v
	```

	```text
	[... full smoke output ...]
	============================== 25 passed in 3.67s ==============================
	```

	---

	## Feature Links

	- Implementation spec: `specs/F003-IMPLEMENTATION_SPEC.md`
	- Verification spec: `specs/F003-VERIFICATION_SPEC.md`

	---

	Demo generated by `feature-demo` agent. Re-run with `/feature-demo F003` to refresh.