| # Feature Demo: F003 — Dense Reward System |
|
|
| > **Generated:** 2026-03-28T06:07:34Z |
| > **Context source:** spec + discovery only (implementation not read) |
| > **Feature entry:** [FEATURES.json #F003](FEATURES.json) |
|
|
| --- |
|
|
| ## What This Feature Does |
|
|
| Before this feature, agents only got a binary reward at the end of an episode, which made exploration hard to learn from. With F003, agents now get small, meaningful reward signals during non-terminal DESCRIBE/SAMPLE/QUERY steps, plus the final terminal correctness reward. |
|
|
| From the user perspective, this means random exploration should produce low cumulative reward, targeted exploration should produce higher reward, and anti-gaming controls should prevent farming rewards via repeated or low-value behavior. |
|
|
| --- |
|
|
| ## What Is Already Proven |
|
|
| ### Verified in This Demo Run |
|
|
| - Happy-path SQL exploration smoke flow passes locally. |
| - Non-SELECT query error handling passes locally. |
| - Budget-exhaustion terminal reward behavior passes locally. |
| - Clamp boundary unit tests for step-reward floor/ceiling pass locally. |
| - Full smoke suite passes locally (25/25). |
|
|
| ### Previously Verified Evidence |
|
|
| - `specs/FEATURES.json` records verifier-approved evidence for F003: `uv run --with pytest pytest tests/ -v` with `166 passed`. |
| - `specs/F003-IMPLEMENTATION_SPEC.md` (Section 7, Step 3.2) records final verification evidence and verifier approval. |
| - `specs/F003-VERIFICATION_SPEC.md` defines unit/integration/e2e scenarios and edge-case checklist used for this demo plan. |
|
|
| --- |
|
|
| ## What Still Needs User Verification |
|
|
| - Run a real episode manually (`reset` → `DESCRIBE/SAMPLE/QUERY/ANSWER`) and inspect live `observation.reward` progression across steps. |
| - Confirm training-facing calibration in your own workload (random exploration ~0.1, targeted ~0.3, correct answer total ~1.3) under your runtime conditions. |
|
|
| --- |
|
|
| ## Quickstart / Verification Steps |
|
|
| > Run these commands to see the feature in action: |
|
|
| ```bash |
| uv run --with pytest pytest tests/test_smoke.py -v -k "sample_and_query_success" |
| uv run --with pytest pytest tests/test_smoke.py -v -k "query_rejects_non_select" |
| uv run --with pytest pytest tests/unit/test_reward.py -v -k "compute_reward_clamp_upper or compute_reward_clamp_lower" |
| ``` |
|
|
| No extra setup was needed in this environment beyond project dependencies. |
|
|
| --- |
|
|
| ## Live Local Proof |
|
|
| > This feature is internal server-side reward logic (no direct end-user CLI command for reward computation itself), so strongest truthful local proof is targeted runtime smoke/unit execution. |
|
|
| ### Run a happy-path exploration step flow |
|
|
| This validates a representative non-terminal exploration path. |
|
|
| ```bash |
| uv run --with pytest pytest tests/test_smoke.py -v -k "sample_and_query_success" |
| ``` |
|
|
| ```text |
| ============================= test session starts ============================== |
| platform darwin -- Python 3.12.3, pytest-9.0.2, pluggy-1.6.0 -- /Users/hjerp/.cache/uv/builds-v0/.tmpjnSgOs/bin/python |
| cachedir: .pytest_cache |
| rootdir: /Users/hjerp/Projects/sql-env-F003-dense-reward-system |
| configfile: pyproject.toml |
| plugins: anyio-4.13.0 |
| collecting ... collected 25 items / 24 deselected / 1 selected |
| |
| tests/test_smoke.py::TestEnvironment::test_sample_and_query_success PASSED [100%] |
| |
| ======================= 1 passed, 24 deselected in 3.79s ======================= |
| ``` |
|
|
| Notice the targeted flow test passes, showing exploration/query behavior remains valid under dense reward integration. |
|
|
| ### Verify boundary clamping behavior |
|
|
| This checks upper/lower clamp boundaries for cumulative step rewards. |
|
|
| ```bash |
| uv run --with pytest pytest tests/unit/test_reward.py -v -k "compute_reward_clamp_upper or compute_reward_clamp_lower" |
| ``` |
|
|
| ```text |
| ============================= test session starts ============================== |
| platform darwin -- Python 3.12.3, pytest-9.0.2, pluggy-1.6.0 -- /Users/hjerp/.cache/uv/builds-v0/.tmp91LChv/bin/python |
| cachedir: .pytest_cache |
| rootdir: /Users/hjerp/Projects/sql-env-F003-dense-reward-system |
| configfile: pyproject.toml |
| plugins: anyio-4.13.0 |
| collecting ... collected 66 items / 64 deselected / 2 selected |
| |
| tests/unit/test_reward.py::TestComputeStepReward::test_compute_reward_clamp_upper PASSED [ 50%] |
| tests/unit/test_reward.py::TestComputeStepReward::test_compute_reward_clamp_lower PASSED [100%] |
| |
| ======================= 2 passed, 64 deselected in 4.58s ======================= |
| ``` |
|
|
| This confirms reward accumulation boundaries are enforced at both extremes. |
|
|
| --- |
|
|
| ## Existing Evidence |
|
|
| - `specs/F003-IMPLEMENTATION_SPEC.md` Section 7 includes recorded per-slice evidence for Layer 1, Layer 2, integration wiring, and full-suite verification. |
| - `specs/FEATURES.json` includes approved verification evidence (`tests_run: 166`, `tests_passed: 166`). |
|
|
| --- |
|
|
| ## Manual Verification Checklist |
|
|
| 1. Start a fresh episode and run one `DESCRIBE` action. |
| 2. Run at least two distinct `QUERY` actions, then repeat one exact query. |
| 3. Confirm repeat behavior is less rewarding than first-time useful queries. |
| 4. Submit an invalid/non-SELECT query and confirm safe penalty behavior. |
| 5. End with `ANSWER` and verify terminal reward still follows correctness outcome. |
|
|
| --- |
|
|
| ## Edge Cases Exercised |
|
|
| ### Invalid non-SELECT query is safely handled |
|
|
| ```bash |
| uv run --with pytest pytest tests/test_smoke.py -v -k "query_rejects_non_select" |
| ``` |
|
|
| ```text |
| ============================= test session starts ============================== |
| platform darwin -- Python 3.12.3, pytest-9.0.2, pluggy-1.6.0 -- /Users/hjerp/.cache/uv/builds-v0/.tmpitwmJ8/bin/python |
| cachedir: .pytest_cache |
| rootdir: /Users/hjerp/Projects/sql-env-F003-dense-reward-system |
| configfile: pyproject.toml |
| plugins: anyio-4.13.0 |
| collecting ... collected 25 items / 24 deselected / 1 selected |
| |
| tests/test_smoke.py::TestEnvironment::test_query_rejects_non_select PASSED [100%] |
| |
| ======================= 1 passed, 24 deselected in 4.04s ======================= |
| ``` |
|
|
| This matters because SQL errors/unsafe query patterns should not break reward flow. |
|
|
| ### Budget exhaustion keeps terminal reward contract |
|
|
| ```bash |
| uv run --with pytest pytest tests/test_smoke.py -v -k "budget_exhaustion_sets_done_and_zero_reward" |
| ``` |
|
|
| ```text |
| ============================= test session starts ============================== |
| platform darwin -- Python 3.12.3, pytest-9.0.2, pluggy-1.6.0 -- /Users/hjerp/.cache/uv/builds-v0/.tmpRB9qch/bin/python |
| cachedir: .pytest_cache |
| rootdir: /Users/hjerp/Projects/sql-env-F003-dense-reward-system |
| configfile: pyproject.toml |
| plugins: anyio-4.13.0 |
| collecting ... collected 25 items / 24 deselected / 1 selected |
| |
| tests/test_smoke.py::TestEnvironment::test_budget_exhaustion_sets_done_and_zero_reward PASSED [100%] |
| |
| ======================= 1 passed, 24 deselected in 3.89s ======================= |
| ``` |
|
|
| This matters because dense shaping must not corrupt terminal episode semantics. |
|
|
| --- |
|
|
| ## Test Evidence (Optional) |
|
|
| > Supplementary proof that the feature works correctly across broader scenarios. |
|
|
| | Test Suite | Tests | Status | |
| |---|---|---| |
| | Smoke suite (`tests/test_smoke.py`) | 25 | All passed | |
|
|
| Representative command: |
|
|
| ```bash |
| uv run --with pytest pytest tests/test_smoke.py -v |
| ``` |
|
|
| ```text |
| [... full smoke output ...] |
| ============================== 25 passed in 3.67s ============================== |
| ``` |
|
|
| --- |
|
|
| ## Feature Links |
|
|
| - Implementation spec: `specs/F003-IMPLEMENTATION_SPEC.md` |
| - Verification spec: `specs/F003-VERIFICATION_SPEC.md` |
|
|
| --- |
|
|
| *Demo generated by `feature-demo` agent. Re-run with `/feature-demo F003` to refresh.* |
|
|