Spaces:

hjerpe
/

sql_env

Sleeping

App Files Files Community

sql_env / specs /F005-DEMO.md

hjerpe

Upload folder using huggingface_hub

5dd1bb4 verified 3 months ago

preview code

Raw

History Blame Contribute Delete

7.12 kB

	# Feature Demo: F005 — Green Agent Wrapper

	> Generated: 2026-03-28T00:10:42Z
	> Context source: spec + discovery only (implementation not read)
	> Feature entry: [FEATURES.json #F005](FEATURES.json)

	---

	## What This Feature Does

	This feature lets you evaluate a policy over many episodes in one call and get structured results back, instead of manually stepping episodes and aggregating outcomes yourself. It is designed to answer practical questions like: “How does policy X perform over 100 episodes?”

	From a user perspective, the key value is fast, repeatable comparison. You can use a built-in random baseline, run seeded evaluations for deterministic comparisons, and inspect both aggregate metrics and per-episode outcomes without losing the whole run if one episode fails.

	---

	## What Is Already Proven

	### Verified in This Demo Run

	- Public evaluation API imports successfully (`RandomPolicy`, `evaluate`, result types).
	- `evaluate(..., n_episodes=0)` returns a valid zero-valued result object.
	- Integration/determinism verification tests passed locally against real SQLEnvironment flow (2 passed).
	- Progress-callback verification passed locally (1 passed).
	- Full F005 evaluation test file passed locally (16 passed).

	### Previously Verified Evidence

	- `specs/FEATURES.json` records approved verification evidence for F005:
	- Command: `uv run --with pytest pytest tests/test_evaluation.py -v`
	- Result: 16 passed
	- Verifier result: approved
	- Timestamp: 2026-03-28T00:04:03Z
	- `specs/F005-IMPLEMENTATION_SPEC.md` Step 2.2 records full-project regression evidence (`116 passed, 1 skipped`) after integration coverage was added.

	---

	## What Still Needs User Verification

	None.

	---

	## Quickstart / Verification Steps

	> Run these commands to see the feature in action:

	```bash
	uv sync
	uv run python -c "from evaluation import evaluate; r=evaluate(None, None, n_episodes=0); print(r)"
	uv run --with pytest pytest tests/test_evaluation.py -v
	```

	Prerequisite: run from project root with dependencies available via `uv`.

	---

	## Live Local Proof

	### Load the evaluation API

	This confirms the user-facing evaluation surface is available from the package.

	```bash
	uv run python -c "from evaluation import RandomPolicy, evaluate, EpisodeResult, EvaluationResult; print('evaluation_api_import_ok')"
	```

	```
	evaluation_api_import_ok
	```

	Notice that all primary public symbols for F005 import cleanly.

	### Run evaluate() in zero-episode mode

	This demonstrates a documented boundary behavior of the evaluation call.

	```bash
	uv run python -c "from evaluation import evaluate; r=evaluate(None, None, n_episodes=0); print(f'n_episodes={r.n_episodes} n_completed={r.n_completed} success_rate={r.success_rate} avg_reward={r.avg_reward} avg_steps={r.avg_steps} episodes={len(r.episodes)}')"
	```

	```
	n_episodes=0 n_completed=0 success_rate=0.0 avg_reward=0.0 avg_steps=0.0 episodes=0
	```

	Notice that the function returns a clean structured result instead of failing on this edge input.

	### Verify real-environment integration and seeded determinism

	This checks the core happy-path flow with real environment integration and repeatable seeded behavior.

	```bash
	uv run --with pytest pytest tests/test_evaluation.py -v -k "test_evaluate_integration_with_sql_environment or test_evaluate_integration_is_deterministic_with_seeds"
	```

	```
	============================= test session starts ==============================
	platform darwin -- Python 3.12.3, pytest-9.0.2, pluggy-1.6.0 -- /Users/hjerp/.cache/uv/builds-v0/.tmpxjssag/bin/python
	cachedir: .pytest_cache
	rootdir: /Users/hjerp/Projects/sql-env-F005-green-agent-wrapper
	configfile: pyproject.toml
	plugins: anyio-4.13.0
	collecting ... collected 16 items / 14 deselected / 2 selected

	tests/test_evaluation.py::test_evaluate_integration_with_sql_environment PASSED [ 50%]
	tests/test_evaluation.py::test_evaluate_integration_is_deterministic_with_seeds PASSED [100%]

	======================= 2 passed, 14 deselected in 4.29s =======================
	```

	Notice both integration behavior and seed determinism passed in this run.

	---

	## Existing Evidence

	- Verification spec reference: `specs/F005-VERIFICATION_SPEC.md`
	- Implementation-step evidence: `specs/F005-IMPLEMENTATION_SPEC.md` (Step 2.2)
	- Feature registry evidence: `specs/FEATURES.json` → `features[id=F005].verification_evidence`

	---

	## Manual Verification Checklist

	No additional manual verification required.

	---

	## Edge Cases Exercised

	### Zero and negative episode counts

	```bash
	uv run --with pytest pytest tests/test_evaluation.py -v -k "test_evaluate_negative_episodes_raises or test_evaluate_zero_episodes_returns_zero_values"
	```

	```
	============================= test session starts ==============================
	platform darwin -- Python 3.12.3, pytest-9.0.2, pluggy-1.6.0 -- /Users/hjerp/.cache/uv/builds-v0/.tmpBSdLqD/bin/python
	cachedir: .pytest_cache
	rootdir: /Users/hjerp/Projects/sql-env-F005-green-agent-wrapper
	configfile: pyproject.toml
	plugins: anyio-4.13.0
	collecting ... collected 16 items / 14 deselected / 2 selected

	tests/test_evaluation.py::test_evaluate_zero_episodes_returns_zero_values PASSED [ 50%]
	tests/test_evaluation.py::test_evaluate_negative_episodes_raises PASSED [100%]

	======================= 2 passed, 14 deselected in 4.02s =======================
	```

	This matters because F005 must handle both boundary (`0`) and invalid (`-1`) episode requests predictably.

	### Progress callback behavior during evaluation

	```bash
	uv run --with pytest pytest tests/test_evaluation.py -v -k "test_evaluate_progress_callback_receives_episode_progress"
	```

	```
	============================= test session starts ==============================
	platform darwin -- Python 3.12.3, pytest-9.0.2, pluggy-1.6.0 -- /Users/hjerp/.cache/uv/builds-v0/.tmp164LzQ/bin/python
	cachedir: .pytest_cache
	rootdir: /Users/hjerp/Projects/sql-env-F005-green-agent-wrapper
	configfile: pyproject.toml
	plugins: anyio-4.13.0
	collecting ... collected 16 items / 15 deselected / 1 selected

	tests/test_evaluation.py::test_evaluate_progress_callback_receives_episode_progress PASSED [100%]

	======================= 1 passed, 15 deselected in 3.78s =======================
	```

	This matters because progress visibility was an explicit anti-frustration requirement.

	---

	## Test Evidence (Optional)

	> Supplementary proof that the feature works correctly across all scenarios.
	> The Live Demo section above shows usage surfaces; this section shows broader verification coverage.

	\| Test Suite \| Tests \| Status \|
	\|---\|---\|---\|
	\| F005 evaluation tests (`tests/test_evaluation.py`) \| 16 \| All passed \|

	Representative command:

	```bash
	uv run --with pytest pytest tests/test_evaluation.py -v
	```

	Representative output summary:

	```
	============================== 16 passed in 4.05s ==============================
	```

	---

	## Feature Links

	- Implementation spec: `specs/F005-IMPLEMENTATION_SPEC.md`
	- Verification spec: `specs/F005-VERIFICATION_SPEC.md`

	---

	Demo generated by `feature-demo` agent. Re-run with `/feature-demo F005` to refresh.