| # Feature Demo: F005 — Green Agent Wrapper |
|
|
| > **Generated:** 2026-03-28T00:10:42Z |
| > **Context source:** spec + discovery only (implementation not read) |
| > **Feature entry:** [FEATURES.json #F005](FEATURES.json) |
|
|
| --- |
|
|
| ## What This Feature Does |
|
|
| This feature lets you evaluate a policy over many episodes in one call and get structured results back, instead of manually stepping episodes and aggregating outcomes yourself. It is designed to answer practical questions like: “How does policy X perform over 100 episodes?” |
|
|
| From a user perspective, the key value is fast, repeatable comparison. You can use a built-in random baseline, run seeded evaluations for deterministic comparisons, and inspect both aggregate metrics and per-episode outcomes without losing the whole run if one episode fails. |
|
|
| --- |
|
|
| ## What Is Already Proven |
|
|
| ### Verified in This Demo Run |
|
|
| - Public evaluation API imports successfully (`RandomPolicy`, `evaluate`, result types). |
| - `evaluate(..., n_episodes=0)` returns a valid zero-valued result object. |
| - Integration/determinism verification tests passed locally against real SQLEnvironment flow (2 passed). |
| - Progress-callback verification passed locally (1 passed). |
| - Full F005 evaluation test file passed locally (16 passed). |
|
|
| ### Previously Verified Evidence |
|
|
| - `specs/FEATURES.json` records approved verification evidence for F005: |
| - Command: `uv run --with pytest pytest tests/test_evaluation.py -v` |
| - Result: 16 passed |
| - Verifier result: approved |
| - Timestamp: 2026-03-28T00:04:03Z |
| - `specs/F005-IMPLEMENTATION_SPEC.md` Step 2.2 records full-project regression evidence (`116 passed, 1 skipped`) after integration coverage was added. |
|
|
| --- |
|
|
| ## What Still Needs User Verification |
|
|
| None. |
|
|
| --- |
|
|
| ## Quickstart / Verification Steps |
|
|
| > Run these commands to see the feature in action: |
|
|
| ```bash |
| uv sync |
| uv run python -c "from evaluation import evaluate; r=evaluate(None, None, n_episodes=0); print(r)" |
| uv run --with pytest pytest tests/test_evaluation.py -v |
| ``` |
|
|
| Prerequisite: run from project root with dependencies available via `uv`. |
|
|
| --- |
|
|
| ## Live Local Proof |
|
|
| ### Load the evaluation API |
|
|
| This confirms the user-facing evaluation surface is available from the package. |
|
|
| ```bash |
| uv run python -c "from evaluation import RandomPolicy, evaluate, EpisodeResult, EvaluationResult; print('evaluation_api_import_ok')" |
| ``` |
|
|
| ``` |
| evaluation_api_import_ok |
| ``` |
|
|
| Notice that all primary public symbols for F005 import cleanly. |
|
|
| ### Run evaluate() in zero-episode mode |
|
|
| This demonstrates a documented boundary behavior of the evaluation call. |
|
|
| ```bash |
| uv run python -c "from evaluation import evaluate; r=evaluate(None, None, n_episodes=0); print(f'n_episodes={r.n_episodes} n_completed={r.n_completed} success_rate={r.success_rate} avg_reward={r.avg_reward} avg_steps={r.avg_steps} episodes={len(r.episodes)}')" |
| ``` |
|
|
| ``` |
| n_episodes=0 n_completed=0 success_rate=0.0 avg_reward=0.0 avg_steps=0.0 episodes=0 |
| ``` |
|
|
| Notice that the function returns a clean structured result instead of failing on this edge input. |
|
|
| ### Verify real-environment integration and seeded determinism |
|
|
| This checks the core happy-path flow with real environment integration and repeatable seeded behavior. |
|
|
| ```bash |
| uv run --with pytest pytest tests/test_evaluation.py -v -k "test_evaluate_integration_with_sql_environment or test_evaluate_integration_is_deterministic_with_seeds" |
| ``` |
|
|
| ``` |
| ============================= test session starts ============================== |
| platform darwin -- Python 3.12.3, pytest-9.0.2, pluggy-1.6.0 -- /Users/hjerp/.cache/uv/builds-v0/.tmpxjssag/bin/python |
| cachedir: .pytest_cache |
| rootdir: /Users/hjerp/Projects/sql-env-F005-green-agent-wrapper |
| configfile: pyproject.toml |
| plugins: anyio-4.13.0 |
| collecting ... collected 16 items / 14 deselected / 2 selected |
| |
| tests/test_evaluation.py::test_evaluate_integration_with_sql_environment PASSED [ 50%] |
| tests/test_evaluation.py::test_evaluate_integration_is_deterministic_with_seeds PASSED [100%] |
| |
| ======================= 2 passed, 14 deselected in 4.29s ======================= |
| ``` |
|
|
| Notice both integration behavior and seed determinism passed in this run. |
|
|
| --- |
|
|
| ## Existing Evidence |
|
|
| - Verification spec reference: `specs/F005-VERIFICATION_SPEC.md` |
| - Implementation-step evidence: `specs/F005-IMPLEMENTATION_SPEC.md` (Step 2.2) |
| - Feature registry evidence: `specs/FEATURES.json` → `features[id=F005].verification_evidence` |
|
|
| --- |
|
|
| ## Manual Verification Checklist |
|
|
| No additional manual verification required. |
|
|
| --- |
|
|
| ## Edge Cases Exercised |
|
|
| ### Zero and negative episode counts |
|
|
| ```bash |
| uv run --with pytest pytest tests/test_evaluation.py -v -k "test_evaluate_negative_episodes_raises or test_evaluate_zero_episodes_returns_zero_values" |
| ``` |
|
|
| ``` |
| ============================= test session starts ============================== |
| platform darwin -- Python 3.12.3, pytest-9.0.2, pluggy-1.6.0 -- /Users/hjerp/.cache/uv/builds-v0/.tmpBSdLqD/bin/python |
| cachedir: .pytest_cache |
| rootdir: /Users/hjerp/Projects/sql-env-F005-green-agent-wrapper |
| configfile: pyproject.toml |
| plugins: anyio-4.13.0 |
| collecting ... collected 16 items / 14 deselected / 2 selected |
| |
| tests/test_evaluation.py::test_evaluate_zero_episodes_returns_zero_values PASSED [ 50%] |
| tests/test_evaluation.py::test_evaluate_negative_episodes_raises PASSED [100%] |
| |
| ======================= 2 passed, 14 deselected in 4.02s ======================= |
| ``` |
|
|
| This matters because F005 must handle both boundary (`0`) and invalid (`-1`) episode requests predictably. |
|
|
| ### Progress callback behavior during evaluation |
|
|
| ```bash |
| uv run --with pytest pytest tests/test_evaluation.py -v -k "test_evaluate_progress_callback_receives_episode_progress" |
| ``` |
|
|
| ``` |
| ============================= test session starts ============================== |
| platform darwin -- Python 3.12.3, pytest-9.0.2, pluggy-1.6.0 -- /Users/hjerp/.cache/uv/builds-v0/.tmp164LzQ/bin/python |
| cachedir: .pytest_cache |
| rootdir: /Users/hjerp/Projects/sql-env-F005-green-agent-wrapper |
| configfile: pyproject.toml |
| plugins: anyio-4.13.0 |
| collecting ... collected 16 items / 15 deselected / 1 selected |
| |
| tests/test_evaluation.py::test_evaluate_progress_callback_receives_episode_progress PASSED [100%] |
| |
| ======================= 1 passed, 15 deselected in 3.78s ======================= |
| ``` |
|
|
| This matters because progress visibility was an explicit anti-frustration requirement. |
|
|
| --- |
|
|
| ## Test Evidence (Optional) |
|
|
| > Supplementary proof that the feature works correctly across all scenarios. |
| > The Live Demo section above shows usage surfaces; this section shows broader verification coverage. |
|
|
| | Test Suite | Tests | Status | |
| |---|---|---| |
| | F005 evaluation tests (`tests/test_evaluation.py`) | 16 | All passed | |
|
|
| Representative command: |
|
|
| ```bash |
| uv run --with pytest pytest tests/test_evaluation.py -v |
| ``` |
|
|
| Representative output summary: |
|
|
| ``` |
| ============================== 16 passed in 4.05s ============================== |
| ``` |
|
|
| --- |
|
|
| ## Feature Links |
|
|
| - Implementation spec: `specs/F005-IMPLEMENTATION_SPEC.md` |
| - Verification spec: `specs/F005-VERIFICATION_SPEC.md` |
|
|
| --- |
|
|
| *Demo generated by `feature-demo` agent. Re-run with `/feature-demo F005` to refresh.* |
|
|