| # Feature Demo: F006 — GRPO Training Pipeline |
|
|
| > **Generated:** 2026-03-28T07:42:55Z |
| > **Context source:** spec + discovery only (implementation not read) |
| > **Feature entry:** [FEATURES.json #F006](FEATURES.json) |
|
|
| --- |
|
|
| ## What This Feature Does |
|
|
| This feature gives you a single notebook workflow to train an SQLEnv policy with GRPO, then compare behavior before vs after training. The user-facing goal is simple: run one notebook and see whether the trained policy explores the database more strategically than a random baseline. |
|
|
| From a user perspective, success means the workflow is reproducible, the learning signal is visible, and the random-vs-trained comparison is easy to inspect in one place. |
|
|
| --- |
|
|
| ## What Is Already Proven |
|
|
| ### Verified in This Demo Run |
|
|
| - Confirmed the training extra can import TRL GRPO classes locally (`trl-grpo-import-ok`). |
| - Ran error-handling unit suite (`6 passed`) covering model-load failure, question-load failure modes, OOM guidance, and parse-fallback logging behavior. |
| - Ran notebook-oriented E2E smoke suite (`5 passed`) covering structure, difficulty filtering, training step execution, and transcript generation. |
| - Ran integration suite (`2 passed`) covering rollout + reward flow and unparseable-action recovery. |
| - Attempted to launch the notebook UI; local environment currently lacks `jupyter` binary (captured below). |
|
|
| ### Previously Verified Evidence |
|
|
| - `FEATURES.json` (F006) records independent verification as **68/68 tests passed** with verifier result `approved` at `2026-03-28T07:37:20Z`. |
| - Implementation spec Section 7 records full verification command passing and prior TRL import check. |
|
|
| --- |
|
|
| ## What Still Needs User Verification |
|
|
| - Open and run `notebooks/train_grpo.ipynb` interactively in a machine with Jupyter available. |
| - Validate the visual learning curve in the notebook output. |
| - Validate side-by-side transcript quality (random vs trained) with your preferred model/runtime. |
|
|
| --- |
|
|
| ## Quickstart / Verification Steps |
|
|
| > Run these commands to see the feature in action: |
|
|
| ```bash |
| uv sync --extra training |
| uv run --extra training python -c "from trl import GRPOConfig, GRPOTrainer; print('trl-grpo-import-ok')" |
| uv run --with pytest pytest tests/e2e/test_training_e2e.py -v |
| ``` |
|
|
| If you want the interactive notebook UI, install Jupyter in your environment first. |
|
|
| --- |
|
|
| ## Live Local Proof |
|
|
| ### Attempt to Launch the Training Notebook UI |
|
|
| This is the user-facing entrypoint described in the spec. |
|
|
| ```bash |
| uv run jupyter notebook "notebooks/train_grpo.ipynb" --no-browser --port 8899 |
| ``` |
|
|
| ``` |
| error: Failed to spawn: `jupyter` |
| Caused by: No such file or directory (os error 2) |
| ``` |
|
|
| What to notice: the notebook launch path is correct, but this environment does not currently have Jupyter installed, so interactive verification is handed off to the user. |
|
|
| ### Verify GRPO Training Dependencies Resolve Locally |
|
|
| ```bash |
| uv run --extra training python -c "from trl import GRPOConfig, GRPOTrainer; print('trl-grpo-import-ok')" |
| ``` |
|
|
| ``` |
| trl-grpo-import-ok |
| ``` |
|
|
| What to notice: the TRL GRPO surface required by the notebook is available in this environment when using the `training` extra. |
|
|
| --- |
|
|
| ## Existing Evidence |
|
|
| - Source: `specs/FEATURES.json` (F006.verification_evidence) |
| - `tests_run: 68`, `tests_passed: 68`, `verifier_result: approved` |
| - Command recorded: `uv run --with pytest pytest tests/unit/test_grpo_config.py tests/unit/test_prompts.py tests/unit/test_rollout.py tests/unit/test_rewards.py tests/unit/test_error_handling.py tests/integration/test_training_pipeline.py tests/e2e/test_training_e2e.py -v` |
|
|
| --- |
|
|
| ## Manual Verification Checklist |
|
|
| 1. Install notebook runtime (`jupyter`) and training deps (`uv sync --extra training`). |
| 2. Launch notebook: `jupyter notebook notebooks/train_grpo.ipynb`. |
| 3. Run all cells end-to-end. |
| 4. Confirm training completes without runtime errors. |
| 5. Confirm reward/learning curve is rendered. |
| 6. Confirm random vs trained transcript comparison appears and is readable. |
| 7. Confirm model artifacts are written to the configured output directory. |
|
|
| --- |
|
|
| ## Edge Cases Exercised |
|
|
| ### Error-path handling (bad model, missing/invalid questions, parse fallback) |
|
|
| ```bash |
| uv run --with pytest pytest tests/unit/test_error_handling.py -v |
| ``` |
|
|
| ``` |
| ============================= test session starts ============================== |
| platform darwin -- Python 3.12.3, pytest-9.0.2, pluggy-1.6.0 -- /Users/hjerp/.cache/uv/builds-v0/.tmpA8Pzif/bin/python |
| collecting ... collected 6 items |
| |
| tests/unit/test_error_handling.py::test_model_load_error_bad_name PASSED [ 16%] |
| tests/unit/test_error_handling.py::test_question_load_missing_file PASSED [ 33%] |
| tests/unit/test_error_handling.py::test_question_load_empty_file PASSED [ 50%] |
| tests/unit/test_error_handling.py::test_question_load_invalid_json PASSED [ 66%] |
| tests/unit/test_error_handling.py::test_oom_guidance PASSED [ 83%] |
| tests/unit/test_error_handling.py::test_action_parse_fallback_logged PASSED [100%] |
| |
| ============================== 6 passed in 4.68s =============================== |
| ``` |
|
|
| Why this matters: this verifies the most important failure modes fail clearly instead of silently. |
|
|
| ### Unparseable action recovery in integration flow |
|
|
| ```bash |
| uv run --with pytest pytest tests/integration/test_training_pipeline.py -v |
| ``` |
|
|
| ``` |
| ============================= test session starts ============================== |
| platform darwin -- Python 3.12.3, pytest-9.0.2, pluggy-1.6.0 -- /Users/hjerp/.cache/uv/builds-v0/.tmpn3aEqJ/bin/python |
| collecting ... collected 2 items |
| |
| tests/integration/test_training_pipeline.py::test_training_pipeline_flow_with_reward_functions PASSED [ 50%] |
| tests/integration/test_training_pipeline.py::test_unparseable_action_recovers_and_episode_continues PASSED [100%] |
| |
| ============================== 2 passed in 3.87s =============================== |
| ``` |
|
|
| Why this matters: malformed model output does not crash the episode loop; training can continue. |
|
|
| ### Verification command mismatch in this environment (`--timeout` flag) |
|
|
| ```bash |
| uv run --with pytest pytest tests/e2e/test_training_e2e.py -v --timeout=300 |
| ``` |
|
|
| ``` |
| ERROR: usage: pytest [options] [file_or_dir] [file_or_dir] [...] |
| pytest: error: unrecognized arguments: --timeout=300 |
| inifile: /Users/hjerp/Projects/sql-env/pyproject.toml |
| rootdir: /Users/hjerp/Projects/sql-env |
| ``` |
|
|
| Why this matters: the spec-listed command assumes timeout-plugin support; local fallback without `--timeout` was required. |
|
|
| --- |
|
|
| ## Test Evidence (Optional) |
|
|
| > Supplementary proof that the feature works correctly across all scenarios. |
| > The Live Demo section above shows how to use the feature; this section shows it was tested. |
|
|
| | Test Suite | Tests | Status | |
| |---|---|---| |
| | Error handling unit tests | 6 | All passed | |
| | E2E training notebook smoke tests | 5 | All passed | |
| | Integration training pipeline tests | 2 | All passed | |
|
|
| Representative command (run in this demo): |
|
|
| ```bash |
| uv run --with pytest pytest tests/e2e/test_training_e2e.py -v |
| ``` |
|
|
| Result summary: |
|
|
| ``` |
| 5 passed in 3.83s |
| ``` |
|
|
| --- |
|
|
| ## Feature Links |
|
|
| - Implementation spec: `specs/F006-IMPLEMENTATION_SPEC.md` |
| - Verification spec: `specs/F006-VERIFICATION_SPEC.md` |
|
|
| --- |
|
|
| *Demo generated by `feature-demo` agent. Re-run with `/feature-demo F006` to refresh.* |
|
|