| # Verification Specification |
|
|
| **Feature:** F006 |
| **Generated from:** specs/F006-VERIFICATION_INPUT.json |
| **Generated:** 2026-03-27 |
| |
| --- |
| |
| ## 1. Unit Tests |
| |
| ### 1.1 GRPOConfig |
| |
| | Test | Description | Input | Expected | Category | |
| |------|-------------|-------|----------|----------| |
| | test_grpo_config_defaults | All defaults are populated when only required fields given | `GRPOConfig(questions_path="q.json", db_dir="dbs/", output_dir="out/")` | `max_new_tokens=256, num_train_epochs=1, per_device_train_batch_size=2, gradient_accumulation_steps=4, learning_rate=5e-6, num_generations=4, step_budget=10, difficulty_filter=["easy","medium"], seed=42, logging_steps=10, model_name="Qwen/Qwen3-1.7B"` | happy | |
| | test_grpo_config_custom_values | Custom values override defaults | `GRPOConfig(model_name="gpt2", max_new_tokens=128, ...)` | Fields match custom values | happy | |
| | test_grpo_config_required_fields | Missing required fields raise error | `GRPOConfig()` (no questions_path, db_dir, output_dir) | `TypeError` or validation error | error | |
| | test_grpo_config_negative_batch_size | Negative or zero batch size | `per_device_train_batch_size=0` | Validation error or clear failure at training time | edge | |
| | test_grpo_config_negative_learning_rate | Negative learning rate | `learning_rate=-1.0` | Validation error | edge | |
| | test_grpo_config_empty_difficulty_filter | Empty difficulty filter list | `difficulty_filter=[]` | Empty training set or clear error | edge | |
| | test_grpo_config_seed_reproducibility | Same seed produces same config state | `seed=42` twice | Identical configs | happy | |
|
|
| **Run:** `uv run pytest tests/unit/test_grpo_config.py -v` |
|
|
| --- |
|
|
| ### 1.2 get_system_prompt (training/prompts.py) |
|
|
| | Test | Description | Input | Expected | Category | |
| |------|-------------|-------|----------|----------| |
| | test_system_prompt_returns_string | Function returns non-empty string | None | `isinstance(result, str) and len(result) > 0` | happy | |
| | test_system_prompt_mentions_action_types | Prompt documents all four action types | None | Result contains "DESCRIBE", "SAMPLE", "QUERY", "ANSWER" | happy | |
| | test_system_prompt_is_deterministic | Multiple calls return identical string | None | `get_system_prompt() == get_system_prompt()` | happy | |
| |
| **Run:** `uv run pytest tests/unit/test_prompts.py -v` |
|
|
| --- |
|
|
| ### 1.3 format_observation (training/prompts.py) |
| |
| | Test | Description | Input | Expected | Category | |
| |------|-------------|-------|----------|----------| |
| | test_format_observation_happy | Formats a normal observation into user-turn string | `SQLObservation(question="Q?", schema_info="tables", result="25", error="", step_count=1, budget_remaining=9, action_history=["QUERY"], done=False, reward=None)` | Non-empty string containing question, result, and budget info | happy | |
| | test_format_observation_with_error | Error field is surfaced in formatted string | `SQLObservation(..., error="syntax error", result="")` | String contains "syntax error" or error indication | happy | |
| | test_format_observation_done_state | Terminal observation is properly formatted | `SQLObservation(..., done=True, reward=1.0)` | String includes reward/done indication | happy | |
| | test_format_observation_empty_result | Empty result is handled gracefully | `SQLObservation(..., result="", error="")` | Returns valid string without crashing | edge | |
| | test_format_observation_long_result | Very long result string | `SQLObservation(..., result="x" * 10000)` | Returns string (may be truncated); no crash | edge | |
|
|
| **Run:** `uv run pytest tests/unit/test_prompts.py -v` |
|
|
| --- |
|
|
| ### 1.4 parse_model_output (training/rollout.py) |
|
|
| | Test | Description | Input | Expected | Category | |
| |------|-------------|-------|----------|----------| |
| | test_parse_describe | Parses DESCRIBE action | `"DESCRIBE employees"` | `SQLAction(action_type="DESCRIBE", argument="employees")` | happy | |
| | test_parse_sample | Parses SAMPLE action | `"SAMPLE departments"` | `SQLAction(action_type="SAMPLE", argument="departments")` | happy | |
| | test_parse_query | Parses QUERY action | `"QUERY SELECT COUNT(*) FROM employees"` | `SQLAction(action_type="QUERY", argument="SELECT COUNT(*) FROM employees")` | happy | |
| | test_parse_answer | Parses ANSWER action | `"ANSWER 42"` | `SQLAction(action_type="ANSWER", argument="42")` | happy | |
| | test_parse_case_insensitive | Case variations accepted | `"describe employees"` or `"Describe employees"` | Valid SQLAction with action_type="DESCRIBE" | edge | |
| | test_parse_with_colon_separator | Colon-separated format | `"QUERY: SELECT 1"` | `SQLAction(action_type="QUERY", argument="SELECT 1")` | edge | |
| | test_parse_garbage_fallback | Unparseable text falls back to QUERY | `"hello world random text"` | `SQLAction(action_type="QUERY", argument="hello world random text")` | error | |
| | test_parse_empty_string_fallback | Empty string falls back to QUERY | `""` | `SQLAction(action_type="QUERY", argument="")` | edge | |
| | test_parse_only_action_no_argument | Action keyword with no argument | `"DESCRIBE"` | Fallback or empty argument handled gracefully | edge | |
| | test_parse_multiline_output | Model output with multiple lines | `"Let me think...\nQUERY SELECT 1"` | Extracts QUERY action or falls back to QUERY with raw text | edge | |
| | test_parse_whitespace_padded | Leading/trailing whitespace | `" ANSWER 42 "` | `SQLAction(action_type="ANSWER", argument="42")` | edge | |
|
|
| **Run:** `uv run pytest tests/unit/test_rollout.py -v` |
|
|
| --- |
|
|
| ### 1.5 reward_correctness (training/rewards.py) |
| |
| | Test | Description | Input | Expected | Category | |
| |------|-------------|-------|----------|----------| |
| | test_correctness_correct_answer | Episode ended with correct answer | Completions with correct=True metadata | `[1.0]` | happy | |
| | test_correctness_wrong_answer | Episode ended with wrong answer | Completions with correct=False metadata | `[0.0]` | happy | |
| | test_correctness_no_answer | Episode timed out without answering | Completions with no answer metadata | `[0.0]` | edge | |
| | test_correctness_batch | Multiple episodes in batch | Mixed correct/wrong | `[1.0, 0.0, 1.0, 0.0]` matching per-episode correctness | happy | |
| | test_correctness_empty_batch | Empty completions list | `[]` | `[]` | edge | |
| | test_correctness_trl_compatible | Return type is list[float] | Any valid input | `all(isinstance(r, float) for r in result)` | happy | |
|
|
| **Run:** `uv run pytest tests/unit/test_rewards.py -v` |
|
|
| --- |
|
|
| ### 1.6 reward_progress (training/rewards.py) |
| |
| | Test | Description | Input | Expected | Category | |
| |------|-------------|-------|----------|----------| |
| | test_progress_full | Maximum progress (correct answer) | Completions with full progress metadata | Reward in `[0.0, 1.0]`, close to 1.0 | happy | |
| | test_progress_none | No progress toward answer | Completions with zero progress | `[0.0]` | happy | |
| | test_progress_partial | Partial progress | Completions with partial closeness | Reward in `(0.0, 1.0)` exclusive | happy | |
| | test_progress_normalized | Output is always in [0, 1] range | Various inputs | `all(0.0 <= r <= 1.0 for r in result)` | happy | |
| | test_progress_batch | Batch of varied progress | Multiple episodes | List of floats, length matches input | happy | |
| | test_progress_trl_compatible | Return type is list[float] | Any valid input | `all(isinstance(r, float) for r in result)` | happy | |
|
|
| **Run:** `uv run pytest tests/unit/test_rewards.py -v` |
|
|
| --- |
|
|
| ### 1.7 reward_operational (training/rewards.py) |
| |
| | Test | Description | Input | Expected | Category | |
| |------|-------------|-------|----------|----------| |
| | test_operational_good_episode | All steps execute OK, discover new info, no repeats | Completions with exec_ok=True, new_info=True per step | Positive reward | happy | |
| | test_operational_all_errors | Every step has execution errors | Completions with exec_ok=False per step | Low/negative reward | error | |
| | test_operational_repeat_penalty | Episode with repeated identical actions | Completions with repeat=True per step | Lower reward than non-repeating | happy | |
| | test_operational_mixed_signals | Mix of good and bad steps | Varied step signals | Reward between extremes | happy | |
| | test_operational_single_step | Episode with only one step | Single step completions | Valid float returned | edge | |
| | test_operational_batch | Multiple episodes | Batch input | List of floats, length matches | happy | |
| | test_operational_trl_compatible | Return type is list[float] | Any valid input | `all(isinstance(r, float) for r in result)` | happy | |
|
|
| **Run:** `uv run pytest tests/unit/test_rewards.py -v` |
|
|
| --- |
|
|
| ### 1.8 rollout_func (training/rollout.py) |
| |
| | Test | Description | Input | Expected | Category | |
| |------|-------------|-------|----------|----------| |
| | test_rollout_returns_completions | Returns list of dicts with expected keys | Single prompt, mock model/tokenizer | List of dicts with "content" and metadata keys | happy | |
| | test_rollout_batch_size | Output length matches input prompt count | N prompts | N completions returned | happy | |
| | test_rollout_episode_terminates | Episodes terminate within step_budget | Config with step_budget=5 | All episodes have <= 5 steps | happy | |
| | test_rollout_metadata_present | Completions include correctness, progress, operational metadata | Any valid input | Each completion dict has "correct", "progress", "operational" keys | happy | |
| | test_rollout_unparseable_action | Model generates gibberish, fallback fires | Mock model returning garbage tokens | Episode continues; no crash | error | |
| | test_rollout_truncation | Long history is truncated to system + last 3 pairs | Mock model, config with step_budget=20 | Context does not exceed token window | edge | |
| |
| **Run:** `uv run pytest tests/unit/test_rollout.py -v` |
|
|
| --- |
|
|
| ## 2. Integration Tests |
|
|
| ### Flow: End-to-End Training Episode |
|
|
| | Step | Action | Expected | Verification | |
| |------|--------|----------|--------------| |
| | 1 | Create GRPOConfig with test questions and mock DB | Config object created | Config fields match inputs | |
| | 2 | Load questions and filter by difficulty | Only easy+medium questions included | Assert filtered count < total if hard questions exist | |
| | 3 | Call rollout_func with a real SQLEnvironment and mock model | Completions returned with metadata | Each completion has "content" key | |
| | 4 | Pass completions to reward_correctness | Returns list[float] of 0.0/1.0 | Length matches batch size | |
| | 5 | Pass completions to reward_progress | Returns list[float] in [0,1] | Length matches batch size | |
| | 6 | Pass completions to reward_operational | Returns list[float] | Length matches batch size | |
|
|
| **Run:** `uv run pytest tests/integration/test_training_pipeline.py -v` |
|
|
| --- |
|
|
| ### Flow: Unparseable Action Recovery |
|
|
| | Step | Action | Expected | Verification | |
| |------|--------|----------|--------------| |
| | 1 | Mock model generates unparseable text | parse_model_output returns QUERY fallback | action_type == "QUERY", argument == raw text | |
| | 2 | SQLEnvironment.step receives fallback action | Returns error observation | observation.error is non-empty | |
| | 3 | Episode continues with next step | Step count increments, budget decreases | step_count > previous, budget_remaining < previous | |
| |
| **Run:** `uv run pytest tests/integration/test_training_pipeline.py -v` |
| |
| --- |
| |
| ### Flow: History Truncation |
| |
| | Step | Action | Expected | Verification | |
| |------|--------|----------|--------------| |
| | 1 | Run rollout with step_budget large enough to exceed token window | Truncation is triggered | History contains system prompt + last 3 observation-action pairs only | |
| | 2 | Episode completes normally after truncation | No crash; completions returned | Valid completion dicts in output | |
|
|
| **Run:** `uv run pytest tests/integration/test_training_pipeline.py -v` |
|
|
| --- |
|
|
| ## 3. API Tests |
|
|
| No API endpoints defined for F006. All interfaces are Python function calls. |
|
|
| --- |
|
|
| ## 4. E2E Tests |
|
|
| ### Scenario: Training Notebook Smoke Test |
|
|
| **Setup:** Test questions JSON with 2 easy questions, test SQLite database, tiny model (or mock). |
| **Actions:** |
| 1. Instantiate GRPOConfig with test paths and minimal hyperparameters (1 epoch, batch_size=1, num_generations=2). |
| 2. Load model and tokenizer (use smallest available model or mock). |
| 3. Create GRPOTrainer with reward functions. |
| 4. Run trainer.train() for a single step. |
| 5. Verify learning curve data is logged. |
| 6. Run comparison episodes (before/after). |
|
|
| **Expected:** |
| - Training completes without error. |
| - At least one metric is logged (loss, reward). |
| - Comparison episodes produce valid SQLObservation sequences. |
|
|
| **Run:** `uv run pytest tests/e2e/test_training_e2e.py -v --timeout=300` |
|
|
| --- |
|
|
| ### Scenario: Question Filtering by Difficulty |
|
|
| **Setup:** Questions file with easy, medium, and hard questions. |
| **Actions:** |
| 1. Create GRPOConfig with `difficulty_filter=["easy"]`. |
| 2. Load and filter questions. |
|
|
| **Expected:** Only easy questions are included in training set. |
|
|
| **Run:** `uv run pytest tests/e2e/test_training_e2e.py -v` |
|
|
| --- |
|
|
| ## 5. Error Handling Tests |
|
|
| ### ModelLoadError |
|
|
| | Test | Description | Trigger | Expected | |
| |------|-------------|---------|----------| |
| | test_model_load_error_bad_name | Invalid HuggingFace model name | `GRPOConfig(model_name="nonexistent/model-xyz-999")` | Fails fast; error message contains "nonexistent/model-xyz-999" | |
|
|
| ### ActionParseError (handled via fallback) |
|
|
| | Test | Description | Trigger | Expected | |
| |------|-------------|---------|----------| |
| | test_action_parse_fallback_logged | Unparseable action triggers warning log | Model outputs `"¯\_(ツ)_/¯"` | Warning logged; returns QUERY fallback | |
|
|
| ### QuestionLoadError |
|
|
| | Test | Description | Trigger | Expected | |
| |------|-------------|---------|----------| |
| | test_question_load_missing_file | Questions path does not exist | `GRPOConfig(questions_path="/nonexistent/q.json")` | Fails fast; error message contains the path | |
| | test_question_load_empty_file | Questions file is empty JSON array | `questions.json` containing `[]` | Fails fast; clear error about empty questions | |
| | test_question_load_invalid_json | Questions file has invalid JSON | `questions.json` containing `{broken` | Fails fast; JSON parse error | |
|
|
| ### OOMError |
|
|
| | Test | Description | Trigger | Expected | |
| |------|-------------|---------|----------| |
| | test_oom_guidance | OOM during training prints guidance | (Cannot reliably trigger in test; verify message formatting only) | Error handler message mentions reducing batch_size or num_generations | |
|
|
| **Run:** `uv run pytest tests/unit/test_error_handling.py -v` |
|
|
| --- |
|
|
| ## 6. Edge Cases Checklist |
|
|
| - [ ] Null/None inputs to parse_model_output |
| - [ ] Empty string inputs to parse_model_output |
| - [ ] Empty completions list to all reward functions |
| - [ ] Single-element completions list to all reward functions |
| - [ ] Very large batch (100+ prompts) to rollout_func |
| - [ ] Questions file with only hard questions and difficulty_filter=["easy"] (zero matches) |
| - [ ] step_budget=1 (immediate budget exhaustion after one action) |
| - [ ] step_budget=0 (zero budget) |
| - [ ] Unicode characters in model output (e.g., CJK, emoji) |
| - [ ] Model output exceeding max_new_tokens |
| - [ ] learning_rate=0.0 (no weight updates) |
| - [ ] num_generations=1 (minimum GRPO completions) |
| - [ ] Concurrent calls to reward functions (thread safety) |
| - [ ] Database with no tables (empty schema) |
| - [ ] Database with very large tables (performance) |
|
|
| --- |
|
|
| ## 7. Evidence Requirements |
|
|
| | Category | Evidence Type | Example | |
| |----------|---------------|---------| |
| | Unit tests | pytest output | `X passed` | |
| | Integration | pytest output | `X passed` | |
| | Error handling | pytest output | `X passed` | |
| | E2E | pytest output + training metrics | `1 passed, loss=X.XX` | |
| | Reward functions | pytest output showing correct values | `reward_correctness: [1.0, 0.0]` | |
| | Parse fallback | pytest output + log capture | `WARNING: unparseable action, falling back to QUERY` | |
|
|