sql_env / specs /F006-VERIFICATION_SPEC.md
hjerpe's picture
Upload folder using huggingface_hub
5dd1bb4 verified
|
Raw
History Blame Contribute Delete
15.8 kB
# Verification Specification
**Feature:** F006
**Generated from:** specs/F006-VERIFICATION_INPUT.json
**Generated:** 2026-03-27
---
## 1. Unit Tests
### 1.1 GRPOConfig
| Test | Description | Input | Expected | Category |
|------|-------------|-------|----------|----------|
| test_grpo_config_defaults | All defaults are populated when only required fields given | `GRPOConfig(questions_path="q.json", db_dir="dbs/", output_dir="out/")` | `max_new_tokens=256, num_train_epochs=1, per_device_train_batch_size=2, gradient_accumulation_steps=4, learning_rate=5e-6, num_generations=4, step_budget=10, difficulty_filter=["easy","medium"], seed=42, logging_steps=10, model_name="Qwen/Qwen3-1.7B"` | happy |
| test_grpo_config_custom_values | Custom values override defaults | `GRPOConfig(model_name="gpt2", max_new_tokens=128, ...)` | Fields match custom values | happy |
| test_grpo_config_required_fields | Missing required fields raise error | `GRPOConfig()` (no questions_path, db_dir, output_dir) | `TypeError` or validation error | error |
| test_grpo_config_negative_batch_size | Negative or zero batch size | `per_device_train_batch_size=0` | Validation error or clear failure at training time | edge |
| test_grpo_config_negative_learning_rate | Negative learning rate | `learning_rate=-1.0` | Validation error | edge |
| test_grpo_config_empty_difficulty_filter | Empty difficulty filter list | `difficulty_filter=[]` | Empty training set or clear error | edge |
| test_grpo_config_seed_reproducibility | Same seed produces same config state | `seed=42` twice | Identical configs | happy |
**Run:** `uv run pytest tests/unit/test_grpo_config.py -v`
---
### 1.2 get_system_prompt (training/prompts.py)
| Test | Description | Input | Expected | Category |
|------|-------------|-------|----------|----------|
| test_system_prompt_returns_string | Function returns non-empty string | None | `isinstance(result, str) and len(result) > 0` | happy |
| test_system_prompt_mentions_action_types | Prompt documents all four action types | None | Result contains "DESCRIBE", "SAMPLE", "QUERY", "ANSWER" | happy |
| test_system_prompt_is_deterministic | Multiple calls return identical string | None | `get_system_prompt() == get_system_prompt()` | happy |
**Run:** `uv run pytest tests/unit/test_prompts.py -v`
---
### 1.3 format_observation (training/prompts.py)
| Test | Description | Input | Expected | Category |
|------|-------------|-------|----------|----------|
| test_format_observation_happy | Formats a normal observation into user-turn string | `SQLObservation(question="Q?", schema_info="tables", result="25", error="", step_count=1, budget_remaining=9, action_history=["QUERY"], done=False, reward=None)` | Non-empty string containing question, result, and budget info | happy |
| test_format_observation_with_error | Error field is surfaced in formatted string | `SQLObservation(..., error="syntax error", result="")` | String contains "syntax error" or error indication | happy |
| test_format_observation_done_state | Terminal observation is properly formatted | `SQLObservation(..., done=True, reward=1.0)` | String includes reward/done indication | happy |
| test_format_observation_empty_result | Empty result is handled gracefully | `SQLObservation(..., result="", error="")` | Returns valid string without crashing | edge |
| test_format_observation_long_result | Very long result string | `SQLObservation(..., result="x" * 10000)` | Returns string (may be truncated); no crash | edge |
**Run:** `uv run pytest tests/unit/test_prompts.py -v`
---
### 1.4 parse_model_output (training/rollout.py)
| Test | Description | Input | Expected | Category |
|------|-------------|-------|----------|----------|
| test_parse_describe | Parses DESCRIBE action | `"DESCRIBE employees"` | `SQLAction(action_type="DESCRIBE", argument="employees")` | happy |
| test_parse_sample | Parses SAMPLE action | `"SAMPLE departments"` | `SQLAction(action_type="SAMPLE", argument="departments")` | happy |
| test_parse_query | Parses QUERY action | `"QUERY SELECT COUNT(*) FROM employees"` | `SQLAction(action_type="QUERY", argument="SELECT COUNT(*) FROM employees")` | happy |
| test_parse_answer | Parses ANSWER action | `"ANSWER 42"` | `SQLAction(action_type="ANSWER", argument="42")` | happy |
| test_parse_case_insensitive | Case variations accepted | `"describe employees"` or `"Describe employees"` | Valid SQLAction with action_type="DESCRIBE" | edge |
| test_parse_with_colon_separator | Colon-separated format | `"QUERY: SELECT 1"` | `SQLAction(action_type="QUERY", argument="SELECT 1")` | edge |
| test_parse_garbage_fallback | Unparseable text falls back to QUERY | `"hello world random text"` | `SQLAction(action_type="QUERY", argument="hello world random text")` | error |
| test_parse_empty_string_fallback | Empty string falls back to QUERY | `""` | `SQLAction(action_type="QUERY", argument="")` | edge |
| test_parse_only_action_no_argument | Action keyword with no argument | `"DESCRIBE"` | Fallback or empty argument handled gracefully | edge |
| test_parse_multiline_output | Model output with multiple lines | `"Let me think...\nQUERY SELECT 1"` | Extracts QUERY action or falls back to QUERY with raw text | edge |
| test_parse_whitespace_padded | Leading/trailing whitespace | `" ANSWER 42 "` | `SQLAction(action_type="ANSWER", argument="42")` | edge |
**Run:** `uv run pytest tests/unit/test_rollout.py -v`
---
### 1.5 reward_correctness (training/rewards.py)
| Test | Description | Input | Expected | Category |
|------|-------------|-------|----------|----------|
| test_correctness_correct_answer | Episode ended with correct answer | Completions with correct=True metadata | `[1.0]` | happy |
| test_correctness_wrong_answer | Episode ended with wrong answer | Completions with correct=False metadata | `[0.0]` | happy |
| test_correctness_no_answer | Episode timed out without answering | Completions with no answer metadata | `[0.0]` | edge |
| test_correctness_batch | Multiple episodes in batch | Mixed correct/wrong | `[1.0, 0.0, 1.0, 0.0]` matching per-episode correctness | happy |
| test_correctness_empty_batch | Empty completions list | `[]` | `[]` | edge |
| test_correctness_trl_compatible | Return type is list[float] | Any valid input | `all(isinstance(r, float) for r in result)` | happy |
**Run:** `uv run pytest tests/unit/test_rewards.py -v`
---
### 1.6 reward_progress (training/rewards.py)
| Test | Description | Input | Expected | Category |
|------|-------------|-------|----------|----------|
| test_progress_full | Maximum progress (correct answer) | Completions with full progress metadata | Reward in `[0.0, 1.0]`, close to 1.0 | happy |
| test_progress_none | No progress toward answer | Completions with zero progress | `[0.0]` | happy |
| test_progress_partial | Partial progress | Completions with partial closeness | Reward in `(0.0, 1.0)` exclusive | happy |
| test_progress_normalized | Output is always in [0, 1] range | Various inputs | `all(0.0 <= r <= 1.0 for r in result)` | happy |
| test_progress_batch | Batch of varied progress | Multiple episodes | List of floats, length matches input | happy |
| test_progress_trl_compatible | Return type is list[float] | Any valid input | `all(isinstance(r, float) for r in result)` | happy |
**Run:** `uv run pytest tests/unit/test_rewards.py -v`
---
### 1.7 reward_operational (training/rewards.py)
| Test | Description | Input | Expected | Category |
|------|-------------|-------|----------|----------|
| test_operational_good_episode | All steps execute OK, discover new info, no repeats | Completions with exec_ok=True, new_info=True per step | Positive reward | happy |
| test_operational_all_errors | Every step has execution errors | Completions with exec_ok=False per step | Low/negative reward | error |
| test_operational_repeat_penalty | Episode with repeated identical actions | Completions with repeat=True per step | Lower reward than non-repeating | happy |
| test_operational_mixed_signals | Mix of good and bad steps | Varied step signals | Reward between extremes | happy |
| test_operational_single_step | Episode with only one step | Single step completions | Valid float returned | edge |
| test_operational_batch | Multiple episodes | Batch input | List of floats, length matches | happy |
| test_operational_trl_compatible | Return type is list[float] | Any valid input | `all(isinstance(r, float) for r in result)` | happy |
**Run:** `uv run pytest tests/unit/test_rewards.py -v`
---
### 1.8 rollout_func (training/rollout.py)
| Test | Description | Input | Expected | Category |
|------|-------------|-------|----------|----------|
| test_rollout_returns_completions | Returns list of dicts with expected keys | Single prompt, mock model/tokenizer | List of dicts with "content" and metadata keys | happy |
| test_rollout_batch_size | Output length matches input prompt count | N prompts | N completions returned | happy |
| test_rollout_episode_terminates | Episodes terminate within step_budget | Config with step_budget=5 | All episodes have <= 5 steps | happy |
| test_rollout_metadata_present | Completions include correctness, progress, operational metadata | Any valid input | Each completion dict has "correct", "progress", "operational" keys | happy |
| test_rollout_unparseable_action | Model generates gibberish, fallback fires | Mock model returning garbage tokens | Episode continues; no crash | error |
| test_rollout_truncation | Long history is truncated to system + last 3 pairs | Mock model, config with step_budget=20 | Context does not exceed token window | edge |
**Run:** `uv run pytest tests/unit/test_rollout.py -v`
---
## 2. Integration Tests
### Flow: End-to-End Training Episode
| Step | Action | Expected | Verification |
|------|--------|----------|--------------|
| 1 | Create GRPOConfig with test questions and mock DB | Config object created | Config fields match inputs |
| 2 | Load questions and filter by difficulty | Only easy+medium questions included | Assert filtered count < total if hard questions exist |
| 3 | Call rollout_func with a real SQLEnvironment and mock model | Completions returned with metadata | Each completion has "content" key |
| 4 | Pass completions to reward_correctness | Returns list[float] of 0.0/1.0 | Length matches batch size |
| 5 | Pass completions to reward_progress | Returns list[float] in [0,1] | Length matches batch size |
| 6 | Pass completions to reward_operational | Returns list[float] | Length matches batch size |
**Run:** `uv run pytest tests/integration/test_training_pipeline.py -v`
---
### Flow: Unparseable Action Recovery
| Step | Action | Expected | Verification |
|------|--------|----------|--------------|
| 1 | Mock model generates unparseable text | parse_model_output returns QUERY fallback | action_type == "QUERY", argument == raw text |
| 2 | SQLEnvironment.step receives fallback action | Returns error observation | observation.error is non-empty |
| 3 | Episode continues with next step | Step count increments, budget decreases | step_count > previous, budget_remaining < previous |
**Run:** `uv run pytest tests/integration/test_training_pipeline.py -v`
---
### Flow: History Truncation
| Step | Action | Expected | Verification |
|------|--------|----------|--------------|
| 1 | Run rollout with step_budget large enough to exceed token window | Truncation is triggered | History contains system prompt + last 3 observation-action pairs only |
| 2 | Episode completes normally after truncation | No crash; completions returned | Valid completion dicts in output |
**Run:** `uv run pytest tests/integration/test_training_pipeline.py -v`
---
## 3. API Tests
No API endpoints defined for F006. All interfaces are Python function calls.
---
## 4. E2E Tests
### Scenario: Training Notebook Smoke Test
**Setup:** Test questions JSON with 2 easy questions, test SQLite database, tiny model (or mock).
**Actions:**
1. Instantiate GRPOConfig with test paths and minimal hyperparameters (1 epoch, batch_size=1, num_generations=2).
2. Load model and tokenizer (use smallest available model or mock).
3. Create GRPOTrainer with reward functions.
4. Run trainer.train() for a single step.
5. Verify learning curve data is logged.
6. Run comparison episodes (before/after).
**Expected:**
- Training completes without error.
- At least one metric is logged (loss, reward).
- Comparison episodes produce valid SQLObservation sequences.
**Run:** `uv run pytest tests/e2e/test_training_e2e.py -v --timeout=300`
---
### Scenario: Question Filtering by Difficulty
**Setup:** Questions file with easy, medium, and hard questions.
**Actions:**
1. Create GRPOConfig with `difficulty_filter=["easy"]`.
2. Load and filter questions.
**Expected:** Only easy questions are included in training set.
**Run:** `uv run pytest tests/e2e/test_training_e2e.py -v`
---
## 5. Error Handling Tests
### ModelLoadError
| Test | Description | Trigger | Expected |
|------|-------------|---------|----------|
| test_model_load_error_bad_name | Invalid HuggingFace model name | `GRPOConfig(model_name="nonexistent/model-xyz-999")` | Fails fast; error message contains "nonexistent/model-xyz-999" |
### ActionParseError (handled via fallback)
| Test | Description | Trigger | Expected |
|------|-------------|---------|----------|
| test_action_parse_fallback_logged | Unparseable action triggers warning log | Model outputs `"¯\_(ツ)_/¯"` | Warning logged; returns QUERY fallback |
### QuestionLoadError
| Test | Description | Trigger | Expected |
|------|-------------|---------|----------|
| test_question_load_missing_file | Questions path does not exist | `GRPOConfig(questions_path="/nonexistent/q.json")` | Fails fast; error message contains the path |
| test_question_load_empty_file | Questions file is empty JSON array | `questions.json` containing `[]` | Fails fast; clear error about empty questions |
| test_question_load_invalid_json | Questions file has invalid JSON | `questions.json` containing `{broken` | Fails fast; JSON parse error |
### OOMError
| Test | Description | Trigger | Expected |
|------|-------------|---------|----------|
| test_oom_guidance | OOM during training prints guidance | (Cannot reliably trigger in test; verify message formatting only) | Error handler message mentions reducing batch_size or num_generations |
**Run:** `uv run pytest tests/unit/test_error_handling.py -v`
---
## 6. Edge Cases Checklist
- [ ] Null/None inputs to parse_model_output
- [ ] Empty string inputs to parse_model_output
- [ ] Empty completions list to all reward functions
- [ ] Single-element completions list to all reward functions
- [ ] Very large batch (100+ prompts) to rollout_func
- [ ] Questions file with only hard questions and difficulty_filter=["easy"] (zero matches)
- [ ] step_budget=1 (immediate budget exhaustion after one action)
- [ ] step_budget=0 (zero budget)
- [ ] Unicode characters in model output (e.g., CJK, emoji)
- [ ] Model output exceeding max_new_tokens
- [ ] learning_rate=0.0 (no weight updates)
- [ ] num_generations=1 (minimum GRPO completions)
- [ ] Concurrent calls to reward functions (thread safety)
- [ ] Database with no tables (empty schema)
- [ ] Database with very large tables (performance)
---
## 7. Evidence Requirements
| Category | Evidence Type | Example |
|----------|---------------|---------|
| Unit tests | pytest output | `X passed` |
| Integration | pytest output | `X passed` |
| Error handling | pytest output | `X passed` |
| E2E | pytest output + training metrics | `1 passed, loss=X.XX` |
| Reward functions | pytest output showing correct values | `reward_correctness: [1.0, 0.0]` |
| Parse fallback | pytest output + log capture | `WARNING: unparseable action, falling back to QUERY` |