Spaces:

hjerpe
/

sql_env

Sleeping

App Files Files Community

sql_env / specs /F006-VERIFICATION_SPEC.md

hjerpe

Upload folder using huggingface_hub

5dd1bb4 verified 3 months ago

preview code

Raw

History Blame Contribute Delete

15.8 kB

	# Verification Specification

	Feature: F006
	Generated from: specs/F006-VERIFICATION_INPUT.json
	Generated: 2026-03-27

	---

	## 1. Unit Tests

	### 1.1 GRPOConfig

	\| Test \| Description \| Input \| Expected \| Category \|
	\|------\|-------------\|-------\|----------\|----------\|
	\| test_grpo_config_defaults \| All defaults are populated when only required fields given \| `GRPOConfig(questions_path="q.json", db_dir="dbs/", output_dir="out/")` \| `max_new_tokens=256, num_train_epochs=1, per_device_train_batch_size=2, gradient_accumulation_steps=4, learning_rate=5e-6, num_generations=4, step_budget=10, difficulty_filter=["easy","medium"], seed=42, logging_steps=10, model_name="Qwen/Qwen3-1.7B"` \| happy \|
	\| test_grpo_config_custom_values \| Custom values override defaults \| `GRPOConfig(model_name="gpt2", max_new_tokens=128, ...)` \| Fields match custom values \| happy \|
	\| test_grpo_config_required_fields \| Missing required fields raise error \| `GRPOConfig()` (no questions_path, db_dir, output_dir) \| `TypeError` or validation error \| error \|
	\| test_grpo_config_negative_batch_size \| Negative or zero batch size \| `per_device_train_batch_size=0` \| Validation error or clear failure at training time \| edge \|
	\| test_grpo_config_negative_learning_rate \| Negative learning rate \| `learning_rate=-1.0` \| Validation error \| edge \|
	\| test_grpo_config_empty_difficulty_filter \| Empty difficulty filter list \| `difficulty_filter=[]` \| Empty training set or clear error \| edge \|
	\| test_grpo_config_seed_reproducibility \| Same seed produces same config state \| `seed=42` twice \| Identical configs \| happy \|

	Run: `uv run pytest tests/unit/test_grpo_config.py -v`

	---

	### 1.2 get_system_prompt (training/prompts.py)

	\| Test \| Description \| Input \| Expected \| Category \|
	\|------\|-------------\|-------\|----------\|----------\|
	\| test_system_prompt_returns_string \| Function returns non-empty string \| None \| `isinstance(result, str) and len(result) > 0` \| happy \|
	\| test_system_prompt_mentions_action_types \| Prompt documents all four action types \| None \| Result contains "DESCRIBE", "SAMPLE", "QUERY", "ANSWER" \| happy \|
	\| test_system_prompt_is_deterministic \| Multiple calls return identical string \| None \| `get_system_prompt() == get_system_prompt()` \| happy \|

	Run: `uv run pytest tests/unit/test_prompts.py -v`

	---

	### 1.3 format_observation (training/prompts.py)

	\| Test \| Description \| Input \| Expected \| Category \|
	\|------\|-------------\|-------\|----------\|----------\|
	\| test_format_observation_happy \| Formats a normal observation into user-turn string \| `SQLObservation(question="Q?", schema_info="tables", result="25", error="", step_count=1, budget_remaining=9, action_history=["QUERY"], done=False, reward=None)` \| Non-empty string containing question, result, and budget info \| happy \|
	\| test_format_observation_with_error \| Error field is surfaced in formatted string \| `SQLObservation(..., error="syntax error", result="")` \| String contains "syntax error" or error indication \| happy \|
	\| test_format_observation_done_state \| Terminal observation is properly formatted \| `SQLObservation(..., done=True, reward=1.0)` \| String includes reward/done indication \| happy \|
	\| test_format_observation_empty_result \| Empty result is handled gracefully \| `SQLObservation(..., result="", error="")` \| Returns valid string without crashing \| edge \|
	\| test_format_observation_long_result \| Very long result string \| `SQLObservation(..., result="x" * 10000)` \| Returns string (may be truncated); no crash \| edge \|

	Run: `uv run pytest tests/unit/test_prompts.py -v`

	---

	### 1.4 parse_model_output (training/rollout.py)

	\| Test \| Description \| Input \| Expected \| Category \|
	\|------\|-------------\|-------\|----------\|----------\|
	\| test_parse_describe \| Parses DESCRIBE action \| `"DESCRIBE employees"` \| `SQLAction(action_type="DESCRIBE", argument="employees")` \| happy \|
	\| test_parse_sample \| Parses SAMPLE action \| `"SAMPLE departments"` \| `SQLAction(action_type="SAMPLE", argument="departments")` \| happy \|
	\| test_parse_query \| Parses QUERY action \| `"QUERY SELECT COUNT() FROM employees"` \| `SQLAction(action_type="QUERY", argument="SELECT COUNT() FROM employees")` \| happy \|
	\| test_parse_answer \| Parses ANSWER action \| `"ANSWER 42"` \| `SQLAction(action_type="ANSWER", argument="42")` \| happy \|
	\| test_parse_case_insensitive \| Case variations accepted \| `"describe employees"` or `"Describe employees"` \| Valid SQLAction with action_type="DESCRIBE" \| edge \|
	\| test_parse_with_colon_separator \| Colon-separated format \| `"QUERY: SELECT 1"` \| `SQLAction(action_type="QUERY", argument="SELECT 1")` \| edge \|
	\| test_parse_garbage_fallback \| Unparseable text falls back to QUERY \| `"hello world random text"` \| `SQLAction(action_type="QUERY", argument="hello world random text")` \| error \|
	\| test_parse_empty_string_fallback \| Empty string falls back to QUERY \| `""` \| `SQLAction(action_type="QUERY", argument="")` \| edge \|
	\| test_parse_only_action_no_argument \| Action keyword with no argument \| `"DESCRIBE"` \| Fallback or empty argument handled gracefully \| edge \|
	\| test_parse_multiline_output \| Model output with multiple lines \| `"Let me think...\nQUERY SELECT 1"` \| Extracts QUERY action or falls back to QUERY with raw text \| edge \|
	\| test_parse_whitespace_padded \| Leading/trailing whitespace \| `" ANSWER 42 "` \| `SQLAction(action_type="ANSWER", argument="42")` \| edge \|

	Run: `uv run pytest tests/unit/test_rollout.py -v`

	---

	### 1.5 reward_correctness (training/rewards.py)

	\| Test \| Description \| Input \| Expected \| Category \|
	\|------\|-------------\|-------\|----------\|----------\|
	\| test_correctness_correct_answer \| Episode ended with correct answer \| Completions with correct=True metadata \| `[1.0]` \| happy \|
	\| test_correctness_wrong_answer \| Episode ended with wrong answer \| Completions with correct=False metadata \| `[0.0]` \| happy \|
	\| test_correctness_no_answer \| Episode timed out without answering \| Completions with no answer metadata \| `[0.0]` \| edge \|
	\| test_correctness_batch \| Multiple episodes in batch \| Mixed correct/wrong \| `[1.0, 0.0, 1.0, 0.0]` matching per-episode correctness \| happy \|
	\| test_correctness_empty_batch \| Empty completions list \| `[]` \| `[]` \| edge \|
	\| test_correctness_trl_compatible \| Return type is list[float] \| Any valid input \| `all(isinstance(r, float) for r in result)` \| happy \|

	Run: `uv run pytest tests/unit/test_rewards.py -v`

	---

	### 1.6 reward_progress (training/rewards.py)

	\| Test \| Description \| Input \| Expected \| Category \|
	\|------\|-------------\|-------\|----------\|----------\|
	\| test_progress_full \| Maximum progress (correct answer) \| Completions with full progress metadata \| Reward in `[0.0, 1.0]`, close to 1.0 \| happy \|
	\| test_progress_none \| No progress toward answer \| Completions with zero progress \| `[0.0]` \| happy \|
	\| test_progress_partial \| Partial progress \| Completions with partial closeness \| Reward in `(0.0, 1.0)` exclusive \| happy \|
	\| test_progress_normalized \| Output is always in [0, 1] range \| Various inputs \| `all(0.0 <= r <= 1.0 for r in result)` \| happy \|
	\| test_progress_batch \| Batch of varied progress \| Multiple episodes \| List of floats, length matches input \| happy \|
	\| test_progress_trl_compatible \| Return type is list[float] \| Any valid input \| `all(isinstance(r, float) for r in result)` \| happy \|

	Run: `uv run pytest tests/unit/test_rewards.py -v`

	---

	### 1.7 reward_operational (training/rewards.py)

	\| Test \| Description \| Input \| Expected \| Category \|
	\|------\|-------------\|-------\|----------\|----------\|
	\| test_operational_good_episode \| All steps execute OK, discover new info, no repeats \| Completions with exec_ok=True, new_info=True per step \| Positive reward \| happy \|
	\| test_operational_all_errors \| Every step has execution errors \| Completions with exec_ok=False per step \| Low/negative reward \| error \|
	\| test_operational_repeat_penalty \| Episode with repeated identical actions \| Completions with repeat=True per step \| Lower reward than non-repeating \| happy \|
	\| test_operational_mixed_signals \| Mix of good and bad steps \| Varied step signals \| Reward between extremes \| happy \|
	\| test_operational_single_step \| Episode with only one step \| Single step completions \| Valid float returned \| edge \|
	\| test_operational_batch \| Multiple episodes \| Batch input \| List of floats, length matches \| happy \|
	\| test_operational_trl_compatible \| Return type is list[float] \| Any valid input \| `all(isinstance(r, float) for r in result)` \| happy \|

	Run: `uv run pytest tests/unit/test_rewards.py -v`

	---

	### 1.8 rollout_func (training/rollout.py)

	\| Test \| Description \| Input \| Expected \| Category \|
	\|------\|-------------\|-------\|----------\|----------\|
	\| test_rollout_returns_completions \| Returns list of dicts with expected keys \| Single prompt, mock model/tokenizer \| List of dicts with "content" and metadata keys \| happy \|
	\| test_rollout_batch_size \| Output length matches input prompt count \| N prompts \| N completions returned \| happy \|
	\| test_rollout_episode_terminates \| Episodes terminate within step_budget \| Config with step_budget=5 \| All episodes have <= 5 steps \| happy \|
	\| test_rollout_metadata_present \| Completions include correctness, progress, operational metadata \| Any valid input \| Each completion dict has "correct", "progress", "operational" keys \| happy \|
	\| test_rollout_unparseable_action \| Model generates gibberish, fallback fires \| Mock model returning garbage tokens \| Episode continues; no crash \| error \|
	\| test_rollout_truncation \| Long history is truncated to system + last 3 pairs \| Mock model, config with step_budget=20 \| Context does not exceed token window \| edge \|

	Run: `uv run pytest tests/unit/test_rollout.py -v`

	---

	## 2. Integration Tests

	### Flow: End-to-End Training Episode

	\| Step \| Action \| Expected \| Verification \|
	\|------\|--------\|----------\|--------------\|
	\| 1 \| Create GRPOConfig with test questions and mock DB \| Config object created \| Config fields match inputs \|
	\| 2 \| Load questions and filter by difficulty \| Only easy+medium questions included \| Assert filtered count < total if hard questions exist \|
	\| 3 \| Call rollout_func with a real SQLEnvironment and mock model \| Completions returned with metadata \| Each completion has "content" key \|
	\| 4 \| Pass completions to reward_correctness \| Returns list[float] of 0.0/1.0 \| Length matches batch size \|
	\| 5 \| Pass completions to reward_progress \| Returns list[float] in [0,1] \| Length matches batch size \|
	\| 6 \| Pass completions to reward_operational \| Returns list[float] \| Length matches batch size \|

	Run: `uv run pytest tests/integration/test_training_pipeline.py -v`

	---

	### Flow: Unparseable Action Recovery

	\| Step \| Action \| Expected \| Verification \|
	\|------\|--------\|----------\|--------------\|
	\| 1 \| Mock model generates unparseable text \| parse_model_output returns QUERY fallback \| action_type == "QUERY", argument == raw text \|
	\| 2 \| SQLEnvironment.step receives fallback action \| Returns error observation \| observation.error is non-empty \|
	\| 3 \| Episode continues with next step \| Step count increments, budget decreases \| step_count > previous, budget_remaining < previous \|

	Run: `uv run pytest tests/integration/test_training_pipeline.py -v`

	---

	### Flow: History Truncation

	\| Step \| Action \| Expected \| Verification \|
	\|------\|--------\|----------\|--------------\|
	\| 1 \| Run rollout with step_budget large enough to exceed token window \| Truncation is triggered \| History contains system prompt + last 3 observation-action pairs only \|
	\| 2 \| Episode completes normally after truncation \| No crash; completions returned \| Valid completion dicts in output \|

	Run: `uv run pytest tests/integration/test_training_pipeline.py -v`

	---

	## 3. API Tests

	No API endpoints defined for F006. All interfaces are Python function calls.

	---

	## 4. E2E Tests

	### Scenario: Training Notebook Smoke Test

	Setup: Test questions JSON with 2 easy questions, test SQLite database, tiny model (or mock).
	Actions:
	1. Instantiate GRPOConfig with test paths and minimal hyperparameters (1 epoch, batch_size=1, num_generations=2).
	2. Load model and tokenizer (use smallest available model or mock).
	3. Create GRPOTrainer with reward functions.
	4. Run trainer.train() for a single step.
	5. Verify learning curve data is logged.
	6. Run comparison episodes (before/after).

	Expected:
	- Training completes without error.
	- At least one metric is logged (loss, reward).
	- Comparison episodes produce valid SQLObservation sequences.

	Run: `uv run pytest tests/e2e/test_training_e2e.py -v --timeout=300`

	---

	### Scenario: Question Filtering by Difficulty

	Setup: Questions file with easy, medium, and hard questions.
	Actions:
	1. Create GRPOConfig with `difficulty_filter=["easy"]`.
	2. Load and filter questions.

	Expected: Only easy questions are included in training set.

	Run: `uv run pytest tests/e2e/test_training_e2e.py -v`

	---

	## 5. Error Handling Tests

	### ModelLoadError

	\| Test \| Description \| Trigger \| Expected \|
	\|------\|-------------\|---------\|----------\|
	\| test_model_load_error_bad_name \| Invalid HuggingFace model name \| `GRPOConfig(model_name="nonexistent/model-xyz-999")` \| Fails fast; error message contains "nonexistent/model-xyz-999" \|

	### ActionParseError (handled via fallback)

	\| Test \| Description \| Trigger \| Expected \|
	\|------\|-------------\|---------\|----------\|
	\| test_action_parse_fallback_logged \| Unparseable action triggers warning log \| Model outputs `"¯\_(ツ)_/¯"` \| Warning logged; returns QUERY fallback \|

	### QuestionLoadError

	\| Test \| Description \| Trigger \| Expected \|
	\|------\|-------------\|---------\|----------\|
	\| test_question_load_missing_file \| Questions path does not exist \| `GRPOConfig(questions_path="/nonexistent/q.json")` \| Fails fast; error message contains the path \|
	\| test_question_load_empty_file \| Questions file is empty JSON array \| `questions.json` containing `[]` \| Fails fast; clear error about empty questions \|
	\| test_question_load_invalid_json \| Questions file has invalid JSON \| `questions.json` containing `{broken` \| Fails fast; JSON parse error \|

	### OOMError

	\| Test \| Description \| Trigger \| Expected \|
	\|------\|-------------\|---------\|----------\|
	\| test_oom_guidance \| OOM during training prints guidance \| (Cannot reliably trigger in test; verify message formatting only) \| Error handler message mentions reducing batch_size or num_generations \|

	Run: `uv run pytest tests/unit/test_error_handling.py -v`

	---

	## 6. Edge Cases Checklist

	- [ ] Null/None inputs to parse_model_output
	- [ ] Empty string inputs to parse_model_output
	- [ ] Empty completions list to all reward functions
	- [ ] Single-element completions list to all reward functions
	- [ ] Very large batch (100+ prompts) to rollout_func
	- [ ] Questions file with only hard questions and difficulty_filter=["easy"] (zero matches)
	- [ ] step_budget=1 (immediate budget exhaustion after one action)
	- [ ] step_budget=0 (zero budget)
	- [ ] Unicode characters in model output (e.g., CJK, emoji)
	- [ ] Model output exceeding max_new_tokens
	- [ ] learning_rate=0.0 (no weight updates)
	- [ ] num_generations=1 (minimum GRPO completions)
	- [ ] Concurrent calls to reward functions (thread safety)
	- [ ] Database with no tables (empty schema)
	- [ ] Database with very large tables (performance)

	---

	## 7. Evidence Requirements

	\| Category \| Evidence Type \| Example \|
	\|----------\|---------------\|---------\|
	\| Unit tests \| pytest output \| `X passed` \|
	\| Integration \| pytest output \| `X passed` \|
	\| Error handling \| pytest output \| `X passed` \|
	\| E2E \| pytest output + training metrics \| `1 passed, loss=X.XX` \|
	\| Reward functions \| pytest output showing correct values \| `reward_correctness: [1.0, 0.0]` \|
	\| Parse fallback \| pytest output + log capture \| `WARNING: unparseable action, falling back to QUERY` \|