File size: 17,168 Bytes
5dd1bb4 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 | # Verification Specification
**Feature:** F004
**Generated from:** specs/F004-VERIFICATION_INPUT.json
**Generated:** 2026-03-24
---
## 1. Unit Tests
### EnrichedQuestionRecord (Type)
| Test | Description | Input | Expected | Category |
|------|-------------|-------|----------|----------|
| test_record_all_fields_present | All 9 required fields populated | Full valid record dict | All fields accessible, no missing keys | happy |
| test_record_question_id_format | question_id matches `{db_id}_{split}_{index:03d}` | `"concert_singer_train_007"` | Passes regex `^[a-z_]+_(train|eval)_\d{3}$` | happy |
| test_record_question_id_invalid | Rejects malformed question_id | `"bad-id"` | Validation error or detectable as invalid | error |
| test_record_answer_type_enum | answer_type is one of allowed values | `"integer"`, `"float"`, `"string"`, `"list"`, `"table"` | Each accepted | happy |
| test_record_answer_type_invalid | Rejects unknown answer_type | `"boolean"` | Rejected or flagged | error |
| test_record_difficulty_enum | difficulty is one of allowed values | `"easy"`, `"medium"`, `"hard"` | Each accepted | happy |
| test_record_difficulty_invalid | Rejects unknown difficulty | `"extreme"` | Rejected or flagged | error |
| test_record_split_enum | split is one of allowed values | `"train"`, `"eval"` | Each accepted | happy |
| test_record_split_invalid | Rejects unknown split | `"test"` | Rejected or flagged | error |
| test_record_tables_involved_nonempty | tables_involved has at least one entry | `["students"]` | Accepted | happy |
| test_record_tables_involved_empty | Empty tables_involved is rejected | `[]` | Rejected or flagged by validation | edge |
| test_record_gold_sql_nonempty | gold_sql is a non-empty string | `"SELECT COUNT(*) FROM students"` | Accepted | happy |
| test_record_gold_sql_empty | Empty gold_sql is rejected | `""` | Rejected or flagged | edge |
| test_record_gold_answer_types | gold_answer can hold int, float, str, list, list-of-lists | `42`, `3.14`, `"Alice"`, `[1,2]`, `[[1,"a"]]` | Each stored and retrievable | happy |
**Run:** `pytest tests/test_f004_dataset.py::TestEnrichedQuestionRecord -v`
---
### classify_answer_type
| Test | Description | Input | Expected | Category |
|------|-------------|-------|----------|----------|
| test_classify_integer | Single integer answer | `42` | `"integer"` | happy |
| test_classify_float | Single float answer | `3.14` | `"float"` | happy |
| test_classify_string | Single string answer | `"Alice"` | `"string"` | happy |
| test_classify_list | Flat list (single column, multiple rows) | `[1, 2, 3]` | `"list"` | happy |
| test_classify_table | List of tuples/lists (multi-column result) | `[(1, "a"), (2, "b")]` | `"table"` | happy |
| test_classify_none | None/null answer | `None` | Defined behavior (error or specific type) | edge |
| test_classify_empty_list | Empty list | `[]` | `"list"` or defined behavior | edge |
| test_classify_single_row_tuple | Single-element tuple | `(42,)` | `"integer"` (unwrapped) or `"list"` | edge |
| test_classify_nested_single | Single-row multi-column | `[(1, "a")]` | `"table"` | edge |
| test_classify_boolean | Boolean answer | `True` | Defined fallback behavior | edge |
**Run:** `pytest tests/test_f004_dataset.py::TestClassifyAnswerType -v`
---
### extract_tables_involved
| Test | Description | Input | Expected | Category |
|------|-------------|-------|----------|----------|
| test_extract_single_table | Simple SELECT from one table | `"SELECT * FROM students"` | `["students"]` | happy |
| test_extract_join | JOIN with two tables | `"SELECT s.name FROM students s JOIN courses c ON s.id = c.student_id"` | `["courses", "students"]` (sorted) | happy |
| test_extract_subquery | Subquery referencing different table | `"SELECT * FROM students WHERE id IN (SELECT student_id FROM enrollments)"` | `["enrollments", "students"]` (sorted) | happy |
| test_extract_deduplication | Same table referenced multiple times | `"SELECT a.x, b.y FROM t1 a JOIN t1 b ON a.id = b.id"` | `["t1"]` (deduplicated) | happy |
| test_extract_case_insensitive | Mixed case SQL keywords | `"select * FROM Students"` | `["Students"]` or `["students"]` (consistent) | edge |
| test_extract_with_alias | Table alias should resolve to table name | `"SELECT s.name FROM students AS s"` | `["students"]` | edge |
| test_extract_multiple_joins | Three or more tables joined | `"SELECT * FROM a JOIN b ON a.id=b.id JOIN c ON b.id=c.id"` | `["a", "b", "c"]` (sorted) | happy |
| test_extract_empty_sql | Empty SQL string | `""` | `[]` or error | edge |
| test_extract_no_from | SQL without FROM clause | `"SELECT 1+1"` | `[]` | edge |
**Run:** `pytest tests/test_f004_dataset.py::TestExtractTablesInvolved -v`
---
### classify_difficulty
| Test | Description | Input | Expected | Category |
|------|-------------|-------|----------|----------|
| test_difficulty_easy_1_table | 1 table involved | `["students"]` | `"easy"` | happy |
| test_difficulty_easy_2_tables | 2 tables involved | `["students", "courses"]` | `"easy"` | happy |
| test_difficulty_medium_3_tables | 3 tables involved | `["a", "b", "c"]` | `"medium"` | happy |
| test_difficulty_hard_4_tables | 4 tables involved | `["a", "b", "c", "d"]` | `"hard"` | happy |
| test_difficulty_hard_many_tables | 6+ tables involved | `["a", "b", "c", "d", "e", "f"]` | `"hard"` | happy |
| test_difficulty_empty_tables | 0 tables (edge case) | `[]` | Defined behavior (error or `"easy"`) | edge |
**Run:** `pytest tests/test_f004_dataset.py::TestClassifyDifficulty -v`
---
### compute_gold_answer
| Test | Description | Input | Expected | Category |
|------|-------------|-------|----------|----------|
| test_compute_valid_select | Valid SELECT on real SQLite DB | `"SELECT COUNT(*) FROM singer"`, valid db_path | Integer result | happy |
| test_compute_multirow | Query returning multiple rows | `"SELECT * FROM singer LIMIT 3"`, valid db_path | List/table result | happy |
| test_compute_invalid_sql | Syntactically invalid SQL | `"SELCT * FORM x"`, valid db_path | Raises `sqlite3.Error` | error |
| test_compute_missing_table | SQL references non-existent table | `"SELECT * FROM nonexistent"`, valid db_path | Raises `sqlite3.Error` | error |
| test_compute_missing_db | Database file does not exist | `"SELECT 1"`, `/tmp/nonexistent.sqlite` | Raises `sqlite3.Error` or `FileNotFoundError` | error |
| test_compute_empty_result | Query returns no rows | `"SELECT * FROM singer WHERE 1=0"`, valid db_path | Empty result (e.g., `[]`) | edge |
| test_compute_null_result | Query returning NULL | `"SELECT NULL"`, valid db_path | `None` or defined null handling | edge |
**Run:** `pytest tests/test_f004_dataset.py::TestComputeGoldAnswer -v`
---
### assign_splits
| Test | Description | Input | Expected | Category |
|------|-------------|-------|----------|----------|
| test_assign_train_from_spider_train | Spider train split maps to train | Questions with `spider_split="train"` | `split="train"` | happy |
| test_assign_eval_from_spider_validation | Spider validation split maps to eval | Questions with `spider_split="validation"` | `split="eval"` | happy |
| test_assign_preserves_all_questions | No questions are dropped | 10 input questions | 10 output questions | happy |
| test_assign_mixed_splits | Mix of train and validation | 7 train + 3 validation | 7 train + 3 eval | happy |
| test_assign_all_train | All questions from train split | 5 train questions | All `split="train"` | edge |
| test_assign_all_eval | All questions from validation split | 5 validation questions | All `split="eval"` | edge |
**Run:** `pytest tests/test_f004_dataset.py::TestAssignSplits -v`
---
### download_spider_databases
| Test | Description | Input | Expected | Category |
|------|-------------|-------|----------|----------|
| test_download_creates_files | Download for known db_ids produces SQLite files | `["concert_singer"]`, temp dir | Dict mapping db_id to valid Path, file exists | happy |
| test_download_skips_existing | Existing database is not re-downloaded | Pre-existing file, same db_id | File unchanged, no download attempt | happy |
| test_download_unknown_db | Unknown db_id | `["nonexistent_db_xyz"]` | Raises `FileNotFoundError` | error |
| test_download_empty_list | Empty db_ids list | `[]` | Returns empty dict | edge |
| test_download_returns_correct_paths | Paths follow `{output_dir}/{db_id}/{db_id}.sqlite` | `["concert_singer"]` | Path matches expected pattern | happy |
**Run:** `pytest tests/test_f004_dataset.py::TestDownloadSpiderDatabases -v`
---
### load_spider_questions
| Test | Description | Input | Expected | Category |
|------|-------------|-------|----------|----------|
| test_load_returns_questions | Valid db_ids produce question dicts | `["concert_singer"]` | Non-empty list of dicts with `query` and `db_id` fields | happy |
| test_load_multiple_dbs | Multiple db_ids returns questions from all | `["concert_singer", "pets_1"]` | Questions from both databases present | happy |
| test_load_includes_both_splits | Both train and validation splits loaded | `["concert_singer"]` | Questions with both spider splits present | happy |
| test_load_connection_failure | Network unavailable (mocked) | Any db_ids, no network | Raises `ConnectionError` | error |
| test_load_empty_list | Empty db_ids list | `[]` | Returns empty list | edge |
**Run:** `pytest tests/test_f004_dataset.py::TestLoadSpiderQuestions -v`
---
### validate_dataset
| Test | Description | Input | Expected | Category |
|------|-------------|-------|----------|----------|
| test_validate_clean_dataset | Valid dataset passes all checks | Well-formed dataset | Empty error list | happy |
| test_validate_missing_field | Record missing required field | Record without `gold_sql` | Error list includes missing field message | error |
| test_validate_duplicate_ids | Two records share same question_id | Duplicate `"concert_singer_train_001"` | Error list includes duplicate message | error |
| test_validate_gold_sql_fails | gold_sql that does not execute | Record with broken SQL | Error list includes SQL execution message | error |
| test_validate_gold_answer_mismatch | gold_answer does not match re-execution | Record with wrong gold_answer | Error list includes mismatch message | error |
| test_validate_difficulty_distribution | Distribution check ~40/40/20 | Dataset with extreme skew (100% easy) | Warning or error about distribution | edge |
| test_validate_clean_splits | No question appears in both splits | Dataset with clean splits | No split errors | happy |
| test_validate_cross_split_leak | Same question in train and eval | Duplicate across splits | Error detected | error |
**Run:** `pytest tests/test_f004_dataset.py::TestValidateDataset -v`
---
## 2. Integration Tests
### Flow: Primary Curation Pipeline
| Step | Action | Expected | Verification |
|------|--------|----------|--------------|
| 1 | Read db_list.json for target database IDs | Returns list of 10 db_ids | `len(db_ids) >= 10` |
| 2 | Download Spider SQLite databases | All db_ids have corresponding .sqlite files | All paths exist, files > 0 bytes |
| 3 | Load raw Spider questions from HuggingFace | Questions loaded for all target db_ids | `len(questions) > 0`, each has `db_id` and `query` |
| 4 | Compute gold_answer for each question | gold_answer populated, failed queries skipped | No None gold_answers in output; skip count logged |
| 5 | Classify answer_type for each question | answer_type is one of 5 valid values | All values in `{"integer","float","string","list","table"}` |
| 6 | Extract tables_involved from gold_sql | Each question has non-empty tables_involved | All lists non-empty, all entries are strings |
| 7 | Classify difficulty from tables_involved | difficulty is one of 3 valid values | All values in `{"easy","medium","hard"}` |
| 8 | Assign train/eval split | Each question has valid split | All values in `{"train","eval"}` |
| 9 | Generate question_id | All IDs unique and match format | Regex match, `len(set(ids)) == len(ids)` |
| 10 | Validate full dataset | validate_dataset returns empty error list | `len(errors) == 0` |
| 11 | Write output JSON files | questions_train.json and questions_eval.json exist | Files parseable as JSON, combined count >= 100 |
**Run:** `pytest tests/test_f004_integration.py::TestCurationPipeline -v`
---
### Flow: Gold SQL Execution Failure (Alternative)
| Step | Action | Expected | Verification |
|------|--------|----------|--------------|
| 1 | Provide question with broken gold_sql | gold_sql raises sqlite3.Error | Exception caught, not propagated |
| 2 | Check warning logged | Log contains db_id and error details | Log output includes warning |
| 3 | Question excluded from final dataset | Output does not contain the broken question | question_id absent from output |
| 4 | Remaining questions processed | Pipeline continues without interruption | Other questions have valid gold_answer |
**Run:** `pytest tests/test_f004_integration.py::TestGoldSqlFailure -v`
---
### Flow: Validate-Only Mode (Alternative)
| Step | Action | Expected | Verification |
|------|--------|----------|--------------|
| 1 | Invoke script with `--validate` flag | Does not download or regenerate data | No network calls made |
| 2 | Load existing questions_train.json and questions_eval.json | Files read successfully | No FileNotFoundError |
| 3 | Locate SQLite databases in data/databases/ | All referenced databases found | All db_paths valid |
| 4 | Run validate_dataset() | Returns list of errors (may be empty) | Return type is list[str] |
| 5 | Exit code reflects validation result | 0 if valid, 1 if invalid | Process exit code matches |
**Run:** `pytest tests/test_f004_integration.py::TestValidateOnlyMode -v`
---
## 3. API Tests
No API endpoints defined for F004. This feature is a standalone curation script.
---
## 4. E2E Tests
### Scenario: Full Dataset Generation
**Setup:** Clean environment with no existing output files. Network access available. `data/questions/db_list.json` contains 10 target database IDs.
**Actions:**
1. Run the curation script end-to-end (no flags)
2. Wait for completion
**Expected:**
- `data/questions/questions_train.json` exists and contains valid JSON array
- `data/questions/questions_eval.json` exists and contains valid JSON array
- Combined question count >= 100
- All questions pass `validate_dataset()` with zero errors
- At least 8 distinct `database_name` values represented
- Train/eval split approximately 70/30 (+/- 10%)
- All three difficulty levels present
- All five answer_type values present (or at least 3)
**Run:** `python scripts/curate_dataset.py && python scripts/curate_dataset.py --validate`
---
### Scenario: Validate-Only on Pre-Generated Data
**Setup:** Output JSON files already exist from a prior run. SQLite databases present in `data/databases/`.
**Actions:**
1. Run the curation script with `--validate` flag only
**Expected:**
- No new files created or modified
- Validation output printed to stdout
- Exit code 0 if data is valid
**Run:** `python scripts/curate_dataset.py --validate`
---
### Scenario: Idempotent Re-Run
**Setup:** Output JSON files already exist from a prior run.
**Actions:**
1. Run the curation script again (full mode)
2. Compare output files
**Expected:**
- Output files are regenerated
- Same question count (deterministic for same input)
- Database files not re-downloaded (skip existing)
**Run:** `python scripts/curate_dataset.py`
---
## 5. Edge Cases Checklist
- [ ] Null/None gold_answer values handled gracefully
- [ ] Empty string gold_sql skipped or rejected
- [ ] SQL with unicode characters in table/column names
- [ ] Very large query results (1000+ rows) handled by compute_gold_answer
- [ ] Database file that exists but is corrupt (0 bytes or invalid SQLite)
- [ ] db_list.json missing or empty
- [ ] db_list.json with duplicate db_ids
- [ ] Question with gold_sql containing multiple statements (semicolons)
- [ ] Question where gold_sql returns different results on re-execution (non-deterministic)
- [ ] Tables_involved extraction with SQL using CTEs (WITH clause)
- [ ] Tables_involved extraction with SQL using UNION across different tables
- [ ] Extremely long gold_sql (> 1000 chars)
- [ ] Database with no tables (empty schema)
- [ ] Retry behavior on transient network failure during HuggingFace download
- [ ] Retry behavior on transient network failure during database download
- [ ] Concurrent access to same SQLite file (if parallelized)
- [ ] Output JSON file encoding (UTF-8) with special characters in question_text
---
## 6. Evidence Requirements
| Category | Evidence Type | Example |
|----------|---------------|---------|
| Unit tests | pytest output | `X passed, Y skipped` |
| Integration | pytest output | `X passed` |
| E2E | Script output + file inspection | `Generated 105 questions`, `Validation passed` |
| Output files | JSON structure inspection | `jq length questions_train.json` returns count |
| Skip handling | Log output | Warning messages for skipped questions |
| Validation | Exit code | `echo $?` returns `0` |
|