File size: 17,168 Bytes
5dd1bb4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
# Verification Specification

**Feature:** F004
**Generated from:** specs/F004-VERIFICATION_INPUT.json
**Generated:** 2026-03-24

---

## 1. Unit Tests

### EnrichedQuestionRecord (Type)

| Test | Description | Input | Expected | Category |
|------|-------------|-------|----------|----------|
| test_record_all_fields_present | All 9 required fields populated | Full valid record dict | All fields accessible, no missing keys | happy |
| test_record_question_id_format | question_id matches `{db_id}_{split}_{index:03d}` | `"concert_singer_train_007"` | Passes regex `^[a-z_]+_(train|eval)_\d{3}$` | happy |
| test_record_question_id_invalid | Rejects malformed question_id | `"bad-id"` | Validation error or detectable as invalid | error |
| test_record_answer_type_enum | answer_type is one of allowed values | `"integer"`, `"float"`, `"string"`, `"list"`, `"table"` | Each accepted | happy |
| test_record_answer_type_invalid | Rejects unknown answer_type | `"boolean"` | Rejected or flagged | error |
| test_record_difficulty_enum | difficulty is one of allowed values | `"easy"`, `"medium"`, `"hard"` | Each accepted | happy |
| test_record_difficulty_invalid | Rejects unknown difficulty | `"extreme"` | Rejected or flagged | error |
| test_record_split_enum | split is one of allowed values | `"train"`, `"eval"` | Each accepted | happy |
| test_record_split_invalid | Rejects unknown split | `"test"` | Rejected or flagged | error |
| test_record_tables_involved_nonempty | tables_involved has at least one entry | `["students"]` | Accepted | happy |
| test_record_tables_involved_empty | Empty tables_involved is rejected | `[]` | Rejected or flagged by validation | edge |
| test_record_gold_sql_nonempty | gold_sql is a non-empty string | `"SELECT COUNT(*) FROM students"` | Accepted | happy |
| test_record_gold_sql_empty | Empty gold_sql is rejected | `""` | Rejected or flagged | edge |
| test_record_gold_answer_types | gold_answer can hold int, float, str, list, list-of-lists | `42`, `3.14`, `"Alice"`, `[1,2]`, `[[1,"a"]]` | Each stored and retrievable | happy |

**Run:** `pytest tests/test_f004_dataset.py::TestEnrichedQuestionRecord -v`

---

### classify_answer_type

| Test | Description | Input | Expected | Category |
|------|-------------|-------|----------|----------|
| test_classify_integer | Single integer answer | `42` | `"integer"` | happy |
| test_classify_float | Single float answer | `3.14` | `"float"` | happy |
| test_classify_string | Single string answer | `"Alice"` | `"string"` | happy |
| test_classify_list | Flat list (single column, multiple rows) | `[1, 2, 3]` | `"list"` | happy |
| test_classify_table | List of tuples/lists (multi-column result) | `[(1, "a"), (2, "b")]` | `"table"` | happy |
| test_classify_none | None/null answer | `None` | Defined behavior (error or specific type) | edge |
| test_classify_empty_list | Empty list | `[]` | `"list"` or defined behavior | edge |
| test_classify_single_row_tuple | Single-element tuple | `(42,)` | `"integer"` (unwrapped) or `"list"` | edge |
| test_classify_nested_single | Single-row multi-column | `[(1, "a")]` | `"table"` | edge |
| test_classify_boolean | Boolean answer | `True` | Defined fallback behavior | edge |

**Run:** `pytest tests/test_f004_dataset.py::TestClassifyAnswerType -v`

---

### extract_tables_involved

| Test | Description | Input | Expected | Category |
|------|-------------|-------|----------|----------|
| test_extract_single_table | Simple SELECT from one table | `"SELECT * FROM students"` | `["students"]` | happy |
| test_extract_join | JOIN with two tables | `"SELECT s.name FROM students s JOIN courses c ON s.id = c.student_id"` | `["courses", "students"]` (sorted) | happy |
| test_extract_subquery | Subquery referencing different table | `"SELECT * FROM students WHERE id IN (SELECT student_id FROM enrollments)"` | `["enrollments", "students"]` (sorted) | happy |
| test_extract_deduplication | Same table referenced multiple times | `"SELECT a.x, b.y FROM t1 a JOIN t1 b ON a.id = b.id"` | `["t1"]` (deduplicated) | happy |
| test_extract_case_insensitive | Mixed case SQL keywords | `"select * FROM Students"` | `["Students"]` or `["students"]` (consistent) | edge |
| test_extract_with_alias | Table alias should resolve to table name | `"SELECT s.name FROM students AS s"` | `["students"]` | edge |
| test_extract_multiple_joins | Three or more tables joined | `"SELECT * FROM a JOIN b ON a.id=b.id JOIN c ON b.id=c.id"` | `["a", "b", "c"]` (sorted) | happy |
| test_extract_empty_sql | Empty SQL string | `""` | `[]` or error | edge |
| test_extract_no_from | SQL without FROM clause | `"SELECT 1+1"` | `[]` | edge |

**Run:** `pytest tests/test_f004_dataset.py::TestExtractTablesInvolved -v`

---

### classify_difficulty

| Test | Description | Input | Expected | Category |
|------|-------------|-------|----------|----------|
| test_difficulty_easy_1_table | 1 table involved | `["students"]` | `"easy"` | happy |
| test_difficulty_easy_2_tables | 2 tables involved | `["students", "courses"]` | `"easy"` | happy |
| test_difficulty_medium_3_tables | 3 tables involved | `["a", "b", "c"]` | `"medium"` | happy |
| test_difficulty_hard_4_tables | 4 tables involved | `["a", "b", "c", "d"]` | `"hard"` | happy |
| test_difficulty_hard_many_tables | 6+ tables involved | `["a", "b", "c", "d", "e", "f"]` | `"hard"` | happy |
| test_difficulty_empty_tables | 0 tables (edge case) | `[]` | Defined behavior (error or `"easy"`) | edge |

**Run:** `pytest tests/test_f004_dataset.py::TestClassifyDifficulty -v`

---

### compute_gold_answer

| Test | Description | Input | Expected | Category |
|------|-------------|-------|----------|----------|
| test_compute_valid_select | Valid SELECT on real SQLite DB | `"SELECT COUNT(*) FROM singer"`, valid db_path | Integer result | happy |
| test_compute_multirow | Query returning multiple rows | `"SELECT * FROM singer LIMIT 3"`, valid db_path | List/table result | happy |
| test_compute_invalid_sql | Syntactically invalid SQL | `"SELCT * FORM x"`, valid db_path | Raises `sqlite3.Error` | error |
| test_compute_missing_table | SQL references non-existent table | `"SELECT * FROM nonexistent"`, valid db_path | Raises `sqlite3.Error` | error |
| test_compute_missing_db | Database file does not exist | `"SELECT 1"`, `/tmp/nonexistent.sqlite` | Raises `sqlite3.Error` or `FileNotFoundError` | error |
| test_compute_empty_result | Query returns no rows | `"SELECT * FROM singer WHERE 1=0"`, valid db_path | Empty result (e.g., `[]`) | edge |
| test_compute_null_result | Query returning NULL | `"SELECT NULL"`, valid db_path | `None` or defined null handling | edge |

**Run:** `pytest tests/test_f004_dataset.py::TestComputeGoldAnswer -v`

---

### assign_splits

| Test | Description | Input | Expected | Category |
|------|-------------|-------|----------|----------|
| test_assign_train_from_spider_train | Spider train split maps to train | Questions with `spider_split="train"` | `split="train"` | happy |
| test_assign_eval_from_spider_validation | Spider validation split maps to eval | Questions with `spider_split="validation"` | `split="eval"` | happy |
| test_assign_preserves_all_questions | No questions are dropped | 10 input questions | 10 output questions | happy |
| test_assign_mixed_splits | Mix of train and validation | 7 train + 3 validation | 7 train + 3 eval | happy |
| test_assign_all_train | All questions from train split | 5 train questions | All `split="train"` | edge |
| test_assign_all_eval | All questions from validation split | 5 validation questions | All `split="eval"` | edge |

**Run:** `pytest tests/test_f004_dataset.py::TestAssignSplits -v`

---

### download_spider_databases

| Test | Description | Input | Expected | Category |
|------|-------------|-------|----------|----------|
| test_download_creates_files | Download for known db_ids produces SQLite files | `["concert_singer"]`, temp dir | Dict mapping db_id to valid Path, file exists | happy |
| test_download_skips_existing | Existing database is not re-downloaded | Pre-existing file, same db_id | File unchanged, no download attempt | happy |
| test_download_unknown_db | Unknown db_id | `["nonexistent_db_xyz"]` | Raises `FileNotFoundError` | error |
| test_download_empty_list | Empty db_ids list | `[]` | Returns empty dict | edge |
| test_download_returns_correct_paths | Paths follow `{output_dir}/{db_id}/{db_id}.sqlite` | `["concert_singer"]` | Path matches expected pattern | happy |

**Run:** `pytest tests/test_f004_dataset.py::TestDownloadSpiderDatabases -v`

---

### load_spider_questions

| Test | Description | Input | Expected | Category |
|------|-------------|-------|----------|----------|
| test_load_returns_questions | Valid db_ids produce question dicts | `["concert_singer"]` | Non-empty list of dicts with `query` and `db_id` fields | happy |
| test_load_multiple_dbs | Multiple db_ids returns questions from all | `["concert_singer", "pets_1"]` | Questions from both databases present | happy |
| test_load_includes_both_splits | Both train and validation splits loaded | `["concert_singer"]` | Questions with both spider splits present | happy |
| test_load_connection_failure | Network unavailable (mocked) | Any db_ids, no network | Raises `ConnectionError` | error |
| test_load_empty_list | Empty db_ids list | `[]` | Returns empty list | edge |

**Run:** `pytest tests/test_f004_dataset.py::TestLoadSpiderQuestions -v`

---

### validate_dataset

| Test | Description | Input | Expected | Category |
|------|-------------|-------|----------|----------|
| test_validate_clean_dataset | Valid dataset passes all checks | Well-formed dataset | Empty error list | happy |
| test_validate_missing_field | Record missing required field | Record without `gold_sql` | Error list includes missing field message | error |
| test_validate_duplicate_ids | Two records share same question_id | Duplicate `"concert_singer_train_001"` | Error list includes duplicate message | error |
| test_validate_gold_sql_fails | gold_sql that does not execute | Record with broken SQL | Error list includes SQL execution message | error |
| test_validate_gold_answer_mismatch | gold_answer does not match re-execution | Record with wrong gold_answer | Error list includes mismatch message | error |
| test_validate_difficulty_distribution | Distribution check ~40/40/20 | Dataset with extreme skew (100% easy) | Warning or error about distribution | edge |
| test_validate_clean_splits | No question appears in both splits | Dataset with clean splits | No split errors | happy |
| test_validate_cross_split_leak | Same question in train and eval | Duplicate across splits | Error detected | error |

**Run:** `pytest tests/test_f004_dataset.py::TestValidateDataset -v`

---

## 2. Integration Tests

### Flow: Primary Curation Pipeline

| Step | Action | Expected | Verification |
|------|--------|----------|--------------|
| 1 | Read db_list.json for target database IDs | Returns list of 10 db_ids | `len(db_ids) >= 10` |
| 2 | Download Spider SQLite databases | All db_ids have corresponding .sqlite files | All paths exist, files > 0 bytes |
| 3 | Load raw Spider questions from HuggingFace | Questions loaded for all target db_ids | `len(questions) > 0`, each has `db_id` and `query` |
| 4 | Compute gold_answer for each question | gold_answer populated, failed queries skipped | No None gold_answers in output; skip count logged |
| 5 | Classify answer_type for each question | answer_type is one of 5 valid values | All values in `{"integer","float","string","list","table"}` |
| 6 | Extract tables_involved from gold_sql | Each question has non-empty tables_involved | All lists non-empty, all entries are strings |
| 7 | Classify difficulty from tables_involved | difficulty is one of 3 valid values | All values in `{"easy","medium","hard"}` |
| 8 | Assign train/eval split | Each question has valid split | All values in `{"train","eval"}` |
| 9 | Generate question_id | All IDs unique and match format | Regex match, `len(set(ids)) == len(ids)` |
| 10 | Validate full dataset | validate_dataset returns empty error list | `len(errors) == 0` |
| 11 | Write output JSON files | questions_train.json and questions_eval.json exist | Files parseable as JSON, combined count >= 100 |

**Run:** `pytest tests/test_f004_integration.py::TestCurationPipeline -v`

---

### Flow: Gold SQL Execution Failure (Alternative)

| Step | Action | Expected | Verification |
|------|--------|----------|--------------|
| 1 | Provide question with broken gold_sql | gold_sql raises sqlite3.Error | Exception caught, not propagated |
| 2 | Check warning logged | Log contains db_id and error details | Log output includes warning |
| 3 | Question excluded from final dataset | Output does not contain the broken question | question_id absent from output |
| 4 | Remaining questions processed | Pipeline continues without interruption | Other questions have valid gold_answer |

**Run:** `pytest tests/test_f004_integration.py::TestGoldSqlFailure -v`

---

### Flow: Validate-Only Mode (Alternative)

| Step | Action | Expected | Verification |
|------|--------|----------|--------------|
| 1 | Invoke script with `--validate` flag | Does not download or regenerate data | No network calls made |
| 2 | Load existing questions_train.json and questions_eval.json | Files read successfully | No FileNotFoundError |
| 3 | Locate SQLite databases in data/databases/ | All referenced databases found | All db_paths valid |
| 4 | Run validate_dataset() | Returns list of errors (may be empty) | Return type is list[str] |
| 5 | Exit code reflects validation result | 0 if valid, 1 if invalid | Process exit code matches |

**Run:** `pytest tests/test_f004_integration.py::TestValidateOnlyMode -v`

---

## 3. API Tests

No API endpoints defined for F004. This feature is a standalone curation script.

---

## 4. E2E Tests

### Scenario: Full Dataset Generation

**Setup:** Clean environment with no existing output files. Network access available. `data/questions/db_list.json` contains 10 target database IDs.

**Actions:**
1. Run the curation script end-to-end (no flags)
2. Wait for completion

**Expected:**
- `data/questions/questions_train.json` exists and contains valid JSON array
- `data/questions/questions_eval.json` exists and contains valid JSON array
- Combined question count >= 100
- All questions pass `validate_dataset()` with zero errors
- At least 8 distinct `database_name` values represented
- Train/eval split approximately 70/30 (+/- 10%)
- All three difficulty levels present
- All five answer_type values present (or at least 3)

**Run:** `python scripts/curate_dataset.py && python scripts/curate_dataset.py --validate`

---

### Scenario: Validate-Only on Pre-Generated Data

**Setup:** Output JSON files already exist from a prior run. SQLite databases present in `data/databases/`.

**Actions:**
1. Run the curation script with `--validate` flag only

**Expected:**
- No new files created or modified
- Validation output printed to stdout
- Exit code 0 if data is valid

**Run:** `python scripts/curate_dataset.py --validate`

---

### Scenario: Idempotent Re-Run

**Setup:** Output JSON files already exist from a prior run.

**Actions:**
1. Run the curation script again (full mode)
2. Compare output files

**Expected:**
- Output files are regenerated
- Same question count (deterministic for same input)
- Database files not re-downloaded (skip existing)

**Run:** `python scripts/curate_dataset.py`

---

## 5. Edge Cases Checklist

- [ ] Null/None gold_answer values handled gracefully
- [ ] Empty string gold_sql skipped or rejected
- [ ] SQL with unicode characters in table/column names
- [ ] Very large query results (1000+ rows) handled by compute_gold_answer
- [ ] Database file that exists but is corrupt (0 bytes or invalid SQLite)
- [ ] db_list.json missing or empty
- [ ] db_list.json with duplicate db_ids
- [ ] Question with gold_sql containing multiple statements (semicolons)
- [ ] Question where gold_sql returns different results on re-execution (non-deterministic)
- [ ] Tables_involved extraction with SQL using CTEs (WITH clause)
- [ ] Tables_involved extraction with SQL using UNION across different tables
- [ ] Extremely long gold_sql (> 1000 chars)
- [ ] Database with no tables (empty schema)
- [ ] Retry behavior on transient network failure during HuggingFace download
- [ ] Retry behavior on transient network failure during database download
- [ ] Concurrent access to same SQLite file (if parallelized)
- [ ] Output JSON file encoding (UTF-8) with special characters in question_text

---

## 6. Evidence Requirements

| Category | Evidence Type | Example |
|----------|---------------|---------|
| Unit tests | pytest output | `X passed, Y skipped` |
| Integration | pytest output | `X passed` |
| E2E | Script output + file inspection | `Generated 105 questions`, `Validation passed` |
| Output files | JSON structure inspection | `jq length questions_train.json` returns count |
| Skip handling | Log output | Warning messages for skipped questions |
| Validation | Exit code | `echo $?` returns `0` |