File size: 5,460 Bytes
5dd1bb4 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 | # Feature Demo: F004 — Question Dataset Expansion
> **Generated:** 2026-03-24T21:07:31Z
> **Context source:** spec + discovery only (implementation not read)
> **Feature entry:** [FEATURES.json #F004](./FEATURES.json)
---
## What This Feature Does
Before this feature, training data came from a single database and could overfit to one schema. F004 expands that into a curated multi-database dataset so training and evaluation reflect more realistic SQL variety.
From a user perspective, this feels like a repeatable CLI workflow: generate enriched train/eval JSON once, then validate it quickly before downstream training. You get precomputed gold answers, answer types, difficulty labels, and deterministic splits.
---
## What Is Already Proven
### Verified in This Demo Run
- Ran full curation pipeline locally and observed generated outputs: 473 train + 203 eval (676 total).
- Ran `--validate` mode locally and observed successful validation for all 676 records.
- Verified split ratio and database coverage from generated artifacts (`train_ratio=0.6997`, `eval_ratio=0.3003`, `db_count=10`).
- Ran an invalid CLI input case (`--db-list` missing path) and captured the real failure output.
- Ran repository smoke tests (`21 passed`).
### Previously Verified Evidence
- `specs/FEATURES.json` (`verification_evidence` for F004): verifier approved, `uv run pytest tests/ -v`, 21/21 passed at `2026-03-24T21:04:54Z`.
- `specs/F004-IMPLEMENTATION_SPEC.md` (Step 2.3): prior validation evidence recorded for 676 curated records and ~70/30 split.
---
## What Still Needs User Verification
None for local CLI proof.
Optional product check: decide whether current MVP difficulty skew warnings are acceptable for your training goals.
---
## Quickstart / Verification Steps
> Run these commands to see the feature in action:
```bash
uv run python scripts/curate_questions.py
uv run python scripts/curate_questions.py --validate
```
Requires local Python/uv environment and access to existing project data directories.
---
## Live Local Proof
### Generate the Curated Train/Eval Datasets
This runs the user-facing curation pipeline end-to-end.
```bash
uv run python scripts/curate_questions.py
```
```
WARNING: Difficulty distribution off target: easy=91.72% (target 40%)
WARNING: Difficulty distribution off target: medium=7.40% (target 40%)
WARNING: Difficulty distribution off target: hard=0.89% (target 20%)
Prepared 10 databases in data/databases
Loaded 676 Spider questions
Curated 676 questions (skipped 0)
Validation passed
Wrote 473 train records to data/questions/questions_train.json
Wrote 203 eval records to data/questions/questions_eval.json
```
Notice the pipeline completes successfully and writes both split files.
### Validate Existing Curated Outputs
This is the fast re-check path users can run before training.
```bash
uv run python scripts/curate_questions.py --validate
```
```
WARNING: Difficulty distribution off target: easy=91.72% (target 40%)
WARNING: Difficulty distribution off target: medium=7.40% (target 40%)
WARNING: Difficulty distribution off target: hard=0.89% (target 20%)
Validation passed for 676 curated records
```
Notice validation passes while surfacing non-blocking MVP warnings.
---
## Existing Evidence
- F004 `verification_evidence` in `specs/FEATURES.json`: 21/21 smoke tests passed, verifier status `approved`.
- `specs/F004-IMPLEMENTATION_SPEC.md` Step 2.3: prior recorded split metrics (`473/203`) and validation pass.
---
## Manual Verification Checklist
1. Run full curation command and confirm both JSON files are written.
2. Run `--validate` and confirm exit succeeds with `Validation passed` message.
3. Confirm split counts are close to 70/30.
4. Confirm warnings (if any) match your accepted MVP quality bar.
---
## Edge Cases Exercised
### Boundary Check: Split Ratio and DB Coverage
```bash
uv run python -c "import json; from pathlib import Path; train=json.loads(Path('data/questions/questions_train.json').read_text()); eval_=json.loads(Path('data/questions/questions_eval.json').read_text()); total=len(train)+len(eval_); dbs=sorted({q['database_name'] for q in train+eval_}); print(f'train={len(train)} eval={len(eval_)} total={total} train_ratio={len(train)/total:.4f} eval_ratio={len(eval_)/total:.4f} db_count={len(dbs)}')"
```
```
train=473 eval=203 total=676 train_ratio=0.6997 eval_ratio=0.3003 db_count=10
```
This confirms the split target and multi-database coverage from actual artifacts.
### Error Case: Missing `--db-list` Path
```bash
uv run python scripts/curate_questions.py --db-list data/questions/does_not_exist.json
```
```
Traceback (most recent call last):
...
FileNotFoundError: [Errno 2] No such file or directory: 'data/questions/does_not_exist.json'
```
This shows current behavior for invalid input path (real failure output captured).
---
## Test Evidence (Optional)
> Supplementary proof that the repository remains healthy.
| Test Suite | Tests | Status |
|---|---|---|
| `uv run pytest tests/ -v` | 21 | All passed |
Representative command run:
```bash
uv run pytest tests/ -v
```
Result summary: `============================== 21 passed in 8.48s ==============================`
---
## Feature Links
- Implementation spec: `specs/F004-IMPLEMENTATION_SPEC.md`
- Verification spec: `specs/F004-VERIFICATION_SPEC.md`
---
*Demo generated by `feature-demo` agent. Re-run with `/feature-demo F004` to refresh.*
|