nittygritty-zzy commited on
Commit
0e554cb
Β·
verified Β·
1 Parent(s): 9f85f8a

Add design doc: pipe-sql-training-reproduction-guide.md

Browse files
docs/pipe-sql-training-reproduction-guide.md CHANGED
@@ -31,7 +31,7 @@ source .venv/Scripts/activate # Windows (Git Bash)
31
  ## Step 2: Install Dependencies
32
 
33
  ```bash
34
- # Install sqlglot in editable mode (puts training_data/ and finetuning/ on sys.path)
35
  uv pip install -e .
36
 
37
  # Install PyTorch with CUDA 12.6 support
@@ -77,21 +77,21 @@ This reads the 15,443 validated golden pairs (standard SQL ↔ pipe SQL) and gen
77
 
78
  ```bash
79
  # Full dataset (recommended for production training)
80
- python -m training_data.generate \
81
- --golden-pairs validation_output/golden_pairs_consolidated.jsonl \
82
  --db-dir data/spider/database \
83
  --db-dir data/bird/train/train_databases \
84
  --db-dir data/bird/dev_20240627/dev_databases \
85
- --output-dir training_data_output \
86
  --tool-calling --tool-ratio 0.3
87
 
88
  # Subset for quick iteration (add --limit)
89
- python -m training_data.generate \
90
- --golden-pairs validation_output/golden_pairs_consolidated.jsonl \
91
  --db-dir data/spider/database \
92
  --db-dir data/bird/train/train_databases \
93
  --db-dir data/bird/dev_20240627/dev_databases \
94
- --output-dir training_data_output \
95
  --tool-calling --tool-ratio 0.3 \
96
  --limit 2000
97
  ```
@@ -148,16 +148,16 @@ bash scripts/train.sh
148
  Validates the pipeline works end-to-end. Use a small dataset generated with `--limit 2000`:
149
 
150
  ```bash
151
- python -m finetuning.train \
152
  --model-name Qwen/Qwen2.5-Coder-1.5B-Instruct \
153
- --train-data training_data_output/train.jsonl \
154
- --dev-data training_data_output/dev.jsonl \
155
  --max-seq-length 4096 \
156
  --per-device-train-batch-size 4 \
157
  --gradient-accumulation-steps 4 \
158
  --num-epochs 1 \
159
  --no-4bit \
160
- --output-dir finetuning_output_smoke
161
  ```
162
 
163
  Expected: loss drops from ~2.1 to ~0.2, token accuracy rises to ~96%.
@@ -165,16 +165,16 @@ Expected: loss drops from ~2.1 to ~0.2, token accuracy rises to ~96%.
165
  #### 5b. Full 1.5B Training (recommended: full dataset, 2 epochs)
166
 
167
  ```bash
168
- python -m finetuning.train \
169
  --model-name Qwen/Qwen2.5-Coder-1.5B-Instruct \
170
- --train-data training_data_output/train.jsonl \
171
- --dev-data training_data_output/dev.jsonl \
172
  --max-seq-length 4096 \
173
  --per-device-train-batch-size 4 \
174
  --gradient-accumulation-steps 8 \
175
  --num-epochs 2 \
176
  --no-4bit \
177
- --output-dir finetuning_output_1.5b
178
  ```
179
 
180
  #### 5c. 7B QLoRA Training (recommended: full dataset, 2 epochs)
@@ -182,10 +182,10 @@ python -m finetuning.train \
182
  For the full-size model using 4-bit quantization to fit in 16 GB VRAM:
183
 
184
  ```bash
185
- python -m finetuning.train \
186
  --model-name Qwen/Qwen2.5-Coder-7B-Instruct \
187
- --train-data training_data_output/train.jsonl \
188
- --dev-data training_data_output/dev.jsonl \
189
  --max-seq-length 4096 \
190
  --per-device-train-batch-size 1 \
191
  --gradient-accumulation-steps 32 \
@@ -194,7 +194,7 @@ python -m finetuning.train \
194
  --load-in-4bit \
195
  --save-steps 1000 \
196
  --eval-steps 1000 \
197
- --output-dir finetuning_output_7b
198
  ```
199
 
200
  > **Important**: The lower learning rate (5e-5 vs default 2e-4) is critical for 7B stability. An earlier run with 2e-4 collapsed to NaN at epoch ~1.5. See the Troubleshooting section for details.
@@ -262,14 +262,14 @@ After training, merge the LoRA adapter into the base model for standalone infere
262
 
263
  ```bash
264
  # For 1.5B model
265
- python -m finetuning.train --merge \
266
  --model-name Qwen/Qwen2.5-Coder-1.5B-Instruct \
267
- --output-dir finetuning_output_1.5b
268
 
269
  # For 7B model
270
- python -m finetuning.train --merge \
271
  --model-name Qwen/Qwen2.5-Coder-7B-Instruct \
272
- --output-dir finetuning_output_7b
273
  ```
274
 
275
  The merged model is saved to `<output-dir>/merged/` and can be loaded directly with `AutoModelForCausalLM.from_pretrained()`.
@@ -340,9 +340,9 @@ The model weights went to NaN at step 680 and remained dead for the remaining ~6
340
 
341
  **Salvageable**: The **checkpoint-500** (before collapse) is still viable β€” eval_loss=0.224, accuracy=95.8%. To use it:
342
  ```bash
343
- python -m finetuning.train --merge \
344
  --model-name Qwen/Qwen2.5-Coder-7B-Instruct \
345
- --output-dir finetuning_output_7b \
346
  --checkpoint checkpoint-500
347
  ```
348
 
@@ -373,33 +373,38 @@ With 7.7x more data, we expect:
373
 
374
  ```
375
  sqlglot/
376
- β”œβ”€β”€ validation_output/
377
- β”‚ └── golden_pairs_consolidated.jsonl # 15,443 validated (gold_sql, pipe_sql) pairs
378
- β”œβ”€β”€ training_data/
379
- β”‚ β”œβ”€β”€ __main__.py # Entry: python -m training_data.generate
380
- β”‚ β”œβ”€β”€ generate.py # Main data generation pipeline
381
- β”‚ β”œβ”€β”€ formatter.py # Chat sample formatting (incremental trajectory)
382
- β”‚ β”œβ”€β”€ tool_formatter.py # Tool-calling sample generation
383
- β”‚ β”œβ”€β”€ trajectory.py # Pipe query β†’ step decomposition
384
- β”‚ β”œβ”€β”€ schema_extractor.py # SQLite schema β†’ text representation
385
- β”‚ β”œβ”€β”€ tool_executor.py # Simulated tool execution for training
386
- β”‚ └── writer.py # Train/dev split and JSONL output
387
- β”œβ”€β”€ finetuning/
388
- β”‚ β”œβ”€β”€ train.py # Main fine-tuning script
389
- β”‚ β”œβ”€β”€ config.py # TrainConfig dataclass with CLI parsing
390
- β”‚ └── data.py # JSONL dataset loader
 
 
 
 
 
 
 
 
 
 
 
 
 
391
  β”œβ”€β”€ scripts/
392
  β”‚ β”œβ”€β”€ setup_data.sh # Downloads Spider 1.0
393
  β”‚ β”œβ”€β”€ setup_bird_data.sh # Downloads BIRD dev + train
394
  β”‚ └── train.sh # One-command data gen + training
395
- β”œβ”€β”€ training_data_output/ # Generated training data (not committed)
396
- β”‚ β”œβ”€β”€ train.jsonl
397
- β”‚ β”œβ”€β”€ dev.jsonl
398
- β”‚ └── stats.json
399
- β”œβ”€β”€ finetuning_output/ # Training outputs (not committed)
400
- β”‚ β”œβ”€β”€ checkpoint-*/ # Intermediate checkpoints
401
- β”‚ β”œβ”€β”€ final/ # Final LoRA adapter
402
- β”‚ └── merged/ # Merged standalone model
403
  └── docs/design/
404
  β”œβ”€β”€ pipe-sql-fine-tuning-design-doc.md
405
  β”œβ”€β”€ pipe-sql-decompiler-design-doc.md
@@ -417,7 +422,7 @@ sqlglot/
417
 
418
  **Cause**: bitsandbytes 4-bit quantization produces BFloat16 parameters, which are incompatible with the FP16 AMP gradient scaler.
419
 
420
- **Fix**: The training script automatically detects this and uses `bf16=True` when `--load-in-4bit` is set on CUDA. If you see this error, ensure you're using the latest `finetuning/train.py`.
421
 
422
  ### Model Loading on CPU Instead of GPU
423
 
@@ -435,9 +440,9 @@ sqlglot/
435
 
436
  **Fix**: Always pass `--model-name` matching the model used for training:
437
  ```bash
438
- python -m finetuning.train --merge \
439
  --model-name Qwen/Qwen2.5-Coder-1.5B-Instruct \
440
- --output-dir finetuning_output
441
  ```
442
 
443
  ### 7B QLoRA Training Collapse (Loss β†’ NaN)
@@ -449,10 +454,10 @@ python -m finetuning.train --merge \
449
  **Fix**: Apply all three mitigations:
450
 
451
  ```bash
452
- python -m finetuning.train \
453
  --model-name Qwen/Qwen2.5-Coder-7B-Instruct \
454
- --train-data training_data_output/train.jsonl \
455
- --dev-data training_data_output/dev.jsonl \
456
  --max-seq-length 4096 \
457
  --per-device-train-batch-size 1 \
458
  --gradient-accumulation-steps 32 \
@@ -461,7 +466,7 @@ python -m finetuning.train \
461
  --load-in-4bit \
462
  --save-steps 500 \
463
  --eval-steps 500 \
464
- --output-dir finetuning_output_7b
465
  ```
466
 
467
  Key changes from the failed run:
@@ -493,7 +498,7 @@ huggingface-cli login
493
  - [x] Training data generated from golden pairs
494
  - [x] Smoke test passed (1.5B, 1 epoch β€” loss 2.13β†’0.20, accuracy 96.1%)
495
  - [x] Full 1.5B training completed (3 epochs β€” eval_loss=0.191, accuracy 95.8%)
496
- - [x] 1.5B LoRA adapter merged (`finetuning_output/merged/`)
497
  - [ ] 7B QLoRA training β€” **collapsed at epoch 1.5** (checkpoint-500 salvageable, needs re-run with lower LR)
498
  - [ ] 7B LoRA adapter merged
499
  - [ ] Full dataset training (15K pairs) β€” pending 7B fix
 
31
  ## Step 2: Install Dependencies
32
 
33
  ```bash
34
+ # Install sqlglot in editable mode (puts pipe_sql/ on sys.path)
35
  uv pip install -e .
36
 
37
  # Install PyTorch with CUDA 12.6 support
 
77
 
78
  ```bash
79
  # Full dataset (recommended for production training)
80
+ python -m pipe_sql.training.generate \
81
+ --golden-pairs pipe_sql/validation_output/golden_pairs_consolidated.jsonl \
82
  --db-dir data/spider/database \
83
  --db-dir data/bird/train/train_databases \
84
  --db-dir data/bird/dev_20240627/dev_databases \
85
+ --output-dir pipe_sql/training_output \
86
  --tool-calling --tool-ratio 0.3
87
 
88
  # Subset for quick iteration (add --limit)
89
+ python -m pipe_sql.training.generate \
90
+ --golden-pairs pipe_sql/validation_output/golden_pairs_consolidated.jsonl \
91
  --db-dir data/spider/database \
92
  --db-dir data/bird/train/train_databases \
93
  --db-dir data/bird/dev_20240627/dev_databases \
94
+ --output-dir pipe_sql/training_output \
95
  --tool-calling --tool-ratio 0.3 \
96
  --limit 2000
97
  ```
 
148
  Validates the pipeline works end-to-end. Use a small dataset generated with `--limit 2000`:
149
 
150
  ```bash
151
+ python -m pipe_sql.finetuning.train \
152
  --model-name Qwen/Qwen2.5-Coder-1.5B-Instruct \
153
+ --train-data pipe_sql/training_output/train.jsonl \
154
+ --dev-data pipe_sql/training_output/dev.jsonl \
155
  --max-seq-length 4096 \
156
  --per-device-train-batch-size 4 \
157
  --gradient-accumulation-steps 4 \
158
  --num-epochs 1 \
159
  --no-4bit \
160
+ --output-dir pipe_sql/finetuning_output_smoke
161
  ```
162
 
163
  Expected: loss drops from ~2.1 to ~0.2, token accuracy rises to ~96%.
 
165
  #### 5b. Full 1.5B Training (recommended: full dataset, 2 epochs)
166
 
167
  ```bash
168
+ python -m pipe_sql.finetuning.train \
169
  --model-name Qwen/Qwen2.5-Coder-1.5B-Instruct \
170
+ --train-data pipe_sql/training_output/train.jsonl \
171
+ --dev-data pipe_sql/training_output/dev.jsonl \
172
  --max-seq-length 4096 \
173
  --per-device-train-batch-size 4 \
174
  --gradient-accumulation-steps 8 \
175
  --num-epochs 2 \
176
  --no-4bit \
177
+ --output-dir pipe_sql/finetuning_output_1.5b
178
  ```
179
 
180
  #### 5c. 7B QLoRA Training (recommended: full dataset, 2 epochs)
 
182
  For the full-size model using 4-bit quantization to fit in 16 GB VRAM:
183
 
184
  ```bash
185
+ python -m pipe_sql.finetuning.train \
186
  --model-name Qwen/Qwen2.5-Coder-7B-Instruct \
187
+ --train-data pipe_sql/training_output/train.jsonl \
188
+ --dev-data pipe_sql/training_output/dev.jsonl \
189
  --max-seq-length 4096 \
190
  --per-device-train-batch-size 1 \
191
  --gradient-accumulation-steps 32 \
 
194
  --load-in-4bit \
195
  --save-steps 1000 \
196
  --eval-steps 1000 \
197
+ --output-dir pipe_sql/finetuning_output_7b
198
  ```
199
 
200
  > **Important**: The lower learning rate (5e-5 vs default 2e-4) is critical for 7B stability. An earlier run with 2e-4 collapsed to NaN at epoch ~1.5. See the Troubleshooting section for details.
 
262
 
263
  ```bash
264
  # For 1.5B model
265
+ python -m pipe_sql.finetuning.train --merge \
266
  --model-name Qwen/Qwen2.5-Coder-1.5B-Instruct \
267
+ --output-dir pipe_sql/finetuning_output_1.5b
268
 
269
  # For 7B model
270
+ python -m pipe_sql.finetuning.train --merge \
271
  --model-name Qwen/Qwen2.5-Coder-7B-Instruct \
272
+ --output-dir pipe_sql/finetuning_output_7b
273
  ```
274
 
275
  The merged model is saved to `<output-dir>/merged/` and can be loaded directly with `AutoModelForCausalLM.from_pretrained()`.
 
340
 
341
  **Salvageable**: The **checkpoint-500** (before collapse) is still viable β€” eval_loss=0.224, accuracy=95.8%. To use it:
342
  ```bash
343
+ python -m pipe_sql.finetuning.train --merge \
344
  --model-name Qwen/Qwen2.5-Coder-7B-Instruct \
345
+ --output-dir pipe_sql/finetuning_output_7b \
346
  --checkpoint checkpoint-500
347
  ```
348
 
 
373
 
374
  ```
375
  sqlglot/
376
+ β”œβ”€β”€ pipe_sql/
377
+ β”‚ β”œβ”€β”€ decompiler/ # Standard SQL β†’ pipe SQL decompiler
378
+ β”‚ β”œβ”€β”€ validation/ # Validation loop runner
379
+ β”‚ β”œβ”€β”€ training/
380
+ β”‚ β”‚ β”œβ”€β”€ __main__.py # Entry: python -m pipe_sql.training.generate
381
+ β”‚ β”‚ β”œβ”€β”€ generate.py # Main data generation pipeline
382
+ β”‚ β”‚ β”œβ”€β”€ formatter.py # Chat sample formatting (incremental trajectory)
383
+ β”‚ β”‚ β”œβ”€β”€ tool_formatter.py # Tool-calling sample generation
384
+ β”‚ β”‚ β”œβ”€β”€ trajectory.py # Pipe query β†’ step decomposition
385
+ β”‚ β”‚ β”œβ”€β”€ schema_extractor.py # SQLite schema β†’ text representation
386
+ β”‚ β”‚ β”œβ”€β”€ tool_executor.py # Simulated tool execution for training
387
+ β”‚ β”‚ └── writer.py # Train/dev split and JSONL output
388
+ β”‚ β”œβ”€β”€ finetuning/
389
+ β”‚ β”‚ β”œβ”€β”€ train.py # Main fine-tuning script
390
+ β”‚ β”‚ β”œβ”€β”€ config.py # TrainConfig dataclass with CLI parsing
391
+ β”‚ β”‚ └── data.py # JSONL dataset loader
392
+ β”‚ β”œβ”€β”€ evaluation/ # Evaluation server + agent
393
+ β”‚ β”œβ”€β”€ validation_output/ # Validated golden pairs
394
+ β”‚ β”‚ └── golden_pairs_consolidated.jsonl # 15,443 validated (gold_sql, pipe_sql) pairs
395
+ β”‚ β”œβ”€β”€ training_output/ # Generated training data (not committed)
396
+ β”‚ β”‚ β”œβ”€β”€ train.jsonl
397
+ β”‚ β”‚ β”œβ”€β”€ dev.jsonl
398
+ β”‚ β”‚ └── stats.json
399
+ β”‚ β”œβ”€β”€ finetuning_output/ # Training outputs (not committed)
400
+ β”‚ β”‚ β”œβ”€β”€ checkpoint-*/ # Intermediate checkpoints
401
+ β”‚ β”‚ β”œβ”€β”€ final/ # Final LoRA adapter
402
+ β”‚ β”‚ └── merged/ # Merged standalone model
403
+ β”‚ └── output/ # Evaluation output (not committed)
404
  β”œβ”€β”€ scripts/
405
  β”‚ β”œβ”€β”€ setup_data.sh # Downloads Spider 1.0
406
  β”‚ β”œβ”€β”€ setup_bird_data.sh # Downloads BIRD dev + train
407
  β”‚ └── train.sh # One-command data gen + training
 
 
 
 
 
 
 
 
408
  └── docs/design/
409
  β”œβ”€β”€ pipe-sql-fine-tuning-design-doc.md
410
  β”œβ”€β”€ pipe-sql-decompiler-design-doc.md
 
422
 
423
  **Cause**: bitsandbytes 4-bit quantization produces BFloat16 parameters, which are incompatible with the FP16 AMP gradient scaler.
424
 
425
+ **Fix**: The training script automatically detects this and uses `bf16=True` when `--load-in-4bit` is set on CUDA. If you see this error, ensure you're using the latest `pipe_sql/finetuning/train.py`.
426
 
427
  ### Model Loading on CPU Instead of GPU
428
 
 
440
 
441
  **Fix**: Always pass `--model-name` matching the model used for training:
442
  ```bash
443
+ python -m pipe_sql.finetuning.train --merge \
444
  --model-name Qwen/Qwen2.5-Coder-1.5B-Instruct \
445
+ --output-dir pipe_sql/finetuning_output
446
  ```
447
 
448
  ### 7B QLoRA Training Collapse (Loss β†’ NaN)
 
454
  **Fix**: Apply all three mitigations:
455
 
456
  ```bash
457
+ python -m pipe_sql.finetuning.train \
458
  --model-name Qwen/Qwen2.5-Coder-7B-Instruct \
459
+ --train-data pipe_sql/training_output/train.jsonl \
460
+ --dev-data pipe_sql/training_output/dev.jsonl \
461
  --max-seq-length 4096 \
462
  --per-device-train-batch-size 1 \
463
  --gradient-accumulation-steps 32 \
 
466
  --load-in-4bit \
467
  --save-steps 500 \
468
  --eval-steps 500 \
469
+ --output-dir pipe_sql/finetuning_output_7b
470
  ```
471
 
472
  Key changes from the failed run:
 
498
  - [x] Training data generated from golden pairs
499
  - [x] Smoke test passed (1.5B, 1 epoch β€” loss 2.13β†’0.20, accuracy 96.1%)
500
  - [x] Full 1.5B training completed (3 epochs β€” eval_loss=0.191, accuracy 95.8%)
501
+ - [x] 1.5B LoRA adapter merged (`pipe_sql/finetuning_output/merged/`)
502
  - [ ] 7B QLoRA training β€” **collapsed at epoch 1.5** (checkpoint-500 salvageable, needs re-run with lower LR)
503
  - [ ] 7B LoRA adapter merged
504
  - [ ] Full dataset training (15K pairs) β€” pending 7B fix