Add design doc: pipe-sql-training-reproduction-guide.md
Browse files
docs/pipe-sql-training-reproduction-guide.md
CHANGED
|
@@ -31,7 +31,7 @@ source .venv/Scripts/activate # Windows (Git Bash)
|
|
| 31 |
## Step 2: Install Dependencies
|
| 32 |
|
| 33 |
```bash
|
| 34 |
-
# Install sqlglot in editable mode (puts
|
| 35 |
uv pip install -e .
|
| 36 |
|
| 37 |
# Install PyTorch with CUDA 12.6 support
|
|
@@ -77,21 +77,21 @@ This reads the 15,443 validated golden pairs (standard SQL β pipe SQL) and gen
|
|
| 77 |
|
| 78 |
```bash
|
| 79 |
# Full dataset (recommended for production training)
|
| 80 |
-
python -m
|
| 81 |
-
--golden-pairs validation_output/golden_pairs_consolidated.jsonl \
|
| 82 |
--db-dir data/spider/database \
|
| 83 |
--db-dir data/bird/train/train_databases \
|
| 84 |
--db-dir data/bird/dev_20240627/dev_databases \
|
| 85 |
-
--output-dir
|
| 86 |
--tool-calling --tool-ratio 0.3
|
| 87 |
|
| 88 |
# Subset for quick iteration (add --limit)
|
| 89 |
-
python -m
|
| 90 |
-
--golden-pairs validation_output/golden_pairs_consolidated.jsonl \
|
| 91 |
--db-dir data/spider/database \
|
| 92 |
--db-dir data/bird/train/train_databases \
|
| 93 |
--db-dir data/bird/dev_20240627/dev_databases \
|
| 94 |
-
--output-dir
|
| 95 |
--tool-calling --tool-ratio 0.3 \
|
| 96 |
--limit 2000
|
| 97 |
```
|
|
@@ -148,16 +148,16 @@ bash scripts/train.sh
|
|
| 148 |
Validates the pipeline works end-to-end. Use a small dataset generated with `--limit 2000`:
|
| 149 |
|
| 150 |
```bash
|
| 151 |
-
python -m finetuning.train \
|
| 152 |
--model-name Qwen/Qwen2.5-Coder-1.5B-Instruct \
|
| 153 |
-
--train-data
|
| 154 |
-
--dev-data
|
| 155 |
--max-seq-length 4096 \
|
| 156 |
--per-device-train-batch-size 4 \
|
| 157 |
--gradient-accumulation-steps 4 \
|
| 158 |
--num-epochs 1 \
|
| 159 |
--no-4bit \
|
| 160 |
-
--output-dir finetuning_output_smoke
|
| 161 |
```
|
| 162 |
|
| 163 |
Expected: loss drops from ~2.1 to ~0.2, token accuracy rises to ~96%.
|
|
@@ -165,16 +165,16 @@ Expected: loss drops from ~2.1 to ~0.2, token accuracy rises to ~96%.
|
|
| 165 |
#### 5b. Full 1.5B Training (recommended: full dataset, 2 epochs)
|
| 166 |
|
| 167 |
```bash
|
| 168 |
-
python -m finetuning.train \
|
| 169 |
--model-name Qwen/Qwen2.5-Coder-1.5B-Instruct \
|
| 170 |
-
--train-data
|
| 171 |
-
--dev-data
|
| 172 |
--max-seq-length 4096 \
|
| 173 |
--per-device-train-batch-size 4 \
|
| 174 |
--gradient-accumulation-steps 8 \
|
| 175 |
--num-epochs 2 \
|
| 176 |
--no-4bit \
|
| 177 |
-
--output-dir finetuning_output_1.5b
|
| 178 |
```
|
| 179 |
|
| 180 |
#### 5c. 7B QLoRA Training (recommended: full dataset, 2 epochs)
|
|
@@ -182,10 +182,10 @@ python -m finetuning.train \
|
|
| 182 |
For the full-size model using 4-bit quantization to fit in 16 GB VRAM:
|
| 183 |
|
| 184 |
```bash
|
| 185 |
-
python -m finetuning.train \
|
| 186 |
--model-name Qwen/Qwen2.5-Coder-7B-Instruct \
|
| 187 |
-
--train-data
|
| 188 |
-
--dev-data
|
| 189 |
--max-seq-length 4096 \
|
| 190 |
--per-device-train-batch-size 1 \
|
| 191 |
--gradient-accumulation-steps 32 \
|
|
@@ -194,7 +194,7 @@ python -m finetuning.train \
|
|
| 194 |
--load-in-4bit \
|
| 195 |
--save-steps 1000 \
|
| 196 |
--eval-steps 1000 \
|
| 197 |
-
--output-dir finetuning_output_7b
|
| 198 |
```
|
| 199 |
|
| 200 |
> **Important**: The lower learning rate (5e-5 vs default 2e-4) is critical for 7B stability. An earlier run with 2e-4 collapsed to NaN at epoch ~1.5. See the Troubleshooting section for details.
|
|
@@ -262,14 +262,14 @@ After training, merge the LoRA adapter into the base model for standalone infere
|
|
| 262 |
|
| 263 |
```bash
|
| 264 |
# For 1.5B model
|
| 265 |
-
python -m finetuning.train --merge \
|
| 266 |
--model-name Qwen/Qwen2.5-Coder-1.5B-Instruct \
|
| 267 |
-
--output-dir finetuning_output_1.5b
|
| 268 |
|
| 269 |
# For 7B model
|
| 270 |
-
python -m finetuning.train --merge \
|
| 271 |
--model-name Qwen/Qwen2.5-Coder-7B-Instruct \
|
| 272 |
-
--output-dir finetuning_output_7b
|
| 273 |
```
|
| 274 |
|
| 275 |
The merged model is saved to `<output-dir>/merged/` and can be loaded directly with `AutoModelForCausalLM.from_pretrained()`.
|
|
@@ -340,9 +340,9 @@ The model weights went to NaN at step 680 and remained dead for the remaining ~6
|
|
| 340 |
|
| 341 |
**Salvageable**: The **checkpoint-500** (before collapse) is still viable β eval_loss=0.224, accuracy=95.8%. To use it:
|
| 342 |
```bash
|
| 343 |
-
python -m finetuning.train --merge \
|
| 344 |
--model-name Qwen/Qwen2.5-Coder-7B-Instruct \
|
| 345 |
-
--output-dir finetuning_output_7b \
|
| 346 |
--checkpoint checkpoint-500
|
| 347 |
```
|
| 348 |
|
|
@@ -373,33 +373,38 @@ With 7.7x more data, we expect:
|
|
| 373 |
|
| 374 |
```
|
| 375 |
sqlglot/
|
| 376 |
-
βββ
|
| 377 |
-
β
|
| 378 |
-
βββ
|
| 379 |
-
β βββ
|
| 380 |
-
β βββ
|
| 381 |
-
β βββ
|
| 382 |
-
β βββ
|
| 383 |
-
β βββ
|
| 384 |
-
β βββ
|
| 385 |
-
β βββ
|
| 386 |
-
β
|
| 387 |
-
|
| 388 |
-
β βββ
|
| 389 |
-
β βββ
|
| 390 |
-
β
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 391 |
βββ scripts/
|
| 392 |
β βββ setup_data.sh # Downloads Spider 1.0
|
| 393 |
β βββ setup_bird_data.sh # Downloads BIRD dev + train
|
| 394 |
β βββ train.sh # One-command data gen + training
|
| 395 |
-
βββ training_data_output/ # Generated training data (not committed)
|
| 396 |
-
β βββ train.jsonl
|
| 397 |
-
β βββ dev.jsonl
|
| 398 |
-
β βββ stats.json
|
| 399 |
-
βββ finetuning_output/ # Training outputs (not committed)
|
| 400 |
-
β βββ checkpoint-*/ # Intermediate checkpoints
|
| 401 |
-
β βββ final/ # Final LoRA adapter
|
| 402 |
-
β βββ merged/ # Merged standalone model
|
| 403 |
βββ docs/design/
|
| 404 |
βββ pipe-sql-fine-tuning-design-doc.md
|
| 405 |
βββ pipe-sql-decompiler-design-doc.md
|
|
@@ -417,7 +422,7 @@ sqlglot/
|
|
| 417 |
|
| 418 |
**Cause**: bitsandbytes 4-bit quantization produces BFloat16 parameters, which are incompatible with the FP16 AMP gradient scaler.
|
| 419 |
|
| 420 |
-
**Fix**: The training script automatically detects this and uses `bf16=True` when `--load-in-4bit` is set on CUDA. If you see this error, ensure you're using the latest `finetuning/train.py`.
|
| 421 |
|
| 422 |
### Model Loading on CPU Instead of GPU
|
| 423 |
|
|
@@ -435,9 +440,9 @@ sqlglot/
|
|
| 435 |
|
| 436 |
**Fix**: Always pass `--model-name` matching the model used for training:
|
| 437 |
```bash
|
| 438 |
-
python -m finetuning.train --merge \
|
| 439 |
--model-name Qwen/Qwen2.5-Coder-1.5B-Instruct \
|
| 440 |
-
--output-dir finetuning_output
|
| 441 |
```
|
| 442 |
|
| 443 |
### 7B QLoRA Training Collapse (Loss β NaN)
|
|
@@ -449,10 +454,10 @@ python -m finetuning.train --merge \
|
|
| 449 |
**Fix**: Apply all three mitigations:
|
| 450 |
|
| 451 |
```bash
|
| 452 |
-
python -m finetuning.train \
|
| 453 |
--model-name Qwen/Qwen2.5-Coder-7B-Instruct \
|
| 454 |
-
--train-data
|
| 455 |
-
--dev-data
|
| 456 |
--max-seq-length 4096 \
|
| 457 |
--per-device-train-batch-size 1 \
|
| 458 |
--gradient-accumulation-steps 32 \
|
|
@@ -461,7 +466,7 @@ python -m finetuning.train \
|
|
| 461 |
--load-in-4bit \
|
| 462 |
--save-steps 500 \
|
| 463 |
--eval-steps 500 \
|
| 464 |
-
--output-dir finetuning_output_7b
|
| 465 |
```
|
| 466 |
|
| 467 |
Key changes from the failed run:
|
|
@@ -493,7 +498,7 @@ huggingface-cli login
|
|
| 493 |
- [x] Training data generated from golden pairs
|
| 494 |
- [x] Smoke test passed (1.5B, 1 epoch β loss 2.13β0.20, accuracy 96.1%)
|
| 495 |
- [x] Full 1.5B training completed (3 epochs β eval_loss=0.191, accuracy 95.8%)
|
| 496 |
-
- [x] 1.5B LoRA adapter merged (`finetuning_output/merged/`)
|
| 497 |
- [ ] 7B QLoRA training β **collapsed at epoch 1.5** (checkpoint-500 salvageable, needs re-run with lower LR)
|
| 498 |
- [ ] 7B LoRA adapter merged
|
| 499 |
- [ ] Full dataset training (15K pairs) β pending 7B fix
|
|
|
|
| 31 |
## Step 2: Install Dependencies
|
| 32 |
|
| 33 |
```bash
|
| 34 |
+
# Install sqlglot in editable mode (puts pipe_sql/ on sys.path)
|
| 35 |
uv pip install -e .
|
| 36 |
|
| 37 |
# Install PyTorch with CUDA 12.6 support
|
|
|
|
| 77 |
|
| 78 |
```bash
|
| 79 |
# Full dataset (recommended for production training)
|
| 80 |
+
python -m pipe_sql.training.generate \
|
| 81 |
+
--golden-pairs pipe_sql/validation_output/golden_pairs_consolidated.jsonl \
|
| 82 |
--db-dir data/spider/database \
|
| 83 |
--db-dir data/bird/train/train_databases \
|
| 84 |
--db-dir data/bird/dev_20240627/dev_databases \
|
| 85 |
+
--output-dir pipe_sql/training_output \
|
| 86 |
--tool-calling --tool-ratio 0.3
|
| 87 |
|
| 88 |
# Subset for quick iteration (add --limit)
|
| 89 |
+
python -m pipe_sql.training.generate \
|
| 90 |
+
--golden-pairs pipe_sql/validation_output/golden_pairs_consolidated.jsonl \
|
| 91 |
--db-dir data/spider/database \
|
| 92 |
--db-dir data/bird/train/train_databases \
|
| 93 |
--db-dir data/bird/dev_20240627/dev_databases \
|
| 94 |
+
--output-dir pipe_sql/training_output \
|
| 95 |
--tool-calling --tool-ratio 0.3 \
|
| 96 |
--limit 2000
|
| 97 |
```
|
|
|
|
| 148 |
Validates the pipeline works end-to-end. Use a small dataset generated with `--limit 2000`:
|
| 149 |
|
| 150 |
```bash
|
| 151 |
+
python -m pipe_sql.finetuning.train \
|
| 152 |
--model-name Qwen/Qwen2.5-Coder-1.5B-Instruct \
|
| 153 |
+
--train-data pipe_sql/training_output/train.jsonl \
|
| 154 |
+
--dev-data pipe_sql/training_output/dev.jsonl \
|
| 155 |
--max-seq-length 4096 \
|
| 156 |
--per-device-train-batch-size 4 \
|
| 157 |
--gradient-accumulation-steps 4 \
|
| 158 |
--num-epochs 1 \
|
| 159 |
--no-4bit \
|
| 160 |
+
--output-dir pipe_sql/finetuning_output_smoke
|
| 161 |
```
|
| 162 |
|
| 163 |
Expected: loss drops from ~2.1 to ~0.2, token accuracy rises to ~96%.
|
|
|
|
| 165 |
#### 5b. Full 1.5B Training (recommended: full dataset, 2 epochs)
|
| 166 |
|
| 167 |
```bash
|
| 168 |
+
python -m pipe_sql.finetuning.train \
|
| 169 |
--model-name Qwen/Qwen2.5-Coder-1.5B-Instruct \
|
| 170 |
+
--train-data pipe_sql/training_output/train.jsonl \
|
| 171 |
+
--dev-data pipe_sql/training_output/dev.jsonl \
|
| 172 |
--max-seq-length 4096 \
|
| 173 |
--per-device-train-batch-size 4 \
|
| 174 |
--gradient-accumulation-steps 8 \
|
| 175 |
--num-epochs 2 \
|
| 176 |
--no-4bit \
|
| 177 |
+
--output-dir pipe_sql/finetuning_output_1.5b
|
| 178 |
```
|
| 179 |
|
| 180 |
#### 5c. 7B QLoRA Training (recommended: full dataset, 2 epochs)
|
|
|
|
| 182 |
For the full-size model using 4-bit quantization to fit in 16 GB VRAM:
|
| 183 |
|
| 184 |
```bash
|
| 185 |
+
python -m pipe_sql.finetuning.train \
|
| 186 |
--model-name Qwen/Qwen2.5-Coder-7B-Instruct \
|
| 187 |
+
--train-data pipe_sql/training_output/train.jsonl \
|
| 188 |
+
--dev-data pipe_sql/training_output/dev.jsonl \
|
| 189 |
--max-seq-length 4096 \
|
| 190 |
--per-device-train-batch-size 1 \
|
| 191 |
--gradient-accumulation-steps 32 \
|
|
|
|
| 194 |
--load-in-4bit \
|
| 195 |
--save-steps 1000 \
|
| 196 |
--eval-steps 1000 \
|
| 197 |
+
--output-dir pipe_sql/finetuning_output_7b
|
| 198 |
```
|
| 199 |
|
| 200 |
> **Important**: The lower learning rate (5e-5 vs default 2e-4) is critical for 7B stability. An earlier run with 2e-4 collapsed to NaN at epoch ~1.5. See the Troubleshooting section for details.
|
|
|
|
| 262 |
|
| 263 |
```bash
|
| 264 |
# For 1.5B model
|
| 265 |
+
python -m pipe_sql.finetuning.train --merge \
|
| 266 |
--model-name Qwen/Qwen2.5-Coder-1.5B-Instruct \
|
| 267 |
+
--output-dir pipe_sql/finetuning_output_1.5b
|
| 268 |
|
| 269 |
# For 7B model
|
| 270 |
+
python -m pipe_sql.finetuning.train --merge \
|
| 271 |
--model-name Qwen/Qwen2.5-Coder-7B-Instruct \
|
| 272 |
+
--output-dir pipe_sql/finetuning_output_7b
|
| 273 |
```
|
| 274 |
|
| 275 |
The merged model is saved to `<output-dir>/merged/` and can be loaded directly with `AutoModelForCausalLM.from_pretrained()`.
|
|
|
|
| 340 |
|
| 341 |
**Salvageable**: The **checkpoint-500** (before collapse) is still viable β eval_loss=0.224, accuracy=95.8%. To use it:
|
| 342 |
```bash
|
| 343 |
+
python -m pipe_sql.finetuning.train --merge \
|
| 344 |
--model-name Qwen/Qwen2.5-Coder-7B-Instruct \
|
| 345 |
+
--output-dir pipe_sql/finetuning_output_7b \
|
| 346 |
--checkpoint checkpoint-500
|
| 347 |
```
|
| 348 |
|
|
|
|
| 373 |
|
| 374 |
```
|
| 375 |
sqlglot/
|
| 376 |
+
βββ pipe_sql/
|
| 377 |
+
β βββ decompiler/ # Standard SQL β pipe SQL decompiler
|
| 378 |
+
β βββ validation/ # Validation loop runner
|
| 379 |
+
β βββ training/
|
| 380 |
+
β β βββ __main__.py # Entry: python -m pipe_sql.training.generate
|
| 381 |
+
β β βββ generate.py # Main data generation pipeline
|
| 382 |
+
β β βββ formatter.py # Chat sample formatting (incremental trajectory)
|
| 383 |
+
β β βββ tool_formatter.py # Tool-calling sample generation
|
| 384 |
+
β β βββ trajectory.py # Pipe query β step decomposition
|
| 385 |
+
β β βββ schema_extractor.py # SQLite schema β text representation
|
| 386 |
+
β β βββ tool_executor.py # Simulated tool execution for training
|
| 387 |
+
β β βββ writer.py # Train/dev split and JSONL output
|
| 388 |
+
β βββ finetuning/
|
| 389 |
+
β β βββ train.py # Main fine-tuning script
|
| 390 |
+
β β βββ config.py # TrainConfig dataclass with CLI parsing
|
| 391 |
+
β β βββ data.py # JSONL dataset loader
|
| 392 |
+
β βββ evaluation/ # Evaluation server + agent
|
| 393 |
+
β βββ validation_output/ # Validated golden pairs
|
| 394 |
+
β β βββ golden_pairs_consolidated.jsonl # 15,443 validated (gold_sql, pipe_sql) pairs
|
| 395 |
+
β βββ training_output/ # Generated training data (not committed)
|
| 396 |
+
β β βββ train.jsonl
|
| 397 |
+
β β βββ dev.jsonl
|
| 398 |
+
β β βββ stats.json
|
| 399 |
+
β βββ finetuning_output/ # Training outputs (not committed)
|
| 400 |
+
β β βββ checkpoint-*/ # Intermediate checkpoints
|
| 401 |
+
β β βββ final/ # Final LoRA adapter
|
| 402 |
+
β β βββ merged/ # Merged standalone model
|
| 403 |
+
β βββ output/ # Evaluation output (not committed)
|
| 404 |
βββ scripts/
|
| 405 |
β βββ setup_data.sh # Downloads Spider 1.0
|
| 406 |
β βββ setup_bird_data.sh # Downloads BIRD dev + train
|
| 407 |
β βββ train.sh # One-command data gen + training
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 408 |
βββ docs/design/
|
| 409 |
βββ pipe-sql-fine-tuning-design-doc.md
|
| 410 |
βββ pipe-sql-decompiler-design-doc.md
|
|
|
|
| 422 |
|
| 423 |
**Cause**: bitsandbytes 4-bit quantization produces BFloat16 parameters, which are incompatible with the FP16 AMP gradient scaler.
|
| 424 |
|
| 425 |
+
**Fix**: The training script automatically detects this and uses `bf16=True` when `--load-in-4bit` is set on CUDA. If you see this error, ensure you're using the latest `pipe_sql/finetuning/train.py`.
|
| 426 |
|
| 427 |
### Model Loading on CPU Instead of GPU
|
| 428 |
|
|
|
|
| 440 |
|
| 441 |
**Fix**: Always pass `--model-name` matching the model used for training:
|
| 442 |
```bash
|
| 443 |
+
python -m pipe_sql.finetuning.train --merge \
|
| 444 |
--model-name Qwen/Qwen2.5-Coder-1.5B-Instruct \
|
| 445 |
+
--output-dir pipe_sql/finetuning_output
|
| 446 |
```
|
| 447 |
|
| 448 |
### 7B QLoRA Training Collapse (Loss β NaN)
|
|
|
|
| 454 |
**Fix**: Apply all three mitigations:
|
| 455 |
|
| 456 |
```bash
|
| 457 |
+
python -m pipe_sql.finetuning.train \
|
| 458 |
--model-name Qwen/Qwen2.5-Coder-7B-Instruct \
|
| 459 |
+
--train-data pipe_sql/training_output/train.jsonl \
|
| 460 |
+
--dev-data pipe_sql/training_output/dev.jsonl \
|
| 461 |
--max-seq-length 4096 \
|
| 462 |
--per-device-train-batch-size 1 \
|
| 463 |
--gradient-accumulation-steps 32 \
|
|
|
|
| 466 |
--load-in-4bit \
|
| 467 |
--save-steps 500 \
|
| 468 |
--eval-steps 500 \
|
| 469 |
+
--output-dir pipe_sql/finetuning_output_7b
|
| 470 |
```
|
| 471 |
|
| 472 |
Key changes from the failed run:
|
|
|
|
| 498 |
- [x] Training data generated from golden pairs
|
| 499 |
- [x] Smoke test passed (1.5B, 1 epoch β loss 2.13β0.20, accuracy 96.1%)
|
| 500 |
- [x] Full 1.5B training completed (3 epochs β eval_loss=0.191, accuracy 95.8%)
|
| 501 |
+
- [x] 1.5B LoRA adapter merged (`pipe_sql/finetuning_output/merged/`)
|
| 502 |
- [ ] 7B QLoRA training β **collapsed at epoch 1.5** (checkpoint-500 salvageable, needs re-run with lower LR)
|
| 503 |
- [ ] 7B LoRA adapter merged
|
| 504 |
- [ ] Full dataset training (15K pairs) β pending 7B fix
|