Zheyuan Zhao commited on
Add design doc: pipe-sql-training-reproduction-guide.md
Browse files
docs/pipe-sql-training-reproduction-guide.md
ADDED
|
@@ -0,0 +1,499 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Pipe SQL Fine-Tuning: Reproduction Guide
|
| 2 |
+
|
| 3 |
+
This document describes how to reproduce the pipe SQL fine-tuning pipeline end-to-end, from a fresh clone of the repository to a trained model. It covers environment setup, data preparation, training data generation, and model fine-tuning.
|
| 4 |
+
|
| 5 |
+
For the design rationale behind this system, see [pipe-sql-fine-tuning-design-doc.md](pipe-sql-fine-tuning-design-doc.md).
|
| 6 |
+
|
| 7 |
+
---
|
| 8 |
+
|
| 9 |
+
## Prerequisites
|
| 10 |
+
|
| 11 |
+
- **GPU**: NVIDIA GPU with >=16 GB VRAM (tested on RTX 4080 16 GB)
|
| 12 |
+
- **NVIDIA Driver**: 525+ (CUDA 12.x compatible)
|
| 13 |
+
- **OS**: Windows 11 or Linux (commands below use bash; on Windows, use Git Bash or WSL)
|
| 14 |
+
- **uv**: Python package manager ([install guide](https://docs.astral.sh/uv/getting-started/installation/))
|
| 15 |
+
- **Disk**: ~15 GB for benchmark databases, ~15 GB for model weights (cached by HuggingFace)
|
| 16 |
+
|
| 17 |
+
---
|
| 18 |
+
|
| 19 |
+
## Step 1: Clone and Create Python Environment
|
| 20 |
+
|
| 21 |
+
```bash
|
| 22 |
+
git clone <repo-url>
|
| 23 |
+
cd sqlglot
|
| 24 |
+
|
| 25 |
+
# Create a Python 3.11 virtual environment
|
| 26 |
+
uv venv .venv --python 3.11
|
| 27 |
+
source .venv/Scripts/activate # Windows (Git Bash)
|
| 28 |
+
# source .venv/bin/activate # Linux/macOS
|
| 29 |
+
```
|
| 30 |
+
|
| 31 |
+
## Step 2: Install Dependencies
|
| 32 |
+
|
| 33 |
+
```bash
|
| 34 |
+
# Install sqlglot in editable mode (puts training_data/ and finetuning/ on sys.path)
|
| 35 |
+
uv pip install -e .
|
| 36 |
+
|
| 37 |
+
# Install PyTorch with CUDA 12.6 support
|
| 38 |
+
uv pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126
|
| 39 |
+
|
| 40 |
+
# Install ML training stack
|
| 41 |
+
uv pip install transformers peft trl datasets bitsandbytes accelerate
|
| 42 |
+
|
| 43 |
+
# For Spider dataset download (Google Drive)
|
| 44 |
+
uv pip install gdown
|
| 45 |
+
```
|
| 46 |
+
|
| 47 |
+
**Verify CUDA**:
|
| 48 |
+
```bash
|
| 49 |
+
python -c "import torch; print(torch.cuda.is_available(), torch.cuda.get_device_name(0))"
|
| 50 |
+
# Expected: True NVIDIA GeForce RTX 4080
|
| 51 |
+
```
|
| 52 |
+
|
| 53 |
+
> **Note**: PyTorch cu126 wheels bundle their own CUDA runtime. You do NOT need to upgrade your system CUDA toolkit β any NVIDIA driver >=525 works.
|
| 54 |
+
|
| 55 |
+
## Step 3: Download Benchmark Databases
|
| 56 |
+
|
| 57 |
+
The training data generation requires SQLite databases from Spider 1.0 and BIRD benchmarks to extract schemas.
|
| 58 |
+
|
| 59 |
+
```bash
|
| 60 |
+
# Spider 1.0 (~1 GB, downloads from Google Drive via gdown)
|
| 61 |
+
bash scripts/setup_data.sh
|
| 62 |
+
|
| 63 |
+
# BIRD dev + train sets (~9 GB, downloads via curl)
|
| 64 |
+
bash scripts/setup_bird_data.sh
|
| 65 |
+
```
|
| 66 |
+
|
| 67 |
+
**Verify**:
|
| 68 |
+
```bash
|
| 69 |
+
ls data/spider/database | wc -l # ~166 databases
|
| 70 |
+
ls data/bird/train/train_databases | wc -l # ~70 databases
|
| 71 |
+
ls data/bird/dev_20240627/dev_databases | wc -l # ~11 databases
|
| 72 |
+
```
|
| 73 |
+
|
| 74 |
+
## Step 4: Generate Training Data
|
| 75 |
+
|
| 76 |
+
This reads the 15,443 validated golden pairs (standard SQL β pipe SQL) and generates incremental chat training samples. Each N-operator pipe query is decomposed into N training samples where the model learns to emit one pipe operator at a time.
|
| 77 |
+
|
| 78 |
+
```bash
|
| 79 |
+
# Full dataset (recommended for production training)
|
| 80 |
+
python -m training_data.generate \
|
| 81 |
+
--golden-pairs validation_output/golden_pairs_consolidated.jsonl \
|
| 82 |
+
--db-dir data/spider/database \
|
| 83 |
+
--db-dir data/bird/train/train_databases \
|
| 84 |
+
--db-dir data/bird/dev_20240627/dev_databases \
|
| 85 |
+
--output-dir training_data_output \
|
| 86 |
+
--tool-calling --tool-ratio 0.3
|
| 87 |
+
|
| 88 |
+
# Subset for quick iteration (add --limit)
|
| 89 |
+
python -m training_data.generate \
|
| 90 |
+
--golden-pairs validation_output/golden_pairs_consolidated.jsonl \
|
| 91 |
+
--db-dir data/spider/database \
|
| 92 |
+
--db-dir data/bird/train/train_databases \
|
| 93 |
+
--db-dir data/bird/dev_20240627/dev_databases \
|
| 94 |
+
--output-dir training_data_output \
|
| 95 |
+
--tool-calling --tool-ratio 0.3 \
|
| 96 |
+
--limit 2000
|
| 97 |
+
```
|
| 98 |
+
|
| 99 |
+
| Flag | Description |
|
| 100 |
+
|------|-------------|
|
| 101 |
+
| `--golden-pairs` | JSONL file with `{gold_sql, pipe_sql, db_id, question_id, question}` entries |
|
| 102 |
+
| `--db-dir` | Directories containing SQLite databases (repeatable) |
|
| 103 |
+
| `--tool-calling` | Also generate agentic tool-calling training samples |
|
| 104 |
+
| `--tool-ratio 0.3` | 30% of golden pairs get an additional tool-calling sample |
|
| 105 |
+
| `--limit N` | Process only the first N pairs (omit for full dataset) |
|
| 106 |
+
|
| 107 |
+
**Expected output**:
|
| 108 |
+
|
| 109 |
+
| Input | Total Samples | Train (95%) | Dev (5%) | Tool-calling |
|
| 110 |
+
|-------|--------------|-------------|----------|--------------|
|
| 111 |
+
| `--limit 2000` | ~7,400 | ~6,900 | ~500 | ~580 |
|
| 112 |
+
| All 15,443 pairs | ~57,000 | ~54,000 | ~2,800 | ~4,600 |
|
| 113 |
+
|
| 114 |
+
Each golden pair produces ~3.7 training samples on average (trajectory decomposition amplification). Output files: `train.jsonl`, `dev.jsonl`, `stats.json`.
|
| 115 |
+
|
| 116 |
+
### Training Data Format
|
| 117 |
+
|
| 118 |
+
Each sample is a chat conversation in OpenAI format:
|
| 119 |
+
|
| 120 |
+
```json
|
| 121 |
+
{
|
| 122 |
+
"messages": [
|
| 123 |
+
{"role": "system", "content": "You are a SQL assistant that writes pipe SQL..."},
|
| 124 |
+
{"role": "user", "content": "Question: ... Schema: ... Query so far: FROM t |> WHERE ..."},
|
| 125 |
+
{"role": "assistant", "content": "|> AGGREGATE COUNT(*) AS cnt GROUP BY department"}
|
| 126 |
+
]
|
| 127 |
+
}
|
| 128 |
+
```
|
| 129 |
+
|
| 130 |
+
## Step 5: Fine-Tune the Model
|
| 131 |
+
|
| 132 |
+
### Quick Start (One Command)
|
| 133 |
+
|
| 134 |
+
The `scripts/train.sh` wrapper handles data generation + training:
|
| 135 |
+
|
| 136 |
+
```bash
|
| 137 |
+
# Smoke test (~5 min, 1 epoch, 100 samples)
|
| 138 |
+
bash scripts/train.sh --smoke-test
|
| 139 |
+
|
| 140 |
+
# Full training (1.5B model, 3 epochs, ~2 hours)
|
| 141 |
+
bash scripts/train.sh
|
| 142 |
+
```
|
| 143 |
+
|
| 144 |
+
### Manual Training Commands
|
| 145 |
+
|
| 146 |
+
#### 5a. Smoke Test (1.5B, 1 epoch, small subset)
|
| 147 |
+
|
| 148 |
+
Validates the pipeline works end-to-end. Use a small dataset generated with `--limit 2000`:
|
| 149 |
+
|
| 150 |
+
```bash
|
| 151 |
+
python -m finetuning.train \
|
| 152 |
+
--model-name Qwen/Qwen2.5-Coder-1.5B-Instruct \
|
| 153 |
+
--train-data training_data_output/train.jsonl \
|
| 154 |
+
--dev-data training_data_output/dev.jsonl \
|
| 155 |
+
--max-seq-length 4096 \
|
| 156 |
+
--per-device-train-batch-size 4 \
|
| 157 |
+
--gradient-accumulation-steps 4 \
|
| 158 |
+
--num-epochs 1 \
|
| 159 |
+
--no-4bit \
|
| 160 |
+
--output-dir finetuning_output_smoke
|
| 161 |
+
```
|
| 162 |
+
|
| 163 |
+
Expected: loss drops from ~2.1 to ~0.2, token accuracy rises to ~96%.
|
| 164 |
+
|
| 165 |
+
#### 5b. Full 1.5B Training (recommended: full dataset, 2 epochs)
|
| 166 |
+
|
| 167 |
+
```bash
|
| 168 |
+
python -m finetuning.train \
|
| 169 |
+
--model-name Qwen/Qwen2.5-Coder-1.5B-Instruct \
|
| 170 |
+
--train-data training_data_output/train.jsonl \
|
| 171 |
+
--dev-data training_data_output/dev.jsonl \
|
| 172 |
+
--max-seq-length 4096 \
|
| 173 |
+
--per-device-train-batch-size 4 \
|
| 174 |
+
--gradient-accumulation-steps 8 \
|
| 175 |
+
--num-epochs 2 \
|
| 176 |
+
--no-4bit \
|
| 177 |
+
--output-dir finetuning_output_1.5b
|
| 178 |
+
```
|
| 179 |
+
|
| 180 |
+
#### 5c. 7B QLoRA Training (recommended: full dataset, 2 epochs)
|
| 181 |
+
|
| 182 |
+
For the full-size model using 4-bit quantization to fit in 16 GB VRAM:
|
| 183 |
+
|
| 184 |
+
```bash
|
| 185 |
+
python -m finetuning.train \
|
| 186 |
+
--model-name Qwen/Qwen2.5-Coder-7B-Instruct \
|
| 187 |
+
--train-data training_data_output/train.jsonl \
|
| 188 |
+
--dev-data training_data_output/dev.jsonl \
|
| 189 |
+
--max-seq-length 4096 \
|
| 190 |
+
--per-device-train-batch-size 1 \
|
| 191 |
+
--gradient-accumulation-steps 32 \
|
| 192 |
+
--learning-rate 5e-5 \
|
| 193 |
+
--num-epochs 2 \
|
| 194 |
+
--load-in-4bit \
|
| 195 |
+
--save-steps 1000 \
|
| 196 |
+
--eval-steps 1000 \
|
| 197 |
+
--output-dir finetuning_output_7b
|
| 198 |
+
```
|
| 199 |
+
|
| 200 |
+
> **Important**: The lower learning rate (5e-5 vs default 2e-4) is critical for 7B stability. An earlier run with 2e-4 collapsed to NaN at epoch ~1.5. See the Troubleshooting section for details.
|
| 201 |
+
|
| 202 |
+
### Recommended Configurations
|
| 203 |
+
|
| 204 |
+
The table below shows recommended settings for both dataset sizes. With the full dataset (15,443 pairs β ~54K train samples), 2 epochs is optimal β 7.7x more data reduces overfitting risk, and eval loss plateaus by epoch 2. With the smaller subset, 3 epochs compensates for limited data.
|
| 205 |
+
|
| 206 |
+
**1.5B (float16, `--no-4bit`)**:
|
| 207 |
+
|
| 208 |
+
| Parameter | Subset (2K pairs) | Full (15K pairs) |
|
| 209 |
+
|-----------|-------------------|-------------------|
|
| 210 |
+
| `--num-epochs` | 3 | **2** |
|
| 211 |
+
| `--per-device-train-batch-size` | 4 | 4 |
|
| 212 |
+
| `--gradient-accumulation-steps` | 8 | 8 |
|
| 213 |
+
| Effective batch size | 32 | 32 |
|
| 214 |
+
| Steps/epoch | ~215 | ~1,690 |
|
| 215 |
+
| Total steps | ~645 | ~3,380 |
|
| 216 |
+
| VRAM usage | ~7 GB | ~7 GB |
|
| 217 |
+
| Est. time (RTX 4080) | ~1h 44min | **~3.5 hours** |
|
| 218 |
+
|
| 219 |
+
**7B QLoRA (4-bit, `--load-in-4bit`)**:
|
| 220 |
+
|
| 221 |
+
| Parameter | Subset (2K pairs) | Full (15K pairs) |
|
| 222 |
+
|-----------|-------------------|-------------------|
|
| 223 |
+
| `--num-epochs` | 2 | **2** |
|
| 224 |
+
| `--per-device-train-batch-size` | 1 | 1 |
|
| 225 |
+
| `--gradient-accumulation-steps` | 32 | **32** |
|
| 226 |
+
| `--learning-rate` | **5e-5** | **5e-5** |
|
| 227 |
+
| Effective batch size | 32 | **32** |
|
| 228 |
+
| `--save-steps` / `--eval-steps` | 500 | **1000** |
|
| 229 |
+
| Steps/epoch | ~429 | ~1,690 |
|
| 230 |
+
| Total steps | ~858 | ~3,380 |
|
| 231 |
+
| VRAM usage | ~12.5 GB | ~12.5 GB |
|
| 232 |
+
| Est. time (RTX 4080) | ~2 hours | **~17 hours** |
|
| 233 |
+
|
| 234 |
+
> **Note**: Earlier runs with `--learning-rate 2e-4` and `--gradient-accumulation-steps 16` over 3 epochs caused a training collapse at epoch ~1.5 (loss β NaN). The settings above reflect the corrected configuration.
|
| 235 |
+
|
| 236 |
+
> **Tip**: Run 1.5B first as a quick validation (~3.5h). If eval loss improves over the subset baseline (0.191), the full dataset is working well. Then kick off the 7B overnight.
|
| 237 |
+
|
| 238 |
+
### Why 2 Epochs for Full Dataset?
|
| 239 |
+
|
| 240 |
+
With the 2K subset (3 epochs), we observed:
|
| 241 |
+
- Train loss 0.132 vs eval loss 0.191 β gap of 0.059 indicates mild overfitting
|
| 242 |
+
- Eval loss plateaued between epoch 2 and 3
|
| 243 |
+
|
| 244 |
+
With 7.7x more training data, the model sees far more diverse examples per epoch. 2 epochs provides sufficient coverage while avoiding diminishing returns. More data > more epochs.
|
| 245 |
+
|
| 246 |
+
### Why grad_accum=32 for Full 7B?
|
| 247 |
+
|
| 248 |
+
Doubling gradient accumulation from 16 to 32 (effective batch 32) halves the number of optimizer steps while keeping total forward/backward passes identical. Each optimizer step uses a lower-variance gradient estimate, giving more stable training. This doesn't change wall-clock time but produces better-calibrated updates.
|
| 249 |
+
|
| 250 |
+
### What the Trainer Does
|
| 251 |
+
|
| 252 |
+
1. Loads the base model (Qwen2.5-Coder) with LoRA adapters targeting all attention + MLP projections (r=16, alpha=32)
|
| 253 |
+
2. Applies a custom chat template with `{% generation %}` markers so loss is computed only on assistant responses (`assistant_only_loss=True`)
|
| 254 |
+
3. Uses gradient checkpointing to reduce VRAM usage
|
| 255 |
+
4. For QLoRA: uses bitsandbytes 4-bit NF4 quantization with bf16 compute
|
| 256 |
+
5. Saves checkpoints periodically, keeps the 3 most recent
|
| 257 |
+
6. Restores the original Qwen chat template (with tool-call support) before saving the final adapter
|
| 258 |
+
|
| 259 |
+
## Step 6: Merge LoRA Adapter
|
| 260 |
+
|
| 261 |
+
After training, merge the LoRA adapter into the base model for standalone inference:
|
| 262 |
+
|
| 263 |
+
```bash
|
| 264 |
+
# For 1.5B model
|
| 265 |
+
python -m finetuning.train --merge \
|
| 266 |
+
--model-name Qwen/Qwen2.5-Coder-1.5B-Instruct \
|
| 267 |
+
--output-dir finetuning_output_1.5b
|
| 268 |
+
|
| 269 |
+
# For 7B model
|
| 270 |
+
python -m finetuning.train --merge \
|
| 271 |
+
--model-name Qwen/Qwen2.5-Coder-7B-Instruct \
|
| 272 |
+
--output-dir finetuning_output_7b
|
| 273 |
+
```
|
| 274 |
+
|
| 275 |
+
The merged model is saved to `<output-dir>/merged/` and can be loaded directly with `AutoModelForCausalLM.from_pretrained()`.
|
| 276 |
+
|
| 277 |
+
> **Important**: Always specify `--model-name` matching the model used for training. The default is 7B, so for 1.5B merges you must pass it explicitly.
|
| 278 |
+
|
| 279 |
+
---
|
| 280 |
+
|
| 281 |
+
## Training Results (Reference)
|
| 282 |
+
|
| 283 |
+
All results on RTX 4080 16 GB, subset dataset (2K pairs β 7,358 samples).
|
| 284 |
+
|
| 285 |
+
### 1.5B Smoke Test (1 epoch, float16)
|
| 286 |
+
|
| 287 |
+
| Metric | Start | End |
|
| 288 |
+
|--------|-------|-----|
|
| 289 |
+
| Train loss | 2.126 | 0.200 |
|
| 290 |
+
| Token accuracy | 67.4% | 96.1% |
|
| 291 |
+
| Steps | β | 429 |
|
| 292 |
+
| Runtime | β | ~35 min |
|
| 293 |
+
|
| 294 |
+
Smooth training curve. No eval configured (single epoch validation run).
|
| 295 |
+
|
| 296 |
+
### 1.5B Full (3 epochs, float16)
|
| 297 |
+
|
| 298 |
+
| Metric | Start | End |
|
| 299 |
+
|--------|-------|-----|
|
| 300 |
+
| Train loss | 2.172 | 0.191 |
|
| 301 |
+
| Token accuracy | 66.9% | 97.7% |
|
| 302 |
+
| Best eval loss | β | **0.191** (step 500, epoch 2.3) |
|
| 303 |
+
| Eval token accuracy | β | 95.8% |
|
| 304 |
+
| Steps | β | 645 |
|
| 305 |
+
| Runtime | β | ~1h 44min |
|
| 306 |
+
|
| 307 |
+
Training converged well. Best checkpoint at step 500. Final train loss (0.132 at step 630) vs eval loss (0.191) shows a gap of 0.059, indicating mild overfitting in the third epoch. LoRA adapter merged successfully.
|
| 308 |
+
|
| 309 |
+
### 7B QLoRA (3 epochs, 4-bit) β FAILED (Training Collapse)
|
| 310 |
+
|
| 311 |
+
| Metric | Start | Best (step 500) | Collapse (step 680) |
|
| 312 |
+
|--------|-------|-----------------|---------------------|
|
| 313 |
+
| Train loss | 2.271 | 0.253 | **7.05 β NaN** |
|
| 314 |
+
| Token accuracy | 66.5% | 97.4% | **58.6% β 0.0%** |
|
| 315 |
+
| Eval loss | β | **0.224** | NaN (step 1000) |
|
| 316 |
+
| Eval token accuracy | β | 95.8% | 0.0% |
|
| 317 |
+
| Grad norm | 0.11 | 0.031 | **NaN** |
|
| 318 |
+
| Steps | β | 500/1287 | 680/1287 |
|
| 319 |
+
|
| 320 |
+
**What happened**: Training progressed normally through step 610 (epoch ~1.42), then catastrophically collapsed:
|
| 321 |
+
|
| 322 |
+
| Step | Epoch | Loss | Accuracy | Grad Norm |
|
| 323 |
+
|------|-------|------|----------|-----------|
|
| 324 |
+
| 610 | 1.42 | 0.25 | 96.6% | 0.24 |
|
| 325 |
+
| 620 | 1.45 | 0.87 | 90.3% | 0.92 |
|
| 326 |
+
| 630 | 1.47 | 2.15 | 72.6% | 2.47 |
|
| 327 |
+
| 640 | 1.49 | 2.77 | 67.8% | 1.66 |
|
| 328 |
+
| 650 | 1.52 | 3.52 | 55.9% | 1.66 |
|
| 329 |
+
| 660 | 1.54 | 3.70 | 45.0% | 1.84 |
|
| 330 |
+
| 670 | 1.56 | 3.95 | 54.6% | 0.86 |
|
| 331 |
+
| 680 | 1.59 | **7.05** | 58.6% | **NaN** |
|
| 332 |
+
| 690+ | 1.61+ | 0.0 | 0.0% | NaN |
|
| 333 |
+
|
| 334 |
+
The model weights went to NaN at step 680 and remained dead for the remaining ~600 steps. The loss spike correlates with gradient norm explosion (0.24 β 2.47 over 20 steps).
|
| 335 |
+
|
| 336 |
+
**Likely causes**:
|
| 337 |
+
1. Learning rate (2e-4) too aggressive for the 7B model
|
| 338 |
+
2. Batch size of 1 (even with grad_accum=16) causes high gradient variance
|
| 339 |
+
3. Possible numerical instability in 4-bit quantization + bf16 compute
|
| 340 |
+
|
| 341 |
+
**Salvageable**: The **checkpoint-500** (before collapse) is still viable β eval_loss=0.224, accuracy=95.8%. To use it:
|
| 342 |
+
```bash
|
| 343 |
+
python -m finetuning.train --merge \
|
| 344 |
+
--model-name Qwen/Qwen2.5-Coder-7B-Instruct \
|
| 345 |
+
--output-dir finetuning_output_7b \
|
| 346 |
+
--checkpoint checkpoint-500
|
| 347 |
+
```
|
| 348 |
+
|
| 349 |
+
**Recommended fixes for re-training** (see Troubleshooting section below):
|
| 350 |
+
- Lower learning rate to 5e-5
|
| 351 |
+
- Increase gradient accumulation to 32 (effective batch 32)
|
| 352 |
+
- Add explicit gradient clipping (`max_grad_norm=0.5`)
|
| 353 |
+
|
| 354 |
+
### Full Dataset Expectations (15K pairs β ~57K samples)
|
| 355 |
+
|
| 356 |
+
With 7.7x more data, we expect:
|
| 357 |
+
- **Lower eval loss** than the 0.191 subset baseline (better generalization from more diverse examples)
|
| 358 |
+
- **Smaller train-eval gap** (less overfitting with 2 epochs on more data)
|
| 359 |
+
- **1.5B**: ~3.5 hours for 2 epochs
|
| 360 |
+
- **7B QLoRA**: ~17 hours for 2 epochs (best run overnight)
|
| 361 |
+
- **Important**: Use the reduced learning rate (5e-5) and higher grad_accum (32) for 7B to avoid the collapse observed in the subset run
|
| 362 |
+
|
| 363 |
+
### VRAM Budget (RTX 4080 β 16 GB)
|
| 364 |
+
|
| 365 |
+
| Model | Quantization | Model VRAM | Training Overhead | Total |
|
| 366 |
+
|-------|-------------|------------|-------------------|-------|
|
| 367 |
+
| 1.5B | float16 | ~3 GB | ~4 GB | ~7 GB |
|
| 368 |
+
| 7B | QLoRA 4-bit | ~4.5 GB | ~8 GB | ~12.5 GB |
|
| 369 |
+
|
| 370 |
+
---
|
| 371 |
+
|
| 372 |
+
## Project Structure
|
| 373 |
+
|
| 374 |
+
```
|
| 375 |
+
sqlglot/
|
| 376 |
+
βββ validation_output/
|
| 377 |
+
β βββ golden_pairs_consolidated.jsonl # 15,443 validated (gold_sql, pipe_sql) pairs
|
| 378 |
+
βββ training_data/
|
| 379 |
+
β βββ __main__.py # Entry: python -m training_data.generate
|
| 380 |
+
β βββ generate.py # Main data generation pipeline
|
| 381 |
+
β βββ formatter.py # Chat sample formatting (incremental trajectory)
|
| 382 |
+
β βββ tool_formatter.py # Tool-calling sample generation
|
| 383 |
+
β βββ trajectory.py # Pipe query β step decomposition
|
| 384 |
+
β βββ schema_extractor.py # SQLite schema β text representation
|
| 385 |
+
β βββ tool_executor.py # Simulated tool execution for training
|
| 386 |
+
β βββ writer.py # Train/dev split and JSONL output
|
| 387 |
+
βββ finetuning/
|
| 388 |
+
β βββ train.py # Main fine-tuning script
|
| 389 |
+
β βββ config.py # TrainConfig dataclass with CLI parsing
|
| 390 |
+
β βββ data.py # JSONL dataset loader
|
| 391 |
+
βββ scripts/
|
| 392 |
+
β βββ setup_data.sh # Downloads Spider 1.0
|
| 393 |
+
β βββ setup_bird_data.sh # Downloads BIRD dev + train
|
| 394 |
+
β βββ train.sh # One-command data gen + training
|
| 395 |
+
βββ training_data_output/ # Generated training data (not committed)
|
| 396 |
+
β βββ train.jsonl
|
| 397 |
+
β βββ dev.jsonl
|
| 398 |
+
β βββ stats.json
|
| 399 |
+
βββ finetuning_output/ # Training outputs (not committed)
|
| 400 |
+
β βββ checkpoint-*/ # Intermediate checkpoints
|
| 401 |
+
β βββ final/ # Final LoRA adapter
|
| 402 |
+
β βββ merged/ # Merged standalone model
|
| 403 |
+
βββ docs/design/
|
| 404 |
+
βββ pipe-sql-fine-tuning-design-doc.md
|
| 405 |
+
βββ pipe-sql-decompiler-design-doc.md
|
| 406 |
+
βββ pipe-sql-validation-loop-design-doc.md
|
| 407 |
+
βββ pipe-sql-training-reproduction-guide.md # This file
|
| 408 |
+
```
|
| 409 |
+
|
| 410 |
+
---
|
| 411 |
+
|
| 412 |
+
## Troubleshooting
|
| 413 |
+
|
| 414 |
+
### BFloat16 / FP16 AMP Error with QLoRA
|
| 415 |
+
|
| 416 |
+
**Error**: `NotImplementedError: "_amp_foreach_non_finite_check_and_unscale_cuda" not implemented for 'BFloat16'`
|
| 417 |
+
|
| 418 |
+
**Cause**: bitsandbytes 4-bit quantization produces BFloat16 parameters, which are incompatible with the FP16 AMP gradient scaler.
|
| 419 |
+
|
| 420 |
+
**Fix**: The training script automatically detects this and uses `bf16=True` when `--load-in-4bit` is set on CUDA. If you see this error, ensure you're using the latest `finetuning/train.py`.
|
| 421 |
+
|
| 422 |
+
### Model Loading on CPU Instead of GPU
|
| 423 |
+
|
| 424 |
+
**Symptom**: Training is extremely slow; logs show "Using float32 on CPU" despite having a CUDA GPU.
|
| 425 |
+
|
| 426 |
+
**Cause**: When using `--no-4bit` on CUDA, an earlier version of the code was missing the `elif use_cuda` branch in `load_model_and_tokenizer()`.
|
| 427 |
+
|
| 428 |
+
**Fix**: The current code includes proper device detection for all CUDA modes (4-bit and float16).
|
| 429 |
+
|
| 430 |
+
### Wrong Base Model During Merge
|
| 431 |
+
|
| 432 |
+
**Symptom**: `RuntimeError` or size mismatch when running `--merge`.
|
| 433 |
+
|
| 434 |
+
**Cause**: The default `--model-name` is `Qwen/Qwen2.5-Coder-7B-Instruct`. If you trained the 1.5B model, you must specify the correct base model during merge.
|
| 435 |
+
|
| 436 |
+
**Fix**: Always pass `--model-name` matching the model used for training:
|
| 437 |
+
```bash
|
| 438 |
+
python -m finetuning.train --merge \
|
| 439 |
+
--model-name Qwen/Qwen2.5-Coder-1.5B-Instruct \
|
| 440 |
+
--output-dir finetuning_output
|
| 441 |
+
```
|
| 442 |
+
|
| 443 |
+
### 7B QLoRA Training Collapse (Loss β NaN)
|
| 444 |
+
|
| 445 |
+
**Symptom**: Training loss spikes dramatically around epoch 1.4β1.6, gradient norm explodes, then all metrics go to NaN/0.0 for the remaining steps.
|
| 446 |
+
|
| 447 |
+
**Cause**: The combination of a high learning rate (2e-4), small per-device batch size (1), and 4-bit quantization creates conditions for numerical instability. A single bad gradient update can cascade β once gradient norms exceed ~1.0, the model enters an irrecoverable divergence loop that ends in NaN weights.
|
| 448 |
+
|
| 449 |
+
**Fix**: Apply all three mitigations:
|
| 450 |
+
|
| 451 |
+
```bash
|
| 452 |
+
python -m finetuning.train \
|
| 453 |
+
--model-name Qwen/Qwen2.5-Coder-7B-Instruct \
|
| 454 |
+
--train-data training_data_output/train.jsonl \
|
| 455 |
+
--dev-data training_data_output/dev.jsonl \
|
| 456 |
+
--max-seq-length 4096 \
|
| 457 |
+
--per-device-train-batch-size 1 \
|
| 458 |
+
--gradient-accumulation-steps 32 \
|
| 459 |
+
--num-epochs 2 \
|
| 460 |
+
--learning-rate 5e-5 \
|
| 461 |
+
--load-in-4bit \
|
| 462 |
+
--save-steps 500 \
|
| 463 |
+
--eval-steps 500 \
|
| 464 |
+
--output-dir finetuning_output_7b
|
| 465 |
+
```
|
| 466 |
+
|
| 467 |
+
Key changes from the failed run:
|
| 468 |
+
| Parameter | Failed Run | Recommended |
|
| 469 |
+
|-----------|-----------|-------------|
|
| 470 |
+
| `--learning-rate` | 2e-4 (default) | **5e-5** |
|
| 471 |
+
| `--gradient-accumulation-steps` | 16 | **32** |
|
| 472 |
+
| `--num-epochs` | 3 | **2** |
|
| 473 |
+
| `max_grad_norm` | 1.0 (default) | **0.5** (if supported) |
|
| 474 |
+
|
| 475 |
+
**Recovery**: If training has already collapsed, the last good checkpoint before the spike is still usable. Check `trainer_state.json` in each checkpoint directory β look for the last one with normal loss values and merge from there.
|
| 476 |
+
|
| 477 |
+
### First Run Downloads Are Slow
|
| 478 |
+
|
| 479 |
+
The first time you run training, HuggingFace downloads the model weights (~3 GB for 1.5B, ~15 GB for 7B). Subsequent runs use the cached weights from `~/.cache/huggingface/`. For faster downloads, set a HuggingFace token:
|
| 480 |
+
|
| 481 |
+
```bash
|
| 482 |
+
huggingface-cli login
|
| 483 |
+
```
|
| 484 |
+
|
| 485 |
+
---
|
| 486 |
+
|
| 487 |
+
## Full Reproduction Checklist
|
| 488 |
+
|
| 489 |
+
- [x] Python 3.11 virtual environment created
|
| 490 |
+
- [x] PyTorch with CUDA support installed and verified
|
| 491 |
+
- [x] Spider 1.0 databases downloaded (~166 DBs)
|
| 492 |
+
- [x] BIRD databases downloaded (~81 DBs)
|
| 493 |
+
- [x] Training data generated from golden pairs
|
| 494 |
+
- [x] Smoke test passed (1.5B, 1 epoch β loss 2.13β0.20, accuracy 96.1%)
|
| 495 |
+
- [x] Full 1.5B training completed (3 epochs β eval_loss=0.191, accuracy 95.8%)
|
| 496 |
+
- [x] 1.5B LoRA adapter merged (`finetuning_output/merged/`)
|
| 497 |
+
- [ ] 7B QLoRA training β **collapsed at epoch 1.5** (checkpoint-500 salvageable, needs re-run with lower LR)
|
| 498 |
+
- [ ] 7B LoRA adapter merged
|
| 499 |
+
- [ ] Full dataset training (15K pairs) β pending 7B fix
|