Zheyuan Zhao commited on
Commit
aeefd70
Β·
verified Β·
1 Parent(s): 5f143d7

Add design doc: pipe-sql-training-reproduction-guide.md

Browse files
docs/pipe-sql-training-reproduction-guide.md ADDED
@@ -0,0 +1,499 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Pipe SQL Fine-Tuning: Reproduction Guide
2
+
3
+ This document describes how to reproduce the pipe SQL fine-tuning pipeline end-to-end, from a fresh clone of the repository to a trained model. It covers environment setup, data preparation, training data generation, and model fine-tuning.
4
+
5
+ For the design rationale behind this system, see [pipe-sql-fine-tuning-design-doc.md](pipe-sql-fine-tuning-design-doc.md).
6
+
7
+ ---
8
+
9
+ ## Prerequisites
10
+
11
+ - **GPU**: NVIDIA GPU with >=16 GB VRAM (tested on RTX 4080 16 GB)
12
+ - **NVIDIA Driver**: 525+ (CUDA 12.x compatible)
13
+ - **OS**: Windows 11 or Linux (commands below use bash; on Windows, use Git Bash or WSL)
14
+ - **uv**: Python package manager ([install guide](https://docs.astral.sh/uv/getting-started/installation/))
15
+ - **Disk**: ~15 GB for benchmark databases, ~15 GB for model weights (cached by HuggingFace)
16
+
17
+ ---
18
+
19
+ ## Step 1: Clone and Create Python Environment
20
+
21
+ ```bash
22
+ git clone <repo-url>
23
+ cd sqlglot
24
+
25
+ # Create a Python 3.11 virtual environment
26
+ uv venv .venv --python 3.11
27
+ source .venv/Scripts/activate # Windows (Git Bash)
28
+ # source .venv/bin/activate # Linux/macOS
29
+ ```
30
+
31
+ ## Step 2: Install Dependencies
32
+
33
+ ```bash
34
+ # Install sqlglot in editable mode (puts training_data/ and finetuning/ on sys.path)
35
+ uv pip install -e .
36
+
37
+ # Install PyTorch with CUDA 12.6 support
38
+ uv pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126
39
+
40
+ # Install ML training stack
41
+ uv pip install transformers peft trl datasets bitsandbytes accelerate
42
+
43
+ # For Spider dataset download (Google Drive)
44
+ uv pip install gdown
45
+ ```
46
+
47
+ **Verify CUDA**:
48
+ ```bash
49
+ python -c "import torch; print(torch.cuda.is_available(), torch.cuda.get_device_name(0))"
50
+ # Expected: True NVIDIA GeForce RTX 4080
51
+ ```
52
+
53
+ > **Note**: PyTorch cu126 wheels bundle their own CUDA runtime. You do NOT need to upgrade your system CUDA toolkit β€” any NVIDIA driver >=525 works.
54
+
55
+ ## Step 3: Download Benchmark Databases
56
+
57
+ The training data generation requires SQLite databases from Spider 1.0 and BIRD benchmarks to extract schemas.
58
+
59
+ ```bash
60
+ # Spider 1.0 (~1 GB, downloads from Google Drive via gdown)
61
+ bash scripts/setup_data.sh
62
+
63
+ # BIRD dev + train sets (~9 GB, downloads via curl)
64
+ bash scripts/setup_bird_data.sh
65
+ ```
66
+
67
+ **Verify**:
68
+ ```bash
69
+ ls data/spider/database | wc -l # ~166 databases
70
+ ls data/bird/train/train_databases | wc -l # ~70 databases
71
+ ls data/bird/dev_20240627/dev_databases | wc -l # ~11 databases
72
+ ```
73
+
74
+ ## Step 4: Generate Training Data
75
+
76
+ This reads the 15,443 validated golden pairs (standard SQL ↔ pipe SQL) and generates incremental chat training samples. Each N-operator pipe query is decomposed into N training samples where the model learns to emit one pipe operator at a time.
77
+
78
+ ```bash
79
+ # Full dataset (recommended for production training)
80
+ python -m training_data.generate \
81
+ --golden-pairs validation_output/golden_pairs_consolidated.jsonl \
82
+ --db-dir data/spider/database \
83
+ --db-dir data/bird/train/train_databases \
84
+ --db-dir data/bird/dev_20240627/dev_databases \
85
+ --output-dir training_data_output \
86
+ --tool-calling --tool-ratio 0.3
87
+
88
+ # Subset for quick iteration (add --limit)
89
+ python -m training_data.generate \
90
+ --golden-pairs validation_output/golden_pairs_consolidated.jsonl \
91
+ --db-dir data/spider/database \
92
+ --db-dir data/bird/train/train_databases \
93
+ --db-dir data/bird/dev_20240627/dev_databases \
94
+ --output-dir training_data_output \
95
+ --tool-calling --tool-ratio 0.3 \
96
+ --limit 2000
97
+ ```
98
+
99
+ | Flag | Description |
100
+ |------|-------------|
101
+ | `--golden-pairs` | JSONL file with `{gold_sql, pipe_sql, db_id, question_id, question}` entries |
102
+ | `--db-dir` | Directories containing SQLite databases (repeatable) |
103
+ | `--tool-calling` | Also generate agentic tool-calling training samples |
104
+ | `--tool-ratio 0.3` | 30% of golden pairs get an additional tool-calling sample |
105
+ | `--limit N` | Process only the first N pairs (omit for full dataset) |
106
+
107
+ **Expected output**:
108
+
109
+ | Input | Total Samples | Train (95%) | Dev (5%) | Tool-calling |
110
+ |-------|--------------|-------------|----------|--------------|
111
+ | `--limit 2000` | ~7,400 | ~6,900 | ~500 | ~580 |
112
+ | All 15,443 pairs | ~57,000 | ~54,000 | ~2,800 | ~4,600 |
113
+
114
+ Each golden pair produces ~3.7 training samples on average (trajectory decomposition amplification). Output files: `train.jsonl`, `dev.jsonl`, `stats.json`.
115
+
116
+ ### Training Data Format
117
+
118
+ Each sample is a chat conversation in OpenAI format:
119
+
120
+ ```json
121
+ {
122
+ "messages": [
123
+ {"role": "system", "content": "You are a SQL assistant that writes pipe SQL..."},
124
+ {"role": "user", "content": "Question: ... Schema: ... Query so far: FROM t |> WHERE ..."},
125
+ {"role": "assistant", "content": "|> AGGREGATE COUNT(*) AS cnt GROUP BY department"}
126
+ ]
127
+ }
128
+ ```
129
+
130
+ ## Step 5: Fine-Tune the Model
131
+
132
+ ### Quick Start (One Command)
133
+
134
+ The `scripts/train.sh` wrapper handles data generation + training:
135
+
136
+ ```bash
137
+ # Smoke test (~5 min, 1 epoch, 100 samples)
138
+ bash scripts/train.sh --smoke-test
139
+
140
+ # Full training (1.5B model, 3 epochs, ~2 hours)
141
+ bash scripts/train.sh
142
+ ```
143
+
144
+ ### Manual Training Commands
145
+
146
+ #### 5a. Smoke Test (1.5B, 1 epoch, small subset)
147
+
148
+ Validates the pipeline works end-to-end. Use a small dataset generated with `--limit 2000`:
149
+
150
+ ```bash
151
+ python -m finetuning.train \
152
+ --model-name Qwen/Qwen2.5-Coder-1.5B-Instruct \
153
+ --train-data training_data_output/train.jsonl \
154
+ --dev-data training_data_output/dev.jsonl \
155
+ --max-seq-length 4096 \
156
+ --per-device-train-batch-size 4 \
157
+ --gradient-accumulation-steps 4 \
158
+ --num-epochs 1 \
159
+ --no-4bit \
160
+ --output-dir finetuning_output_smoke
161
+ ```
162
+
163
+ Expected: loss drops from ~2.1 to ~0.2, token accuracy rises to ~96%.
164
+
165
+ #### 5b. Full 1.5B Training (recommended: full dataset, 2 epochs)
166
+
167
+ ```bash
168
+ python -m finetuning.train \
169
+ --model-name Qwen/Qwen2.5-Coder-1.5B-Instruct \
170
+ --train-data training_data_output/train.jsonl \
171
+ --dev-data training_data_output/dev.jsonl \
172
+ --max-seq-length 4096 \
173
+ --per-device-train-batch-size 4 \
174
+ --gradient-accumulation-steps 8 \
175
+ --num-epochs 2 \
176
+ --no-4bit \
177
+ --output-dir finetuning_output_1.5b
178
+ ```
179
+
180
+ #### 5c. 7B QLoRA Training (recommended: full dataset, 2 epochs)
181
+
182
+ For the full-size model using 4-bit quantization to fit in 16 GB VRAM:
183
+
184
+ ```bash
185
+ python -m finetuning.train \
186
+ --model-name Qwen/Qwen2.5-Coder-7B-Instruct \
187
+ --train-data training_data_output/train.jsonl \
188
+ --dev-data training_data_output/dev.jsonl \
189
+ --max-seq-length 4096 \
190
+ --per-device-train-batch-size 1 \
191
+ --gradient-accumulation-steps 32 \
192
+ --learning-rate 5e-5 \
193
+ --num-epochs 2 \
194
+ --load-in-4bit \
195
+ --save-steps 1000 \
196
+ --eval-steps 1000 \
197
+ --output-dir finetuning_output_7b
198
+ ```
199
+
200
+ > **Important**: The lower learning rate (5e-5 vs default 2e-4) is critical for 7B stability. An earlier run with 2e-4 collapsed to NaN at epoch ~1.5. See the Troubleshooting section for details.
201
+
202
+ ### Recommended Configurations
203
+
204
+ The table below shows recommended settings for both dataset sizes. With the full dataset (15,443 pairs β†’ ~54K train samples), 2 epochs is optimal β€” 7.7x more data reduces overfitting risk, and eval loss plateaus by epoch 2. With the smaller subset, 3 epochs compensates for limited data.
205
+
206
+ **1.5B (float16, `--no-4bit`)**:
207
+
208
+ | Parameter | Subset (2K pairs) | Full (15K pairs) |
209
+ |-----------|-------------------|-------------------|
210
+ | `--num-epochs` | 3 | **2** |
211
+ | `--per-device-train-batch-size` | 4 | 4 |
212
+ | `--gradient-accumulation-steps` | 8 | 8 |
213
+ | Effective batch size | 32 | 32 |
214
+ | Steps/epoch | ~215 | ~1,690 |
215
+ | Total steps | ~645 | ~3,380 |
216
+ | VRAM usage | ~7 GB | ~7 GB |
217
+ | Est. time (RTX 4080) | ~1h 44min | **~3.5 hours** |
218
+
219
+ **7B QLoRA (4-bit, `--load-in-4bit`)**:
220
+
221
+ | Parameter | Subset (2K pairs) | Full (15K pairs) |
222
+ |-----------|-------------------|-------------------|
223
+ | `--num-epochs` | 2 | **2** |
224
+ | `--per-device-train-batch-size` | 1 | 1 |
225
+ | `--gradient-accumulation-steps` | 32 | **32** |
226
+ | `--learning-rate` | **5e-5** | **5e-5** |
227
+ | Effective batch size | 32 | **32** |
228
+ | `--save-steps` / `--eval-steps` | 500 | **1000** |
229
+ | Steps/epoch | ~429 | ~1,690 |
230
+ | Total steps | ~858 | ~3,380 |
231
+ | VRAM usage | ~12.5 GB | ~12.5 GB |
232
+ | Est. time (RTX 4080) | ~2 hours | **~17 hours** |
233
+
234
+ > **Note**: Earlier runs with `--learning-rate 2e-4` and `--gradient-accumulation-steps 16` over 3 epochs caused a training collapse at epoch ~1.5 (loss β†’ NaN). The settings above reflect the corrected configuration.
235
+
236
+ > **Tip**: Run 1.5B first as a quick validation (~3.5h). If eval loss improves over the subset baseline (0.191), the full dataset is working well. Then kick off the 7B overnight.
237
+
238
+ ### Why 2 Epochs for Full Dataset?
239
+
240
+ With the 2K subset (3 epochs), we observed:
241
+ - Train loss 0.132 vs eval loss 0.191 β†’ gap of 0.059 indicates mild overfitting
242
+ - Eval loss plateaued between epoch 2 and 3
243
+
244
+ With 7.7x more training data, the model sees far more diverse examples per epoch. 2 epochs provides sufficient coverage while avoiding diminishing returns. More data > more epochs.
245
+
246
+ ### Why grad_accum=32 for Full 7B?
247
+
248
+ Doubling gradient accumulation from 16 to 32 (effective batch 32) halves the number of optimizer steps while keeping total forward/backward passes identical. Each optimizer step uses a lower-variance gradient estimate, giving more stable training. This doesn't change wall-clock time but produces better-calibrated updates.
249
+
250
+ ### What the Trainer Does
251
+
252
+ 1. Loads the base model (Qwen2.5-Coder) with LoRA adapters targeting all attention + MLP projections (r=16, alpha=32)
253
+ 2. Applies a custom chat template with `{% generation %}` markers so loss is computed only on assistant responses (`assistant_only_loss=True`)
254
+ 3. Uses gradient checkpointing to reduce VRAM usage
255
+ 4. For QLoRA: uses bitsandbytes 4-bit NF4 quantization with bf16 compute
256
+ 5. Saves checkpoints periodically, keeps the 3 most recent
257
+ 6. Restores the original Qwen chat template (with tool-call support) before saving the final adapter
258
+
259
+ ## Step 6: Merge LoRA Adapter
260
+
261
+ After training, merge the LoRA adapter into the base model for standalone inference:
262
+
263
+ ```bash
264
+ # For 1.5B model
265
+ python -m finetuning.train --merge \
266
+ --model-name Qwen/Qwen2.5-Coder-1.5B-Instruct \
267
+ --output-dir finetuning_output_1.5b
268
+
269
+ # For 7B model
270
+ python -m finetuning.train --merge \
271
+ --model-name Qwen/Qwen2.5-Coder-7B-Instruct \
272
+ --output-dir finetuning_output_7b
273
+ ```
274
+
275
+ The merged model is saved to `<output-dir>/merged/` and can be loaded directly with `AutoModelForCausalLM.from_pretrained()`.
276
+
277
+ > **Important**: Always specify `--model-name` matching the model used for training. The default is 7B, so for 1.5B merges you must pass it explicitly.
278
+
279
+ ---
280
+
281
+ ## Training Results (Reference)
282
+
283
+ All results on RTX 4080 16 GB, subset dataset (2K pairs β†’ 7,358 samples).
284
+
285
+ ### 1.5B Smoke Test (1 epoch, float16)
286
+
287
+ | Metric | Start | End |
288
+ |--------|-------|-----|
289
+ | Train loss | 2.126 | 0.200 |
290
+ | Token accuracy | 67.4% | 96.1% |
291
+ | Steps | β€” | 429 |
292
+ | Runtime | β€” | ~35 min |
293
+
294
+ Smooth training curve. No eval configured (single epoch validation run).
295
+
296
+ ### 1.5B Full (3 epochs, float16)
297
+
298
+ | Metric | Start | End |
299
+ |--------|-------|-----|
300
+ | Train loss | 2.172 | 0.191 |
301
+ | Token accuracy | 66.9% | 97.7% |
302
+ | Best eval loss | β€” | **0.191** (step 500, epoch 2.3) |
303
+ | Eval token accuracy | β€” | 95.8% |
304
+ | Steps | β€” | 645 |
305
+ | Runtime | β€” | ~1h 44min |
306
+
307
+ Training converged well. Best checkpoint at step 500. Final train loss (0.132 at step 630) vs eval loss (0.191) shows a gap of 0.059, indicating mild overfitting in the third epoch. LoRA adapter merged successfully.
308
+
309
+ ### 7B QLoRA (3 epochs, 4-bit) β€” FAILED (Training Collapse)
310
+
311
+ | Metric | Start | Best (step 500) | Collapse (step 680) |
312
+ |--------|-------|-----------------|---------------------|
313
+ | Train loss | 2.271 | 0.253 | **7.05 β†’ NaN** |
314
+ | Token accuracy | 66.5% | 97.4% | **58.6% β†’ 0.0%** |
315
+ | Eval loss | β€” | **0.224** | NaN (step 1000) |
316
+ | Eval token accuracy | β€” | 95.8% | 0.0% |
317
+ | Grad norm | 0.11 | 0.031 | **NaN** |
318
+ | Steps | β€” | 500/1287 | 680/1287 |
319
+
320
+ **What happened**: Training progressed normally through step 610 (epoch ~1.42), then catastrophically collapsed:
321
+
322
+ | Step | Epoch | Loss | Accuracy | Grad Norm |
323
+ |------|-------|------|----------|-----------|
324
+ | 610 | 1.42 | 0.25 | 96.6% | 0.24 |
325
+ | 620 | 1.45 | 0.87 | 90.3% | 0.92 |
326
+ | 630 | 1.47 | 2.15 | 72.6% | 2.47 |
327
+ | 640 | 1.49 | 2.77 | 67.8% | 1.66 |
328
+ | 650 | 1.52 | 3.52 | 55.9% | 1.66 |
329
+ | 660 | 1.54 | 3.70 | 45.0% | 1.84 |
330
+ | 670 | 1.56 | 3.95 | 54.6% | 0.86 |
331
+ | 680 | 1.59 | **7.05** | 58.6% | **NaN** |
332
+ | 690+ | 1.61+ | 0.0 | 0.0% | NaN |
333
+
334
+ The model weights went to NaN at step 680 and remained dead for the remaining ~600 steps. The loss spike correlates with gradient norm explosion (0.24 β†’ 2.47 over 20 steps).
335
+
336
+ **Likely causes**:
337
+ 1. Learning rate (2e-4) too aggressive for the 7B model
338
+ 2. Batch size of 1 (even with grad_accum=16) causes high gradient variance
339
+ 3. Possible numerical instability in 4-bit quantization + bf16 compute
340
+
341
+ **Salvageable**: The **checkpoint-500** (before collapse) is still viable β€” eval_loss=0.224, accuracy=95.8%. To use it:
342
+ ```bash
343
+ python -m finetuning.train --merge \
344
+ --model-name Qwen/Qwen2.5-Coder-7B-Instruct \
345
+ --output-dir finetuning_output_7b \
346
+ --checkpoint checkpoint-500
347
+ ```
348
+
349
+ **Recommended fixes for re-training** (see Troubleshooting section below):
350
+ - Lower learning rate to 5e-5
351
+ - Increase gradient accumulation to 32 (effective batch 32)
352
+ - Add explicit gradient clipping (`max_grad_norm=0.5`)
353
+
354
+ ### Full Dataset Expectations (15K pairs β†’ ~57K samples)
355
+
356
+ With 7.7x more data, we expect:
357
+ - **Lower eval loss** than the 0.191 subset baseline (better generalization from more diverse examples)
358
+ - **Smaller train-eval gap** (less overfitting with 2 epochs on more data)
359
+ - **1.5B**: ~3.5 hours for 2 epochs
360
+ - **7B QLoRA**: ~17 hours for 2 epochs (best run overnight)
361
+ - **Important**: Use the reduced learning rate (5e-5) and higher grad_accum (32) for 7B to avoid the collapse observed in the subset run
362
+
363
+ ### VRAM Budget (RTX 4080 β€” 16 GB)
364
+
365
+ | Model | Quantization | Model VRAM | Training Overhead | Total |
366
+ |-------|-------------|------------|-------------------|-------|
367
+ | 1.5B | float16 | ~3 GB | ~4 GB | ~7 GB |
368
+ | 7B | QLoRA 4-bit | ~4.5 GB | ~8 GB | ~12.5 GB |
369
+
370
+ ---
371
+
372
+ ## Project Structure
373
+
374
+ ```
375
+ sqlglot/
376
+ β”œβ”€β”€ validation_output/
377
+ β”‚ └── golden_pairs_consolidated.jsonl # 15,443 validated (gold_sql, pipe_sql) pairs
378
+ β”œβ”€β”€ training_data/
379
+ β”‚ β”œβ”€β”€ __main__.py # Entry: python -m training_data.generate
380
+ β”‚ β”œβ”€β”€ generate.py # Main data generation pipeline
381
+ β”‚ β”œβ”€β”€ formatter.py # Chat sample formatting (incremental trajectory)
382
+ β”‚ β”œβ”€β”€ tool_formatter.py # Tool-calling sample generation
383
+ β”‚ β”œβ”€β”€ trajectory.py # Pipe query β†’ step decomposition
384
+ β”‚ β”œβ”€β”€ schema_extractor.py # SQLite schema β†’ text representation
385
+ β”‚ β”œβ”€β”€ tool_executor.py # Simulated tool execution for training
386
+ β”‚ └── writer.py # Train/dev split and JSONL output
387
+ β”œβ”€β”€ finetuning/
388
+ β”‚ β”œβ”€β”€ train.py # Main fine-tuning script
389
+ β”‚ β”œβ”€β”€ config.py # TrainConfig dataclass with CLI parsing
390
+ β”‚ └── data.py # JSONL dataset loader
391
+ β”œβ”€β”€ scripts/
392
+ β”‚ β”œβ”€β”€ setup_data.sh # Downloads Spider 1.0
393
+ β”‚ β”œβ”€β”€ setup_bird_data.sh # Downloads BIRD dev + train
394
+ β”‚ └── train.sh # One-command data gen + training
395
+ β”œβ”€β”€ training_data_output/ # Generated training data (not committed)
396
+ β”‚ β”œβ”€β”€ train.jsonl
397
+ β”‚ β”œβ”€β”€ dev.jsonl
398
+ β”‚ └── stats.json
399
+ β”œβ”€β”€ finetuning_output/ # Training outputs (not committed)
400
+ β”‚ β”œβ”€β”€ checkpoint-*/ # Intermediate checkpoints
401
+ β”‚ β”œβ”€β”€ final/ # Final LoRA adapter
402
+ β”‚ └── merged/ # Merged standalone model
403
+ └── docs/design/
404
+ β”œβ”€β”€ pipe-sql-fine-tuning-design-doc.md
405
+ β”œβ”€β”€ pipe-sql-decompiler-design-doc.md
406
+ β”œβ”€β”€ pipe-sql-validation-loop-design-doc.md
407
+ └── pipe-sql-training-reproduction-guide.md # This file
408
+ ```
409
+
410
+ ---
411
+
412
+ ## Troubleshooting
413
+
414
+ ### BFloat16 / FP16 AMP Error with QLoRA
415
+
416
+ **Error**: `NotImplementedError: "_amp_foreach_non_finite_check_and_unscale_cuda" not implemented for 'BFloat16'`
417
+
418
+ **Cause**: bitsandbytes 4-bit quantization produces BFloat16 parameters, which are incompatible with the FP16 AMP gradient scaler.
419
+
420
+ **Fix**: The training script automatically detects this and uses `bf16=True` when `--load-in-4bit` is set on CUDA. If you see this error, ensure you're using the latest `finetuning/train.py`.
421
+
422
+ ### Model Loading on CPU Instead of GPU
423
+
424
+ **Symptom**: Training is extremely slow; logs show "Using float32 on CPU" despite having a CUDA GPU.
425
+
426
+ **Cause**: When using `--no-4bit` on CUDA, an earlier version of the code was missing the `elif use_cuda` branch in `load_model_and_tokenizer()`.
427
+
428
+ **Fix**: The current code includes proper device detection for all CUDA modes (4-bit and float16).
429
+
430
+ ### Wrong Base Model During Merge
431
+
432
+ **Symptom**: `RuntimeError` or size mismatch when running `--merge`.
433
+
434
+ **Cause**: The default `--model-name` is `Qwen/Qwen2.5-Coder-7B-Instruct`. If you trained the 1.5B model, you must specify the correct base model during merge.
435
+
436
+ **Fix**: Always pass `--model-name` matching the model used for training:
437
+ ```bash
438
+ python -m finetuning.train --merge \
439
+ --model-name Qwen/Qwen2.5-Coder-1.5B-Instruct \
440
+ --output-dir finetuning_output
441
+ ```
442
+
443
+ ### 7B QLoRA Training Collapse (Loss β†’ NaN)
444
+
445
+ **Symptom**: Training loss spikes dramatically around epoch 1.4–1.6, gradient norm explodes, then all metrics go to NaN/0.0 for the remaining steps.
446
+
447
+ **Cause**: The combination of a high learning rate (2e-4), small per-device batch size (1), and 4-bit quantization creates conditions for numerical instability. A single bad gradient update can cascade β€” once gradient norms exceed ~1.0, the model enters an irrecoverable divergence loop that ends in NaN weights.
448
+
449
+ **Fix**: Apply all three mitigations:
450
+
451
+ ```bash
452
+ python -m finetuning.train \
453
+ --model-name Qwen/Qwen2.5-Coder-7B-Instruct \
454
+ --train-data training_data_output/train.jsonl \
455
+ --dev-data training_data_output/dev.jsonl \
456
+ --max-seq-length 4096 \
457
+ --per-device-train-batch-size 1 \
458
+ --gradient-accumulation-steps 32 \
459
+ --num-epochs 2 \
460
+ --learning-rate 5e-5 \
461
+ --load-in-4bit \
462
+ --save-steps 500 \
463
+ --eval-steps 500 \
464
+ --output-dir finetuning_output_7b
465
+ ```
466
+
467
+ Key changes from the failed run:
468
+ | Parameter | Failed Run | Recommended |
469
+ |-----------|-----------|-------------|
470
+ | `--learning-rate` | 2e-4 (default) | **5e-5** |
471
+ | `--gradient-accumulation-steps` | 16 | **32** |
472
+ | `--num-epochs` | 3 | **2** |
473
+ | `max_grad_norm` | 1.0 (default) | **0.5** (if supported) |
474
+
475
+ **Recovery**: If training has already collapsed, the last good checkpoint before the spike is still usable. Check `trainer_state.json` in each checkpoint directory β€” look for the last one with normal loss values and merge from there.
476
+
477
+ ### First Run Downloads Are Slow
478
+
479
+ The first time you run training, HuggingFace downloads the model weights (~3 GB for 1.5B, ~15 GB for 7B). Subsequent runs use the cached weights from `~/.cache/huggingface/`. For faster downloads, set a HuggingFace token:
480
+
481
+ ```bash
482
+ huggingface-cli login
483
+ ```
484
+
485
+ ---
486
+
487
+ ## Full Reproduction Checklist
488
+
489
+ - [x] Python 3.11 virtual environment created
490
+ - [x] PyTorch with CUDA support installed and verified
491
+ - [x] Spider 1.0 databases downloaded (~166 DBs)
492
+ - [x] BIRD databases downloaded (~81 DBs)
493
+ - [x] Training data generated from golden pairs
494
+ - [x] Smoke test passed (1.5B, 1 epoch β€” loss 2.13β†’0.20, accuracy 96.1%)
495
+ - [x] Full 1.5B training completed (3 epochs β€” eval_loss=0.191, accuracy 95.8%)
496
+ - [x] 1.5B LoRA adapter merged (`finetuning_output/merged/`)
497
+ - [ ] 7B QLoRA training β€” **collapsed at epoch 1.5** (checkpoint-500 salvageable, needs re-run with lower LR)
498
+ - [ ] 7B LoRA adapter merged
499
+ - [ ] Full dataset training (15K pairs) β€” pending 7B fix