laterabhi commited on
Commit
afa8b1d
·
verified ·
1 Parent(s): 85cd109

Upload Blog.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. Blog.md +84 -0
Blog.md ADDED
@@ -0,0 +1,84 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # GRPO Training for SQL Query Optimization
2
+
3
+ ## Overview
4
+ Fine-tuned `Qwen/Qwen2.5-0.5B-Instruct` using GRPO (Group Relative Policy Optimization)
5
+ reinforcement learning to optimize SQL queries using a DuckDB execution environment.
6
+
7
+ ## Problem Statement
8
+ SQL query optimization is critical for database performance. This project trains an LLM
9
+ to automatically identify and fix SQL anti-patterns using RL with verifiable rewards.
10
+
11
+ ## Approach
12
+
13
+ ### Environment
14
+ - Used [SQL Query Optimization Environment](https://github.com/OfficialAbhinavSingh/SQL-Query-Optimization-Environment-)
15
+ - DuckDB-based execution environment with 5 tasks of increasing difficulty
16
+ - Tasks: basic antipatterns, correlated subqueries, wildcard scans, implicit joins, window functions
17
+
18
+ ### GRPO Training
19
+ - **Algorithm:** GRPO (Group Relative Policy Optimization)
20
+ - **Base Model:** Qwen/Qwen2.5-0.5B-Instruct
21
+ - **Episodes:** 100
22
+ - **Group Size:** 4 completions per prompt
23
+ - **Hardware:** Kaggle GPU T4 x2
24
+
25
+ ### Reward Function
26
+ The reward function combines multiple signals:
27
+ - `execution_speedup`: How much faster the optimized query runs
28
+ - `result_correctness`: Whether the optimized query returns identical results
29
+ - `issue_detection`: Whether SQL anti-patterns were correctly identified
30
+ - `approval_correctness`: Whether the approval flag is set correctly
31
+ - `summary_quality`: Quality of the explanation
32
+ - `severity_labels`: Correctness of severity ratings
33
+
34
+ Bonus reward added for correct issue detection even when SQL execution fails,
35
+ providing a useful gradient signal for partial progress.
36
+
37
+ ## Results
38
+
39
+ ### Training Progress
40
+ | Metric | Value |
41
+ |--------|-------|
42
+ | Start avg (ep1-10) | 0.3090 |
43
+ | End avg (ep91-100) | 0.5962 |
44
+ | Improvement | +93% |
45
+
46
+ ### Final Evaluation
47
+ | Task | Difficulty | Score |
48
+ |------|-----------|-------|
49
+ | task_1_basic_antipatterns | easy | 0.7500 ✅ |
50
+ | task_2_correlated_subqueries | medium | 0.8313 ✅ |
51
+ | task_3_wildcard_scan | medium-hard | 0.9250 ✅ |
52
+ | task_4_implicit_join | hard | 0.6438 ✅ |
53
+ | task_5_window_functions | expert | 0.6250 ⚠️ |
54
+ | **Average** | | **0.7550** |
55
+
56
+ **Baseline (original query unchanged): 0.6300**
57
+ **Improvement over baseline: +0.1250 (+12.5%)**
58
+
59
+ ### Training Curve
60
+ ![Training Curve](grpo_results.png)
61
+
62
+ ## Key Findings
63
+
64
+ 1. **Reward variance is critical** — Early runs had flat 0.08 rewards because the model
65
+ generated invalid SQL. Fixing the prompt to include schema information created reward
66
+ variance needed for GRPO to learn.
67
+
68
+ 2. **Prompt engineering matters for RL** — Explicitly telling the model to use only
69
+ columns from the schema was the single most impactful fix.
70
+
71
+ 3. **Partial credit helps** — Adding issue detection bonus gave the model a learning
72
+ signal even when SQL execution failed.
73
+
74
+ 4. **Task difficulty affects learning** — Harder tasks (implicit joins, window functions)
75
+ consistently scored lower, suggesting curriculum learning could help.
76
+
77
+ ## Model
78
+ https://huggingface.co/laterabhi/grpo-sql-optimizer
79
+
80
+ ## References
81
+ - [GRPO Paper - DeepSeekMath](https://arxiv.org/abs/2402.03300)
82
+ - [TRL Library](https://huggingface.co/docs/trl)
83
+ - [SQL Optimization Environment](https://github.com/OfficialAbhinavSingh/SQL-Query-Optimization-Environment-)
84
+ - [Qwen2.5 Model](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct)