Spaces:
Running
Running
Upload Blog.md with huggingface_hub
Browse files
Blog.md
ADDED
|
@@ -0,0 +1,84 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# GRPO Training for SQL Query Optimization
|
| 2 |
+
|
| 3 |
+
## Overview
|
| 4 |
+
Fine-tuned `Qwen/Qwen2.5-0.5B-Instruct` using GRPO (Group Relative Policy Optimization)
|
| 5 |
+
reinforcement learning to optimize SQL queries using a DuckDB execution environment.
|
| 6 |
+
|
| 7 |
+
## Problem Statement
|
| 8 |
+
SQL query optimization is critical for database performance. This project trains an LLM
|
| 9 |
+
to automatically identify and fix SQL anti-patterns using RL with verifiable rewards.
|
| 10 |
+
|
| 11 |
+
## Approach
|
| 12 |
+
|
| 13 |
+
### Environment
|
| 14 |
+
- Used [SQL Query Optimization Environment](https://github.com/OfficialAbhinavSingh/SQL-Query-Optimization-Environment-)
|
| 15 |
+
- DuckDB-based execution environment with 5 tasks of increasing difficulty
|
| 16 |
+
- Tasks: basic antipatterns, correlated subqueries, wildcard scans, implicit joins, window functions
|
| 17 |
+
|
| 18 |
+
### GRPO Training
|
| 19 |
+
- **Algorithm:** GRPO (Group Relative Policy Optimization)
|
| 20 |
+
- **Base Model:** Qwen/Qwen2.5-0.5B-Instruct
|
| 21 |
+
- **Episodes:** 100
|
| 22 |
+
- **Group Size:** 4 completions per prompt
|
| 23 |
+
- **Hardware:** Kaggle GPU T4 x2
|
| 24 |
+
|
| 25 |
+
### Reward Function
|
| 26 |
+
The reward function combines multiple signals:
|
| 27 |
+
- `execution_speedup`: How much faster the optimized query runs
|
| 28 |
+
- `result_correctness`: Whether the optimized query returns identical results
|
| 29 |
+
- `issue_detection`: Whether SQL anti-patterns were correctly identified
|
| 30 |
+
- `approval_correctness`: Whether the approval flag is set correctly
|
| 31 |
+
- `summary_quality`: Quality of the explanation
|
| 32 |
+
- `severity_labels`: Correctness of severity ratings
|
| 33 |
+
|
| 34 |
+
Bonus reward added for correct issue detection even when SQL execution fails,
|
| 35 |
+
providing a useful gradient signal for partial progress.
|
| 36 |
+
|
| 37 |
+
## Results
|
| 38 |
+
|
| 39 |
+
### Training Progress
|
| 40 |
+
| Metric | Value |
|
| 41 |
+
|--------|-------|
|
| 42 |
+
| Start avg (ep1-10) | 0.3090 |
|
| 43 |
+
| End avg (ep91-100) | 0.5962 |
|
| 44 |
+
| Improvement | +93% |
|
| 45 |
+
|
| 46 |
+
### Final Evaluation
|
| 47 |
+
| Task | Difficulty | Score |
|
| 48 |
+
|------|-----------|-------|
|
| 49 |
+
| task_1_basic_antipatterns | easy | 0.7500 ✅ |
|
| 50 |
+
| task_2_correlated_subqueries | medium | 0.8313 ✅ |
|
| 51 |
+
| task_3_wildcard_scan | medium-hard | 0.9250 ✅ |
|
| 52 |
+
| task_4_implicit_join | hard | 0.6438 ✅ |
|
| 53 |
+
| task_5_window_functions | expert | 0.6250 ⚠️ |
|
| 54 |
+
| **Average** | | **0.7550** |
|
| 55 |
+
|
| 56 |
+
**Baseline (original query unchanged): 0.6300**
|
| 57 |
+
**Improvement over baseline: +0.1250 (+12.5%)**
|
| 58 |
+
|
| 59 |
+
### Training Curve
|
| 60 |
+

|
| 61 |
+
|
| 62 |
+
## Key Findings
|
| 63 |
+
|
| 64 |
+
1. **Reward variance is critical** — Early runs had flat 0.08 rewards because the model
|
| 65 |
+
generated invalid SQL. Fixing the prompt to include schema information created reward
|
| 66 |
+
variance needed for GRPO to learn.
|
| 67 |
+
|
| 68 |
+
2. **Prompt engineering matters for RL** — Explicitly telling the model to use only
|
| 69 |
+
columns from the schema was the single most impactful fix.
|
| 70 |
+
|
| 71 |
+
3. **Partial credit helps** — Adding issue detection bonus gave the model a learning
|
| 72 |
+
signal even when SQL execution failed.
|
| 73 |
+
|
| 74 |
+
4. **Task difficulty affects learning** — Harder tasks (implicit joins, window functions)
|
| 75 |
+
consistently scored lower, suggesting curriculum learning could help.
|
| 76 |
+
|
| 77 |
+
## Model
|
| 78 |
+
https://huggingface.co/laterabhi/grpo-sql-optimizer
|
| 79 |
+
|
| 80 |
+
## References
|
| 81 |
+
- [GRPO Paper - DeepSeekMath](https://arxiv.org/abs/2402.03300)
|
| 82 |
+
- [TRL Library](https://huggingface.co/docs/trl)
|
| 83 |
+
- [SQL Optimization Environment](https://github.com/OfficialAbhinavSingh/SQL-Query-Optimization-Environment-)
|
| 84 |
+
- [Qwen2.5 Model](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct)
|