Spaces:

laterabhi
/

grpo-sql-optimizer

Running

App Files Files Community

laterabhi commited on 26 days ago

Commit

afa8b1d

verified ·

1 Parent(s): 85cd109

Upload Blog.md with huggingface_hub

Browse files

Files changed (1) hide show

Blog.md +84 -0

Blog.md ADDED Viewed

	@@ -0,0 +1,84 @@

+# GRPO Training for SQL Query Optimization
+## Overview
+Fine-tuned `Qwen/Qwen2.5-0.5B-Instruct` using GRPO (Group Relative Policy Optimization)
+reinforcement learning to optimize SQL queries using a DuckDB execution environment.
+## Problem Statement
+SQL query optimization is critical for database performance. This project trains an LLM
+to automatically identify and fix SQL anti-patterns using RL with verifiable rewards.
+## Approach
+### Environment
+- Used [SQL Query Optimization Environment](https://github.com/OfficialAbhinavSingh/SQL-Query-Optimization-Environment-)
+- DuckDB-based execution environment with 5 tasks of increasing difficulty
+- Tasks: basic antipatterns, correlated subqueries, wildcard scans, implicit joins, window functions
+### GRPO Training
+- **Algorithm:** GRPO (Group Relative Policy Optimization)
+- **Base Model:** Qwen/Qwen2.5-0.5B-Instruct
+- **Episodes:** 100
+- **Group Size:** 4 completions per prompt
+- **Hardware:** Kaggle GPU T4 x2
+### Reward Function
+The reward function combines multiple signals:
+- `execution_speedup`: How much faster the optimized query runs
+- `result_correctness`: Whether the optimized query returns identical results
+- `issue_detection`: Whether SQL anti-patterns were correctly identified
+- `approval_correctness`: Whether the approval flag is set correctly
+- `summary_quality`: Quality of the explanation
+- `severity_labels`: Correctness of severity ratings
+Bonus reward added for correct issue detection even when SQL execution fails,
+providing a useful gradient signal for partial progress.
+## Results
+### Training Progress
+| Metric | Value |
+|--------|-------|
+| Start avg (ep1-10) | 0.3090 |
+| End avg (ep91-100) | 0.5962 |
+| Improvement | +93% |
+### Final Evaluation
+| Task | Difficulty | Score |
+|------|-----------|-------|
+| task_1_basic_antipatterns | easy | 0.7500 ✅ |
+| task_2_correlated_subqueries | medium | 0.8313 ✅ |
+| task_3_wildcard_scan | medium-hard | 0.9250 ✅ |
+| task_4_implicit_join | hard | 0.6438 ✅ |
+| task_5_window_functions | expert | 0.6250 ⚠️ |
+| **Average** | | **0.7550** |
+**Baseline (original query unchanged): 0.6300**
+**Improvement over baseline: +0.1250 (+12.5%)**
+### Training Curve
+![Training Curve](grpo_results.png)
+## Key Findings
+1. **Reward variance is critical** — Early runs had flat 0.08 rewards because the model
+   generated invalid SQL. Fixing the prompt to include schema information created reward
+   variance needed for GRPO to learn.
+2. **Prompt engineering matters for RL** — Explicitly telling the model to use only
+   columns from the schema was the single most impactful fix.
+3. **Partial credit helps** — Adding issue detection bonus gave the model a learning
+   signal even when SQL execution failed.
+4. **Task difficulty affects learning** — Harder tasks (implicit joins, window functions)
+   consistently scored lower, suggesting curriculum learning could help.
+## Model
+https://huggingface.co/laterabhi/grpo-sql-optimizer
+## References
+- [GRPO Paper - DeepSeekMath](https://arxiv.org/abs/2402.03300)
+- [TRL Library](https://huggingface.co/docs/trl)
+- [SQL Optimization Environment](https://github.com/OfficialAbhinavSingh/SQL-Query-Optimization-Environment-)
+- [Qwen2.5 Model](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct)