Spaces:
Running
Running
File size: 3,914 Bytes
12bc11b 9a6ccae 12bc11b 9a6ccae | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 | ---
title: Grpo Sql Optimizer
emoji: 🧠
colorFrom: pink
colorTo: red
sdk: static
pinned: false
---
# GRPO Training for SQL Query Optimization (DuckDB-Verifiable Rewards)
Fine-tuned `Qwen/Qwen2.5-0.5B-Instruct` using **GRPO (Group Relative Policy Optimization)** to optimize SQL queries with **verifiable rewards**: we **execute** the original + rewritten SQL against a real DuckDB database and score based on measured speedup and correctness.
- **Repo (source of truth):** https://github.com/OfficialAbhinavSingh/SQL-Query-Optimization-Environment-
- **Model:** https://huggingface.co/laterabhi/grpo-sql-optimizer
- **Space:** https://huggingface.co/spaces/laterabhi/grpo-sql-optimizer
---
## Why this matters
LLMs often generate SQL that is syntactically valid but **slow** (or subtly wrong) at scale. Classic training setups use heuristic scoring, which can be gamed. This project trains/evaluates SQL optimization with **execution-grounded feedback**.
---
## Environment (5 tasks, increasing difficulty)
We use the **SQL Query Optimization Environment** (OpenEnv compliant), backed by an in-memory DuckDB dataset:
- `users` (10k), `orders` (500k), `events` (1M), `products` (1k)
Tasks:
1. `task_1_basic_antipatterns` (easy)
2. `task_2_correlated_subqueries` (medium)
3. `task_3_wildcard_scan` (medium-hard)
4. `task_4_implicit_join` (hard)
5. `task_5_window_functions` (expert)
---
## Reward function (execution-grounded)
Composite reward in \[0, 1\], combining:
- **execution_speedup (35%)**: measured ratio `original_ms / optimized_ms` from DuckDB
- **result_correctness (20%)**: results match check (order-independent for large outputs)
- **issue_detection (25%)**: anti-pattern detection vs ground-truth keywords per task
- **approval_correctness (8%)**
- **summary_quality (7%)**
- **severity_labels (5%)**
This is designed to be **hard to game**: “fast but wrong” loses correctness; “verbose but slow” loses speedup.
---
## Training setup (GRPO)
- **Algorithm:** GRPO (group-relative policy optimization)
- **Base model:** `Qwen/Qwen2.5-0.5B-Instruct`
- **Group size:** 4 completions per prompt
- **Notebook:** Kaggle (linked in the repo README)
---
## Results (from the GitHub repo)
### Training progress (100 episodes)
| Metric | Value |
|--------|-------|
| Start avg (ep 1–10) | 0.3090 |
| End avg (ep 91–100) | 0.5962 |
| Improvement | **+93%** |
**Reward curve:**

### Final evaluation (per task)
| Task | Difficulty | Score |
|------|-----------|-------|
| task_1_basic_antipatterns | easy | **0.7500** ✅ |
| task_2_correlated_subqueries | medium | **0.8313** ✅ |
| task_3_wildcard_scan | medium-hard | **0.6563** ✅ |
| task_4_implicit_join | hard | **0.6563** ✅ |
| task_5_window_functions | expert | **0.6500** ✅ |
> **Task 5 note:** `task_5_window_functions` is the **expert** scenario, so it’s expected to be the lowest. This is not an error—just the hardest distribution.
### “Before / After” (environment-only, no API keys)
We also provide a reproducible before/after contrast:
- **Before:** suggestions present but `optimized_query` empty (no speedup/correctness signal)
- **After:** deterministic fallback policy with a real optimized query

---
## How to reproduce (locally)
```bash
git clone https://github.com/OfficialAbhinavSingh/SQL-Query-Optimization-Environment-.git
cd SQL-Query-Optimization-Environment-
pip install -r requirements.txt
# Baselines (fallback + optional LLM if HF_TOKEN set)
python baseline_runner.py
# Environment-only before/after (no API keys)
python training/eval_before_after.py --save-dir results |