# GRPO Training for SQL Query Optimization (DuckDB-Verifiable Rewards) Fine-tuned `Qwen/Qwen2.5-0.5B-Instruct` using **GRPO (Group Relative Policy Optimization)** to optimize SQL queries with **verifiable rewards**: we **execute** the original + rewritten SQL against a real DuckDB database and score based on measured speedup and correctness. - **Repo (source of truth):** https://github.com/OfficialAbhinavSingh/SQL-Query-Optimization-Environment- - **Model:** https://huggingface.co/laterabhi/grpo-sql-optimizer - **Space:** https://huggingface.co/spaces/laterabhi/grpo-sql-optimizer --- ## Why this matters LLMs often generate SQL that is syntactically valid but **slow** (or subtly wrong) at scale. Classic training setups use heuristic scoring, which can be gamed. This project trains/evaluates SQL optimization with **execution-grounded feedback**. --- ## Environment (5 tasks, increasing difficulty) We use the **SQL Query Optimization Environment** (OpenEnv compliant), backed by an in-memory DuckDB dataset: - `users` (10k), `orders` (500k), `events` (1M), `products` (1k) Tasks: 1. `task_1_basic_antipatterns` (easy) 2. `task_2_correlated_subqueries` (medium) 3. `task_3_wildcard_scan` (medium-hard) 4. `task_4_implicit_join` (hard) 5. `task_5_window_functions` (expert) --- ## Reward function (execution-grounded) Composite reward in \[0, 1\], combining: - **execution_speedup (35%)**: measured ratio `original_ms / optimized_ms` from DuckDB - **result_correctness (20%)**: results match check (order-independent for large outputs) - **issue_detection (25%)**: anti-pattern detection vs ground-truth keywords per task - **approval_correctness (8%)** - **summary_quality (7%)** - **severity_labels (5%)** This is designed to be **hard to game**: “fast but wrong” loses correctness; “verbose but slow” loses speedup. --- ## Training setup (GRPO) - **Algorithm:** GRPO (group-relative policy optimization) - **Base model:** `Qwen/Qwen2.5-0.5B-Instruct` - **Group size:** 4 completions per prompt - **Notebook:** Kaggle (linked in repo README) - **Note:** `train.py` defaults to 200 episodes, but the reported curve/table below is from the 100-episode run described in the repo. --- ## Results (from the GitHub repo) ### Training progress (100 episodes) | Metric | Value | |--------|-------| | Start avg (ep 1–10) | 0.3090 | | End avg (ep 91–100) | 0.5962 | | Improvement | **+93%** | **Reward curve:** ![GRPO reward curve](https://raw.githubusercontent.com/OfficialAbhinavSingh/SQL-Query-Optimization-Environment-/main/results/grpo_reward_curve.png) ### Final evaluation (per task) | Task | Difficulty | Score | |------|-----------|-------| | task_1_basic_antipatterns | easy | **0.7500** ✅ | | task_2_correlated_subqueries | medium | **0.8313** ✅ | | task_3_wildcard_scan | medium-hard | **0.6563** ✅ | | task_4_implicit_join | hard | **0.6563** ✅ | | task_5_window_functions | expert | **0.6500** ✅ | > **Task 5 note:** `task_5_window_functions` is the **expert** scenario, so it’s expected to be the lowest. This is not an error—just the hardest distribution. ### “Before / After” (environment-only, no API keys) We also provide a **reproducible** before/after contrast: - **Before:** suggestions present but `optimized_query` empty (no speedup/correctness signal) - **After:** deterministic fallback policy with a real optimized query Table + chart are committed in the repo: - `results/before_after_table.md` - `results/before_after_chart.png` ![Before/After chart](https://raw.githubusercontent.com/OfficialAbhinavSingh/SQL-Query-Optimization-Environment-/main/results/before_after_chart.png) --- ## How to reproduce (locally) ```bash git clone https://github.com/OfficialAbhinavSingh/SQL-Query-Optimization-Environment-.git cd SQL-Query-Optimization-Environment- pip install -r requirements.txt # Baselines (fallback + optional LLM if HF_TOKEN set) python baseline_runner.py # Environment-only before/after (no API keys) python training/eval_before_after.py --save-dir results