Spaces:
Running
Running
| # GRPO Training for SQL Query Optimization (DuckDB-Verifiable Rewards) | |
| Fine-tuned `Qwen/Qwen2.5-0.5B-Instruct` using **GRPO (Group Relative Policy Optimization)** to optimize SQL queries with **verifiable rewards**: we **execute** the original + rewritten SQL against a real DuckDB database and score based on measured speedup and correctness. | |
| - **Repo (source of truth):** https://github.com/OfficialAbhinavSingh/SQL-Query-Optimization-Environment- | |
| - **Model:** https://huggingface.co/laterabhi/grpo-sql-optimizer | |
| - **Space:** https://huggingface.co/spaces/laterabhi/grpo-sql-optimizer | |
| --- | |
| ## Why this matters | |
| LLMs often generate SQL that is syntactically valid but **slow** (or subtly wrong) at scale. Classic training setups use heuristic scoring, which can be gamed. This project trains/evaluates SQL optimization with **execution-grounded feedback**. | |
| --- | |
| ## Environment (5 tasks, increasing difficulty) | |
| We use the **SQL Query Optimization Environment** (OpenEnv compliant), backed by an in-memory DuckDB dataset: | |
| - `users` (10k), `orders` (500k), `events` (1M), `products` (1k) | |
| Tasks: | |
| 1. `task_1_basic_antipatterns` (easy) | |
| 2. `task_2_correlated_subqueries` (medium) | |
| 3. `task_3_wildcard_scan` (medium-hard) | |
| 4. `task_4_implicit_join` (hard) | |
| 5. `task_5_window_functions` (expert) | |
| --- | |
| ## Reward function (execution-grounded) | |
| Composite reward in \[0, 1\], combining: | |
| - **execution_speedup (35%)**: measured ratio `original_ms / optimized_ms` from DuckDB | |
| - **result_correctness (20%)**: results match check (order-independent for large outputs) | |
| - **issue_detection (25%)**: anti-pattern detection vs ground-truth keywords per task | |
| - **approval_correctness (8%)** | |
| - **summary_quality (7%)** | |
| - **severity_labels (5%)** | |
| This is designed to be **hard to game**: “fast but wrong” loses correctness; “verbose but slow” loses speedup. | |
| --- | |
| ## Training setup (GRPO) | |
| - **Algorithm:** GRPO (group-relative policy optimization) | |
| - **Base model:** `Qwen/Qwen2.5-0.5B-Instruct` | |
| - **Group size:** 4 completions per prompt | |
| - **Notebook:** Kaggle (linked in repo README) | |
| - **Note:** `train.py` defaults to 200 episodes, but the reported curve/table below is from the 100-episode run described in the repo. | |
| --- | |
| ## Results (from the GitHub repo) | |
| ### Training progress (100 episodes) | |
| | Metric | Value | | |
| |--------|-------| | |
| | Start avg (ep 1–10) | 0.3090 | | |
| | End avg (ep 91–100) | 0.5962 | | |
| | Improvement | **+93%** | | |
| **Reward curve:** | |
|  | |
| ### Final evaluation (per task) | |
| | Task | Difficulty | Score | | |
| |------|-----------|-------| | |
| | task_1_basic_antipatterns | easy | **0.7500** ✅ | | |
| | task_2_correlated_subqueries | medium | **0.8313** ✅ | | |
| | task_3_wildcard_scan | medium-hard | **0.6563** ✅ | | |
| | task_4_implicit_join | hard | **0.6563** ✅ | | |
| | task_5_window_functions | expert | **0.6500** ✅ | | |
| > **Task 5 note:** `task_5_window_functions` is the **expert** scenario, so it’s expected to be the lowest. This is not an error—just the hardest distribution. | |
| ### “Before / After” (environment-only, no API keys) | |
| We also provide a **reproducible** before/after contrast: | |
| - **Before:** suggestions present but `optimized_query` empty (no speedup/correctness signal) | |
| - **After:** deterministic fallback policy with a real optimized query | |
| Table + chart are committed in the repo: | |
| - `results/before_after_table.md` | |
| - `results/before_after_chart.png` | |
|  | |
| --- | |
| ## How to reproduce (locally) | |
| ```bash | |
| git clone https://github.com/OfficialAbhinavSingh/SQL-Query-Optimization-Environment-.git | |
| cd SQL-Query-Optimization-Environment- | |
| pip install -r requirements.txt | |
| # Baselines (fallback + optional LLM if HF_TOKEN set) | |
| python baseline_runner.py | |
| # Environment-only before/after (no API keys) | |
| python training/eval_before_after.py --save-dir results |