# GRPO Training for SQL Query Optimization (DuckDB-Verifiable Rewards)

Fine-tuned `Qwen/Qwen2.5-0.5B-Instruct` using **GRPO (Group Relative Policy Optimization)** to optimize SQL queries with **verifiable rewards**: we **execute** the original + rewritten SQL against a real DuckDB database and score based on measured speedup and correctness.

- **Repo (source of truth):** https://github.com/OfficialAbhinavSingh/SQL-Query-Optimization-Environment-
- **Model:** https://huggingface.co/laterabhi/grpo-sql-optimizer
- **Space:** https://huggingface.co/spaces/laterabhi/grpo-sql-optimizer

---

## Why this matters

LLMs often generate SQL that is syntactically valid but **slow** (or subtly wrong) at scale. Classic training setups use heuristic scoring, which can be gamed. This project trains/evaluates SQL optimization with **execution-grounded feedback**.

---

## Environment (5 tasks, increasing difficulty)

We use the **SQL Query Optimization Environment** (OpenEnv compliant), backed by an in-memory DuckDB dataset:

- `users` (10k), `orders` (500k), `events` (1M), `products` (1k)

Tasks:
1. `task_1_basic_antipatterns` (easy)
2. `task_2_correlated_subqueries` (medium)
3. `task_3_wildcard_scan` (medium-hard)
4. `task_4_implicit_join` (hard)
5. `task_5_window_functions` (expert)

---

## Reward function (execution-grounded)

Composite reward in \[0, 1\], combining:

- **execution_speedup (35%)**: measured ratio `original_ms / optimized_ms` from DuckDB
- **result_correctness (20%)**: results match check (order-independent for large outputs)
- **issue_detection (25%)**: anti-pattern detection vs ground-truth keywords per task
- **approval_correctness (8%)**
- **summary_quality (7%)**
- **severity_labels (5%)**

This is designed to be **hard to game**: “fast but wrong” loses correctness; “verbose but slow” loses speedup.

---

## Training setup (GRPO)

- **Algorithm:** GRPO (group-relative policy optimization)
- **Base model:** `Qwen/Qwen2.5-0.5B-Instruct`
- **Group size:** 4 completions per prompt
- **Notebook:** Kaggle (linked in repo README)
- **Note:** `train.py` defaults to 200 episodes, but the reported curve/table below is from the 100-episode run described in the repo.

---

## Results (from the GitHub repo)

### Training progress (100 episodes)

| Metric | Value |
|--------|-------|
| Start avg (ep 1–10) | 0.3090 |
| End avg (ep 91–100) | 0.5962 |
| Improvement | **+93%** |

**Reward curve:**

![GRPO reward curve](https://raw.githubusercontent.com/OfficialAbhinavSingh/SQL-Query-Optimization-Environment-/main/results/grpo_reward_curve.png)

### Final evaluation (per task)

| Task | Difficulty | Score |
|------|-----------|-------|
| task_1_basic_antipatterns | easy | **0.7500** ✅ |
| task_2_correlated_subqueries | medium | **0.8313** ✅ |
| task_3_wildcard_scan | medium-hard | **0.6563** ✅ |
| task_4_implicit_join | hard | **0.6563** ✅ |
| task_5_window_functions | expert | **0.6500** ✅ |

> **Task 5 note:** `task_5_window_functions` is the **expert** scenario, so it’s expected to be the lowest. This is not an error—just the hardest distribution.

### “Before / After” (environment-only, no API keys)

We also provide a **reproducible** before/after contrast:
- **Before:** suggestions present but `optimized_query` empty (no speedup/correctness signal)
- **After:** deterministic fallback policy with a real optimized query

Table + chart are committed in the repo:

- `results/before_after_table.md`
- `results/before_after_chart.png`

![Before/After chart](https://raw.githubusercontent.com/OfficialAbhinavSingh/SQL-Query-Optimization-Environment-/main/results/before_after_chart.png)

---

## How to reproduce (locally)

```bash
git clone https://github.com/OfficialAbhinavSingh/SQL-Query-Optimization-Environment-.git
cd SQL-Query-Optimization-Environment-
pip install -r requirements.txt

# Baselines (fallback + optional LLM if HF_TOKEN set)
python baseline_runner.py

# Environment-only before/after (no API keys)
python training/eval_before_after.py --save-dir results