Spaces:
Sleeping
Sleeping
File size: 7,448 Bytes
90fc756 852b5ea 90fc756 852b5ea 90fc756 852b5ea b83c8ad 90fc756 852b5ea 90fc756 852b5ea 90fc756 852b5ea 90fc756 852b5ea b83c8ad 852b5ea b83c8ad 852b5ea b83c8ad 90fc756 b83c8ad 90fc756 b83c8ad 852b5ea b83c8ad 90fc756 852b5ea 90fc756 852b5ea 90fc756 852b5ea 90fc756 852b5ea 90fc756 852b5ea b83c8ad 852b5ea b83c8ad 852b5ea 90fc756 852b5ea 90fc756 852b5ea 90fc756 852b5ea 90fc756 852b5ea 90fc756 852b5ea 90fc756 852b5ea 90fc756 852b5ea 90fc756 b83c8ad 90fc756 852b5ea 90fc756 852b5ea 90fc756 852b5ea 90fc756 852b5ea 90fc756 852b5ea | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 | ---
title: SQL Query Reviewer
colorFrom: blue
colorTo: green
sdk: docker
app_port: 8000
pinned: false
tags:
- openenv
---
# SQL Query Reviewer
An OpenEnv environment where an AI agent reviews SQL queries for correctness, performance, and security β the same task thousands of engineers perform every day in code reviews, migration scripts, and ETL audits.
## Why This Matters
SQL bugs are among the most common and costly defects in production systems. A misplaced
keyword breaks an API. A missing WHERE clause on a DELETE wipes a table. An unparameterized
input opens a path to data exfiltration. A function call on an indexed column turns a
10ms query into a 30-second full table scan.
Today, these defects are caught by human reviewers who spend hours on repetitive pattern
matching during code reviews, migration audits, and ETL pipeline checks. This creates a
bottleneck β senior engineers are pulled from feature work to review SQL, and critical
issues still slip through.
This environment provides a standardized benchmark to train and evaluate AI agents on
exactly this task. Unlike toy benchmarks, every query reflects real patterns found in
production codebases β from typos that break APIs to injection vectors that expose user
data to race conditions that enable double-spending. The agent must identify issues,
suggest fixes, and know when to approve β just like a human code reviewer.
The environment provides rich per-step reward signals with severity-weighted partial
credit, making it directly suitable for GRPO and PPO training loops. The task bank spans
three difficulty levels with meaningful score variance, ensuring the benchmark
discriminates between agent capabilities.
## What The Environment Does
Each episode gives the agent:
- a SQL query (with realistic bugs drawn from production patterns)
- schema context when it matters (table definitions, column types, constraints)
- a short explanation of the query's intended purpose
The agent responds step by step with one of four actions:
| Action | Description |
|---|---|
| `identify_issue` | Flag a correctness, performance, or security problem |
| `suggest_fix` | Propose corrected SQL for a previously identified issue |
| `approve` | Mark the query as acceptable (ends episode) |
| `request_more_context` | Ask for additional schema information |
## Reward Design
Rewards are deterministic and shaped for partial progress throughout the trajectory:
- **Correct issue identification**: +0.10 to +0.45 scaled by issue severity, confidence, and discovery order
- **Valid fix suggestion**: +0.08 to +0.10 bonus
- **Confidence bonus**: up to +0.05 for high-confidence correct identifications
- **Discovery order bonus**: +0.04 for first issue found, diminishing for subsequent finds
- **False positive**: β0.10 penalty
- **Duplicate identification**: β0.02 penalty
- **Approving with missed issues**: β0.15 per missed issue
- **Complete correct approval**: +0.20
- **Request context when schema available**: β0.03 penalty (encourages using provided schema)
### Reward Properties for RL Training
- **Dense**: Every step returns a non-zero signal, enabling credit assignment
- **Bounded**: Per-step rewards in [-1.0, +0.45], episode scores in (0, 1)
- **Shaped**: Partial credit for partial coverage β no cliff between "found 2 of 3" and "found 3 of 3"
- **Deterministic**: Same actions always produce the same rewards (no randomness in grading)
- **Discriminative**: Hard tasks require multi-step reasoning; easy tasks reward quick identification
## Task Bank
The environment ships with **20 tasks** across three difficulty levels:
| Difficulty | Count | Examples | Score Range |
|---|---|---|---|
| Easy | 7 | Misspelled keywords, missing FROM, = NULL vs IS NULL, DELETE without WHERE, self-comparison | ~0.60β0.90 |
| Medium | 7 | SELECT *, missing LIMIT, correlated subqueries, function on indexed column, ORDER BY RAND() | ~0.30β0.65 |
| Hard | 6 | SQL injection, privilege escalation, PII leakage, self-join optimization, race conditions | ~0.15β0.45 |
Each ground truth issue includes 8-12 keywords and synonyms for robust fuzzy matching, plus
bigram matching to catch common two-word phrases LLMs use (e.g., "sql injection", "missing where").
Task data: `tasks/easy_tasks.json`, `tasks/medium_tasks.json`, `tasks/hard_tasks.json`
## Action & Observation Spaces
**Action** (`SQLReviewAction`):
- `action_type`: identify_issue | suggest_fix | approve | request_more_context
- `issue_category`: syntax | performance | security | logic | style
- `issue_description`: concise statement of the problem
- `suggested_fix`: corrected SQL fragment
- `confidence`: float 0.0β1.0
**Observation** (`SQLReviewObservation`):
- `query`: the full SQL query text
- `schema_info`: dict of table β column definitions
- `context`: natural language description of query intent
- `issues_found_so_far`: previously identified issues this episode
- `remaining_actions`: steps left before episode ends
- `difficulty`: easy | medium | hard
- `feedback`: result of last action
## Repository Layout
```
.
βββ openenv.yaml
βββ models.py
βββ client.py
βββ inference.py β baseline agent (root directory)
βββ Dockerfile
βββ sql_query_reviewer/ β typed models and client package
βββ server/ β FastAPI environment server
β βββ environment.py β reset(), step(), state()
β βββ grader.py β deterministic scoring with bigram matching
β βββ reward.py β per-step reward with order bonus
β βββ app.py β HTTP routes
βββ tasks/ β 20 SQL query tasks (JSON)
βββ tests/ β pytest suite
```
## Local Development
```bash
python -m venv .venv
source .venv/bin/activate # or .venv\Scripts\activate on Windows
pip install -e .[dev]
uvicorn server.app:app --reload --port 8000
```
Test the API:
```bash
curl -X POST http://localhost:8000/reset -H "Content-Type: application/json" -d '{"task_id":"easy_001"}'
curl http://localhost:8000/state
pytest
```
## Docker
```bash
docker build -t sql-query-reviewer .
docker run -p 8000:8000 sql-query-reviewer
```
## Inference
```bash
export ENV_BASE_URL=http://localhost:8000
export API_BASE_URL=https://router.huggingface.co/v1
export MODEL_NAME=Qwen/Qwen2.5-72B-Instruct
export HF_TOKEN=hf_xxx
python inference.py
```
The script runs all 20 tasks and emits structured `[START]`, `[STEP]`, `[END]` logs per the OpenEnv spec.
## Hugging Face Spaces
This repo is Space-ready: HF YAML front matter in README, root Dockerfile, API on port 8000. Deploy with:
```bash
git remote add hf https://huggingface.co/spaces/<username>/sql-query-reviewer
git push hf main
```
## Usage Example
```python
from sql_query_reviewer import SQLReviewAction, SQLReviewEnv
with SQLReviewEnv(base_url="https://hellinferno-sql-query-reviewer.hf.space").sync() as env:
result = env.reset(task_id="easy_001")
result = env.step(SQLReviewAction(
action_type="identify_issue",
issue_category="syntax",
issue_description="SELCT is misspelled and should be SELECT",
suggested_fix="SELECT * FROM users WHERE id = 1;",
confidence=0.98,
))
print(result.reward)
print(result.observation.feedback)
```
## Author
**Hellinferno** β Solo participant, Meta PyTorch OpenEnv Hackathon 2026
|