sql-arena-env / README.md
sakthivarshans's picture
Updated Readme for better reference
608369f
---
title: SQLArenaEnv
sdk: docker
app_port: 8000
pinned: true
tags:
- openenv
- rl-environment
- sql
- reasoning
- multi-step
---
# SQLArenaEnv
> **The first OpenEnv compatible environment for training multi-step SQL reasoning agents β€” where exploration queries are first-class actions.**
[![OpenEnv](https://img.shields.io/badge/OpenEnv-compatible-blue)](https://github.com/meta-pytorch/OpenEnv)
[![Tasks](https://img.shields.io/badge/tasks-50-green)]()
[![Difficulty](https://img.shields.io/badge/difficulty-easy→expert-orange)]()
---
## What Is SQLArenaEnv?
Most SQL benchmarks test single-shot generation β€” the model sees a question and must output the correct query in one attempt. Real-world SQL reasoning doesn't work that way. Analysts explore the data, run investigative queries, check schemas, and refine their approach before committing to a final answer.
**SQLArenaEnv makes exploration a first-class action.**
The agent is given a natural language question and a database. It can run up to 5 **explore** queries to investigate the schema and data β€” seeing real results β€” before submitting a final **submit** query that is scored. Agents that explore strategically outperform agents that blindly guess.
This tests the skill that actually matters: **reasoning about data, not just memorizing SQL syntax.**
---
## The Core Mechanic
```
Episode start
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Question: "Find customers who β”‚
β”‚ placed more than 1 completed order"β”‚
β”‚ Schema: customers, orders, β”‚
β”‚ order_items β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ EXPLORE action β”‚ ← up to 5 free exploration queries
β”‚ sql = "SELECT * β”‚ each returns real data rows
β”‚ FROM customers β”‚ small -0.02 cost per step
β”‚ LIMIT 5" β”‚ to discourage random fishing
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ (repeat up to 5 times)
β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ SUBMIT action β”‚ ← final answer query
β”‚ sql = "SELECT β”‚ scored against reference solution
β”‚ c.name, COUNT(*) β”‚ 1.0 = correct
β”‚ FROM customers c β”‚ 0.4 = partial
β”‚ JOIN orders o ... β”‚ 0.0 = wrong
β”‚ HAVING COUNT > 1" β”‚ -0.1 = syntax error
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
---
## Task Library β€” 50 Tasks Across 4 Tiers
| Tier | Count | SQL Concepts Tested |
|------|-------|---------------------|
| **Easy** | 10 | SELECT, WHERE, ORDER BY, GROUP BY, basic aggregation |
| **Medium** | 15 | JOINs, HAVING, subqueries, LEFT JOIN, correlated filters |
| **Hard** | 15 | CTEs, window functions (RANK, LAG, PERCENT_RANK), multi-join analytics |
| **Expert** | 10 | Correlated subqueries, financial scoring, multi-CTE chains, complex aggregation |
All tasks use realistic domains: **e-commerce orders, HR systems, retail analytics, banking/finance.** Data is India-relevant (names, cities, currencies in INR).
---
## Reward Structure
```python
# Explore step
reward = -0.02 # small cost β€” discourages blind fishing
# Submit step
reward = 1.0 # correct answer (exact match)
reward = 0.4 # partial credit (right columns, wrong rows)
reward = 0.0 # wrong answer
reward = -0.1 # SQL syntax error
```
---
## Quick Start
### Install client
```bash
pip install git+https://huggingface.co/spaces/sakthivarshans/sql-arena-env
```
### Use in Python (async)
```python
import asyncio
from sql_arena_env import SQLArenaEnv, SQLArenaAction
async def main():
async with SQLArenaEnv(base_url="https://sakthivarshans-sql-arena-env.hf.space") as env:
# Start episode β€” random task
result = await env.reset()
obs = result.observation
print(f"Question: {obs.question}")
print(f"Schema: {obs.schema_info}")
# Explore the data
result = await env.step(SQLArenaAction(
sql="SELECT * FROM customers LIMIT 5",
query_type="explore"
))
print(f"Sample data: {result.observation.query_result}")
# Submit final answer
result = await env.step(SQLArenaAction(
sql="SELECT c.name, COUNT(*) as order_count FROM customers c JOIN orders o ON c.customer_id = o.customer_id WHERE o.status='completed' GROUP BY c.customer_id HAVING COUNT(*) > 1",
query_type="submit"
))
print(f"Correct: {result.observation.is_correct}")
print(f"Reward: {result.reward}")
print(f"Feedback: {result.observation.feedback}")
asyncio.run(main())
```
### Load a specific task
```python
result = await env.reset(task_id="hard_002") # specific task
result = await env.reset(difficulty="medium") # random from tier
result = await env.reset() # fully random
```
### Sync usage
```python
with SQLArenaEnv(base_url="http://localhost:8000").sync() as env:
result = env.reset(task_id="easy_001")
result = env.step(SQLArenaAction(sql="SELECT 1", query_type="submit"))
```
---
## HTTP API Reference
| Method | Endpoint | Description |
|--------|----------|-------------|
| `GET` | `/health` | Liveness check |
| `POST` | `/reset` | Start new episode |
| `POST` | `/step` | Execute SQL action |
| `GET` | `/state` | Current episode state |
| `GET` | `/schema` | Action/observation schema |
| `WS` | `/ws` | WebSocket (persistent session, use for training) |
### Reset request body
```json
{ "task_id": "medium_001" }
{ "difficulty": "hard" }
{}
```
### Step request body
```json
{
"action": {
"sql": "SELECT name FROM customers LIMIT 5",
"query_type": "explore"
}
}
```
---
## Graders
Four deterministic graders, one per difficulty tier. Each runs 3 representative tasks.
```bash
python graders.py
```
```
easy : 1.0000 β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ
medium : 1.0000 β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ
hard : 1.0000 β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ
expert : 1.0000 β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ
overall : 1.0000 β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ
```
---
## Run Locally
### With Docker
```bash
docker build -t sql-arena-env:latest -f server/Dockerfile .
docker run -p 8000:8000 sql-arena-env:latest
curl http://localhost:8000/health
```
### Without Docker
```bash
pip install openenv-core
pip install -e .
uvicorn server.app:app --host 0.0.0.0 --port 8000
```
### Run graders
```bash
python graders.py
```
### Run inference script
```bash
export API_BASE_URL="https://router.huggingface.co/v1"
export MODEL_NAME="Qwen/Qwen2.5-72B-Instruct"
export HF_TOKEN="hf_..."
export SQLARENA_TASK="medium_001"
python inference.py
```
---
## Project Structure
```
sql_arena_env/
β”œβ”€β”€ __init__.py # Package exports
β”œβ”€β”€ models.py # SQLArenaAction, SQLArenaObservation
β”œβ”€β”€ client.py # Async typed client (SQLArenaEnv)
β”œβ”€β”€ tasks.py # 50 curated SQL tasks with schemas & solutions
β”œβ”€β”€ graders.py # 4 deterministic graders
β”œβ”€β”€ inference.py # Hackathon inference script
β”œβ”€β”€ openenv.yaml # OpenEnv manifest
β”œβ”€β”€ pyproject.toml # pip installable
└── server/
β”œβ”€β”€ app.py # FastAPI + WebSocket via create_app()
β”œβ”€β”€ sql_arena_environment.py # Core environment logic
β”œβ”€β”€ Dockerfile # openenv-base multi-stage build
└── requirements.txt
```
---
## Why SQLArenaEnv?
**The gap it fills:** Text to SQL benchmarks like Spider and BIRD measure single shot accuracy. No existing OpenEnv environment measures *multi step SQL reasoning* where the agent can gather information before committing. This is the benchmark that matches how SQL is actually used.
**Why exploration matters for RL training:** An agent that learns to run `SELECT * FROM table LIMIT 5` before attempting a complex GROUP BY query is learning a genuinely useful cognitive strategy, the same strategy a senior data analyst uses. Standard single shot SQL environments cannot teach this. SQLArenaEnv can.
**What improves with training:** GRPO/PPO agents trained on SQLArenaEnv learn to use explore steps strategically, they converge to running schema discovery queries first (`SELECT * FROM sqlite_master`), then sample queries, then submitting. This mirrors expert human behavior and transfers to real SQL tasks.
---
## Citation
```
SQLArenaEnv β€” Multi-step SQL Reasoning Environment for OpenEnv
OpenEnv Hackathon 2026 β€” Meta Γ— Hugging Face Γ— Scaler
```