Spaces:
Running
title: SQLArenaEnv
sdk: docker
app_port: 8000
pinned: true
tags:
- openenv
- rl-environment
- sql
- reasoning
- multi-step
SQLArenaEnv
The first OpenEnv compatible environment for training multi-step SQL reasoning agents β where exploration queries are first-class actions.
What Is SQLArenaEnv?
Most SQL benchmarks test single-shot generation β the model sees a question and must output the correct query in one attempt. Real-world SQL reasoning doesn't work that way. Analysts explore the data, run investigative queries, check schemas, and refine their approach before committing to a final answer.
SQLArenaEnv makes exploration a first-class action.
The agent is given a natural language question and a database. It can run up to 5 explore queries to investigate the schema and data β seeing real results β before submitting a final submit query that is scored. Agents that explore strategically outperform agents that blindly guess.
This tests the skill that actually matters: reasoning about data, not just memorizing SQL syntax.
The Core Mechanic
Episode start
β
βΌ
βββββββββββββββββββββββββββββββββββββββ
β Question: "Find customers who β
β placed more than 1 completed order"β
β Schema: customers, orders, β
β order_items β
ββββββββββββββββ¬βββββββββββββββββββββββ
β
ββββββββββββΌβββββββββββ
β EXPLORE action β β up to 5 free exploration queries
β sql = "SELECT * β each returns real data rows
β FROM customers β small -0.02 cost per step
β LIMIT 5" β to discourage random fishing
ββββββββββββ¬ββββββββββββ
β (repeat up to 5 times)
β
ββββββββββββΌβββββββββββ
β SUBMIT action β β final answer query
β sql = "SELECT β scored against reference solution
β c.name, COUNT(*) β 1.0 = correct
β FROM customers c β 0.4 = partial
β JOIN orders o ... β 0.0 = wrong
β HAVING COUNT > 1" β -0.1 = syntax error
βββββββββββββββββββββββ
Task Library β 50 Tasks Across 4 Tiers
| Tier | Count | SQL Concepts Tested |
|---|---|---|
| Easy | 10 | SELECT, WHERE, ORDER BY, GROUP BY, basic aggregation |
| Medium | 15 | JOINs, HAVING, subqueries, LEFT JOIN, correlated filters |
| Hard | 15 | CTEs, window functions (RANK, LAG, PERCENT_RANK), multi-join analytics |
| Expert | 10 | Correlated subqueries, financial scoring, multi-CTE chains, complex aggregation |
All tasks use realistic domains: e-commerce orders, HR systems, retail analytics, banking/finance. Data is India-relevant (names, cities, currencies in INR).
Reward Structure
# Explore step
reward = -0.02 # small cost β discourages blind fishing
# Submit step
reward = 1.0 # correct answer (exact match)
reward = 0.4 # partial credit (right columns, wrong rows)
reward = 0.0 # wrong answer
reward = -0.1 # SQL syntax error
Quick Start
Install client
pip install git+https://huggingface.co/spaces/sakthivarshans/sql-arena-env
Use in Python (async)
import asyncio
from sql_arena_env import SQLArenaEnv, SQLArenaAction
async def main():
async with SQLArenaEnv(base_url="https://sakthivarshans-sql-arena-env.hf.space") as env:
# Start episode β random task
result = await env.reset()
obs = result.observation
print(f"Question: {obs.question}")
print(f"Schema: {obs.schema_info}")
# Explore the data
result = await env.step(SQLArenaAction(
sql="SELECT * FROM customers LIMIT 5",
query_type="explore"
))
print(f"Sample data: {result.observation.query_result}")
# Submit final answer
result = await env.step(SQLArenaAction(
sql="SELECT c.name, COUNT(*) as order_count FROM customers c JOIN orders o ON c.customer_id = o.customer_id WHERE o.status='completed' GROUP BY c.customer_id HAVING COUNT(*) > 1",
query_type="submit"
))
print(f"Correct: {result.observation.is_correct}")
print(f"Reward: {result.reward}")
print(f"Feedback: {result.observation.feedback}")
asyncio.run(main())
Load a specific task
result = await env.reset(task_id="hard_002") # specific task
result = await env.reset(difficulty="medium") # random from tier
result = await env.reset() # fully random
Sync usage
with SQLArenaEnv(base_url="http://localhost:8000").sync() as env:
result = env.reset(task_id="easy_001")
result = env.step(SQLArenaAction(sql="SELECT 1", query_type="submit"))
HTTP API Reference
| Method | Endpoint | Description |
|---|---|---|
GET |
/health |
Liveness check |
POST |
/reset |
Start new episode |
POST |
/step |
Execute SQL action |
GET |
/state |
Current episode state |
GET |
/schema |
Action/observation schema |
WS |
/ws |
WebSocket (persistent session, use for training) |
Reset request body
{ "task_id": "medium_001" }
{ "difficulty": "hard" }
{}
Step request body
{
"action": {
"sql": "SELECT name FROM customers LIMIT 5",
"query_type": "explore"
}
}
Graders
Four deterministic graders, one per difficulty tier. Each runs 3 representative tasks.
python graders.py
easy : 1.0000 ββββββββββββββββββββ
medium : 1.0000 ββββββββββββββββββββ
hard : 1.0000 ββββββββββββββββββββ
expert : 1.0000 ββββββββββββββββββββ
overall : 1.0000 ββββββββββββββββββββ
Run Locally
With Docker
docker build -t sql-arena-env:latest -f server/Dockerfile .
docker run -p 8000:8000 sql-arena-env:latest
curl http://localhost:8000/health
Without Docker
pip install openenv-core
pip install -e .
uvicorn server.app:app --host 0.0.0.0 --port 8000
Run graders
python graders.py
Run inference script
export API_BASE_URL="https://router.huggingface.co/v1"
export MODEL_NAME="Qwen/Qwen2.5-72B-Instruct"
export HF_TOKEN="hf_..."
export SQLARENA_TASK="medium_001"
python inference.py
Project Structure
sql_arena_env/
βββ __init__.py # Package exports
βββ models.py # SQLArenaAction, SQLArenaObservation
βββ client.py # Async typed client (SQLArenaEnv)
βββ tasks.py # 50 curated SQL tasks with schemas & solutions
βββ graders.py # 4 deterministic graders
βββ inference.py # Hackathon inference script
βββ openenv.yaml # OpenEnv manifest
βββ pyproject.toml # pip installable
βββ server/
βββ app.py # FastAPI + WebSocket via create_app()
βββ sql_arena_environment.py # Core environment logic
βββ Dockerfile # openenv-base multi-stage build
βββ requirements.txt
Why SQLArenaEnv?
The gap it fills: Text to SQL benchmarks like Spider and BIRD measure single shot accuracy. No existing OpenEnv environment measures multi step SQL reasoning where the agent can gather information before committing. This is the benchmark that matches how SQL is actually used.
Why exploration matters for RL training: An agent that learns to run SELECT * FROM table LIMIT 5 before attempting a complex GROUP BY query is learning a genuinely useful cognitive strategy, the same strategy a senior data analyst uses. Standard single shot SQL environments cannot teach this. SQLArenaEnv can.
What improves with training: GRPO/PPO agents trained on SQLArenaEnv learn to use explore steps strategically, they converge to running schema discovery queries first (SELECT * FROM sqlite_master), then sample queries, then submitting. This mirrors expert human behavior and transfers to real SQL tasks.
Citation
SQLArenaEnv β Multi-step SQL Reasoning Environment for OpenEnv
OpenEnv Hackathon 2026 β Meta Γ Hugging Face Γ Scaler