Spaces:
Running
Running
| title: SQLArenaEnv | |
| sdk: docker | |
| app_port: 8000 | |
| pinned: true | |
| tags: | |
| - openenv | |
| - rl-environment | |
| - sql | |
| - reasoning | |
| - multi-step | |
| # SQLArenaEnv | |
| > **The first OpenEnv compatible environment for training multi-step SQL reasoning agents β where exploration queries are first-class actions.** | |
| [](https://github.com/meta-pytorch/OpenEnv) | |
| []() | |
| []() | |
| --- | |
| ## What Is SQLArenaEnv? | |
| Most SQL benchmarks test single-shot generation β the model sees a question and must output the correct query in one attempt. Real-world SQL reasoning doesn't work that way. Analysts explore the data, run investigative queries, check schemas, and refine their approach before committing to a final answer. | |
| **SQLArenaEnv makes exploration a first-class action.** | |
| The agent is given a natural language question and a database. It can run up to 5 **explore** queries to investigate the schema and data β seeing real results β before submitting a final **submit** query that is scored. Agents that explore strategically outperform agents that blindly guess. | |
| This tests the skill that actually matters: **reasoning about data, not just memorizing SQL syntax.** | |
| --- | |
| ## The Core Mechanic | |
| ``` | |
| Episode start | |
| β | |
| βΌ | |
| βββββββββββββββββββββββββββββββββββββββ | |
| β Question: "Find customers who β | |
| β placed more than 1 completed order"β | |
| β Schema: customers, orders, β | |
| β order_items β | |
| ββββββββββββββββ¬βββββββββββββββββββββββ | |
| β | |
| ββββββββββββΌβββββββββββ | |
| β EXPLORE action β β up to 5 free exploration queries | |
| β sql = "SELECT * β each returns real data rows | |
| β FROM customers β small -0.02 cost per step | |
| β LIMIT 5" β to discourage random fishing | |
| ββββββββββββ¬ββββββββββββ | |
| β (repeat up to 5 times) | |
| β | |
| ββββββββββββΌβββββββββββ | |
| β SUBMIT action β β final answer query | |
| β sql = "SELECT β scored against reference solution | |
| β c.name, COUNT(*) β 1.0 = correct | |
| β FROM customers c β 0.4 = partial | |
| β JOIN orders o ... β 0.0 = wrong | |
| β HAVING COUNT > 1" β -0.1 = syntax error | |
| βββββββββββββββββββββββ | |
| ``` | |
| --- | |
| ## Task Library β 50 Tasks Across 4 Tiers | |
| | Tier | Count | SQL Concepts Tested | | |
| |------|-------|---------------------| | |
| | **Easy** | 10 | SELECT, WHERE, ORDER BY, GROUP BY, basic aggregation | | |
| | **Medium** | 15 | JOINs, HAVING, subqueries, LEFT JOIN, correlated filters | | |
| | **Hard** | 15 | CTEs, window functions (RANK, LAG, PERCENT_RANK), multi-join analytics | | |
| | **Expert** | 10 | Correlated subqueries, financial scoring, multi-CTE chains, complex aggregation | | |
| All tasks use realistic domains: **e-commerce orders, HR systems, retail analytics, banking/finance.** Data is India-relevant (names, cities, currencies in INR). | |
| --- | |
| ## Reward Structure | |
| ```python | |
| # Explore step | |
| reward = -0.02 # small cost β discourages blind fishing | |
| # Submit step | |
| reward = 1.0 # correct answer (exact match) | |
| reward = 0.4 # partial credit (right columns, wrong rows) | |
| reward = 0.0 # wrong answer | |
| reward = -0.1 # SQL syntax error | |
| ``` | |
| --- | |
| ## Quick Start | |
| ### Install client | |
| ```bash | |
| pip install git+https://huggingface.co/spaces/sakthivarshans/sql-arena-env | |
| ``` | |
| ### Use in Python (async) | |
| ```python | |
| import asyncio | |
| from sql_arena_env import SQLArenaEnv, SQLArenaAction | |
| async def main(): | |
| async with SQLArenaEnv(base_url="https://sakthivarshans-sql-arena-env.hf.space") as env: | |
| # Start episode β random task | |
| result = await env.reset() | |
| obs = result.observation | |
| print(f"Question: {obs.question}") | |
| print(f"Schema: {obs.schema_info}") | |
| # Explore the data | |
| result = await env.step(SQLArenaAction( | |
| sql="SELECT * FROM customers LIMIT 5", | |
| query_type="explore" | |
| )) | |
| print(f"Sample data: {result.observation.query_result}") | |
| # Submit final answer | |
| result = await env.step(SQLArenaAction( | |
| sql="SELECT c.name, COUNT(*) as order_count FROM customers c JOIN orders o ON c.customer_id = o.customer_id WHERE o.status='completed' GROUP BY c.customer_id HAVING COUNT(*) > 1", | |
| query_type="submit" | |
| )) | |
| print(f"Correct: {result.observation.is_correct}") | |
| print(f"Reward: {result.reward}") | |
| print(f"Feedback: {result.observation.feedback}") | |
| asyncio.run(main()) | |
| ``` | |
| ### Load a specific task | |
| ```python | |
| result = await env.reset(task_id="hard_002") # specific task | |
| result = await env.reset(difficulty="medium") # random from tier | |
| result = await env.reset() # fully random | |
| ``` | |
| ### Sync usage | |
| ```python | |
| with SQLArenaEnv(base_url="http://localhost:8000").sync() as env: | |
| result = env.reset(task_id="easy_001") | |
| result = env.step(SQLArenaAction(sql="SELECT 1", query_type="submit")) | |
| ``` | |
| --- | |
| ## HTTP API Reference | |
| | Method | Endpoint | Description | | |
| |--------|----------|-------------| | |
| | `GET` | `/health` | Liveness check | | |
| | `POST` | `/reset` | Start new episode | | |
| | `POST` | `/step` | Execute SQL action | | |
| | `GET` | `/state` | Current episode state | | |
| | `GET` | `/schema` | Action/observation schema | | |
| | `WS` | `/ws` | WebSocket (persistent session, use for training) | | |
| ### Reset request body | |
| ```json | |
| { "task_id": "medium_001" } | |
| { "difficulty": "hard" } | |
| {} | |
| ``` | |
| ### Step request body | |
| ```json | |
| { | |
| "action": { | |
| "sql": "SELECT name FROM customers LIMIT 5", | |
| "query_type": "explore" | |
| } | |
| } | |
| ``` | |
| --- | |
| ## Graders | |
| Four deterministic graders, one per difficulty tier. Each runs 3 representative tasks. | |
| ```bash | |
| python graders.py | |
| ``` | |
| ``` | |
| easy : 1.0000 ββββββββββββββββββββ | |
| medium : 1.0000 ββββββββββββββββββββ | |
| hard : 1.0000 ββββββββββββββββββββ | |
| expert : 1.0000 ββββββββββββββββββββ | |
| overall : 1.0000 ββββββββββββββββββββ | |
| ``` | |
| --- | |
| ## Run Locally | |
| ### With Docker | |
| ```bash | |
| docker build -t sql-arena-env:latest -f server/Dockerfile . | |
| docker run -p 8000:8000 sql-arena-env:latest | |
| curl http://localhost:8000/health | |
| ``` | |
| ### Without Docker | |
| ```bash | |
| pip install openenv-core | |
| pip install -e . | |
| uvicorn server.app:app --host 0.0.0.0 --port 8000 | |
| ``` | |
| ### Run graders | |
| ```bash | |
| python graders.py | |
| ``` | |
| ### Run inference script | |
| ```bash | |
| export API_BASE_URL="https://router.huggingface.co/v1" | |
| export MODEL_NAME="Qwen/Qwen2.5-72B-Instruct" | |
| export HF_TOKEN="hf_..." | |
| export SQLARENA_TASK="medium_001" | |
| python inference.py | |
| ``` | |
| --- | |
| ## Project Structure | |
| ``` | |
| sql_arena_env/ | |
| βββ __init__.py # Package exports | |
| βββ models.py # SQLArenaAction, SQLArenaObservation | |
| βββ client.py # Async typed client (SQLArenaEnv) | |
| βββ tasks.py # 50 curated SQL tasks with schemas & solutions | |
| βββ graders.py # 4 deterministic graders | |
| βββ inference.py # Hackathon inference script | |
| βββ openenv.yaml # OpenEnv manifest | |
| βββ pyproject.toml # pip installable | |
| βββ server/ | |
| βββ app.py # FastAPI + WebSocket via create_app() | |
| βββ sql_arena_environment.py # Core environment logic | |
| βββ Dockerfile # openenv-base multi-stage build | |
| βββ requirements.txt | |
| ``` | |
| --- | |
| ## Why SQLArenaEnv? | |
| **The gap it fills:** Text to SQL benchmarks like Spider and BIRD measure single shot accuracy. No existing OpenEnv environment measures *multi step SQL reasoning* where the agent can gather information before committing. This is the benchmark that matches how SQL is actually used. | |
| **Why exploration matters for RL training:** An agent that learns to run `SELECT * FROM table LIMIT 5` before attempting a complex GROUP BY query is learning a genuinely useful cognitive strategy, the same strategy a senior data analyst uses. Standard single shot SQL environments cannot teach this. SQLArenaEnv can. | |
| **What improves with training:** GRPO/PPO agents trained on SQLArenaEnv learn to use explore steps strategically, they converge to running schema discovery queries first (`SELECT * FROM sqlite_master`), then sample queries, then submitting. This mirrors expert human behavior and transfers to real SQL tasks. | |
| --- | |
| ## Citation | |
| ``` | |
| SQLArenaEnv β Multi-step SQL Reasoning Environment for OpenEnv | |
| OpenEnv Hackathon 2026 β Meta Γ Hugging Face Γ Scaler | |
| ``` |