Spaces:

sakthivarshans
/

sql-arena-env

Running

App Files Files Community

sql-arena-env / README.md

sakthivarshans

Updated Readme for better reference

608369f 2 days ago

preview code

raw

history blame contribute delete

9.12 kB

	---
	title: SQLArenaEnv
	sdk: docker
	app_port: 8000
	pinned: true
	tags:
	- openenv
	- rl-environment
	- sql
	- reasoning
	- multi-step
	---

	# SQLArenaEnv

	> The first OpenEnv compatible environment for training multi-step SQL reasoning agents — where exploration queries are first-class actions.

	[![OpenEnv](https://img.shields.io/badge/OpenEnv-compatible-blue)](https://github.com/meta-pytorch/OpenEnv)
	[![Tasks](https://img.shields.io/badge/tasks-50-green)]()
	[![Difficulty](https://img.shields.io/badge/difficulty-easy→expert-orange)]()

	---

	## What Is SQLArenaEnv?

	Most SQL benchmarks test single-shot generation — the model sees a question and must output the correct query in one attempt. Real-world SQL reasoning doesn't work that way. Analysts explore the data, run investigative queries, check schemas, and refine their approach before committing to a final answer.

	SQLArenaEnv makes exploration a first-class action.

	The agent is given a natural language question and a database. It can run up to 5 explore queries to investigate the schema and data — seeing real results — before submitting a final submit query that is scored. Agents that explore strategically outperform agents that blindly guess.

	This tests the skill that actually matters: reasoning about data, not just memorizing SQL syntax.

	---

	## The Core Mechanic

	```
	Episode start
	│
	▼
	┌─────────────────────────────────────┐
	│ Question: "Find customers who │
	│ placed more than 1 completed order"│
	│ Schema: customers, orders, │
	│ order_items │
	└──────────────┬──────────────────────┘
	│
	┌──────────▼──────────┐
	│ EXPLORE action │ ← up to 5 free exploration queries
	│ sql = "SELECT * │ each returns real data rows
	│ FROM customers │ small -0.02 cost per step
	│ LIMIT 5" │ to discourage random fishing
	└──────────┬───────────┘
	│ (repeat up to 5 times)
	│
	┌──────────▼──────────┐
	│ SUBMIT action │ ← final answer query
	│ sql = "SELECT │ scored against reference solution
	│ c.name, COUNT(*) │ 1.0 = correct
	│ FROM customers c │ 0.4 = partial
	│ JOIN orders o ... │ 0.0 = wrong
	│ HAVING COUNT > 1" │ -0.1 = syntax error
	└─────────────────────┘
	```

	---

	## Task Library — 50 Tasks Across 4 Tiers

	\| Tier \| Count \| SQL Concepts Tested \|
	\|------\|-------\|---------------------\|
	\| Easy \| 10 \| SELECT, WHERE, ORDER BY, GROUP BY, basic aggregation \|
	\| Medium \| 15 \| JOINs, HAVING, subqueries, LEFT JOIN, correlated filters \|
	\| Hard \| 15 \| CTEs, window functions (RANK, LAG, PERCENT_RANK), multi-join analytics \|
	\| Expert \| 10 \| Correlated subqueries, financial scoring, multi-CTE chains, complex aggregation \|

	All tasks use realistic domains: e-commerce orders, HR systems, retail analytics, banking/finance. Data is India-relevant (names, cities, currencies in INR).

	---

	## Reward Structure

	```python
	# Explore step
	reward = -0.02 # small cost — discourages blind fishing

	# Submit step
	reward = 1.0 # correct answer (exact match)
	reward = 0.4 # partial credit (right columns, wrong rows)
	reward = 0.0 # wrong answer
	reward = -0.1 # SQL syntax error
	```

	---

	## Quick Start

	### Install client
	```bash
	pip install git+https://huggingface.co/spaces/sakthivarshans/sql-arena-env
	```

	### Use in Python (async)
	```python
	import asyncio
	from sql_arena_env import SQLArenaEnv, SQLArenaAction

	async def main():
	async with SQLArenaEnv(base_url="https://sakthivarshans-sql-arena-env.hf.space") as env:

	# Start episode — random task
	result = await env.reset()
	obs = result.observation
	print(f"Question: {obs.question}")
	print(f"Schema: {obs.schema_info}")

	# Explore the data
	result = await env.step(SQLArenaAction(
	sql="SELECT * FROM customers LIMIT 5",
	query_type="explore"
	))
	print(f"Sample data: {result.observation.query_result}")

	# Submit final answer
	result = await env.step(SQLArenaAction(
	sql="SELECT c.name, COUNT() as order_count FROM customers c JOIN orders o ON c.customer_id = o.customer_id WHERE o.status='completed' GROUP BY c.customer_id HAVING COUNT() > 1",
	query_type="submit"
	))
	print(f"Correct: {result.observation.is_correct}")
	print(f"Reward: {result.reward}")
	print(f"Feedback: {result.observation.feedback}")

	asyncio.run(main())
	```

	### Load a specific task
	```python
	result = await env.reset(task_id="hard_002") # specific task
	result = await env.reset(difficulty="medium") # random from tier
	result = await env.reset() # fully random
	```

	### Sync usage
	```python
	with SQLArenaEnv(base_url="http://localhost:8000").sync() as env:
	result = env.reset(task_id="easy_001")
	result = env.step(SQLArenaAction(sql="SELECT 1", query_type="submit"))
	```

	---

	## HTTP API Reference

	\| Method \| Endpoint \| Description \|
	\|--------\|----------\|-------------\|
	\| `GET` \| `/health` \| Liveness check \|
	\| `POST` \| `/reset` \| Start new episode \|
	\| `POST` \| `/step` \| Execute SQL action \|
	\| `GET` \| `/state` \| Current episode state \|
	\| `GET` \| `/schema` \| Action/observation schema \|
	\| `WS` \| `/ws` \| WebSocket (persistent session, use for training) \|

	### Reset request body
	```json
	{ "task_id": "medium_001" }
	{ "difficulty": "hard" }
	{}
	```

	### Step request body
	```json
	{
	"action": {
	"sql": "SELECT name FROM customers LIMIT 5",
	"query_type": "explore"
	}
	}
	```

	---

	## Graders

	Four deterministic graders, one per difficulty tier. Each runs 3 representative tasks.

	```bash
	python graders.py
	```

	```
	easy : 1.0000 ████████████████████
	medium : 1.0000 ████████████████████
	hard : 1.0000 ████████████████████
	expert : 1.0000 ████████████████████
	overall : 1.0000 ████████████████████
	```

	---

	## Run Locally

	### With Docker
	```bash
	docker build -t sql-arena-env:latest -f server/Dockerfile .
	docker run -p 8000:8000 sql-arena-env:latest
	curl http://localhost:8000/health
	```

	### Without Docker
	```bash
	pip install openenv-core
	pip install -e .
	uvicorn server.app:app --host 0.0.0.0 --port 8000
	```

	### Run graders
	```bash
	python graders.py
	```

	### Run inference script
	```bash
	export API_BASE_URL="https://router.huggingface.co/v1"
	export MODEL_NAME="Qwen/Qwen2.5-72B-Instruct"
	export HF_TOKEN="hf_..."
	export SQLARENA_TASK="medium_001"
	python inference.py
	```

	---

	## Project Structure

	```
	sql_arena_env/
	├── __init__.py # Package exports
	├── models.py # SQLArenaAction, SQLArenaObservation
	├── client.py # Async typed client (SQLArenaEnv)
	├── tasks.py # 50 curated SQL tasks with schemas & solutions
	├── graders.py # 4 deterministic graders
	├── inference.py # Hackathon inference script
	├── openenv.yaml # OpenEnv manifest
	├── pyproject.toml # pip installable
	└── server/
	├── app.py # FastAPI + WebSocket via create_app()
	├── sql_arena_environment.py # Core environment logic
	├── Dockerfile # openenv-base multi-stage build
	└── requirements.txt
	```

	---

	## Why SQLArenaEnv?

	The gap it fills: Text to SQL benchmarks like Spider and BIRD measure single shot accuracy. No existing OpenEnv environment measures multi step SQL reasoning where the agent can gather information before committing. This is the benchmark that matches how SQL is actually used.

	Why exploration matters for RL training: An agent that learns to run `SELECT * FROM table LIMIT 5` before attempting a complex GROUP BY query is learning a genuinely useful cognitive strategy, the same strategy a senior data analyst uses. Standard single shot SQL environments cannot teach this. SQLArenaEnv can.

	What improves with training: GRPO/PPO agents trained on SQLArenaEnv learn to use explore steps strategically, they converge to running schema discovery queries first (`SELECT * FROM sqlite_master`), then sample queries, then submitting. This mirrors expert human behavior and transfers to real SQL tasks.

	---

	## Citation

	```
	SQLArenaEnv — Multi-step SQL Reasoning Environment for OpenEnv
	OpenEnv Hackathon 2026 — Meta × Hugging Face × Scaler
	```