---
title: SQLArenaEnv
sdk: docker
app_port: 8000
pinned: true
tags:
  - openenv
  - rl-environment
  - sql
  - reasoning
  - multi-step
---

# SQLArenaEnv

> **The first OpenEnv compatible environment for training multi-step SQL reasoning agents — where exploration queries are first-class actions.**

[![OpenEnv](https://img.shields.io/badge/OpenEnv-compatible-blue)](https://github.com/meta-pytorch/OpenEnv)
[![Tasks](https://img.shields.io/badge/tasks-50-green)]()
[![Difficulty](https://img.shields.io/badge/difficulty-easy→expert-orange)]()

---

## What Is SQLArenaEnv?

Most SQL benchmarks test single-shot generation — the model sees a question and must output the correct query in one attempt. Real-world SQL reasoning doesn't work that way. Analysts explore the data, run investigative queries, check schemas, and refine their approach before committing to a final answer.

**SQLArenaEnv makes exploration a first-class action.**

The agent is given a natural language question and a database. It can run up to 5 **explore** queries to investigate the schema and data — seeing real results — before submitting a final **submit** query that is scored. Agents that explore strategically outperform agents that blindly guess.

This tests the skill that actually matters: **reasoning about data, not just memorizing SQL syntax.**

---

## The Core Mechanic

```
Episode start
     │
     ▼
┌─────────────────────────────────────┐
│  Question: "Find customers who      │
│  placed more than 1 completed order"│
│  Schema: customers, orders,         │
│          order_items                │
└──────────────┬──────────────────────┘
               │
    ┌──────────▼──────────┐
    │   EXPLORE action     │  ← up to 5 free exploration queries
    │   sql = "SELECT *    │    each returns real data rows
    │   FROM customers     │    small -0.02 cost per step
    │   LIMIT 5"           │    to discourage random fishing
    └──────────┬───────────┘
               │  (repeat up to 5 times)
               │
    ┌──────────▼──────────┐
    │   SUBMIT action      │  ← final answer query
    │   sql = "SELECT      │    scored against reference solution
    │   c.name, COUNT(*)   │    1.0 = correct
    │   FROM customers c   │    0.4 = partial
    │   JOIN orders o ...  │    0.0 = wrong
    │   HAVING COUNT > 1"  │   -0.1 = syntax error
    └─────────────────────┘
```

---

## Task Library — 50 Tasks Across 4 Tiers

| Tier | Count | SQL Concepts Tested |
|------|-------|---------------------|
| **Easy** | 10 | SELECT, WHERE, ORDER BY, GROUP BY, basic aggregation |
| **Medium** | 15 | JOINs, HAVING, subqueries, LEFT JOIN, correlated filters |
| **Hard** | 15 | CTEs, window functions (RANK, LAG, PERCENT_RANK), multi-join analytics |
| **Expert** | 10 | Correlated subqueries, financial scoring, multi-CTE chains, complex aggregation |

All tasks use realistic domains: **e-commerce orders, HR systems, retail analytics, banking/finance.** Data is India-relevant (names, cities, currencies in INR).

---

## Reward Structure

```python
# Explore step
reward = -0.02   # small cost — discourages blind fishing

# Submit step
reward = 1.0     # correct answer (exact match)
reward = 0.4     # partial credit (right columns, wrong rows)
reward = 0.0     # wrong answer
reward = -0.1    # SQL syntax error
```

---

## Quick Start

### Install client
```bash
pip install git+https://huggingface.co/spaces/sakthivarshans/sql-arena-env
```

### Use in Python (async)
```python
import asyncio
from sql_arena_env import SQLArenaEnv, SQLArenaAction

async def main():
    async with SQLArenaEnv(base_url="https://sakthivarshans-sql-arena-env.hf.space") as env:

        # Start episode — random task
        result = await env.reset()
        obs = result.observation
        print(f"Question: {obs.question}")
        print(f"Schema:   {obs.schema_info}")

        # Explore the data
        result = await env.step(SQLArenaAction(
            sql="SELECT * FROM customers LIMIT 5",
            query_type="explore"
        ))
        print(f"Sample data: {result.observation.query_result}")

        # Submit final answer
        result = await env.step(SQLArenaAction(
            sql="SELECT c.name, COUNT(*) as order_count FROM customers c JOIN orders o ON c.customer_id = o.customer_id WHERE o.status='completed' GROUP BY c.customer_id HAVING COUNT(*) > 1",
            query_type="submit"
        ))
        print(f"Correct: {result.observation.is_correct}")
        print(f"Reward:  {result.reward}")
        print(f"Feedback: {result.observation.feedback}")

asyncio.run(main())
```

### Load a specific task
```python
result = await env.reset(task_id="hard_002")   # specific task
result = await env.reset(difficulty="medium")   # random from tier
result = await env.reset()                       # fully random
```

### Sync usage
```python
with SQLArenaEnv(base_url="http://localhost:8000").sync() as env:
    result = env.reset(task_id="easy_001")
    result = env.step(SQLArenaAction(sql="SELECT 1", query_type="submit"))
```

---

## HTTP API Reference

| Method | Endpoint | Description |
|--------|----------|-------------|
| `GET` | `/health` | Liveness check |
| `POST` | `/reset` | Start new episode |
| `POST` | `/step` | Execute SQL action |
| `GET` | `/state` | Current episode state |
| `GET` | `/schema` | Action/observation schema |
| `WS` | `/ws` | WebSocket (persistent session, use for training) |

### Reset request body
```json
{ "task_id": "medium_001" }
{ "difficulty": "hard" }
{}
```

### Step request body
```json
{
  "action": {
    "sql": "SELECT name FROM customers LIMIT 5",
    "query_type": "explore"
  }
}
```

---

## Graders

Four deterministic graders, one per difficulty tier. Each runs 3 representative tasks.

```bash
python graders.py
```

```
easy      : 1.0000  ████████████████████
medium    : 1.0000  ████████████████████
hard      : 1.0000  ████████████████████
expert    : 1.0000  ████████████████████
overall   : 1.0000  ████████████████████
```

---

## Run Locally

### With Docker
```bash
docker build -t sql-arena-env:latest -f server/Dockerfile .
docker run -p 8000:8000 sql-arena-env:latest
curl http://localhost:8000/health
```

### Without Docker
```bash
pip install openenv-core
pip install -e .
uvicorn server.app:app --host 0.0.0.0 --port 8000
```

### Run graders
```bash
python graders.py
```

### Run inference script
```bash
export API_BASE_URL="https://router.huggingface.co/v1"
export MODEL_NAME="Qwen/Qwen2.5-72B-Instruct"
export HF_TOKEN="hf_..."
export SQLARENA_TASK="medium_001"
python inference.py
```

---

## Project Structure

```
sql_arena_env/
├── __init__.py              # Package exports
├── models.py                # SQLArenaAction, SQLArenaObservation
├── client.py                # Async typed client (SQLArenaEnv)
├── tasks.py                 # 50 curated SQL tasks with schemas & solutions
├── graders.py               # 4 deterministic graders
├── inference.py             # Hackathon inference script
├── openenv.yaml             # OpenEnv manifest
├── pyproject.toml           # pip installable
└── server/
    ├── app.py               # FastAPI + WebSocket via create_app()
    ├── sql_arena_environment.py  # Core environment logic
    ├── Dockerfile           # openenv-base multi-stage build
    └── requirements.txt
```

---

## Why SQLArenaEnv?

**The gap it fills:** Text to SQL benchmarks like Spider and BIRD measure single shot accuracy. No existing OpenEnv environment measures *multi step SQL reasoning* where the agent can gather information before committing. This is the benchmark that matches how SQL is actually used.

**Why exploration matters for RL training:** An agent that learns to run `SELECT * FROM table LIMIT 5` before attempting a complex GROUP BY query is learning a genuinely useful cognitive strategy, the same strategy a senior data analyst uses. Standard single shot SQL environments cannot teach this. SQLArenaEnv can.

**What improves with training:** GRPO/PPO agents trained on SQLArenaEnv learn to use explore steps strategically, they converge to running schema discovery queries first (`SELECT * FROM sqlite_master`), then sample queries, then submitting. This mirrors expert human behavior and transfers to real SQL tasks.

---

## Citation

```
SQLArenaEnv — Multi-step SQL Reasoning Environment for OpenEnv
OpenEnv Hackathon 2026 — Meta × Hugging Face × Scaler
```