Spaces:
Running
Running
| title: NL2SQL Bench | |
| emoji: π | |
| colorFrom: blue | |
| colorTo: indigo | |
| sdk: docker | |
| pinned: false | |
| # NL2SQL-Bench | |
| **Natural Language to SQL Analytics Environment for RL Training** | |
| [](https://github.com/meta-pytorch/OpenEnv) | |
| [](https://www.python.org) | |
| [](LICENSE) | |
| --- | |
| ## What is NL2SQL-Bench? | |
| NL2SQL-Bench is an OpenEnv-compliant RL training environment where an AI agent must iteratively write and refine **SQLite queries** to answer natural-language business questions against a synthetic e-commerce database. | |
| This fills a genuine gap in the OpenEnv ecosystem β no SQL query environment currently exists. Every data-driven company employs analysts who translate business questions into SQL. Training agents to do this well (and to recover from errors) is immediately valuable. | |
| **Why it's a great RL domain:** | |
| - Rewards are **100% deterministic** β no LLM-as-judge, no subjectivity | |
| - Multi-turn episodes create **dense reward signal** across the trajectory | |
| - The error β fix β retry loop is a novel mechanic not present in existing environments | |
| - Three clearly graduated difficulty levels challenge models across the full skill range | |
| --- | |
| ## Environment Description | |
| The agent interacts with a **synthetic e-commerce SQLite database** containing ~150 customers, 64 products across 8 categories, ~600 orders, ~1000 order items, and ~400 reviews. The database is seeded deterministically (seed=42) so results are reproducible across any machine. | |
| The agent receives a natural-language question and iteratively submits SQL queries. Each query is executed, graded against the ground truth, and the reward + error/result is fed back as the next observation. | |
| --- | |
| ## Database Schema | |
| ``` | |
| categories(id, name) | |
| products(id, name, category_id, price, stock_quantity) | |
| customers(id, name, email, country, tierβ{bronze|silver|gold}, created_at) | |
| orders(id, customer_id, statusβ{pending|processing|shipped|delivered|cancelled}, | |
| created_at, total_amount) | |
| order_items(id, order_id, product_id, quantity, unit_price) | |
| reviews(id, product_id, customer_id, ratingβ1-5, created_at) | |
| ``` | |
| All dates are ISO-8601 strings sortable by text comparison. SQLite window functions and CTEs are fully supported. | |
| --- | |
| ## Action & Observation Space | |
| ### Action | |
| ```python | |
| @dataclass | |
| class NL2SQLAction(Action): | |
| query: str # A SQLite SELECT query string | |
| ``` | |
| ### Observation | |
| ```python | |
| @dataclass | |
| class NL2SQLObservation(Observation): | |
| question: str # The NL question to answer | |
| schema_context: str # Compact schema description | |
| task_name: str # Active task identifier | |
| last_query: str # SQL submitted on previous step | |
| last_result: List[Dict] # Up to 10 result rows | |
| last_error: Optional[str] # SQLite error string or None | |
| result_columns: List[str] # Column names of last_result | |
| step: int # Current step (0 after reset) | |
| max_steps: int # Maximum steps per episode | |
| done: bool # Episode ended? | |
| reward: Optional[float] # Step reward [0.0, 1.0] | |
| score: float # Cumulative normalised score | |
| ``` | |
| --- | |
| ## Tasks & Expected Difficulty | |
| ### Task 1 β `simple-filter` (easy) | |
| Single-table SELECT queries with WHERE, ORDER BY, LIMIT. Tests basic SQL fluency. Example questions: | |
| - "List all gold-tier customers ordered by name alphabetically." | |
| - "Return the top 5 most expensive products." | |
| **Expected solve rate (frontier model, 5 steps):** ~80% | |
| ### Task 2 β `join-aggregation` (medium) | |
| Multi-table JOINs with GROUP BY, HAVING, and aggregation functions. Example questions: | |
| - "How many orders has each customer placed? Include customers with zero orders." | |
| - "Which customers have spent more than $500 total on delivered orders?" | |
| **Expected solve rate (frontier model, 5 steps):** ~55% | |
| ### Task 3 β `analytics-window` (hard) | |
| CTEs, window functions (DENSE_RANK, ROW_NUMBER, running SUM), and nested subqueries. Example questions: | |
| - "Rank customers by total spending using DENSE_RANK." | |
| - "Show monthly revenue and running total for delivered orders in 2024." | |
| **Expected solve rate (frontier model, 5 steps):** ~30% | |
| --- | |
| ## Reward Function | |
| Rewards are computed by deterministic comparison of the agent's result set against the ground truth: | |
| | Component | Score | Description | | |
| |---|---|---| | |
| | `syntax_ok` | +0.10 | Query runs without SQLite error | | |
| | `columns_match` | +0.20 | Returned column names match ground truth | | |
| | `row_count_match` | +0.20 | Number of rows matches | | |
| | `exact_match` | +0.50 | Full result set equals ground truth | | |
| | `step_penalty` | β0.05/step | Deducted per step beyond the first | | |
| Final reward is clamped to `[0.0, 1.0]`. Order sensitivity matches the ground-truth query: ORDER BY queries require correct row ordering; others are order-agnostic. | |
| --- | |
| ## Baseline Scores | |
| Run by the `inference.py` script using `Qwen/Qwen2.5-72B-Instruct` via HuggingFace router: | |
| | Task | Expected Score | | |
| |---|---| | |
| | `simple-filter` | ~0.70 | | |
| | `join-aggregation` | ~0.45 | | |
| | `analytics-window` | ~0.25 | | |
| --- | |
| ## Setup & Usage | |
| ### Prerequisites | |
| - Python 3.10+ | |
| - Docker (for containerised deployment) | |
| - A HuggingFace account + token | |
| ### Local Development (no Docker) | |
| ```bash | |
| # Clone the repository | |
| git clone https://huggingface.co/spaces/your-username/nl2sql-bench | |
| cd nl2sql-bench | |
| # Quick start | |
| chmod +x scripts/run_local.sh | |
| ./scripts/run_local.sh | |
| # Or manually: | |
| python3 -m venv .venv && source .venv/bin/activate | |
| pip install openenv-core fastapi "uvicorn[standard]" openai pydantic | |
| export PYTHONPATH=".:server" | |
| cd server && uvicorn app:app --reload --port 8000 | |
| ``` | |
| ### Test the Running Server | |
| ```bash | |
| # Run smoke tests | |
| chmod +x scripts/smoke_test.sh | |
| ./scripts/smoke_test.sh http://localhost:8000 | |
| # Run full test suite | |
| pip install pytest pytest-asyncio | |
| PYTHONPATH=".:server" pytest tests/ -v | |
| ``` | |
| ### Docker | |
| ```bash | |
| # Build | |
| docker build -t nl2sql-bench:latest . | |
| # Run | |
| docker run -p 7860:7860 nl2sql-bench:latest | |
| # Test | |
| ./scripts/smoke_test.sh http://localhost:7860 | |
| ``` | |
| ### Pre-submission Validation | |
| ```bash | |
| # Run the official validator (replace with your HF Space URL) | |
| chmod +x pre_validation_script.sh | |
| ./pre_validation_script.sh https://your-username-nl2sql-bench.hf.space . | |
| ``` | |
| ### Running the Baseline Inference | |
| ```bash | |
| # Set mandatory variables | |
| export API_BASE_URL="https://router.huggingface.co/v1" | |
| export MODEL_NAME="Qwen/Qwen2.5-72B-Instruct" | |
| export HF_TOKEN="hf_your_token_here" | |
| export SPACE_URL="https://your-username-nl2sql-bench.hf.space" | |
| python inference.py | |
| ``` | |
| ### Using the Client Programmatically | |
| ```python | |
| import asyncio | |
| from client import NL2SQLEnv | |
| from models import NL2SQLAction | |
| async def main(): | |
| async with NL2SQLEnv(base_url="http://localhost:8000") as env: | |
| result = await env.reset() | |
| print(result.observation.question) | |
| result = await env.step(NL2SQLAction( | |
| query="SELECT id, name FROM customers WHERE tier='gold' ORDER BY name" | |
| )) | |
| print(f"Reward: {result.reward:.2f}") | |
| print(f"Done: {result.done}") | |
| print(f"Error: {result.observation.last_error}") | |
| asyncio.run(main()) | |
| ``` | |
| --- | |
| ## Project Structure | |
| ``` | |
| nl2sql-bench/ | |
| βββ models.py # NL2SQLAction, NL2SQLObservation, NL2SQLState | |
| βββ client.py # NL2SQLEnv(HTTPEnvClient) | |
| βββ inference.py # Baseline inference script (mandatory name) | |
| βββ openenv.yaml # OpenEnv manifest | |
| βββ pyproject.toml | |
| βββ Dockerfile # HF Spaces compatible (port 7860) | |
| βββ .env.example | |
| βββ server/ | |
| β βββ app.py # FastAPI entry point | |
| β βββ environment.py # Core RL environment logic | |
| β βββ grader.py # Deterministic reward computation | |
| β βββ requirements.txt | |
| β βββ db/ | |
| β β βββ schema.sql # 6-table e-commerce schema | |
| β β βββ seed.py # Deterministic data generator (seed=42) | |
| β βββ tasks/ | |
| β βββ base.py # BaseTask + registry | |
| β βββ easy.py # simple-filter (5 examples) | |
| β βββ medium.py # join-aggregation (5 examples) | |
| β βββ hard.py # analytics-window (5 examples) | |
| βββ tests/ | |
| β βββ conftest.py | |
| β βββ test_all.py # 30+ pytest tests | |
| βββ scripts/ | |
| βββ run_local.sh # Local dev server | |
| βββ smoke_test.sh # Endpoint smoke tests | |
| ``` | |
| --- | |
| ## Design Decisions | |
| **Why SQLite in-memory?** Zero runtime dependency, deterministic, and it runs comfortably within the 2 vCPU / 8 GB constraint. The database loads in ~50ms. | |
| **Why multi-turn (up to 5 steps)?** A single-shot SQL environment gives binary rewards. Multi-turn with error feedback gives the agent β and the GRPO trainer β a rich signal: the model learns not just to write SQL, but to debug and refine its queries. | |
| **Why step penalty?** Without it, an agent that accidentally gets the right answer on step 5 scores the same as one that gets it on step 1. The penalty creates pressure to solve efficiently, which is realistic. | |
| **Why order-sensitive comparison for ORDER BY queries?** Business questions that say "rank by spending" expect a ranked output. Order-agnostic comparison would give spurious credit. | |
| --- | |
| ## License | |
| MIT β see [LICENSE](LICENSE) | |