Spaces:

UtkarshSatav
/

sql-env

Sleeping

App Files Files Community

sql-env / README.md

UtkarshSatav

Upload folder using huggingface_hub

41fadd7 verified about 2 months ago

preview code

raw

history blame contribute delete

13.1 kB

	---
	title: SQLEnv - SQL Query Writing Environment
	emoji: 🗃️
	colorFrom: blue
	colorTo: green
	sdk: docker
	pinned: false
	app_port: 7860
	tags:
	- openenv
	---

	# SQLEnv — SQL Query Writing Environment for AI Agents

	An OpenEnv-compatible reinforcement learning environment where AI agents learn to write correct SQL queries from natural language questions. Built on a realistic e-commerce database with deterministic, partial-credit grading across 3 difficulty levels.

	## Why This Environment Matters

	Text-to-SQL is a real-world task performed by millions of data analysts, engineers, and business users daily. Natural language interfaces to databases are a $2B+ market (Tableau Ask Data, ThoughtSpot, etc.), yet there is no standardized RL benchmark for training and evaluating text-to-SQL agents in the OpenEnv ecosystem.

	SQLEnv fills this gap by providing:
	- A realistic domain (e-commerce) that any evaluator can understand
	- Deterministic grading — same query always produces the same score
	- Rich reward signal — not just pass/fail, but 4-component partial credit
	- Meaningful difficulty progression — from simple WHERE clauses to window functions
	- Immediate practical value — can be used to train/evaluate text-to-SQL copilots

	Who benefits: AI/ML researchers training agents, data teams evaluating LLM SQL capability, educators teaching SQL with automated grading, and enterprises building natural language BI tools.

	## Interactive Demo

	Visit the [Live Space](https://huggingface.co/spaces/UtkarshSatav/sql-env) to try the environment directly in your browser:
	- Select a difficulty level (easy / medium / hard)
	- Read the natural language question
	- Write SQL queries and get instant graded feedback
	- See color-coded rewards and visual progress tracking

	## Environment Design

	### Episode Flow

	```
	Agent Environment
	\| \|
	\|-------- reset() ------------->\| Initialize DB, load task, return Q1
	\|<------- observation ----------\| Schema + question + 0 reward
	\| \|
	\|--- step(SQLAction(query)) --->\| Execute SQL, grade against ground truth
	\|<------- observation ----------\| Result + reward + feedback
	\| \|
	\| (retry or next question) \| Move on after perfect score or max attempts
	\| \|
	\|<------- done=true ------------\| All 5 questions answered
	```

	Each episode consists of 5 questions. The agent gets multiple attempts per question (3 for easy, 4 for medium, 5 for hard). A small step penalty (-0.02 per retry) encourages efficiency.

	### Action Space

	The agent submits a single SQL SELECT query:

	```json
	{
	"action": {
	"query": "SELECT name, age FROM customers WHERE age > 30 ORDER BY age DESC"
	}
	}
	```

	- Only SELECT statements are allowed (INSERT/UPDATE/DELETE/DROP are blocked)
	- Queries execute against an in-memory SQLite database
	- The agent sees the full database schema in every observation

	### Observation Space

	After each step, the agent receives:

	```json
	{
	"observation": {
	"task_name": "basic_select",
	"question": "Find the names and ages of all customers older than 30, sorted by age from highest to lowest.",
	"schema_description": "=== DATABASE SCHEMA ===\n\nTABLE: customers -- Customer information\n id INTEGER PRIMARY KEY\n name TEXT NOT NULL\n ...",
	"query_result": "name \| age\n--------------+----\nSuresh Menon \| 50\nKavita Joshi \| 45\n...",
	"error": "",
	"steps_remaining": 2,
	"question_index": 1,
	"total_questions": 5
	},
	"reward": 1.0,
	"done": false
	}
	```

	Key fields:
	- `question` — Natural language question to answer with SQL
	- `schema_description` — Full database schema with sample data and relationships
	- `query_result` — Formatted table output of the executed query (or error message)
	- `error` — SQL error string if query failed, empty otherwise
	- `steps_remaining` — Attempts left for this question
	- `metadata.feedback` — Human-readable explanation of what was right/wrong

	## Tasks (3 Difficulty Levels)

	### Task 1: `basic_select` (Easy)
	Simple SELECT queries with WHERE, ORDER BY, LIMIT, COUNT.

	Example questions:
	- "Find the names and ages of all customers older than 30, sorted by age descending"
	- "List all products in the Electronics category, sorted by price descending"
	- "How many orders have the status shipped?"

	Max attempts per question: 3

	### Task 2: `join_aggregate` (Medium)
	JOIN queries with GROUP BY, HAVING, and aggregate functions.

	Example questions:
	- "What is the average order total for each customer?"
	- "Which products have been ordered more than 2 times in total quantity?"
	- "List all customers who have never placed an order"

	Max attempts per question: 4

	### Task 3: `advanced_analytics` (Hard)
	Subqueries, CTEs, window functions, and complex multi-table analytics.

	Example questions:
	- "Find all customers whose total spending exceeds the average customer spending"
	- "Rank all products by revenue within each category using RANK()"
	- "Calculate month-over-month growth in order count using LAG()"

	Max attempts per question: 5

	## Reward Function

	The reward is NOT binary. It provides rich gradient signal across the full trajectory using 4 weighted components:

	```
	Total Reward = 0.1 × syntax + 0.2 × columns + 0.3 × rows + 0.4 × exact
	```

	\| Component \| Weight \| Score Range \| What It Measures \|
	\|---\|---\|---\|---\|
	\| Syntax \| 0.1 \| 0 or 1 \| Query parses and executes without SQL error \|
	\| Columns \| 0.2 \| 0.0–1.0 \| Fraction of expected column names present in result \|
	\| Rows \| 0.3 \| 0.0–1.0 \| Fraction of expected rows matching (position-aware for ordered queries) \|
	\| Exact \| 0.4 \| 0, 0.5, or 1 \| Full result set matches ground truth (0.5 if extra rows) \|

	Score examples:

	\| Scenario \| Reward \| Breakdown \|
	\|---\|---\|---\|
	\| Syntax error (`SELEC * FORM`) \| 0.00 \| All components zero \|
	\| Valid SQL, wrong columns \| 0.10 \| Syntax only \|
	\| Right columns, partial rows \| 0.40 \| Syntax + columns + partial rows \|
	\| Right columns, all rows, extra rows \| 0.80 \| Syntax + columns + rows + partial exact \|
	\| Perfect match \| 1.00 \| All components maximum \|

	Additional mechanics:
	- Step penalty: -0.02 per retry attempt (encourages getting it right the first time)
	- Deterministic: Same query always produces the same score
	- Handles edge cases: NULL values, float tolerance (±0.01), column reordering

	## Database

	Realistic e-commerce database with 5 interconnected tables:

	```
	┌──────────────┐ ┌──────────────┐ ┌──────────────┐
	│ customers │ │ orders │ │ products │
	│ (20 rows) │───<│ (30 rows) │ │ (15 rows) │
	│ name, email │ │ order_date │ │ name, price │
	│ age, city │ │ status │ │ category │
	└──────────────┘ └──────────────┘ └──────────────┘
	│
	┌──────────────┐ ┌──────────────┐
	│ order_items │───>│ reviews │
	│ (46 rows) │ │ (25 rows) │
	│ quantity │ │ rating 1-5 │
	└──────────────┘ └──────────────┘
	```

	Key data characteristics:
	- 4 product categories: Electronics, Clothing, Books, Home
	- 4 order statuses: pending, shipped, delivered, cancelled
	- 7 cities across India
	- 2 customers with no orders (for LEFT JOIN / NOT IN queries)
	- 5 customers spanning 3+ product categories (for complex analytics)
	- Prices in INR, dates in ISO format

	All data is deterministic — seeded identically on every `reset()`.

	## Baseline Scores

	Tested with Llama 3.3 70B (via Groq, temperature=0.3):

	\| Task \| Score \| Steps \| Notes \|
	\|---\|---\|---\|---\|
	\| `basic_select` \| 1.000 \| 6 \| Solved all 5 questions, 1 retry on Q3 \|
	\| `join_aggregate` \| 1.000 \| 8 \| Used all 8 steps, retried Q2 multiple times \|
	\| `advanced_analytics` \| 0.969 \| 8 \| Near-perfect, retries on RANK() and LAG() queries \|

	Also tested with Qwen 2.5 72B (via HuggingFace):

	\| Task \| Score \| Notes \|
	\|---\|---\|---\|
	\| `basic_select` \| 1.000 \| Perfect \|
	\| `join_aggregate` \| 0.967 \| Near-perfect \|
	\| `advanced_analytics` \| 0.623 \| Partial (API credits ran out mid-task) \|

	These scores demonstrate:
	- Easy task is solvable by current LLMs (validates environment works)
	- Hard task challenges even 70B models (meaningful benchmark)
	- Scores vary across runs and models (not a constant grader)

	## API Endpoints

	\| Endpoint \| Method \| Description \|
	\|---\|---\|---\|
	\| `/` \| GET \| Interactive Gradio playground \|
	\| `/health` \| GET \| Health check → `{"status": "healthy"}` \|
	\| `/reset` \| POST \| Reset environment, returns initial observation \|
	\| `/step` \| POST \| Submit SQL query, returns graded observation \|
	\| `/state` \| GET \| Current episode state (episode_id, step_count) \|
	\| `/schema` \| GET \| Action/observation JSON schemas \|
	\| `/docs` \| GET \| Interactive Swagger API documentation \|
	\| `/ws` \| WS \| WebSocket for persistent sessions \|

	## Setup & Usage

	### Quick Start (Local)

	```bash
	git clone https://github.com/UtkarshSatav/Scaler-Hackathon-SQL.env-.git
	cd Scaler-Hackathon-SQL.env-
	pip install openenv-core[core] fastapi uvicorn gradio openai
	uvicorn server.app:app --host 0.0.0.0 --port 7860
	```

	### Docker

	```bash
	docker build -t sql-env:latest -f Dockerfile .
	docker run -p 7860:7860 sql-env:latest
	```

	### Using the Client

	```python
	from client import SQLEnvClient
	from models import SQLAction

	with SQLEnvClient(base_url="http://localhost:7860") as env:
	result = env.reset()
	print(result.observation.question)

	result = env.step(SQLAction(query="SELECT * FROM customers"))
	print(f"Reward: {result.reward}")
	```

	### Running Inference

	```bash
	export API_BASE_URL="https://api.groq.com/openai/v1"
	export API_KEY="your_key"
	export MODEL_NAME="llama-3.3-70b-versatile"
	python inference.py
	```

	## Environment Variables

	\| Variable \| Default \| Description \|
	\|---\|---\|---\|
	\| `SQL_ENV_TASK` \| `basic_select` \| Task to load: basic_select, join_aggregate, advanced_analytics \|
	\| `SQL_ENV_MAX_STEPS` \| `15` \| Maximum total steps per episode \|
	\| `SQL_ENV_STEP_PENALTY` \| `0.02` \| Penalty per retry attempt \|
	\| `API_BASE_URL` \| `https://router.huggingface.co/v1` \| LLM API endpoint (for inference.py) \|
	\| `MODEL_NAME` \| `Qwen/Qwen2.5-72B-Instruct` \| Model for inference \|
	\| `HF_TOKEN` / `API_KEY` \| (required) \| API key for LLM calls \|

	## Project Structure

	```
	sql_env/
	├── inference.py # Mandatory baseline inference script
	├── Dockerfile # HF Spaces deployment container
	├── openenv.yaml # OpenEnv metadata
	├── README.md # This file
	├── models.py # SQLAction, SQLObservation (typed Pydantic models)
	├── client.py # SQLEnvClient for WebSocket connections
	├── data/
	│ ├── schema.sql # Database table definitions
	│ ├── seed.sql # Deterministic seed data
	│ └── tasks/
	│ ├── basic_select.json # 5 easy questions + ground truth
	│ ├── join_aggregate.json # 5 medium questions + ground truth
	│ └── advanced_analytics.json # 5 hard questions + ground truth
	├── server/
	│ ├── app.py # FastAPI + Gradio app
	│ ├── sql_env_environment.py # SQLEnvironment (reset/step/state)
	│ ├── database.py # SQLite management
	│ ├── graders.py # Multi-component reward function
	│ ├── gradio_ui.py # Interactive web UI
	│ └── Dockerfile # Alternative Dockerfile (openenv scaffold)
	└── tests/
	├── test_database.py # 8 tests
	├── test_graders.py # 13 tests
	├── test_environment.py # 15 tests
	├── test_inference.py # 8 tests
	└── test_server.py # 5 tests (49 total, all passing)
	```

	## Technical Decisions

	\| Decision \| Rationale \|
	\|---\|---\|
	\| SQLite (in-memory) \| Zero external deps, deterministic, resets instantly \|
	\| E-commerce domain \| Universally understood, naturally scales in complexity \|
	\| 4-component reward \| Provides gradient signal, not just binary pass/fail \|
	\| Multiple attempts per question \| Allows agent to learn from errors within an episode \|
	\| Fixed seed data \| Reproducible results — same data every reset() \|
	\| Schema in every observation \| No hidden information, agent sees full context \|