Spaces:

UtkarshSatav
/

sql-env

Sleeping

App Files Files Community

UtkarshSatav commited on Apr 3

Commit

41fadd7

verified ·

1 Parent(s): 1c5c280

Upload folder using huggingface_hub

Browse files

Files changed (1) hide show

README.md +262 -46

README.md CHANGED Viewed

@@ -12,45 +12,76 @@ tags:
 # SQLEnv — SQL Query Writing Environment for AI Agents
-An OpenEnv-compatible reinforcement learning environment where an AI agent learns to write correct SQL queries from natural language questions against a realistic e-commerce database.
-## Overview
-The agent receives a database schema and a natural language question, submits SQL queries, and gets graded with rich partial-credit scoring.
-**3 difficulty levels, 5 questions each:**
-| Task | Difficulty | SQL Features |
-|---|---|---|
-| `basic_select` | Easy | WHERE, ORDER BY, LIMIT, COUNT |
-| `join_aggregate` | Medium | JOIN, GROUP BY, HAVING, AVG, SUM |
-| `advanced_analytics` | Hard | Subqueries, RANK(), LAG(), PARTITION BY |
-## API Endpoints
-| Endpoint | Method | Description |
-|---|---|---|
-| `/health` | GET | Health check |
-| `/reset` | POST | Reset environment, get first question |
-| `/step` | POST | Submit SQL query, get graded result |
-| `/state` | GET | Current episode state |
-| `/docs` | GET | Interactive API documentation |
-## Action Space
 ```json
-{"action": {"query": "SELECT name, age FROM customers WHERE age > 30 ORDER BY age DESC"}}
 ```
-## Observation Space
 ```json
 {
   "observation": {
     "task_name": "basic_select",
-    "question": "Find all customers older than 30...",
-    "schema_description": "=== DATABASE SCHEMA === ...",
-    "query_result": "name | age ...",
     "error": "",
     "steps_remaining": 2,
     "question_index": 1,
@@ -61,47 +92,232 @@ The agent receives a database schema and a natural language question, submits SQ
 }
 ```
 ## Reward Function
-Multi-component partial credit scoring (0.0 to 1.0):
-| Component | Weight | What It Measures |
 |---|---|---|
-| Syntax | 0.1 | Query executes without error |
-| Columns | 0.2 | Expected columns present |
-| Rows | 0.3 | Expected rows match |
-| Exact | 0.4 | Full result set matches ground truth |
 ## Database
-Realistic e-commerce database with 5 tables:
-- **customers** (20 rows) - name, email, age, city, signup_date
-- **products** (15 rows) - name, category, price, stock
-- **orders** (30 rows) - customer_id, order_date, status, total_amount
-- **order_items** (46 rows) - order_id, product_id, quantity, unit_price
-- **reviews** (25 rows) - product_id, customer_id, rating, review_text
 ## Baseline Scores
-Tested with Llama 3.3 70B (via Groq):
-| Task | Score |
-|---|---|
-| basic_select | 1.000 |
-| join_aggregate | 1.000 |
-| advanced_analytics | 0.969 |
-## Setup
 ```bash
-pip install openenv-core
-cd sql_env
 uvicorn server.app:app --host 0.0.0.0 --port 7860
 ```
 ## Environment Variables
 | Variable | Default | Description |
 |---|---|---|
-| SQL_ENV_TASK | basic_select | Task to load |
-| SQL_ENV_MAX_STEPS | 15 | Max steps per episode |

 # SQLEnv — SQL Query Writing Environment for AI Agents
+An OpenEnv-compatible reinforcement learning environment where AI agents learn to write correct SQL queries from natural language questions. Built on a realistic e-commerce database with deterministic, partial-credit grading across 3 difficulty levels.
+## Why This Environment Matters
+**Text-to-SQL is a real-world task** performed by millions of data analysts, engineers, and business users daily. Natural language interfaces to databases are a $2B+ market (Tableau Ask Data, ThoughtSpot, etc.), yet there is no standardized RL benchmark for training and evaluating text-to-SQL agents in the OpenEnv ecosystem.
+SQLEnv fills this gap by providing:
+- A **realistic domain** (e-commerce) that any evaluator can understand
+- **Deterministic grading** — same query always produces the same score
+- **Rich reward signal** — not just pass/fail, but 4-component partial credit
+- **Meaningful difficulty progression** — from simple WHERE clauses to window functions
+- **Immediate practical value** — can be used to train/evaluate text-to-SQL copilots
+**Who benefits:** AI/ML researchers training agents, data teams evaluating LLM SQL capability, educators teaching SQL with automated grading, and enterprises building natural language BI tools.
+## Interactive Demo
+Visit the **[Live Space](https://huggingface.co/spaces/UtkarshSatav/sql-env)** to try the environment directly in your browser:
+- Select a difficulty level (easy / medium / hard)
+- Read the natural language question
+- Write SQL queries and get instant graded feedback
+- See color-coded rewards and visual progress tracking
+## Environment Design
+### Episode Flow
+```
+Agent                          Environment
+  |                                |
+  |-------- reset() ------------->|  Initialize DB, load task, return Q1
+  |<------- observation ----------|  Schema + question + 0 reward
+  |                                |
+  |--- step(SQLAction(query)) --->|  Execute SQL, grade against ground truth
+  |<------- observation ----------|  Result + reward + feedback
+  |                                |
+  |  (retry or next question)     |  Move on after perfect score or max attempts
+  |                                |
+  |<------- done=true ------------|  All 5 questions answered
+```
+Each episode consists of 5 questions. The agent gets multiple attempts per question (3 for easy, 4 for medium, 5 for hard). A small step penalty (-0.02 per retry) encourages efficiency.
+### Action Space
+The agent submits a single SQL SELECT query:
 ```json
+{
+  "action": {
+    "query": "SELECT name, age FROM customers WHERE age > 30 ORDER BY age DESC"
+  }
+}
 ```
+- Only SELECT statements are allowed (INSERT/UPDATE/DELETE/DROP are blocked)
+- Queries execute against an in-memory SQLite database
+- The agent sees the full database schema in every observation
+### Observation Space
+After each step, the agent receives:
 ```json
 {
   "observation": {
     "task_name": "basic_select",
+    "question": "Find the names and ages of all customers older than 30, sorted by age from highest to lowest.",
+    "schema_description": "=== DATABASE SCHEMA ===\n\nTABLE: customers -- Customer information\n  id INTEGER PRIMARY KEY\n  name TEXT NOT NULL\n  ...",
+    "query_result": "name         | age\n--------------+----\nSuresh Menon | 50\nKavita Joshi | 45\n...",
     "error": "",
     "steps_remaining": 2,
     "question_index": 1,
 }
 ```
+Key fields:
+- `question` — Natural language question to answer with SQL
+- `schema_description` — Full database schema with sample data and relationships
+- `query_result` — Formatted table output of the executed query (or error message)
+- `error` — SQL error string if query failed, empty otherwise
+- `steps_remaining` — Attempts left for this question
+- `metadata.feedback` — Human-readable explanation of what was right/wrong
+## Tasks (3 Difficulty Levels)
+### Task 1: `basic_select` (Easy)
+Simple SELECT queries with WHERE, ORDER BY, LIMIT, COUNT.
+**Example questions:**
+- "Find the names and ages of all customers older than 30, sorted by age descending"
+- "List all products in the Electronics category, sorted by price descending"
+- "How many orders have the status shipped?"
+**Max attempts per question:** 3
+### Task 2: `join_aggregate` (Medium)
+JOIN queries with GROUP BY, HAVING, and aggregate functions.
+**Example questions:**
+- "What is the average order total for each customer?"
+- "Which products have been ordered more than 2 times in total quantity?"
+- "List all customers who have never placed an order"
+**Max attempts per question:** 4
+### Task 3: `advanced_analytics` (Hard)
+Subqueries, CTEs, window functions, and complex multi-table analytics.
+**Example questions:**
+- "Find all customers whose total spending exceeds the average customer spending"
+- "Rank all products by revenue within each category using RANK()"
+- "Calculate month-over-month growth in order count using LAG()"
+**Max attempts per question:** 5
 ## Reward Function
+The reward is **NOT binary**. It provides rich gradient signal across the full trajectory using 4 weighted components:
+```
+Total Reward = 0.1 × syntax + 0.2 × columns + 0.3 × rows + 0.4 × exact
+```
+| Component | Weight | Score Range | What It Measures |
+|---|---|---|---|
+| **Syntax** | 0.1 | 0 or 1 | Query parses and executes without SQL error |
+| **Columns** | 0.2 | 0.0–1.0 | Fraction of expected column names present in result |
+| **Rows** | 0.3 | 0.0–1.0 | Fraction of expected rows matching (position-aware for ordered queries) |
+| **Exact** | 0.4 | 0, 0.5, or 1 | Full result set matches ground truth (0.5 if extra rows) |
+**Score examples:**
+| Scenario | Reward | Breakdown |
 |---|---|---|
+| Syntax error (`SELEC * FORM`) | 0.00 | All components zero |
+| Valid SQL, wrong columns | 0.10 | Syntax only |
+| Right columns, partial rows | 0.40 | Syntax + columns + partial rows |
+| Right columns, all rows, extra rows | 0.80 | Syntax + columns + rows + partial exact |
+| Perfect match | 1.00 | All components maximum |
+**Additional mechanics:**
+- **Step penalty:** -0.02 per retry attempt (encourages getting it right the first time)
+- **Deterministic:** Same query always produces the same score
+- **Handles edge cases:** NULL values, float tolerance (±0.01), column reordering
 ## Database
+Realistic e-commerce database with 5 interconnected tables:
+```
+┌──────────────┐    ┌──────────────┐    ┌──────────────┐
+│  customers   │    │   orders     │    │  products    │
+│  (20 rows)   │───<│  (30 rows)   │    │  (15 rows)   │
+│  name, email │    │  order_date  │    │  name, price │
+│  age, city   │    │  status      │    │  category    │
+└──────────────┘    └──────────────┘    └──────────────┘
+                           │
+                    ┌──────────────┐    ┌──────────────┐
+                    │ order_items  │───>│   reviews    │
+                    │  (46 rows)   │    │  (25 rows)   │
+                    │  quantity    │    │  rating 1-5  │
+                    └──────────────┘    └──────────────┘
+```
+**Key data characteristics:**
+- 4 product categories: Electronics, Clothing, Books, Home
+- 4 order statuses: pending, shipped, delivered, cancelled
+- 7 cities across India
+- 2 customers with no orders (for LEFT JOIN / NOT IN queries)
+- 5 customers spanning 3+ product categories (for complex analytics)
+- Prices in INR, dates in ISO format
+All data is **deterministic** — seeded identically on every `reset()`.
 ## Baseline Scores
+Tested with **Llama 3.3 70B** (via Groq, temperature=0.3):
+| Task | Score | Steps | Notes |
+|---|---|---|---|
+| `basic_select` | **1.000** | 6 | Solved all 5 questions, 1 retry on Q3 |
+| `join_aggregate` | **1.000** | 8 | Used all 8 steps, retried Q2 multiple times |
+| `advanced_analytics` | **0.969** | 8 | Near-perfect, retries on RANK() and LAG() queries |
+Also tested with **Qwen 2.5 72B** (via HuggingFace):
+| Task | Score | Notes |
+|---|---|---|
+| `basic_select` | **1.000** | Perfect |
+| `join_aggregate` | **0.967** | Near-perfect |
+| `advanced_analytics` | **0.623** | Partial (API credits ran out mid-task) |
+These scores demonstrate:
+- Easy task is solvable by current LLMs (validates environment works)
+- Hard task challenges even 70B models (meaningful benchmark)
+- Scores vary across runs and models (not a constant grader)
+## API Endpoints
+| Endpoint | Method | Description |
+|---|---|---|
+| `/` | GET | Interactive Gradio playground |
+| `/health` | GET | Health check → `{"status": "healthy"}` |
+| `/reset` | POST | Reset environment, returns initial observation |
+| `/step` | POST | Submit SQL query, returns graded observation |
+| `/state` | GET | Current episode state (episode_id, step_count) |
+| `/schema` | GET | Action/observation JSON schemas |
+| `/docs` | GET | Interactive Swagger API documentation |
+| `/ws` | WS | WebSocket for persistent sessions |
+## Setup & Usage
+### Quick Start (Local)
 ```bash
+git clone https://github.com/UtkarshSatav/Scaler-Hackathon-SQL.env-.git
+cd Scaler-Hackathon-SQL.env-
+pip install openenv-core[core] fastapi uvicorn gradio openai
 uvicorn server.app:app --host 0.0.0.0 --port 7860
 ```
+### Docker
+```bash
+docker build -t sql-env:latest -f Dockerfile .
+docker run -p 7860:7860 sql-env:latest
+```
+### Using the Client
+```python
+from client import SQLEnvClient
+from models import SQLAction
+with SQLEnvClient(base_url="http://localhost:7860") as env:
+    result = env.reset()
+    print(result.observation.question)
+    result = env.step(SQLAction(query="SELECT * FROM customers"))
+    print(f"Reward: {result.reward}")
+```
+### Running Inference
+```bash
+export API_BASE_URL="https://api.groq.com/openai/v1"
+export API_KEY="your_key"
+export MODEL_NAME="llama-3.3-70b-versatile"
+python inference.py
+```
 ## Environment Variables
 | Variable | Default | Description |
 |---|---|---|
+| `SQL_ENV_TASK` | `basic_select` | Task to load: basic_select, join_aggregate, advanced_analytics |
+| `SQL_ENV_MAX_STEPS` | `15` | Maximum total steps per episode |
+| `SQL_ENV_STEP_PENALTY` | `0.02` | Penalty per retry attempt |
+| `API_BASE_URL` | `https://router.huggingface.co/v1` | LLM API endpoint (for inference.py) |
+| `MODEL_NAME` | `Qwen/Qwen2.5-72B-Instruct` | Model for inference |
+| `HF_TOKEN` / `API_KEY` | (required) | API key for LLM calls |
+## Project Structure
+```
+sql_env/
+├── inference.py                 # Mandatory baseline inference script
+├── Dockerfile                   # HF Spaces deployment container
+├── openenv.yaml                 # OpenEnv metadata
+├── README.md                    # This file
+├── models.py                    # SQLAction, SQLObservation (typed Pydantic models)
+├── client.py                    # SQLEnvClient for WebSocket connections
+├── data/
+│   ├── schema.sql               # Database table definitions
+│   ├── seed.sql                 # Deterministic seed data
+│   └── tasks/
+│       ├── basic_select.json    # 5 easy questions + ground truth
+│       ├── join_aggregate.json  # 5 medium questions + ground truth
+│       └── advanced_analytics.json  # 5 hard questions + ground truth
+├── server/
+│   ├── app.py                   # FastAPI + Gradio app
+│   ├── sql_env_environment.py   # SQLEnvironment (reset/step/state)
+│   ├── database.py              # SQLite management
+│   ├── graders.py               # Multi-component reward function
+│   ├── gradio_ui.py             # Interactive web UI
+│   └── Dockerfile               # Alternative Dockerfile (openenv scaffold)
+└── tests/
+    ├── test_database.py         # 8 tests
+    ├── test_graders.py          # 13 tests
+    ├── test_environment.py      # 15 tests
+    ├── test_inference.py        # 8 tests
+    └── test_server.py           # 5 tests (49 total, all passing)
+```
+## Technical Decisions
+| Decision | Rationale |
+|---|---|
+| SQLite (in-memory) | Zero external deps, deterministic, resets instantly |
+| E-commerce domain | Universally understood, naturally scales in complexity |
+| 4-component reward | Provides gradient signal, not just binary pass/fail |
+| Multiple attempts per question | Allows agent to learn from errors within an episode |
+| Fixed seed data | Reproducible results — same data every reset() |
+| Schema in every observation | No hidden information, agent sees full context |