Spaces:

UtkarshSatav
/

sql-env

Sleeping

App Files Files Community

UtkarshSatav commited on Apr 3

Commit

08b82d0

verified ·

1 Parent(s): 6b8948a

Upload folder using huggingface_hub

Browse files

Files changed (21) hide show

.gitignore +13 -0
Dockerfile +25 -0
README.md +250 -5
__init__.py +8 -0
client.py +54 -0
data/schema.sql +49 -0
data/seed.sql +200 -0
data/tasks/advanced_analytics.json +101 -0
data/tasks/basic_select.json +75 -0
data/tasks/join_aggregate.json +93 -0
inference.py +265 -0
models.py +28 -0
openenv.yaml +6 -0
pyproject.toml +45 -0
server/Dockerfile +80 -0
server/__init__.py +1 -0
server/app.py +50 -0
server/database.py +214 -0
server/graders.py +214 -0
server/requirements.txt +4 -0
server/sql_env_environment.py +217 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,13 @@

+__pycache__/
+*.pyc
+*.pyo
+*.egg-info/
+dist/
+build/
+.venv/
+venv/
+*.db
+*.sqlite
+.env
+.DS_Store
+uv.lock

Dockerfile ADDED Viewed

	@@ -0,0 +1,25 @@

+FROM python:3.11-slim
+WORKDIR /app
+# Install system dependencies
+RUN apt-get update && \
+    apt-get install -y --no-install-recommends curl && \
+    rm -rf /var/lib/apt/lists/*
+# Copy requirements first for better caching
+COPY server/requirements.txt .
+RUN pip install --no-cache-dir -r requirements.txt
+# Copy project files
+COPY . .
+# Expose HF Spaces default port
+EXPOSE 7860
+# Health check
+HEALTHCHECK --interval=30s --timeout=3s --start-period=10s --retries=3 \
+    CMD curl -f http://localhost:7860/health || exit 1
+# Run the FastAPI server on port 7860 (HF Spaces default)
+CMD ["uvicorn", "server.app:app", "--host", "0.0.0.0", "--port", "7860"]

README.md CHANGED Viewed

@@ -1,10 +1,255 @@
 ---
-title: Sql Env
-emoji: 📚
-colorFrom: gray
-colorTo: blue
 sdk: docker
 pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: Sql Env Environment Server
+emoji: 🗜️
+colorFrom: blue
+colorTo: gray
 sdk: docker
 pinned: false
+app_port: 8000
+base_path: /web
+tags:
+  - openenv
 ---
+# Sql Env Environment
+A simple test environment that echoes back messages. Perfect for testing the env APIs as well as demonstrating environment usage patterns.
+## Quick Start
+The simplest way to use the Sql Env environment is through the `SqlEnv` class:
+```python
+from sql_env import SqlAction, SqlEnv
+try:
+    # Create environment from Docker image
+    sql_envenv = SqlEnv.from_docker_image("sql_env-env:latest")
+    # Reset
+    result = sql_envenv.reset()
+    print(f"Reset: {result.observation.echoed_message}")
+    # Send multiple messages
+    messages = ["Hello, World!", "Testing echo", "Final message"]
+    for msg in messages:
+        result = sql_envenv.step(SqlAction(message=msg))
+        print(f"Sent: '{msg}'")
+        print(f"  → Echoed: '{result.observation.echoed_message}'")
+        print(f"  → Length: {result.observation.message_length}")
+        print(f"  → Reward: {result.reward}")
+finally:
+    # Always clean up
+    sql_envenv.close()
+```
+That's it! The `SqlEnv.from_docker_image()` method handles:
+- Starting the Docker container
+- Waiting for the server to be ready
+- Connecting to the environment
+- Container cleanup when you call `close()`
+## Building the Docker Image
+Before using the environment, you need to build the Docker image:
+```bash
+# From project root
+docker build -t sql_env-env:latest -f server/Dockerfile .
+```
+## Deploying to Hugging Face Spaces
+You can easily deploy your OpenEnv environment to Hugging Face Spaces using the `openenv push` command:
+```bash
+# From the environment directory (where openenv.yaml is located)
+openenv push
+# Or specify options
+openenv push --namespace my-org --private
+```
+The `openenv push` command will:
+1. Validate that the directory is an OpenEnv environment (checks for `openenv.yaml`)
+2. Prepare a custom build for Hugging Face Docker space (enables web interface)
+3. Upload to Hugging Face (ensuring you're logged in)
+### Prerequisites
+- Authenticate with Hugging Face: The command will prompt for login if not already authenticated
+### Options
+- `--directory`, `-d`: Directory containing the OpenEnv environment (defaults to current directory)
+- `--repo-id`, `-r`: Repository ID in format 'username/repo-name' (defaults to 'username/env-name' from openenv.yaml)
+- `--base-image`, `-b`: Base Docker image to use (overrides Dockerfile FROM)
+- `--private`: Deploy the space as private (default: public)
+### Examples
+```bash
+# Push to your personal namespace (defaults to username/env-name from openenv.yaml)
+openenv push
+# Push to a specific repository
+openenv push --repo-id my-org/my-env
+# Push with a custom base image
+openenv push --base-image ghcr.io/meta-pytorch/openenv-base:latest
+# Push as a private space
+openenv push --private
+# Combine options
+openenv push --repo-id my-org/my-env --base-image custom-base:latest --private
+```
+After deployment, your space will be available at:
+`https://huggingface.co/spaces/<repo-id>`
+The deployed space includes:
+- **Web Interface** at `/web` - Interactive UI for exploring the environment
+- **API Documentation** at `/docs` - Full OpenAPI/Swagger interface
+- **Health Check** at `/health` - Container health monitoring
+- **WebSocket** at `/ws` - Persistent session endpoint for low-latency interactions
+## Environment Details
+### Action
+**SqlAction**: Contains a single field
+- `message` (str) - The message to echo back
+### Observation
+**SqlObservation**: Contains the echo response and metadata
+- `echoed_message` (str) - The message echoed back
+- `message_length` (int) - Length of the message
+- `reward` (float) - Reward based on message length (length × 0.1)
+- `done` (bool) - Always False for echo environment
+- `metadata` (dict) - Additional info like step count
+### Reward
+The reward is calculated as: `message_length × 0.1`
+- "Hi" → reward: 0.2
+- "Hello, World!" → reward: 1.3
+- Empty message → reward: 0.0
+## Advanced Usage
+### Connecting to an Existing Server
+If you already have a Sql Env environment server running, you can connect directly:
+```python
+from sql_env import SqlEnv
+# Connect to existing server
+sql_envenv = SqlEnv(base_url="<ENV_HTTP_URL_HERE>")
+# Use as normal
+result = sql_envenv.reset()
+result = sql_envenv.step(SqlAction(message="Hello!"))
+```
+Note: When connecting to an existing server, `sql_envenv.close()` will NOT stop the server.
+### Using the Context Manager
+The client supports context manager usage for automatic connection management:
+```python
+from sql_env import SqlAction, SqlEnv
+# Connect with context manager (auto-connects and closes)
+with SqlEnv(base_url="http://localhost:8000") as env:
+    result = env.reset()
+    print(f"Reset: {result.observation.echoed_message}")
+    # Multiple steps with low latency
+    for msg in ["Hello", "World", "!"]:
+        result = env.step(SqlAction(message=msg))
+        print(f"Echoed: {result.observation.echoed_message}")
+```
+The client uses WebSocket connections for:
+- **Lower latency**: No HTTP connection overhead per request
+- **Persistent session**: Server maintains your environment state
+- **Efficient for episodes**: Better for many sequential steps
+### Concurrent WebSocket Sessions
+The server supports multiple concurrent WebSocket connections. To enable this,
+modify `server/app.py` to use factory mode:
+```python
+# In server/app.py - use factory mode for concurrent sessions
+app = create_app(
+    SqlEnvironment,  # Pass class, not instance
+    SqlAction,
+    SqlObservation,
+    max_concurrent_envs=4,  # Allow 4 concurrent sessions
+)
+```
+Then multiple clients can connect simultaneously:
+```python
+from sql_env import SqlAction, SqlEnv
+from concurrent.futures import ThreadPoolExecutor
+def run_episode(client_id: int):
+    with SqlEnv(base_url="http://localhost:8000") as env:
+        result = env.reset()
+        for i in range(10):
+            result = env.step(SqlAction(message=f"Client {client_id}, step {i}"))
+        return client_id, result.observation.message_length
+# Run 4 episodes concurrently
+with ThreadPoolExecutor(max_workers=4) as executor:
+    results = list(executor.map(run_episode, range(4)))
+```
+## Development & Testing
+### Direct Environment Testing
+Test the environment logic directly without starting the HTTP server:
+```bash
+# From the server directory
+python3 server/sql_env_environment.py
+```
+This verifies that:
+- Environment resets correctly
+- Step executes actions properly
+- State tracking works
+- Rewards are calculated correctly
+### Running Locally
+Run the server locally for development:
+```bash
+uvicorn server.app:app --reload
+```
+## Project Structure
+```
+sql_env/
+├── .dockerignore         # Docker build exclusions
+├── __init__.py            # Module exports
+├── README.md              # This file
+├── openenv.yaml           # OpenEnv manifest
+├── pyproject.toml         # Project metadata and dependencies
+├── uv.lock                # Locked dependencies (generated)
+├── client.py              # SqlEnv client
+├── models.py              # Action and Observation models
+└── server/
+    ├── __init__.py        # Server module exports
+    ├── sql_env_environment.py  # Core environment logic
+    ├── app.py             # FastAPI application (HTTP + WebSocket endpoints)
+    └── Dockerfile         # Container image definition
+```

__init__.py ADDED Viewed

	@@ -0,0 +1,8 @@

+"""SQL Query Writing Environment."""
+from .models import SQLAction, SQLObservation
+__all__ = [
+    "SQLAction",
+    "SQLObservation",
+]

client.py ADDED Viewed

	@@ -0,0 +1,54 @@

+"""SQL Query Writing Environment Client."""
+from typing import Dict
+from openenv.core import EnvClient
+from openenv.core.client_types import StepResult
+from openenv.core.env_server.types import State
+from .models import SQLAction, SQLObservation
+class SQLEnvClient(
+    EnvClient[SQLAction, SQLObservation, State]
+):
+    """
+    Client for the SQL Query Writing Environment.
+    Example:
+        >>> with SQLEnvClient(base_url="http://localhost:8000") as client:
+        ...     result = client.reset()
+        ...     print(result.observation.question)
+        ...     result = client.step(SQLAction(query="SELECT * FROM customers"))
+        ...     print(result.observation.query_result)
+    """
+    def _step_payload(self, action: SQLAction) -> Dict:
+        return {"query": action.query}
+    def _parse_result(self, payload: Dict) -> StepResult[SQLObservation]:
+        obs_data = payload.get("observation", {})
+        observation = SQLObservation(
+            task_name=obs_data.get("task_name", ""),
+            question=obs_data.get("question", ""),
+            schema_description=obs_data.get("schema_description", ""),
+            query_result=obs_data.get("query_result", ""),
+            error=obs_data.get("error", ""),
+            steps_remaining=obs_data.get("steps_remaining", 0),
+            question_index=obs_data.get("question_index", 0),
+            total_questions=obs_data.get("total_questions", 0),
+            done=payload.get("done", False),
+            reward=payload.get("reward"),
+            metadata=obs_data.get("metadata", {}),
+        )
+        return StepResult(
+            observation=observation,
+            reward=payload.get("reward"),
+            done=payload.get("done", False),
+        )
+    def _parse_state(self, payload: Dict) -> State:
+        return State(
+            episode_id=payload.get("episode_id"),
+            step_count=payload.get("step_count", 0),
+        )

data/schema.sql ADDED Viewed

	@@ -0,0 +1,49 @@

+-- SQLEnv E-Commerce Database Schema
+-- Designed for text-to-SQL agent training with 3 difficulty levels
+CREATE TABLE IF NOT EXISTS customers (
+    id INTEGER PRIMARY KEY,
+    name TEXT NOT NULL,
+    email TEXT NOT NULL UNIQUE,
+    age INTEGER NOT NULL,
+    city TEXT NOT NULL,
+    signup_date TEXT NOT NULL  -- ISO format: YYYY-MM-DD
+);
+CREATE TABLE IF NOT EXISTS products (
+    id INTEGER PRIMARY KEY,
+    name TEXT NOT NULL,
+    category TEXT NOT NULL,  -- Electronics, Clothing, Books, Home
+    price REAL NOT NULL,
+    stock INTEGER NOT NULL DEFAULT 0
+);
+CREATE TABLE IF NOT EXISTS orders (
+    id INTEGER PRIMARY KEY,
+    customer_id INTEGER NOT NULL,
+    order_date TEXT NOT NULL,  -- ISO format: YYYY-MM-DD
+    status TEXT NOT NULL,      -- pending, shipped, delivered, cancelled
+    total_amount REAL NOT NULL,
+    FOREIGN KEY (customer_id) REFERENCES customers(id)
+);
+CREATE TABLE IF NOT EXISTS order_items (
+    id INTEGER PRIMARY KEY,
+    order_id INTEGER NOT NULL,
+    product_id INTEGER NOT NULL,
+    quantity INTEGER NOT NULL,
+    unit_price REAL NOT NULL,
+    FOREIGN KEY (order_id) REFERENCES orders(id),
+    FOREIGN KEY (product_id) REFERENCES products(id)
+);
+CREATE TABLE IF NOT EXISTS reviews (
+    id INTEGER PRIMARY KEY,
+    product_id INTEGER NOT NULL,
+    customer_id INTEGER NOT NULL,
+    rating INTEGER NOT NULL CHECK (rating >= 1 AND rating <= 5),
+    review_text TEXT NOT NULL,
+    review_date TEXT NOT NULL,  -- ISO format: YYYY-MM-DD
+    FOREIGN KEY (product_id) REFERENCES products(id),
+    FOREIGN KEY (customer_id) REFERENCES customers(id)
+);

data/seed.sql ADDED Viewed

	@@ -0,0 +1,200 @@

+-- SQLEnv Deterministic Seed Data
+-- Carefully crafted to support easy/medium/hard SQL tasks
+-- ============================================================
+-- CUSTOMERS (20 rows)
+-- ============================================================
+INSERT INTO customers (id, name, email, age, city, signup_date) VALUES
+(1,  'Aarav Sharma',    'aarav@example.com',    28, 'Mumbai',      '2023-03-15'),
+(2,  'Priya Patel',     'priya@example.com',    35, 'Delhi',       '2023-05-20'),
+(3,  'Rahul Kumar',     'rahul@example.com',    42, 'Bangalore',   '2023-01-10'),
+(4,  'Sneha Gupta',     'sneha@example.com',    24, 'Mumbai',      '2024-02-14'),
+(5,  'Vikram Singh',    'vikram@example.com',   31, 'Chennai',     '2023-07-01'),
+(6,  'Ananya Reddy',    'ananya@example.com',   29, 'Hyderabad',   '2023-09-05'),
+(7,  'Arjun Nair',      'arjun@example.com',    38, 'Bangalore',   '2023-04-18'),
+(8,  'Kavita Joshi',    'kavita@example.com',   45, 'Pune',        '2023-02-28'),
+(9,  'Deepak Verma',    'deepak@example.com',   33, 'Delhi',       '2024-01-05'),
+(10, 'Meera Iyer',      'meera@example.com',    27, 'Chennai',     '2023-11-12'),
+(11, 'Rohan Das',       'rohan@example.com',    36, 'Kolkata',     '2023-06-22'),
+(12, 'Pooja Mishra',    'pooja@example.com',    41, 'Pune',        '2023-08-15'),
+(13, 'Suresh Menon',    'suresh@example.com',   50, 'Mumbai',      '2023-03-30'),
+(14, 'Nisha Agarwal',   'nisha@example.com',    23, 'Delhi',       '2024-03-01'),
+(15, 'Amit Pandey',     'amit@example.com',     34, 'Bangalore',   '2023-10-10'),
+(16, 'Ritu Chopra',     'ritu@example.com',     30, 'Hyderabad',   '2023-12-25'),
+(17, 'Karan Malhotra',  'karan@example.com',    26, 'Mumbai',      '2024-01-20'),
+(18, 'Divya Saxena',    'divya@example.com',    39, 'Chennai',     '2023-05-05'),
+(19, 'Nikhil Bhat',     'nikhil@example.com',   32, 'Kolkata',     '2023-07-14'),
+(20, 'Swati Tiwari',    'swati@example.com',    44, 'Pune',        '2023-04-02');
+-- ============================================================
+-- PRODUCTS (15 rows, 4 categories)
+-- ============================================================
+INSERT INTO products (id, name, category, price, stock) VALUES
+(1,  'Wireless Headphones',   'Electronics', 2499.00,  50),
+(2,  'Smartphone Case',       'Electronics',  499.00, 200),
+(3,  'Bluetooth Speaker',     'Electronics', 3999.00,  30),
+(4,  'USB-C Cable',           'Electronics',  199.00, 500),
+(5,  'Cotton T-Shirt',        'Clothing',     599.00, 150),
+(6,  'Denim Jeans',           'Clothing',    1499.00,  80),
+(7,  'Running Shoes',         'Clothing',    2999.00,  40),
+(8,  'Winter Jacket',         'Clothing',    3499.00,  25),
+(9,  'Python Programming',    'Books',        449.00, 100),
+(10, 'Data Science Handbook',  'Books',       699.00,  60),
+(11, 'Mystery Novel',          'Books',       299.00, 120),
+(12, 'Cooking Recipes',        'Books',       399.00,  90),
+(13, 'Ceramic Mug Set',        'Home',        799.00,  70),
+(14, 'Desk Lamp',              'Home',       1299.00,  45),
+(15, 'Plant Pot',              'Home',        349.00, 110);
+-- ============================================================
+-- ORDERS (30 rows)
+-- ============================================================
+INSERT INTO orders (id, customer_id, order_date, status, total_amount) VALUES
+(1,  1,  '2024-01-15', 'delivered',  2998.00),
+(2,  1,  '2024-03-20', 'delivered',   499.00),
+(3,  2,  '2024-02-10', 'delivered',  4498.00),
+(4,  3,  '2024-01-05', 'delivered',  1798.00),
+(5,  3,  '2024-04-12', 'shipped',    3999.00),
+(6,  2,  '2024-03-01', 'delivered',   599.00),
+(7,  5,  '2024-02-28', 'delivered',  5997.00),
+(8,  5,  '2024-05-15', 'shipped',    1299.00),
+(9,  6,  '2024-01-20', 'delivered',   898.00),
+(10, 7,  '2024-03-10', 'delivered',  2499.00),
+(11, 7,  '2024-06-01', 'pending',     699.00),
+(12, 8,  '2024-02-14', 'delivered',  4998.00),
+(13, 8,  '2024-04-25', 'cancelled',  1499.00),
+(14, 9,  '2024-03-05', 'delivered',   848.00),
+(15, 10, '2024-01-30', 'delivered',  2999.00),
+(16, 10, '2024-05-20', 'shipped',     449.00),
+(17, 11, '2024-02-18', 'delivered',  1598.00),
+(18, 12, '2024-03-22', 'delivered',  3499.00),
+(19, 13, '2024-04-08', 'shipped',     798.00),
+(20, 13, '2024-01-12', 'delivered',  2499.00),
+(21, 11, '2024-05-01', 'pending',     599.00),
+(22, 15, '2024-02-05', 'delivered',  1748.00),
+(23, 15, '2024-06-10', 'pending',     299.00),
+(24, 16, '2024-03-15', 'delivered',   999.00),
+(25, 17, '2024-04-20', 'shipped',    2499.00),
+(26, 18, '2024-01-25', 'delivered',  4498.00),
+(27, 18, '2024-05-30', 'delivered',   699.00),
+(28, 19, '2024-02-22', 'delivered',  1299.00),
+(29, 19, '2024-06-05', 'cancelled',   399.00),
+(30, 20, '2024-03-28', 'delivered',  3798.00);
+-- ============================================================
+-- ORDER_ITEMS (60 rows)
+-- ============================================================
+INSERT INTO order_items (id, order_id, product_id, quantity, unit_price) VALUES
+-- Order 1: customer 1 bought headphones + smartphone case
+(1,  1,  1,  1, 2499.00),
+(2,  1,  2,  1,  499.00),
+-- Order 2: customer 1 bought smartphone case
+(3,  2,  2,  1,  499.00),
+-- Order 3: customer 2 bought headphones + jeans + smartphone case
+(4,  3,  1,  1, 2499.00),
+(5,  3,  6,  1, 1499.00),
+(6,  3,  2,  1,  499.00),
+-- Order 4: customer 3 bought desk lamp + smartphone case
+(7,  4,  14, 1, 1299.00),
+(8,  4,  2,  1,  499.00),
+-- Order 5: customer 3 bought bluetooth speaker
+(9,  5,  3,  1, 3999.00),
+-- Order 6: customer 2 bought t-shirt
+(10, 6,  5,  1,  599.00),
+-- Order 7: customer 5 bought running shoes x2
+(11, 7,  7,  2, 2999.00),
+-- Order 8: customer 5 bought desk lamp
+(12, 8,  14, 1, 1299.00),
+-- Order 9: customer 6 bought mug set + usb cable
+(13, 9,  13, 1,  799.00),
+(14, 9,  4,  1,  199.00),  -- total should be 998, but we set 898 for slight difference
+-- Order 10: customer 7 bought headphones
+(15, 10, 1,  1, 2499.00),
+-- Order 11: customer 7 bought data science book
+(16, 11, 10, 1,  699.00),
+-- Order 12: customer 8 bought headphones x2
+(17, 12, 1,  2, 2499.00),
+-- Order 13: customer 8 bought jeans (cancelled)
+(18, 13, 6,  1, 1499.00),
+-- Order 14: customer 9 bought python book + plant pot
+(19, 14, 9,  1,  449.00),
+(20, 14, 15, 1,  349.00),
+-- Order 15: customer 10 bought running shoes
+(21, 15, 7,  1, 2999.00),
+-- Order 16: customer 10 bought python book
+(22, 16, 9,  1,  449.00),
+-- Order 17: customer 11 bought mug set x2
+(23, 17, 13, 2,  799.00),
+-- Order 18: customer 12 bought winter jacket
+(24, 18, 8,  1, 3499.00),
+-- Order 19: customer 13 bought mug set (shipped)
+(25, 19, 13, 1,  799.00),
+-- Order 20: customer 13 bought headphones
+(26, 20, 1,  1, 2499.00),
+-- Order 21: customer 11 bought t-shirt (pending)
+(27, 21, 5,  1,  599.00),
+-- Order 22: customer 15 bought python book + desk lamp
+(28, 22, 9,  1,  449.00),
+(29, 22, 14, 1, 1299.00),
+-- Order 23: customer 15 bought mystery novel (pending)
+(30, 23, 11, 1,  299.00),
+-- Order 24: customer 16 bought t-shirt + cooking book
+(31, 24, 5,  1,  599.00),
+(32, 24, 12, 1,  399.00),
+-- Order 25: customer 17 bought headphones (shipped)
+(33, 25, 1,  1, 2499.00),
+-- Order 26: customer 18 bought bluetooth speaker + smartphone case
+(34, 26, 3,  1, 3999.00),
+(35, 26, 2,  1,  499.00),
+-- Order 27: customer 18 bought data science book
+(36, 27, 10, 1,  699.00),
+-- Order 28: customer 19 bought desk lamp
+(37, 28, 14, 1, 1299.00),
+-- Order 29: customer 19 bought cooking book (cancelled)
+(38, 29, 12, 1,  399.00),
+-- Order 30: customer 20 bought running shoes + mug set
+(39, 30, 7,  1, 2999.00),
+(40, 30, 13, 1,  799.00),
+-- Extra items to diversify category coverage
+-- Order 3 (customer 2): add a book → Electronics + Clothing + Books
+(41, 3,  9,  1,  449.00),
+-- Order 4 (customer 3): add a t-shirt → Home + Electronics + Clothing
+(42, 4,  5,  1,  599.00),
+-- Order 4 (customer 3): add a book → Home + Electronics + Clothing + Books = 4 categories
+(43, 4,  11, 1,  299.00),
+-- Order 24 (customer 16): add a mug set → Clothing + Books + Home
+(44, 24, 13, 1,  799.00),
+-- Order 9 (customer 6): add a book → Home + Electronics + Books
+(45, 9,  12, 1,  399.00),
+-- Order 17 (customer 11): add headphones → Home + Clothing + Electronics
+(46, 17, 4,  1,  199.00);
+-- ============================================================
+-- REVIEWS (25 rows)
+-- ============================================================
+INSERT INTO reviews (id, product_id, customer_id, rating, review_text, review_date) VALUES
+(1,  1,  1,  5, 'Amazing sound quality, worth every penny!',          '2024-02-01'),
+(2,  1,  7,  4, 'Good headphones, battery could be better.',          '2024-03-25'),
+(3,  1,  13, 5, 'Best wireless headphones I have owned.',             '2024-02-15'),
+(4,  2,  1,  3, 'Decent case but feels a bit cheap.',                 '2024-04-01'),
+(5,  2,  3,  4, 'Good fit, protects the phone well.',                 '2024-02-10'),
+(6,  3,  18, 5, 'Incredible bass for the price!',                     '2024-02-20'),
+(7,  3,  5,  4, 'Great speaker, slightly heavy though.',              '2024-06-01'),
+(8,  5,  4,  4, 'Soft cotton, very comfortable.',                     '2024-03-15'),
+(9,  5,  16, 3, 'Okay quality, shrunk after washing.',                '2024-04-05'),
+(10, 6,  2,  5, 'Perfect fit, love the style!',                       '2024-03-10'),
+(11, 7,  5,  5, 'Super comfortable for running.',                     '2024-03-20'),
+(12, 7,  10, 4, 'Good shoes but sizing runs a bit large.',            '2024-02-15'),
+(13, 7,  20, 5, 'Best running shoes ever!',                           '2024-04-10'),
+(14, 8,  12, 4, 'Warm and stylish, great for winter.',                '2024-04-01'),
+(15, 9,  9,  5, 'Excellent book for learning Python.',                '2024-04-10'),
+(16, 9,  15, 4, 'Good content, some chapters feel rushed.',           '2024-03-01'),
+(17, 10, 7,  3, 'Covers basics well, lacks depth on ML topics.',      '2024-06-15'),
+(18, 11, 15, 4, 'Gripping plot, could not put it down!',             '2024-06-20'),
+(19, 13, 6,  5, 'Beautiful mugs, great as a gift set.',               '2024-02-10'),
+(20, 13, 11, 4, 'Nice mugs, one had a small chip.',                   '2024-03-05'),
+(21, 13, 20, 5, 'Excellent quality ceramic.',                         '2024-04-15'),
+(22, 14, 3,  4, 'Bright and adjustable, good for desk work.',         '2024-02-20'),
+(23, 14, 19, 5, 'Perfect desk lamp, love the design.',                '2024-03-10'),
+(24, 15, 9,  3, 'Looks nice but drainage hole is too small.',         '2024-04-15'),
+(25, 12, 16, 4, 'Great recipes, easy to follow instructions.',        '2024-04-20');

data/tasks/advanced_analytics.json ADDED Viewed

	@@ -0,0 +1,101 @@

+{
+  "task_name": "advanced_analytics",
+  "difficulty": "hard",
+  "description": "Subqueries, CTEs, window functions, and complex multi-table analytics",
+  "max_steps_per_question": 5,
+  "questions": [
+    {
+      "id": "hard_1",
+      "question": "Find all customers whose total spending across all orders exceeds the average total spending per customer. Show customer name and total spent, sorted by total spent from highest to lowest.",
+      "ground_truth_sql": "SELECT c.name, SUM(o.total_amount) as total_spent FROM customers c JOIN orders o ON c.id = o.customer_id GROUP BY c.id HAVING total_spent > (SELECT AVG(total_spent) FROM (SELECT SUM(total_amount) as total_spent FROM orders GROUP BY customer_id)) ORDER BY total_spent DESC",
+      "expected_columns": ["name", "total_spent"],
+      "expected_row_count": 9,
+      "expected_rows": [
+        ["Vikram Singh", 7296.0],
+        ["Kavita Joshi", 6497.0],
+        ["Rahul Kumar", 5797.0],
+        ["Divya Saxena", 5197.0],
+        ["Priya Patel", 5097.0],
+        ["Swati Tiwari", 3798.0],
+        ["Pooja Mishra", 3499.0],
+        ["Aarav Sharma", 3497.0],
+        ["Meera Iyer", 3448.0]
+      ],
+      "order_matters": true
+    },
+    {
+      "id": "hard_2",
+      "question": "Rank all products by their total revenue (quantity * unit_price from order_items) within each product category. Show category, product name, revenue, and the rank within the category. Sort by category alphabetically, then by rank.",
+      "ground_truth_sql": "SELECT p.category, p.name, SUM(oi.quantity * oi.unit_price) as revenue, RANK() OVER (PARTITION BY p.category ORDER BY SUM(oi.quantity * oi.unit_price) DESC) as category_rank FROM products p JOIN order_items oi ON p.id = oi.product_id GROUP BY p.id ORDER BY p.category, category_rank",
+      "expected_columns": ["category", "name", "revenue", "category_rank"],
+      "expected_row_count": 15,
+      "expected_rows": [
+        ["Books", "Python Programming", 1796.0, 1],
+        ["Books", "Data Science Handbook", 1398.0, 2],
+        ["Books", "Cooking Recipes", 1197.0, 3],
+        ["Books", "Mystery Novel", 598.0, 4],
+        ["Clothing", "Running Shoes", 11996.0, 1],
+        ["Clothing", "Winter Jacket", 3499.0, 2],
+        ["Clothing", "Denim Jeans", 2998.0, 3],
+        ["Clothing", "Cotton T-Shirt", 2396.0, 4],
+        ["Electronics", "Wireless Headphones", 17493.0, 1],
+        ["Electronics", "Bluetooth Speaker", 7998.0, 2],
+        ["Electronics", "Smartphone Case", 2495.0, 3],
+        ["Electronics", "USB-C Cable", 398.0, 4],
+        ["Home", "Desk Lamp", 5196.0, 1],
+        ["Home", "Ceramic Mug Set", 4794.0, 2],
+        ["Home", "Plant Pot", 349.0, 3]
+      ],
+      "order_matters": true
+    },
+    {
+      "id": "hard_3",
+      "question": "Calculate the month-over-month growth in order count for 2024. Show the month (as YYYY-MM), the number of orders that month, and the change from the previous month (NULL for the first month). Sort by month.",
+      "ground_truth_sql": "SELECT strftime('%Y-%m', order_date) as month, COUNT(*) as order_count, COUNT(*) - LAG(COUNT(*)) OVER (ORDER BY strftime('%Y-%m', order_date)) as growth FROM orders GROUP BY month ORDER BY month",
+      "expected_columns": ["month", "order_count", "growth"],
+      "expected_row_count": 6,
+      "expected_rows": [
+        ["2024-01", 6, null],
+        ["2024-02", 6, 0],
+        ["2024-03", 7, 1],
+        ["2024-04", 4, -3],
+        ["2024-05", 4, 0],
+        ["2024-06", 3, -1]
+      ],
+      "order_matters": true
+    },
+    {
+      "id": "hard_4",
+      "question": "Find all customers who have purchased products from at least 3 different product categories. Show the customer name and the number of distinct categories they bought from, sorted by category count descending then name ascending.",
+      "ground_truth_sql": "SELECT c.name, COUNT(DISTINCT p.category) as category_count FROM customers c JOIN orders o ON c.id = o.customer_id JOIN order_items oi ON o.id = oi.order_id JOIN products p ON oi.product_id = p.id GROUP BY c.id HAVING category_count >= 3 ORDER BY category_count DESC, c.name ASC",
+      "expected_columns": ["name", "category_count"],
+      "expected_row_count": 5,
+      "expected_rows": [
+        ["Rahul Kumar", 4],
+        ["Ananya Reddy", 3],
+        ["Priya Patel", 3],
+        ["Ritu Chopra", 3],
+        ["Rohan Das", 3]
+      ],
+      "order_matters": true
+    },
+    {
+      "id": "hard_5",
+      "question": "For each product category, find the product with the highest average review rating. Show the category, product name, and average rating (rounded to 2 decimal places). Only include products that have at least 2 reviews. Sort by category alphabetically, then by average rating descending.",
+      "ground_truth_sql": "SELECT p.category, p.name, ROUND(AVG(r.rating), 2) as avg_rating FROM products p JOIN reviews r ON p.id = r.product_id GROUP BY p.id HAVING COUNT(r.id) >= 2 ORDER BY p.category, avg_rating DESC",
+      "expected_columns": ["category", "name", "avg_rating"],
+      "expected_row_count": 8,
+      "expected_rows": [
+        ["Books", "Python Programming", 4.5],
+        ["Clothing", "Running Shoes", 4.67],
+        ["Clothing", "Cotton T-Shirt", 3.5],
+        ["Electronics", "Wireless Headphones", 4.67],
+        ["Electronics", "Bluetooth Speaker", 4.5],
+        ["Electronics", "Smartphone Case", 3.5],
+        ["Home", "Ceramic Mug Set", 4.67],
+        ["Home", "Desk Lamp", 4.5]
+      ],
+      "order_matters": true
+    }
+  ]
+}

data/tasks/basic_select.json ADDED Viewed

	@@ -0,0 +1,75 @@

+{
+  "task_name": "basic_select",
+  "difficulty": "easy",
+  "description": "Simple SELECT queries with WHERE, ORDER BY, LIMIT",
+  "max_steps_per_question": 3,
+  "questions": [
+    {
+      "id": "easy_1",
+      "question": "Find the names and ages of all customers older than 30, sorted by age from highest to lowest.",
+      "ground_truth_sql": "SELECT name, age FROM customers WHERE age > 30 ORDER BY age DESC",
+      "expected_columns": ["name", "age"],
+      "expected_row_count": 13,
+      "expected_rows": [
+        ["Suresh Menon", 50],
+        ["Kavita Joshi", 45],
+        ["Swati Tiwari", 44],
+        ["Rahul Kumar", 42],
+        ["Pooja Mishra", 41],
+        ["Divya Saxena", 39],
+        ["Arjun Nair", 38],
+        ["Rohan Das", 36],
+        ["Priya Patel", 35],
+        ["Amit Pandey", 34],
+        ["Deepak Verma", 33],
+        ["Nikhil Bhat", 32],
+        ["Vikram Singh", 31]
+      ],
+      "order_matters": true
+    },
+    {
+      "id": "easy_2",
+      "question": "List all products in the 'Electronics' category, showing name and price, sorted by price from highest to lowest.",
+      "ground_truth_sql": "SELECT name, price FROM products WHERE category = 'Electronics' ORDER BY price DESC",
+      "expected_columns": ["name", "price"],
+      "expected_row_count": 4,
+      "expected_rows": [
+        ["Bluetooth Speaker", 3999.0],
+        ["Wireless Headphones", 2499.0],
+        ["Smartphone Case", 499.0],
+        ["USB-C Cable", 199.0]
+      ],
+      "order_matters": true
+    },
+    {
+      "id": "easy_3",
+      "question": "How many orders have the status 'shipped'?",
+      "ground_truth_sql": "SELECT COUNT(*) as shipped_count FROM orders WHERE status = 'shipped'",
+      "expected_columns": ["shipped_count"],
+      "expected_row_count": 1,
+      "expected_rows": [[5]],
+      "order_matters": false
+    },
+    {
+      "id": "easy_4",
+      "question": "What is the most expensive product? Show its name and price.",
+      "ground_truth_sql": "SELECT name, price FROM products ORDER BY price DESC LIMIT 1",
+      "expected_columns": ["name", "price"],
+      "expected_row_count": 1,
+      "expected_rows": [["Bluetooth Speaker", 3999.0]],
+      "order_matters": false
+    },
+    {
+      "id": "easy_5",
+      "question": "Find all customers from Mumbai who signed up after January 1, 2024. Show their name and signup date, sorted by signup date.",
+      "ground_truth_sql": "SELECT name, signup_date FROM customers WHERE city = 'Mumbai' AND signup_date > '2024-01-01' ORDER BY signup_date",
+      "expected_columns": ["name", "signup_date"],
+      "expected_row_count": 2,
+      "expected_rows": [
+        ["Karan Malhotra", "2024-01-20"],
+        ["Sneha Gupta", "2024-02-14"]
+      ],
+      "order_matters": true
+    }
+  ]
+}

data/tasks/join_aggregate.json ADDED Viewed

	@@ -0,0 +1,93 @@

+{
+  "task_name": "join_aggregate",
+  "difficulty": "medium",
+  "description": "JOIN queries with GROUP BY, HAVING, and aggregate functions",
+  "max_steps_per_question": 4,
+  "questions": [
+    {
+      "id": "med_1",
+      "question": "What is the average order total for each customer? Show customer name and average total (rounded to 2 decimal places), sorted by average total from highest to lowest.",
+      "ground_truth_sql": "SELECT c.name, ROUND(AVG(o.total_amount), 2) as avg_total FROM customers c JOIN orders o ON c.id = o.customer_id GROUP BY c.id ORDER BY avg_total DESC",
+      "expected_columns": ["name", "avg_total"],
+      "expected_row_count": 18,
+      "expected_rows": [
+        ["Swati Tiwari", 3798.0],
+        ["Vikram Singh", 3648.0],
+        ["Pooja Mishra", 3499.0],
+        ["Kavita Joshi", 3248.5],
+        ["Rahul Kumar", 2898.5],
+        ["Divya Saxena", 2598.5],
+        ["Priya Patel", 2548.5],
+        ["Karan Malhotra", 2499.0],
+        ["Aarav Sharma", 1748.5],
+        ["Meera Iyer", 1724.0],
+        ["Suresh Menon", 1648.5],
+        ["Arjun Nair", 1599.0],
+        ["Rohan Das", 1098.5],
+        ["Amit Pandey", 1023.5],
+        ["Ritu Chopra", 999.0],
+        ["Ananya Reddy", 898.0],
+        ["Nikhil Bhat", 849.0],
+        ["Deepak Verma", 848.0]
+      ],
+      "order_matters": true
+    },
+    {
+      "id": "med_2",
+      "question": "Which products have been ordered more than 2 times in total quantity? Show the product name and total quantity ordered, sorted by total quantity from highest to lowest.",
+      "ground_truth_sql": "SELECT p.name, SUM(oi.quantity) as total_qty FROM products p JOIN order_items oi ON p.id = oi.product_id GROUP BY p.id HAVING total_qty > 2 ORDER BY total_qty DESC",
+      "expected_columns": ["name", "total_qty"],
+      "expected_row_count": 8,
+      "expected_rows": [
+        ["Wireless Headphones", 7],
+        ["Ceramic Mug Set", 6],
+        ["Smartphone Case", 5],
+        ["Desk Lamp", 4],
+        ["Python Programming", 4],
+        ["Running Shoes", 4],
+        ["Cotton T-Shirt", 4],
+        ["Cooking Recipes", 3]
+      ],
+      "order_matters": true
+    },
+    {
+      "id": "med_3",
+      "question": "List all customers who have never placed an order. Show their name and email, sorted by name.",
+      "ground_truth_sql": "SELECT name, email FROM customers WHERE id NOT IN (SELECT DISTINCT customer_id FROM orders) ORDER BY name",
+      "expected_columns": ["name", "email"],
+      "expected_row_count": 2,
+      "expected_rows": [
+        ["Nisha Agarwal", "nisha@example.com"],
+        ["Sneha Gupta", "sneha@example.com"]
+      ],
+      "order_matters": true
+    },
+    {
+      "id": "med_4",
+      "question": "What is the total revenue per product category? Calculate revenue as quantity times unit_price from order_items. Show category and revenue (rounded to 2 decimal places), sorted by revenue from highest to lowest.",
+      "ground_truth_sql": "SELECT p.category, ROUND(SUM(oi.quantity * oi.unit_price), 2) as revenue FROM products p JOIN order_items oi ON p.id = oi.product_id GROUP BY p.category ORDER BY revenue DESC",
+      "expected_columns": ["category", "revenue"],
+      "expected_row_count": 4,
+      "expected_rows": [
+        ["Electronics", 28384.0],
+        ["Clothing", 20889.0],
+        ["Home", 10339.0],
+        ["Books", 4989.0]
+      ],
+      "order_matters": true
+    },
+    {
+      "id": "med_5",
+      "question": "Who are the top 3 customers by total spending? Show customer name and total amount spent across all orders, sorted by total spent from highest to lowest.",
+      "ground_truth_sql": "SELECT c.name, SUM(o.total_amount) as total_spent FROM customers c JOIN orders o ON c.id = o.customer_id GROUP BY c.id ORDER BY total_spent DESC LIMIT 3",
+      "expected_columns": ["name", "total_spent"],
+      "expected_row_count": 3,
+      "expected_rows": [
+        ["Vikram Singh", 7296.0],
+        ["Kavita Joshi", 6497.0],
+        ["Rahul Kumar", 5797.0]
+      ],
+      "order_matters": true
+    }
+  ]
+}

inference.py ADDED Viewed

	@@ -0,0 +1,265 @@

+"""
+Inference Script for SQL Query Writing Environment
+===================================================
+MANDATORY — this file must be named `inference.py` and placed in the project root.
+Uses OpenAI Client for all LLM calls. Reads credentials from environment variables:
+    API_BASE_URL   The API endpoint for the LLM.
+    MODEL_NAME     The model identifier to use for inference.
+    HF_TOKEN       Your Hugging Face / API key.
+STDOUT FORMAT:
+    [START] task=<task_name> env=<benchmark> model=<model_name>
+    [STEP]  step=<n> action=<action_str> reward=<0.00> done=<true|false> error=<msg|null>
+    [END]   success=<true|false> steps=<n> score=<0.000> rewards=<r1,r2,...,rn>
+"""
+import os
+import sys
+import textwrap
+from typing import List, Optional
+from openai import OpenAI
+# ---------------------------------------------------------------------------
+# Environment — runs locally (no Docker needed for inference)
+# ---------------------------------------------------------------------------
+sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
+os.environ.setdefault("SQL_ENV_TASK", "basic_select")
+from server.sql_env_environment import SQLEnvironment
+from models import SQLAction
+# ---------------------------------------------------------------------------
+# Configuration
+# ---------------------------------------------------------------------------
+API_KEY = os.getenv("HF_TOKEN") or os.getenv("API_KEY")
+if not API_KEY:
+    try:
+        from huggingface_hub import get_token
+        API_KEY = get_token()
+    except Exception:
+        pass
+API_BASE_URL = os.getenv("API_BASE_URL") or "https://router.huggingface.co/v1"
+MODEL_NAME = os.getenv("MODEL_NAME") or "Qwen/Qwen2.5-72B-Instruct"
+BENCHMARK = "sql_env"
+TASKS = ["basic_select", "join_aggregate", "advanced_analytics"]
+MAX_STEPS = 8
+TEMPERATURE = 0.3
+MAX_TOKENS = 512
+SUCCESS_SCORE_THRESHOLD = 0.1
+# ---------------------------------------------------------------------------
+# Logging helpers (MANDATORY stdout format)
+# ---------------------------------------------------------------------------
+def log_start(task: str, env: str, model: str) -> None:
+    print(f"[START] task={task} env={env} model={model}", flush=True)
+def log_step(step: int, action: str, reward: float, done: bool, error: Optional[str]) -> None:
+    # Sanitize action: remove newlines, truncate for readability
+    action_clean = action.replace("\n", " ").replace("\r", "").strip()
+    if len(action_clean) > 200:
+        action_clean = action_clean[:200] + "..."
+    error_val = error if error else "null"
+    done_val = str(done).lower()
+    print(
+        f"[STEP] step={step} action={action_clean} reward={reward:.2f} done={done_val} error={error_val}",
+        flush=True,
+    )
+def log_end(success: bool, steps: int, score: float, rewards: List[float]) -> None:
+    rewards_str = ",".join(f"{r:.2f}" for r in rewards)
+    print(
+        f"[END] success={str(success).lower()} steps={steps} score={score:.3f} rewards={rewards_str}",
+        flush=True,
+    )
+# ---------------------------------------------------------------------------
+# System prompt for the LLM
+# ---------------------------------------------------------------------------
+SYSTEM_PROMPT = textwrap.dedent("""
+    You are an expert SQL query writer. You are given a database schema and a
+    natural language question. Write a single SQL SELECT query that answers
+    the question exactly.
+    Rules:
+    - Write ONLY the SQL query, nothing else. No explanations, no markdown.
+    - Use only SELECT statements (no INSERT, UPDATE, DELETE, etc.)
+    - Match the requested column names and sorting exactly.
+    - Use standard SQL compatible with SQLite.
+    - If the question asks for rounding, use ROUND().
+    - If the question asks for sorting, include ORDER BY.
+    - Pay attention to whether results should be sorted ascending or descending.
+""").strip()
+def build_user_prompt(
+    question: str,
+    schema: str,
+    last_result: str,
+    last_error: str,
+    feedback: str,
+    attempt: int,
+) -> str:
+    """Build the prompt for the LLM."""
+    parts = [
+        f"DATABASE SCHEMA:\n{schema}\n",
+        f"QUESTION: {question}\n",
+    ]
+    if attempt > 1:
+        parts.append(f"PREVIOUS ATTEMPT RESULT:\n{last_result}\n")
+        if last_error:
+            parts.append(f"ERROR: {last_error}\n")
+        if feedback:
+            parts.append(f"FEEDBACK: {feedback}\n")
+        parts.append(
+            f"This is attempt {attempt}. Fix the query based on the feedback above.\n"
+        )
+    parts.append("Write the SQL query:")
+    return "\n".join(parts)
+def get_sql_from_model(
+    client: OpenAI,
+    question: str,
+    schema: str,
+    last_result: str,
+    last_error: str,
+    feedback: str,
+    attempt: int,
+) -> str:
+    """Call the LLM to generate a SQL query."""
+    user_prompt = build_user_prompt(
+        question, schema, last_result, last_error, feedback, attempt
+    )
+    try:
+        completion = client.chat.completions.create(
+            model=MODEL_NAME,
+            messages=[
+                {"role": "system", "content": SYSTEM_PROMPT},
+                {"role": "user", "content": user_prompt},
+            ],
+            temperature=TEMPERATURE,
+            max_tokens=MAX_TOKENS,
+            stream=False,
+        )
+        text = (completion.choices[0].message.content or "").strip()
+        # Clean up: remove markdown code blocks if present
+        if text.startswith("```"):
+            lines = text.split("\n")
+            # Remove first and last lines (```sql and ```)
+            lines = [l for l in lines if not l.strip().startswith("```")]
+            text = "\n".join(lines).strip()
+        return text if text else "SELECT 1"
+    except Exception as exc:
+        print(f"[DEBUG] Model request failed: {exc}", flush=True)
+        return "SELECT 1"
+def run_task(client: OpenAI, task_name: str) -> None:
+    """Run a single task and emit [START]/[STEP]/[END] logs."""
+    os.environ["SQL_ENV_TASK"] = task_name
+    env = SQLEnvironment()
+    obs = env.reset()
+    rewards: List[float] = []
+    steps_taken = 0
+    score = 0.0
+    success = False
+    attempt_on_q = 0
+    log_start(task=task_name, env=BENCHMARK, model=MODEL_NAME)
+    try:
+        last_result = ""
+        last_error = ""
+        feedback = ""
+        attempt_on_q = 1
+        for step in range(1, MAX_STEPS + 1):
+            if obs.done:
+                break
+            # Get SQL from model
+            sql_query = get_sql_from_model(
+                client=client,
+                question=obs.question,
+                schema=obs.schema_description,
+                last_result=last_result,
+                last_error=last_error,
+                feedback=feedback,
+                attempt=attempt_on_q,
+            )
+            # Step the environment
+            obs = env.step(SQLAction(query=sql_query))
+            reward = obs.reward
+            done = obs.done
+            error = obs.error if obs.error else None
+            rewards.append(reward)
+            steps_taken = step
+            log_step(
+                step=step,
+                action=sql_query,
+                reward=reward,
+                done=done,
+                error=error,
+            )
+            # Track state for retry prompting
+            last_result = obs.query_result
+            last_error = obs.error
+            feedback = obs.metadata.get("feedback", "")
+            # Track attempt number for current question
+            if reward >= 0.98:  # near-perfect, moved to next question
+                attempt_on_q = 1
+            else:
+                attempt_on_q += 1
+            if done:
+                break
+        # Calculate normalized score
+        max_possible = obs.total_questions  # 5 questions, max 1.0 each
+        if max_possible > 0:
+            score = sum(rewards) / max_possible
+        score = min(max(score, 0.0), 1.0)
+        success = score >= SUCCESS_SCORE_THRESHOLD
+    except Exception as exc:
+        print(f"[DEBUG] Task {task_name} error: {exc}", flush=True)
+    finally:
+        log_end(success=success, steps=steps_taken, score=score, rewards=rewards)
+def main() -> None:
+    """Run inference on all 3 tasks."""
+    if not API_KEY:
+        print("[ERROR] HF_TOKEN or API_KEY environment variable is required.", flush=True)
+        sys.exit(1)
+    client = OpenAI(base_url=API_BASE_URL, api_key=API_KEY)
+    for task_name in TASKS:
+        run_task(client, task_name)
+        print("", flush=True)  # blank line between tasks
+if __name__ == "__main__":
+    main()

models.py ADDED Viewed

	@@ -0,0 +1,28 @@

+"""
+Data models for the SQL Query Writing Environment.
+Defines the Action (agent submits SQL query) and Observation
+(agent sees schema, question, query results, reward) types.
+"""
+from openenv.core.env_server.types import Action, Observation
+from pydantic import Field
+class SQLAction(Action):
+    """Action for the SQL environment - a SQL query to execute."""
+    query: str = Field(..., description="SQL SELECT query to execute against the database")
+class SQLObservation(Observation):
+    """Observation from the SQL environment."""
+    task_name: str = Field(default="", description="Current task identifier (e.g., basic_select)")
+    question: str = Field(default="", description="Natural language question to answer with SQL")
+    schema_description: str = Field(default="", description="Human-readable database schema")
+    query_result: str = Field(default="", description="Result of the last executed query (or error)")
+    error: str = Field(default="", description="SQL error message if query failed, empty otherwise")
+    steps_remaining: int = Field(default=0, description="Number of attempts remaining for current question")
+    question_index: int = Field(default=0, description="Current question number (1-indexed)")
+    total_questions: int = Field(default=0, description="Total questions in this task")

openenv.yaml ADDED Viewed

	@@ -0,0 +1,6 @@

+spec_version: 1
+name: sql_env
+type: space
+runtime: fastapi
+app: server.app:app
+port: 7860

pyproject.toml ADDED Viewed

	@@ -0,0 +1,45 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+[build-system]
+requires = ["setuptools>=45", "wheel"]
+build-backend = "setuptools.build_meta"
+[project]
+name = "openenv-sql_env"
+version = "0.1.0"
+description = "Sql Env environment for OpenEnv"
+requires-python = ">=3.10"
+dependencies = [
+    # Core OpenEnv runtime (provides FastAPI server + HTTP client types)
+    # install from github
+    # "openenv-core[core] @ git+https://github.com/meta-pytorch/OpenEnv.git",
+    "openenv-core[core]>=0.2.2",
+    # Environment-specific dependencies
+    # Add all dependencies needed for your environment here
+    # Examples:
+    # "numpy>=1.19.0",
+    # "torch>=2.0.0",
+    # "gymnasium>=0.29.0",
+    # "openspiel>=1.0.0",
+    # "smolagents>=1.22.0,<2",
+]
+[project.optional-dependencies]
+dev = [
+    "pytest>=8.0.0",
+    "pytest-cov>=4.0.0",
+]
+[project.scripts]
+# Server entry point - enables running via: uv run --project . server
+# or: python -m sql_env.server.app
+server = "sql_env.server.app:main"
+[tool.setuptools]
+include-package-data = true
+packages = ["sql_env", "sql_env.server"]
+package-dir = { "sql_env" = ".", "sql_env.server" = "server" }

server/Dockerfile ADDED Viewed

	@@ -0,0 +1,80 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+# Multi-stage build using openenv-base
+# This Dockerfile is flexible and works for both:
+# - In-repo environments (with local OpenEnv sources)
+# - Standalone environments (with openenv from PyPI/Git)
+# The build script (openenv build) handles context detection and sets appropriate build args.
+ARG BASE_IMAGE=ghcr.io/meta-pytorch/openenv-base:latest
+FROM ${BASE_IMAGE} AS builder
+WORKDIR /app
+# Ensure git is available (required for installing dependencies from VCS)
+RUN apt-get update && \
+    apt-get install -y --no-install-recommends git && \
+    rm -rf /var/lib/apt/lists/*
+# Build argument to control whether we're building standalone or in-repo
+ARG BUILD_MODE=in-repo
+ARG ENV_NAME=sql_env
+# Copy environment code (always at root of build context)
+COPY . /app/env
+# For in-repo builds, openenv is already vendored in the build context
+# For standalone builds, openenv will be installed via pyproject.toml
+WORKDIR /app/env
+# Ensure uv is available (for local builds where base image lacks it)
+RUN if ! command -v uv >/dev/null 2>&1; then \
+        curl -LsSf https://astral.sh/uv/install.sh | sh && \
+        mv /root/.local/bin/uv /usr/local/bin/uv && \
+        mv /root/.local/bin/uvx /usr/local/bin/uvx; \
+    fi
+# Install dependencies using uv sync
+# If uv.lock exists, use it; otherwise resolve on the fly
+RUN --mount=type=cache,target=/root/.cache/uv \
+    if [ -f uv.lock ]; then \
+        uv sync --frozen --no-install-project --no-editable; \
+    else \
+        uv sync --no-install-project --no-editable; \
+    fi
+RUN --mount=type=cache,target=/root/.cache/uv \
+    if [ -f uv.lock ]; then \
+        uv sync --frozen --no-editable; \
+    else \
+        uv sync --no-editable; \
+    fi
+# Final runtime stage
+FROM ${BASE_IMAGE}
+WORKDIR /app
+# Copy the virtual environment from builder
+COPY --from=builder /app/env/.venv /app/.venv
+# Copy the environment code
+COPY --from=builder /app/env /app/env
+# Set PATH to use the virtual environment
+ENV PATH="/app/.venv/bin:$PATH"
+# Set PYTHONPATH so imports work correctly
+ENV PYTHONPATH="/app/env:$PYTHONPATH"
+# Health check
+HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
+    CMD curl -f http://localhost:8000/health || exit 1
+# Run the FastAPI server
+# The module path is constructed to work with the /app/env structure
+CMD ["sh", "-c", "cd /app/env && uvicorn server.app:app --host 0.0.0.0 --port 8000"]

server/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ """SQL environment server components."""

server/app.py ADDED Viewed

	@@ -0,0 +1,50 @@

+"""
+FastAPI application for the SQL Query Writing Environment.
+Endpoints:
+    - POST /reset: Reset the environment
+    - POST /step: Execute an action (SQL query)
+    - GET /state: Get current environment state
+    - GET /health: Health check
+    - WS /ws: WebSocket endpoint for persistent sessions
+Usage:
+    uvicorn server.app:app --reload --host 0.0.0.0 --port 8000
+"""
+try:
+    from openenv.core.env_server.http_server import create_app
+except Exception as e:
+    raise ImportError(
+        "openenv is required. Install with: pip install openenv-core"
+    ) from e
+try:
+    from ..models import SQLAction, SQLObservation
+    from .sql_env_environment import SQLEnvironment
+except (ImportError, ModuleNotFoundError):
+    from models import SQLAction, SQLObservation
+    from server.sql_env_environment import SQLEnvironment
+app = create_app(
+    SQLEnvironment,
+    SQLAction,
+    SQLObservation,
+    env_name="sql_env",
+    max_concurrent_envs=3,
+)
+def main(host: str = "0.0.0.0", port: int = 8000):
+    """Entry point for direct execution."""
+    import uvicorn
+    uvicorn.run(app, host=host, port=port)
+if __name__ == "__main__":
+    import argparse
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--port", type=int, default=8000)
+    args = parser.parse_args()
+    main(port=args.port)

server/database.py ADDED Viewed

	@@ -0,0 +1,214 @@

+"""
+SQLite database management for SQLEnv.
+Handles database initialization, query execution, and schema introspection.
+All operations use an in-memory SQLite database that is re-created on each
+environment reset, ensuring deterministic, isolated episodes.
+"""
+import os
+import sqlite3
+from dataclasses import dataclass, field
+from pathlib import Path
+from typing import List, Optional, Tuple
+DATA_DIR = Path(__file__).resolve().parent.parent / "data"
+SCHEMA_PATH = DATA_DIR / "schema.sql"
+SEED_PATH = DATA_DIR / "seed.sql"
+@dataclass
+class QueryResult:
+    """Result of executing a SQL query."""
+    columns: List[str] = field(default_factory=list)
+    rows: List[Tuple] = field(default_factory=list)
+    error: Optional[str] = None
+    row_count: int = 0
+    @property
+    def success(self) -> bool:
+        return self.error is None
+    def to_display_string(self, max_rows: int = 20) -> str:
+        """Format result as a readable table string."""
+        if self.error:
+            return f"ERROR: {self.error}"
+        if not self.columns:
+            return "(no results)"
+        # Calculate column widths
+        col_widths = [len(str(c)) for c in self.columns]
+        display_rows = self.rows[:max_rows]
+        for row in display_rows:
+            for i, val in enumerate(row):
+                col_widths[i] = max(col_widths[i], len(str(val)))
+        # Build table
+        header = " | ".join(
+            str(c).ljust(col_widths[i]) for i, c in enumerate(self.columns)
+        )
+        separator = "-+-".join("-" * w for w in col_widths)
+        lines = [header, separator]
+        for row in display_rows:
+            line = " | ".join(
+                str(val).ljust(col_widths[i]) for i, val in enumerate(row)
+            )
+            lines.append(line)
+        if len(self.rows) > max_rows:
+            lines.append(f"... ({len(self.rows) - max_rows} more rows)")
+        lines.append(f"\n({self.row_count} row{'s' if self.row_count != 1 else ''})")
+        return "\n".join(lines)
+class Database:
+    """
+    Manages an in-memory SQLite database for one episode.
+    Each call to `initialize()` creates a fresh database with the schema
+    and seed data, ensuring deterministic state across episodes.
+    """
+    def __init__(self):
+        self._conn: Optional[sqlite3.Connection] = None
+    def initialize(self) -> None:
+        """Create a fresh in-memory database with schema and seed data."""
+        self.close()
+        self._conn = sqlite3.connect(":memory:")
+        self._conn.execute("PRAGMA foreign_keys = ON")
+        schema_sql = SCHEMA_PATH.read_text()
+        self._conn.executescript(schema_sql)
+        seed_sql = SEED_PATH.read_text()
+        self._conn.executescript(seed_sql)
+        self._conn.commit()
+    def execute_query(self, sql: str, timeout_seconds: float = 5.0) -> QueryResult:
+        """
+        Execute a SQL query and return the result.
+        Only SELECT statements are allowed. Modification statements
+        (INSERT, UPDATE, DELETE, DROP, ALTER, CREATE) are rejected.
+        Args:
+            sql: The SQL query string to execute.
+            timeout_seconds: Max execution time (unused for SQLite in-memory).
+        Returns:
+            QueryResult with columns, rows, and potential error.
+        """
+        if self._conn is None:
+            return QueryResult(error="Database not initialized. Call reset() first.")
+        # Strip and normalize
+        stripped = sql.strip().rstrip(";").strip()
+        if not stripped:
+            return QueryResult(error="Empty query.")
+        # Block modification statements
+        first_word = stripped.split()[0].upper()
+        blocked = {"INSERT", "UPDATE", "DELETE", "DROP", "ALTER", "CREATE", "TRUNCATE", "REPLACE"}
+        if first_word in blocked:
+            return QueryResult(
+                error=f"Only SELECT queries are allowed. Got: {first_word}"
+            )
+        try:
+            cursor = self._conn.execute(stripped)
+            if cursor.description is None:
+                return QueryResult(error="Query did not return results.")
+            columns = [desc[0] for desc in cursor.description]
+            rows = cursor.fetchall()
+            return QueryResult(
+                columns=columns,
+                rows=rows,
+                row_count=len(rows),
+            )
+        except sqlite3.Error as e:
+            return QueryResult(error=str(e))
+    def get_schema_description(self) -> str:
+        """
+        Return a human-readable description of the database schema
+        including table structures and sample data.
+        """
+        schema_text = []
+        schema_text.append("=== DATABASE SCHEMA ===\n")
+        tables = [
+            ("customers", "Customer information"),
+            ("products", "Product catalog"),
+            ("orders", "Customer orders"),
+            ("order_items", "Items within each order"),
+            ("reviews", "Product reviews by customers"),
+        ]
+        if self._conn is None:
+            return "Database not initialized."
+        for table_name, description in tables:
+            schema_text.append(f"TABLE: {table_name} -- {description}")
+            # Get column info
+            cursor = self._conn.execute(f"PRAGMA table_info({table_name})")
+            columns = cursor.fetchall()
+            for col in columns:
+                # col: (cid, name, type, notnull, default_value, pk)
+                col_name = col[1]
+                col_type = col[2]
+                is_pk = " PRIMARY KEY" if col[5] else ""
+                is_nn = " NOT NULL" if col[3] else ""
+                schema_text.append(f"  {col_name} {col_type}{is_pk}{is_nn}")
+            # Get foreign keys
+            cursor = self._conn.execute(f"PRAGMA foreign_key_list({table_name})")
+            fks = cursor.fetchall()
+            for fk in fks:
+                schema_text.append(f"  FOREIGN KEY ({fk[3]}) REFERENCES {fk[2]}({fk[4]})")
+            # Show sample data (first 3 rows)
+            result = self.execute_query(f"SELECT * FROM {table_name} LIMIT 3")
+            if result.success and result.rows:
+                schema_text.append(f"  Sample data ({result.row_count} rows shown):")
+                for row in result.rows:
+                    schema_text.append(f"    {row}")
+            # Show total count
+            count_result = self.execute_query(
+                f"SELECT COUNT(*) FROM {table_name}"
+            )
+            if count_result.success and count_result.rows:
+                total = count_result.rows[0][0]
+                schema_text.append(f"  Total rows: {total}")
+            schema_text.append("")
+        # Add relationship summary
+        schema_text.append("=== RELATIONSHIPS ===")
+        schema_text.append("orders.customer_id -> customers.id")
+        schema_text.append("order_items.order_id -> orders.id")
+        schema_text.append("order_items.product_id -> products.id")
+        schema_text.append("reviews.product_id -> products.id")
+        schema_text.append("reviews.customer_id -> customers.id")
+        schema_text.append("")
+        schema_text.append("=== NOTES ===")
+        schema_text.append("- All dates are in ISO format (YYYY-MM-DD)")
+        schema_text.append("- Prices are in INR (Indian Rupees)")
+        schema_text.append("- Order status: pending, shipped, delivered, cancelled")
+        schema_text.append("- Product categories: Electronics, Clothing, Books, Home")
+        schema_text.append("- Ratings are integers from 1 to 5")
+        return "\n".join(schema_text)
+    def close(self) -> None:
+        """Close the database connection."""
+        if self._conn is not None:
+            self._conn.close()
+            self._conn = None

server/graders.py ADDED Viewed

	@@ -0,0 +1,214 @@

+"""
+Multi-component grading system for SQL query evaluation.
+Scores agent queries against ground truth with partial credit:
+  - syntax_score  (0.1): Query parses and executes without error
+  - column_score  (0.2): Fraction of expected columns present
+  - row_score     (0.3): Fraction of expected rows matching
+  - exact_score   (0.4): Full result set matches ground truth exactly
+Total reward per question is in [0.0, 1.0].
+"""
+from typing import Any, List, Optional, Tuple
+from .database import Database, QueryResult
+def _normalize_value(val: Any) -> Any:
+    """Normalize a value for comparison (handle float/int equivalence, None)."""
+    if val is None:
+        return None
+    if isinstance(val, float):
+        if val == int(val):
+            return int(val)
+        return round(val, 2)
+    if isinstance(val, str):
+        return val.strip()
+    return val
+def _normalize_row(row: Tuple) -> Tuple:
+    """Normalize all values in a row."""
+    return tuple(_normalize_value(v) for v in row)
+def _normalize_column_name(col: str) -> str:
+    """Normalize column name for comparison (lowercase, strip)."""
+    return col.strip().lower()
+def grade_query(
+    db: Database,
+    agent_sql: str,
+    expected_columns: List[str],
+    expected_rows: List[List],
+    order_matters: bool = True,
+) -> dict:
+    """
+    Grade an agent's SQL query against expected results.
+    Args:
+        db: Active Database instance.
+        agent_sql: The SQL query submitted by the agent.
+        expected_columns: List of expected column names.
+        expected_rows: List of expected row values (list of lists).
+        order_matters: Whether row order affects scoring.
+    Returns:
+        Dictionary with:
+          - reward: float in [0.0, 1.0]
+          - syntax_score: float
+          - column_score: float
+          - row_score: float
+          - exact_score: float
+          - query_result: QueryResult object
+          - feedback: str describing what was right/wrong
+    """
+    result = db.execute_query(agent_sql)
+    # Component weights
+    W_SYNTAX = 0.1
+    W_COLUMN = 0.2
+    W_ROW = 0.3
+    W_EXACT = 0.4
+    # --- Syntax Score ---
+    if not result.success:
+        return {
+            "reward": 0.0,
+            "syntax_score": 0.0,
+            "column_score": 0.0,
+            "row_score": 0.0,
+            "exact_score": 0.0,
+            "query_result": result,
+            "feedback": f"SQL error: {result.error}",
+        }
+    syntax_score = 1.0
+    # --- Column Score ---
+    expected_cols_normalized = [_normalize_column_name(c) for c in expected_columns]
+    actual_cols_normalized = [_normalize_column_name(c) for c in result.columns]
+    if not expected_cols_normalized:
+        column_score = 1.0 if not actual_cols_normalized else 0.0
+    else:
+        matched_cols = sum(
+            1 for c in expected_cols_normalized if c in actual_cols_normalized
+        )
+        column_score = matched_cols / len(expected_cols_normalized)
+    # --- Row Score ---
+    expected_rows_normalized = [
+        _normalize_row(tuple(row)) for row in expected_rows
+    ]
+    actual_rows_normalized = [_normalize_row(row) for row in result.rows]
+    if not expected_rows_normalized:
+        row_score = 1.0 if not actual_rows_normalized else 0.0
+    else:
+        if order_matters:
+            # For ordered results, match position-by-position
+            matched_rows = 0
+            for i, expected_row in enumerate(expected_rows_normalized):
+                if i < len(actual_rows_normalized):
+                    if _rows_match(expected_row, actual_rows_normalized[i], expected_cols_normalized, actual_cols_normalized):
+                        matched_rows += 1
+            row_score = matched_rows / len(expected_rows_normalized)
+        else:
+            # For unordered results, check set membership
+            matched_rows = 0
+            remaining_actual = list(actual_rows_normalized)
+            for expected_row in expected_rows_normalized:
+                for j, actual_row in enumerate(remaining_actual):
+                    if _rows_match(expected_row, actual_row, expected_cols_normalized, actual_cols_normalized):
+                        matched_rows += 1
+                        remaining_actual.pop(j)
+                        break
+            row_score = matched_rows / len(expected_rows_normalized)
+    # --- Exact Score ---
+    exact_score = 0.0
+    if column_score == 1.0 and row_score == 1.0:
+        # Check exact match: same number of rows and all matched
+        if len(actual_rows_normalized) == len(expected_rows_normalized):
+            exact_score = 1.0
+        else:
+            # Extra rows returned — partial exact credit
+            exact_score = 0.5
+    # --- Total Reward ---
+    reward = (
+        W_SYNTAX * syntax_score
+        + W_COLUMN * column_score
+        + W_ROW * row_score
+        + W_EXACT * exact_score
+    )
+    reward = round(min(max(reward, 0.0), 1.0), 4)
+    # --- Feedback ---
+    feedback_parts = []
+    if syntax_score == 1.0:
+        feedback_parts.append("Query executed successfully.")
+    if column_score < 1.0:
+        missing = [c for c in expected_cols_normalized if c not in actual_cols_normalized]
+        feedback_parts.append(f"Missing columns: {missing}. Expected: {expected_cols_normalized}, Got: {actual_cols_normalized}")
+    if row_score < 1.0:
+        feedback_parts.append(
+            f"Row match: {row_score:.0%} ({int(row_score * len(expected_rows_normalized))}/{len(expected_rows_normalized)} rows correct). "
+            f"Expected {len(expected_rows_normalized)} rows, got {len(actual_rows_normalized)}."
+        )
+    if exact_score == 1.0:
+        feedback_parts.append("Perfect match!")
+    elif exact_score == 0.5:
+        feedback_parts.append(f"All expected rows found but got {len(actual_rows_normalized)} rows instead of {len(expected_rows_normalized)} (extra rows).")
+    return {
+        "reward": reward,
+        "syntax_score": syntax_score,
+        "column_score": column_score,
+        "row_score": row_score,
+        "exact_score": exact_score,
+        "query_result": result,
+        "feedback": " ".join(feedback_parts),
+    }
+def _rows_match(
+    expected_row: Tuple,
+    actual_row: Tuple,
+    expected_cols: List[str],
+    actual_cols: List[str],
+) -> bool:
+    """
+    Check if an actual row matches an expected row.
+    Handles column reordering: maps expected columns to actual column positions.
+    """
+    if len(expected_cols) != len(expected_row):
+        return False
+    # Build a mapping from expected column index to actual column index
+    col_map = {}
+    for i, ec in enumerate(expected_cols):
+        if ec in actual_cols:
+            col_map[i] = actual_cols.index(ec)
+        else:
+            return False  # Missing column
+    for exp_idx, act_idx in col_map.items():
+        if act_idx >= len(actual_row):
+            return False
+        exp_val = expected_row[exp_idx]
+        act_val = _normalize_value(actual_row[act_idx])
+        if exp_val != act_val:
+            # Try numeric comparison with tolerance
+            try:
+                if abs(float(exp_val) - float(act_val)) < 0.01:
+                    continue
+            except (TypeError, ValueError):
+                pass
+            return False
+    return True

server/requirements.txt ADDED Viewed

	@@ -0,0 +1,4 @@

+openenv-core[core]>=0.2.0
+fastapi>=0.115.0
+uvicorn>=0.24.0
+openai>=1.0.0

server/sql_env_environment.py ADDED Viewed

	@@ -0,0 +1,217 @@

+"""
+SQL Query Writing Environment.
+An AI agent receives a database schema and natural language question,
+then writes SQL queries to answer the question. The environment grades
+each query with partial-credit scoring and provides feedback.
+"""
+import json
+import os
+from pathlib import Path
+from uuid import uuid4
+from openenv.core.env_server.interfaces import Environment
+from openenv.core.env_server.types import State
+try:
+    from ..models import SQLAction, SQLObservation
+except ImportError:
+    from models import SQLAction, SQLObservation
+from .database import Database
+from .graders import grade_query
+TASKS_DIR = Path(__file__).resolve().parent.parent / "data" / "tasks"
+# Default task can be overridden via environment variable
+DEFAULT_TASK = os.getenv("SQL_ENV_TASK", "basic_select")
+MAX_TOTAL_STEPS = int(os.getenv("SQL_ENV_MAX_STEPS", "15"))
+STEP_PENALTY = float(os.getenv("SQL_ENV_STEP_PENALTY", "0.02"))
+def _load_task(task_name: str) -> dict:
+    """Load a task definition from JSON file."""
+    task_path = TASKS_DIR / f"{task_name}.json"
+    if not task_path.exists():
+        available = [f.stem for f in TASKS_DIR.glob("*.json")]
+        raise ValueError(
+            f"Task '{task_name}' not found. Available: {available}"
+        )
+    with open(task_path) as f:
+        return json.load(f)
+class SQLEnvironment(Environment):
+    """
+    SQL Query Writing Environment.
+    The agent interacts with an e-commerce SQLite database by submitting
+    SQL queries to answer natural language questions. Each query is graded
+    with a multi-component reward function providing partial credit.
+    Episode flow:
+      1. reset() → loads task, initializes DB, returns first question
+      2. step(SQLAction) → executes query, grades it, returns observation
+      3. Episode ends when all questions answered or max steps reached
+    """
+    SUPPORTS_CONCURRENT_SESSIONS: bool = True
+    def __init__(self):
+        self._db = Database()
+        self._state = State(episode_id=str(uuid4()), step_count=0)
+        self._task: dict = {}
+        self._questions: list = []
+        self._current_q_index: int = 0
+        self._q_steps_used: int = 0
+        self._max_steps_per_q: int = 3
+        self._total_steps: int = 0
+        self._rewards: list = []
+        self._schema_cache: str = ""
+        self._done: bool = False
+        self._last_feedback: str = ""
+    def reset(self) -> SQLObservation:
+        """
+        Reset the environment: initialize DB, load task, return first question.
+        """
+        self._db.initialize()
+        self._state = State(episode_id=str(uuid4()), step_count=0)
+        task_name = os.getenv("SQL_ENV_TASK", DEFAULT_TASK)
+        self._task = _load_task(task_name)
+        self._questions = self._task["questions"]
+        self._max_steps_per_q = self._task.get("max_steps_per_question", 3)
+        self._current_q_index = 0
+        self._q_steps_used = 0
+        self._total_steps = 0
+        self._rewards = []
+        self._done = False
+        self._last_feedback = ""
+        self._schema_cache = self._db.get_schema_description()
+        return self._make_observation(
+            reward=0.0,
+            query_result="",
+            error="",
+        )
+    def step(self, action: SQLAction) -> SQLObservation:  # type: ignore[override]
+        """
+        Execute the agent's SQL query, grade it, and return observation.
+        """
+        # Auto-reset if step called before reset (HTTP stateless mode)
+        if not self._questions:
+            self.reset()
+        if self._done or self._current_q_index >= len(self._questions):
+            self._done = True
+            return self._make_observation(
+                reward=0.0,
+                query_result="Episode is over. Call reset() to start a new episode.",
+                error="",
+            )
+        self._state.step_count += 1
+        self._total_steps += 1
+        self._q_steps_used += 1
+        # Get current question
+        question = self._questions[self._current_q_index]
+        # Grade the query
+        grade_result = grade_query(
+            db=self._db,
+            agent_sql=action.query,
+            expected_columns=question["expected_columns"],
+            expected_rows=question["expected_rows"],
+            order_matters=question.get("order_matters", True),
+        )
+        raw_reward = grade_result["reward"]
+        # Apply step penalty (not on first attempt)
+        penalty = STEP_PENALTY * (self._q_steps_used - 1)
+        reward = max(raw_reward - penalty, 0.0)
+        reward = round(reward, 4)
+        self._rewards.append(reward)
+        self._last_feedback = grade_result["feedback"]
+        # Format query result for observation
+        query_result_str = grade_result["query_result"].to_display_string()
+        error_str = grade_result["query_result"].error or ""
+        # Check if we should move to next question
+        perfect = grade_result["exact_score"] == 1.0
+        out_of_attempts = self._q_steps_used >= self._max_steps_per_q
+        move_on = perfect or out_of_attempts
+        if move_on:
+            self._current_q_index += 1
+            self._q_steps_used = 0
+        # Check if episode is done
+        if self._current_q_index >= len(self._questions):
+            self._done = True
+        if self._total_steps >= MAX_TOTAL_STEPS:
+            self._done = True
+        return self._make_observation(
+            reward=reward,
+            query_result=query_result_str,
+            error=error_str,
+        )
+    @property
+    def state(self) -> State:
+        return self._state
+    def _make_observation(
+        self,
+        reward: float,
+        query_result: str,
+        error: str,
+    ) -> SQLObservation:
+        """Build an SQLObservation for the current state."""
+        if self._done or not self._questions or self._current_q_index >= len(self._questions):
+            # Episode finished or not started
+            return SQLObservation(
+                task_name=self._task.get("task_name", ""),
+                question="Episode complete. All questions answered.",
+                schema_description="",
+                query_result=query_result,
+                error=error,
+                steps_remaining=0,
+                question_index=len(self._questions),
+                total_questions=len(self._questions),
+                done=True,
+                reward=reward,
+                metadata={
+                    "feedback": self._last_feedback,
+                    "total_reward": round(sum(self._rewards), 4),
+                    "rewards": [round(r, 4) for r in self._rewards],
+                },
+            )
+        question = self._questions[self._current_q_index]
+        steps_remaining = self._max_steps_per_q - self._q_steps_used
+        return SQLObservation(
+            task_name=self._task.get("task_name", ""),
+            question=question["question"],
+            schema_description=self._schema_cache,
+            query_result=query_result,
+            error=error,
+            steps_remaining=steps_remaining,
+            question_index=self._current_q_index + 1,
+            total_questions=len(self._questions),
+            done=False,
+            reward=reward,
+            metadata={
+                "feedback": self._last_feedback,
+                "question_id": question["id"],
+                "difficulty": self._task.get("difficulty", ""),
+            },
+        )