Spaces:

shreyas231219
/

MetaPytorch-Hackathon-2

Sleeping

App Files Files Community

MetaPytorch-Hackathon-2 / README.md

shreyas231219

Fix directory pointers in README instructions

eed7fd0 about 2 months ago

preview code

raw

history blame contribute delete

5.84 kB

metadata

title: SQL-Agent-RL
emoji: 🐳
colorFrom: blue
colorTo: indigo
sdk: docker
pinned: false
tag: openenv

SQL / Data Cleaning Sandbox

A FastAPI OpenEnv environment for evaluating AI agents on realistic SQLite data tasks. Agents interact using SQL and Python to triage, clean, and normalize messy datasets across 6 diverse tasks.

Motivation

This environment targets data engineering and debugging workflows where an agent must:

inspect database state,
correct broken or inconsistent data,
calculate complex financial or system metrics,
migrate flat schemas into normalized tables,
and do so using incremental feedback.

It is designed for benchmarks with partial progress scoring and explicit penalties for destructive actions.

Action Space

Agents submit actions as JSON objects:

tool: sql or python
command: the SQL query or Python code to execute

Example:

{
  "tool": "sql",
  "command": "SELECT COUNT(*) FROM users WHERE email IS NULL"
}

Observation Space

Each environment response includes:

output: command output text
error: raw execution error or null
current_step: current step index
max_steps: allowed step budget
task_description: active task prompt
done: whether the episode finished
reward: partial reward for the step (includes potential late-task penalties)

Tasks

The environment provides six progressively difficult tasks, indexed as task1 through task6.

task1 — Data Triage (Easy)

Description: Compute total January 2024 revenue from the sales table.
Goal: Run a SQL aggregation that returns the exact total value.
Success Criteria: Reward 1.0 if the result matches 1000.00.

task2 — Data Cleaning (Medium)

Description: Clean the users table:
- Lowercase all emails.
- Remove duplicate emails (retain lowest id).
- Replace NULL ages with 0.
Reward Breakdown: 0.3 for Lowercase, 0.4 for No Duplicates, 0.3 for No NULLs.

task3 — Schema Migration (Hard)

Description: Normalize flat_orders into separate customers and orders tables.
Reward Breakdown:
- 0.2 for correct customers schema.
- 0.2 for correct orders schema.
- 0.6 for accurate data migration and referential integrity.

task4 — Incident Response (Advanced)

Description: Identify an IP address spamming 403 errors:
- Create a blocked_ips table.
- Move the offending IP into the blocklist.
- Prune the offending records from the master server_logs.
Reward Breakdown: 0.2 for table creation, 0.3 for correct IP identification, 0.5 for successful log pruning.
Penalty: Deductions occur if legitimate traffic logs are accidentally deleted.

task5 — Data Imputation & Revenue View (Advanced)

Description: Standardize corrupted date strings and calculate Life Time Value:
- Find and replace "NULL", "N/A", or empty strings in end_date_str with "2024-12-31".
- Create a view user_ltv calculating revenue using julianday() arithmetic.
Reward Breakdown: 0.3 for data cleaning, 0.3 for view creation, 0.4 for calculation accuracy.

task6 — JSON Analysis & Ranking (Expert)

Description: Extract nested JSON data and rank performance:
- Add a total_comp column to employees.
- Extract bonus_pct from a nested JSON string to compute total compensation.
- Create a view department_all_stars showing the top earner in each department with performance rating "A".
Reward Breakdown: 0.2 for schema mutation, 0.3 for JSON extraction accuracy, 0.5 for correct ranking logic.

Reward Mechanism

Each step is scored by the task-specific grader in server/environment.py.

The grader inspects the current database state and latest output.
Reward is clamped to the range 0.01 to 0.99.
Episodes end when the step count reaches max_steps or reward reaches 0.99.
Errors subtract 0.05 from the step reward.
Destructive or incorrect data modifications in advanced tasks result in score penalties.

Baseline Scores

Recent reference runs using robust capable LLMs (e.g., llama-3.3-70b-versatile via Groq) indicate the environment is reliably solvable but effectively differentiates between model reasoning capabilities on the later multi-step tasks.

Model	Task 1 (Easy)	Task 2 (Medium)	Task 3 (Hard)	Task 4 (Advanced)	Task 5 (Advanced)	Task 6 (Expert)
Llama-3.3-70B	~1.00	~1.00	~1.00	~0.99	~0.90	~0.99
Llama-3.1-8B	~0.99	~0.60	~0.40	~0.30	~0.10	~0.00

Note: Scores represent typical final-step partial-progress rewards. Simpler models often struggle to complete Schema Migration (Task 3) or JSON extraction windowing (Task 6), while advanced models can typically achieve near-perfect rewards within 3 to 6 execution steps per task.

Local Setup

Install Python dependencies

cd MetaPytorch-Hackathon-3
pip install -r server/requirements.txt
pip install -e .

Run the sandbox server locally

python -m uvicorn server.app:app --host 0.0.0.0 --port 7860

Run inference and evaluation

Ensure API_BASE_URL, MODEL_NAME, and HF_TOKEN (or OPENAI_API_KEY) are set.

cd MetaPytorch-Hackathon-3
python inference.py

Docker Setup

cd MetaPytorch-Hackathon-3

docker build -t sql-sandbox .
docker run -p 7860:7860 sql-sandbox

Project structure

client.py — OpenEnv client wrapper
models.py — action and observation models
openenv.yaml — environment manifest
inference.py — OpenAI baseline runner
inference_groq.py — Groq baseline runner
server/app.py — FastAPI app entrypoint
server/environment.py — task logic, grading, and reward mechanics

License

BSD-3-Clause