Spaces:

shreyas231219
/

MetaPytorch-Hackathon-2

Sleeping

App Files Files Community

MetaPytorch-Hackathon-2 / README.md

shreyas231219

Fix directory pointers in README instructions

eed7fd0 about 2 months ago

preview code

raw

history blame contribute delete

5.84 kB

	---
	title: SQL-Agent-RL
	emoji: 🐳
	colorFrom: blue
	colorTo: indigo
	sdk: docker
	pinned: false
	tag: openenv
	---
	# SQL / Data Cleaning Sandbox

	A FastAPI OpenEnv environment for evaluating AI agents on realistic SQLite data tasks.
	Agents interact using SQL and Python to triage, clean, and normalize messy datasets across 6 diverse tasks.

	## Motivation

	This environment targets data engineering and debugging workflows where an agent must:
	- inspect database state,
	- correct broken or inconsistent data,
	- calculate complex financial or system metrics,
	- migrate flat schemas into normalized tables,
	- and do so using incremental feedback.

	It is designed for benchmarks with partial progress scoring and explicit penalties for destructive actions.

	## Action Space

	Agents submit actions as JSON objects:
	- `tool`: `sql` or `python`
	- `command`: the SQL query or Python code to execute

	Example:

	```json
	{
	"tool": "sql",
	"command": "SELECT COUNT(*) FROM users WHERE email IS NULL"
	}
	```

	## Observation Space

	Each environment response includes:
	- `output`: command output text
	- `error`: raw execution error or `null`
	- `current_step`: current step index
	- `max_steps`: allowed step budget
	- `task_description`: active task prompt
	- `done`: whether the episode finished
	- `reward`: partial reward for the step (includes potential late-task penalties)

	## Tasks

	The environment provides six progressively difficult tasks, indexed as `task1` through `task6`.

	### task1 — Data Triage (Easy)
	- Description: Compute total January 2024 revenue from the `sales` table.
	- Goal: Run a SQL aggregation that returns the exact total value.
	- Success Criteria: Reward `1.0` if the result matches `1000.00`.

	### task2 — Data Cleaning (Medium)
	- Description: Clean the `users` table:
	- Lowercase all emails.
	- Remove duplicate emails (retain lowest `id`).
	- Replace NULL ages with `0`.
	- Reward Breakdown: `0.3` for Lowercase, `0.4` for No Duplicates, `0.3` for No NULLs.

	### task3 — Schema Migration (Hard)
	- Description: Normalize `flat_orders` into separate `customers` and `orders` tables.
	- Reward Breakdown:
	- `0.2` for correct `customers` schema.
	- `0.2` for correct `orders` schema.
	- `0.6` for accurate data migration and referential integrity.

	### task4 — Incident Response (Advanced)
	- Description: Identify an IP address spamming 403 errors:
	- Create a `blocked_ips` table.
	- Move the offending IP into the blocklist.
	- Prune the offending records from the master `server_logs`.
	- Reward Breakdown: `0.2` for table creation, `0.3` for correct IP identification, `0.5` for successful log pruning.
	- Penalty: Deductions occur if legitimate traffic logs are accidentally deleted.

	### task5 — Data Imputation & Revenue View (Advanced)
	- Description: Standardize corrupted date strings and calculate Life Time Value:
	- Find and replace "NULL", "N/A", or empty strings in `end_date_str` with "2024-12-31".
	- Create a view `user_ltv` calculating revenue using `julianday()` arithmetic.
	- Reward Breakdown: `0.3` for data cleaning, `0.3` for view creation, `0.4` for calculation accuracy.

	### task6 — JSON Analysis & Ranking (Expert)
	- Description: Extract nested JSON data and rank performance:
	- Add a `total_comp` column to `employees`.
	- Extract `bonus_pct` from a nested JSON string to compute total compensation.
	- Create a view `department_all_stars` showing the top earner in each department with performance rating "A".
	- Reward Breakdown: `0.2` for schema mutation, `0.3` for JSON extraction accuracy, `0.5` for correct ranking logic.

	## Reward Mechanism

	Each step is scored by the task-specific grader in `server/environment.py`.
	- The grader inspects the current database state and latest output.
	- Reward is clamped to the range `0.01` to `0.99`.
	- Episodes end when the step count reaches `max_steps` or reward reaches `0.99`.
	- Errors subtract `0.05` from the step reward.
	- Destructive or incorrect data modifications in advanced tasks result in score penalties.

	## Baseline Scores

	Recent reference runs using robust capable LLMs (e.g., `llama-3.3-70b-versatile` via Groq) indicate the environment is reliably solvable but effectively differentiates between model reasoning capabilities on the later multi-step tasks.

	\| Model \| Task 1 (Easy) \| Task 2 (Medium) \| Task 3 (Hard) \| Task 4 (Advanced) \| Task 5 (Advanced) \| Task 6 (Expert) \|
	\|---\|---\|---\|---\|---\|---\|---\|
	\| Llama-3.3-70B \| ~1.00 \| ~1.00 \| ~1.00 \| ~0.99 \| ~0.90 \| ~0.99 \|
	\| Llama-3.1-8B \| ~0.99 \| ~0.60 \| ~0.40 \| ~0.30 \| ~0.10 \| ~0.00 \|

	Note: Scores represent typical final-step partial-progress rewards. Simpler models often struggle to complete Schema Migration (Task 3) or JSON extraction windowing (Task 6), while advanced models can typically achieve near-perfect rewards within 3 to 6 execution steps per task.

	## Local Setup

	### Install Python dependencies

	```bash
	cd MetaPytorch-Hackathon-3
	pip install -r server/requirements.txt
	pip install -e .
	```

	### Run the sandbox server locally

	```bash
	python -m uvicorn server.app:app --host 0.0.0.0 --port 7860
	```

	## Run inference and evaluation

	Ensure `API_BASE_URL`, `MODEL_NAME`, and `HF_TOKEN` (or `OPENAI_API_KEY`) are set.

	```bash
	cd MetaPytorch-Hackathon-3
	python inference.py
	```

	## Docker Setup

	```bash
	cd MetaPytorch-Hackathon-3

	docker build -t sql-sandbox .
	docker run -p 7860:7860 sql-sandbox
	```

	## Project structure

	- `client.py` — OpenEnv client wrapper
	- `models.py` — action and observation models
	- `openenv.yaml` — environment manifest
	- `inference.py` — OpenAI baseline runner
	- `inference_groq.py` — Groq baseline runner
	- `server/app.py` — FastAPI app entrypoint
	- `server/environment.py` — task logic, grading, and reward mechanics

	## License

	BSD-3-Clause