Metadata-Version: 2.4 Name: sql-data-engineer-env Version: 0.1.0 Summary: A real-world SQL data engineering environment for agent evaluation. Requires-Python: >=3.10 Description-Content-Type: text/markdown Requires-Dist: openenv-core Requires-Dist: fastapi Requires-Dist: uvicorn Requires-Dist: pydantic Requires-Dist: openai Requires-Dist: pandas --- title: SQL Data Engineer Environment emoji: 🗄️ colorFrom: blue colorTo: indigo sdk: docker pinned: false app_port: 7860 --- # OpenEnv: SQL Data Engineer Environment Welcome to the **SQL Data Engineer Environment**—a robust, fully-compliant baseline environment built for the Meta OpenEnv Hackathon. This project tests an AI agent's ability to natively interact with a live SQL engine to perform pragmatic data extraction, cleansing, and complex schema normalization tasks. ## Environment Description & Motivation ### Why SQL Data Engineering? The industry needs reliable agents that can act as backend developers, DB administrators, and data engineers. While many environments focus on web browsing or gaming, manipulating relational databases is a high-value, real-world task. This environment simulates authentic obstacles developers face: - Analyzing undocumented database schemas. - Cleansing noisy string data into strict scalar types. - Restructuring and normalizing flat tables into relational architectures while rigorously preserving foreign-key constraints. It presents an excellent metric to gauge an LLM's structured reasoning and precise SQL generation capabilities. --- ## Space Definitions The environment adheres strictly to the OpenEnv Pydantic specification, enabling seamless API integration. ### Observation Space The observation space is tailored to provide dense context while remaining token-efficient: - `goal` (string): The explicit task prompt/requirement dictating what the agent must achieve. - `schema_dump` (string | null): The current DDL representing all tables and views in the DB (schema definition). Sent back only when the schema dynamically changes or the state is stable. - `result` (string): The standard output of the previously executed query (capped to 10 rows for SELECTs) or a clear `rowcount` confirmation for INSERTs/UPDATEs. - `last_action_error` (boolean): Flag indicating if the previous SQL Action threw a syntax or logic engine error. - `step` (integer): The current episode step tally. ### Action Space - `action_str` (string): The agent must return a JSON dictionary containing a single, syntactically correct SQLite query to be executed against the backend state. --- ## Tasks & Graders Each episode challenges the agent with one of 3 tasks featuring deterministic OpenEnv graders scoring between `0.0` and `1.0`. #### 1. Easy: Data Extraction (View Creation) - **Goal**: Read a `customers` table, filter out metrics > 1000.0, and construct a targeted `high_value_customers` SQL View. - **Difficulty**: Easy. Tests basic SELECT syntax and DDL proficiency. - **Grader**: Validates if the correct view exists in `sqlite_master`, assigns `0.5` points. Exact row and content matching grants the remaining `+0.5`. #### 2. Medium: Data Cleaning - **Goal**: Coerce a messy `products` table. The agent must standardize categorical string sizes (e.g., converting 'ELEC' to 'ELECTRONICS') and extract numeric floats from dirty string pricing (e.g., '$85.00' -> `85.0`) into a new generated float column. - **Difficulty**: Medium. Tests native string pattern matching and targeted UPDATE pipelines. - **Grader**: Adding the column yields `0.3` points. Correct categorical string mapping grants up to `0.3`, and correctly extracted float prices yield `0.4` respectively. #### 3. Hard: Schema Normalization - **Goal**: Normalize a completely flat `hospital_records` repository into a structured 3-table format (`patients`, `doctors`, `appointments`). Data must be completely migrated and bound by Primary/Foreign key constraints. - **Difficulty**: Hard. Tests multi-step schema architectural reasoning and safe data-migration pipelines. - **Grader**: Validating table signatures issues `0.1` per table. Proper data counts yield `0.1` each, and if a relational JOIN across the new DB perfectly rebuilds the original flat map, the final `0.3` is awarded. ### The Dense Reward Function Scores are completely dense over the episode lifecycle. At `step(action)`, the grader executes. The mathematical reward signal is continuous: `Reward = (Current_Score - Previous_Score)`. *Note: A `-0.05` penalty is actively applied when `last_action_error` triggers, strongly discouraging hallucinated or malformed SQL loops.* --- ## Local Setup & Usage To validate the OpenEnv schema, install the framework, and run the OpenAI-compatible baseline script locally: ```bash # 1. Clone the repository and navigate inside git clone cd OpenEnv-SQL-Data-Engineer # 2. Setup standard Python virtual environment python -m venv venv source venv/bin/activate # Or `venv\Scripts\activate` on Windows # 3. Install core dependencies (FastAPI, Pydantic, OpenAI, OpenEnv) pip install openenv openenv-core openai pydantic fastapi uvicorn requests # 4. Verify OpenEnv schema compliance locally openenv validate # 5. Execute the baseline AI Agent (make sure to set your key) export OPENAI_API_KEY="your-api-key" export MODEL_NAME="gpt-4o" python inference.py ``` --- ## Deployment Instructions ### Docker Container Build The environment provides a native Hugging Face structured `Dockerfile` configured to launch on port 7860 as an unprivileged user. ```bash docker build -t openenv-sql . docker run -p 7860:7860 openenv-sql ``` ### Deploying to Hugging Face Spaces To finalize your Hackathon deployment and spin up the live inference API: 1. First, create a new minimal **Docker Space** inside Hugging Face. 2. Ensure you add `HF_TOKEN` globally inside your HF Space Repository secrets. 3. Push this directory to the Space via git: ```bash git remote add space https://huggingface.co/spaces// git push space main ``` 4. The environment URL will naturally respond to ping checks and `/reset` on HF endpoints. ### Baseline Scores (Llama-3-8B-Instruct) - **Easy Task:** 1.0 (Passed) - **Medium Task:** 0.62 (Partial Success - struggled with complex string casting) - **Hard Task:** 0.2 (Challenging - requires higher reasoning/longer context)