Spaces:

GlitchGhost
/

dataclean-openenv

Sleeping

App Files Files Community

dataclean-openenv / README.md

GlitchGhost

Fix Phase 2: add [START]/[STEP]/[END] structured output to inference.py

48e9b06 12 days ago

preview code

raw

history blame contribute delete

7.53 kB

metadata

title: DataClean Environment
emoji: 🧹
colorFrom: blue
colorTo: green
sdk: docker
app_port: 7860
short_description: OpenEnv data-cleaning environment for RL agents
tags:
  - openenv
  - docker
  - fastapi
  - data-cleaning

DataClean Environment

An OpenEnv-compliant environment for training and evaluating AI agents on real-world data-quality cleaning tasks.

Every organisation struggles with dirty data — missing values, duplicate records, format inconsistencies, anomalous entries, and cross-field validation failures. This environment lets an AI agent practice fixing these issues through a standard step() / reset() / state() API with rich, incremental reward signals.

Motivation

Data cleaning consumes up to 80% of a data professional's time. Automating even a fraction of this work has enormous practical value. This environment:

Models tasks that humans actually do every day (not games or toys)
Provides a realistic, graded benchmark for evaluating LLM-based data agents
Rewards partial progress, not just final correctness
Scales from simple fixes (missing emails) to subtle cross-field audits (age vs birth-date mismatches)

Environment Overview

Property	Value
Domain	Data-quality analysis and cleaning
Action space	`fix_value`, `delete_row`, `fill_missing`, `flag_anomaly`, `submit`, `noop`
Observation space	Text table of current data + quality report + column stats + history
Reward range	0.0 – 1.0 (continuous, per-step updates)
Episode length	15 / 25 / 35 steps (easy / medium / hard)
Tasks	3 (easy, medium, hard)

Action Space

Action	Parameters	Description
`fix_value`	`row_index`, `column_name`, `new_value`	Overwrite a cell with the corrected value
`delete_row`	`row_index`	Remove a duplicate or invalid row
`fill_missing`	`row_index`, `column_name`, `new_value`	Fill an empty/null cell
`flag_anomaly`	`row_index`, `column_name`	Mark a cell as suspicious (partial credit)
`submit`	—	End the episode and finalise scoring
`noop`	—	Do nothing this step

Actions are JSON objects:

{"action_type": "fix_value", "row_index": 2, "column_name": "phone", "new_value": "555-0103"}

Observation Space

Each observation contains:

Field	Type	Description
`task_name`	string	Task identifier (easy/medium/hard)
`task_description`	string	Human-readable goal
`difficulty`	string	easy / medium / hard
`data_preview`	string	Current dataset as an aligned text table
`quality_report`	string	Auto-detected quality issues (hints, not answers)
`columns_info`	list[dict]	Per-column stats: name, total, empty, unique
`action_history`	list[string]	Log of recent actions and outcomes
`step_number`	int	Current step (1-based)
`max_steps`	int	Action budget
`current_score`	float	Running score 0.0–1.0
`available_actions`	list[string]	Valid action types

Tasks

Task 1: Easy — Customer Contact Cleanup

Dataset: 10 customer records (name, email, phone, age, city)
Issues (5): Missing email, invalid phone format, exact duplicate row, impossible age, malformed email
Max steps: 15
Expected difficulty: A capable LLM should score 0.6–1.0

Task 2: Medium — E-commerce Order Normalisation

Dataset: 15 sales orders (order_id, customer, product, quantity, price, date, status)
Issues (10): Mixed date formats (YYYY-MM-DD vs DD/MM/YYYY vs dots), inconsistent product codes, negative quantity, price formatting ($1,234.56 vs 1234.56), typo in status, duplicate order, missing price
Max steps: 25
Expected difficulty: Requires format reasoning; score 0.3–0.7

Task 3: Hard — Employee Records Audit

Dataset: 20 HR records (emp_id, name, email, birth_date, age, department, dept_code, role, salary, start_date, manager_id)
Issues (11): Cross-field age/birth-date mismatch, department/dept_code conflict, near-duplicate employees, anomalous salary for role, future dates, placeholder "NULL" name, negative salary, impossible start date, referential integrity violations
Max steps: 35
Expected difficulty: Challenges frontier models; score 0.1–0.5

Reward Function

The reward provides signal at every step, not just at episode end:

score = (issues_fixed / total_issues) - wrong_fix_penalty + efficiency_bonus

Partial progress: Each correctly fixed issue adds 1/total_issues to the score
Wrong-fix penalty: Changing a correct value to something wrong costs 0.05 per occurrence
Efficiency bonus: Finishing early adds up to 0.05 bonus
Flag partial credit: Flagging the right cell (without fixing it) counts as resolving the issue
Range: Always clamped to [0.0, 1.0]

Setup & Usage

Prerequisites

Python 3.10+
Docker (for containerised deployment)

Install

pip install -r requirements.txt
pip install -e .

Run locally

# Start the server
uvicorn dataclean_env.server.app:app --host 0.0.0.0 --port 7860 --reload

# In another terminal, test the health endpoint
curl http://localhost:7860/health
# {"status": "healthy"}

Docker

# Build
docker build -t dataclean-env:latest .

# Run
docker run -d -p 7860:7860 dataclean-env:latest

# Test
curl http://localhost:7860/health

Run inference

# Set environment variables
export API_BASE_URL="https://router.huggingface.co/v1"
export MODEL_NAME="your-model-name"
export HF_TOKEN="your-hf-token"
export ENV_BASE_URL="http://localhost:7860"

# Run baseline agent
python inference.py

Baseline Scores

Scores obtained with a standard LLM agent using the inference script:

Task	Score	Notes
Easy	~0.70	Most obvious issues fixed
Medium	~0.40	Format reasoning challenging
Hard	~0.25	Cross-field logic very difficult
Average	~0.45

(Scores vary by model. Frontier models score higher.)

API Endpoints

Endpoint	Method	Description
`/health`	GET	Health check → `{"status": "healthy"}`
`/reset`	POST	Reset with `{"task_name": "easy\|medium\|hard"}`
`/step`	POST	Execute action JSON
`/state`	GET	Current episode metadata
`/ws`	WebSocket	Full session (primary OpenEnv protocol)
`/docs`	GET	OpenAPI documentation

Project Structure

├── inference.py              # Baseline inference script (OpenAI client)
├── openenv.yaml              # OpenEnv manifest
├── Dockerfile                # Container definition
├── pyproject.toml            # Package metadata
├── requirements.txt          # Dependencies
├── README.md                 # This file
├── dataclean_env/
│   ├── __init__.py           # Package exports
│   ├── models.py             # Action, Observation, State (Pydantic)
│   ├── client.py             # Sync HTTP client
│   └── server/
│       ├── __init__.py
│       ├── app.py            # FastAPI server (HTTP + WebSocket)
│       ├── environment.py    # Core environment logic
│       └── tasks.py          # Task data and ground truth

License

BSD 3-Clause