dataclean-openenv / README.md
GlitchGhost's picture
Fix Phase 2: add [START]/[STEP]/[END] structured output to inference.py
48e9b06
metadata
title: DataClean Environment
emoji: 🧹
colorFrom: blue
colorTo: green
sdk: docker
app_port: 7860
short_description: OpenEnv data-cleaning environment for RL agents
tags:
  - openenv
  - docker
  - fastapi
  - data-cleaning

DataClean Environment

An OpenEnv-compliant environment for training and evaluating AI agents on real-world data-quality cleaning tasks.

Every organisation struggles with dirty data β€” missing values, duplicate records, format inconsistencies, anomalous entries, and cross-field validation failures. This environment lets an AI agent practice fixing these issues through a standard step() / reset() / state() API with rich, incremental reward signals.


Motivation

Data cleaning consumes up to 80% of a data professional's time. Automating even a fraction of this work has enormous practical value. This environment:

  • Models tasks that humans actually do every day (not games or toys)
  • Provides a realistic, graded benchmark for evaluating LLM-based data agents
  • Rewards partial progress, not just final correctness
  • Scales from simple fixes (missing emails) to subtle cross-field audits (age vs birth-date mismatches)

Environment Overview

Property Value
Domain Data-quality analysis and cleaning
Action space fix_value, delete_row, fill_missing, flag_anomaly, submit, noop
Observation space Text table of current data + quality report + column stats + history
Reward range 0.0 – 1.0 (continuous, per-step updates)
Episode length 15 / 25 / 35 steps (easy / medium / hard)
Tasks 3 (easy, medium, hard)

Action Space

Action Parameters Description
fix_value row_index, column_name, new_value Overwrite a cell with the corrected value
delete_row row_index Remove a duplicate or invalid row
fill_missing row_index, column_name, new_value Fill an empty/null cell
flag_anomaly row_index, column_name Mark a cell as suspicious (partial credit)
submit β€” End the episode and finalise scoring
noop β€” Do nothing this step

Actions are JSON objects:

{"action_type": "fix_value", "row_index": 2, "column_name": "phone", "new_value": "555-0103"}

Observation Space

Each observation contains:

Field Type Description
task_name string Task identifier (easy/medium/hard)
task_description string Human-readable goal
difficulty string easy / medium / hard
data_preview string Current dataset as an aligned text table
quality_report string Auto-detected quality issues (hints, not answers)
columns_info list[dict] Per-column stats: name, total, empty, unique
action_history list[string] Log of recent actions and outcomes
step_number int Current step (1-based)
max_steps int Action budget
current_score float Running score 0.0–1.0
available_actions list[string] Valid action types

Tasks

Task 1: Easy β€” Customer Contact Cleanup

  • Dataset: 10 customer records (name, email, phone, age, city)
  • Issues (5): Missing email, invalid phone format, exact duplicate row, impossible age, malformed email
  • Max steps: 15
  • Expected difficulty: A capable LLM should score 0.6–1.0

Task 2: Medium β€” E-commerce Order Normalisation

  • Dataset: 15 sales orders (order_id, customer, product, quantity, price, date, status)
  • Issues (10): Mixed date formats (YYYY-MM-DD vs DD/MM/YYYY vs dots), inconsistent product codes, negative quantity, price formatting ($1,234.56 vs 1234.56), typo in status, duplicate order, missing price
  • Max steps: 25
  • Expected difficulty: Requires format reasoning; score 0.3–0.7

Task 3: Hard β€” Employee Records Audit

  • Dataset: 20 HR records (emp_id, name, email, birth_date, age, department, dept_code, role, salary, start_date, manager_id)
  • Issues (11): Cross-field age/birth-date mismatch, department/dept_code conflict, near-duplicate employees, anomalous salary for role, future dates, placeholder "NULL" name, negative salary, impossible start date, referential integrity violations
  • Max steps: 35
  • Expected difficulty: Challenges frontier models; score 0.1–0.5

Reward Function

The reward provides signal at every step, not just at episode end:

score = (issues_fixed / total_issues) - wrong_fix_penalty + efficiency_bonus
  • Partial progress: Each correctly fixed issue adds 1/total_issues to the score
  • Wrong-fix penalty: Changing a correct value to something wrong costs 0.05 per occurrence
  • Efficiency bonus: Finishing early adds up to 0.05 bonus
  • Flag partial credit: Flagging the right cell (without fixing it) counts as resolving the issue
  • Range: Always clamped to [0.0, 1.0]

Setup & Usage

Prerequisites

  • Python 3.10+
  • Docker (for containerised deployment)

Install

pip install -r requirements.txt
pip install -e .

Run locally

# Start the server
uvicorn dataclean_env.server.app:app --host 0.0.0.0 --port 7860 --reload

# In another terminal, test the health endpoint
curl http://localhost:7860/health
# {"status": "healthy"}

Docker

# Build
docker build -t dataclean-env:latest .

# Run
docker run -d -p 7860:7860 dataclean-env:latest

# Test
curl http://localhost:7860/health

Run inference

# Set environment variables
export API_BASE_URL="https://router.huggingface.co/v1"
export MODEL_NAME="your-model-name"
export HF_TOKEN="your-hf-token"
export ENV_BASE_URL="http://localhost:7860"

# Run baseline agent
python inference.py

Baseline Scores

Scores obtained with a standard LLM agent using the inference script:

Task Score Notes
Easy ~0.70 Most obvious issues fixed
Medium ~0.40 Format reasoning challenging
Hard ~0.25 Cross-field logic very difficult
Average ~0.45

(Scores vary by model. Frontier models score higher.)


API Endpoints

Endpoint Method Description
/health GET Health check β†’ {"status": "healthy"}
/reset POST Reset with {"task_name": "easy|medium|hard"}
/step POST Execute action JSON
/state GET Current episode metadata
/ws WebSocket Full session (primary OpenEnv protocol)
/docs GET OpenAPI documentation

Project Structure

β”œβ”€β”€ inference.py              # Baseline inference script (OpenAI client)
β”œβ”€β”€ openenv.yaml              # OpenEnv manifest
β”œβ”€β”€ Dockerfile                # Container definition
β”œβ”€β”€ pyproject.toml            # Package metadata
β”œβ”€β”€ requirements.txt          # Dependencies
β”œβ”€β”€ README.md                 # This file
β”œβ”€β”€ dataclean_env/
β”‚   β”œβ”€β”€ __init__.py           # Package exports
β”‚   β”œβ”€β”€ models.py             # Action, Observation, State (Pydantic)
β”‚   β”œβ”€β”€ client.py             # Sync HTTP client
β”‚   └── server/
β”‚       β”œβ”€β”€ __init__.py
β”‚       β”œβ”€β”€ app.py            # FastAPI server (HTTP + WebSocket)
β”‚       β”œβ”€β”€ environment.py    # Core environment logic
β”‚       └── tasks.py          # Task data and ground truth

License

BSD 3-Clause