Spaces:
Sleeping
title: DataClean Environment
emoji: π§Ή
colorFrom: blue
colorTo: green
sdk: docker
app_port: 7860
short_description: OpenEnv data-cleaning environment for RL agents
tags:
- openenv
- docker
- fastapi
- data-cleaning
DataClean Environment
An OpenEnv-compliant environment for training and evaluating AI agents on real-world data-quality cleaning tasks.
Every organisation struggles with dirty data β missing values, duplicate records, format inconsistencies, anomalous entries, and cross-field validation failures. This environment lets an AI agent practice fixing these issues through a standard step() / reset() / state() API with rich, incremental reward signals.
Motivation
Data cleaning consumes up to 80% of a data professional's time. Automating even a fraction of this work has enormous practical value. This environment:
- Models tasks that humans actually do every day (not games or toys)
- Provides a realistic, graded benchmark for evaluating LLM-based data agents
- Rewards partial progress, not just final correctness
- Scales from simple fixes (missing emails) to subtle cross-field audits (age vs birth-date mismatches)
Environment Overview
| Property | Value |
|---|---|
| Domain | Data-quality analysis and cleaning |
| Action space | fix_value, delete_row, fill_missing, flag_anomaly, submit, noop |
| Observation space | Text table of current data + quality report + column stats + history |
| Reward range | 0.0 β 1.0 (continuous, per-step updates) |
| Episode length | 15 / 25 / 35 steps (easy / medium / hard) |
| Tasks | 3 (easy, medium, hard) |
Action Space
| Action | Parameters | Description |
|---|---|---|
fix_value |
row_index, column_name, new_value |
Overwrite a cell with the corrected value |
delete_row |
row_index |
Remove a duplicate or invalid row |
fill_missing |
row_index, column_name, new_value |
Fill an empty/null cell |
flag_anomaly |
row_index, column_name |
Mark a cell as suspicious (partial credit) |
submit |
β | End the episode and finalise scoring |
noop |
β | Do nothing this step |
Actions are JSON objects:
{"action_type": "fix_value", "row_index": 2, "column_name": "phone", "new_value": "555-0103"}
Observation Space
Each observation contains:
| Field | Type | Description |
|---|---|---|
task_name |
string | Task identifier (easy/medium/hard) |
task_description |
string | Human-readable goal |
difficulty |
string | easy / medium / hard |
data_preview |
string | Current dataset as an aligned text table |
quality_report |
string | Auto-detected quality issues (hints, not answers) |
columns_info |
list[dict] | Per-column stats: name, total, empty, unique |
action_history |
list[string] | Log of recent actions and outcomes |
step_number |
int | Current step (1-based) |
max_steps |
int | Action budget |
current_score |
float | Running score 0.0β1.0 |
available_actions |
list[string] | Valid action types |
Tasks
Task 1: Easy β Customer Contact Cleanup
- Dataset: 10 customer records (name, email, phone, age, city)
- Issues (5): Missing email, invalid phone format, exact duplicate row, impossible age, malformed email
- Max steps: 15
- Expected difficulty: A capable LLM should score 0.6β1.0
Task 2: Medium β E-commerce Order Normalisation
- Dataset: 15 sales orders (order_id, customer, product, quantity, price, date, status)
- Issues (10): Mixed date formats (YYYY-MM-DD vs DD/MM/YYYY vs dots), inconsistent product codes, negative quantity, price formatting ($1,234.56 vs 1234.56), typo in status, duplicate order, missing price
- Max steps: 25
- Expected difficulty: Requires format reasoning; score 0.3β0.7
Task 3: Hard β Employee Records Audit
- Dataset: 20 HR records (emp_id, name, email, birth_date, age, department, dept_code, role, salary, start_date, manager_id)
- Issues (11): Cross-field age/birth-date mismatch, department/dept_code conflict, near-duplicate employees, anomalous salary for role, future dates, placeholder "NULL" name, negative salary, impossible start date, referential integrity violations
- Max steps: 35
- Expected difficulty: Challenges frontier models; score 0.1β0.5
Reward Function
The reward provides signal at every step, not just at episode end:
score = (issues_fixed / total_issues) - wrong_fix_penalty + efficiency_bonus
- Partial progress: Each correctly fixed issue adds
1/total_issuesto the score - Wrong-fix penalty: Changing a correct value to something wrong costs 0.05 per occurrence
- Efficiency bonus: Finishing early adds up to 0.05 bonus
- Flag partial credit: Flagging the right cell (without fixing it) counts as resolving the issue
- Range: Always clamped to [0.0, 1.0]
Setup & Usage
Prerequisites
- Python 3.10+
- Docker (for containerised deployment)
Install
pip install -r requirements.txt
pip install -e .
Run locally
# Start the server
uvicorn dataclean_env.server.app:app --host 0.0.0.0 --port 7860 --reload
# In another terminal, test the health endpoint
curl http://localhost:7860/health
# {"status": "healthy"}
Docker
# Build
docker build -t dataclean-env:latest .
# Run
docker run -d -p 7860:7860 dataclean-env:latest
# Test
curl http://localhost:7860/health
Run inference
# Set environment variables
export API_BASE_URL="https://router.huggingface.co/v1"
export MODEL_NAME="your-model-name"
export HF_TOKEN="your-hf-token"
export ENV_BASE_URL="http://localhost:7860"
# Run baseline agent
python inference.py
Baseline Scores
Scores obtained with a standard LLM agent using the inference script:
| Task | Score | Notes |
|---|---|---|
| Easy | ~0.70 | Most obvious issues fixed |
| Medium | ~0.40 | Format reasoning challenging |
| Hard | ~0.25 | Cross-field logic very difficult |
| Average | ~0.45 |
(Scores vary by model. Frontier models score higher.)
API Endpoints
| Endpoint | Method | Description |
|---|---|---|
/health |
GET | Health check β {"status": "healthy"} |
/reset |
POST | Reset with {"task_name": "easy|medium|hard"} |
/step |
POST | Execute action JSON |
/state |
GET | Current episode metadata |
/ws |
WebSocket | Full session (primary OpenEnv protocol) |
/docs |
GET | OpenAPI documentation |
Project Structure
βββ inference.py # Baseline inference script (OpenAI client)
βββ openenv.yaml # OpenEnv manifest
βββ Dockerfile # Container definition
βββ pyproject.toml # Package metadata
βββ requirements.txt # Dependencies
βββ README.md # This file
βββ dataclean_env/
β βββ __init__.py # Package exports
β βββ models.py # Action, Observation, State (Pydantic)
β βββ client.py # Sync HTTP client
β βββ server/
β βββ __init__.py
β βββ app.py # FastAPI server (HTTP + WebSocket)
β βββ environment.py # Core environment logic
β βββ tasks.py # Task data and ground truth
License
BSD 3-Clause