Spaces:
Sleeping
title: OpenEnv Data Cleaning Environment
emoji: 🧼
colorFrom: blue
colorTo: green
sdk: docker
pinned: false
tags:
- fastapi
- docker
- openenv
- data-cleaning
- data-validation
What this is
This repo is a Dockerized FastAPI application that serves an OpenEnv-style data cleaning environment:
POST /resetto start a session on a taskPOST /stepto take actions (fill missing, drop duplicates, standardize formats, etc.)GET /healthfor readiness checks
It is suitable for Hugging Face Spaces (Docker). Inference Endpoints are not ideal here because this is an interactive multi-endpoint environment, not a single model inference API.
Web UI (optional)
Open /web for a lightweight dashboard to reset/step and view the table preview.
Real-world task
Simulates a common data engineering workflow: cleaning a dirty table so downstream analytics/ML won't break. Agents must iteratively apply safe transformations (imputation, deduplication, normalization, format standardization, range/outlier handling) and then submit.
Tasks (3 levels, deterministic grading)
- easy_001: missing values + exact duplicates (customer table)
- medium_001: missing values + format inconsistencies + invalid ranges (employee table)
- hard_001: missing values + duplicates + mixed date/currency formats + cross-field constraints + outliers (sales table)
On submit, the grader returns a score in [0.0, 1.0] in info.grade.final_score.
Action space
Action = { "action_type": <enum>, "params": <dict> }
Supported action_type values:
fill_missing(column,strategyin {mean, median, mode, forward_fill, backward_fill}, optionalvalue)drop_duplicates(optionalsubset, optionalkeep)normalize_text(column,operations)standardize_format(column,format_typein {email, date, phone, currency, percentage})validate_range(column, optionalmin_value, optionalmax_value)detect_outliers(column,methodin {iqr, zscore}, optionalthreshold)infer_values(column,method, optionalreference_columns)flag_invalid(row_id, optionalreason)revert_last_action(no params)submit(no params)
Observation space
Each step() returns an Observation including:
table_preview: first N rows as JSON recordscolumn_schema: per-column type, null counts, unique counts, samplesdetected_issues: issue summaries (type/count/severity)quality_metrics: completeness/validity/consistency/uniqueness/overall (0..1)issues_remaining,step_count,max_steps, pluslast_action_result
Reward shaping (dense signal)
Reward is a multi-component Reward model:
- Positive: quality improvement, issue resolution progress, schema validity
- Penalties: destructive changes, redundant actions, per-step cost
This gives partial-progress signal instead of only terminal success/failure.
Local run (Docker)
docker build -t datacleanser .
docker run --rm -p 7860:7860 datacleanser
Verify:
curl -fsS http://localhost:7860/health
curl -fsS http://localhost:7860/tasks
API usage
Reset:
curl -fsS -X POST http://localhost:7860/reset \
-H "Content-Type: application/json" \
-d '{"task_id":"easy_001","session_id":"demo"}'
Step:
curl -fsS -X POST http://localhost:7860/step \
-H "Content-Type: application/json" \
-d '{
"session_id": "demo",
"action": {
"action_type": "fill_missing",
"params": {"column": "age", "strategy": "median"}
}
}'
Baseline agent (LLM) inference
The baseline script is inference.py (repo root). It uses an OpenAI-compatible API.
Required environment variables (per submission rules):
API_BASE_URL: OpenAI-compatible endpoint base URL (optional if using OpenAI default)MODEL_NAME: model id (e.g.gpt-4.1-mini, or your provider's model name)OPENAI_API_KEY: API key (preferred)HF_TOKEN: API key fallback (used ifOPENAI_API_KEYis not set)
Run all 3 tasks locally via Docker:
docker build -t datacleanser .
docker run --rm -p 7860:7860 datacleanser
In another terminal (using local python env that has deps installed), or run inside the container:
docker exec -it $(docker ps -q --filter ancestor=datacleanser | head -n 1) \
sh -lc 'API_BASE_URL="$API_BASE_URL" MODEL_NAME="$MODEL_NAME" OPENAI_API_KEY="$OPENAI_API_KEY" HF_TOKEN="$HF_TOKEN" python3 inference.py --all --out baseline_results.json'
Hugging Face Spaces (Docker) deployment
- Create a Space → SDK: Docker
- Push these files to the Space repo:
Dockerfile.dockerignorerequirements.txtapp.pyenv/,agent/,data/(optional; datasets are generated on startup)README.md(this file)
- The Space will build and start automatically on port 7860.
Notes
- The server generates datasets on startup (see
app.pystartup event). - For baseline agent runs (outside Spaces), set
OPENAI_API_KEYand useinference.py.