DataCleanser / README.md
sairaj2's picture
Fix README.md frontmatter for HF Spaces
9696c2d
metadata
title: OpenEnv Data Cleaning Environment
emoji: 🧼
colorFrom: blue
colorTo: green
sdk: docker
pinned: false
tags:
  - fastapi
  - docker
  - openenv
  - data-cleaning
  - data-validation

What this is

This repo is a Dockerized FastAPI application that serves an OpenEnv-style data cleaning environment:

  • POST /reset to start a session on a task
  • POST /step to take actions (fill missing, drop duplicates, standardize formats, etc.)
  • GET /health for readiness checks

It is suitable for Hugging Face Spaces (Docker). Inference Endpoints are not ideal here because this is an interactive multi-endpoint environment, not a single model inference API.

Web UI (optional)

Open /web for a lightweight dashboard to reset/step and view the table preview.

Real-world task

Simulates a common data engineering workflow: cleaning a dirty table so downstream analytics/ML won't break. Agents must iteratively apply safe transformations (imputation, deduplication, normalization, format standardization, range/outlier handling) and then submit.

Tasks (3 levels, deterministic grading)

  • easy_001: missing values + exact duplicates (customer table)
  • medium_001: missing values + format inconsistencies + invalid ranges (employee table)
  • hard_001: missing values + duplicates + mixed date/currency formats + cross-field constraints + outliers (sales table)

On submit, the grader returns a score in [0.0, 1.0] in info.grade.final_score.

Action space

Action = { "action_type": <enum>, "params": <dict> }

Supported action_type values:

  • fill_missing (column, strategy in {mean, median, mode, forward_fill, backward_fill}, optional value)
  • drop_duplicates (optional subset, optional keep)
  • normalize_text (column, operations)
  • standardize_format (column, format_type in {email, date, phone, currency, percentage})
  • validate_range (column, optional min_value, optional max_value)
  • detect_outliers (column, method in {iqr, zscore}, optional threshold)
  • infer_values (column, method, optional reference_columns)
  • flag_invalid (row_id, optional reason)
  • revert_last_action (no params)
  • submit (no params)

Observation space

Each step() returns an Observation including:

  • table_preview: first N rows as JSON records
  • column_schema: per-column type, null counts, unique counts, samples
  • detected_issues: issue summaries (type/count/severity)
  • quality_metrics: completeness/validity/consistency/uniqueness/overall (0..1)
  • issues_remaining, step_count, max_steps, plus last_action_result

Reward shaping (dense signal)

Reward is a multi-component Reward model:

  • Positive: quality improvement, issue resolution progress, schema validity
  • Penalties: destructive changes, redundant actions, per-step cost

This gives partial-progress signal instead of only terminal success/failure.

Local run (Docker)

docker build -t datacleanser .
docker run --rm -p 7860:7860 datacleanser

Verify:

curl -fsS http://localhost:7860/health
curl -fsS http://localhost:7860/tasks

API usage

Reset:

curl -fsS -X POST http://localhost:7860/reset \
  -H "Content-Type: application/json" \
  -d '{"task_id":"easy_001","session_id":"demo"}'

Step:

curl -fsS -X POST http://localhost:7860/step \
  -H "Content-Type: application/json" \
  -d '{
    "session_id": "demo",
    "action": {
      "action_type": "fill_missing",
      "params": {"column": "age", "strategy": "median"}
    }
  }'

Baseline agent (LLM) inference

The baseline script is inference.py (repo root). It uses an OpenAI-compatible API.

Required environment variables (per submission rules):

  • API_BASE_URL: OpenAI-compatible endpoint base URL (optional if using OpenAI default)
  • MODEL_NAME: model id (e.g. gpt-4.1-mini, or your provider's model name)
  • OPENAI_API_KEY: API key (preferred)
  • HF_TOKEN: API key fallback (used if OPENAI_API_KEY is not set)

Run all 3 tasks locally via Docker:

docker build -t datacleanser .
docker run --rm -p 7860:7860 datacleanser

In another terminal (using local python env that has deps installed), or run inside the container:

docker exec -it $(docker ps -q --filter ancestor=datacleanser | head -n 1) \
  sh -lc 'API_BASE_URL="$API_BASE_URL" MODEL_NAME="$MODEL_NAME" OPENAI_API_KEY="$OPENAI_API_KEY" HF_TOKEN="$HF_TOKEN" python3 inference.py --all --out baseline_results.json'

Hugging Face Spaces (Docker) deployment

  1. Create a Space → SDK: Docker
  2. Push these files to the Space repo:
  • Dockerfile
  • .dockerignore
  • requirements.txt
  • app.py
  • env/, agent/, data/ (optional; datasets are generated on startup)
  • README.md (this file)
  1. The Space will build and start automatically on port 7860.

Notes

  • The server generates datasets on startup (see app.py startup event).
  • For baseline agent runs (outside Spaces), set OPENAI_API_KEY and use inference.py.