Spaces:

sairaj2
/

DataCleanser

Sleeping

App Files Files Community

DataCleanser / README.md

sairaj2

Fix README.md frontmatter for HF Spaces

9696c2d 20 days ago

preview code

raw

history blame contribute delete

5.09 kB

metadata

title: OpenEnv Data Cleaning Environment
emoji: 🧼
colorFrom: blue
colorTo: green
sdk: docker
pinned: false
tags:
  - fastapi
  - docker
  - openenv
  - data-cleaning
  - data-validation

What this is

This repo is a Dockerized FastAPI application that serves an OpenEnv-style data cleaning environment:

POST /reset to start a session on a task
POST /step to take actions (fill missing, drop duplicates, standardize formats, etc.)
GET /health for readiness checks

It is suitable for Hugging Face Spaces (Docker). Inference Endpoints are not ideal here because this is an interactive multi-endpoint environment, not a single model inference API.

Web UI (optional)

Open /web for a lightweight dashboard to reset/step and view the table preview.

Real-world task

Simulates a common data engineering workflow: cleaning a dirty table so downstream analytics/ML won't break. Agents must iteratively apply safe transformations (imputation, deduplication, normalization, format standardization, range/outlier handling) and then submit.

Tasks (3 levels, deterministic grading)

easy_001: missing values + exact duplicates (customer table)
medium_001: missing values + format inconsistencies + invalid ranges (employee table)
hard_001: missing values + duplicates + mixed date/currency formats + cross-field constraints + outliers (sales table)

On submit, the grader returns a score in [0.0, 1.0] in info.grade.final_score.

Action space

Action = { "action_type": <enum>, "params": <dict> }

Supported action_type values:

fill_missing (column, strategy in {mean, median, mode, forward_fill, backward_fill}, optional value)
drop_duplicates (optional subset, optional keep)
normalize_text (column, operations)
standardize_format (column, format_type in {email, date, phone, currency, percentage})
validate_range (column, optional min_value, optional max_value)
detect_outliers (column, method in {iqr, zscore}, optional threshold)
infer_values (column, method, optional reference_columns)
flag_invalid (row_id, optional reason)
revert_last_action (no params)
submit (no params)

Observation space

Each step() returns an Observation including:

table_preview: first N rows as JSON records
column_schema: per-column type, null counts, unique counts, samples
detected_issues: issue summaries (type/count/severity)
quality_metrics: completeness/validity/consistency/uniqueness/overall (0..1)
issues_remaining, step_count, max_steps, plus last_action_result

Reward shaping (dense signal)

Reward is a multi-component Reward model:

Positive: quality improvement, issue resolution progress, schema validity
Penalties: destructive changes, redundant actions, per-step cost

This gives partial-progress signal instead of only terminal success/failure.

Local run (Docker)

docker build -t datacleanser .
docker run --rm -p 7860:7860 datacleanser

Verify:

curl -fsS http://localhost:7860/health
curl -fsS http://localhost:7860/tasks

API usage

Reset:

curl -fsS -X POST http://localhost:7860/reset \
  -H "Content-Type: application/json" \
  -d '{"task_id":"easy_001","session_id":"demo"}'

Step:

curl -fsS -X POST http://localhost:7860/step \
  -H "Content-Type: application/json" \
  -d '{
    "session_id": "demo",
    "action": {
      "action_type": "fill_missing",
      "params": {"column": "age", "strategy": "median"}
    }
  }'

Baseline agent (LLM) inference

The baseline script is inference.py (repo root). It uses an OpenAI-compatible API.

Required environment variables (per submission rules):

API_BASE_URL: OpenAI-compatible endpoint base URL (optional if using OpenAI default)
MODEL_NAME: model id (e.g. gpt-4.1-mini, or your provider's model name)
OPENAI_API_KEY: API key (preferred)
HF_TOKEN: API key fallback (used if OPENAI_API_KEY is not set)

Run all 3 tasks locally via Docker:

docker build -t datacleanser .
docker run --rm -p 7860:7860 datacleanser

In another terminal (using local python env that has deps installed), or run inside the container:

docker exec -it $(docker ps -q --filter ancestor=datacleanser | head -n 1) \
  sh -lc 'API_BASE_URL="$API_BASE_URL" MODEL_NAME="$MODEL_NAME" OPENAI_API_KEY="$OPENAI_API_KEY" HF_TOKEN="$HF_TOKEN" python3 inference.py --all --out baseline_results.json'

Hugging Face Spaces (Docker) deployment

Create a Space → SDK: Docker
Push these files to the Space repo:

Dockerfile
.dockerignore
requirements.txt
app.py
env/, agent/, data/ (optional; datasets are generated on startup)
README.md (this file)

The Space will build and start automatically on port 7860.

Notes

The server generates datasets on startup (see app.py startup event).
For baseline agent runs (outside Spaces), set OPENAI_API_KEY and use inference.py.