| --- |
| title: CleanOps Env |
| emoji: "π§Ή" |
| colorFrom: blue |
| colorTo: green |
| sdk: docker |
| app_port: 8000 |
| tags: |
| - openenv |
| - data-cleaning |
| - reinforcement-learning |
| --- |
| |
| # CleanOps OpenEnv |
|
|
| CleanOps is a real-world OpenEnv benchmark for evaluating AI agents on |
| operational data-cleaning workflows. Instead of solving a toy problem, the |
| agent has to inspect messy business tables, choose remediation operations, |
| escalate ambiguous records for human review, run downstream dry-run syncs, and |
| submit a cleaned dataset scored by deterministic graders. |
|
|
| The benchmark models the kind of cleanup work that sales ops, RevOps, support |
| ops, and data platform teams perform before loading data into CRMs, billing |
| systems, and analytics warehouses. |
|
|
| ## Live Links |
|
|
| - Hugging Face Space: [harsharajkumar273/cleanops-openenv](https://huggingface.co/spaces/harsharajkumar273/cleanops-openenv) |
| - Live App: [harsharajkumar273-cleanops-openenv.hf.space](https://harsharajkumar273-cleanops-openenv.hf.space/) |
| - GitHub Repository: [harsharajkumar/cleanops-openenv](https://github.com/harsharajkumar/cleanops-openenv) |
|
|
| ## Highlights |
|
|
| - Real-world benchmark: evaluates agents on CRM, order, subscription, and |
| payment cleanup rather than games. |
| - Full OpenEnv implementation: typed `Action`, `Observation`, and `State` |
| models plus `reset()`, `step()`, and `state()`. |
| - Human-in-the-loop realism: agents can request deterministic review responses |
| for ambiguous records. |
| - Downstream simulation: agents can run CRM or billing dry runs before submit. |
| - Cost-aware reward shaping: the environment rewards useful progress while |
| penalizing wasted review budget, repeated actions, and risky shortcuts. |
|
|
| ## What The Agent Does |
|
|
| On each episode, the agent: |
|
|
| 1. inspects noisy business tables and validation issues |
| 2. chooses from a typed catalog of cleaning operations |
| 3. requests review for ambiguous merges or broken references |
| 4. runs deterministic downstream dry runs against CRM or billing systems |
| 5. applies targeted fixes while avoiding destructive shortcuts |
| 6. submits the cleaned dataset for deterministic scoring |
|
|
| ## Task Suite |
|
|
| | Task ID | Difficulty | Description | |
| |---|---|---| |
| | `customer_contacts_easy` | Easy | Clean a CRM contacts export by normalizing names/emails/phones/states, handling one reviewable duplicate, and preparing the table for CRM import. | |
| | `orders_reconciliation_medium` | Medium | Clean an e-commerce order extract by standardizing dates, currency, amounts, statuses, and shipping states while preserving returned orders and checking downstream billing readiness. | |
| | `crm_migration_hard` | Hard | Repair a 3-table CRM migration extract with duplicate customers, broken foreign keys, ambiguous payment/customer linkages, review escalation, and CRM/billing dry-run checks. | |
|
|
| ## API |
|
|
| ### Local Python API |
|
|
| ```python |
| from cleanops_env import DataCleaningAction, LocalCleanOpsEnv |
| |
| env = LocalCleanOpsEnv() |
| observation = env.reset(task_id="customer_contacts_easy", seed=7) |
| |
| observation, reward, done, info = env.step( |
| DataCleaningAction( |
| action_type="apply_operation", |
| operation_id="easy_normalize_emails", |
| reasoning="Normalize emails before customer deduplication.", |
| ) |
| ) |
| |
| state = env.state() |
| ``` |
|
|
| ### OpenEnv Server API |
|
|
| ```bash |
| PYTHONPATH="$PWD" python -m server.app --host 0.0.0.0 --port 8000 |
| ``` |
|
|
| Then use the typed client: |
|
|
| ```python |
| from cleanops_env import CleanOpsEnvClient, DataCleaningAction |
| |
| with CleanOpsEnvClient(base_url="http://127.0.0.1:8000") as env: |
| result = env.reset(task_id="orders_reconciliation_medium", seed=7) |
| result = env.step( |
| DataCleaningAction( |
| action_type="inspect_table", |
| table_name="orders", |
| reasoning="Review order rows before cleaning.", |
| ) |
| ) |
| state = env.state() |
| ``` |
|
|
| ## Action Space |
|
|
| `DataCleaningAction` |
|
|
| | Field | Type | Meaning | |
| |---|---|---| |
| | `action_type` | `"inspect_table" \| "inspect_operation" \| "apply_operation" \| "request_review" \| "run_sync_dry_run" \| "submit"` | Selects the action family. | |
| | `table_name` | `str \| null` | Table to inspect when `action_type="inspect_table"`. | |
| | `operation_id` | `str \| null` | Cleaning operation to inspect/apply. | |
| | `entity_type`, `entity_id`, `reason_code` | `str \| null` | Structured review request fields for ambiguous entities. | |
| | `target_system` | `"crm" \| "billing" \| null` | Downstream system to test with a dry run. | |
| | `reasoning` | `str` | Optional trace text used by baseline scripts. | |
|
|
| ## Observation Space |
|
|
| `DataCleaningObservation` includes: |
|
|
| | Field | Meaning | |
| |---|---| |
| | `quality_score`, `best_score`, `grader` | Deterministic score and score decomposition. | |
| | `review_budget_remaining`, `available_review_targets`, `pending_reviews`, `resolved_reviews` | Human-review queue state. | |
| | `supported_sync_targets`, `downstream_health`, `risk_cards`, `last_dry_run` | Downstream business-system simulation state. | |
| | `action_costs` | Estimated cost profile for the action families available in this benchmark. | |
| | `table_summaries`, `focus_table`, `available_operations`, `focus_operation` | Structured data/task context for the agent. | |
| | `validation_issues`, `issue_cards` | Current rule failures and remediation hints. | |
| | `recent_history`, `last_action_status`, `last_action_error` | Interaction trace and outcome details. | |
|
|
| ## Reward Function |
|
|
| Each step computes: |
|
|
| ```text |
| reward = |
| 1.00 * score_delta |
| + 0.35 * issue_count_delta |
| + 0.55 * downstream_health_delta |
| + inspection_bonus |
| + review_bonus |
| + step_penalty |
| + invalid_action_penalty |
| + no_op_penalty |
| + review_cost_penalty |
| + action_cost_penalty |
| + submit_bonus |
| ``` |
|
|
| This gives partial progress credit throughout the trajectory while penalizing |
| invalid actions, repeated work, wasted review budget, and low-quality |
| submission. |
|
|
| ## System Design |
|
|
| - `cleanops_env/tasks.py`: task definitions, gold tables, operation catalog, |
| review cases, and sync-target support. |
| - `cleanops_env/graders.py`: deterministic table-quality grading and validation |
| checks. |
| - `cleanops_env/environment.py`: episode state, reward shaping, review queues, |
| dry-run simulation, and typed `step()` / `reset()` / `state()`. |
| - `server/app.py`: FastAPI/OpenEnv server plus the Hugging Face demo UI. |
| - `inference.py`: submission-ready baseline runner with structured logs. |
|
|
| ## Grading |
|
|
| Each task uses a deterministic grader that outputs a final score in `(0.0, 1.0)` |
| from three components: |
|
|
| - `cell_match_score` |
| - `key_recall_score` |
| - `validation_score` |
|
|
| Final score: |
|
|
| ```text |
| 0.55 * cell_match_score + 0.20 * key_recall_score + 0.25 * validation_score |
| ``` |
|
|
| ## Setup |
|
|
| ```bash |
| git clone https://github.com/harsharajkumar/cleanops-openenv.git |
| cd cleanops-openenv |
| python -m venv .venv |
| source .venv/bin/activate |
| pip install -e ".[dev]" |
| ``` |
|
|
| ## Validate |
|
|
| ```bash |
| openenv validate --verbose |
| pytest -q |
| ``` |
|
|
| ## Submission Inference Script |
|
|
| `inference.py` lives at the project root and follows the required stdout |
| contract: |
|
|
| ```text |
| [START] task=<task_name> env=<benchmark> model=<model_name> |
| [STEP] step=<n> action=<action_str> reward=<0.00> done=<true|false> error=<msg|null> |
| [END] success=<true|false> steps=<n> score=<0.00> rewards=<r1,r2,...,rn> |
| ``` |
|
|
| Environment variables: |
|
|
| | Variable | Purpose | |
| |---|---| |
| | `API_BASE_URL` | OpenAI-compatible inference endpoint. Defaults to `https://router.huggingface.co/v1`. | |
| | `MODEL_NAME` | Model identifier. Defaults to `Qwen/Qwen2.5-72B-Instruct`. | |
| | `HF_TOKEN` | API key for the inference endpoint. | |
| | `LOCAL_IMAGE_NAME` | Optional local Docker image name used with `CleanOpsEnvClient.from_docker_image()`. | |
| | `TASK_NAME` | Task to run, or `all` for all tasks. Defaults to `all`. | |
|
|
| ## Baselines |
|
|
| ### Deterministic Oracle Smoke Baseline |
|
|
| ```bash |
| PYTHONPATH="$PWD" python scripts/run_oracle_smoke.py |
| ``` |
|
|
| Expected local scores: |
|
|
| | Task ID | Score | Steps | Total Reward | |
| |---|---:|---:|---:| |
| | `customer_contacts_easy` | 0.9900 | 7 | 1.1280 | |
| | `orders_reconciliation_medium` | 0.9900 | 6 | 1.0325 | |
| | `crm_migration_hard` | 0.9900 | 8 | 1.2568 | |
| | Mean | 0.9900 | - | - | |
|
|
| ### OpenAI Baseline Agent |
|
|
| ```bash |
| export OPENAI_API_KEY="..." |
| export OPENAI_MODEL="gpt-4.1-mini" |
| export OPENAI_SEED=7 |
| PYTHONPATH="$PWD" python scripts/run_openai_baseline.py --output openai_baseline.json |
| ``` |
|
|
| ## Docker |
|
|
| ```bash |
| docker build -t cleanops-env:latest . |
| docker run --rm -p 8000:8000 cleanops-env:latest |
| curl http://127.0.0.1:8000/health |
| ``` |
|
|
| ## Project Structure |
|
|
| ```text |
| cleanops-openenv/ |
| βββ cleanops_env/ |
| βββ scripts/ |
| βββ server/ |
| βββ tests/ |
| βββ Dockerfile |
| βββ inference.py |
| βββ openenv.yaml |
| βββ README.md |
| ``` |
|
|