cleanops-openenv / README.md
harsharajkumar273's picture
Merge V2 review and dry-run mechanics
7c2c5f2
metadata
title: CleanOps Env
emoji: 🧹
colorFrom: blue
colorTo: green
sdk: docker
app_port: 8000
tags:
  - openenv
  - data-cleaning
  - reinforcement-learning

CleanOps OpenEnv

CleanOps is a real-world OpenEnv benchmark for evaluating AI agents on operational data-cleaning workflows. Instead of solving a toy problem, the agent has to inspect messy business tables, choose remediation operations, escalate ambiguous records for human review, run downstream dry-run syncs, and submit a cleaned dataset scored by deterministic graders.

The benchmark models the kind of cleanup work that sales ops, RevOps, support ops, and data platform teams perform before loading data into CRMs, billing systems, and analytics warehouses.

Live Links

Highlights

  • Real-world benchmark: evaluates agents on CRM, order, subscription, and payment cleanup rather than games.
  • Full OpenEnv implementation: typed Action, Observation, and State models plus reset(), step(), and state().
  • Human-in-the-loop realism: agents can request deterministic review responses for ambiguous records.
  • Downstream simulation: agents can run CRM or billing dry runs before submit.
  • Cost-aware reward shaping: the environment rewards useful progress while penalizing wasted review budget, repeated actions, and risky shortcuts.

What The Agent Does

On each episode, the agent:

  1. inspects noisy business tables and validation issues
  2. chooses from a typed catalog of cleaning operations
  3. requests review for ambiguous merges or broken references
  4. runs deterministic downstream dry runs against CRM or billing systems
  5. applies targeted fixes while avoiding destructive shortcuts
  6. submits the cleaned dataset for deterministic scoring

Task Suite

Task ID Difficulty Description
customer_contacts_easy Easy Clean a CRM contacts export by normalizing names/emails/phones/states, handling one reviewable duplicate, and preparing the table for CRM import.
orders_reconciliation_medium Medium Clean an e-commerce order extract by standardizing dates, currency, amounts, statuses, and shipping states while preserving returned orders and checking downstream billing readiness.
crm_migration_hard Hard Repair a 3-table CRM migration extract with duplicate customers, broken foreign keys, ambiguous payment/customer linkages, review escalation, and CRM/billing dry-run checks.

API

Local Python API

from cleanops_env import DataCleaningAction, LocalCleanOpsEnv

env = LocalCleanOpsEnv()
observation = env.reset(task_id="customer_contacts_easy", seed=7)

observation, reward, done, info = env.step(
    DataCleaningAction(
        action_type="apply_operation",
        operation_id="easy_normalize_emails",
        reasoning="Normalize emails before customer deduplication.",
    )
)

state = env.state()

OpenEnv Server API

PYTHONPATH="$PWD" python -m server.app --host 0.0.0.0 --port 8000

Then use the typed client:

from cleanops_env import CleanOpsEnvClient, DataCleaningAction

with CleanOpsEnvClient(base_url="http://127.0.0.1:8000") as env:
    result = env.reset(task_id="orders_reconciliation_medium", seed=7)
    result = env.step(
        DataCleaningAction(
            action_type="inspect_table",
            table_name="orders",
            reasoning="Review order rows before cleaning.",
        )
    )
    state = env.state()

Action Space

DataCleaningAction

Field Type Meaning
action_type "inspect_table" | "inspect_operation" | "apply_operation" | "request_review" | "run_sync_dry_run" | "submit" Selects the action family.
table_name str | null Table to inspect when action_type="inspect_table".
operation_id str | null Cleaning operation to inspect/apply.
entity_type, entity_id, reason_code str | null Structured review request fields for ambiguous entities.
target_system "crm" | "billing" | null Downstream system to test with a dry run.
reasoning str Optional trace text used by baseline scripts.

Observation Space

DataCleaningObservation includes:

Field Meaning
quality_score, best_score, grader Deterministic score and score decomposition.
review_budget_remaining, available_review_targets, pending_reviews, resolved_reviews Human-review queue state.
supported_sync_targets, downstream_health, risk_cards, last_dry_run Downstream business-system simulation state.
action_costs Estimated cost profile for the action families available in this benchmark.
table_summaries, focus_table, available_operations, focus_operation Structured data/task context for the agent.
validation_issues, issue_cards Current rule failures and remediation hints.
recent_history, last_action_status, last_action_error Interaction trace and outcome details.

Reward Function

Each step computes:

reward =
  1.00 * score_delta
+ 0.35 * issue_count_delta
+ 0.55 * downstream_health_delta
+ inspection_bonus
+ review_bonus
+ step_penalty
+ invalid_action_penalty
+ no_op_penalty
+ review_cost_penalty
+ action_cost_penalty
+ submit_bonus

This gives partial progress credit throughout the trajectory while penalizing invalid actions, repeated work, wasted review budget, and low-quality submission.

System Design

  • cleanops_env/tasks.py: task definitions, gold tables, operation catalog, review cases, and sync-target support.
  • cleanops_env/graders.py: deterministic table-quality grading and validation checks.
  • cleanops_env/environment.py: episode state, reward shaping, review queues, dry-run simulation, and typed step() / reset() / state().
  • server/app.py: FastAPI/OpenEnv server plus the Hugging Face demo UI.
  • inference.py: submission-ready baseline runner with structured logs.

Grading

Each task uses a deterministic grader that outputs a final score in (0.0, 1.0) from three components:

  • cell_match_score
  • key_recall_score
  • validation_score

Final score:

0.55 * cell_match_score + 0.20 * key_recall_score + 0.25 * validation_score

Setup

git clone https://github.com/harsharajkumar/cleanops-openenv.git
cd cleanops-openenv
python -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"

Validate

openenv validate --verbose
pytest -q

Submission Inference Script

inference.py lives at the project root and follows the required stdout contract:

[START] task=<task_name> env=<benchmark> model=<model_name>
[STEP] step=<n> action=<action_str> reward=<0.00> done=<true|false> error=<msg|null>
[END] success=<true|false> steps=<n> score=<0.00> rewards=<r1,r2,...,rn>

Environment variables:

Variable Purpose
API_BASE_URL OpenAI-compatible inference endpoint. Defaults to https://router.huggingface.co/v1.
MODEL_NAME Model identifier. Defaults to Qwen/Qwen2.5-72B-Instruct.
HF_TOKEN API key for the inference endpoint.
LOCAL_IMAGE_NAME Optional local Docker image name used with CleanOpsEnvClient.from_docker_image().
TASK_NAME Task to run, or all for all tasks. Defaults to all.

Baselines

Deterministic Oracle Smoke Baseline

PYTHONPATH="$PWD" python scripts/run_oracle_smoke.py

Expected local scores:

Task ID Score Steps Total Reward
customer_contacts_easy 0.9900 7 1.1280
orders_reconciliation_medium 0.9900 6 1.0325
crm_migration_hard 0.9900 8 1.2568
Mean 0.9900 - -

OpenAI Baseline Agent

export OPENAI_API_KEY="..."
export OPENAI_MODEL="gpt-4.1-mini"
export OPENAI_SEED=7
PYTHONPATH="$PWD" python scripts/run_openai_baseline.py --output openai_baseline.json

Docker

docker build -t cleanops-env:latest .
docker run --rm -p 8000:8000 cleanops-env:latest
curl http://127.0.0.1:8000/health

Project Structure

cleanops-openenv/
β”œβ”€β”€ cleanops_env/
β”œβ”€β”€ scripts/
β”œβ”€β”€ server/
β”œβ”€β”€ tests/
β”œβ”€β”€ Dockerfile
β”œβ”€β”€ inference.py
β”œβ”€β”€ openenv.yaml
└── README.md