Spaces:

harsharajkumar273
/

cleanops-openenv

Running

App Files Files Community

cleanops-openenv / README.md

harsharajkumar273

Merge V2 review and dry-run mechanics

7c2c5f2 about 2 months ago

preview code

raw

history blame contribute delete

8.64 kB

	---
	title: CleanOps Env
	emoji: "🧹"
	colorFrom: blue
	colorTo: green
	sdk: docker
	app_port: 8000
	tags:
	- openenv
	- data-cleaning
	- reinforcement-learning
	---

	# CleanOps OpenEnv

	CleanOps is a real-world OpenEnv benchmark for evaluating AI agents on
	operational data-cleaning workflows. Instead of solving a toy problem, the
	agent has to inspect messy business tables, choose remediation operations,
	escalate ambiguous records for human review, run downstream dry-run syncs, and
	submit a cleaned dataset scored by deterministic graders.

	The benchmark models the kind of cleanup work that sales ops, RevOps, support
	ops, and data platform teams perform before loading data into CRMs, billing
	systems, and analytics warehouses.

	## Live Links

	- Hugging Face Space: [harsharajkumar273/cleanops-openenv](https://huggingface.co/spaces/harsharajkumar273/cleanops-openenv)
	- Live App: [harsharajkumar273-cleanops-openenv.hf.space](https://harsharajkumar273-cleanops-openenv.hf.space/)
	- GitHub Repository: [harsharajkumar/cleanops-openenv](https://github.com/harsharajkumar/cleanops-openenv)

	## Highlights

	- Real-world benchmark: evaluates agents on CRM, order, subscription, and
	payment cleanup rather than games.
	- Full OpenEnv implementation: typed `Action`, `Observation`, and `State`
	models plus `reset()`, `step()`, and `state()`.
	- Human-in-the-loop realism: agents can request deterministic review responses
	for ambiguous records.
	- Downstream simulation: agents can run CRM or billing dry runs before submit.
	- Cost-aware reward shaping: the environment rewards useful progress while
	penalizing wasted review budget, repeated actions, and risky shortcuts.

	## What The Agent Does

	On each episode, the agent:

	1. inspects noisy business tables and validation issues
	2. chooses from a typed catalog of cleaning operations
	3. requests review for ambiguous merges or broken references
	4. runs deterministic downstream dry runs against CRM or billing systems
	5. applies targeted fixes while avoiding destructive shortcuts
	6. submits the cleaned dataset for deterministic scoring

	## Task Suite

	\| Task ID \| Difficulty \| Description \|
	\|---\|---\|---\|
	\| `customer_contacts_easy` \| Easy \| Clean a CRM contacts export by normalizing names/emails/phones/states, handling one reviewable duplicate, and preparing the table for CRM import. \|
	\| `orders_reconciliation_medium` \| Medium \| Clean an e-commerce order extract by standardizing dates, currency, amounts, statuses, and shipping states while preserving returned orders and checking downstream billing readiness. \|
	\| `crm_migration_hard` \| Hard \| Repair a 3-table CRM migration extract with duplicate customers, broken foreign keys, ambiguous payment/customer linkages, review escalation, and CRM/billing dry-run checks. \|

	## API

	### Local Python API

	```python
	from cleanops_env import DataCleaningAction, LocalCleanOpsEnv

	env = LocalCleanOpsEnv()
	observation = env.reset(task_id="customer_contacts_easy", seed=7)

	observation, reward, done, info = env.step(
	DataCleaningAction(
	action_type="apply_operation",
	operation_id="easy_normalize_emails",
	reasoning="Normalize emails before customer deduplication.",
	)
	)

	state = env.state()
	```

	### OpenEnv Server API

	```bash
	PYTHONPATH="$PWD" python -m server.app --host 0.0.0.0 --port 8000
	```

	Then use the typed client:

	```python
	from cleanops_env import CleanOpsEnvClient, DataCleaningAction

	with CleanOpsEnvClient(base_url="http://127.0.0.1:8000") as env:
	result = env.reset(task_id="orders_reconciliation_medium", seed=7)
	result = env.step(
	DataCleaningAction(
	action_type="inspect_table",
	table_name="orders",
	reasoning="Review order rows before cleaning.",
	)
	)
	state = env.state()
	```

	## Action Space

	`DataCleaningAction`

	\| Field \| Type \| Meaning \|
	\|---\|---\|---\|
	\| `action_type` \| `"inspect_table" \\| "inspect_operation" \\| "apply_operation" \\| "request_review" \\| "run_sync_dry_run" \\| "submit"` \| Selects the action family. \|
	\| `table_name` \| `str \\| null` \| Table to inspect when `action_type="inspect_table"`. \|
	\| `operation_id` \| `str \\| null` \| Cleaning operation to inspect/apply. \|
	\| `entity_type`, `entity_id`, `reason_code` \| `str \\| null` \| Structured review request fields for ambiguous entities. \|
	\| `target_system` \| `"crm" \\| "billing" \\| null` \| Downstream system to test with a dry run. \|
	\| `reasoning` \| `str` \| Optional trace text used by baseline scripts. \|

	## Observation Space

	`DataCleaningObservation` includes:

	\| Field \| Meaning \|
	\|---\|---\|
	\| `quality_score`, `best_score`, `grader` \| Deterministic score and score decomposition. \|
	\| `review_budget_remaining`, `available_review_targets`, `pending_reviews`, `resolved_reviews` \| Human-review queue state. \|
	\| `supported_sync_targets`, `downstream_health`, `risk_cards`, `last_dry_run` \| Downstream business-system simulation state. \|
	\| `action_costs` \| Estimated cost profile for the action families available in this benchmark. \|
	\| `table_summaries`, `focus_table`, `available_operations`, `focus_operation` \| Structured data/task context for the agent. \|
	\| `validation_issues`, `issue_cards` \| Current rule failures and remediation hints. \|
	\| `recent_history`, `last_action_status`, `last_action_error` \| Interaction trace and outcome details. \|

	## Reward Function

	Each step computes:

	```text
	reward =
	1.00 * score_delta
	+ 0.35 * issue_count_delta
	+ 0.55 * downstream_health_delta
	+ inspection_bonus
	+ review_bonus
	+ step_penalty
	+ invalid_action_penalty
	+ no_op_penalty
	+ review_cost_penalty
	+ action_cost_penalty
	+ submit_bonus
	```

	This gives partial progress credit throughout the trajectory while penalizing
	invalid actions, repeated work, wasted review budget, and low-quality
	submission.

	## System Design

	- `cleanops_env/tasks.py`: task definitions, gold tables, operation catalog,
	review cases, and sync-target support.
	- `cleanops_env/graders.py`: deterministic table-quality grading and validation
	checks.
	- `cleanops_env/environment.py`: episode state, reward shaping, review queues,
	dry-run simulation, and typed `step()` / `reset()` / `state()`.
	- `server/app.py`: FastAPI/OpenEnv server plus the Hugging Face demo UI.
	- `inference.py`: submission-ready baseline runner with structured logs.

	## Grading

	Each task uses a deterministic grader that outputs a final score in `(0.0, 1.0)`
	from three components:

	- `cell_match_score`
	- `key_recall_score`
	- `validation_score`

	Final score:

	```text
	0.55 * cell_match_score + 0.20 * key_recall_score + 0.25 * validation_score
	```

	## Setup

	```bash
	git clone https://github.com/harsharajkumar/cleanops-openenv.git
	cd cleanops-openenv
	python -m venv .venv
	source .venv/bin/activate
	pip install -e ".[dev]"
	```

	## Validate

	```bash
	openenv validate --verbose
	pytest -q
	```

	## Submission Inference Script

	`inference.py` lives at the project root and follows the required stdout
	contract:

	```text
	[START] task=<task_name> env=<benchmark> model=<model_name>
	[STEP] step=<n> action=<action_str> reward=<0.00> done=<true\|false> error=<msg\|null>
	[END] success=<true\|false> steps=<n> score=<0.00> rewards=<r1,r2,...,rn>
	```

	Environment variables:

	\| Variable \| Purpose \|
	\|---\|---\|
	\| `API_BASE_URL` \| OpenAI-compatible inference endpoint. Defaults to `https://router.huggingface.co/v1`. \|
	\| `MODEL_NAME` \| Model identifier. Defaults to `Qwen/Qwen2.5-72B-Instruct`. \|
	\| `HF_TOKEN` \| API key for the inference endpoint. \|
	\| `LOCAL_IMAGE_NAME` \| Optional local Docker image name used with `CleanOpsEnvClient.from_docker_image()`. \|
	\| `TASK_NAME` \| Task to run, or `all` for all tasks. Defaults to `all`. \|

	## Baselines

	### Deterministic Oracle Smoke Baseline

	```bash
	PYTHONPATH="$PWD" python scripts/run_oracle_smoke.py
	```

	Expected local scores:

	\| Task ID \| Score \| Steps \| Total Reward \|
	\|---\|---:\|---:\|---:\|
	\| `customer_contacts_easy` \| 0.9900 \| 7 \| 1.1280 \|
	\| `orders_reconciliation_medium` \| 0.9900 \| 6 \| 1.0325 \|
	\| `crm_migration_hard` \| 0.9900 \| 8 \| 1.2568 \|
	\| Mean \| 0.9900 \| - \| - \|

	### OpenAI Baseline Agent

	```bash
	export OPENAI_API_KEY="..."
	export OPENAI_MODEL="gpt-4.1-mini"
	export OPENAI_SEED=7
	PYTHONPATH="$PWD" python scripts/run_openai_baseline.py --output openai_baseline.json
	```

	## Docker

	```bash
	docker build -t cleanops-env:latest .
	docker run --rm -p 8000:8000 cleanops-env:latest
	curl http://127.0.0.1:8000/health
	```

	## Project Structure

	```text
	cleanops-openenv/
	├── cleanops_env/
	├── scripts/
	├── server/
	├── tests/
	├── Dockerfile
	├── inference.py
	├── openenv.yaml
	└── README.md
	```