Spaces:

harsharajkumar273
/

cleanops-openenv

Running

App Files Files Community

harsharajkumar273 commited on Apr 10

Commit

7c2c5f2

1 Parent(s): 3493624

Merge V2 review and dry-run mechanics

Browse files

Files changed (8) hide show

README.md +82 -94
cleanops_env/__init__.py +14 -1
cleanops_env/environment.py +397 -5
cleanops_env/models.py +114 -1
cleanops_env/tasks.py +86 -0
inference.py +20 -1
scripts/run_openai_baseline.py +16 -2
tests/test_environment.py +95 -0

README.md CHANGED Viewed

@@ -13,33 +13,52 @@ tags:
 # CleanOps OpenEnv
-CleanOps is a real-world OpenEnv benchmark where an agent cleans operational
-tabular data from CRM, order, subscription, and payment pipelines. The agent
-must inspect tables, choose remediation operations, avoid destructive shortcuts,
-and submit a cleaned dataset scored by deterministic graders.
-This is intentionally not a game or toy task. It models the kind of operational
-data cleanup that sales ops, RevOps, support ops, and data platform teams perform
-before loading systems-of-record or analytics warehouses.
-## Why This Environment Is Useful
-- Realistic domain: tabular data standardization, missing-value repair,
-  deduplication, and referential integrity fixes.
-- Deterministic programmatic graders: every task returns a reproducible
-  `0.0-1.0` score with interpretable components.
-- Dense reward shaping: reward is driven by score deltas, issue-count reduction,
-  inspection bonuses, step costs, no-op penalties, and submission bonuses.
-- Curriculum-ready tasks: one easy, one medium, and one hard task with increasing
-  schema complexity and cross-table dependencies.
 ## Task Suite
 | Task ID | Difficulty | Description |
 |---|---|---|
-| `customer_contacts_easy` | Easy | Clean a CRM contacts export by normalizing names/emails/phones/states, filling one missing state, and merging duplicate customers without dropping inactive accounts. |
-| `orders_reconciliation_medium` | Medium | Clean an e-commerce order extract by standardizing dates, currency, amounts, statuses, and shipping states while deduplicating repeated exports and preserving cancelled orders. |
-| `crm_migration_hard` | Hard | Repair a 3-table CRM migration extract by normalizing customer/subscription/payment fields, merging duplicate customer IDs, fixing foreign keys from email joins, and removing duplicate payment facts. |
 ## API
@@ -65,11 +84,10 @@ state = env.state()
 ### OpenEnv Server API
 ```bash
-cd /Users/harsharajkumar/Downloads/research_paper_simplifier-main/meta
 PYTHONPATH="$PWD" python -m server.app --host 0.0.0.0 --port 8000
 ```
-Then use the typed WebSocket client:
 ```python
 from cleanops_env import CleanOpsEnvClient, DataCleaningAction
@@ -92,30 +110,26 @@ with CleanOpsEnvClient(base_url="http://127.0.0.1:8000") as env:
 | Field | Type | Meaning |
 |---|---|---|
-| `action_type` | `"inspect_table" \| "inspect_operation" \| "apply_operation" \| "submit"` | Selects the action family. |
 | `table_name` | `str \| null` | Table to inspect when `action_type="inspect_table"`. |
 | `operation_id` | `str \| null` | Cleaning operation to inspect/apply. |
 | `reasoning` | `str` | Optional trace text used by baseline scripts. |
-| `metadata` | `dict` | OpenEnv metadata channel. |
 ## Observation Space
-`DataCleaningObservation` extends OpenEnv's typed `Observation` model and includes:
 | Field | Meaning |
 |---|---|
-| `task_id`, `task_title`, `difficulty`, `objective`, `dataset_context` | Task metadata and objective. |
 | `quality_score`, `best_score`, `grader` | Deterministic score and score decomposition. |
-| `remaining_steps`, `done`, `reward`, `reward_breakdown` | Episode and reward state. |
-| `table_summaries` | Compact per-table statistics and previews. |
-| `focus_table` | Full rows for the currently inspected table. |
-| `available_operations` | Typed catalog of cleaning actions and risk labels. |
-| `focus_operation` | Predicted row-level before/after diff for an inspected operation. |
 | `validation_issues`, `issue_cards` | Current rule failures and remediation hints. |
-| `recent_history`, `last_action_status`, `last_action_error`, `metadata` | Interaction trace and episode metadata. |
-`DataCleaningState` returns the current mutable tables, applied operations,
-inspection history, step count, and score state.
 ## Reward Function
@@ -123,26 +137,42 @@ Each step computes:
 ```text
 reward =
-  1.25 * score_delta
 + 0.35 * issue_count_delta
 + inspection_bonus
 + step_penalty
 + invalid_action_penalty
 + no_op_penalty
 + submit_bonus
 ```
-This gives partial progress credit throughout the trajectory and penalizes
-repeat/no-op actions, invalid operations, and low-quality premature submission.
 ## Grading
-Each task uses a deterministic grader that outputs a final score in `[0.0, 1.0]`
 from three components:
-- `cell_match_score`: exact canonicalized cell match against gold cleaned tables.
-- `key_recall_score`: entity/row identity quality after dedupe and row retention.
-- `validation_score`: fraction of unresolved data-quality checks eliminated.
 Final score:
@@ -153,7 +183,8 @@ Final score:
 ## Setup
 ```bash
-cd /Users/harsharajkumar/Downloads/research_paper_simplifier-main/meta
 python -m venv .venv
 source .venv/bin/activate
 pip install -e ".[dev]"
@@ -162,7 +193,6 @@ pip install -e ".[dev]"
 ## Validate
 ```bash
-cd /Users/harsharajkumar/Downloads/research_paper_simplifier-main/meta
 openenv validate --verbose
 pytest -q
 ```
@@ -188,92 +218,50 @@ Environment variables:
 | `LOCAL_IMAGE_NAME` | Optional local Docker image name used with `CleanOpsEnvClient.from_docker_image()`. |
 | `TASK_NAME` | Task to run, or `all` for all tasks. Defaults to `all`. |
-Example:
-```bash
-cd /Users/harsharajkumar/Downloads/research_paper_simplifier-main/meta
-export API_BASE_URL="https://router.huggingface.co/v1"
-export MODEL_NAME="Qwen/Qwen2.5-72B-Instruct"
-export HF_TOKEN="..."
-PYTHONPATH="$PWD" python inference.py
-```
 ## Baselines
 ### Deterministic Oracle Smoke Baseline
 ```bash
-cd /Users/harsharajkumar/Downloads/research_paper_simplifier-main/meta
 PYTHONPATH="$PWD" python scripts/run_oracle_smoke.py
 ```
-Expected scores measured locally:
 | Task ID | Score | Steps | Total Reward |
 |---|---:|---:|---:|
-| `customer_contacts_easy` | 1.0000 | 7 | 1.1430 |
-| `orders_reconciliation_medium` | 1.0000 | 6 | 1.0222 |
-| `crm_migration_hard` | 1.0000 | 8 | 1.0827 |
-| Mean | 1.0000 | - | - |
 ### OpenAI Baseline Agent
 ```bash
-cd /Users/harsharajkumar/Downloads/research_paper_simplifier-main/meta
 export OPENAI_API_KEY="..."
 export OPENAI_MODEL="gpt-4.1-mini"
 export OPENAI_SEED=7
 PYTHONPATH="$PWD" python scripts/run_openai_baseline.py --output openai_baseline.json
 ```
-The OpenAI runner uses temperature `0`, fixed seed values, and the typed
-`DataCleaningAction` schema to produce reproducible rollouts.
 ## Docker
 ```bash
-cd /Users/harsharajkumar/Downloads/research_paper_simplifier-main/meta
 docker build -t cleanops-env:latest .
 docker run --rm -p 8000:8000 cleanops-env:latest
 curl http://127.0.0.1:8000/health
 ```
-## Hugging Face Spaces Deployment
-1. Create a new Docker Space.
-2. Upload this directory as the Space repo contents.
-3. Keep the README metadata frontmatter and `Dockerfile` at repo root.
-4. Ensure the Space has the `openenv` tag.
-5. If needed, push with the OpenEnv CLI:
-```bash
-cd /Users/harsharajkumar/Downloads/research_paper_simplifier-main/meta
-openenv push
-```
 ## Project Structure
 ```text
-meta/
 ├── cleanops_env/
-│   ├── client.py
-│   ├── environment.py
-│   ├── graders.py
-│   ├── local_env.py
-│   ├── models.py
-│   └── tasks.py
 ├── scripts/
-│   ├── run_openai_baseline.py
-│   └── run_oracle_smoke.py
 ├── server/
-│   ├── app.py
-│   └── Dockerfile
 ├── tests/
-│   └── test_environment.py
 ├── Dockerfile
 ├── inference.py
 ├── openenv.yaml
-├── pyproject.toml
-├── uv.lock
 └── README.md
 ```

 # CleanOps OpenEnv
+CleanOps is a real-world OpenEnv benchmark for evaluating AI agents on
+operational data-cleaning workflows. Instead of solving a toy problem, the
+agent has to inspect messy business tables, choose remediation operations,
+escalate ambiguous records for human review, run downstream dry-run syncs, and
+submit a cleaned dataset scored by deterministic graders.
+The benchmark models the kind of cleanup work that sales ops, RevOps, support
+ops, and data platform teams perform before loading data into CRMs, billing
+systems, and analytics warehouses.
+## Live Links
+- Hugging Face Space: [harsharajkumar273/cleanops-openenv](https://huggingface.co/spaces/harsharajkumar273/cleanops-openenv)
+- Live App: [harsharajkumar273-cleanops-openenv.hf.space](https://harsharajkumar273-cleanops-openenv.hf.space/)
+- GitHub Repository: [harsharajkumar/cleanops-openenv](https://github.com/harsharajkumar/cleanops-openenv)
+## Highlights
+- Real-world benchmark: evaluates agents on CRM, order, subscription, and
+  payment cleanup rather than games.
+- Full OpenEnv implementation: typed `Action`, `Observation`, and `State`
+  models plus `reset()`, `step()`, and `state()`.
+- Human-in-the-loop realism: agents can request deterministic review responses
+  for ambiguous records.
+- Downstream simulation: agents can run CRM or billing dry runs before submit.
+- Cost-aware reward shaping: the environment rewards useful progress while
+  penalizing wasted review budget, repeated actions, and risky shortcuts.
+## What The Agent Does
+On each episode, the agent:
+1. inspects noisy business tables and validation issues
+2. chooses from a typed catalog of cleaning operations
+3. requests review for ambiguous merges or broken references
+4. runs deterministic downstream dry runs against CRM or billing systems
+5. applies targeted fixes while avoiding destructive shortcuts
+6. submits the cleaned dataset for deterministic scoring
 ## Task Suite
 | Task ID | Difficulty | Description |
 |---|---|---|
+| `customer_contacts_easy` | Easy | Clean a CRM contacts export by normalizing names/emails/phones/states, handling one reviewable duplicate, and preparing the table for CRM import. |
+| `orders_reconciliation_medium` | Medium | Clean an e-commerce order extract by standardizing dates, currency, amounts, statuses, and shipping states while preserving returned orders and checking downstream billing readiness. |
+| `crm_migration_hard` | Hard | Repair a 3-table CRM migration extract with duplicate customers, broken foreign keys, ambiguous payment/customer linkages, review escalation, and CRM/billing dry-run checks. |
 ## API
 ### OpenEnv Server API
 ```bash
 PYTHONPATH="$PWD" python -m server.app --host 0.0.0.0 --port 8000
 ```
+Then use the typed client:
 ```python
 from cleanops_env import CleanOpsEnvClient, DataCleaningAction
 | Field | Type | Meaning |
 |---|---|---|
+| `action_type` | `"inspect_table" \| "inspect_operation" \| "apply_operation" \| "request_review" \| "run_sync_dry_run" \| "submit"` | Selects the action family. |
 | `table_name` | `str \| null` | Table to inspect when `action_type="inspect_table"`. |
 | `operation_id` | `str \| null` | Cleaning operation to inspect/apply. |
+| `entity_type`, `entity_id`, `reason_code` | `str \| null` | Structured review request fields for ambiguous entities. |
+| `target_system` | `"crm" \| "billing" \| null` | Downstream system to test with a dry run. |
 | `reasoning` | `str` | Optional trace text used by baseline scripts. |
 ## Observation Space
+`DataCleaningObservation` includes:
 | Field | Meaning |
 |---|---|
 | `quality_score`, `best_score`, `grader` | Deterministic score and score decomposition. |
+| `review_budget_remaining`, `available_review_targets`, `pending_reviews`, `resolved_reviews` | Human-review queue state. |
+| `supported_sync_targets`, `downstream_health`, `risk_cards`, `last_dry_run` | Downstream business-system simulation state. |
+| `action_costs` | Estimated cost profile for the action families available in this benchmark. |
+| `table_summaries`, `focus_table`, `available_operations`, `focus_operation` | Structured data/task context for the agent. |
 | `validation_issues`, `issue_cards` | Current rule failures and remediation hints. |
+| `recent_history`, `last_action_status`, `last_action_error` | Interaction trace and outcome details. |
 ## Reward Function
 ```text
 reward =
+  1.00 * score_delta
 + 0.35 * issue_count_delta
++ 0.55 * downstream_health_delta
 + inspection_bonus
++ review_bonus
 + step_penalty
 + invalid_action_penalty
 + no_op_penalty
++ review_cost_penalty
++ action_cost_penalty
 + submit_bonus
 ```
+This gives partial progress credit throughout the trajectory while penalizing
+invalid actions, repeated work, wasted review budget, and low-quality
+submission.
+## System Design
+- `cleanops_env/tasks.py`: task definitions, gold tables, operation catalog,
+  review cases, and sync-target support.
+- `cleanops_env/graders.py`: deterministic table-quality grading and validation
+  checks.
+- `cleanops_env/environment.py`: episode state, reward shaping, review queues,
+  dry-run simulation, and typed `step()` / `reset()` / `state()`.
+- `server/app.py`: FastAPI/OpenEnv server plus the Hugging Face demo UI.
+- `inference.py`: submission-ready baseline runner with structured logs.
 ## Grading
+Each task uses a deterministic grader that outputs a final score in `(0.0, 1.0)`
 from three components:
+- `cell_match_score`
+- `key_recall_score`
+- `validation_score`
 Final score:
 ## Setup
 ```bash
+git clone https://github.com/harsharajkumar/cleanops-openenv.git
+cd cleanops-openenv
 python -m venv .venv
 source .venv/bin/activate
 pip install -e ".[dev]"
 ## Validate
 ```bash
 openenv validate --verbose
 pytest -q
 ```
 | `LOCAL_IMAGE_NAME` | Optional local Docker image name used with `CleanOpsEnvClient.from_docker_image()`. |
 | `TASK_NAME` | Task to run, or `all` for all tasks. Defaults to `all`. |
 ## Baselines
 ### Deterministic Oracle Smoke Baseline
 ```bash
 PYTHONPATH="$PWD" python scripts/run_oracle_smoke.py
 ```
+Expected local scores:
 | Task ID | Score | Steps | Total Reward |
 |---|---:|---:|---:|
+| `customer_contacts_easy` | 0.9900 | 7 | 1.1280 |
+| `orders_reconciliation_medium` | 0.9900 | 6 | 1.0325 |
+| `crm_migration_hard` | 0.9900 | 8 | 1.2568 |
+| Mean | 0.9900 | - | - |
 ### OpenAI Baseline Agent
 ```bash
 export OPENAI_API_KEY="..."
 export OPENAI_MODEL="gpt-4.1-mini"
 export OPENAI_SEED=7
 PYTHONPATH="$PWD" python scripts/run_openai_baseline.py --output openai_baseline.json
 ```
 ## Docker
 ```bash
 docker build -t cleanops-env:latest .
 docker run --rm -p 8000:8000 cleanops-env:latest
 curl http://127.0.0.1:8000/health
 ```
 ## Project Structure
 ```text
+cleanops-openenv/
 ├── cleanops_env/
 ├── scripts/
 ├── server/
 ├── tests/
 ├── Dockerfile
 ├── inference.py
 ├── openenv.yaml
 └── README.md
 ```

cleanops_env/__init__.py CHANGED Viewed

@@ -4,19 +4,32 @@ from cleanops_env.client import CleanOpsEnvClient
 from cleanops_env.environment import CleanOpsEnvironment
 from cleanops_env.local_env import LocalCleanOpsEnv
 from cleanops_env.models import (
     DataCleaningAction,
     DataCleaningObservation,
     DataCleaningState,
     RewardBreakdown,
 )
 __all__ = [
     "CleanOpsEnvClient",
     "CleanOpsEnvironment",
     "DataCleaningAction",
     "DataCleaningObservation",
     "DataCleaningState",
     "LocalCleanOpsEnv",
     "RewardBreakdown",
 ]

 from cleanops_env.environment import CleanOpsEnvironment
 from cleanops_env.local_env import LocalCleanOpsEnv
 from cleanops_env.models import (
+    ActionCostEntry,
     DataCleaningAction,
     DataCleaningObservation,
     DataCleaningState,
+    DownstreamHealth,
+    DryRunFinding,
+    DryRunReport,
+    PendingReview,
     RewardBreakdown,
+    ReviewResolution,
+    ReviewTarget,
 )
 __all__ = [
     "CleanOpsEnvClient",
     "CleanOpsEnvironment",
+    "ActionCostEntry",
     "DataCleaningAction",
     "DataCleaningObservation",
     "DataCleaningState",
+    "DownstreamHealth",
+    "DryRunFinding",
+    "DryRunReport",
     "LocalCleanOpsEnv",
+    "PendingReview",
     "RewardBreakdown",
+    "ReviewResolution",
+    "ReviewTarget",
 ]

cleanops_env/environment.py CHANGED Viewed

@@ -9,18 +9,27 @@ from uuid import uuid4
 from openenv.core.env_server.interfaces import Environment
 from openenv.core.env_server.types import EnvironmentMetadata
-from cleanops_env.graders import build_table_summary, grade_tables
 from cleanops_env.models import (
     DataCleaningAction,
     DataCleaningObservation,
     DataCleaningState,
     OperationDetail,
     OperationSummary,
     RewardBreakdown,
     RowChange,
     TableView,
 )
 from cleanops_env.tasks import (
     TaskSpec,
     apply_operation_to_tables,
     clone_tables,
@@ -31,6 +40,28 @@ from cleanops_env.tasks import (
     sorted_rows,
 )
 class CleanOpsEnvironment(Environment[DataCleaningAction, DataCleaningObservation, DataCleaningState]):
     """A realistic data-cleaning workflow environment with deterministic graders."""
@@ -46,6 +77,8 @@ class CleanOpsEnvironment(Environment[DataCleaningAction, DataCleaningObservatio
         self._focus_operation_detail: OperationDetail | None = None
         self._done = False
         self._initial_issue_count = max(1, len(self._grade.validation_issues))
         self._state = DataCleaningState(
             episode_id=str(uuid4()),
             step_count=0,
@@ -54,14 +87,22 @@ class CleanOpsEnvironment(Environment[DataCleaningAction, DataCleaningObservatio
             difficulty=self._task_spec.difficulty,
             requested_seed=None,
             max_steps=self._task_spec.max_steps,
             submitted=False,
             current_score=self._grade.score,
             best_score=self._grade.score,
             outstanding_issue_count=len(self._grade.validation_issues),
-            tables=clone_tables(self._task_spec.dirty_tables),
             applied_operation_ids=[],
             inspected_tables=[self._focus_table_name],
             inspected_operations=[],
             recent_history=[],
         )
@@ -81,6 +122,8 @@ class CleanOpsEnvironment(Environment[DataCleaningAction, DataCleaningObservatio
         self._done = False
         self._grade = grade_tables(self._task_spec, self._task_spec.dirty_tables)
         self._initial_issue_count = max(1, len(self._grade.validation_issues))
         self._state = DataCleaningState(
             episode_id=episode_id or str(uuid4()),
             step_count=0,
@@ -89,14 +132,22 @@ class CleanOpsEnvironment(Environment[DataCleaningAction, DataCleaningObservatio
             difficulty=self._task_spec.difficulty,
             requested_seed=normalized_seed,
             max_steps=self._task_spec.max_steps,
             submitted=False,
             current_score=self._grade.score,
             best_score=self._grade.score,
             outstanding_issue_count=len(self._grade.validation_issues),
-            tables=clone_tables(self._task_spec.dirty_tables),
             applied_operation_ids=[],
             inspected_tables=[self._focus_table_name],
             inspected_operations=[],
             recent_history=[f"reset -> loaded task {self._task_spec.task_id} ({self._task_spec.difficulty}) seed={normalized_seed}"],
         )
         return self._build_observation(
@@ -127,13 +178,20 @@ class CleanOpsEnvironment(Environment[DataCleaningAction, DataCleaningObservatio
         self._state.step_count += 1
         previous_score = self._state.current_score
         previous_issue_count = self._state.outstanding_issue_count
         invalid_action_penalty = 0.0
         noop_penalty = 0.0
         insight_bonus = 0.0
         submit_bonus = 0.0
         status_message = ""
         action_error: str | None = None
         if action.action_type == "inspect_table":
             table_name = normalize_whitespace(action.table_name or "")
@@ -189,36 +247,119 @@ class CleanOpsEnvironment(Environment[DataCleaningAction, DataCleaningObservatio
                     if self._task_spec.operations[operation_id].tables_affected:
                         self._focus_table_name = self._task_spec.operations[operation_id].tables_affected[0]
                     status_message = f"Applied '{operation_id}' to {affected_tables or 'current tables'}."
         elif action.action_type == "submit":
             self._state.submitted = True
             self._done = True
             status_message = "Submitted cleaned tables for grading."
         self._grade = grade_tables(self._task_spec, self._state.tables)
         self._state.current_score = self._grade.score
         self._state.best_score = max(self._state.best_score, self._grade.score)
         self._state.outstanding_issue_count = len(self._grade.validation_issues)
         quality_delta = round(self._state.current_score - previous_score, 4)
         issue_delta = round((previous_issue_count - self._state.outstanding_issue_count) / self._initial_issue_count, 4)
         efficiency_penalty = -0.01
         if action.action_type == "submit":
-            submit_bonus = round(0.4 * self._state.current_score, 4) if self._state.current_score >= 0.8 else round(-0.2 * (1.0 - self._state.current_score), 4)
         if self._state.step_count >= self._state.max_steps and not self._done:
             self._done = True
             self._state.submitted = False
             status_message = f"{status_message} Step budget exhausted; episode truncated.".strip()
-        reward_total = round(1.25 * quality_delta + 0.35 * issue_delta + insight_bonus + efficiency_penalty + invalid_action_penalty + noop_penalty + submit_bonus, 4)
         reward_breakdown = RewardBreakdown(
             quality_delta=quality_delta,
             issue_delta=issue_delta,
             insight_bonus=insight_bonus,
             efficiency_penalty=efficiency_penalty,
             invalid_action_penalty=invalid_action_penalty,
             noop_penalty=noop_penalty,
             submit_bonus=submit_bonus,
             total=reward_total,
         )
@@ -228,6 +369,10 @@ class CleanOpsEnvironment(Environment[DataCleaningAction, DataCleaningObservatio
             action_descriptor += f"[{action.operation_id}]"
         if action.table_name:
             action_descriptor += f"[{action.table_name}]"
         self._state.recent_history.append(f"step {self._state.step_count}: {action_descriptor} -> score={self._state.current_score:.4f}")
         self._state.recent_history = self._state.recent_history[-10:]
@@ -274,6 +419,18 @@ class CleanOpsEnvironment(Environment[DataCleaningAction, DataCleaningObservatio
             )
             for operation in sorted(self._task_spec.operations.values(), key=lambda op: op.operation_id)
         ]
         return DataCleaningObservation(
             task_id=self._task_spec.task_id,
             task_title=self._task_spec.title,
@@ -284,9 +441,18 @@ class CleanOpsEnvironment(Environment[DataCleaningAction, DataCleaningObservatio
             quality_score=self._state.current_score,
             best_score=self._state.best_score,
             remaining_steps=max(0, self._state.max_steps - self._state.step_count),
             table_summaries=summaries,
             focus_table=focus_table,
             available_operations=available_operations,
             focus_operation=self._focus_operation_detail,
             validation_issues=self._grade.validation_issues,
             issue_cards=list(self._task_spec.issue_cards),
@@ -301,6 +467,9 @@ class CleanOpsEnvironment(Environment[DataCleaningAction, DataCleaningObservatio
                 "episode_id": self._state.episode_id,
                 "requested_seed": self._state.requested_seed,
                 "applied_operation_ids": list(self._state.applied_operation_ids),
                 "submitted": self._state.submitted,
             },
         )
@@ -334,6 +503,229 @@ class CleanOpsEnvironment(Environment[DataCleaningAction, DataCleaningObservatio
         random.Random(seed + sum(ord(char) for char in table_name)).shuffle(shuffled_rows)
         return shuffled_rows
     def _build_operation_detail(
         self,
         task_spec: TaskSpec,

 from openenv.core.env_server.interfaces import Environment
 from openenv.core.env_server.types import EnvironmentMetadata
+from cleanops_env.graders import build_table_summary, count_duplicate_groups, grade_tables
 from cleanops_env.models import (
+    ActionCostEntry,
     DataCleaningAction,
     DataCleaningObservation,
     DataCleaningState,
+    DownstreamHealth,
+    DryRunFinding,
+    DryRunReport,
     OperationDetail,
     OperationSummary,
+    PendingReview,
+    ReviewResolution,
+    ReviewTarget,
+    RiskCard,
     RewardBreakdown,
     RowChange,
     TableView,
 )
 from cleanops_env.tasks import (
+    ReviewCaseSpec,
     TaskSpec,
     apply_operation_to_tables,
     clone_tables,
     sorted_rows,
 )
+ACTION_COSTS: dict[str, float] = {
+    "inspect_table": 0.005,
+    "inspect_operation": 0.005,
+    "apply_operation:safe": 0.01,
+    "apply_operation:review": 0.015,
+    "apply_operation:destructive": 0.03,
+    "request_review": 0.025,
+    "run_sync_dry_run": 0.02,
+    "submit": 0.005,
+}
+ACTION_COST_DESCRIPTIONS: dict[str, str] = {
+    "inspect_table": "Low-cost inspection to understand current records.",
+    "inspect_operation": "Low-cost preview to inspect an operation before applying it.",
+    "apply_operation:safe": "Safe automated cleanup with low operational risk.",
+    "apply_operation:review": "Review-sensitive cleanup that should be used more deliberately.",
+    "apply_operation:destructive": "Destructive cleanup with higher business risk if applied incorrectly.",
+    "request_review": "Consumes limited human-review budget to resolve ambiguity safely.",
+    "run_sync_dry_run": "Runs a deterministic downstream system simulation before submit.",
+    "submit": "Low-cost finalization step after cleanup is complete.",
+}
 class CleanOpsEnvironment(Environment[DataCleaningAction, DataCleaningObservation, DataCleaningState]):
     """A realistic data-cleaning workflow environment with deterministic graders."""
         self._focus_operation_detail: OperationDetail | None = None
         self._done = False
         self._initial_issue_count = max(1, len(self._grade.validation_issues))
+        initial_tables = clone_tables(self._task_spec.dirty_tables)
+        initial_downstream_health = self._compute_downstream_health(self._task_spec, initial_tables, self._grade.validation_issues)
         self._state = DataCleaningState(
             episode_id=str(uuid4()),
             step_count=0,
             difficulty=self._task_spec.difficulty,
             requested_seed=None,
             max_steps=self._task_spec.max_steps,
+            review_budget_total=self._task_spec.review_budget,
+            review_budget_remaining=self._task_spec.review_budget,
             submitted=False,
             current_score=self._grade.score,
             best_score=self._grade.score,
             outstanding_issue_count=len(self._grade.validation_issues),
+            downstream_health=initial_downstream_health,
+            last_dry_run=None,
+            tables=initial_tables,
             applied_operation_ids=[],
             inspected_tables=[self._focus_table_name],
             inspected_operations=[],
+            requested_review_ids=[],
+            pending_reviews=[],
+            resolved_reviews=[],
+            dry_run_targets=[],
             recent_history=[],
         )
         self._done = False
         self._grade = grade_tables(self._task_spec, self._task_spec.dirty_tables)
         self._initial_issue_count = max(1, len(self._grade.validation_issues))
+        initial_tables = clone_tables(self._task_spec.dirty_tables)
+        initial_downstream_health = self._compute_downstream_health(self._task_spec, initial_tables, self._grade.validation_issues)
         self._state = DataCleaningState(
             episode_id=episode_id or str(uuid4()),
             step_count=0,
             difficulty=self._task_spec.difficulty,
             requested_seed=normalized_seed,
             max_steps=self._task_spec.max_steps,
+            review_budget_total=self._task_spec.review_budget,
+            review_budget_remaining=self._task_spec.review_budget,
             submitted=False,
             current_score=self._grade.score,
             best_score=self._grade.score,
             outstanding_issue_count=len(self._grade.validation_issues),
+            downstream_health=initial_downstream_health,
+            last_dry_run=None,
+            tables=initial_tables,
             applied_operation_ids=[],
             inspected_tables=[self._focus_table_name],
             inspected_operations=[],
+            requested_review_ids=[],
+            pending_reviews=[],
+            resolved_reviews=[],
+            dry_run_targets=[],
             recent_history=[f"reset -> loaded task {self._task_spec.task_id} ({self._task_spec.difficulty}) seed={normalized_seed}"],
         )
         return self._build_observation(
         self._state.step_count += 1
         previous_score = self._state.current_score
         previous_issue_count = self._state.outstanding_issue_count
+        previous_downstream_score = self._state.downstream_health.overall_health_score
         invalid_action_penalty = 0.0
         noop_penalty = 0.0
         insight_bonus = 0.0
+        review_bonus = 0.0
+        review_cost_penalty = 0.0
+        action_cost_penalty = 0.0
         submit_bonus = 0.0
         status_message = ""
         action_error: str | None = None
+        released_reviews = self._release_ready_reviews()
+        if released_reviews:
+            review_bonus = round(0.04 * len(released_reviews), 4)
         if action.action_type == "inspect_table":
             table_name = normalize_whitespace(action.table_name or "")
                     if self._task_spec.operations[operation_id].tables_affected:
                         self._focus_table_name = self._task_spec.operations[operation_id].tables_affected[0]
                     status_message = f"Applied '{operation_id}' to {affected_tables or 'current tables'}."
+        elif action.action_type == "request_review":
+            entity_type = normalize_whitespace(action.entity_type or "").lower()
+            entity_id = normalize_whitespace(action.entity_id or "")
+            reason_code = normalize_whitespace(action.reason_code or "")
+            review_case = self._find_review_case(entity_type, entity_id, reason_code)
+            if not entity_type or not entity_id or not reason_code:
+                invalid_action_penalty = -0.25
+                status_message = "request_review requires entity_type, entity_id, and reason_code."
+                action_error = status_message
+            elif review_case is None:
+                invalid_action_penalty = -0.2
+                status_message = f"No deterministic review case exists for {entity_type}:{entity_id} ({reason_code})."
+                action_error = status_message
+            elif review_case.review_id in self._state.requested_review_ids:
+                noop_penalty = -0.05
+                status_message = f"Review '{review_case.review_id}' was already requested."
+            elif self._state.review_budget_remaining <= 0:
+                invalid_action_penalty = -0.18
+                status_message = "No review budget remaining for this episode."
+                action_error = status_message
+            else:
+                self._state.review_budget_remaining -= 1
+                self._state.requested_review_ids.append(review_case.review_id)
+                self._state.pending_reviews.append(
+                    PendingReview(
+                        review_id=review_case.review_id,
+                        entity_type=review_case.entity_type,
+                        entity_id=review_case.entity_id,
+                        reason_code=review_case.reason_code,
+                        title=review_case.title,
+                        requested_at_step=self._state.step_count,
+                        ready_at_step=self._state.step_count + 1,
+                    )
+                )
+                review_cost_penalty = -0.02
+                status_message = (
+                    f"Queued review '{review_case.review_id}' for {review_case.entity_type} {review_case.entity_id}; "
+                    "response will be available on the next step."
+                )
+        elif action.action_type == "run_sync_dry_run":
+            target_system = action.target_system
+            if target_system is None:
+                invalid_action_penalty = -0.2
+                status_message = "run_sync_dry_run requires target_system."
+                action_error = status_message
+            elif target_system not in self._task_spec.sync_targets:
+                invalid_action_penalty = -0.2
+                status_message = f"Task '{self._task_spec.task_id}' does not support dry-run target '{target_system}'."
+                action_error = status_message
+            else:
+                self._state.last_dry_run = self._build_dry_run_report(target_system)
+                if target_system not in self._state.dry_run_targets:
+                    self._state.dry_run_targets.append(target_system)
+                    insight_bonus = max(insight_bonus, 0.01)
+                else:
+                    noop_penalty = min(noop_penalty, -0.01)
+                status_message = self._state.last_dry_run.summary
         elif action.action_type == "submit":
             self._state.submitted = True
             self._done = True
             status_message = "Submitted cleaned tables for grading."
+        action_cost_penalty = -self._estimate_action_cost(action)
         self._grade = grade_tables(self._task_spec, self._state.tables)
         self._state.current_score = self._grade.score
         self._state.best_score = max(self._state.best_score, self._grade.score)
         self._state.outstanding_issue_count = len(self._grade.validation_issues)
+        self._state.downstream_health = self._compute_downstream_health(self._task_spec, self._state.tables, self._grade.validation_issues)
         quality_delta = round(self._state.current_score - previous_score, 4)
         issue_delta = round((previous_issue_count - self._state.outstanding_issue_count) / self._initial_issue_count, 4)
+        downstream_health_delta = round(self._state.downstream_health.overall_health_score - previous_downstream_score, 4)
         efficiency_penalty = -0.01
         if action.action_type == "submit":
+            submission_health = round(0.65 * self._state.current_score + 0.35 * self._state.downstream_health.overall_health_score, 4)
+            submit_bonus = round(0.4 * submission_health, 4) if submission_health >= 0.82 else round(-0.2 * (1.0 - submission_health), 4)
         if self._state.step_count >= self._state.max_steps and not self._done:
             self._done = True
             self._state.submitted = False
             status_message = f"{status_message} Step budget exhausted; episode truncated.".strip()
+        if released_reviews:
+            release_note = ", ".join(review.review_id for review in released_reviews)
+            status_message = f"{status_message} Review response available: {release_note}.".strip()
+        reward_total = round(
+            1.0 * quality_delta
+            + 0.35 * issue_delta
+            + 0.55 * downstream_health_delta
+            + insight_bonus
+            + review_bonus
+            + efficiency_penalty
+            + invalid_action_penalty
+            + noop_penalty
+            + review_cost_penalty
+            + action_cost_penalty
+            + submit_bonus,
+            4,
+        )
         reward_breakdown = RewardBreakdown(
             quality_delta=quality_delta,
             issue_delta=issue_delta,
+            downstream_health_delta=downstream_health_delta,
             insight_bonus=insight_bonus,
+            review_bonus=review_bonus,
             efficiency_penalty=efficiency_penalty,
             invalid_action_penalty=invalid_action_penalty,
             noop_penalty=noop_penalty,
+            review_cost_penalty=review_cost_penalty,
+            action_cost_penalty=action_cost_penalty,
             submit_bonus=submit_bonus,
             total=reward_total,
         )
             action_descriptor += f"[{action.operation_id}]"
         if action.table_name:
             action_descriptor += f"[{action.table_name}]"
+        if action.entity_id:
+            action_descriptor += f"[{action.entity_id}]"
+        if action.target_system:
+            action_descriptor += f"[{action.target_system}]"
         self._state.recent_history.append(f"step {self._state.step_count}: {action_descriptor} -> score={self._state.current_score:.4f}")
         self._state.recent_history = self._state.recent_history[-10:]
             )
             for operation in sorted(self._task_spec.operations.values(), key=lambda op: op.operation_id)
         ]
+        available_review_targets = [
+            ReviewTarget(
+                review_id=review_case.review_id,
+                entity_type=review_case.entity_type,
+                entity_id=review_case.entity_id,
+                reason_code=review_case.reason_code,
+                title=review_case.title,
+                detail=review_case.detail,
+                recommended_operation_ids=list(review_case.recommended_operation_ids),
+            )
+            for review_case in sorted(self._task_spec.review_cases.values(), key=lambda case: case.review_id)
+        ]
         return DataCleaningObservation(
             task_id=self._task_spec.task_id,
             task_title=self._task_spec.title,
             quality_score=self._state.current_score,
             best_score=self._state.best_score,
             remaining_steps=max(0, self._state.max_steps - self._state.step_count),
+            review_budget_remaining=self._state.review_budget_remaining,
+            supported_sync_targets=list(self._task_spec.sync_targets),
+            downstream_health=self._state.downstream_health,
+            risk_cards=self._build_risk_cards(),
+            last_dry_run=self._state.last_dry_run,
+            action_costs=self._build_action_cost_entries(),
             table_summaries=summaries,
             focus_table=focus_table,
             available_operations=available_operations,
+            available_review_targets=available_review_targets,
+            pending_reviews=list(self._state.pending_reviews),
+            resolved_reviews=list(self._state.resolved_reviews),
             focus_operation=self._focus_operation_detail,
             validation_issues=self._grade.validation_issues,
             issue_cards=list(self._task_spec.issue_cards),
                 "episode_id": self._state.episode_id,
                 "requested_seed": self._state.requested_seed,
                 "applied_operation_ids": list(self._state.applied_operation_ids),
+                "review_budget_remaining": self._state.review_budget_remaining,
+                "requested_review_ids": list(self._state.requested_review_ids),
+                "dry_run_targets": list(self._state.dry_run_targets),
                 "submitted": self._state.submitted,
             },
         )
         random.Random(seed + sum(ord(char) for char in table_name)).shuffle(shuffled_rows)
         return shuffled_rows
+    def _find_review_case(self, entity_type: str, entity_id: str, reason_code: str) -> ReviewCaseSpec | None:
+        for review_case in self._task_spec.review_cases.values():
+            if (
+                review_case.entity_type == entity_type
+                and review_case.entity_id == entity_id
+                and review_case.reason_code == reason_code
+            ):
+                return review_case
+        return None
+    def _release_ready_reviews(self) -> list[ReviewResolution]:
+        if not self._state.pending_reviews:
+            return []
+        still_pending: list[PendingReview] = []
+        released: list[ReviewResolution] = []
+        for pending_review in self._state.pending_reviews:
+            if pending_review.ready_at_step > self._state.step_count:
+                still_pending.append(pending_review)
+                continue
+            review_case = self._task_spec.review_cases[pending_review.review_id]
+            released_review = ReviewResolution(
+                review_id=review_case.review_id,
+                entity_type=review_case.entity_type,
+                entity_id=review_case.entity_id,
+                reason_code=review_case.reason_code,
+                title=review_case.title,
+                resolution=review_case.resolution,
+                response_summary=review_case.response_summary,
+                evidence_summary=review_case.evidence_summary,
+                recommended_operation_ids=list(review_case.recommended_operation_ids),
+            )
+            self._state.resolved_reviews.append(released_review)
+            released.append(released_review)
+        self._state.pending_reviews = still_pending
+        return released
+    def _estimate_action_cost(self, action: DataCleaningAction) -> float:
+        if action.action_type == "apply_operation":
+            operation = self._task_spec.operations.get(normalize_whitespace(action.operation_id or ""))
+            if operation is None:
+                return ACTION_COSTS["apply_operation:safe"]
+            if operation.risk == "review":
+                return ACTION_COSTS["apply_operation:review"]
+            if operation.risk == "destructive":
+                return ACTION_COSTS["apply_operation:destructive"]
+            return ACTION_COSTS["apply_operation:safe"]
+        return ACTION_COSTS.get(action.action_type, 0.01)
+    def _build_action_cost_entries(self) -> list[ActionCostEntry]:
+        return [
+            ActionCostEntry(action_key=action_key, estimated_cost=estimated_cost, description=ACTION_COST_DESCRIPTIONS[action_key])
+            for action_key, estimated_cost in ACTION_COSTS.items()
+        ]
+    @staticmethod
+    def _open_metric(value: float) -> float:
+        return round(min(0.99, max(0.01, value)), 4)
+    def _compute_downstream_health(
+        self,
+        task_spec: TaskSpec,
+        tables: dict[str, list[dict[str, str]]],
+        validation_issues: list,
+    ) -> DownstreamHealth:
+        customers = tables.get("customers", [])
+        orders = tables.get("orders", [])
+        subscriptions = tables.get("subscriptions", [])
+        payments = tables.get("payments", [])
+        crm_rows = max(1, len(customers) + len(subscriptions))
+        billing_rows = max(1, len(orders) + len(subscriptions) + len(payments))
+        payment_rows = max(1, len(orders) + len(payments))
+        crm_issue_weight = sum(max(1, len(issue.row_ids)) for issue in validation_issues if issue.table_name in {"customers", "subscriptions"})
+        billing_issue_weight = sum(
+            max(1, len(issue.row_ids))
+            for issue in validation_issues
+            if issue.table_name in {"orders", "payments", "subscriptions"}
+            and (issue.code.startswith("foreign_key:") or issue.code.startswith("required:") or issue.code.startswith("unique:"))
+        )
+        payment_issue_weight = sum(
+            max(1, len(issue.row_ids))
+            for issue in validation_issues
+            if issue.table_name in {"orders", "payments"}
+        )
+        customer_duplicate_groups = count_duplicate_groups(task_spec, "customers", customers) if "customers" in task_spec.duplicate_identity_columns else 0
+        customer_rows = max(1, len(customers))
+        payment_duplicate_groups = count_duplicate_groups(task_spec, "payments", payments) if "payments" in task_spec.duplicate_identity_columns else 0
+        crm_sync_success_rate = self._open_metric(1.0 - (crm_issue_weight / max(2, crm_rows * 2)))
+        if not orders and not payments:
+            billing_link_integrity = 0.99
+            revenue_reporting_risk = 0.01
+        else:
+            billing_link_integrity = self._open_metric(1.0 - (billing_issue_weight / max(2, billing_rows * 2)))
+            revenue_reporting_risk = self._open_metric(min(0.99, (payment_issue_weight / max(2, payment_rows * 2)) + (payment_duplicate_groups / max(1, payment_rows))))
+        duplicate_contact_risk = self._open_metric(min(0.99, (customer_duplicate_groups / customer_rows) + 0.06 * sum(1 for issue in validation_issues if issue.code.startswith("unique:customers"))))
+        overall_health_score = self._open_metric(
+            (
+                crm_sync_success_rate
+                + billing_link_integrity
+                + (1.0 - duplicate_contact_risk)
+                + (1.0 - revenue_reporting_risk)
+            )
+            / 4.0
+        )
+        return DownstreamHealth(
+            crm_sync_success_rate=crm_sync_success_rate,
+            billing_link_integrity=billing_link_integrity,
+            duplicate_contact_risk=duplicate_contact_risk,
+            revenue_reporting_risk=revenue_reporting_risk,
+            overall_health_score=overall_health_score,
+        )
+    def _build_risk_cards(self) -> list[RiskCard]:
+        health = self._state.downstream_health
+        cards = [
+            RiskCard(
+                title="CRM import risk",
+                detail="Customer and subscription issues can block CRM migration syncs.",
+                severity="high" if health.crm_sync_success_rate < 0.8 else "medium" if health.crm_sync_success_rate < 0.92 else "low",
+                metric_name="crm_sync_success_rate",
+                current_value=health.crm_sync_success_rate,
+                recommended_action_ids=[op_id for op_id in self._recommended_operation_ids_for_tables({"customers", "subscriptions"})],
+            ),
+            RiskCard(
+                title="Billing linkage risk",
+                detail="Broken foreign keys or missing IDs can mislink orders, subscriptions, and payments.",
+                severity="high" if health.billing_link_integrity < 0.8 else "medium" if health.billing_link_integrity < 0.92 else "low",
+                metric_name="billing_link_integrity",
+                current_value=health.billing_link_integrity,
+                recommended_action_ids=[op_id for op_id in self._recommended_operation_ids_for_tables({"orders", "subscriptions", "payments"})],
+            ),
+            RiskCard(
+                title="Duplicate contact risk",
+                detail="Remaining duplicate customer identities can create bad merges downstream.",
+                severity="high" if health.duplicate_contact_risk > 0.3 else "medium" if health.duplicate_contact_risk > 0.12 else "low",
+                metric_name="duplicate_contact_risk",
+                current_value=health.duplicate_contact_risk,
+                recommended_action_ids=[op_id for op_id in self._recommended_operation_ids_for_keyword("merge")],
+            ),
+            RiskCard(
+                title="Revenue reporting risk",
+                detail="Duplicate or mislinked payment and order facts can distort downstream reporting.",
+                severity="high" if health.revenue_reporting_risk > 0.3 else "medium" if health.revenue_reporting_risk > 0.12 else "low",
+                metric_name="revenue_reporting_risk",
+                current_value=health.revenue_reporting_risk,
+                recommended_action_ids=[op_id for op_id in self._recommended_operation_ids_for_tables({"orders", "payments"})],
+            ),
+        ]
+        return cards
+    def _recommended_operation_ids_for_tables(self, table_names: set[str]) -> list[str]:
+        return [
+            operation.operation_id
+            for operation in sorted(self._task_spec.operations.values(), key=lambda op: op.operation_id)
+            if set(operation.tables_affected) & table_names
+        ][:4]
+    def _recommended_operation_ids_for_keyword(self, keyword: str) -> list[str]:
+        lowered = keyword.lower()
+        return [
+            operation.operation_id
+            for operation in sorted(self._task_spec.operations.values(), key=lambda op: op.operation_id)
+            if lowered in operation.operation_id.lower() or lowered in operation.title.lower()
+        ][:4]
+    def _build_dry_run_report(self, target_system: str) -> DryRunReport:
+        findings: list[DryRunFinding] = []
+        for issue in self._grade.validation_issues:
+            if target_system == "crm" and issue.table_name not in {"customers", "subscriptions"}:
+                continue
+            if target_system == "billing" and issue.table_name not in {"orders", "subscriptions", "payments"}:
+                continue
+            findings.append(
+                DryRunFinding(
+                    code=issue.code,
+                    severity=issue.severity,
+                    table_name=issue.table_name,
+                    row_ids=list(issue.row_ids),
+                    message=issue.message,
+                )
+            )
+        health = self._state.downstream_health
+        success_rate = health.crm_sync_success_rate if target_system == "crm" else health.billing_link_integrity
+        if target_system == "crm" and health.duplicate_contact_risk > 0.12:
+            findings.append(
+                DryRunFinding(
+                    code="risk:duplicate_contacts",
+                    severity="medium" if health.duplicate_contact_risk <= 0.3 else "high",
+                    table_name="customers",
+                    message="CRM dry run predicts duplicate-contact collisions after import.",
+                )
+            )
+        if target_system == "billing" and health.revenue_reporting_risk > 0.12:
+            findings.append(
+                DryRunFinding(
+                    code="risk:revenue_reporting",
+                    severity="medium" if health.revenue_reporting_risk <= 0.3 else "high",
+                    table_name="payments" if "payments" in self._state.tables else "orders",
+                    message="Billing dry run predicts mislinked or duplicated revenue facts.",
+                )
+            )
+        summary = (
+            f"Dry run for {target_system.upper()} found {len(findings)} blocking or risky findings; "
+            f"estimated success rate is {success_rate:.2f}."
+        )
+        return DryRunReport(
+            target_system=target_system,
+            success_rate=success_rate,
+            finding_count=len(findings),
+            findings=findings,
+            summary=summary,
+            generated_at_step=self._state.step_count,
+        )
     def _build_operation_detail(
         self,
         task_spec: TaskSpec,

cleanops_env/models.py CHANGED Viewed

@@ -14,10 +14,14 @@ class RewardBreakdown(BaseModel):
     quality_delta: float = Field(default=0.0, description="Change in overall grader score after the action.")
     issue_delta: float = Field(default=0.0, description="Normalized change in outstanding validation issues.")
     insight_bonus: float = Field(default=0.0, description="Small positive reward for inspecting new assets.")
     efficiency_penalty: float = Field(default=0.0, description="Per-step penalty to discourage long episodes.")
     invalid_action_penalty: float = Field(default=0.0, description="Penalty for malformed or unsupported actions.")
     noop_penalty: float = Field(default=0.0, description="Penalty for no-op or repeated actions.")
     submit_bonus: float = Field(default=0.0, description="End-of-episode bonus based on final score.")
     total: float = Field(default=0.0, description="Final scalar reward returned.")
@@ -42,6 +46,94 @@ class IssueCard(BaseModel):
     recommended_operation_ids: list[str] = Field(default_factory=list, description="Operations likely to address the issue.")
 class TableSummary(BaseModel):
     """Compact summary of a table."""
@@ -102,9 +194,13 @@ class GradeBreakdown(BaseModel):
 class DataCleaningAction(Action):
     """Action model for the environment."""
-    action_type: Literal["inspect_table", "inspect_operation", "apply_operation", "submit"] = Field(..., description="Type of action to perform.")
     table_name: str | None = Field(default=None, description="Table to inspect when action_type=inspect_table.")
     operation_id: str | None = Field(default=None, description="Operation to inspect or apply when action_type is inspect_operation or apply_operation.")
     reasoning: str = Field(default="", description="Optional natural-language reasoning for debugging baselines.")
@@ -120,9 +216,18 @@ class DataCleaningObservation(Observation):
     quality_score: float = Field(default=0.0, description="Current deterministic grader score.")
     best_score: float = Field(default=0.0, description="Best score seen in the current episode.")
     remaining_steps: int = Field(default=0, description="How many actions remain before truncation.")
     table_summaries: list[TableSummary] = Field(default_factory=list, description="Compact summaries of all tables.")
     focus_table: TableView | None = Field(default=None, description="Detailed contents for the currently inspected table.")
     available_operations: list[OperationSummary] = Field(default_factory=list, description="Available cleaning actions.")
     focus_operation: OperationDetail | None = Field(default=None, description="Detailed preview for the currently inspected operation.")
     validation_issues: list[ValidationIssue] = Field(default_factory=list, description="Current unresolved validation issues.")
     issue_cards: list[IssueCard] = Field(default_factory=list, description="Aggregated issue cards with suggested next actions.")
@@ -141,12 +246,20 @@ class DataCleaningState(State):
     difficulty: Literal["easy", "medium", "hard"] = Field(..., description="Current task difficulty.")
     requested_seed: int | None = Field(default=None, description="Seed used when resetting the current episode.")
     max_steps: int = Field(..., description="Task step budget.")
     submitted: bool = Field(default=False, description="Whether submit was called.")
     current_score: float = Field(default=0.0, description="Current deterministic grader score.")
     best_score: float = Field(default=0.0, description="Best score achieved this episode.")
     outstanding_issue_count: int = Field(default=0, description="Number of unresolved validation issues.")
     tables: dict[str, list[dict[str, str]]] = Field(default_factory=dict, description="Current mutable table contents.")
     applied_operation_ids: list[str] = Field(default_factory=list, description="Operations already applied.")
     inspected_tables: list[str] = Field(default_factory=list, description="Tables inspected so far.")
     inspected_operations: list[str] = Field(default_factory=list, description="Operations inspected so far.")
     recent_history: list[str] = Field(default_factory=list, description="Recent action log.")

     quality_delta: float = Field(default=0.0, description="Change in overall grader score after the action.")
     issue_delta: float = Field(default=0.0, description="Normalized change in outstanding validation issues.")
+    downstream_health_delta: float = Field(default=0.0, description="Change in downstream operational health after the action.")
     insight_bonus: float = Field(default=0.0, description="Small positive reward for inspecting new assets.")
     efficiency_penalty: float = Field(default=0.0, description="Per-step penalty to discourage long episodes.")
     invalid_action_penalty: float = Field(default=0.0, description="Penalty for malformed or unsupported actions.")
     noop_penalty: float = Field(default=0.0, description="Penalty for no-op or repeated actions.")
+    review_bonus: float = Field(default=0.0, description="Positive reward when a queued review response becomes available.")
+    review_cost_penalty: float = Field(default=0.0, description="Small cost for consuming limited human-review budget.")
+    action_cost_penalty: float = Field(default=0.0, description="Cost-aware penalty attached to the chosen action.")
     submit_bonus: float = Field(default=0.0, description="End-of-episode bonus based on final score.")
     total: float = Field(default=0.0, description="Final scalar reward returned.")
     recommended_operation_ids: list[str] = Field(default_factory=list, description="Operations likely to address the issue.")
+class ReviewTarget(BaseModel):
+    """A reviewable entity that can be escalated to a human reviewer."""
+    review_id: str = Field(..., description="Stable review case identifier.")
+    entity_type: str = Field(..., description="Type of entity under review.")
+    entity_id: str = Field(..., description="Primary identifier for the reviewed entity.")
+    reason_code: str = Field(..., description="Why the review would be requested.")
+    title: str = Field(..., description="Short human-readable review title.")
+    detail: str = Field(..., description="Why this review matters.")
+    recommended_operation_ids: list[str] = Field(default_factory=list, description="Operations likely to be safe once review resolves.")
+class PendingReview(BaseModel):
+    """A queued review request awaiting a deterministic response."""
+    review_id: str = Field(..., description="Stable review case identifier.")
+    entity_type: str = Field(..., description="Type of entity under review.")
+    entity_id: str = Field(..., description="Primary identifier for the reviewed entity.")
+    reason_code: str = Field(..., description="Why the review was requested.")
+    title: str = Field(..., description="Short human-readable review title.")
+    requested_at_step: int = Field(..., description="Step index when the review was requested.")
+    ready_at_step: int = Field(..., description="First step on which the deterministic response becomes available.")
+class ReviewResolution(BaseModel):
+    """A resolved human-review response surfaced back to the agent."""
+    review_id: str = Field(..., description="Stable review case identifier.")
+    entity_type: str = Field(..., description="Type of entity under review.")
+    entity_id: str = Field(..., description="Primary identifier for the reviewed entity.")
+    reason_code: str = Field(..., description="Why the review was requested.")
+    title: str = Field(..., description="Short human-readable review title.")
+    resolution: str = Field(..., description="Deterministic review outcome label.")
+    response_summary: str = Field(..., description="What the reviewer concluded.")
+    evidence_summary: str = Field(..., description="Short explanation for the decision.")
+    recommended_operation_ids: list[str] = Field(default_factory=list, description="Operations that become safer after the review response.")
+class DryRunFinding(BaseModel):
+    """A deterministic downstream issue surfaced by a dry-run sync."""
+    code: str = Field(..., description="Stable machine-readable issue code.")
+    severity: Literal["low", "medium", "high"] = Field(..., description="Issue severity.")
+    table_name: str | None = Field(default=None, description="Table implicated by the dry-run finding.")
+    row_ids: list[str] = Field(default_factory=list, description="Primary-key values implicated by the finding.")
+    message: str = Field(..., description="Human-readable dry-run explanation.")
+class DryRunReport(BaseModel):
+    """A dry-run simulation result for a downstream business system."""
+    target_system: Literal["crm", "billing"] = Field(..., description="Which downstream system was tested.")
+    success_rate: float = Field(default=0.0, description="Deterministic estimate of how many records would import successfully.")
+    finding_count: int = Field(default=0, description="How many concrete blockers or risks were found.")
+    findings: list[DryRunFinding] = Field(default_factory=list, description="Structured findings from the simulated sync.")
+    summary: str = Field(default="", description="Short narrative summary of the dry-run result.")
+    generated_at_step: int = Field(default=0, description="Step on which the report was generated.")
+class DownstreamHealth(BaseModel):
+    """Operational health estimates for downstream systems."""
+    crm_sync_success_rate: float = Field(default=0.0, description="Estimated CRM import success rate.")
+    billing_link_integrity: float = Field(default=0.0, description="Estimated correctness of billing/customer linkages.")
+    duplicate_contact_risk: float = Field(default=0.0, description="Estimated risk that duplicate contacts still remain.")
+    revenue_reporting_risk: float = Field(default=0.0, description="Estimated risk of duplicate or mislinked revenue facts.")
+    overall_health_score: float = Field(default=0.0, description="Composite downstream health score used for reward shaping.")
+class RiskCard(BaseModel):
+    """A compact operational risk summary derived from downstream health."""
+    title: str = Field(..., description="Short risk title.")
+    detail: str = Field(..., description="Why this risk matters operationally.")
+    severity: Literal["low", "medium", "high"] = Field(..., description="Severity for UI and agent prioritization.")
+    metric_name: str = Field(..., description="Downstream metric represented by this card.")
+    current_value: float = Field(default=0.0, description="Current metric or risk value in [0, 1].")
+    recommended_action_ids: list[str] = Field(default_factory=list, description="Operations likely to improve this risk.")
+class ActionCostEntry(BaseModel):
+    """Estimated operational cost of taking an action."""
+    action_key: str = Field(..., description="Stable action or risk key.")
+    estimated_cost: float = Field(default=0.0, description="Relative action cost used in reward shaping.")
+    description: str = Field(default="", description="Why this action costs reviewer or system capacity.")
 class TableSummary(BaseModel):
     """Compact summary of a table."""
 class DataCleaningAction(Action):
     """Action model for the environment."""
+    action_type: Literal["inspect_table", "inspect_operation", "apply_operation", "request_review", "run_sync_dry_run", "submit"] = Field(..., description="Type of action to perform.")
     table_name: str | None = Field(default=None, description="Table to inspect when action_type=inspect_table.")
     operation_id: str | None = Field(default=None, description="Operation to inspect or apply when action_type is inspect_operation or apply_operation.")
+    entity_type: str | None = Field(default=None, description="Entity type to review when action_type=request_review.")
+    entity_id: str | None = Field(default=None, description="Entity identifier to review when action_type=request_review.")
+    target_system: Literal["crm", "billing"] | None = Field(default=None, description="Downstream system to simulate when action_type=run_sync_dry_run.")
+    reason_code: str | None = Field(default=None, description="Reason for escalating a review request.")
     reasoning: str = Field(default="", description="Optional natural-language reasoning for debugging baselines.")
     quality_score: float = Field(default=0.0, description="Current deterministic grader score.")
     best_score: float = Field(default=0.0, description="Best score seen in the current episode.")
     remaining_steps: int = Field(default=0, description="How many actions remain before truncation.")
+    review_budget_remaining: int = Field(default=0, description="How many human-review requests remain in the current episode.")
+    supported_sync_targets: list[str] = Field(default_factory=list, description="Downstream systems that can be tested with run_sync_dry_run.")
+    downstream_health: DownstreamHealth = Field(default_factory=DownstreamHealth, description="Current operational health estimates for downstream systems.")
+    risk_cards: list[RiskCard] = Field(default_factory=list, description="Operational risk summaries derived from downstream health.")
+    last_dry_run: DryRunReport | None = Field(default=None, description="Most recent downstream dry-run result, if any.")
+    action_costs: list[ActionCostEntry] = Field(default_factory=list, description="Estimated cost of each action family.")
     table_summaries: list[TableSummary] = Field(default_factory=list, description="Compact summaries of all tables.")
     focus_table: TableView | None = Field(default=None, description="Detailed contents for the currently inspected table.")
     available_operations: list[OperationSummary] = Field(default_factory=list, description="Available cleaning actions.")
+    available_review_targets: list[ReviewTarget] = Field(default_factory=list, description="Entities that can be escalated for deterministic review.")
+    pending_reviews: list[PendingReview] = Field(default_factory=list, description="Review requests that have been queued but not yet resolved.")
+    resolved_reviews: list[ReviewResolution] = Field(default_factory=list, description="Resolved review responses available to the agent.")
     focus_operation: OperationDetail | None = Field(default=None, description="Detailed preview for the currently inspected operation.")
     validation_issues: list[ValidationIssue] = Field(default_factory=list, description="Current unresolved validation issues.")
     issue_cards: list[IssueCard] = Field(default_factory=list, description="Aggregated issue cards with suggested next actions.")
     difficulty: Literal["easy", "medium", "hard"] = Field(..., description="Current task difficulty.")
     requested_seed: int | None = Field(default=None, description="Seed used when resetting the current episode.")
     max_steps: int = Field(..., description="Task step budget.")
+    review_budget_total: int = Field(default=0, description="Total number of review requests available in this task.")
+    review_budget_remaining: int = Field(default=0, description="Remaining number of review requests available in this task.")
     submitted: bool = Field(default=False, description="Whether submit was called.")
     current_score: float = Field(default=0.0, description="Current deterministic grader score.")
     best_score: float = Field(default=0.0, description="Best score achieved this episode.")
     outstanding_issue_count: int = Field(default=0, description="Number of unresolved validation issues.")
+    downstream_health: DownstreamHealth = Field(default_factory=DownstreamHealth, description="Current downstream operational health.")
+    last_dry_run: DryRunReport | None = Field(default=None, description="Most recent downstream dry-run result.")
     tables: dict[str, list[dict[str, str]]] = Field(default_factory=dict, description="Current mutable table contents.")
     applied_operation_ids: list[str] = Field(default_factory=list, description="Operations already applied.")
     inspected_tables: list[str] = Field(default_factory=list, description="Tables inspected so far.")
     inspected_operations: list[str] = Field(default_factory=list, description="Operations inspected so far.")
+    requested_review_ids: list[str] = Field(default_factory=list, description="Review cases already requested in this episode.")
+    pending_reviews: list[PendingReview] = Field(default_factory=list, description="Queued review requests awaiting deterministic responses.")
+    resolved_reviews: list[ReviewResolution] = Field(default_factory=list, description="Resolved review responses available to the agent.")
+    dry_run_targets: list[str] = Field(default_factory=list, description="Downstream targets that have already been dry-run in this episode.")
     recent_history: list[str] = Field(default_factory=list, description="Recent action log.")

cleanops_env/tasks.py CHANGED Viewed

@@ -98,6 +98,20 @@ class OperationSpec:
     transform: TransformFn
 @dataclass(frozen=True)
 class TaskSpec:
     task_id: str
@@ -106,6 +120,8 @@ class TaskSpec:
     objective: str
     dataset_context: str
     max_steps: int
     primary_keys: dict[str, str]
     duplicate_identity_columns: dict[str, tuple[str, ...]]
     dirty_tables: Tables
@@ -114,6 +130,7 @@ class TaskSpec:
     operations: dict[str, OperationSpec]
     solution_operation_ids: tuple[str, ...]
     issue_cards: tuple[IssueCard, ...]
 def clone_tables(tables: Tables) -> Tables:
@@ -353,6 +370,8 @@ def _task_from_solution(
     objective: str,
     dataset_context: str,
     max_steps: int,
     primary_keys: dict[str, str],
     duplicate_identity_columns: dict[str, tuple[str, ...]],
     dirty_tables: Tables,
@@ -360,6 +379,7 @@ def _task_from_solution(
     operations: dict[str, OperationSpec],
     solution_operation_ids: tuple[str, ...],
     issue_cards: tuple[IssueCard, ...],
 ) -> TaskSpec:
     gold_tables = clone_tables(dirty_tables)
     for operation_id in solution_operation_ids:
@@ -371,6 +391,8 @@ def _task_from_solution(
         objective=objective,
         dataset_context=dataset_context,
         max_steps=max_steps,
         primary_keys=primary_keys,
         duplicate_identity_columns=duplicate_identity_columns,
         dirty_tables=dirty_tables,
@@ -379,6 +401,7 @@ def _task_from_solution(
         operations=operations,
         solution_operation_ids=solution_operation_ids,
         issue_cards=issue_cards,
     )
@@ -418,6 +441,20 @@ def _build_easy_task() -> TaskSpec:
         IssueCard(title="A missing state value blocks validation", detail="One customer record has city information but no state code.", issue_codes=["required:customers.state"], recommended_operation_ids=["easy_fill_state_from_city"]),
         IssueCard(title="Duplicate customer identities exist", detail="Two rows refer to the same customer once emails are normalized.", issue_codes=["unique:customers.email"], recommended_operation_ids=["easy_merge_customers_by_email"]),
     )
     return _task_from_solution(
         task_id="customer_contacts_easy",
         title="Customer Contacts Standardization",
@@ -425,6 +462,8 @@ def _build_easy_task() -> TaskSpec:
         objective="Prepare a customer-contact export for CRM import by standardizing contact fields, filling one missing state, and merging duplicate customer rows without deleting valid inactive accounts.",
         dataset_context="This table simulates a weekly B2B CRM export that sales ops cleans before loading into a customer system.",
         max_steps=10,
         primary_keys={"customers": "customer_id"},
         duplicate_identity_columns={"customers": ("email",)},
         dirty_tables=dirty_tables,
@@ -432,6 +471,7 @@ def _build_easy_task() -> TaskSpec:
         operations=operations,
         solution_operation_ids=("easy_normalize_names", "easy_normalize_emails", "easy_normalize_phones", "easy_normalize_states", "easy_fill_state_from_city", "easy_merge_customers_by_email"),
         issue_cards=issue_cards,
     )
@@ -477,6 +517,20 @@ def _build_medium_task() -> TaskSpec:
         IssueCard(title="Shipping state labels are not canonical", detail="Downstream warehouse tools require two-letter state abbreviations.", issue_codes=["enum:orders.shipping_state"], recommended_operation_ids=["med_normalize_shipping_states"]),
         IssueCard(title="A duplicated order row exists", detail="One record is a second export copy of another order.", issue_codes=["unique:orders.order_id"], recommended_operation_ids=["med_dedupe_orders"]),
     )
     return _task_from_solution(
         task_id="orders_reconciliation_medium",
         title="E-commerce Order Reconciliation",
@@ -484,6 +538,8 @@ def _build_medium_task() -> TaskSpec:
         objective="Clean a transactional orders export by normalizing dates, money, statuses, and shipping states while deduplicating repeated order exports without deleting legitimate cancelled orders.",
         dataset_context="This table simulates a daily order extract from an e-commerce platform that revenue ops must reconcile before BI ingestion.",
         max_steps=12,
         primary_keys={"orders": "order_id"},
         duplicate_identity_columns={"orders": ("order_id",)},
         dirty_tables=dirty_tables,
@@ -491,6 +547,7 @@ def _build_medium_task() -> TaskSpec:
         operations=operations,
         solution_operation_ids=("med_normalize_dates", "med_normalize_currency_amounts", "med_normalize_order_statuses", "med_normalize_shipping_states", "med_dedupe_orders"),
         issue_cards=issue_cards,
     )
@@ -571,6 +628,32 @@ def _build_hard_task() -> TaskSpec:
         IssueCard(title="Subscription and payment facts use inconsistent formats", detail="Plans, statuses, dates, amounts, and currency values need canonicalization before loading.", issue_codes=["enum:subscriptions.plan_code", "enum:subscriptions.status", "pattern:subscriptions.renewal_date", "pattern:payments.amount", "enum:payments.payment_status", "pattern:payments.paid_at"], recommended_operation_ids=["hard_normalize_subscriptions", "hard_normalize_payments"]),
         IssueCard(title="Duplicate payment facts are present", detail="Two payment rows represent the same invoice settlement and one should be removed.", issue_codes=["unique:payments.customer_email+subscription_id+amount+paid_at"], recommended_operation_ids=["hard_remove_duplicate_payments"]),
     )
     return _task_from_solution(
         task_id="crm_migration_hard",
         title="CRM Migration Referential Cleanup",
@@ -578,6 +661,8 @@ def _build_hard_task() -> TaskSpec:
         objective="Repair a three-table CRM migration extract by standardizing customer, subscription, and payment data; merging duplicate customers; fixing foreign keys from email joins; and removing duplicate payment facts without dropping legitimate orphan-like child rows.",
         dataset_context="This dataset simulates a SaaS CRM and billing migration where a team must clean customer master data and child ledger references before import.",
         max_steps=18,
         primary_keys={"customers": "customer_id", "subscriptions": "subscription_id", "payments": "payment_id"},
         duplicate_identity_columns={"customers": ("email",), "subscriptions": ("subscription_id",), "payments": ("customer_email", "subscription_id", "amount", "paid_at")},
         dirty_tables=dirty_tables,
@@ -585,6 +670,7 @@ def _build_hard_task() -> TaskSpec:
         operations=operations,
         solution_operation_ids=("hard_normalize_customer_fields", "hard_merge_customers_by_email", "hard_normalize_subscriptions", "hard_repair_subscription_customer_refs", "hard_normalize_payments", "hard_repair_payment_customer_refs", "hard_remove_duplicate_payments"),
         issue_cards=issue_cards,
     )

     transform: TransformFn
+@dataclass(frozen=True)
+class ReviewCaseSpec:
+    review_id: str
+    entity_type: str
+    entity_id: str
+    reason_code: str
+    title: str
+    detail: str
+    resolution: str
+    response_summary: str
+    evidence_summary: str
+    recommended_operation_ids: tuple[str, ...] = ()
 @dataclass(frozen=True)
 class TaskSpec:
     task_id: str
     objective: str
     dataset_context: str
     max_steps: int
+    review_budget: int
+    sync_targets: tuple[str, ...]
     primary_keys: dict[str, str]
     duplicate_identity_columns: dict[str, tuple[str, ...]]
     dirty_tables: Tables
     operations: dict[str, OperationSpec]
     solution_operation_ids: tuple[str, ...]
     issue_cards: tuple[IssueCard, ...]
+    review_cases: dict[str, ReviewCaseSpec]
 def clone_tables(tables: Tables) -> Tables:
     objective: str,
     dataset_context: str,
     max_steps: int,
+    review_budget: int,
+    sync_targets: tuple[str, ...],
     primary_keys: dict[str, str],
     duplicate_identity_columns: dict[str, tuple[str, ...]],
     dirty_tables: Tables,
     operations: dict[str, OperationSpec],
     solution_operation_ids: tuple[str, ...],
     issue_cards: tuple[IssueCard, ...],
+    review_cases: dict[str, ReviewCaseSpec],
 ) -> TaskSpec:
     gold_tables = clone_tables(dirty_tables)
     for operation_id in solution_operation_ids:
         objective=objective,
         dataset_context=dataset_context,
         max_steps=max_steps,
+        review_budget=review_budget,
+        sync_targets=sync_targets,
         primary_keys=primary_keys,
         duplicate_identity_columns=duplicate_identity_columns,
         dirty_tables=dirty_tables,
         operations=operations,
         solution_operation_ids=solution_operation_ids,
         issue_cards=issue_cards,
+        review_cases=review_cases,
     )
         IssueCard(title="A missing state value blocks validation", detail="One customer record has city information but no state code.", issue_codes=["required:customers.state"], recommended_operation_ids=["easy_fill_state_from_city"]),
         IssueCard(title="Duplicate customer identities exist", detail="Two rows refer to the same customer once emails are normalized.", issue_codes=["unique:customers.email"], recommended_operation_ids=["easy_merge_customers_by_email"]),
     )
+    review_cases = {
+        "easy_customer_duplicate_review": ReviewCaseSpec(
+            review_id="easy_customer_duplicate_review",
+            entity_type="customer",
+            entity_id="C005",
+            reason_code="possible_duplicate",
+            title="Confirm duplicate customer merge",
+            detail="Alice Johnson appears twice with status conflicts after email normalization.",
+            resolution="merge_confirmed_keep_c001",
+            response_summary="Merge C005 into C001. Keep the active account record and preserve inactive customers elsewhere in the file.",
+            evidence_summary="Normalized emails match and both rows describe the same Nashville customer; C001 is the canonical CRM ID.",
+            recommended_operation_ids=("easy_merge_customers_by_email",),
+        )
+    }
     return _task_from_solution(
         task_id="customer_contacts_easy",
         title="Customer Contacts Standardization",
         objective="Prepare a customer-contact export for CRM import by standardizing contact fields, filling one missing state, and merging duplicate customer rows without deleting valid inactive accounts.",
         dataset_context="This table simulates a weekly B2B CRM export that sales ops cleans before loading into a customer system.",
         max_steps=10,
+        review_budget=1,
+        sync_targets=("crm",),
         primary_keys={"customers": "customer_id"},
         duplicate_identity_columns={"customers": ("email",)},
         dirty_tables=dirty_tables,
         operations=operations,
         solution_operation_ids=("easy_normalize_names", "easy_normalize_emails", "easy_normalize_phones", "easy_normalize_states", "easy_fill_state_from_city", "easy_merge_customers_by_email"),
         issue_cards=issue_cards,
+        review_cases=review_cases,
     )
         IssueCard(title="Shipping state labels are not canonical", detail="Downstream warehouse tools require two-letter state abbreviations.", issue_codes=["enum:orders.shipping_state"], recommended_operation_ids=["med_normalize_shipping_states"]),
         IssueCard(title="A duplicated order row exists", detail="One record is a second export copy of another order.", issue_codes=["unique:orders.order_id"], recommended_operation_ids=["med_dedupe_orders"]),
     )
+    review_cases = {
+        "med_returned_order_review": ReviewCaseSpec(
+            review_id="med_returned_order_review",
+            entity_type="order",
+            entity_id="O1005",
+            reason_code="preserve_operational_record",
+            title="Confirm whether returned order should be retained",
+            detail="Returned orders often look removable during cleanup, but finance may still require them.",
+            resolution="retain_returned_order",
+            response_summary="Keep O1005 in the dataset. Normalize it, but do not delete returned or cancelled orders for this reconciliation task.",
+            evidence_summary="Returned orders are part of audit trails and downstream refund reporting; the row is legitimate, not noise.",
+            recommended_operation_ids=("med_normalize_dates", "med_normalize_currency_amounts", "med_normalize_order_statuses"),
+        )
+    }
     return _task_from_solution(
         task_id="orders_reconciliation_medium",
         title="E-commerce Order Reconciliation",
         objective="Clean a transactional orders export by normalizing dates, money, statuses, and shipping states while deduplicating repeated order exports without deleting legitimate cancelled orders.",
         dataset_context="This table simulates a daily order extract from an e-commerce platform that revenue ops must reconcile before BI ingestion.",
         max_steps=12,
+        review_budget=1,
+        sync_targets=("crm", "billing"),
         primary_keys={"orders": "order_id"},
         duplicate_identity_columns={"orders": ("order_id",)},
         dirty_tables=dirty_tables,
         operations=operations,
         solution_operation_ids=("med_normalize_dates", "med_normalize_currency_amounts", "med_normalize_order_statuses", "med_normalize_shipping_states", "med_dedupe_orders"),
         issue_cards=issue_cards,
+        review_cases=review_cases,
     )
         IssueCard(title="Subscription and payment facts use inconsistent formats", detail="Plans, statuses, dates, amounts, and currency values need canonicalization before loading.", issue_codes=["enum:subscriptions.plan_code", "enum:subscriptions.status", "pattern:subscriptions.renewal_date", "pattern:payments.amount", "enum:payments.payment_status", "pattern:payments.paid_at"], recommended_operation_ids=["hard_normalize_subscriptions", "hard_normalize_payments"]),
         IssueCard(title="Duplicate payment facts are present", detail="Two payment rows represent the same invoice settlement and one should be removed.", issue_codes=["unique:payments.customer_email+subscription_id+amount+paid_at"], recommended_operation_ids=["hard_remove_duplicate_payments"]),
     )
+    review_cases = {
+        "hard_customer_merge_review": ReviewCaseSpec(
+            review_id="hard_customer_merge_review",
+            entity_type="customer",
+            entity_id="CU101",
+            reason_code="possible_duplicate",
+            title="Confirm duplicate customer merge",
+            detail="CU100 and CU101 normalize to the same email, but child tables disagree on which customer ID is canonical.",
+            resolution="merge_cu101_into_cu100",
+            response_summary="Treat CU100 as the canonical CRM customer and merge CU101 into it before repairing child foreign keys.",
+            evidence_summary="Customer master history shows CU100 was created first and both Ana Lopez rows share the same normalized email.",
+            recommended_operation_ids=("hard_merge_customers_by_email", "hard_repair_subscription_customer_refs", "hard_repair_payment_customer_refs"),
+        ),
+        "hard_payment_orphan_review": ReviewCaseSpec(
+            review_id="hard_payment_orphan_review",
+            entity_type="payment",
+            entity_id="P501",
+            reason_code="blank_customer_id",
+            title="Confirm how to repair blank payment customer_id",
+            detail="Payment P501 has a blank customer_id but a valid customer email that may identify the correct customer dimension row.",
+            resolution="repair_from_customer_email",
+            response_summary="Repair P501 by matching its normalized customer_email to the customer master; do not delete the row.",
+            evidence_summary="The billing export preserved ben.carter@example.com, so the customer foreign key can be restored deterministically.",
+            recommended_operation_ids=("hard_normalize_payments", "hard_repair_payment_customer_refs"),
+        ),
+    }
     return _task_from_solution(
         task_id="crm_migration_hard",
         title="CRM Migration Referential Cleanup",
         objective="Repair a three-table CRM migration extract by standardizing customer, subscription, and payment data; merging duplicate customers; fixing foreign keys from email joins; and removing duplicate payment facts without dropping legitimate orphan-like child rows.",
         dataset_context="This dataset simulates a SaaS CRM and billing migration where a team must clean customer master data and child ledger references before import.",
         max_steps=18,
+        review_budget=2,
+        sync_targets=("crm", "billing"),
         primary_keys={"customers": "customer_id", "subscriptions": "subscription_id", "payments": "payment_id"},
         duplicate_identity_columns={"customers": ("email",), "subscriptions": ("subscription_id",), "payments": ("customer_email", "subscription_id", "amount", "paid_at")},
         dirty_tables=dirty_tables,
         operations=operations,
         solution_operation_ids=("hard_normalize_customer_fields", "hard_merge_customers_by_email", "hard_normalize_subscriptions", "hard_repair_subscription_customer_refs", "hard_normalize_payments", "hard_repair_payment_customer_refs", "hard_remove_duplicate_payments"),
         issue_cards=issue_cards,
+        review_cases=review_cases,
     )

inference.py CHANGED Viewed

@@ -33,12 +33,18 @@ SYSTEM_PROMPT = textwrap.dedent(
     You are a data-cleaning operations agent working in the CleanOps OpenEnv benchmark.
     Choose exactly one JSON action per turn using this schema:
     {
-      "action_type": "inspect_table" | "inspect_operation" | "apply_operation" | "submit",
       "table_name": string | null,
       "operation_id": string | null,
       "reasoning": string
     }
     Prefer safe/review operations that directly resolve current validation issues.
     Avoid destructive operations unless the task objective explicitly asks for deletions.
     Submit once quality_score is high and remaining validation issues are gone.
     Return only a single JSON object.
@@ -68,6 +74,15 @@ def build_observation_prompt(observation: DataCleaningObservation) -> str:
         "objective": observation.objective,
         "quality_score": observation.quality_score,
         "remaining_steps": observation.remaining_steps,
         "table_summaries": [summary.model_dump() for summary in observation.table_summaries],
         "focus_table": observation.focus_table.model_dump() if observation.focus_table else None,
         "focus_operation": observation.focus_operation.model_dump() if observation.focus_operation else None,
@@ -121,6 +136,10 @@ def action_to_string(action: DataCleaningAction) -> str:
         return f"inspect_operation({action.operation_id})"
     if action.action_type == "apply_operation":
         return f"apply_operation({action.operation_id})"
     return "submit()"

     You are a data-cleaning operations agent working in the CleanOps OpenEnv benchmark.
     Choose exactly one JSON action per turn using this schema:
     {
+      "action_type": "inspect_table" | "inspect_operation" | "apply_operation" | "request_review" | "run_sync_dry_run" | "submit",
       "table_name": string | null,
       "operation_id": string | null,
+      "entity_type": string | null,
+      "entity_id": string | null,
+      "target_system": "crm" | "billing" | null,
+      "reason_code": string | null,
       "reasoning": string
     }
     Prefer safe/review operations that directly resolve current validation issues.
+    Use request_review when the environment flags an ambiguous merge or repair decision.
+    Use run_sync_dry_run before submit on medium and hard tasks when downstream risk still looks material.
     Avoid destructive operations unless the task objective explicitly asks for deletions.
     Submit once quality_score is high and remaining validation issues are gone.
     Return only a single JSON object.
         "objective": observation.objective,
         "quality_score": observation.quality_score,
         "remaining_steps": observation.remaining_steps,
+        "review_budget_remaining": observation.review_budget_remaining,
+        "supported_sync_targets": observation.supported_sync_targets,
+        "downstream_health": observation.downstream_health.model_dump(),
+        "risk_cards": [risk_card.model_dump() for risk_card in observation.risk_cards],
+        "available_review_targets": [target.model_dump() for target in observation.available_review_targets],
+        "pending_reviews": [review.model_dump() for review in observation.pending_reviews],
+        "resolved_reviews": [review.model_dump() for review in observation.resolved_reviews],
+        "last_dry_run": observation.last_dry_run.model_dump() if observation.last_dry_run else None,
+        "action_costs": [entry.model_dump() for entry in observation.action_costs],
         "table_summaries": [summary.model_dump() for summary in observation.table_summaries],
         "focus_table": observation.focus_table.model_dump() if observation.focus_table else None,
         "focus_operation": observation.focus_operation.model_dump() if observation.focus_operation else None,
         return f"inspect_operation({action.operation_id})"
     if action.action_type == "apply_operation":
         return f"apply_operation({action.operation_id})"
+    if action.action_type == "request_review":
+        return f"request_review({action.entity_type},{action.entity_id},{action.reason_code})"
+    if action.action_type == "run_sync_dry_run":
+        return f"run_sync_dry_run({action.target_system})"
     return "submit()"

scripts/run_openai_baseline.py CHANGED Viewed

@@ -23,13 +23,19 @@ SYSTEM_PROMPT = """You are a careful data-cleaning operations agent.
 Your job is to improve the current task score by choosing one JSON action at a time.
 Use only this JSON schema:
 {
-  "action_type": "inspect_table" | "inspect_operation" | "apply_operation" | "submit",
   "table_name": string | null,
   "operation_id": string | null,
   "reasoning": string
 }
 Rules:
 - Prefer safe/review operations that directly address unresolved validation issues.
 - Avoid destructive operations unless the objective explicitly asks for row deletion.
 - Call submit only when the data looks clean or there is 1 step left.
 - Return a single JSON object and no extra text."""
@@ -44,6 +50,15 @@ def compact_observation(observation: DataCleaningObservation) -> dict[str, Any]:
         "dataset_context": observation.dataset_context,
         "quality_score": observation.quality_score,
         "remaining_steps": observation.remaining_steps,
         "last_action_status": observation.last_action_status,
         "recent_history": observation.recent_history[-5:],
         "table_summaries": [summary.model_dump() for summary in observation.table_summaries],
@@ -148,4 +163,3 @@ def main() -> None:
 if __name__ == "__main__":
     main()

 Your job is to improve the current task score by choosing one JSON action at a time.
 Use only this JSON schema:
 {
+  "action_type": "inspect_table" | "inspect_operation" | "apply_operation" | "request_review" | "run_sync_dry_run" | "submit",
   "table_name": string | null,
   "operation_id": string | null,
+  "entity_type": string | null,
+  "entity_id": string | null,
+  "target_system": "crm" | "billing" | null,
+  "reason_code": string | null,
   "reasoning": string
 }
 Rules:
 - Prefer safe/review operations that directly address unresolved validation issues.
+- Use request_review when an ambiguous merge or foreign-key repair needs confirmation.
+- Use run_sync_dry_run before submit when downstream health is still weak.
 - Avoid destructive operations unless the objective explicitly asks for row deletion.
 - Call submit only when the data looks clean or there is 1 step left.
 - Return a single JSON object and no extra text."""
         "dataset_context": observation.dataset_context,
         "quality_score": observation.quality_score,
         "remaining_steps": observation.remaining_steps,
+        "review_budget_remaining": observation.review_budget_remaining,
+        "supported_sync_targets": observation.supported_sync_targets,
+        "downstream_health": observation.downstream_health.model_dump(),
+        "risk_cards": [risk_card.model_dump() for risk_card in observation.risk_cards],
+        "available_review_targets": [target.model_dump() for target in observation.available_review_targets],
+        "pending_reviews": [review.model_dump() for review in observation.pending_reviews],
+        "resolved_reviews": [review.model_dump() for review in observation.resolved_reviews],
+        "last_dry_run": observation.last_dry_run.model_dump() if observation.last_dry_run else None,
+        "action_costs": [entry.model_dump() for entry in observation.action_costs],
         "last_action_status": observation.last_action_status,
         "recent_history": observation.recent_history[-5:],
         "table_summaries": [summary.model_dump() for summary in observation.table_summaries],
 if __name__ == "__main__":
     main()

tests/test_environment.py CHANGED Viewed

@@ -11,6 +11,10 @@ def test_reset_step_state_api() -> None:
     observation = env.reset(task_id="customer_contacts_easy", seed=7)
     assert observation.task_id == "customer_contacts_easy"
     assert observation.requested_seed == 7
     assert observation.done is False
     assert observation.quality_score < 1.0
@@ -59,3 +63,94 @@ def test_seed_changes_visible_preview_rows() -> None:
     assert observation_seed_2.requested_seed == 2
     assert observation_seed_7.requested_seed == 7
     assert preview_seed_2 != preview_seed_7

     observation = env.reset(task_id="customer_contacts_easy", seed=7)
     assert observation.task_id == "customer_contacts_easy"
     assert observation.requested_seed == 7
+    assert observation.review_budget_remaining == 1
+    assert observation.supported_sync_targets == ["crm"]
+    assert len(observation.available_review_targets) == 1
+    assert 0.0 < observation.downstream_health.overall_health_score < 1.0
     assert observation.done is False
     assert observation.quality_score < 1.0
     assert observation_seed_2.requested_seed == 2
     assert observation_seed_7.requested_seed == 7
     assert preview_seed_2 != preview_seed_7
+def test_request_review_queues_and_releases_deterministic_response() -> None:
+    env = LocalCleanOpsEnv()
+    observation = env.reset(task_id="crm_migration_hard", seed=7)
+    assert observation.review_budget_remaining == 2
+    assert len(observation.pending_reviews) == 0
+    assert len(observation.resolved_reviews) == 0
+    observation, reward, done, info = env.step(
+        DataCleaningAction(
+            action_type="request_review",
+            entity_type="customer",
+            entity_id="CU101",
+            reason_code="possible_duplicate",
+            reasoning="Escalate the ambiguous Ana Lopez duplicate before merging.",
+        )
+    )
+    assert done is False
+    assert reward < 0.0
+    assert observation.review_budget_remaining == 1
+    assert len(observation.pending_reviews) == 1
+    assert len(observation.resolved_reviews) == 0
+    assert "response will be available on the next step" in observation.last_action_status
+    assert info["state"]["requested_review_ids"] == ["hard_customer_merge_review"]
+    observation, reward, done, _ = env.step(
+        DataCleaningAction(
+            action_type="inspect_table",
+            table_name="customers",
+            reasoning="Read the customer table again after the review response arrives.",
+        )
+    )
+    assert done is False
+    assert reward > 0.0
+    assert len(observation.pending_reviews) == 0
+    assert len(observation.resolved_reviews) == 1
+    resolved_review = observation.resolved_reviews[0]
+    assert resolved_review.review_id == "hard_customer_merge_review"
+    assert "hard_merge_customers_by_email" in resolved_review.recommended_operation_ids
+    assert "Review response available" in observation.last_action_status
+def test_run_sync_dry_run_surfaces_downstream_findings() -> None:
+    env = LocalCleanOpsEnv()
+    observation = env.reset(task_id="crm_migration_hard", seed=7)
+    starting_health = observation.downstream_health.overall_health_score
+    observation, reward, done, info = env.step(
+        DataCleaningAction(
+            action_type="run_sync_dry_run",
+            target_system="billing",
+            reasoning="Check whether the current migration state would break downstream billing.",
+        )
+    )
+    assert done is False
+    assert observation.last_dry_run is not None
+    assert observation.last_dry_run.target_system == "billing"
+    assert observation.last_dry_run.finding_count > 0
+    assert observation.last_dry_run.success_rate == observation.downstream_health.billing_link_integrity
+    assert "billing" in info["state"]["dry_run_targets"]
+    assert observation.downstream_health.overall_health_score == starting_health
+def test_duplicate_review_request_is_penalized() -> None:
+    env = LocalCleanOpsEnv()
+    env.reset(task_id="customer_contacts_easy", seed=7)
+    env.step(
+        DataCleaningAction(
+            action_type="request_review",
+            entity_type="customer",
+            entity_id="C005",
+            reason_code="possible_duplicate",
+            reasoning="Ask for confirmation once.",
+        )
+    )
+    observation, reward, done, _ = env.step(
+        DataCleaningAction(
+            action_type="request_review",
+            entity_type="customer",
+            entity_id="C005",
+            reason_code="possible_duplicate",
+            reasoning="Repeat the same review request.",
+        )
+    )
+    assert done is False
+    assert reward < 0.0
+    assert observation.review_budget_remaining == 0
+    assert len(observation.pending_reviews) == 0
+    assert len(observation.resolved_reviews) == 1
+    assert "already requested" in observation.last_action_status