Spaces:

yashmarathe
/

data-cleaning-openenv

Sleeping

yashmarathe commited on Mar 28

Commit

f380b90

0 Parent(s):

feat: Data Cleaning RL Environment for OpenEnv Hackathon

Complete OpenEnv-compliant RL environment where an agent cleans tabular
datasets (iris/adult/credit-g from OpenML) using structured commands.

- datasets.py: load and cache OpenML datasets with sampling for large sets
- noise_injector.py: deterministic seeded noise (missing, type errors,
duplicates, outliers, schema violations) across 3 difficulty tiers
- models.py: typed Pydantic models for actions, observations, episode state
- server/environment.py: episode management, all 7 action types, rewards
- grader.py: RandomForest-based bracketed normalization grader
- server/app.py: FastAPI with /reset /step /state /tasks /grader /baseline /health
- baseline.py: heuristic agent with HTTP and internal execution modes
- client.py: async httpx client wrapper
- openenv.yaml, Dockerfile, requirements.txt, pyproject.toml, README.md

Files changed (19) hide show

.gitignore +40 -0
data_cleaning_env/.dockerignore +11 -0
data_cleaning_env/README.md +212 -0
data_cleaning_env/__init__.py +12 -0
data_cleaning_env/baseline.py +253 -0
data_cleaning_env/client.py +99 -0
data_cleaning_env/datasets.py +90 -0
data_cleaning_env/grader.py +161 -0
data_cleaning_env/models.py +157 -0
data_cleaning_env/noise_injector.py +115 -0
data_cleaning_env/openenv.yaml +35 -0
data_cleaning_env/pyproject.toml +28 -0
data_cleaning_env/server/Dockerfile +23 -0
data_cleaning_env/server/__init__.py +1 -0
data_cleaning_env/server/app.py +282 -0
data_cleaning_env/server/environment.py +371 -0
data_cleaning_env/server/requirements.txt +8 -0
docs/brainstorms/2026-03-27-data-cleaning-env-requirements.md +143 -0
docs/plans/2026-03-27-001-feat-data-cleaning-openenv-environment-plan.md +955 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,40 @@

+# Python
+__pycache__/
+*.py[cod]
+*.pyo
+*.pyd
+.Python
+*.egg
+*.egg-info/
+dist/
+build/
+.eggs/
+.venv/
+venv/
+env/
+.env
+# scikit-learn / OpenML cache
+~/scikit_learn_data/
+scikit_learn_data/
+# Pytest
+.pytest_cache/
+.coverage
+htmlcov/
+# Ruff
+.ruff_cache/
+# IDEs
+.vscode/
+.idea/
+*.swp
+*.swo
+# OS
+.DS_Store
+Thumbs.db
+# Docker
+*.log

data_cleaning_env/.dockerignore ADDED Viewed

	@@ -0,0 +1,11 @@

+__pycache__/
+*.py[cod]
+*.egg-info/
+.eggs/
+dist/
+build/
+.env
+outputs/
+*.log
+.pytest_cache/
+.coverage

data_cleaning_env/README.md ADDED Viewed

	@@ -0,0 +1,212 @@

+# Data Cleaning OpenEnv
+An [OpenEnv](https://github.com/meta-pytorch/OpenEnv) RL environment where an AI agent cleans tabular datasets using structured commands. The agent is rewarded for improving data quality — measured by column-level accuracy each step and by downstream ML model accuracy at episode end.
+Built for the [Meta PyTorch OpenEnv Hackathon x SST](https://www.scaler.com/school-of-technology) Round 1.
+---
+## Environment Description
+The agent receives a dirty tabular dataset and must apply cleaning operations to restore it toward the original ground-truth data. Each episode ends either when the agent issues a `done` action or the step limit is reached. The episode is then scored by training a RandomForest classifier on the cleaned data and measuring accuracy improvement over the dirty baseline.
+**Datasets** (from [OpenML](https://openml.org)):
+| Task   | Dataset   | OpenML ID | Noise Types                | Max Steps |
+|----|--------|---------|-----------------------|-----------|
+| easy   | iris      | 61      | Missing values (15% of numeric cols)                | 20        |
+| medium | adult     | 1590      | Missing values (20%), type errors, duplicate rows (3%)            | 40        |
+| hard   | credit-g  | 31        | Missing values (25%), type errors, duplicates (5%), outliers, schema violations | 60 |
+Noise injection is **deterministic** (seeded at 42), ensuring reproducibility.
+---
+## Action Space
+Actions are typed JSON objects. The `action_type` field is always required.
+| Action          | Required Fields                       | Description                          |
+|---------------------|-----------------|---------------------|
+| `fill_missing`          | `column`, `strategy`                    | Fill NaN values. Strategy: `mean\|median\|mode\|constant` |
+| `drop_duplicates`       | —                    | Remove all duplicate rows            |
+| `fix_type`              | `column`, `dtype`                   | Coerce column dtype. dtype: `int\|float\|str`  |
+| `normalize`         | `column`                        | Z-score normalize a numeric column           |
+| `drop_outliers`         | `column`, `method`                     | Remove outliers. method: `iqr\|zscore`         |
+| `fix_schema_violation`  | `column`, `constraint`                   | Fix constraint. constraint: `non_negative\|clamp_range` |
+| `done`                  | —               | Signal episode completion                      |
+**Example action JSON:**
+```json
+{
+  "action_type": "fill_missing",
+  "column": "age",
+  "strategy": "median"
+}
+```
+---
+## Observation Space
+Each `step()` and `reset()` call returns an observation with the following fields:
+| Field            | Type                      | Description                 |
+|----------------|----------------|-----------------|
+| `task`           | `str`             | Task tier: `easy`, `medium`, or `hard`        |
+| `step`      | `int`             | Current step number (0-indexed)                  |
+| `max_steps`      | `int`             | Maximum steps in this episode            |
+| `columns`        | `List[str]`                     | All column names                 |
+| `column_issues`  | `Dict[str, ColumnIssues]`         | Per-column data quality issues                   |
+| `column_stats`   | `Dict[str, ColumnStats]`          | Per-column statistics (mean, std, null_count, …) |
+| `reward`         | `float`                      | Per-step reward from the last action             |
+| `done`           | `bool`                    | True if the episode has ended                    |
+**`ColumnIssues` fields:**
+- `missing_count` — number of NaN values
+- `missing_pct` — fraction of NaN values [0, 1]
+- `type_errors` — values that cannot be parsed as the expected dtype
+- `outlier_count` — values outside 1.5×IQR
+- `has_duplicates` — whether any duplicate rows exist in the dataset
+---
+## Reward Function
+- **Per-step reward:** Column-level accuracy delta relative to clean ground truth.
+  `reward = clip(new_accuracy - prev_accuracy, -0.1, +0.1)`
+  Invalid actions receive `-0.05`.
+- **Episode grader score:** Downstream RandomForest accuracy improvement, normalized:
+  `score = clip((agent_acc - dirty_acc) / (oracle_acc - dirty_acc), 0.0, 1.0)`
+  - `0.0` = no improvement over the dirty baseline
+  - `1.0` = restored to oracle (original) quality
+---
+## API Endpoints
+| Method | Path        | Description                      |
+|------|-------------|-------------|
+| POST   | `/reset`    | Start a new episode. Body: `{"task": "easy\|medium\|hard"}` |
+| POST   | `/step`     | Apply an action. Body: `{"episode_id": "...", "action": {…}}`|
+| GET    | `/state`    | Get episode metadata. Query: `?episode_id=…`                 |
+| GET    | `/tasks`    | List tasks and the full action schema                  |
+| POST   | `/grader`   | Grade a completed episode. Body: `{"episode_id": "..."}`     |
+| POST   | `/baseline` | Run the heuristic baseline agent across all 3 tasks          |
+| GET    | `/health`   | Liveness check                  |
+Interactive API docs available at `/docs` when the server is running.
+---
+## Setup
+### Local (Python)
+```bash
+cd data_cleaning_env
+# Install dependencies
+pip install -r server/requirements.txt
+# Run the server
+uvicorn server.app:app --host 0.0.0.0 --port 8000
+# In another terminal — run the baseline agent
+python baseline.py --url http://localhost:8000
+```
+### Docker
+```bash
+# Build from data_cleaning_env/
+docker build -f server/Dockerfile -t data-cleaning-env .
+# Run
+docker run -p 8000:8000 data-cleaning-env
+# Test
+curl http://localhost:8000/health
+curl -X POST http://localhost:8000/reset -H "Content-Type: application/json" -d '{"task":"easy"}'
+```
+---
+## Baseline Scores
+Run the heuristic baseline agent (fill missing → fix types → drop duplicates → done):
+```
+Task easy:   ~0.70–0.85
+Task medium: ~0.40–0.60
+Task hard:   ~0.20–0.40
+```
+Actual scores depend on the dataset split and RandomForest training run. They are deterministic given the same seed.
+To reproduce:
+```bash
+python baseline.py --url http://localhost:8000
+```
+Or via the API:
+```bash
+curl -X POST http://localhost:8000/baseline
+```
+---
+## Quick Example
+```python
+import requests
+BASE = "http://localhost:8000"
+# Start a medium-difficulty episode
+resp = requests.post(f"{BASE}/reset", json={"task": "medium"}).json()
+episode_id = resp["state"]["episode_id"]
+obs = resp["observation"]
+print(f"Columns: {obs['columns']}")
+print(f"Issues: {obs['column_issues']}")
+# Fill missing values in the first column
+first_col = obs["columns"][0]
+resp = requests.post(f"{BASE}/step", json={
+    "episode_id": episode_id,
+    "action": {"action_type": "fill_missing", "column": first_col, "strategy": "median"}
+}).json()
+print(f"Reward: {resp['reward']}")
+# Grade the episode
+resp = requests.post(f"{BASE}/grader", json={"episode_id": episode_id}).json()
+print(f"Score: {resp['score']}")
+```
+---
+## Project Structure
+```
+data_cleaning_env/
+├── __init__.py           # Package exports
+├── models.py             # Pydantic: CleaningAction, Observation, EpisodeState
+├── datasets.py        # OpenML dataset loading and caching
+├── noise_injector.py     # Deterministic noise injection (3 task levels)
+├── grader.py          # Episode grader: sklearn RandomForest + bracketed score
+├── baseline.py           # Heuristic baseline agent
+├── client.py           # Async HTTP client wrapper
+├── openenv.yaml          # OpenEnv manifest
+├── pyproject.toml        # Package metadata
+└── server/
+    ├── __init__.py
+    ├── environment.py    # Core env logic (episodes, actions, rewards, observations)
+    ├── app.py            # FastAPI application
+    ├── requirements.txt  # Docker dependencies
+    └── Dockerfile        # Container image
+```
+---
+## License
+MIT

data_cleaning_env/__init__.py ADDED Viewed

	@@ -0,0 +1,12 @@

+"""
+Data Cleaning RL Environment for OpenEnv.
+An RL environment where agents clean tabular datasets using structured commands,
+graded by downstream ML model accuracy improvement.
+"""
+from models import CleaningAction, Observation, EpisodeState
+from client import DataCleaningEnvClient
+__all__ = ["CleaningAction", "Observation", "EpisodeState", "DataCleaningEnvClient"]
+__version__ = "1.0.0"

data_cleaning_env/baseline.py ADDED Viewed

	@@ -0,0 +1,253 @@

+"""
+Heuristic baseline agent for the Data Cleaning RL Environment.
+Strategy (per task):
+  1. Fill missing values with median for all columns that have missing data
+  2. Fix type errors (coerce to float) for all columns with type errors
+  3. Drop duplicate rows once
+  4. Drop outliers (IQR) for columns with many outliers (medium/hard tasks)
+  5. Fix schema violations (non_negative) for first 2 columns (hard task only)
+  6. Signal done
+Run standalone:
+    python baseline.py [--url http://localhost:8000]
+Or import and call from app.py:
+    from baseline import run_baseline_internal
+    scores = run_baseline_internal(env)
+"""
+from __future__ import annotations
+import argparse
+import json
+import sys
+from typing import TYPE_CHECKING
+import requests
+if TYPE_CHECKING:
+    from server.environment import DataCleaningEnvironment
+BASE_URL_DEFAULT = "http://localhost:8000"
+def run_single_episode_http(task: str, base_url: str) -> float:
+    """Run one baseline episode via the HTTP API. Returns the grader score."""
+    # Reset
+    resp = requests.post(f"{base_url}/reset", json={"task": task}, timeout=30)
+    resp.raise_for_status()
+    data = resp.json()
+    episode_id = data["state"]["episode_id"]
+    obs = data["observation"]
+    issues = obs["column_issues"]
+    task_max_steps = obs["max_steps"]
+    steps_used = 0
+    def post_step(action_payload: dict) -> dict:
+        nonlocal steps_used
+        r = requests.post(
+            f"{base_url}/step",
+            json={"episode_id": episode_id, "action": action_payload},
+            timeout=30,
+        )
+        r.raise_for_status()
+        steps_used += 1
+        return r.json()
+    # 1. Fill missing values
+    for col, col_issues in issues.items():
+        if steps_used >= task_max_steps - 3:
+            break
+        if col_issues["missing_count"] > 0:
+            post_step(
+                {
+                    "action_type": "fill_missing",
+                    "column": col,
+                    "strategy": "median",
+                }
+            )
+    # 2. Fix type errors
+    for col, col_issues in issues.items():
+        if steps_used >= task_max_steps - 3:
+            break
+        if col_issues.get("type_errors", 0) > 0:
+            post_step(
+                {
+                    "action_type": "fix_type",
+                    "column": col,
+                    "dtype": "float",
+                }
+            )
+    # 3. Drop duplicates
+    if steps_used < task_max_steps - 2 and any(
+        c.get("has_duplicates", False) for c in issues.values()
+    ):
+        post_step({"action_type": "drop_duplicates"})
+    # 4. Drop outliers (medium and hard tasks)
+    if task in ("medium", "hard"):
+        for col, col_issues in issues.items():
+            if steps_used >= task_max_steps - 1:
+                break
+            if col_issues.get("outlier_count", 0) > 3:
+                post_step(
+                    {
+                        "action_type": "drop_outliers",
+                        "column": col,
+                        "method": "iqr",
+                    }
+                )
+    # 5. Fix schema violations (hard task only)
+    if task == "hard":
+        for col in list(issues.keys())[:2]:
+            if steps_used >= task_max_steps - 1:
+                break
+            post_step(
+                {
+                    "action_type": "fix_schema_violation",
+                    "column": col,
+                    "constraint": "non_negative",
+                }
+            )
+    # 6. Done
+    post_step({"action_type": "done"})
+    # Grade
+    resp = requests.post(
+        f"{base_url}/grader", json={"episode_id": episode_id}, timeout=60
+    )
+    resp.raise_for_status()
+    return float(resp.json()["score"])
+def run_baseline_http(base_url: str = BASE_URL_DEFAULT) -> dict[str, float]:
+    """Run baseline across all tasks via HTTP. Returns {task: score}."""
+    scores: dict[str, float] = {}
+    for task in ["easy", "medium", "hard"]:
+        score = run_single_episode_http(task, base_url)
+        scores[task] = round(score, 4)
+        print(f"  Task {task}: {scores[task]:.4f}")
+    return scores
+def run_baseline_internal(env: "DataCleaningEnvironment") -> dict[str, float]:
+    """
+    Run baseline directly against the environment object (no HTTP).
+    Used by the /baseline endpoint to avoid HTTP round-trips.
+    """
+    from models import (
+        ActionType,
+        CleaningAction,
+        DType,
+        FillStrategy,
+        OutlierMethod,
+        SchemaConstraint,
+    )
+    from grader import grade_episode
+    scores: dict[str, float] = {}
+    for task in ["easy", "medium", "hard"]:
+        obs, state = env.reset(task=task)
+        episode_id = state.episode_id
+        issues = obs.column_issues
+        max_steps = obs.max_steps
+        steps_used = 0
+        def do_step(action: CleaningAction) -> None:
+            nonlocal steps_used, obs
+            if steps_used >= max_steps - 1:
+                return
+            obs_new, _reward, _done = env.step(episode_id, action)
+            obs = obs_new
+            steps_used += 1
+        # 1. Fill missing
+        for col, col_issues in issues.items():
+            if steps_used >= max_steps - 3:
+                break
+            if col_issues.missing_count > 0:
+                do_step(
+                    CleaningAction(
+                        action_type=ActionType.fill_missing,
+                        column=col,
+                        strategy=FillStrategy.median,
+                    )
+                )
+        # 2. Fix type errors
+        for col, col_issues in issues.items():
+            if steps_used >= max_steps - 3:
+                break
+            if col_issues.type_errors > 0:
+                do_step(
+                    CleaningAction(
+                        action_type=ActionType.fix_type,
+                        column=col,
+                        dtype=DType.float,
+                    )
+                )
+        # 3. Drop duplicates
+        if steps_used < max_steps - 2 and any(
+            c.has_duplicates for c in issues.values()
+        ):
+            do_step(CleaningAction(action_type=ActionType.drop_duplicates))
+        # 4. Drop outliers (medium/hard)
+        if task in ("medium", "hard"):
+            for col, col_issues in issues.items():
+                if steps_used >= max_steps - 1:
+                    break
+                if col_issues.outlier_count > 3:
+                    do_step(
+                        CleaningAction(
+                            action_type=ActionType.drop_outliers,
+                            column=col,
+                            method=OutlierMethod.iqr,
+                        )
+                    )
+        # 5. Fix schema violations (hard only)
+        if task == "hard":
+            for col in list(issues.keys())[:2]:
+                if steps_used >= max_steps - 1:
+                    break
+                do_step(
+                    CleaningAction(
+                        action_type=ActionType.fix_schema_violation,
+                        column=col,
+                        constraint=SchemaConstraint.non_negative,
+                    )
+                )
+        # 6. Done
+        env.step(episode_id, CleaningAction(action_type=ActionType.done))
+        # Grade
+        ep = env.episodes[episode_id]
+        score = grade_episode(ep["current_df"], task, ep["target_col"])
+        scores[task] = round(score, 4)
+        print(f"  Task {task}: {scores[task]:.4f}")
+    return scores
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="Run baseline heuristic agent")
+    parser.add_argument(
+        "--url",
+        default=BASE_URL_DEFAULT,
+        help=f"Base URL of the running server (default: {BASE_URL_DEFAULT})",
+    )
+    args = parser.parse_args()
+    print(f"Running baseline agent against {args.url} ...")
+    scores = run_baseline_http(args.url)
+    print(f"\nBaseline scores:\n{json.dumps(scores, indent=2)}")

data_cleaning_env/client.py ADDED Viewed

	@@ -0,0 +1,99 @@

+"""
+Python client for the Data Cleaning RL Environment.
+Provides a lightweight async wrapper for local testing and integration
+with RL training frameworks.
+Usage (async):
+    import asyncio
+    from client import DataCleaningEnvClient
+    from models import CleaningAction, ActionType, FillStrategy
+    async def main():
+        client = DataCleaningEnvClient(base_url="http://localhost:8000")
+        result = await client.reset(task="easy")
+        episode_id = result["state"]["episode_id"]
+        action = CleaningAction(
+            action_type=ActionType.fill_missing,
+            column="sepallength",
+            strategy=FillStrategy.median,
+        )
+        result = await client.step(episode_id, action)
+        print(result)
+    asyncio.run(main())
+"""
+from __future__ import annotations
+from typing import Any
+try:
+    import httpx
+    _HAS_HTTPX = True
+except ImportError:
+    _HAS_HTTPX = False
+from models import CleaningAction
+class DataCleaningEnvClient:
+    """Async HTTP client for the Data Cleaning OpenEnv server."""
+    def __init__(self, base_url: str = "http://localhost:8000") -> None:
+        self.base_url = base_url.rstrip("/")
+    async def reset(self, task: str = "easy") -> dict[str, Any]:
+        """Start a new episode. Returns {observation, state}."""
+        return await self._post("/reset", {"task": task})
+    async def step(self, episode_id: str, action: CleaningAction) -> dict[str, Any]:
+        """Apply a cleaning action. Returns {observation, reward, done, info}."""
+        return await self._post(
+            "/step",
+            {
+                "episode_id": episode_id,
+                "action": action.model_dump(),
+            },
+        )
+    async def get_state(self, episode_id: str) -> dict[str, Any]:
+        """Get episode metadata."""
+        return await self._get(f"/state?episode_id={episode_id}")
+    async def grade(self, episode_id: str) -> dict[str, Any]:
+        """Grade the current episode. Returns {episode_id, task, score}."""
+        return await self._post("/grader", {"episode_id": episode_id})
+    async def get_tasks(self) -> dict[str, Any]:
+        """Get available tasks and action schema."""
+        return await self._get("/tasks")
+    async def baseline(self) -> dict[str, Any]:
+        """Trigger the baseline agent and return scores."""
+        return await self._post("/baseline", {})
+    async def health(self) -> dict[str, Any]:
+        """Liveness check."""
+        return await self._get("/health")
+    async def _post(self, path: str, payload: dict) -> dict[str, Any]:
+        if not _HAS_HTTPX:
+            raise ImportError(
+                "httpx is required for async HTTP. Install it: pip install httpx"
+            )
+        async with httpx.AsyncClient(base_url=self.base_url, timeout=60) as client:
+            resp = await client.post(path, json=payload)
+            resp.raise_for_status()
+            return resp.json()
+    async def _get(self, path: str) -> dict[str, Any]:
+        if not _HAS_HTTPX:
+            raise ImportError(
+                "httpx is required for async HTTP. Install it: pip install httpx"
+            )
+        async with httpx.AsyncClient(base_url=self.base_url, timeout=60) as client:
+            resp = await client.get(path)
+            resp.raise_for_status()
+            return resp.json()

data_cleaning_env/datasets.py ADDED Viewed

	@@ -0,0 +1,90 @@

+"""
+Dataset loading and caching for the Data Cleaning RL Environment.
+Uses sklearn.datasets.fetch_openml to load public OpenML datasets.
+Results are cached in memory after the first load to avoid repeated downloads.
+"""
+from __future__ import annotations
+import pandas as pd
+from sklearn.datasets import fetch_openml
+TASK_CONFIGS: dict[str, dict] = {
+    "easy": {
+        "data_id": 61,  # iris — 150 rows, 4 numeric features, 3-class
+        "name": "iris",
+        "target": "class",
+        "sample_size": None,  # use full dataset
+    },
+    "medium": {
+        "data_id": 1590,  # adult v2 — 48,842 rows, 15 features, binary class
+        "name": "adult",
+        "target": "class",
+        "sample_size": 2000,  # sample for grading speed
+    },
+    "hard": {
+        "data_id": 31,  # credit-g — 1,000 rows, 20 features, binary class
+        "name": "credit-g",
+        "target": "class",
+        "sample_size": None,  # use full dataset
+    },
+}
+# In-memory cache: task -> (clean_df, target_col)
+_DATASET_CACHE: dict[str, tuple[pd.DataFrame, str]] = {}
+def load_clean_dataset(task: str) -> tuple[pd.DataFrame, str]:
+    """
+    Load a clean OpenML dataset for the given task.
+    Returns:
+        (clean_df, target_col) — DataFrame with original (clean) data and the
+        name of the target column.
+    Raises:
+        ValueError: if task is not one of "easy", "medium", "hard".
+    """
+    if task not in TASK_CONFIGS:
+        raise ValueError(f"Unknown task '{task}'. Must be one of: {list(TASK_CONFIGS)}")
+    if task in _DATASET_CACHE:
+        df, target = _DATASET_CACHE[task]
+        return df.copy(), target
+    cfg = TASK_CONFIGS[task]
+    dataset = fetch_openml(
+        data_id=cfg["data_id"],
+        as_frame=True,
+        cache=True,
+        parser="auto",
+    )
+    df: pd.DataFrame = dataset.frame.copy()
+    # Rename target column to "class" for consistency if needed
+    target_col: str = cfg["target"]
+    if dataset.target_names and dataset.target_names[0] != target_col:
+        actual_target = dataset.target_names[0]
+        if actual_target in df.columns:
+            df = df.rename(columns={actual_target: target_col})
+    # Ensure target column exists; if it's in dataset.target but not frame, add it
+    if target_col not in df.columns and dataset.target is not None:
+        df[target_col] = dataset.target.values
+    # Sample for large datasets to keep grading fast
+    if cfg["sample_size"] is not None and len(df) > cfg["sample_size"]:
+        df = df.sample(n=cfg["sample_size"], random_state=42).reset_index(drop=True)
+    # Reset index for clean indexing
+    df = df.reset_index(drop=True)
+    _DATASET_CACHE[task] = (df.copy(), target_col)
+    return df.copy(), target_col
+def preload_all() -> None:
+    """Preload all datasets into cache. Call at server startup."""
+    for task in TASK_CONFIGS:
+        load_clean_dataset(task)

data_cleaning_env/grader.py ADDED Viewed

	@@ -0,0 +1,161 @@

+"""
+Grader for the Data Cleaning RL Environment.
+Trains a simple sklearn RandomForest on the agent cleaned dataset and scores
+improvement using a bracketed normalization formula:
+    score = clip((agent_acc - dirty_acc) / (oracle_acc - dirty_acc), 0.0, 1.0)
+Oracle and dirty baselines are precomputed once at server startup.
+"""
+from __future__ import annotations
+import numpy as np
+import pandas as pd
+from sklearn.compose import ColumnTransformer
+from sklearn.ensemble import RandomForestClassifier
+from sklearn.impute import SimpleImputer
+from sklearn.model_selection import train_test_split
+from sklearn.pipeline import Pipeline
+from sklearn.preprocessing import LabelEncoder, StandardScaler
+# Precomputed at startup, keyed by task name
+_ORACLE_SCORES: dict[str, float] = {}
+_DIRTY_SCORES: dict[str, float] = {}
+def train_and_score(df: pd.DataFrame, target_col: str) -> float:
+    """
+    Train a RandomForest classifier on df and return test accuracy.
+    Returns 0.0 on any failure to avoid crashing the grader endpoint.
+    """
+    try:
+        if target_col not in df.columns:
+            return 0.0
+        df = df.copy()
+        X = df.drop(columns=[target_col])
+        y = df[target_col].astype(str)
+        # Encode target labels
+        le = LabelEncoder()
+        y_enc = le.fit_transform(y.fillna("__missing__"))
+        # Need at least 2 classes and enough samples
+        if len(np.unique(y_enc)) < 2 or len(df) < 10:
+            return 0.0
+        # Build column transformer
+        num_cols = X.select_dtypes(include="number").columns.tolist()
+        cat_cols = X.select_dtypes(exclude="number").columns.tolist()
+        transformers: list = []
+        if num_cols:
+            transformers.append(
+                (
+                    "num",
+                    Pipeline(
+                        [
+                            ("impute", SimpleImputer(strategy="median")),
+                            ("scale", StandardScaler()),
+                        ]
+                    ),
+                    num_cols,
+                )
+            )
+        if cat_cols:
+            transformers.append(
+                (
+                    "cat",
+                    SimpleImputer(strategy="most_frequent"),
+                    cat_cols,
+                )
+            )
+        if not transformers:
+            return 0.0
+        preprocessor = ColumnTransformer(transformers, remainder="drop")
+        clf = Pipeline(
+            [
+                ("pre", preprocessor),
+                (
+                    "clf",
+                    RandomForestClassifier(
+                        n_estimators=50,
+                        random_state=42,
+                        n_jobs=-1,
+                        max_depth=10,
+                    ),
+                ),
+            ]
+        )
+        # Stratified split
+        try:
+            X_train, X_test, y_train, y_test = train_test_split(
+                X,
+                y_enc,
+                test_size=0.2,
+                random_state=42,
+                stratify=y_enc,
+            )
+        except ValueError:
+            X_train, X_test, y_train, y_test = train_test_split(
+                X,
+                y_enc,
+                test_size=0.2,
+                random_state=42,
+            )
+        clf.fit(X_train, y_train)
+        return float(clf.score(X_test, y_test))
+    except Exception:
+        return 0.0
+def compute_oracle_and_dirty_baselines(
+    task: str,
+    clean_df: pd.DataFrame,
+    dirty_df: pd.DataFrame,
+    target_col: str,
+) -> None:
+    """
+    Precompute oracle and dirty baseline scores for a task.
+    Call once at server startup for each task.
+    """
+    _ORACLE_SCORES[task] = train_and_score(clean_df, target_col)
+    _DIRTY_SCORES[task] = train_and_score(dirty_df, target_col)
+def grade_episode(
+    episode_cleaned_df: pd.DataFrame,
+    task: str,
+    target_col: str,
+) -> float:
+    """
+    Grade a completed episode.
+    Returns a score in [0.0, 1.0].
+    """
+    agent_acc = train_and_score(episode_cleaned_df, target_col)
+    oracle = _ORACLE_SCORES.get(task, 1.0)
+    dirty = _DIRTY_SCORES.get(task, 0.0)
+    if oracle <= dirty:
+        return 1.0 if agent_acc >= oracle else 0.0
+    score = (agent_acc - dirty) / (oracle - dirty)
+    return float(np.clip(score, 0.0, 1.0))
+def get_baselines() -> dict[str, dict[str, float]]:
+    """Return the precomputed oracle and dirty scores for all tasks."""
+    return {
+        task: {
+            "oracle_accuracy": _ORACLE_SCORES.get(task),
+            "dirty_accuracy": _DIRTY_SCORES.get(task),
+        }
+        for task in ["easy", "medium", "hard"]
+    }

data_cleaning_env/models.py ADDED Viewed

	@@ -0,0 +1,157 @@

+"""
+Pydantic models for the Data Cleaning RL Environment.
+Defines the typed action, observation, and state structures used by the
+OpenEnv step()/reset()/state() API.
+"""
+from __future__ import annotations
+from enum import Enum
+from typing import Any, Dict, List, Optional
+from pydantic import BaseModel, Field
+# ----------------------
+# Action types
+# -----------------------
+class ActionType(str, Enum):
+    fill_missing = "fill_missing"
+    drop_duplicates = "drop_duplicates"
+    fix_type = "fix_type"
+    normalize = "normalize"
+    drop_outliers = "drop_outliers"
+    fix_schema_violation = "fix_schema_violation"
+    done = "done"
+class FillStrategy(str, Enum):
+    mean = "mean"
+    median = "median"
+    mode = "mode"
+    constant = "constant"
+class DType(str, Enum):
+    int = "int"
+    float = "float"
+    str = "str"
+class OutlierMethod(str, Enum):
+    iqr = "iqr"
+    zscore = "zscore"
+class SchemaConstraint(str, Enum):
+    non_negative = "non_negative"
+    clamp_range = "clamp_range"
+class CleaningAction(BaseModel):
+    """A single cleaning action issued by the agent."""
+    action_type: ActionType = Field(
+        ...,
+        description="Type of cleaning action to perform.",
+    )
+    column: Optional[str] = Field(
+        None,
+        description="Target column name. Required for all column-level actions.",
+    )
+    strategy: Optional[FillStrategy] = Field(
+        None,
+        description="Fill strategy for fill_missing action.",
+    )
+    dtype: Optional[DType] = Field(
+        None,
+        description="Target dtype for fix_type action.",
+    )
+    method: Optional[OutlierMethod] = Field(
+        None,
+        description="Outlier detection method for drop_outliers action.",
+    )
+    constraint: Optional[SchemaConstraint] = Field(
+        None,
+        description="Constraint type for fix_schema_violation action.",
+    )
+    constant_value: Optional[Any] = Field(
+        None,
+        description="Constant fill value for fill_missing with strategy=constant.",
+    )
+# ---------------
+# Observation
+# -------------------
+class ColumnIssues(BaseModel):
+    """Per-column data quality issues detected in the current state."""
+    missing_count: int = Field(..., description="Number of missing (NaN) values.")
+    missing_pct: float = Field(..., description="Fraction of missing values [0, 1].")
+    type_errors: int = Field(
+        ...,
+        description="Number of cells that cannot be parsed as the expected dtype.",
+    )
+    outlier_count: int = Field(
+        ...,
+        description="Number of outliers detected via IQR rule.",
+    )
+    has_duplicates: bool = Field(
+        ...,
+        description="True if the dataset currently contains duplicate rows.",
+    )
+class ColumnStats(BaseModel):
+    """Compact statistical summary for a column."""
+    mean: Optional[float] = None
+    std: Optional[float] = None
+    null_count: int = 0
+    unique_count: int = 0
+class Observation(BaseModel):
+    """Observation returned by reset() and step()."""
+    task: str = Field(..., description="Task tier: 'easy', 'medium', or 'hard'.")
+    step: int = Field(..., description="Current step number (0-indexed).")
+    max_steps: int = Field(..., description="Maximum steps allowed in this episode.")
+    columns: List[str] = Field(..., description="Column names in the dataset.")
+    column_issues: Dict[str, ColumnIssues] = Field(
+        ...,
+        description="Data quality issues per column.",
+    )
+    column_stats: Dict[str, ColumnStats] = Field(
+        ...,
+        description="Compact statistics per column.",
+    )
+    reward: float = Field(
+        ...,
+        description="Per-step reward from the most recent action.",
+    )
+    done: bool = Field(..., description="True if the episode has ended.")
+# ----------------
+# Episode state
+# --------------------
+class EpisodeState(BaseModel):
+    """Metadata about the current episode, returned by state()."""
+    episode_id: str = Field(..., description="Unique episode identifier (UUID).")
+    task: str = Field(..., description="Task tier: 'easy', 'medium', or 'hard'.")
+    step: int = Field(..., description="Current step number.")
+    max_steps: int = Field(..., description="Maximum steps allowed.")
+    score: Optional[float] = Field(
+        None,
+        description="Final grader score (0.0–1.0). Only set after /grader is called.",
+    )

data_cleaning_env/noise_injector.py ADDED Viewed

	@@ -0,0 +1,115 @@

+"""
+Deterministic noise injection for the Data Cleaning RL Environment.
+Each task tier injects a different combination and severity of noise into a
+clean dataset. The same seed always produces the same dirty dataset, ensuring
+reproducibility for judges and baseline evaluation.
+"""
+from __future__ import annotations
+import numpy as np
+import pandas as pd
+def inject_noise(df: pd.DataFrame, task: str, seed: int = 42) -> pd.DataFrame:
+    """
+    Inject noise into a clean DataFrame according to the task difficulty.
+    Args:
+        df: Clean source DataFrame (will not be modified in place).
+        task: One of "easy", "medium", "hard".
+        seed: RNG seed for reproducibility.
+    Returns:
+        A new dirty DataFrame.
+    """
+    if task == "easy":
+        return _inject_easy(df.copy(), seed)
+    elif task == "medium":
+        return _inject_medium(df.copy(), seed)
+    elif task == "hard":
+        return _inject_hard(df.copy(), seed)
+    else:
+        raise ValueError(f"Unknown task '{task}'. Must be one of: easy, medium, hard")
+def _inject_easy(dirty: pd.DataFrame, seed: int) -> pd.DataFrame:
+    """Easy: 15 percent missing values in numeric columns only."""
+    rng = np.random.default_rng(seed)
+    numeric_cols = dirty.select_dtypes(include="number").columns.tolist()
+    for col in numeric_cols:
+        mask = rng.random(len(dirty)) < 0.15
+        dirty.loc[mask, col] = np.nan
+    return dirty.reset_index(drop=True)
+def _inject_medium(dirty: pd.DataFrame, seed: int) -> pd.DataFrame:
+    """Medium: 20 percent missing + type errors in 2 numeric cols + 3 percent duplicate rows."""
+    rng = np.random.default_rng(seed)
+    # 1. Missing values in all columns
+    for col in dirty.columns:
+        mask = rng.random(len(dirty)) < 0.20
+        dirty.loc[mask, col] = np.nan
+    # 2. Type corruption: convert some non-null numeric cells to string
+    numeric_cols = dirty.select_dtypes(include="number").columns.tolist()
+    for col in numeric_cols[:2]:
+        mask = (rng.random(len(dirty)) < 0.05) & dirty[col].notna()
+        dirty[col] = dirty[col].astype(object)
+        dirty.loc[mask, col] = dirty.loc[mask, col].apply(
+            lambda x: f"err_{x}" if pd.notna(x) else x
+        )
+    # 3. Duplicate rows
+    n_dups = max(1, int(len(dirty) * 0.03))
+    dup_indices = rng.choice(len(dirty), size=n_dups, replace=True)
+    dup_rows = dirty.iloc[dup_indices]
+    dirty = pd.concat([dirty, dup_rows], ignore_index=True)
+    return dirty.reset_index(drop=True)
+def _inject_hard(dirty: pd.DataFrame, seed: int) -> pd.DataFrame:
+    """Hard: 25 percent missing + type errors + outliers + duplicates + schema violations."""
+    rng = np.random.default_rng(seed)
+    numeric_cols = dirty.select_dtypes(include="number").columns.tolist()
+    # 1. Missing values in all columns
+    for col in dirty.columns:
+        mask = rng.random(len(dirty)) < 0.25
+        dirty.loc[mask, col] = np.nan
+    # 2. Type corruption in 3 numeric columns
+    for col in numeric_cols[:3]:
+        mask = (rng.random(len(dirty)) < 0.07) & dirty[col].notna()
+        dirty[col] = dirty[col].astype(object)
+        dirty.loc[mask, col] = dirty.loc[mask, col].apply(
+            lambda x: f"err_{x}" if pd.notna(x) else x
+        )
+    # 3. Outliers: set 5 percent of numeric values to 10x their column max
+    for col in numeric_cols:
+        numeric_series = pd.to_numeric(dirty[col], errors="coerce")
+        col_max = numeric_series.max()
+        if pd.notna(col_max) and col_max != 0:
+            mask = (rng.random(len(dirty)) < 0.05) & dirty[col].notna()
+            dirty.loc[mask, col] = col_max * 10
+    # 4. Duplicate rows
+    n_dups = max(1, int(len(dirty) * 0.05))
+    dup_indices = rng.choice(len(dirty), size=n_dups, replace=True)
+    dup_rows = dirty.iloc[dup_indices]
+    dirty = pd.concat([dirty, dup_rows], ignore_index=True)
+    # 5. Schema violations: negative values in first 2 strictly-positive columns
+    pos_cols = numeric_cols[:2]
+    for col in pos_cols:
+        numeric_series = pd.to_numeric(dirty[col], errors="coerce")
+        mask = (rng.random(len(dirty)) < 0.05) & dirty[col].notna()
+        positive_vals = numeric_series[mask].abs()
+        dirty.loc[mask, col] = -positive_vals
+    return dirty.reset_index(drop=True)

data_cleaning_env/openenv.yaml ADDED Viewed

	@@ -0,0 +1,35 @@

+name: data-cleaning-env
+version: "1.0.0"
+description: >
+  RL environment for tabular data cleaning. An AI agent receives a dirty
+  OpenML dataset and issues structured cleaning commands (fill_missing,
+  fix_type, drop_duplicates, drop_outliers, fix_schema_violation, normalize).
+  Graded by downstream RandomForest accuracy improvement over a dirty baseline.
+author: Yash Marathe
+tags:
+  - data-engineering
+  - tabular
+  - real-world
+  - openml
+  - data-quality
+tasks:
+  - id: easy
+    description: "Fix missing values in the Iris dataset (15% of numeric values missing)"
+  - id: medium
+    description: "Fix missing values, type errors, and duplicates in Adult Income dataset (2k sample)"
+  - id: hard
+    description: "Fix missing values, type errors, duplicates, outliers, and schema violations in Credit-G dataset"
+server:
+  port: 8000
+  health_endpoint: /health
+endpoints:
+  reset: POST /reset
+  step: POST /step
+  state: GET /state
+  tasks: GET /tasks
+  grader: POST /grader
+  baseline: POST /baseline
+  health: GET /health

data_cleaning_env/pyproject.toml ADDED Viewed

	@@ -0,0 +1,28 @@

+[build-system]
+requires = ["hatchling"]
+build-backend = "hatchling.build"
+[project]
+name = "data-cleaning-env"
+version = "1.0.0"
+description = "OpenEnv RL environment for tabular data cleaning. Agent issues structured commands to clean dirty datasets from OpenML, graded by downstream ML accuracy."
+readme = "README.md"
+requires-python = ">=3.11"
+dependencies = [
+    "fastapi>=0.115",
+    "uvicorn[standard]>=0.30",
+    "pydantic>=2.0",
+    "scikit-learn>=1.4",
+    "pandas>=2.0",
+    "numpy>=1.26",
+    "requests>=2.31",
+]
+[project.optional-dependencies]
+dev = [
+    "pytest>=8.0",
+    "httpx>=0.27",
+]
+[tool.hatch.build.targets.wheel]
+packages = ["data_cleaning_env"]

data_cleaning_env/server/Dockerfile ADDED Viewed

	@@ -0,0 +1,23 @@

+FROM python:3.11-slim
+WORKDIR /app
+# Install dependencies first (layer caching)
+COPY server/requirements.txt .
+RUN pip install --no-cache-dir -r requirements.txt
+# Copy source
+COPY . .
+# Pre-download and cache OpenML datasets at build time to avoid cold-start latency.
+# This also validates that all 3 datasets are reachable.
+RUN python -c "\
+import sys; sys.path.insert(0, '/app'); \
+from datasets import preload_all; \
+preload_all(); \
+print('Datasets cached successfully.')"
+EXPOSE 8000
+# Run from /app so that all module imports resolve against data_cleaning_env/
+CMD ["uvicorn", "server.app:app", "--host", "0.0.0.0", "--port", "8000"]

data_cleaning_env/server/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ # server package

data_cleaning_env/server/app.py ADDED Viewed

	@@ -0,0 +1,282 @@

+"""
+FastAPI server for the Data Cleaning RL Environment.
+Implements the full OpenEnv standard API (reset/step/state) plus the
+hackathon-required endpoints (/tasks, /grader, /baseline, /health).
+"""
+from __future__ import annotations
+import sys
+import os
+_HERE = os.path.dirname(os.path.abspath(__file__))
+_ROOT = os.path.dirname(_HERE)
+if _ROOT not in sys.path:
+    sys.path.insert(0, _ROOT)
+from contextlib import asynccontextmanager
+from typing import Optional
+from fastapi import FastAPI, HTTPException
+from pydantic import BaseModel
+from datasets import load_clean_dataset, preload_all
+from grader import compute_oracle_and_dirty_baselines, grade_episode
+from models import CleaningAction
+from noise_injector import inject_noise
+from server.environment import DataCleaningEnvironment
+# ---------------
+# Startup / lifespan
+# --------------
+@asynccontextmanager
+async def lifespan(app: FastAPI):
+    """
+    Preload datasets and compute oracle/dirty baselines at startup.
+    Avoids cold-start latency on the first API call.
+    """
+    preload_all()
+    for task in ["easy", "medium", "hard"]:
+        clean_df, target_col = load_clean_dataset(task)
+        dirty_df = inject_noise(clean_df, task)
+        compute_oracle_and_dirty_baselines(task, clean_df, dirty_df, target_col)
+    yield
+app = FastAPI(
+    title="Data Cleaning OpenEnv",
+    description=(
+        "An OpenEnv RL environment where agents clean tabular datasets "
+        "using structured commands. Graded by downstream ML accuracy improvement."
+    ),
+    version="1.0.0",
+    lifespan=lifespan,
+)
+env = DataCleaningEnvironment()
+# ------------
+# Request/response models
+# -------------
+class ResetRequest(BaseModel):
+    task: Optional[str] = "easy"
+class StepRequest(BaseModel):
+    episode_id: str
+    action: CleaningAction
+class GraderRequest(BaseModel):
+    episode_id: str
+# --------------
+# Standard OpenEnv endpoints
+# ----------------------
+@app.post("/reset", summary="Start a new episode")
+async def reset(req: ResetRequest):
+    """
+    Initialize a new episode with the given task tier.
+    Body: {"task": "easy" | "medium" | "hard"}
+    Returns the initial observation and episode state.
+    """
+    try:
+        obs, state = env.reset(task=req.task or "easy")
+    except ValueError as e:
+        raise HTTPException(status_code=400, detail=str(e))
+    return {"observation": obs.model_dump(), "state": state.model_dump()}
+@app.post("/step", summary="Apply a cleaning action")
+async def step(req: StepRequest):
+    """
+    Apply one cleaning action to the active episode.
+    Returns the updated observation, step reward, and done flag.
+    """
+    try:
+        obs, reward, done = env.step(req.episode_id, req.action)
+    except KeyError:
+        raise HTTPException(
+            status_code=404,
+            detail=f"Episode '{req.episode_id}' not found.",
+        )
+    except ValueError as e:
+        raise HTTPException(status_code=400, detail=str(e))
+    return {
+        "observation": obs.model_dump(),
+        "reward": reward,
+        "done": done,
+        "info": {},
+    }
+@app.get("/state", summary="Get episode metadata")
+async def state(episode_id: str):
+    """Return metadata about an active episode."""
+    ep = env.episodes.get(episode_id)
+    if not ep:
+        raise HTTPException(
+            status_code=404,
+            detail=f"Episode '{episode_id}' not found.",
+        )
+    return {
+        "episode_id": episode_id,
+        "task": ep["task"],
+        "step": ep["step"],
+        "max_steps": ep["max_steps"],
+        "done": ep["done"],
+    }
+# -------------------------
+# Hackathon-required endpoints
+# -------------------------
+@app.get("/tasks", summary="List tasks and action schema")
+async def tasks():
+    """
+    Return the list of available tasks and the full action schema.
+    Required by the hackathon pre-submission checklist.
+    """
+    return {
+        "tasks": [
+            {
+                "id": "easy",
+                "description": (
+                    "Fix missing values in the Iris dataset. "
+                    "15% of numeric values are missing."
+                ),
+                "dataset": "iris (OpenML ID 61)",
+                "max_steps": 20,
+                "noise_types": ["missing_values"],
+            },
+            {
+                "id": "medium",
+                "description": (
+                    "Fix missing values, type errors, and duplicate rows "
+                    "in the Adult Income dataset (2,000-row sample)."
+                ),
+                "dataset": "adult (OpenML ID 1590, 2k sample)",
+                "max_steps": 40,
+                "noise_types": ["missing_values", "type_errors", "duplicates"],
+            },
+            {
+                "id": "hard",
+                "description": (
+                    "Fix missing values, type errors, duplicates, outliers, "
+                    "and schema violations in the Credit-G dataset."
+                ),
+                "dataset": "credit-g (OpenML ID 31)",
+                "max_steps": 60,
+                "noise_types": [
+                    "missing_values",
+                    "type_errors",
+                    "duplicates",
+                    "outliers",
+                    "schema_violations",
+                ],
+            },
+        ],
+        "action_schema": {
+            "action_type": {
+                "type": "string",
+                "required": True,
+                "values": [
+                    "fill_missing",
+                    "drop_duplicates",
+                    "fix_type",
+                    "normalize",
+                    "drop_outliers",
+                    "fix_schema_violation",
+                    "done",
+                ],
+            },
+            "column": {
+                "type": "string",
+                "required": "for column-level actions",
+                "description": "Target column name from the dataset.",
+            },
+            "strategy": {
+                "type": "string",
+                "required": "for fill_missing",
+                "values": ["mean", "median", "mode", "constant"],
+            },
+            "dtype": {
+                "type": "string",
+                "required": "for fix_type",
+                "values": ["int", "float", "str"],
+            },
+            "method": {
+                "type": "string",
+                "required": "for drop_outliers",
+                "values": ["iqr", "zscore"],
+            },
+            "constraint": {
+                "type": "string",
+                "required": "for fix_schema_violation",
+                "values": ["non_negative", "clamp_range"],
+            },
+            "constant_value": {
+                "type": "any",
+                "required": "for fill_missing with strategy=constant",
+                "description": "The constant value to fill missing cells with.",
+            },
+        },
+    }
+@app.post("/grader", summary="Grade a completed episode")
+async def grader(req: GraderRequest):
+    """
+    Compute the grader score for a completed (or ongoing) episode.
+    Score is in [0.0, 1.0]:
+    - 0.0 = no improvement over the dirty baseline
+    - 1.0 = dataset restored to oracle (original) quality
+    """
+    ep = env.episodes.get(req.episode_id)
+    if not ep:
+        raise HTTPException(
+            status_code=404,
+            detail=f"Episode '{req.episode_id}' not found.",
+        )
+    score = grade_episode(ep["current_df"], ep["task"], ep["target_col"])
+    return {
+        "episode_id": req.episode_id,
+        "task": ep["task"],
+        "score": score,
+    }
+@app.post("/baseline", summary="Run the baseline heuristic agent")
+async def baseline():
+    """
+    Run the built-in heuristic baseline agent through all 3 tasks and return scores.
+    The baseline uses a simple rule-based strategy: fill missing (median),
+    fix types, drop duplicates, drop outliers, then done.
+    """
+    from baseline import run_baseline_internal
+    scores = run_baseline_internal(env)
+    return {"baseline_scores": scores}
+@app.get("/health", summary="Health check")
+async def health():
+    """Liveness check -- returns 200 when the server is running."""
+    return {"status": "ok"}

data_cleaning_env/server/environment.py ADDED Viewed

	@@ -0,0 +1,371 @@

+"""
+Core environment logic for the Data Cleaning RL Environment.
+Manages episodes, applies cleaning actions, computes per-step rewards,
+and assembles observations.
+"""
+from __future__ import annotations
+import sys
+import os
+_HERE = os.path.dirname(os.path.abspath(__file__))
+_ROOT = os.path.dirname(_HERE)
+if _ROOT not in sys.path:
+    sys.path.insert(0, _ROOT)
+import uuid
+from typing import Any
+import numpy as np
+import pandas as pd
+from datasets import load_clean_dataset
+from noise_injector import inject_noise
+from models import (
+    ActionType,
+    CleaningAction,
+    ColumnIssues,
+    ColumnStats,
+    EpisodeState,
+    Observation,
+)
+MAX_STEPS: dict[str, int] = {
+    "easy": 20,
+    "medium": 40,
+    "hard": 60,
+}
+REWARD_CLIP = 0.1
+class DataCleaningEnvironment:
+    """
+    Manages multiple concurrent episodes.
+    Episodes are stored in-memory. Each reset() creates a fresh episode.
+    """
+    def __init__(self) -> None:
+        self.episodes: dict[str, dict[str, Any]] = {}
+    # --------------------
+    # Public API
+    # ----------------------
+    def reset(self, task: str = "easy") -> tuple[Observation, EpisodeState]:
+        """Initialize a new episode and return the initial observation."""
+        if task not in MAX_STEPS:
+            raise ValueError(
+                f"Unknown task '{task}'. Must be one of: {list(MAX_STEPS)}"
+            )
+        episode_id = str(uuid.uuid4())
+        clean_df, target_col = load_clean_dataset(task)
+        dirty_df = inject_noise(clean_df, task)
+        initial_accuracy = self._column_accuracy(dirty_df, clean_df)
+        self.episodes[episode_id] = {
+            "task": task,
+            "clean_df": clean_df,
+            "current_df": dirty_df.copy(),
+            "target_col": target_col,
+            "step": 0,
+            "max_steps": MAX_STEPS[task],
+            "done": False,
+            "prev_column_accuracy": initial_accuracy,
+        }
+        obs = self._make_observation(episode_id, reward=0.0)
+        state = EpisodeState(
+            episode_id=episode_id,
+            task=task,
+            step=0,
+            max_steps=MAX_STEPS[task],
+        )
+        return obs, state
+    def step(
+        self, episode_id: str, action: CleaningAction
+    ) -> tuple[Observation, float, bool]:
+        """
+        Apply an action to the current episode state.
+        Returns:
+            (observation, reward, done)
+        Raises:
+            KeyError: if episode_id is unknown.
+            ValueError: if the episode is already done.
+        """
+        ep = self.episodes[episode_id]  # raises KeyError if unknown
+        if ep["done"]:
+            raise ValueError(
+                "Episode already done. Call /reset to start a new episode."
+            )
+        reward = 0.0
+        df = ep["current_df"]
+        if action.action_type == ActionType.done:
+            ep["done"] = True
+            reward = 0.0
+        else:
+            try:
+                df = self._apply_action(df, action, ep)
+                ep["current_df"] = df
+                new_accuracy = self._column_accuracy(df, ep["clean_df"])
+                delta = new_accuracy - ep["prev_column_accuracy"]
+                reward = float(np.clip(delta, -REWARD_CLIP, REWARD_CLIP))
+                ep["prev_column_accuracy"] = new_accuracy
+            except Exception:
+                # Penalize invalid/no-op actions
+                reward = -0.05
+        ep["step"] += 1
+        if ep["step"] >= ep["max_steps"]:
+            ep["done"] = True
+            reward = 0.0
+        obs = self._make_observation(episode_id, reward=reward)
+        return obs, reward, ep["done"]
+    # --------------
+    # Action application
+    # ---------------------
+    def _apply_action(
+        self, df: pd.DataFrame, action: CleaningAction, ep: dict
+    ) -> pd.DataFrame:
+        """Apply a cleaning action and return the modified DataFrame."""
+        df = df.copy()
+        col = action.column
+        if action.action_type == ActionType.fill_missing:
+            df = self._fill_missing(df, col, action)
+        elif action.action_type == ActionType.drop_duplicates:
+            df = df.drop_duplicates().reset_index(drop=True)
+        elif action.action_type == ActionType.fix_type:
+            df = self._fix_type(df, col, action)
+        elif action.action_type == ActionType.normalize:
+            df = self._normalize(df, col)
+        elif action.action_type == ActionType.drop_outliers:
+            df = self._drop_outliers(df, col, action)
+        elif action.action_type == ActionType.fix_schema_violation:
+            df = self._fix_schema_violation(df, col, action, ep)
+        else:
+            raise ValueError(f"Unhandled action type: {action.action_type}")
+        return df
+    def _fill_missing(
+        self, df: pd.DataFrame, col: str, action: CleaningAction
+    ) -> pd.DataFrame:
+        if col not in df.columns:
+            raise ValueError(f"Column '{col}' not found.")
+        numeric = pd.to_numeric(df[col], errors="coerce")
+        strategy = action.strategy.value if action.strategy else "median"
+        if strategy == "mean":
+            fill_value = numeric.mean()
+        elif strategy == "median":
+            fill_value = numeric.median()
+        elif strategy == "mode":
+            mode_vals = numeric.mode()
+            fill_value = mode_vals.iloc[0] if not mode_vals.empty else 0
+        elif strategy == "constant":
+            fill_value = action.constant_value
+        else:
+            fill_value = numeric.median()
+        df[col] = numeric.fillna(fill_value)
+        return df
+    def _fix_type(
+        self, df: pd.DataFrame, col: str, action: CleaningAction
+    ) -> pd.DataFrame:
+        if col not in df.columns:
+            raise ValueError(f"Column '{col}' not found.")
+        dtype = action.dtype.value if action.dtype else "float"
+        if dtype in ("int", "float"):
+            coerced = pd.to_numeric(df[col], errors="coerce")
+            if dtype == "int":
+                df[col] = coerced.astype("Int64")
+            else:
+                df[col] = coerced.astype("float64")
+        else:
+            df[col] = df[col].astype(str)
+        return df
+    def _normalize(self, df: pd.DataFrame, col: str) -> pd.DataFrame:
+        if col not in df.columns:
+            raise ValueError(f"Column '{col}' not found.")
+        numeric = pd.to_numeric(df[col], errors="coerce")
+        mean = numeric.mean()
+        std = numeric.std()
+        if pd.isna(mean) or std == 0 or pd.isna(std):
+            return df
+        df[col] = (numeric - mean) / std
+        return df
+    def _drop_outliers(
+        self, df: pd.DataFrame, col: str, action: CleaningAction
+    ) -> pd.DataFrame:
+        if col not in df.columns:
+            raise ValueError(f"Column '{col}' not found.")
+        numeric = pd.to_numeric(df[col], errors="coerce")
+        method = action.method.value if action.method else "iqr"
+        if method == "iqr":
+            q1, q3 = numeric.quantile(0.25), numeric.quantile(0.75)
+            iqr = q3 - q1
+            if iqr == 0:
+                return df
+            mask = numeric.between(q1 - 1.5 * iqr, q3 + 1.5 * iqr) | numeric.isna()
+        else:  # zscore
+            mean, std = numeric.mean(), numeric.std()
+            if std == 0 or pd.isna(std):
+                return df
+            z = (numeric - mean) / std
+            mask = z.abs() < 3
+        return df[mask].reset_index(drop=True)
+    def _fix_schema_violation(
+        self, df: pd.DataFrame, col: str, action: CleaningAction, ep: dict
+    ) -> pd.DataFrame:
+        if col not in df.columns:
+            raise ValueError(f"Column '{col}' not found.")
+        numeric = pd.to_numeric(df[col], errors="coerce")
+        constraint = action.constraint.value if action.constraint else "non_negative"
+        if constraint == "non_negative":
+            df[col] = numeric.clip(lower=0)
+        elif constraint == "clamp_range":
+            clean_col = pd.to_numeric(ep["clean_df"][col], errors="coerce")
+            lo, hi = clean_col.quantile(0.05), clean_col.quantile(0.95)
+            df[col] = numeric.clip(lo, hi)
+        return df
+    # ------------------
+    # Reward computation
+    # ---------------------------
+    def _column_accuracy(
+        self, current_df: pd.DataFrame, clean_df: pd.DataFrame
+    ) -> float:
+        """
+        Mean fraction of values matching clean ground truth, averaged across columns.
+        """
+        scores: list[float] = []
+        common_cols = [c for c in clean_df.columns if c in current_df.columns]
+        n_rows = min(len(current_df), len(clean_df))
+        for col in common_cols:
+            cur = current_df[col].iloc[:n_rows].reset_index(drop=True)
+            cln = clean_df[col].iloc[:n_rows].reset_index(drop=True)
+            cur_num = pd.to_numeric(cur, errors="coerce")
+            cln_num = pd.to_numeric(cln, errors="coerce")
+            if cln_num.notna().mean() > 0.9:
+                both_valid = cur_num.notna() & cln_num.notna()
+                if both_valid.sum() == 0:
+                    scores.append(0.0)
+                    continue
+                match = (cur_num - cln_num).abs() < 1e-4
+                scores.append(float(match.mean()))
+            else:
+                scores.append(float((cur.astype(str) == cln.astype(str)).mean()))
+        return float(np.mean(scores)) if scores else 0.0
+    # ----------------------
+    # Observation assembly
+    # ---------------
+    def _make_observation(self, episode_id: str, reward: float) -> Observation:
+        ep = self.episodes[episode_id]
+        df = ep["current_df"]
+        clean_df = ep["clean_df"]
+        column_issues: dict[str, ColumnIssues] = {}
+        column_stats: dict[str, ColumnStats] = {}
+        n_dups = int(df.duplicated().sum())
+        for col in clean_df.columns:
+            if col not in df.columns:
+                continue
+            col_data = df[col]
+            numeric = pd.to_numeric(col_data, errors="coerce")
+            clean_numeric = pd.to_numeric(clean_df[col], errors="coerce")
+            # Type errors: cells non-null in series but unparseable as numeric
+            type_errs = 0
+            if clean_numeric.notna().mean() > 0.9:
+                raw_nulls = col_data.isna().sum()
+                numeric_nulls = numeric.isna().sum()
+                type_errs = max(0, int(numeric_nulls - raw_nulls))
+            # Outlier count via IQR
+            outlier_count = 0
+            if numeric.notna().sum() > 4:
+                q1, q3 = numeric.quantile(0.25), numeric.quantile(0.75)
+                iqr = q3 - q1
+                if iqr > 0:
+                    outlier_count = int(
+                        ((numeric < q1 - 1.5 * iqr) | (numeric > q3 + 1.5 * iqr)).sum()
+                    )
+            column_issues[col] = ColumnIssues(
+                missing_count=int(col_data.isna().sum()),
+                missing_pct=round(float(col_data.isna().mean()), 4),
+                type_errors=type_errs,
+                outlier_count=outlier_count,
+                has_duplicates=n_dups > 0,
+            )
+            column_stats[col] = ColumnStats(
+                mean=(
+                    round(float(numeric.mean()), 4) if numeric.notna().any() else None
+                ),
+                std=(
+                    round(float(numeric.std()), 4)
+                    if numeric.notna().sum() > 1
+                    else None
+                ),
+                null_count=int(col_data.isna().sum()),
+                unique_count=int(col_data.nunique()),
+            )
+        return Observation(
+            task=ep["task"],
+            step=ep["step"],
+            max_steps=ep["max_steps"],
+            columns=list(clean_df.columns),
+            column_issues=column_issues,
+            column_stats=column_stats,
+            reward=reward,
+            done=ep["done"],
+        )

data_cleaning_env/server/requirements.txt ADDED Viewed

	@@ -0,0 +1,8 @@

+fastapi>=0.115
+uvicorn[standard]>=0.30
+pydantic>=2.0
+scikit-learn>=1.4
+pandas>=2.0
+numpy>=1.26
+requests>=2.31
+httpx>=0.27

docs/brainstorms/2026-03-27-data-cleaning-env-requirements.md ADDED Viewed

	@@ -0,0 +1,143 @@

+---
+date: 2026-03-27
+topic: data-cleaning-openenv
+---
+# Data Cleaning RL Environment (OpenEnv Hackathon Round 1)
+## Problem Frame
+AI agents have no standardized way to practice and be evaluated on real-world data cleaning tasks.
+This environment gives an RL agent a dirty tabular dataset and lets it iteratively apply structured
+cleaning actions, rewarded by how much it improves data quality — measured both incrementally
+(column-level accuracy per step) and holistically (downstream ML model accuracy at episode end).
+Target users: RL researchers and LLM training practitioners who want to train/evaluate agents on
+realistic, verifiable data engineering tasks.
+Competition constraint: must comply fully with the OpenEnv spec and be deployed to Hugging Face Spaces by 7 Apr 11:59 PM.
+---
+## Requirements
+- **R1.** The environment simulates a tabular data cleaning task: the agent receives a dirty dataset
+  as its observation and applies typed cleaning commands until the episode ends or a max-step limit
+  is reached.
+- **R2.** The action space consists of structured, typed commands:
+  - `fill_missing(column, strategy)` — strategy in {mean, median, mode, constant}
+  - `drop_duplicates()`
+  - `fix_type(column, dtype)` — dtype in {int, float, str, datetime}
+  - `normalize(column)` — z-score normalization
+  - `drop_outliers(column, method)` — method in {iqr, zscore}
+  - `fix_schema_violation(column, constraint)` — e.g., clamp to valid range
+  - `done()` — signal episode completion
+- **R3.** The observation at each step contains:
+  - Current dirty dataset (serialized as JSON records or column stats summary)
+  - List of detected issues per column (missing count, type errors, duplicate count, outlier count)
+  - Current step number and remaining steps
+- **R4.** Three task tiers, each using a different OpenML dataset with injected noise:
+  - **Task 1 (Easy):** Missing values only. OpenML dataset: `iris` or `wine`. Score: 0.0–1.0.
+  - **Task 2 (Medium):** Missing values + type errors + duplicate rows. OpenML dataset: `adult` (income). Score: 0.0–1.0.
+  - **Task 3 (Hard):** All of the above + outliers + schema violations (value range constraints). OpenML dataset: `credit-g` or `diabetes`. Score: 0.0–1.0.
+- **R5.** Reward function:
+  - **Per-step reward:** column-level accuracy delta against ground-truth clean dataset. Computed as
+    mean across columns of (correct values / total values). Range: [-0.1, +0.1] per step to
+    discourage harmful actions.
+  - **Episode reward (grader score):** train a simple sklearn classifier/regressor on the cleaned
+  dataset; score = accuracy on a held-out test split. Normalized to [0.0, 1.0] relative to
+    baseline dirty-data accuracy and oracle clean-data accuracy.
+- **R6.** Ground truth: each dirty dataset has a corresponding clean version (the original OpenML
+  dataset before noise injection). The noise injection script is deterministic (seeded RNG) and
+  reproducible.
+- **R7.** The environment server exposes the full OpenEnv spec:
+  - `POST /reset` — initialize episode, return initial observation
+  - `POST /step` — apply action, return (observation, reward, done, info)
+  - `GET /state` — return episode metadata (step count, episode id, task name)
+  - `GET /tasks` — return list of tasks and action schema
+  - `POST /grader` — run grader on completed episode, return score
+  - `POST /baseline` — run baseline inference script, return scores for all 3 tasks
+- **R8.** A baseline inference script (`baseline.py`) uses a simple heuristic agent (e.g., always
+  fill missing with median, drop duplicates, then done) as the reproducible baseline. Must complete
+  without error and produce deterministic scores.
+- **R9.** Deployment: working Dockerfile, deployed to Hugging Face Spaces, `openenv.yaml` manifest.
+- **R10.** README covers environment description, action/observation schema, setup instructions,
+  baseline scores.
+---
+## Success Criteria
+- HF Space returns HTTP 200 and responds to `reset()` (automated ping passes)
+- `openenv.yaml` validates against the OpenEnv spec validator
+- Docker image builds without error
+- Baseline script completes and produces scores for all 3 tasks
+- All 3 graders return scores strictly in [0.0, 1.0]
+- Grader scores are meaningfully differentiated across task tiers (harder = lower baseline score)
+- Episode reward correlates with actual data quality improvement (sanity test)
+---
+## Scope Boundaries
+- Not building a general-purpose data cleaning pipeline or production tool — this is a training environment only
+- No natural-language action space; actions are typed/structured only
+- No multi-table / relational joins in v1 (single DataFrame per episode)
+- No real external database connections; datasets are loaded from OpenML at environment startup and cached
+- No UI or visualization — headless server only
+- Downstream ML model is simple (sklearn LogisticRegression or RandomForest, not a deep model) to keep grading fast
+---
+## Key Decisions
+- **OpenML datasets:** Widely known, programmatically accessible via `openml` Python package, no API key needed for public datasets. Clean ground truth = original dataset; dirty = noise-injected copy.
+- **Noise injection is deterministic (seeded):** Ensures reproducibility for judges running the baseline script.
+- **Hybrid reward (per-step + episode):** Gives the agent dense learning signal during traing while reporting a realistic business metric (ML accuracy) as the final grader score.
+- **Structured action space:** Keeps grading deterministic and makes the `/tasks` endpoint action schema self-documenting with no ambiguity.
+- **Simple sklearn baseline model:** Balances realism of the metric with fast grading latency (< 2s per episode grading call).
+---
+## Dependencies / Assumptions
+- OpenML Python package (`openml`) is pip-installable and works inside Docker without external auth for public datasets
+- Chosen OpenML datasets (iris, adult, credit-g / diabetes) are stable and publicly available
+- sklearn is available for baseline model training in the grading step
+- OpenEnv CLI (`openenv`) is available for scaffolding and HF Space deployment
+- Hackathon submission window: 28 Mar – 7 Apr 2026
+---
+## Outstanding Questions
+### Resolve Before Planning
+- None — all product decisions are resolved.
+### Deferred to Planning
+- [Affects R5][Needs research] Exact normalization formula for episode reward: how to compute the
+  oracle clean-data accuracy upper bound (run on unmodified OpenML data) and dirty-data lower bound
+  (run on fully noise-injected data) to bracket the [0.0, 1.0] range correctly.
+- [Affects R7][Technical] Confirm `/grader` endpoint contract: does it accept episode history or
+  just the final cleaned dataset? Check OpenEnv spec RFC for grader interface.
+- [Affects R9][Technical] Confirm HF Spaces Dockerfile constraints (CPU-only, memory limits) to
+  ensure sklearn training fits within free-tier limits.
+- [Affects R4][Needs research] Verify OpenML dataset IDs and that noise injection levels produce
+  meaningfully different difficulty (e.g., 10% missing for easy, 30% + type errors for medium).
+---
+## Next Steps
+All blocking questions resolved.
+→ `/ce:plan` for structured implementation planning

docs/plans/2026-03-27-001-feat-data-cleaning-openenv-environment-plan.md ADDED Viewed

	@@ -0,0 +1,955 @@

+---
+title: "feat: Data Cleaning RL Environment (OpenEnv Hackathon Round 1)"
+type: feat
+status: active
+date: 2026-03-27
+origin: docs/brainstorms/2026-03-27-data-cleaning-env-requirements.md
+---
+# feat: Data Cleaning RL Environment (OpenEnv Hackathon Round 1)
+## Overview
+Build a fully compliant OpenEnv RL environment where an AI agent cleans tabular datasets by
+issuing structured commands. The environment uses OpenML public datasets with synthetically
+injected noise across three difficulty tiers. The grader rewards the agent based on downstream
+ML model accuracy improvement. Deployed to Hugging Face Spaces with Docker, passing all
+pre-submission validation checks.
+**Deadline:** 7 Apr 2026, 11:59 PM
+**Submission window opens:** 28 Mar 2026
+---
+## Problem Statement / Motivation
+There is no standardized RL environment for tabular data cleaning, despite it being one of the
+most time-consuming real-world tasks for data engineers. This environment gives agents a
+reproducible, objectively scorable task with dense intermediate rewards and a realistic final
+metric (downstream ML accuracy). It is both hackathon-viable and genuinely novel as a
+community environment.
+(see origin: `docs/brainstorms/2026-03-27-data-cleaning-env-requirements.md`)
+---
+## Architecture
+```
+data_cleaning_env/
+├── __init__.py           # Export Action, Observation, DataCleaningEnv
+├── models.py                    # Pydantic: CleaningAction, Observation, EpisodeState
+├── client.py                    # DataCleaningEnv(EnvClient) — async + sync API
+├── noise_injector.py            # Deterministic noise injection (seeded RNG)
+├── grader.py                    # Episode grader — trains sklearn model, returns 0.0–1.0
+├── datasets.py           # Load & cache OpenML datasets at startup
+├── baseline.py                # Heuristic baseline agent script
+├── openenv.yaml                 # OpenEnv manifest
+├── README.md              # Environment docs
+├── pyproject.toml             # Package metadata + dependencies
+└── server/
+    ├── environment.py           # DataCleaningEnvironment(Environment) — core logic
+    ├── app.py            # FastAPI app: /reset /step /state /tasks /grader /baseline
+    ├── requirements.txt         # Docker dependencies
+    └── Dockerfile               # Container image
+```
+---
+## Proposed Solution
+### Phase 1: Environment Scaffold & Data Layer
+**Scaffold the OpenEnv project structure** using `openenv init data_cleaning_env`, then build
+the data layer.
+#### 1.1 Dataset Loading (`datasets.py`)
+Use `sklearn.datasets.fetch_openml` (no extra package needed — sklearn is already a dependency):
+```python
+# datasets.py
+from sklearn.datasets import fetch_openml
+import pandas as pd
+TASK_CONFIGS = {
+    "easy":   {"data_id": 61,   "name": "iris",     "target": "class"},
+    "medium": {"data_id": 1590, "name": "adult",    "target": "class"},
+    "hard":   {"data_id": 31,   "name": "credit-g", "target": "class"},
+}
+def load_clean_dataset(task: str) -> tuple[pd.DataFrame, str]:
+    """Returns (clean_df, target_col). Cached after first load."""
+    cfg = TASK_CONFIGS[task]
+    dataset = fetch_openml(data_id=cfg["data_id"], as_frame=True, cache=True)
+    df = dataset.frame.copy()
+    return df, cfg["target"]
+```
+**Dataset IDs confirmed:**
+- iris → OpenML ID 61 (150 rows, 4 numeric features, 3-class)
+- adult → OpenML ID 1590 (48,842 rows, 15 features, binary class) — use a 2,000-row sample for speed
+- credit-g → OpenML ID 31 (1,000 rows, 20 features, binary class)
+#### 1.2 Noise Injector (`noise_injector.py`)
+Deterministic noise injection with seeded RNG ensures judges reproduce the same dirty dataset:
+```python
+# noise_injector.py
+import numpy as np
+import pandas as pd
+def inject_noise(df: pd.DataFrame, task: str, seed: int = 42) -> pd.DataFrame:
+    rng = np.random.default_rng(seed)
+    dirty = df.copy()
+    if task == "easy":
+        # 15% missing values in numeric columns only
+        for col in dirty.select_dtypes(include="number").columns:
+            mask = rng.random(len(dirty)) < 0.15
+        dirty.loc[mask, col] = np.nan
+    elif task == "medium":
+      # 20% missing values (all columns)
+        for col in dirty.columns:
+            mask = rng.random(len(dirty)) < 0.20
+            dirty.loc[mask, col] = np.nan
+        # 5% type corruption: convert numeric to string "err_<value>"
+        for col in dirty.select_dtypes(include="number").columns[:2]:
+       mask = rng.random(len(dirty)) < 0.05
+            dirty.loc[mask, col] = dirty.loc[mask, col].apply(
+           lambda x: f"err_{x}" if pd.notna(x) else x
+         )
+        # Inject 3% duplicate rows
+        n_dups = int(len(dirty) * 0.03)
+        dup_rows = dirty.sample(n=n_dups, random_state=seed)
+        dirty = pd.concat([dirty, dup_rows], ignore_index=True)
+    elif task == "hard":
+        # All medium noise plus:
+        # 25% missing values
+        for col in dirty.columns:
+            mask = rng.random(len(dirty)) < 0.25
+            dirty.loc[mask, col] = np.nan
+    # Type errors in 3 columns
+        for col in dirty.select_dtypes(include="number").columns[:3]:
+            mask = rng.random(len(dirty)) < 0.07
+            dirty.loc[mask, col] = dirty.loc[mask, col].apply(
+         lambda x: f"err_{x}" if pd.notna(x) else x
+            )
+        # Outliers: set 5% of numeric values to 10x their max
+        for col in dirty.select_dtypes(include="number").columns:
+          col_max = dirty[col].max()
+         if pd.notna(col_max):
+         mask = rng.random(len(dirty)) < 0.05
+              dirty.loc[mask, col] = col_max * 10
+     # Duplicate rows 5%
+        n_dups = int(len(dirty) * 0.05)
+        dup_rows = dirty.sample(n=n_dups, random_state=seed)
+        dirty = pd.concat([dirty, dup_rows], ignore_index=True)
+        # Schema violation: negative values in strictly positive columns
+        pos_cols = dirty.select_dtypes(include="number").columns[:2]
+        for col in pos_cols:
+            mask = rng.random(len(dirty)) < 0.05
+            dirty.loc[mask, col] = dirty.loc[mask, col].abs() * -1
+    return dirty
+```
+---
+### Phase 2: Models & Action Space (`models.py`)
+Use Pydantic (already required by OpenEnv FastAPI stack):
+```python
+# models.py
+from pydantic import BaseModel
+from typing import Literal, Optional, List, Dict, Any
+from enum import Enum
+class FillStrategy(str, Enum):
+    mean = "mean"
+    median = "median"
+    mode = "mode"
+    constant = "constant"
+class OutlierMethod(str, Enum):
+    iqr = "iqr"
+    zscore = "zscore"
+class ActionType(str, Enum):
+    fill_missing = "fill_missing"
+    drop_duplicates = "drop_duplicates"
+    fix_type = "fix_type"
+    normalize = "normalize"
+    drop_outliers = "drop_outliers"
+  fix_schema_violation = "fix_schema_violation"
+    done = "done"
+class CleaningAction(BaseModel):
+    action_type: ActionType
+    column: Optional[str] = None        # required for column-level actions
+    strategy: Optional[FillStrategy] = None   # for fill_missing
+    dtype: Optional[str] = None         # for fix_type: "int", "float", "str"
+    method: Optional[OutlierMethod] = None    # for drop_outliers
+    constraint: Optional[str] = None    # for fix_schema_violation: e.g. "non_negative"
+    constant_value: Optional[Any] = None      # for fill_missing with strategy=constant
+class ColumnIssues(BaseModel):
+    missing_count: int
+    missing_pct: float
+    type_errors: int
+    outlier_count: int
+    has_duplicates: bool    # dataset-level, repeated per column for convenience
+class Observation(BaseModel):
+    task: str
+    step: int
+    max_steps: int
+    columns: List[str]
+    column_issues: Dict[str, ColumnIssues]
+    # Compact dataset representation: column stats (not full data — avoids huge payloads)
+    column_stats: Dict[str, Dict[str, Any]]  # {"col": {"mean": .., "null_count": ..}}
+    reward: float           # per-step reward from last action
+    done: bool
+class EpisodeState(BaseModel):
+    episode_id: str
+    task: str
+    step: int
+    max_steps: int
+    score: Optional[float] = None    # only set after grader runs
+```
+**Action schema for `/tasks` endpoint:**
+```json
+{
+  "tasks": ["easy", "medium", "hard"],
+  "action_schema": {
+    "action_type": "string (required): fill_missing | drop_duplicates | fix_type | normalize | drop_outliers | fix_schema_violation | done",
+    "column": "string (optional): target column name",
+    "strategy": "string (optional for fill_missing): mean | median | mode | constant",
+    "dtype": "string (optional for fix_type): int | float | str",
+    "method": "string (optional for drop_outliers): iqr | zscore",
+    "constraint": "string (optional for fix_schema_violation): non_negative | clamp_range",
+    "constant_value": "any (optional for fill_missing with strategy=constant)"
+  }
+```
+---
+### Phase 3: Core Environment Logic (`server/environment.py`)
+```python
+# server/environment.py
+import uuid
+import pandas as pd
+import numpy as np
+from datasets import load_clean_dataset
+from noise_injector import inject_noise
+from models import CleaningAction, Observation, EpisodeState, ActionType
+MAX_STEPS = {"easy": 20, "medium": 40, "hard": 60}
+class DataCleaningEnvironment:
+    def __init__(self):
+        self.episodes: dict[str, dict] = {}   # episode_id -> state dict
+  def reset(self, task: str = "easy") -> tuple[Observation, EpisodeState]:
+        episode_id = str(uuid.uuid4())
+        clean_df, target_col = load_clean_dataset(task)
+        dirty_df = inject_noise(clean_df, task)
+        self.episodes[episode_id] = {
+            "task": task,
+            "clean_df": clean_df,
+         "current_df": dirty_df.copy(),
+            "target_col": target_col,
+        "step": 0,
+            "max_steps": MAX_STEPS[task],
+            "done": False,
+            "prev_column_accuracy": self._column_accuracy(dirty_df, clean_df),
+        }
+        obs = self._make_observation(episode_id, reward=0.0)
+        state = EpisodeState(
+            episode_id=episode_id,
+            task=task,
+            step=0,
+          max_steps=MAX_STEPS[task],
+        )
+        return obs, state
+    def step(self, episode_id: str, action: CleaningAction) -> tuple[Observation, float, bool]:
+        ep = self.episodes[episode_id]
+        if ep["done"]:
+          raise ValueError("Episode already done. Call reset().")
+        df = ep["current_df"]
+        reward = 0.0
+      try:
+            df = self._apply_action(df, action, ep)
+            ep["current_df"] = df
+        except Exception as e:
+            reward = -0.05  # penalize invalid actions
+        ep["step"] += 1
+        if action.action_type == ActionType.done or ep["step"] >= ep["max_steps"]:
+            ep["done"] = True
+            reward = 0.0  # terminal step reward is 0; grader score is the final signal
+        else:
+            new_accuracy = self._column_accuracy(df, ep["clean_df"])
+            reward = float(np.clip(new_accuracy - ep["prev_column_accuracy"], -0.1, 0.1))
+            ep["prev_column_accuracy"] = new_accuracy
+        obs = self._make_observation(episode_id, reward=reward)
+        return obs, reward, ep["done"]
+    def _apply_action(self, df: pd.DataFrame, action: CleaningAction, ep: dict) -> pd.DataFrame:
+        col = action.column
+        if action.action_type == ActionType.fill_missing:
+            if col not in df.columns:
+            raise ValueError(f"Column {col} not found")
+          # Coerce column to numeric first if possible, then fill
+         df[col] = pd.to_numeric(df[col], errors="coerce")
+         strategy = action.strategy.value
+            if strategy == "mean":
+              df[col].fillna(df[col].mean(), inplace=True)
+            elif strategy == "median":
+         df[col].fillna(df[col].median(), inplace=True)
+            elif strategy == "mode":
+                df[col].fillna(df[col].mode().iloc[0], inplace=True)
+            elif strategy == "constant":
+             df[col].fillna(action.constant_value, inplace=True)
+        elif action.action_type == ActionType.drop_duplicates:
+        df = df.drop_duplicates().reset_index(drop=True)
+        elif action.action_type == ActionType.fix_type:
+          if col not in df.columns:
+            raise ValueError(f"Column {col} not found")
+            dtype_map = {"int": "Int64", "float": "float64", "str": "str"}
+            df[col] = pd.to_numeric(df[col], errors="coerce").astype(dtype_map.get(action.dtype, "str"))
+      elif action.action_type == ActionType.normalize:
+         if col not in df.columns:
+       raise ValueError(f"Column {col} not found")
+            df[col] = pd.to_numeric(df[col], errors="coerce")
+            mean, std = df[col].mean(), df[col].std()
+         if std > 0:
+                df[col] = (df[col] - mean) / std
+        elif action.action_type == ActionType.drop_outliers:
+            if col not in df.columns:
+                raise ValueError(f"Column {col} not found")
+        numeric_col = pd.to_numeric(df[col], errors="coerce")
+            if action.method.value == "iqr":
+            q1, q3 = numeric_col.quantile(0.25), numeric_col.quantile(0.75)
+                iqr = q3 - q1
+                mask = numeric_col.between(q1 - 1.5 * iqr, q3 + 1.5 * iqr)
+        else:  # zscore
+                z = (numeric_col - numeric_col.mean()) / numeric_col.std()
+                mask = z.abs() < 3
+            df = df[mask].reset_index(drop=True)
+    elif action.action_type == ActionType.fix_schema_violation:
+            if col not in df.columns:
+                raise ValueError(f"Column {col} not found")
+        numeric_col = pd.to_numeric(df[col], errors="coerce")
+            if action.constraint == "non_negative":
+             df[col] = numeric_col.clip(lower=0)
+         elif action.constraint == "clamp_range":
+                # Clamp to [p5, p95] of the clean dataset for that column
+                clean_col = pd.to_numeric(ep["clean_df"][col], errors="coerce")
+                df[col] = numeric_col.clip(clean_col.quantile(0.05), clean_col.quantile(0.95))
+      return df
+    def _column_accuracy(self, current_df: pd.DataFrame, clean_df: pd.DataFrame) -> float:
+        """Mean fraction of values matching clean ground truth, per column."""
+        scores = []
+        # Align by index and columns
+        common_cols = [c for c in clean_df.columns if c in current_df.columns]
+        n_rows = min(len(current_df), len(clean_df))
+        for col in common_cols:
+        cur = current_df[col].iloc[:n_rows].reset_index(drop=True)
+            cln = clean_df[col].iloc[:n_rows].reset_index(drop=True)
+            try:
+                cur_num = pd.to_numeric(cur, errors="coerce")
+                cln_num = pd.to_numeric(cln, errors="coerce")
+                match = (cur_num - cln_num).abs() < 1e-6
+                scores.append(match.mean())
+            except Exception:
+                scores.append((cur == cln).mean())
+        return float(np.mean(scores)) if scores else 0.0
+    def _make_observation(self, episode_id: str, reward: float) -> Observation:
+        ep = self.episodes[episode_id]
+        df = ep["current_df"]
+      clean_df = ep["clean_df"]
+        column_issues = {}
+        column_stats = {}
+        n_dups = df.duplicated().sum()
+        for col in clean_df.columns:
+            if col not in df.columns:
+         continue
+            col_data = df[col]
+            numeric = pd.to_numeric(col_data, errors="coerce")
+            # count type errors: values that can't parse as numeric when clean column is numeric
+         clean_numeric = pd.to_numeric(clean_df[col], errors="coerce")
+            type_errs = 0
+      if clean_numeric.notna().mean() > 0.9:   # column is mostly numeric
+             type_errs = int(numeric.isna().sum() - col_data.isna().sum())
+           type_errs = max(0, type_errs)
+     # outlier count via IQR
+            outlier_count = 0
+      if numeric.notna().sum() > 4:
+            q1, q3 = numeric.quantile(0.25), numeric.quantile(0.75)
+                iqr = q3 - q1
+                if iqr > 0:
+               outlier_count = int(((numeric < q1 - 1.5*iqr) | (numeric > q3 + 1.5*iqr)).sum())
+        column_issues[col] = {
+            "missing_count": int(col_data.isna().sum()),
+             "missing_pct": round(float(col_data.isna().mean()), 3),
+                "type_errors": type_errs,
+                "outlier_count": outlier_count,
+       "has_duplicates": bool(n_dups > 0),
+          }
+            column_stats[col] = {
+              "mean": round(float(numeric.mean()), 4) if numeric.notna().any() else None,
+         "std": round(float(numeric.std()), 4) if numeric.notna().any() else None,
+                "null_count": int(col_data.isna().sum()),
+                "unique_count": int(col_data.nunique()),
+         }
+        return Observation(
+            task=ep["task"],
+            step=ep["step"],
+      max_steps=ep["max_steps"],
+            columns=list(clean_df.columns),
+            column_issues=column_issues,
+            column_stats=column_stats,
+            reward=reward,
+       done=ep["done"],
+        )
+```
+---
+### Phase 4: Grader (`grader.py`)
+The grader trains a simple sklearn model on the cleaned dataset and scores using the bracketed
+normalization formula: `(agent_acc - dirty_acc) / (oracle_acc - dirty_acc)`.
+The oracle and dirty baselines are precomputed at server startup (deterministic, cached).
+```python
+# grader.py
+import numpy as np
+import pandas as pd
+from sklearn.ensemble import RandomForestClassifier
+from sklearn.model_selection import train_test_split
+from sklearn.preprocessing import LabelEncoder
+from sklearn.pipeline import Pipeline
+from sklearn.impute import SimpleImputer
+from sklearn.preprocessing import StandardScaler
+from sklearn.compose import ColumnTransformer
+from sklearn.pipeline import Pipeline as SKPipeline
+# Precomputed at startup, keyed by task
+_ORACLE_SCORES: dict[str, float] = {}
+_DIRTY_SCORES: dict[str, float] = {}
+def train_and_score(df: pd.DataFrame, target_col: str) -> float:
+    """Train RandomForest on df, return test accuracy. Returns 0.0 on failure."""
+    try:
+        X = df.drop(columns=[target_col])
+        y = df[target_col].astype(str)
+        # Encode target
+        le = LabelEncoder()
+        y_enc = le.fit_transform(y.fillna("missing"))
+        # Numeric pipeline: impute + scale
+        num_cols = X.select_dtypes(include="number").columns.tolist()
+        cat_cols = X.select_dtypes(exclude="number").columns.tolist()
+        transformers = []
+        if num_cols:
+            transformers.append(("num", Pipeline([
+         ("impute", SimpleImputer(strategy="median")),
+                ("scale", StandardScaler()),
+            ]), num_cols))
+        if cat_cols:
+            transformers.append(("cat", SimpleImputer(strategy="most_frequent"), cat_cols))
+        preprocessor = ColumnTransformer(transformers, remainder="drop")
+        clf = SKPipeline([
+            ("pre", preprocessor),
+            ("clf", RandomForestClassifier(n_estimators=50, random_state=42, n_jobs=-1)),
+        ])
+        X_train, X_test, y_train, y_test = train_test_split(
+            X, y_enc, test_size=0.2, random_state=42, stratify=y_enc
+        )
+        clf.fit(X_train, y_train)
+        return float(clf.score(X_test, y_test))
+    except Exception:
+        return 0.0
+def compute_oracle_and_dirty_baselines(task: str, clean_df: pd.DataFrame,
+                      dirty_df: pd.DataFrame, target_col: str):
+    """Call once at startup."""
+    _ORACLE_SCORES[task] = train_and_score(clean_df, target_col)
+    _DIRTY_SCORES[task] = train_and_score(dirty_df, target_col)
+def grade_episode(episode_cleaned_df: pd.DataFrame, task: str, target_col: str) -> float:
+    """Returns score in [0.0, 1.0]."""
+    agent_acc = train_and_score(episode_cleaned_df, target_col)
+    oracle = _ORACLE_SCORES.get(task, 1.0)
+    dirty = _DIRTY_SCORES.get(task, 0.0)
+    if oracle <= dirty:
+        return float(agent_acc >= oracle)
+    score = (agent_acc - dirty) / (oracle - dirty)
+    return float(np.clip(score, 0.0, 1.0))
+```
+---
+### Phase 5: FastAPI Server (`server/app.py`)
+Implements all required endpoints: OpenEnv standard + hackathon-specific.
+```python
+# server/app.py
+from fastapi import FastAPI, HTTPException
+from pydantic import BaseModel
+from typing import Optional
+from server.environment import DataCleaningEnvironment
+from grader import grade_episode
+from baseline import run_baseline
+from models import CleaningAction
+app = FastAPI(title="Data Cleaning OpenEnv", version="1.0.0")
+env = DataCleaningEnvironment()
+class ResetRequest(BaseModel):
+    task: Optional[str] = "easy"
+class StepRequest(BaseModel):
+    episode_id: str
+    action: CleaningAction
+class GraderRequest(BaseModel):
+    episode_id: str
+# --- Standard OpenEnv endpoints ---
+@app.post("/reset")
+async def reset(req: ResetRequest):
+    obs, state = env.reset(task=req.task)
+    return {"observation": obs.model_dump(), "state": state.model_dump()}
+@app.post("/step")
+async def step(req: StepRequest):
+    try:
+        obs, reward, done = env.step(req.episode_id, req.action)
+        return {"observation": obs.model_dump(), "reward": reward, "done": done, "info": {}}
+    except KeyError:
+        raise HTTPException(status_code=404, detail="Episode not found")
+    except ValueError as e:
+        raise HTTPException(status_code=400, detail=str(e))
+@app.get("/state")
+async def state(episode_id: str):
+    ep = env.episodes.get(episode_id)
+    if not ep:
+        raise HTTPException(status_code=404, detail="Episode not found")
+    return {"episode_id": episode_id, "task": ep["task"],
+       "step": ep["step"], "max_steps": ep["max_steps"], "done": ep["done"]}
+# --- Hackathon-required endpoints ---
+@app.get("/tasks")
+async def tasks():
+    return {
+        "tasks": [
+         {
+        "id": "easy",
+                "description": "Fix missing values in the Iris dataset (15% missing, numeric only).",
+                "max_steps": 20,
+            },
+            {
+                "id": "medium",
+                "description": "Fix missing values, type errors, and duplicates in the Adult Income dataset.",
+                "max_steps": 40,
+            },
+            {
+              "id": "hard",
+              "description": "Fix missing values, type errors, duplicates, outliers, and schema violations in the Credit-G dataset.",
+              "max_steps": 60,
+            },
+        ],
+        "action_schema": {
+            "action_type": "string (required): fill_missing | drop_duplicates | fix_type | normalize | drop_outliers | fix_schema_violation | done",
+          "column": "string (optional): target column name",
+            "strategy": "string (for fill_missing): mean | median | mode | constant",
+            "dtype": "string (for fix_type): int | float | str",
+          "method": "string (for drop_outliers): iqr | zscore",
+            "constraint": "string (for fix_schema_violation): non_negative | clamp_range",
+            "constant_value": "any (for fill_missing with strategy=constant)",
+        },
+    }
+@app.post("/grader")
+async def grader(req: GraderRequest):
+    ep = env.episodes.get(req.episode_id)
+    if not ep:
+        raise HTTPException(status_code=404, detail="Episode not found")
+    score = grade_episode(ep["current_df"], ep["task"], ep["target_col"])
+    return {"episode_id": req.episode_id, "task": ep["task"], "score": score}
+@app.post("/baseline")
+async def baseline():
+    scores = run_baseline()
+    return {"baseline_scores": scores}
+@app.get("/health")
+async def health():
+    return {"status": "ok"}
+```
+---
+### Phase 6: Baseline Inference Script (`baseline.py`)
+Heuristic agent: fill all missing with median, fix types (coerce to numeric), drop duplicates, drop outliers via IQR, then call `done`.
+```python
+# baseline.py
+"""
+Heuristic baseline agent for the Data Cleaning environment.
+Run: python baseline.py
+"""
+import requests
+import json
+BASE_URL = "http://localhost:8000"
+def run_single_episode(task: str) -> float:
+    # Reset
+    resp = requests.post(f"{BASE_URL}/reset", json={"task": task})
+    data = resp.json()
+    episode_id = data["state"]["episode_id"]
+    obs = data["observation"]
+    # Strategy: fill_missing → fix_type → drop_duplicates → drop_outliers → done
+    columns = obs["columns"]
+    issues = obs["column_issues"]
+    # 1. Fill missing values (median) for all columns with missing
+    for col, col_issues in issues.items():
+        if col_issues["missing_count"] > 0:
+         requests.post(f"{BASE_URL}/step", json={
+          "episode_id": episode_id,
+         "action": {"action_type": "fill_missing", "column": col, "strategy": "median"}
+            })
+    # 2. Fix type errors
+    for col, col_issues in issues.items():
+        if col_issues["type_errors"] > 0:
+            requests.post(f"{BASE_URL}/step", json={
+              "episode_id": episode_id,
+            "action": {"action_type": "fix_type", "column": col, "dtype": "float"}
+            })
+    # 3. Drop duplicates (once)
+    requests.post(f"{BASE_URL}/step", json={
+        "episode_id": episode_id,
+        "action": {"action_type": "drop_duplicates"}
+    })
+    # 4. Drop outliers for columns with many outliers
+    for col, col_issues in issues.items():
+        if col_issues["outlier_count"] > 5:
+         requests.post(f"{BASE_URL}/step", json={
+                "episode_id": episode_id,
+                "action": {"action_type": "drop_outliers", "column": col, "method": "iqr"}
+          })
+    # 5. Done
+    requests.post(f"{BASE_URL}/step", json={
+        "episode_id": episode_id,
+        "action": {"action_type": "done"}
+    })
+    # Grade
+    resp = requests.post(f"{BASE_URL}/grader", json={"episode_id": episode_id})
+    return resp.json()["score"]
+def run_baseline() -> dict:
+    results = {}
+    for task in ["easy", "medium", "hard"]:
+        score = run_single_episode(task)
+        results[task] = round(score, 4)
+        print(f"  Task {task}: {results[task]:.4f}")
+    return results
+if __name__ == "__main__":
+    print("Running baseline agent...")
+    scores = run_baseline()
+    print(f"\nBaseline scores: {json.dumps(scores, indent=2)}")
+```
+---
+### Phase 7: OpenEnv Configuration Files
+#### `openenv.yaml`
+```yaml
+name: data-cleaning-env
+version: "1.0.0"
+description: "RL environment for tabular data cleaning. Agent issues structured commands to clean dirty datasets from OpenML. Graded by downstream ML model accuracy."
+author: "Yash Marathe"
+tags:
+  - data-engineering
+  - tabular
+  - real-world
+  - openml
+tasks:
+  - id: easy
+    description: "Fix missing values in Iris dataset"
+  - id: medium
+    description: "Fix missing values, type errors, duplicates in Adult Income dataset"
+  - id: hard
+    description: "Fix all noise types in Credit-G dataset including outliers and schema violations"
+server:
+  port: 8000
+  health_endpoint: /health
+```
+#### `server/Dockerfile`
+```dockerfile
+FROM python:3.11-slim
+WORKDIR /app
+COPY server/requirements.txt .
+RUN pip install --no-cache-dir -r requirements.txt
+COPY . .
+# Pre-download datasets at build time (avoids cold-start latency)
+RUN python -c "from datasets import load_clean_dataset; \
+    [load_clean_dataset(t) for t in ['easy', 'medium', 'hard']]"
+EXPOSE 8000
+CMD ["uvicorn", "server.app:app", "--host", "0.0.0.0", "--port", "8000"]
+```
+#### `server/requirements.txt`
+```
+fastapi>=0.115
+uvicorn[standard]>=0.30
+pydantic>=2.0
+scikit-learn>=1.4
+pandas>=2.0
+numpy>=1.26
+openenv-core>=0.1
+```
+---
+### Phase 8: Client (`client.py`)
+```python
+# client.py — standard OpenEnv EnvClient wrapper
+from openenv import EnvClient
+from models import CleaningAction, Observation
+class DataCleaningEnv(EnvClient):
+    async def reset(self, task: str = "easy") -> dict:
+        return await self._post("/reset", {"task": task})
+    async def step(self, episode_id: str, action: CleaningAction) -> dict:
+        return await self._post("/step", {
+            "episode_id": episode_id,
+            "action": action.model_dump()
+        })
+    async def get_state(self, episode_id: str) -> dict:
+        return await self._get(f"/state?episode_id={episode_id}")
+```
+---
+## Technical Considerations
+### Normalization Formula for Episode Score (Deferred to Planning — resolved here)
+```
+score = clip((agent_acc - dirty_acc) / (oracle_acc - dirty_acc), 0.0, 1.0)
+```
+- `oracle_acc` = RandomForest accuracy on original unmodified OpenML dataset
+- `dirty_acc` = RandomForest accuracy on fully noise-injected dataset
+- Precomputed once at startup using `datasets.py` + `noise_injector.py`
+- Edge case: if `oracle_acc == dirty_acc` (noise had no effect), return 1.0 if agent matches oracle else 0.0
+### HF Spaces Constraints (Deferred — resolved here)
+- Free HF Spaces: 2 CPU cores, 16GB RAM — sufficient for RandomForest on these small datasets
+- adult dataset: use 2,000-row sample to keep grading under 2s
+- Set `n_estimators=50` and `n_jobs=-1` for speed
+- Datasets pre-downloaded at Docker build time via `RUN python -c "..."` step
+### Episode Memory
+- Episodes stored in-memory in server process (`self.episodes` dict)
+- HF Spaces restarts the container daily — acceptable for a hackathon env
+- No persistence needed; each `reset()` creates a fresh episode
+### Concurrent Safety
+- FastAPI is async; `environment.py` uses a plain dict which is safe for single-process uvicorn
+- For multi-worker deployments: switch to process-safe storage (not needed for hackathon)
+---
+## System-Wide Impact
+### Interaction Graph
+1. Agent calls `POST /reset` → `DataCleaningEnvironment.reset()` → `load_clean_dataset()` → `inject_noise()` → `_make_observation()` → returns `Observation` + `EpisodeState`
+2. Agent calls `POST /step` → `env.step()` → `_apply_action()` → `_column_accuracy()` (per-step reward) → `_make_observation()` → returns updated `Observation`
+3. Agent calls `POST /grader` → `grade_episode()` → `train_and_score(cleaned_df)` → bracketed normalization → float score
+4. Agent calls `POST /baseline` → `run_baseline()` → runs heuristic agent through all 3 tasks → returns dict of scores
+### Error Propagation
+- Invalid `episode_id` → 404 HTTPException (does not crash server)
+- Invalid action (column not found, bad dtype) → `-0.05` reward penalty, observation still returned; action silently no-ops
+- `train_and_score` failure → returns `0.0` (grader never crashes)
+- Dataset download failure at startup → server fails to start; this is caught at Docker build time by the pre-download step
+### State Lifecycle
+- `reset()` creates a new episode entry. Episode lives in `self.episodes[episode_id]` until process restart.
+- Calling `step()` after `done=True` raises `ValueError` (surfaced as 400)
+- No partial state corruption risk: `_apply_action` returns a new DataFrame copy on success; original is only replaced on success
+---
+## Acceptance Criteria
+### Functional Requirements
+- [ ] R1: `POST /reset?task=easy|medium|hard` returns `Observation` with `done=False`
+- [ ] R2: All 7 action types are implemented and affect the DataFrame correctly
+- [ ] R3: Observation includes column issues, column stats, step, reward, done
+- [ ] R4: Three tasks use iris (easy), adult-2k-sample (medium), credit-g (hard) with the defined noise levels
+- [ ] R5: Per-step reward is in `[-0.1, +0.1]`; episode grader score is in `[0.0, 1.0]`
+- [ ] R6: Noise injection with same seed produces identical dirty dataset on every run
+- [ ] R7: All required endpoints respond correctly: `/reset`, `/step`, `/state`, `/tasks`, `/grader`, `/baseline`, `/health`
+- [ ] R8: `baseline.py` runs end-to-end without error and prints scores for all 3 tasks
+- [ ] R9: `docker build` succeeds; container starts and `/health` returns 200
+- [ ] R10: README documents action schema, observation format, setup, and baseline scores
+### Quality Gates
+- [ ] Grader scores differ meaningfully across tasks: easy > medium > hard for the heuristic baseline
+- [ ] Grader score for a perfect clean (oracle) = 1.0
+- [ ] Grader score for an untouched dirty dataset ≈ 0.0
+- [ ] Episode completes in < 5s for easy/medium tasks on a single CPU core
+- [ ] HF Space passes automated ping (HTTP 200 + valid `/reset` response)
+- [ ] `openenv.yaml` validates with `openenv validate`
+---
+## Success Metrics
+(see origin: `docs/brainstorms/2026-03-27-data-cleaning-env-requirements.md#success-criteria`)
+- HF Space automated ping passes
+- OpenEnv spec validator passes
+- Docker build succeeds
+- Baseline script completes with deterministic scores
+- All 3 graders return scores in [0.0, 1.0]
+- Harder tasks produce lower baseline scores (validates difficulty progression)
+---
+## Dependencies & Risks
+| Risk | Likelihood | Mitigation |
+|------|-----------|------|
+| OpenML API down at Docker build time | Low | Add retry logic; cache datasets in repo as CSV fallback |
+| adult dataset too large → grader > 2s | Medium | Use 2,000-row sample; set `n_estimators=50` |
+| HF Spaces memory limit exceeded | Low | RandomForest on <2k rows uses <200MB RAM |
+| OpenEnv spec changes before deadline | Low | Pin `openenv-core` version in requirements.txt |
+| Noise levels don't produce meaningful difficulty gap | Medium | Verify with a quick manual test run before submission |
+---
+## Implementation Order
+```
+Day 1 (Mar 27-28): Scaffold + Data Layer
+  ├── openenv init data_cleaning_env
+  ├── datasets.py (fetch_openml, cache)
+  ├── noise_injector.py (3 task levels, seeded)
+  └── Smoke test: verify 3 datasets load and noise is visible
+Day 2 (Mar 29-30): Core Environment + Models
+  ├── models.py (Pydantic models)
+  ├── server/environment.py (_apply_action, _column_accuracy, _make_observation)
+  └── Unit test: step through all 7 action types, verify reward sign
+Day 3 (Mar 31 - Apr 1): Grader + API
+  ├── grader.py (train_and_score, oracle/dirty precompute, grade_episode)
+  ├── server/app.py (all endpoints)
+  └── Integration test: full episode reset→step*N→grader for each task
+Day 4 (Apr 2-3): Baseline + Packaging
+  ├── baseline.py (heuristic agent, run_baseline)
+  ├── client.py (EnvClient wrapper)
+  ├── openenv.yaml, Dockerfile, requirements.txt
+  └── docker build && docker run test
+Day 5 (Apr 4-5): Deploy + README + Validation
+  ├── openenv push (HF Spaces deploy)
+  ├── README.md (action schema, observation format, setup, baseline scores)
+  ├── openenv validate
+  └── Submit via hackathon portal
+```
+---
+## Sources & References
+### Origin
+- **Origin document:** [`docs/brainstorms/2026-03-27-data-cleaning-env-requirements.md`](../brainstorms/2026-03-27-data-cleaning-env-requirements.md)
+  Key decisions carried forward: structured action space, OpenML datasets (iris/adult/credit-g),
+  hybrid reward (per-step column accuracy + episode ML accuracy), simple sklearn RandomForest grader.
+### External References
+- OpenEnv framework: https://github.com/meta-pytorch/OpenEnv
+- OpenEnv docs: https://meta-pytorch.org/OpenEnv/
+- OpenML dataset IDs: iris=61, adult=1590, credit-g=31
+- `sklearn.datasets.fetch_openml`: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_openml.html
+- OpenEnv env structure: `openenv init` scaffold (see README: `envs/README.md`)
+- HF Spaces deployment: `openenv push` CLI command
+- Hackathon submission deadline: 7 Apr 2026, 11:59 PM