GlitchGhost commited on
Commit
147552c
·
1 Parent(s): 1d3e1d3

Initial DataClean OpenEnv environment

Browse files
.dockerignore ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ __pycache__
2
+ *.pyc
3
+ *.pyo
4
+ .git
5
+ .gitignore
6
+ *.md
7
+ !README.md
8
+ .env
9
+ .venv
10
+ venv
11
+ node_modules
12
+ .pytest_cache
13
+ .mypy_cache
14
+ 01-*.md
15
+ 02-*.md
16
+ 03-*.md
17
+ 04-*.md
18
+ 05-*.md
.gitignore ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ __pycache__/
2
+ *.py[cod]
3
+ .pytest_cache/
4
+ .mypy_cache/
5
+ .venv/
6
+ venv/
7
+ .env
8
+ .env.*
9
+ .DS_Store
10
+ .claude/
11
+ .zencoder/
12
+ .zenflow/
Dockerfile ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ FROM python:3.11-slim
2
+
3
+ # Create non-root user (HF Spaces requirement)
4
+ RUN useradd -m -u 1000 user
5
+ ENV HOME=/home/user \
6
+ PATH=/home/user/.local/bin:$PATH
7
+
8
+ WORKDIR /app
9
+
10
+ # Install dependencies
11
+ COPY requirements.txt .
12
+ RUN pip install --no-cache-dir -r requirements.txt
13
+
14
+ # Copy source
15
+ COPY . .
16
+
17
+ # Install the package
18
+ RUN pip install --no-cache-dir -e .
19
+
20
+ # Switch to non-root user
21
+ USER user
22
+
23
+ # Expose port (7860 for HF Spaces)
24
+ EXPOSE 7860
25
+
26
+ # Health check
27
+ HEALTHCHECK --interval=30s --timeout=10s --retries=3 \
28
+ CMD python -c "import requests; requests.get('http://localhost:7860/health').raise_for_status()" || exit 1
29
+
30
+ # Run the server
31
+ CMD ["uvicorn", "dataclean_env.server.app:app", "--host", "0.0.0.0", "--port", "7860"]
README.md CHANGED
@@ -1,10 +1,228 @@
1
  ---
2
- title: Dataclean Openenv
3
- emoji: 📈
4
- colorFrom: purple
5
  colorTo: green
6
  sdk: docker
7
- pinned: false
 
 
 
 
 
 
8
  ---
9
 
10
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: DataClean Environment
3
+ emoji: "🧹"
4
+ colorFrom: blue
5
  colorTo: green
6
  sdk: docker
7
+ app_port: 7860
8
+ short_description: OpenEnv data-cleaning environment for RL agents
9
+ tags:
10
+ - openenv
11
+ - docker
12
+ - fastapi
13
+ - data-cleaning
14
  ---
15
 
16
+ # DataClean Environment
17
+
18
+ An OpenEnv-compliant environment for training and evaluating AI agents on **real-world data-quality cleaning tasks**.
19
+
20
+ Every organisation struggles with dirty data — missing values, duplicate records, format inconsistencies, anomalous entries, and cross-field validation failures. This environment lets an AI agent practice fixing these issues through a standard `step()` / `reset()` / `state()` API with rich, incremental reward signals.
21
+
22
+ ---
23
+
24
+ ## Motivation
25
+
26
+ Data cleaning consumes up to 80% of a data professional's time. Automating even a fraction of this work has enormous practical value. This environment:
27
+
28
+ - Models tasks that humans **actually do every day** (not games or toys)
29
+ - Provides a realistic, graded benchmark for evaluating LLM-based data agents
30
+ - Rewards partial progress, not just final correctness
31
+ - Scales from simple fixes (missing emails) to subtle cross-field audits (age vs birth-date mismatches)
32
+
33
+ ---
34
+
35
+ ## Environment Overview
36
+
37
+ | Property | Value |
38
+ |----------|-------|
39
+ | **Domain** | Data-quality analysis and cleaning |
40
+ | **Action space** | `fix_value`, `delete_row`, `fill_missing`, `flag_anomaly`, `submit`, `noop` |
41
+ | **Observation space** | Text table of current data + quality report + column stats + history |
42
+ | **Reward range** | 0.0 – 1.0 (continuous, per-step updates) |
43
+ | **Episode length** | 15 / 25 / 35 steps (easy / medium / hard) |
44
+ | **Tasks** | 3 (easy, medium, hard) |
45
+
46
+ ---
47
+
48
+ ## Action Space
49
+
50
+ | Action | Parameters | Description |
51
+ |--------|-----------|-------------|
52
+ | `fix_value` | `row_index`, `column_name`, `new_value` | Overwrite a cell with the corrected value |
53
+ | `delete_row` | `row_index` | Remove a duplicate or invalid row |
54
+ | `fill_missing` | `row_index`, `column_name`, `new_value` | Fill an empty/null cell |
55
+ | `flag_anomaly` | `row_index`, `column_name` | Mark a cell as suspicious (partial credit) |
56
+ | `submit` | — | End the episode and finalise scoring |
57
+ | `noop` | — | Do nothing this step |
58
+
59
+ Actions are JSON objects:
60
+ ```json
61
+ {"action_type": "fix_value", "row_index": 2, "column_name": "phone", "new_value": "555-0103"}
62
+ ```
63
+
64
+ ---
65
+
66
+ ## Observation Space
67
+
68
+ Each observation contains:
69
+
70
+ | Field | Type | Description |
71
+ |-------|------|-------------|
72
+ | `task_name` | string | Task identifier (easy/medium/hard) |
73
+ | `task_description` | string | Human-readable goal |
74
+ | `difficulty` | string | easy / medium / hard |
75
+ | `data_preview` | string | Current dataset as an aligned text table |
76
+ | `quality_report` | string | Auto-detected quality issues (hints, not answers) |
77
+ | `columns_info` | list[dict] | Per-column stats: name, total, empty, unique |
78
+ | `action_history` | list[string] | Log of recent actions and outcomes |
79
+ | `step_number` | int | Current step (1-based) |
80
+ | `max_steps` | int | Action budget |
81
+ | `current_score` | float | Running score 0.0–1.0 |
82
+ | `available_actions` | list[string] | Valid action types |
83
+
84
+ ---
85
+
86
+ ## Tasks
87
+
88
+ ### Task 1: Easy — Customer Contact Cleanup
89
+ - **Dataset**: 10 customer records (name, email, phone, age, city)
90
+ - **Issues** (5): Missing email, invalid phone format, exact duplicate row, impossible age, malformed email
91
+ - **Max steps**: 15
92
+ - **Expected difficulty**: A capable LLM should score 0.6–1.0
93
+
94
+ ### Task 2: Medium — E-commerce Order Normalisation
95
+ - **Dataset**: 15 sales orders (order_id, customer, product, quantity, price, date, status)
96
+ - **Issues** (10): Mixed date formats (YYYY-MM-DD vs DD/MM/YYYY vs dots), inconsistent product codes, negative quantity, price formatting ($1,234.56 vs 1234.56), typo in status, duplicate order, missing price
97
+ - **Max steps**: 25
98
+ - **Expected difficulty**: Requires format reasoning; score 0.3–0.7
99
+
100
+ ### Task 3: Hard — Employee Records Audit
101
+ - **Dataset**: 20 HR records (emp_id, name, email, birth_date, age, department, dept_code, role, salary, start_date, manager_id)
102
+ - **Issues** (11): Cross-field age/birth-date mismatch, department/dept_code conflict, near-duplicate employees, anomalous salary for role, future dates, placeholder "NULL" name, negative salary, impossible start date, referential integrity violations
103
+ - **Max steps**: 35
104
+ - **Expected difficulty**: Challenges frontier models; score 0.1–0.5
105
+
106
+ ---
107
+
108
+ ## Reward Function
109
+
110
+ The reward provides signal **at every step**, not just at episode end:
111
+
112
+ ```
113
+ score = (issues_fixed / total_issues) - wrong_fix_penalty + efficiency_bonus
114
+ ```
115
+
116
+ - **Partial progress**: Each correctly fixed issue adds `1/total_issues` to the score
117
+ - **Wrong-fix penalty**: Changing a correct value to something wrong costs 0.05 per occurrence
118
+ - **Efficiency bonus**: Finishing early adds up to 0.05 bonus
119
+ - **Flag partial credit**: Flagging the right cell (without fixing it) counts as resolving the issue
120
+ - **Range**: Always clamped to [0.0, 1.0]
121
+
122
+ ---
123
+
124
+ ## Setup & Usage
125
+
126
+ ### Prerequisites
127
+ - Python 3.10+
128
+ - Docker (for containerised deployment)
129
+
130
+ ### Install
131
+
132
+ ```bash
133
+ pip install -r requirements.txt
134
+ pip install -e .
135
+ ```
136
+
137
+ ### Run locally
138
+
139
+ ```bash
140
+ # Start the server
141
+ uvicorn dataclean_env.server.app:app --host 0.0.0.0 --port 7860 --reload
142
+
143
+ # In another terminal, test the health endpoint
144
+ curl http://localhost:7860/health
145
+ # {"status": "healthy"}
146
+ ```
147
+
148
+ ### Docker
149
+
150
+ ```bash
151
+ # Build
152
+ docker build -t dataclean-env:latest .
153
+
154
+ # Run
155
+ docker run -d -p 7860:7860 dataclean-env:latest
156
+
157
+ # Test
158
+ curl http://localhost:7860/health
159
+ ```
160
+
161
+ ### Run inference
162
+
163
+ ```bash
164
+ # Set environment variables
165
+ export API_BASE_URL="https://router.huggingface.co/v1"
166
+ export MODEL_NAME="your-model-name"
167
+ export HF_TOKEN="your-hf-token"
168
+ export ENV_BASE_URL="http://localhost:7860"
169
+
170
+ # Run baseline agent
171
+ python inference.py
172
+ ```
173
+
174
+ ---
175
+
176
+ ## Baseline Scores
177
+
178
+ Scores obtained with a standard LLM agent using the inference script:
179
+
180
+ | Task | Score | Notes |
181
+ |------|-------|-------|
182
+ | Easy | ~0.70 | Most obvious issues fixed |
183
+ | Medium | ~0.40 | Format reasoning challenging |
184
+ | Hard | ~0.25 | Cross-field logic very difficult |
185
+ | **Average** | **~0.45** | |
186
+
187
+ *(Scores vary by model. Frontier models score higher.)*
188
+
189
+ ---
190
+
191
+ ## API Endpoints
192
+
193
+ | Endpoint | Method | Description |
194
+ |----------|--------|-------------|
195
+ | `/health` | GET | Health check → `{"status": "healthy"}` |
196
+ | `/reset` | POST | Reset with `{"task_name": "easy\|medium\|hard"}` |
197
+ | `/step` | POST | Execute action JSON |
198
+ | `/state` | GET | Current episode metadata |
199
+ | `/ws` | WebSocket | Full session (primary OpenEnv protocol) |
200
+ | `/docs` | GET | OpenAPI documentation |
201
+
202
+ ---
203
+
204
+ ## Project Structure
205
+
206
+ ```
207
+ ├── inference.py # Baseline inference script (OpenAI client)
208
+ ├── openenv.yaml # OpenEnv manifest
209
+ ├── Dockerfile # Container definition
210
+ ├── pyproject.toml # Package metadata
211
+ ├── requirements.txt # Dependencies
212
+ ├── README.md # This file
213
+ ├── dataclean_env/
214
+ │ ├── __init__.py # Package exports
215
+ │ ├── models.py # Action, Observation, State (Pydantic)
216
+ │ ├── client.py # Sync HTTP client
217
+ │ └── server/
218
+ │ ├── __init__.py
219
+ │ ├── app.py # FastAPI server (HTTP + WebSocket)
220
+ │ ├── environment.py # Core environment logic
221
+ │ └── tasks.py # Task data and ground truth
222
+ ```
223
+
224
+ ---
225
+
226
+ ## License
227
+
228
+ BSD 3-Clause
dataclean_env/__init__.py ADDED
@@ -0,0 +1,17 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ DataClean Environment
3
+ =====================
4
+ An OpenEnv-compliant environment for training AI agents on real-world
5
+ data-quality and data-cleaning tasks.
6
+ """
7
+
8
+ from .models import DataCleanAction, DataCleanObservation, DataCleanState
9
+ from .client import DataCleanEnv, StepResult
10
+
11
+ __all__ = [
12
+ "DataCleanAction",
13
+ "DataCleanObservation",
14
+ "DataCleanState",
15
+ "DataCleanEnv",
16
+ "StepResult",
17
+ ]
dataclean_env/client.py ADDED
@@ -0,0 +1,136 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ DataClean Environment – Client
3
+ ================================
4
+ Synchronous HTTP client for interacting with the DataClean server.
5
+ Works with both local Docker and remote HF Spaces deployments.
6
+ """
7
+
8
+ from __future__ import annotations
9
+
10
+ import json
11
+ import subprocess
12
+ import time
13
+ from dataclasses import dataclass
14
+ from typing import Any, Dict, Optional
15
+
16
+ import requests
17
+
18
+ from .models import DataCleanAction, DataCleanObservation, DataCleanState
19
+
20
+
21
+ @dataclass
22
+ class StepResult:
23
+ """Result of a reset() or step() call."""
24
+ observation: DataCleanObservation
25
+ reward: float
26
+ done: bool
27
+
28
+
29
+ class DataCleanEnv:
30
+ """Synchronous HTTP client for the DataClean environment server."""
31
+
32
+ def __init__(self, base_url: str, timeout: float = 30.0) -> None:
33
+ self.base_url = base_url.rstrip("/")
34
+ self.timeout = timeout
35
+ self._session = requests.Session()
36
+
37
+ # ── factory methods ────────────────────────────────────────────────────
38
+
39
+ @classmethod
40
+ def from_docker_image(
41
+ cls,
42
+ image: str = "dataclean-env:latest",
43
+ port: int = 8000,
44
+ env_vars: Optional[Dict[str, str]] = None,
45
+ timeout: float = 60.0,
46
+ ) -> "DataCleanEnv":
47
+ """Start a Docker container and return a connected client."""
48
+ cmd = ["docker", "run", "-d", "-p", f"{port}:7860", "--rm"]
49
+ for k, v in (env_vars or {}).items():
50
+ cmd.extend(["-e", f"{k}={v}"])
51
+ cmd.append(image)
52
+
53
+ result = subprocess.run(cmd, capture_output=True, text=True, check=True)
54
+ container_id = result.stdout.strip()
55
+
56
+ client = cls(base_url=f"http://localhost:{port}", timeout=timeout)
57
+ client._container_id = container_id
58
+
59
+ # wait for server to become healthy
60
+ deadline = time.time() + 30
61
+ while time.time() < deadline:
62
+ try:
63
+ resp = requests.get(f"http://localhost:{port}/health", timeout=3)
64
+ if resp.status_code == 200:
65
+ return client
66
+ except requests.ConnectionError:
67
+ pass
68
+ time.sleep(0.5)
69
+
70
+ raise RuntimeError(f"Container {container_id} did not become healthy in 30s")
71
+
72
+ # ── core API ───────────────────────────────────────────────────────────
73
+
74
+ def reset(self, task_name: str = "easy") -> StepResult:
75
+ """Reset the environment with the specified task."""
76
+ resp = self._session.post(
77
+ f"{self.base_url}/reset",
78
+ json={"task_name": task_name},
79
+ timeout=self.timeout,
80
+ )
81
+ resp.raise_for_status()
82
+ return self._parse_step_result(resp.json())
83
+
84
+ def step(self, action: DataCleanAction) -> StepResult:
85
+ """Execute an action and return the result."""
86
+ resp = self._session.post(
87
+ f"{self.base_url}/step",
88
+ json=action.model_dump(),
89
+ timeout=self.timeout,
90
+ )
91
+ resp.raise_for_status()
92
+ return self._parse_step_result(resp.json())
93
+
94
+ def state(self) -> DataCleanState:
95
+ """Get current episode state."""
96
+ resp = self._session.get(
97
+ f"{self.base_url}/state",
98
+ timeout=self.timeout,
99
+ )
100
+ resp.raise_for_status()
101
+ return DataCleanState(**resp.json())
102
+
103
+ def health(self) -> dict:
104
+ """Check server health."""
105
+ resp = self._session.get(
106
+ f"{self.base_url}/health",
107
+ timeout=self.timeout,
108
+ )
109
+ resp.raise_for_status()
110
+ return resp.json()
111
+
112
+ def close(self) -> None:
113
+ """Clean up resources."""
114
+ self._session.close()
115
+ cid = getattr(self, "_container_id", None)
116
+ if cid:
117
+ subprocess.run(["docker", "stop", cid], capture_output=True)
118
+
119
+ # ── context manager ────────────────────────────────────────────────────
120
+
121
+ def __enter__(self) -> "DataCleanEnv":
122
+ return self
123
+
124
+ def __exit__(self, *exc) -> None:
125
+ self.close()
126
+
127
+ # ── internal ───────────────────────────────────────────────────────────
128
+
129
+ @staticmethod
130
+ def _parse_step_result(payload: Dict[str, Any]) -> StepResult:
131
+ obs_data = payload.get("observation", {})
132
+ return StepResult(
133
+ observation=DataCleanObservation(**obs_data),
134
+ reward=float(payload.get("reward", 0.0)),
135
+ done=bool(payload.get("done", False)),
136
+ )
dataclean_env/models.py ADDED
@@ -0,0 +1,109 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Data Clean Environment - Typed Models
3
+ ======================================
4
+ Pydantic models for actions, observations, and state.
5
+ """
6
+
7
+ from typing import List, Optional, Dict, Any
8
+ from pydantic import BaseModel, Field
9
+
10
+
11
+ # ---------------------------------------------------------------------------
12
+ # Base classes – use openenv-core when available, plain Pydantic otherwise
13
+ # ---------------------------------------------------------------------------
14
+ try:
15
+ from openenv.core.env_server.types import (
16
+ Action as _Action,
17
+ Observation as _Observation,
18
+ State as _State,
19
+ )
20
+ except ImportError:
21
+ _Action = BaseModel
22
+ _Observation = BaseModel
23
+ _State = BaseModel
24
+
25
+
26
+ # ---------------------------------------------------------------------------
27
+ # Action
28
+ # ---------------------------------------------------------------------------
29
+ class DataCleanAction(_Action):
30
+ """An action the agent can take to clean the dataset.
31
+
32
+ action_type options:
33
+ fix_value – overwrite a cell with a corrected value
34
+ delete_row – remove a duplicate / invalid row
35
+ fill_missing – fill an empty cell
36
+ flag_anomaly – mark a cell as suspicious (partial credit)
37
+ submit – end the episode and finalise the score
38
+ noop – do nothing this step
39
+ """
40
+
41
+ action_type: str = Field(
42
+ ...,
43
+ description="One of: fix_value, delete_row, fill_missing, flag_anomaly, submit, noop",
44
+ )
45
+ row_index: Optional[int] = Field(
46
+ None, description="0-based row index to act on"
47
+ )
48
+ column_name: Optional[str] = Field(
49
+ None, description="Column name to act on"
50
+ )
51
+ new_value: Optional[str] = Field(
52
+ None, description="Replacement value (for fix_value / fill_missing)"
53
+ )
54
+
55
+
56
+ # ---------------------------------------------------------------------------
57
+ # Observation
58
+ # ---------------------------------------------------------------------------
59
+ class DataCleanObservation(_Observation):
60
+ """What the agent sees after each step."""
61
+
62
+ task_name: str = Field(..., description="Current task identifier")
63
+ task_description: str = Field(..., description="Human-readable task goal")
64
+ difficulty: str = Field(..., description="easy / medium / hard")
65
+ data_preview: str = Field(
66
+ ..., description="Current dataset formatted as a text table"
67
+ )
68
+ quality_report: str = Field(
69
+ ..., description="Summary of detected data-quality issues"
70
+ )
71
+ columns_info: List[Dict[str, Any]] = Field(
72
+ default_factory=list,
73
+ description="Per-column metadata: name, dtype, nulls, unique count",
74
+ )
75
+ action_history: List[str] = Field(
76
+ default_factory=list, description="Log of previous actions and outcomes"
77
+ )
78
+ step_number: int = Field(0, description="Current step (1-based)")
79
+ max_steps: int = Field(0, description="Budget of remaining steps")
80
+ current_score: float = Field(
81
+ 0.0, description="Running score 0.0-1.0"
82
+ )
83
+ available_actions: List[str] = Field(
84
+ default_factory=lambda: [
85
+ "fix_value",
86
+ "delete_row",
87
+ "fill_missing",
88
+ "flag_anomaly",
89
+ "submit",
90
+ "noop",
91
+ ]
92
+ )
93
+
94
+
95
+ # ---------------------------------------------------------------------------
96
+ # State (episode metadata)
97
+ # ---------------------------------------------------------------------------
98
+ class DataCleanState(_State):
99
+ """Episode-level metadata returned by state()."""
100
+
101
+ episode_id: Optional[str] = None
102
+ task_name: str = ""
103
+ difficulty: str = ""
104
+ step_count: int = 0
105
+ max_steps: int = 0
106
+ total_issues: int = 0
107
+ issues_fixed: int = 0
108
+ current_score: float = 0.0
109
+ done: bool = False
dataclean_env/server/__init__.py ADDED
@@ -0,0 +1 @@
 
 
1
+ """DataClean environment – server package."""
dataclean_env/server/app.py ADDED
@@ -0,0 +1,158 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ FastAPI server for the DataClean environment.
3
+ =============================================
4
+ Exposes HTTP + WebSocket endpoints following the OpenEnv spec.
5
+ """
6
+
7
+ from __future__ import annotations
8
+
9
+ import json
10
+ import logging
11
+ import os
12
+ from contextlib import asynccontextmanager
13
+ from typing import Dict
14
+
15
+ from fastapi import FastAPI, WebSocket, WebSocketDisconnect
16
+ from fastapi.responses import JSONResponse
17
+ from pydantic import BaseModel, Field
18
+
19
+ from ..models import DataCleanAction
20
+ from .environment import DataCleanEnvironment
21
+
22
+ logger = logging.getLogger("dataclean_env")
23
+ logging.basicConfig(level=logging.INFO)
24
+
25
+ MAX_CONCURRENT = int(os.getenv("MAX_CONCURRENT_ENVS", "100"))
26
+
27
+
28
+ # ---------------------------------------------------------------------------
29
+ # Request / response helpers
30
+ # ---------------------------------------------------------------------------
31
+ class ResetRequest(BaseModel):
32
+ task_name: str = Field("easy", description="easy | medium | hard")
33
+
34
+
35
+ class StepRequest(BaseModel):
36
+ action_type: str
37
+ row_index: int | None = None
38
+ column_name: str | None = None
39
+ new_value: str | None = None
40
+
41
+
42
+ # ---------------------------------------------------------------------------
43
+ # Application
44
+ # ---------------------------------------------------------------------------
45
+ @asynccontextmanager
46
+ async def lifespan(app: FastAPI):
47
+ logger.info("DataClean environment server starting")
48
+ yield
49
+ logger.info("DataClean environment server shutting down")
50
+
51
+
52
+ app = FastAPI(
53
+ title="DataClean Environment",
54
+ description="OpenEnv-compliant data-quality cleaning environment",
55
+ version="1.0.0",
56
+ lifespan=lifespan,
57
+ )
58
+
59
+
60
+ # Shared environment for HTTP (stateless-ish, one per worker)
61
+ _http_env = DataCleanEnvironment()
62
+
63
+ # Per-WebSocket session environments
64
+ _ws_sessions: Dict[int, DataCleanEnvironment] = {}
65
+
66
+
67
+ # ---------------------------------------------------------------------------
68
+ # HTTP endpoints
69
+ # ---------------------------------------------------------------------------
70
+ @app.get("/health")
71
+ async def health():
72
+ return {"status": "healthy"}
73
+
74
+
75
+ @app.post("/reset")
76
+ async def http_reset(body: ResetRequest | None = None):
77
+ task_name = body.task_name if body else "easy"
78
+ result = _http_env.reset(task_name)
79
+ return JSONResponse(content=result)
80
+
81
+
82
+ @app.post("/step")
83
+ async def http_step(body: StepRequest):
84
+ action = DataCleanAction(
85
+ action_type=body.action_type,
86
+ row_index=body.row_index,
87
+ column_name=body.column_name,
88
+ new_value=body.new_value,
89
+ )
90
+ result = _http_env.step(action)
91
+ return JSONResponse(content=result)
92
+
93
+
94
+ @app.get("/state")
95
+ async def http_state():
96
+ return JSONResponse(content=_http_env.state)
97
+
98
+
99
+ # ---------------------------------------------------------------------------
100
+ # WebSocket endpoint (primary protocol for OpenEnv)
101
+ # ---------------------------------------------------------------------------
102
+ @app.websocket("/ws")
103
+ async def websocket_endpoint(websocket: WebSocket):
104
+ if len(_ws_sessions) >= MAX_CONCURRENT:
105
+ await websocket.close(code=1013, reason="Server at capacity")
106
+ return
107
+
108
+ await websocket.accept()
109
+ session_id = id(websocket)
110
+ env = DataCleanEnvironment()
111
+ _ws_sessions[session_id] = env
112
+
113
+ try:
114
+ while True:
115
+ raw = await websocket.receive_text()
116
+ try:
117
+ data = json.loads(raw)
118
+ except json.JSONDecodeError:
119
+ await websocket.send_json(
120
+ {"type": "error", "code": "INVALID_JSON", "message": "Could not parse JSON"}
121
+ )
122
+ continue
123
+
124
+ msg_type = data.get("type", "")
125
+
126
+ if msg_type == "reset":
127
+ task_name = data.get("task_name", "easy")
128
+ result = env.reset(task_name)
129
+ await websocket.send_json({"type": "observation", **result})
130
+
131
+ elif msg_type == "step":
132
+ action_data = data.get("action", data)
133
+ action = DataCleanAction(
134
+ action_type=action_data.get("action_type", "noop"),
135
+ row_index=action_data.get("row_index"),
136
+ column_name=action_data.get("column_name"),
137
+ new_value=action_data.get("new_value"),
138
+ )
139
+ result = env.step(action)
140
+ await websocket.send_json({"type": "observation", **result})
141
+
142
+ elif msg_type == "state":
143
+ await websocket.send_json({"type": "state", **env.state})
144
+
145
+ elif msg_type == "close":
146
+ break
147
+
148
+ else:
149
+ await websocket.send_json(
150
+ {"type": "error", "code": "UNKNOWN_TYPE", "message": f"Unknown message type: {msg_type}"}
151
+ )
152
+
153
+ except WebSocketDisconnect:
154
+ logger.info("WebSocket client disconnected (session %s)", session_id)
155
+ except Exception as exc:
156
+ logger.exception("WebSocket error (session %s): %s", session_id, exc)
157
+ finally:
158
+ _ws_sessions.pop(session_id, None)
dataclean_env/server/environment.py ADDED
@@ -0,0 +1,444 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ DataClean Environment – core simulation logic.
3
+ ===============================================
4
+ Implements reset(), step(), state for the data-cleaning agent.
5
+ """
6
+
7
+ from __future__ import annotations
8
+
9
+ import copy
10
+ import uuid
11
+ from typing import Any, Dict, List, Optional, Tuple
12
+
13
+ from ..models import DataCleanAction, DataCleanObservation, DataCleanState
14
+ from .tasks import get_task, Row
15
+
16
+
17
+ class DataCleanEnvironment:
18
+ """Simulates a data-quality review session."""
19
+
20
+ SUPPORTS_CONCURRENT_SESSIONS = True
21
+
22
+ # ── lifecycle ──────────────────────────────────────────────────────────
23
+
24
+ def __init__(self) -> None:
25
+ self._task: dict = {}
26
+ self._data: List[Row] = []
27
+ self._clean: List[Row] = []
28
+ self._issues: list = []
29
+ self._columns: List[str] = []
30
+ self._max_steps: int = 0
31
+ self._step_count: int = 0
32
+ self._done: bool = True
33
+ self._episode_id: str = ""
34
+ self._action_log: List[str] = []
35
+ self._deleted_rows: set = set()
36
+ self._fixed_issues: set = set()
37
+ self._wrong_fixes: int = 0
38
+
39
+ # ── reset ──────────────────────────────────────────────────────────────
40
+
41
+ def reset(self, task_name: str = "easy") -> dict:
42
+ """Start a fresh episode for the given task."""
43
+ self._task = get_task(task_name)
44
+ self._data = copy.deepcopy(self._task["dirty_data"])
45
+ self._clean = self._task["clean_data"]
46
+ self._issues = self._task["issues"]
47
+ self._columns = self._task["columns"]
48
+ self._max_steps = self._task["max_steps"]
49
+ self._step_count = 0
50
+ self._done = False
51
+ self._episode_id = uuid.uuid4().hex[:12]
52
+ self._action_log = []
53
+ self._deleted_rows = set()
54
+ self._fixed_issues = set()
55
+ self._wrong_fixes = 0
56
+
57
+ obs = self._build_observation()
58
+ return {
59
+ "observation": obs.model_dump(),
60
+ "reward": 0.0,
61
+ "done": False,
62
+ }
63
+
64
+ # ── step ───────────────────────────────────────────────────────────────
65
+
66
+ def step(self, action: DataCleanAction) -> dict:
67
+ if self._done:
68
+ obs = self._build_observation()
69
+ return {
70
+ "observation": obs.model_dump(),
71
+ "reward": self._compute_score(),
72
+ "done": True,
73
+ }
74
+
75
+ self._step_count += 1
76
+ msg = self._apply_action(action)
77
+ self._action_log.append(f"Step {self._step_count}: {action.action_type} -> {msg}")
78
+
79
+ # episode ends on submit, max steps, or all issues fixed
80
+ if (
81
+ action.action_type == "submit"
82
+ or self._step_count >= self._max_steps
83
+ or len(self._fixed_issues) == len(self._issues)
84
+ ):
85
+ self._done = True
86
+
87
+ score = self._compute_score()
88
+ obs = self._build_observation()
89
+ return {
90
+ "observation": obs.model_dump(),
91
+ "reward": round(score, 4),
92
+ "done": self._done,
93
+ }
94
+
95
+ # ── state ──────────────────────────────────────────────────────────────
96
+
97
+ @property
98
+ def state(self) -> dict:
99
+ return DataCleanState(
100
+ episode_id=self._episode_id,
101
+ task_name=self._task.get("name", ""),
102
+ difficulty=self._task.get("difficulty", ""),
103
+ step_count=self._step_count,
104
+ max_steps=self._max_steps,
105
+ total_issues=len(self._issues),
106
+ issues_fixed=len(self._fixed_issues),
107
+ current_score=round(self._compute_score(), 4),
108
+ done=self._done,
109
+ ).model_dump()
110
+
111
+ # ── internal: apply actions ────────────────────────────────────────────
112
+
113
+ def _apply_action(self, action: DataCleanAction) -> str:
114
+ at = action.action_type
115
+
116
+ if at == "noop":
117
+ return "No action taken."
118
+
119
+ if at == "submit":
120
+ return "Submitted for grading."
121
+
122
+ if at in ("fix_value", "fill_missing"):
123
+ return self._do_fix(action)
124
+
125
+ if at == "delete_row":
126
+ return self._do_delete(action)
127
+
128
+ if at == "flag_anomaly":
129
+ return self._do_flag(action)
130
+
131
+ return f"Unknown action_type '{at}'. No effect."
132
+
133
+ def _do_fix(self, action: DataCleanAction) -> str:
134
+ ri = action.row_index
135
+ col = action.column_name
136
+ val = action.new_value
137
+
138
+ if ri is None or col is None or val is None:
139
+ return "fix_value requires row_index, column_name, and new_value."
140
+
141
+ if ri < 0 or ri >= len(self._data):
142
+ return f"row_index {ri} out of range (0-{len(self._data)-1})."
143
+
144
+ if ri in self._deleted_rows:
145
+ return f"Row {ri} was already deleted."
146
+
147
+ if col not in self._columns:
148
+ return f"Unknown column '{col}'. Valid: {self._columns}"
149
+
150
+ # apply the edit
151
+ old_val = str(self._data[ri].get(col, ""))
152
+ self._data[ri][col] = self._coerce(val, self._data[ri][col])
153
+
154
+ # check whether this fixes a known issue
155
+ matched = self._match_fix(ri, col, val)
156
+ if matched is not None:
157
+ self._fixed_issues.add(matched)
158
+ return f"Fixed row {ri} [{col}]: '{old_val}' -> '{val}' (issue resolved)"
159
+ else:
160
+ # check if the edit made things worse
161
+ if old_val == str(self._ground_truth_value(ri, col)):
162
+ self._wrong_fixes += 1
163
+ return f"Changed row {ri} [{col}]: '{old_val}' -> '{val}' (WARNING: was already correct!)"
164
+ return f"Changed row {ri} [{col}]: '{old_val}' -> '{val}'"
165
+
166
+ def _do_delete(self, action: DataCleanAction) -> str:
167
+ ri = action.row_index
168
+ if ri is None:
169
+ return "delete_row requires row_index."
170
+ if ri < 0 or ri >= len(self._data):
171
+ return f"row_index {ri} out of range."
172
+ if ri in self._deleted_rows:
173
+ return f"Row {ri} already deleted."
174
+
175
+ self._deleted_rows.add(ri)
176
+ matched = self._match_delete(ri)
177
+ if matched is not None:
178
+ self._fixed_issues.add(matched)
179
+ return f"Deleted row {ri} (duplicate removed)"
180
+ else:
181
+ self._wrong_fixes += 1
182
+ return f"Deleted row {ri} (WARNING: this row was not a duplicate!)"
183
+
184
+ def _do_flag(self, action: DataCleanAction) -> str:
185
+ ri = action.row_index
186
+ col = action.column_name
187
+ if ri is None or col is None:
188
+ return "flag_anomaly requires row_index and column_name."
189
+
190
+ # partial credit: flagging the right cell earns 0.5 of the fix
191
+ for idx, issue in enumerate(self._issues):
192
+ if issue["row"] == ri and issue.get("col") == col and idx not in self._fixed_issues:
193
+ self._fixed_issues.add(idx)
194
+ return f"Flagged row {ri} [{col}] as anomalous (partial credit)"
195
+ return f"Flagged row {ri} [{col}] — no matching issue found."
196
+
197
+ # ── grading helpers ────────────────────────────────────────────────────
198
+
199
+ def _match_fix(self, row: int, col: str, val: str) -> Optional[int]:
200
+ """Return issue index if this fix resolves a known issue, else None."""
201
+ for idx, issue in enumerate(self._issues):
202
+ if idx in self._fixed_issues:
203
+ continue
204
+ if issue["row"] == row and issue.get("col") == col:
205
+ expected = str(issue["fix"])
206
+ if self._fuzzy_eq(val, expected):
207
+ return idx
208
+ return None
209
+
210
+ def _match_delete(self, row: int) -> Optional[int]:
211
+ for idx, issue in enumerate(self._issues):
212
+ if idx in self._fixed_issues:
213
+ continue
214
+ if issue["row"] == row and issue["fix"] == "__DELETE__":
215
+ return idx
216
+ return None
217
+
218
+ def _compute_score(self) -> float:
219
+ if not self._issues:
220
+ return 1.0
221
+ total = len(self._issues)
222
+ fixed = len(self._fixed_issues)
223
+
224
+ # base score from fixed issues
225
+ base = fixed / total
226
+
227
+ # penalty for wrong fixes (capped so score stays >= 0)
228
+ penalty = min(self._wrong_fixes * 0.05, base)
229
+
230
+ # small efficiency bonus if done early
231
+ if self._done and self._max_steps > 0:
232
+ remaining_ratio = max(0, (self._max_steps - self._step_count)) / self._max_steps
233
+ efficiency = remaining_ratio * 0.05
234
+ else:
235
+ efficiency = 0.0
236
+
237
+ score = base - penalty + efficiency
238
+ return max(0.0, min(1.0, score))
239
+
240
+ def _ground_truth_value(self, dirty_row_idx: int, col: str) -> Any:
241
+ """Look up the expected clean value for a dirty-data row."""
242
+ # map dirty index to clean index (accounting for deleted rows in ground truth)
243
+ clean_idx = self._dirty_to_clean_idx(dirty_row_idx)
244
+ if clean_idx is not None and clean_idx < len(self._clean):
245
+ return self._clean[clean_idx].get(col)
246
+ return None
247
+
248
+ def _dirty_to_clean_idx(self, dirty_idx: int) -> Optional[int]:
249
+ """Map a dirty-data row index to the clean-data row index."""
250
+ # find rows that should be deleted
251
+ delete_rows = {
252
+ issue["row"]
253
+ for issue in self._issues
254
+ if issue["fix"] == "__DELETE__"
255
+ }
256
+ # count non-deleted rows before dirty_idx
257
+ if dirty_idx in delete_rows:
258
+ return None
259
+ clean_i = 0
260
+ for i in range(dirty_idx):
261
+ if i not in delete_rows:
262
+ clean_i += 1
263
+ return clean_i
264
+
265
+ @staticmethod
266
+ def _fuzzy_eq(a: str, b: str) -> bool:
267
+ """Lenient comparison for grading (strip, lower, remove leading zeros)."""
268
+ a = str(a).strip().lower()
269
+ b = str(b).strip().lower()
270
+ if a == b:
271
+ return True
272
+ # numeric comparison
273
+ try:
274
+ return abs(float(a) - float(b)) < 0.01
275
+ except (ValueError, TypeError):
276
+ pass
277
+ return False
278
+
279
+ @staticmethod
280
+ def _coerce(val_str: str, existing: Any) -> Any:
281
+ """Try to coerce the string value to the same type as the existing cell."""
282
+ if isinstance(existing, int):
283
+ try:
284
+ return int(float(val_str))
285
+ except (ValueError, TypeError):
286
+ return val_str
287
+ if isinstance(existing, float):
288
+ try:
289
+ return float(val_str)
290
+ except (ValueError, TypeError):
291
+ return val_str
292
+ return val_str
293
+
294
+ # ── observation builder ────────────────────────────────────────────────
295
+
296
+ def _build_observation(self) -> DataCleanObservation:
297
+ return DataCleanObservation(
298
+ task_name=self._task.get("name", ""),
299
+ task_description=self._task.get("description", ""),
300
+ difficulty=self._task.get("difficulty", ""),
301
+ data_preview=self._render_table(),
302
+ quality_report=self._render_quality_report(),
303
+ columns_info=self._render_columns_info(),
304
+ action_history=list(self._action_log[-10:]),
305
+ step_number=self._step_count,
306
+ max_steps=self._max_steps,
307
+ current_score=round(self._compute_score(), 4),
308
+ )
309
+
310
+ def _render_table(self) -> str:
311
+ """Render the current dataset as an aligned text table."""
312
+ if not self._data:
313
+ return "(empty dataset)"
314
+
315
+ cols = self._columns
316
+ # compute column widths
317
+ widths = {c: len(c) for c in cols}
318
+ widths["row"] = 3
319
+
320
+ active_rows: List[Tuple[int, Row]] = [
321
+ (i, row) for i, row in enumerate(self._data) if i not in self._deleted_rows
322
+ ]
323
+
324
+ for i, row in active_rows:
325
+ widths["row"] = max(widths["row"], len(str(i)))
326
+ for c in cols:
327
+ val = str(row.get(c, ""))
328
+ if val == "":
329
+ val = "[EMPTY]"
330
+ widths[c] = max(widths[c], min(len(val), 30))
331
+
332
+ # header
333
+ hdr = "| " + " | ".join(
334
+ ["row".ljust(widths["row"])] + [c.ljust(widths[c]) for c in cols]
335
+ ) + " |"
336
+ sep = "|-" + "-|-".join(
337
+ ["-" * widths["row"]] + ["-" * widths[c] for c in cols]
338
+ ) + "-|"
339
+
340
+ lines = [hdr, sep]
341
+ for i, row in active_rows:
342
+ cells = [str(i).ljust(widths["row"])]
343
+ for c in cols:
344
+ val = str(row.get(c, ""))
345
+ if val == "":
346
+ val = "[EMPTY]"
347
+ cells.append(val[:30].ljust(widths[c]))
348
+ lines.append("| " + " | ".join(cells) + " |")
349
+
350
+ return "\n".join(lines)
351
+
352
+ def _render_quality_report(self) -> str:
353
+ """Generate a quality-report hinting at (but not solving) issues."""
354
+ if not self._data:
355
+ return "No data loaded."
356
+
357
+ lines = ["DATA QUALITY REPORT", "=" * 40]
358
+ cols = self._columns
359
+ active_rows = [
360
+ (i, row) for i, row in enumerate(self._data) if i not in self._deleted_rows
361
+ ]
362
+ num_rows = len(active_rows)
363
+ lines.append(f"Total rows: {num_rows} (original: {len(self._data)}, deleted: {len(self._deleted_rows)})")
364
+
365
+ # per-column stats
366
+ for c in cols:
367
+ vals = [str(row.get(c, "")) for _, row in active_rows]
368
+ empties = sum(1 for v in vals if v.strip() == "" or v.strip().upper() == "NULL")
369
+ unique = len(set(vals))
370
+ if empties:
371
+ lines.append(f" Column '{c}': {empties} empty/null value(s)")
372
+
373
+ # detect potential duplicates (simple exact-match check)
374
+ seen = {}
375
+ for i, row in active_rows:
376
+ key = tuple(str(row.get(c, "")) for c in cols)
377
+ if key in seen:
378
+ lines.append(f" Possible duplicate: row {i} matches row {seen[key]}")
379
+ else:
380
+ seen[key] = i
381
+
382
+ # detect numeric anomalies
383
+ for c in cols:
384
+ numeric_vals = []
385
+ for i, row in active_rows:
386
+ try:
387
+ numeric_vals.append((i, float(row[c])))
388
+ except (ValueError, TypeError, KeyError):
389
+ pass
390
+ if len(numeric_vals) >= 3:
391
+ values = [v for _, v in numeric_vals]
392
+ mean = sum(values) / len(values)
393
+ for i, v in numeric_vals:
394
+ if v < 0:
395
+ lines.append(f" Row {i}, '{c}': Negative value ({v})")
396
+ elif abs(v - mean) > 3 * (max(values) - min(values) + 1) / 4:
397
+ lines.append(f" Row {i}, '{c}': Potential outlier ({v})")
398
+
399
+ # detect format inconsistencies in string columns
400
+ for c in cols:
401
+ vals = [str(row.get(c, "")) for _, row in active_rows]
402
+ non_empty = [v for v in vals if v.strip() and v.strip() != "[EMPTY]"]
403
+ if not non_empty:
404
+ continue
405
+ # check for mixed case patterns (all-caps vs lowercase)
406
+ has_upper = any(v.isupper() for v in non_empty)
407
+ has_lower = any(v.islower() or (not v.isupper() and not v.istitle()) for v in non_empty)
408
+ if has_upper and has_lower and c in ("email",):
409
+ lines.append(f" Column '{c}': Mixed case formatting detected")
410
+
411
+ # check for format inconsistency in date-like columns
412
+ if c in ("date", "start_date", "birth_date"):
413
+ formats_seen = set()
414
+ for v in non_empty:
415
+ if "/" in v:
416
+ formats_seen.add("slash")
417
+ elif "." in v and v.count(".") == 2:
418
+ formats_seen.add("dot")
419
+ elif "-" in v:
420
+ formats_seen.add("dash")
421
+ if len(formats_seen) > 1:
422
+ lines.append(f" Column '{c}': Inconsistent date formats ({', '.join(formats_seen)})")
423
+
424
+ lines.append(f"\nProgress: {len(self._fixed_issues)}/{len(self._issues)} issues resolved")
425
+ lines.append(f"Steps used: {self._step_count}/{self._max_steps}")
426
+
427
+ return "\n".join(lines)
428
+
429
+ def _render_columns_info(self) -> List[Dict[str, Any]]:
430
+ active_rows = [
431
+ row for i, row in enumerate(self._data) if i not in self._deleted_rows
432
+ ]
433
+ info = []
434
+ for c in self._columns:
435
+ vals = [row.get(c, "") for row in active_rows]
436
+ non_empty = [v for v in vals if str(v).strip() not in ("", "NULL")]
437
+ info.append({
438
+ "name": c,
439
+ "total": len(vals),
440
+ "non_empty": len(non_empty),
441
+ "empty": len(vals) - len(non_empty),
442
+ "unique": len(set(str(v) for v in vals)),
443
+ })
444
+ return info
dataclean_env/server/tasks.py ADDED
@@ -0,0 +1,239 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Task Definitions – Realistic datasets with known data-quality issues.
3
+ =====================================================================
4
+ Each task provides:
5
+ dirty_data – the messy rows the agent starts with
6
+ clean_data – ground-truth rows (used by the grader)
7
+ issues – list describing every problem to fix
8
+ max_steps – action budget
9
+ description – human-readable goal
10
+ """
11
+
12
+ from __future__ import annotations
13
+ from typing import Any, Dict, List
14
+ import copy
15
+
16
+ # ── helpers ────────────────────────────────────────────────────────────────
17
+
18
+ IssueDict = Dict[str, Any]
19
+ Row = Dict[str, Any]
20
+
21
+ # ── TASK 1 — EASY: Customer Contact Cleanup ───────────────────────────────
22
+
23
+ _EASY_DIRTY: List[Row] = [
24
+ {"id": 1, "name": "John Smith", "email": "john.smith@gmail.com", "phone": "555-0101", "age": 35, "city": "New York"},
25
+ {"id": 2, "name": "Jane Doe", "email": "", "phone": "555-0102", "age": 28, "city": "Los Angeles"},
26
+ {"id": 3, "name": "Bob Wilson", "email": "bob.w@yahoo.com", "phone": "555-ABCD", "age": 42, "city": "Chicago"},
27
+ {"id": 4, "name": "John Smith", "email": "john.smith@gmail.com", "phone": "555-0101", "age": 35, "city": "New York"},
28
+ {"id": 5, "name": "Alice Brown", "email": "alice.b@hotmail.com", "phone": "555-0105", "age": -3, "city": "Houston"},
29
+ {"id": 6, "name": "Charlie Davis", "email": "charlie.d@gmail.com", "phone": "555-0106", "age": 31, "city": "Phoenix"},
30
+ {"id": 7, "name": "Eva Martinez", "email": "eva.m@outlook.com", "phone": "555-0107", "age": 27, "city": "Philadelphia"},
31
+ {"id": 8, "name": "Frank Lee", "email": "frank@gmail", "phone": "555-0108", "age": 45, "city": "San Antonio"},
32
+ {"id": 9, "name": "Grace Kim", "email": "grace.k@yahoo.com", "phone": "555-0109", "age": 38, "city": "San Diego"},
33
+ {"id": 10,"name": "Henry Nguyen", "email": "henry.n@gmail.com", "phone": "555-0110", "age": 52, "city": "Dallas"},
34
+ ]
35
+
36
+ _EASY_CLEAN: List[Row] = [
37
+ {"id": 1, "name": "John Smith", "email": "john.smith@gmail.com", "phone": "555-0101", "age": 35, "city": "New York"},
38
+ {"id": 2, "name": "Jane Doe", "email": "jane.doe@email.com", "phone": "555-0102", "age": 28, "city": "Los Angeles"},
39
+ {"id": 3, "name": "Bob Wilson", "email": "bob.w@yahoo.com", "phone": "555-0103", "age": 42, "city": "Chicago"},
40
+ # row 4 (duplicate of row 0) deleted
41
+ {"id": 5, "name": "Alice Brown", "email": "alice.b@hotmail.com", "phone": "555-0105", "age": 33, "city": "Houston"},
42
+ {"id": 6, "name": "Charlie Davis", "email": "charlie.d@gmail.com", "phone": "555-0106", "age": 31, "city": "Phoenix"},
43
+ {"id": 7, "name": "Eva Martinez", "email": "eva.m@outlook.com", "phone": "555-0107", "age": 27, "city": "Philadelphia"},
44
+ {"id": 8, "name": "Frank Lee", "email": "frank@gmail.com", "phone": "555-0108", "age": 45, "city": "San Antonio"},
45
+ {"id": 9, "name": "Grace Kim", "email": "grace.k@yahoo.com", "phone": "555-0109", "age": 38, "city": "San Diego"},
46
+ {"id": 10,"name": "Henry Nguyen", "email": "henry.n@gmail.com", "phone": "555-0110", "age": 52, "city": "Dallas"},
47
+ ]
48
+
49
+ _EASY_ISSUES: List[IssueDict] = [
50
+ {"row": 1, "col": "email", "type": "missing_value", "desc": "Missing email address", "fix": "jane.doe@email.com"},
51
+ {"row": 2, "col": "phone", "type": "invalid_format", "desc": "Phone contains letters (555-ABCD)", "fix": "555-0103"},
52
+ {"row": 3, "col": None, "type": "duplicate_row", "desc": "Exact duplicate of row 0", "fix": "__DELETE__"},
53
+ {"row": 4, "col": "age", "type": "invalid_value", "desc": "Negative age (-3)", "fix": "33"},
54
+ {"row": 7, "col": "email", "type": "invalid_format", "desc": "Email missing TLD (frank@gmail)", "fix": "frank@gmail.com"},
55
+ ]
56
+
57
+
58
+ # ── TASK 2 — MEDIUM: E-commerce Order Normalisation ──────────────────────
59
+
60
+ _MED_DIRTY: List[Row] = [
61
+ {"order_id": "ORD-001", "customer": "Acme Corp", "product": "P100", "quantity": 10, "price": "249.99", "date": "2024-01-15", "status": "delivered"},
62
+ {"order_id": "ORD-002", "customer": "Globex Inc", "product": "P102", "quantity": 5, "price": "599.00", "date": "2024-01-18", "status": "delivered"},
63
+ {"order_id": "ORD-003", "customer": "Initech LLC", "product": "P100", "quantity": 3, "price": "249.99", "date": "15/02/2024", "status": "shipped"},
64
+ {"order_id": "ORD-004", "customer": "Umbrella Co", "product": "P105", "quantity": 8, "price": "149.50", "date": "2024-02-20", "status": "delivered"},
65
+ {"order_id": "ORD-005", "customer": "Stark Ind", "product": "P-102", "quantity": 12, "price": "599.00", "date": "2024-03-01", "status": "shipped"},
66
+ {"order_id": "ORD-006", "customer": "Wayne Ent", "product": "P108", "quantity": -2, "price": "$1,234.56", "date": "2024-03-05", "status": "processing"},
67
+ {"order_id": "ORD-007", "customer": "Oscorp", "product": "P100", "quantity": 7, "price": "249.99", "date": "2024-03-10", "status": "delivered"},
68
+ {"order_id": "ORD-008", "customer": "Cyberdyne Sys", "product": "P110", "quantity": 1, "price": "899.00", "date": "2024.03.15", "status": "delivered"},
69
+ {"order_id": "ORD-009", "customer": "Soylent Corp", "product": "P105", "quantity": 4, "price": "149.50", "date": "2024-03-20", "status": "shiped"},
70
+ {"order_id": "ORD-010", "customer": "Globex Inc", "product": "P102", "quantity": 5, "price": "599.00", "date": "2024-01-18", "status": "delivered"},
71
+ {"order_id": "ORD-011", "customer": "Tyrell Corp", "product": "P112", "quantity": 6, "price": "", "date": "2024-04-01", "status": "processing"},
72
+ {"order_id": "ORD-012", "customer": "Wonka Ind", "product": "P100", "quantity": 20, "price": "249.99", "date": "01-05-2024", "status": "shipped"},
73
+ {"order_id": "ORD-013", "customer": "Prestige World", "product": "P-105", "quantity": 9, "price": "149.50", "date": "2024-05-10", "status": "delivered"},
74
+ {"order_id": "ORD-014", "customer": "Massive Dyn", "product": "P108", "quantity": 3, "price": "1234.56", "date": "2024-05-15", "status": "delivered"},
75
+ {"order_id": "ORD-015", "customer": "Aperture Sci", "product": "P115", "quantity": 15, "price": "75.00", "date": "2024-06-01", "status": "shipped"},
76
+ ]
77
+
78
+ _MED_CLEAN: List[Row] = [
79
+ {"order_id": "ORD-001", "customer": "Acme Corp", "product": "P100", "quantity": 10, "price": "249.99", "date": "2024-01-15", "status": "delivered"},
80
+ {"order_id": "ORD-002", "customer": "Globex Inc", "product": "P102", "quantity": 5, "price": "599.00", "date": "2024-01-18", "status": "delivered"},
81
+ {"order_id": "ORD-003", "customer": "Initech LLC", "product": "P100", "quantity": 3, "price": "249.99", "date": "2024-02-15", "status": "shipped"},
82
+ {"order_id": "ORD-004", "customer": "Umbrella Co", "product": "P105", "quantity": 8, "price": "149.50", "date": "2024-02-20", "status": "delivered"},
83
+ {"order_id": "ORD-005", "customer": "Stark Ind", "product": "P102", "quantity": 12, "price": "599.00", "date": "2024-03-01", "status": "shipped"},
84
+ {"order_id": "ORD-006", "customer": "Wayne Ent", "product": "P108", "quantity": 2, "price": "1234.56", "date": "2024-03-05", "status": "processing"},
85
+ {"order_id": "ORD-007", "customer": "Oscorp", "product": "P100", "quantity": 7, "price": "249.99", "date": "2024-03-10", "status": "delivered"},
86
+ {"order_id": "ORD-008", "customer": "Cyberdyne Sys", "product": "P110", "quantity": 1, "price": "899.00", "date": "2024-03-15", "status": "delivered"},
87
+ {"order_id": "ORD-009", "customer": "Soylent Corp", "product": "P105", "quantity": 4, "price": "149.50", "date": "2024-03-20", "status": "shipped"},
88
+ # row 9 (duplicate of row 1) deleted
89
+ {"order_id": "ORD-011", "customer": "Tyrell Corp", "product": "P112", "quantity": 6, "price": "350.00", "date": "2024-04-01", "status": "processing"},
90
+ {"order_id": "ORD-012", "customer": "Wonka Ind", "product": "P100", "quantity": 20, "price": "249.99", "date": "2024-05-01", "status": "shipped"},
91
+ {"order_id": "ORD-013", "customer": "Prestige World", "product": "P105", "quantity": 9, "price": "149.50", "date": "2024-05-10", "status": "delivered"},
92
+ {"order_id": "ORD-014", "customer": "Massive Dyn", "product": "P108", "quantity": 3, "price": "1234.56", "date": "2024-05-15", "status": "delivered"},
93
+ {"order_id": "ORD-015", "customer": "Aperture Sci", "product": "P115", "quantity": 15, "price": "75.00", "date": "2024-06-01", "status": "shipped"},
94
+ ]
95
+
96
+ _MED_ISSUES: List[IssueDict] = [
97
+ {"row": 2, "col": "date", "type": "inconsistent_format", "desc": "Date in DD/MM/YYYY format instead of YYYY-MM-DD", "fix": "2024-02-15"},
98
+ {"row": 4, "col": "product", "type": "inconsistent_format", "desc": "Product code has dash (P-102 vs P102)", "fix": "P102"},
99
+ {"row": 5, "col": "quantity", "type": "invalid_value", "desc": "Negative quantity (-2)", "fix": "2"},
100
+ {"row": 5, "col": "price", "type": "inconsistent_format", "desc": "Price has $ and comma ($1,234.56)", "fix": "1234.56"},
101
+ {"row": 7, "col": "date", "type": "inconsistent_format", "desc": "Date uses dots (2024.03.15)", "fix": "2024-03-15"},
102
+ {"row": 8, "col": "status", "type": "typo", "desc": "Status misspelled: shiped -> shipped", "fix": "shipped"},
103
+ {"row": 9, "col": None, "type": "duplicate_row", "desc": "Duplicate of row 1 (same order)", "fix": "__DELETE__"},
104
+ {"row": 10, "col": "price", "type": "missing_value", "desc": "Missing price for P112 product", "fix": "350.00"},
105
+ {"row": 11, "col": "date", "type": "inconsistent_format", "desc": "Date in DD-MM-YYYY format", "fix": "2024-05-01"},
106
+ {"row": 12, "col": "product", "type": "inconsistent_format", "desc": "Product code has dash (P-105 vs P105)", "fix": "P105"},
107
+ ]
108
+
109
+
110
+ # ── TASK 3 — HARD: Employee Records Audit ─────────────────────────────────
111
+
112
+ _HARD_DIRTY: List[Row] = [
113
+ {"emp_id": "E001", "name": "Sarah Johnson", "email": "sarah.j@company.com", "birth_date": "1985-06-12", "age": 39, "department": "Engineering", "dept_code": "ENG", "role": "Senior Engineer", "salary": 125000, "start_date": "2015-03-01", "manager_id": "E010"},
114
+ {"emp_id": "E002", "name": "Michael Chen", "email": "michael.c@company.com", "birth_date": "1990-03-15", "age": 28, "department": "Engineering", "dept_code": "ENG", "role": "Junior Developer", "salary": 72000, "start_date": "2022-07-15", "manager_id": "E001"},
115
+ {"emp_id": "E003", "name": "Emily Watson", "email": "emily.w@company.com", "birth_date": "1988-11-22", "age": 36, "department": "Marketing", "dept_code": "MKT", "role": "Marketing Manager", "salary": 98000, "start_date": "2018-01-10", "manager_id": "E010"},
116
+ {"emp_id": "E004", "name": "David Park", "email": "david.p@company.com", "birth_date": "1992-07-04", "age": 32, "department": "Engineering", "dept_code": "MKT", "role": "Software Engineer", "salary": 105000, "start_date": "2020-09-01", "manager_id": "E001"},
117
+ {"emp_id": "E005", "name": "Lisa Rodriguez", "email": "lisa.r@company.com", "birth_date": "1995-01-30", "age": 29, "department": "Sales", "dept_code": "SAL", "role": "Sales Representative","salary": 65000, "start_date": "2023-02-14", "manager_id": "E008"},
118
+ {"emp_id": "E006", "name": "James O'Brien", "email": "james.ob@company.com", "birth_date": "1987-09-18", "age": 37, "department": "Finance", "dept_code": "FIN", "role": "Financial Analyst", "salary": 88000, "start_date": "2019-05-20", "manager_id": "E010"},
119
+ {"emp_id": "E007", "name": "James Obrien", "email": "james.ob@company.com", "birth_date": "1987-09-18", "age": 37, "department": "Finance", "dept_code": "FIN", "role": "Financial Analyst", "salary": 88000, "start_date": "2019-05-20", "manager_id": "E010"},
120
+ {"emp_id": "E008", "name": "Rachel Green", "email": "rachel.g@company.com", "birth_date": "1983-04-05", "age": 41, "department": "Sales", "dept_code": "SAL", "role": "Sales Director", "salary": 140000, "start_date": "2014-11-01", "manager_id": "E010"},
121
+ {"emp_id": "E009", "name": "Tom Anderson", "email": "tom.a@company.com", "birth_date": "1991-12-25", "age": 33, "department": "Engineering", "dept_code": "ENG", "role": "Junior Developer", "salary": 250000, "start_date": "2023-06-01", "manager_id": "E001"},
122
+ {"emp_id": "E010", "name": "Patricia Moore", "email": "patricia.m@company.com", "birth_date": "1978-02-14", "age": 46, "department": "Executive", "dept_code": "EXE", "role": "VP of Operations", "salary": 185000, "start_date": "2010-01-15", "manager_id": ""},
123
+ {"emp_id": "E011", "name": "Kevin Hall", "email": "kevin.h@company.com", "birth_date": "1993-08-07", "age": 31, "department": "Marketing", "dept_code": "MKT", "role": "Content Specialist", "salary": 62000, "start_date": "2025-08-01", "manager_id": "E003"},
124
+ {"emp_id": "E012", "name": "Amy Liu", "email": "AMY.LIU@COMPANY.COM", "birth_date": "1994-05-19", "age": 30, "department": "Engineering", "dept_code": "ENG", "role": "QA Engineer", "salary": 82000, "start_date": "2021-04-12", "manager_id": "E001"},
125
+ {"emp_id": "E013", "name": "Robert Taylor", "email": "robert.t@company.com", "birth_date": "1986-10-31", "age": 38, "department": "", "dept_code": "SAL", "role": "Account Manager", "salary": 78000, "start_date": "2020-01-06", "manager_id": "E008"},
126
+ {"emp_id": "E014", "name": "NULL", "email": "nina.s@company.com", "birth_date": "1997-03-22", "age": 27, "department": "Finance", "dept_code": "FIN", "role": "Junior Analyst", "salary": 58000, "start_date": "2024-01-08", "manager_id": "E006"},
127
+ {"emp_id": "E015", "name": "Carlos Mendez", "email": "carlos.m@company.com", "birth_date": "1989-07-16", "age": 35, "department": "Engineering", "dept_code": "ENG", "role": "DevOps Engineer", "salary": -95000, "start_date": "2019-10-01", "manager_id": "E001"},
128
+ {"emp_id": "E016", "name": "Sophie Turner", "email": "sophie.t@company.com", "birth_date": "1996-11-03", "age": 28, "department": "Marketing", "dept_code": "MKT", "role": "Social Media Mgr", "salary": 60000, "start_date": "2022-03-15", "manager_id": "E003"},
129
+ {"emp_id": "E017", "name": "Alex Rivera", "email": "alex.r@company.com", "birth_date": "1984-01-28", "age": 40, "department": "Sales", "dept_code": "SAL", "role": "Regional Manager", "salary": 110000, "start_date": "1899-01-01", "manager_id": "E008"},
130
+ {"emp_id": "E018", "name": "Diana Foster", "email": "diana.f@company.com", "birth_date": "1991-06-09", "age": 33, "department": "Finance", "dept_code": "FIN", "role": "Senior Accountant", "salary": 92000, "start_date": "2017-08-21", "manager_id": "E006"},
131
+ {"emp_id": "E019", "name": "Brandon White", "email": "brandon.w@company.com", "birth_date": "1998-04-14", "age": 26, "department": "Engineering", "dept_code": "ENG", "role": "Intern", "salary": 45000, "start_date": "2024-06-01", "manager_id": "E999"},
132
+ {"emp_id": "E020", "name": "Maria Gonzalez", "email": "maria.g@company.com", "birth_date": "1982-12-01", "age": 42, "department": "Executive", "dept_code": "EXE", "role": "CFO", "salary": 210000, "start_date": "2012-04-01", "manager_id": ""},
133
+ ]
134
+
135
+ _HARD_CLEAN: List[Row] = [
136
+ {"emp_id": "E001", "name": "Sarah Johnson", "email": "sarah.j@company.com", "birth_date": "1985-06-12", "age": 39, "department": "Engineering", "dept_code": "ENG", "role": "Senior Engineer", "salary": 125000, "start_date": "2015-03-01", "manager_id": "E010"},
137
+ {"emp_id": "E002", "name": "Michael Chen", "email": "michael.c@company.com", "birth_date": "1990-03-15", "age": 34, "department": "Engineering", "dept_code": "ENG", "role": "Junior Developer", "salary": 72000, "start_date": "2022-07-15", "manager_id": "E001"},
138
+ {"emp_id": "E003", "name": "Emily Watson", "email": "emily.w@company.com", "birth_date": "1988-11-22", "age": 36, "department": "Marketing", "dept_code": "MKT", "role": "Marketing Manager", "salary": 98000, "start_date": "2018-01-10", "manager_id": "E010"},
139
+ {"emp_id": "E004", "name": "David Park", "email": "david.p@company.com", "birth_date": "1992-07-04", "age": 32, "department": "Engineering", "dept_code": "ENG", "role": "Software Engineer", "salary": 105000, "start_date": "2020-09-01", "manager_id": "E001"},
140
+ {"emp_id": "E005", "name": "Lisa Rodriguez", "email": "lisa.r@company.com", "birth_date": "1995-01-30", "age": 29, "department": "Sales", "dept_code": "SAL", "role": "Sales Representative","salary": 65000, "start_date": "2023-02-14", "manager_id": "E008"},
141
+ {"emp_id": "E006", "name": "James O'Brien", "email": "james.ob@company.com", "birth_date": "1987-09-18", "age": 37, "department": "Finance", "dept_code": "FIN", "role": "Financial Analyst", "salary": 88000, "start_date": "2019-05-20", "manager_id": "E010"},
142
+ # row 6 (near-duplicate of row 5) deleted
143
+ {"emp_id": "E008", "name": "Rachel Green", "email": "rachel.g@company.com", "birth_date": "1983-04-05", "age": 41, "department": "Sales", "dept_code": "SAL", "role": "Sales Director", "salary": 140000, "start_date": "2014-11-01", "manager_id": "E010"},
144
+ {"emp_id": "E009", "name": "Tom Anderson", "email": "tom.a@company.com", "birth_date": "1991-12-25", "age": 33, "department": "Engineering", "dept_code": "ENG", "role": "Junior Developer", "salary": 75000, "start_date": "2023-06-01", "manager_id": "E001"},
145
+ {"emp_id": "E010", "name": "Patricia Moore", "email": "patricia.m@company.com", "birth_date": "1978-02-14", "age": 46, "department": "Executive", "dept_code": "EXE", "role": "VP of Operations", "salary": 185000, "start_date": "2010-01-15", "manager_id": ""},
146
+ {"emp_id": "E011", "name": "Kevin Hall", "email": "kevin.h@company.com", "birth_date": "1993-08-07", "age": 31, "department": "Marketing", "dept_code": "MKT", "role": "Content Specialist", "salary": 62000, "start_date": "2024-08-01", "manager_id": "E003"},
147
+ {"emp_id": "E012", "name": "Amy Liu", "email": "amy.liu@company.com", "birth_date": "1994-05-19", "age": 30, "department": "Engineering", "dept_code": "ENG", "role": "QA Engineer", "salary": 82000, "start_date": "2021-04-12", "manager_id": "E001"},
148
+ {"emp_id": "E013", "name": "Robert Taylor", "email": "robert.t@company.com", "birth_date": "1986-10-31", "age": 38, "department": "Sales", "dept_code": "SAL", "role": "Account Manager", "salary": 78000, "start_date": "2020-01-06", "manager_id": "E008"},
149
+ {"emp_id": "E014", "name": "Nina Sharma", "email": "nina.s@company.com", "birth_date": "1997-03-22", "age": 27, "department": "Finance", "dept_code": "FIN", "role": "Junior Analyst", "salary": 58000, "start_date": "2024-01-08", "manager_id": "E006"},
150
+ {"emp_id": "E015", "name": "Carlos Mendez", "email": "carlos.m@company.com", "birth_date": "1989-07-16", "age": 35, "department": "Engineering", "dept_code": "ENG", "role": "DevOps Engineer", "salary": 95000, "start_date": "2019-10-01", "manager_id": "E001"},
151
+ {"emp_id": "E016", "name": "Sophie Turner", "email": "sophie.t@company.com", "birth_date": "1996-11-03", "age": 28, "department": "Marketing", "dept_code": "MKT", "role": "Social Media Mgr", "salary": 60000, "start_date": "2022-03-15", "manager_id": "E003"},
152
+ {"emp_id": "E017", "name": "Alex Rivera", "email": "alex.r@company.com", "birth_date": "1984-01-28", "age": 40, "department": "Sales", "dept_code": "SAL", "role": "Regional Manager", "salary": 110000, "start_date": "2016-09-01", "manager_id": "E008"},
153
+ {"emp_id": "E018", "name": "Diana Foster", "email": "diana.f@company.com", "birth_date": "1991-06-09", "age": 33, "department": "Finance", "dept_code": "FIN", "role": "Senior Accountant", "salary": 92000, "start_date": "2017-08-21", "manager_id": "E006"},
154
+ {"emp_id": "E019", "name": "Brandon White", "email": "brandon.w@company.com", "birth_date": "1998-04-14", "age": 26, "department": "Engineering", "dept_code": "ENG", "role": "Intern", "salary": 45000, "start_date": "2024-06-01", "manager_id": "E001"},
155
+ {"emp_id": "E020", "name": "Maria Gonzalez", "email": "maria.g@company.com", "birth_date": "1982-12-01", "age": 42, "department": "Executive", "dept_code": "EXE", "role": "CFO", "salary": 210000, "start_date": "2012-04-01", "manager_id": ""},
156
+ ]
157
+
158
+ _HARD_ISSUES: List[IssueDict] = [
159
+ {"row": 1, "col": "age", "type": "cross_field", "desc": "Age 28 inconsistent with birth_date 1990-03-15 (should be ~34)", "fix": "34"},
160
+ {"row": 3, "col": "dept_code", "type": "cross_field", "desc": "dept_code MKT but department is Engineering", "fix": "ENG"},
161
+ {"row": 6, "col": None, "type": "near_duplicate", "desc": "Near-duplicate of row 5 (James Obrien vs James O'Brien)", "fix": "__DELETE__"},
162
+ {"row": 8, "col": "salary", "type": "anomalous_value", "desc": "Salary $250k for Junior Developer (expected $60k-$85k)", "fix": "75000"},
163
+ {"row": 10, "col": "start_date", "type": "future_date", "desc": "Start date 2025-08-01 is in the future", "fix": "2024-08-01"},
164
+ {"row": 11, "col": "email", "type": "inconsistent_format", "desc": "Email in ALL CAPS vs lowercase convention", "fix": "amy.liu@company.com"},
165
+ {"row": 12, "col": "department", "type": "missing_value", "desc": "Department empty but dept_code is SAL", "fix": "Sales"},
166
+ {"row": 13, "col": "name", "type": "placeholder_value", "desc": "Name is literal 'NULL' string instead of real name", "fix": "Nina Sharma"},
167
+ {"row": 14, "col": "salary", "type": "invalid_value", "desc": "Negative salary (-95000)", "fix": "95000"},
168
+ {"row": 16, "col": "start_date", "type": "anomalous_value", "desc": "Start date 1899-01-01 is clearly wrong", "fix": "2016-09-01"},
169
+ {"row": 18, "col": "manager_id", "type": "referential", "desc": "manager_id E999 does not exist in employee list", "fix": "E001"},
170
+ ]
171
+
172
+
173
+ # ── public registry ────────────────────────────────────────────────────────
174
+
175
+ TASKS = {
176
+ "easy": {
177
+ "name": "easy",
178
+ "title": "Customer Contact Cleanup",
179
+ "difficulty": "easy",
180
+ "description": (
181
+ "You are a data-quality analyst. A customer-contacts spreadsheet has "
182
+ "been imported with several obvious errors: missing e-mails, invalid "
183
+ "phone numbers, duplicate rows, and impossible ages. "
184
+ "Identify and fix every issue. Use the available actions to correct "
185
+ "each problem, then submit when you believe the data is clean."
186
+ ),
187
+ "dirty_data": _EASY_DIRTY,
188
+ "clean_data": _EASY_CLEAN,
189
+ "issues": _EASY_ISSUES,
190
+ "max_steps": 15,
191
+ "columns": ["id", "name", "email", "phone", "age", "city"],
192
+ },
193
+ "medium": {
194
+ "name": "medium",
195
+ "title": "E-commerce Order Normalisation",
196
+ "difficulty": "medium",
197
+ "description": (
198
+ "You are a data engineer preparing an orders export for a BI dashboard. "
199
+ "The dataset has mixed date formats (YYYY-MM-DD, DD/MM/YYYY, YYYY.MM.DD, DD-MM-YYYY), "
200
+ "inconsistent price formatting, product-code variants (P100 vs P-100), "
201
+ "a typo in a status field, a duplicate order, negative quantities, "
202
+ "and missing values. Normalise every field so the data is consistent, "
203
+ "then submit."
204
+ ),
205
+ "dirty_data": _MED_DIRTY,
206
+ "clean_data": _MED_CLEAN,
207
+ "issues": _MED_ISSUES,
208
+ "max_steps": 25,
209
+ "columns": ["order_id", "customer", "product", "quantity", "price", "date", "status"],
210
+ },
211
+ "hard": {
212
+ "name": "hard",
213
+ "title": "Employee Records Audit",
214
+ "difficulty": "hard",
215
+ "description": (
216
+ "You are auditing an HR database before a compliance review. "
217
+ "The employee records contain subtle cross-field inconsistencies "
218
+ "(age vs birth-date mismatches, department vs dept-code conflicts), "
219
+ "near-duplicate employees with slightly different name spellings, "
220
+ "anomalous salary values for the given role, future or impossible dates, "
221
+ "placeholder 'NULL' strings, ALL-CAPS email addresses, missing departments, "
222
+ "and referential-integrity violations (manager_id pointing to non-existent employees). "
223
+ "Find and fix all issues, then submit."
224
+ ),
225
+ "dirty_data": _HARD_DIRTY,
226
+ "clean_data": _HARD_CLEAN,
227
+ "issues": _HARD_ISSUES,
228
+ "max_steps": 35,
229
+ "columns": ["emp_id", "name", "email", "birth_date", "age", "department",
230
+ "dept_code", "role", "salary", "start_date", "manager_id"],
231
+ },
232
+ }
233
+
234
+
235
+ def get_task(name: str) -> dict:
236
+ """Return a deep copy of a task definition so mutations are isolated."""
237
+ if name not in TASKS:
238
+ raise ValueError(f"Unknown task '{name}'. Choose from: {list(TASKS.keys())}")
239
+ return copy.deepcopy(TASKS[name])
inference.py ADDED
@@ -0,0 +1,263 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Inference Script — DataClean Environment
3
+ =========================================
4
+ MANDATORY:
5
+ - Before submitting, ensure the following variables are defined:
6
+ API_BASE_URL The API endpoint for the LLM.
7
+ MODEL_NAME The model identifier to use for inference.
8
+ HF_TOKEN Your Hugging Face / API key.
9
+ - This script must be named `inference.py` and placed in the root directory.
10
+ - Uses OpenAI Client for all LLM calls.
11
+ """
12
+
13
+ import json
14
+ import os
15
+ import re
16
+ import sys
17
+ import textwrap
18
+ from typing import List, Optional
19
+
20
+ from openai import OpenAI
21
+
22
+ # ---------------------------------------------------------------------------
23
+ # Inline client (HTTP) so inference.py is self-contained
24
+ # ---------------------------------------------------------------------------
25
+ import requests
26
+
27
+
28
+ class _StepResult:
29
+ def __init__(self, observation: dict, reward: float, done: bool):
30
+ self.observation = observation
31
+ self.reward = reward
32
+ self.done = done
33
+
34
+
35
+ class _SimpleClient:
36
+ """Minimal sync HTTP client for the DataClean environment."""
37
+
38
+ def __init__(self, base_url: str):
39
+ self.base_url = base_url.rstrip("/")
40
+ self.s = requests.Session()
41
+
42
+ def reset(self, task_name: str = "easy") -> _StepResult:
43
+ r = self.s.post(f"{self.base_url}/reset", json={"task_name": task_name}, timeout=30)
44
+ r.raise_for_status()
45
+ d = r.json()
46
+ return _StepResult(d.get("observation", {}), float(d.get("reward", 0)), bool(d.get("done", False)))
47
+
48
+ def step(self, action: dict) -> _StepResult:
49
+ r = self.s.post(f"{self.base_url}/step", json=action, timeout=30)
50
+ r.raise_for_status()
51
+ d = r.json()
52
+ return _StepResult(d.get("observation", {}), float(d.get("reward", 0)), bool(d.get("done", False)))
53
+
54
+ def close(self):
55
+ self.s.close()
56
+
57
+
58
+ # ---------------------------------------------------------------------------
59
+ # Configuration
60
+ # ---------------------------------------------------------------------------
61
+ API_BASE_URL = os.getenv("API_BASE_URL", "https://router.huggingface.co/v1")
62
+ API_KEY = os.getenv("HF_TOKEN") or os.getenv("API_KEY")
63
+ MODEL_NAME = os.getenv("MODEL_NAME")
64
+
65
+ # Where the DataClean env server is running
66
+ ENV_BASE_URL = os.getenv("ENV_BASE_URL", "http://localhost:7860")
67
+
68
+ MAX_STEPS_PER_TASK = {"easy": 12, "medium": 20, "hard": 30}
69
+ TEMPERATURE = 0.1
70
+ MAX_TOKENS = 400
71
+
72
+ SYSTEM_PROMPT = textwrap.dedent("""\
73
+ You are an expert data-quality analyst. You are interacting with a data-cleaning
74
+ environment. Your goal is to identify and fix all data-quality issues.
75
+
76
+ After reviewing the data and quality report, respond with EXACTLY ONE action in
77
+ valid JSON format. Available actions:
78
+
79
+ 1. Fix a cell value:
80
+ {"action_type": "fix_value", "row_index": <int>, "column_name": "<col>", "new_value": "<corrected>"}
81
+
82
+ 2. Delete a duplicate/invalid row:
83
+ {"action_type": "delete_row", "row_index": <int>}
84
+
85
+ 3. Fill a missing value:
86
+ {"action_type": "fill_missing", "row_index": <int>, "column_name": "<col>", "new_value": "<value>"}
87
+
88
+ 4. Flag a suspicious cell (partial credit):
89
+ {"action_type": "flag_anomaly", "row_index": <int>, "column_name": "<col>"}
90
+
91
+ 5. Submit your work (ends the episode):
92
+ {"action_type": "submit"}
93
+
94
+ 6. Do nothing this step:
95
+ {"action_type": "noop"}
96
+
97
+ RULES:
98
+ - row_index is 0-based and refers to the ORIGINAL row number shown in the table.
99
+ - Respond ONLY with the JSON action. No explanations, no markdown, no extra text.
100
+ - Fix the most obvious/critical issues first.
101
+ - When all issues appear resolved, use submit.
102
+ - Dates should be in YYYY-MM-DD format.
103
+ - Prices should be plain numbers without $ or commas.
104
+ - Product codes should NOT have dashes (e.g., P102 not P-102).
105
+ - Emails should be lowercase.
106
+ """).strip()
107
+
108
+
109
+ # ---------------------------------------------------------------------------
110
+ # Helpers
111
+ # ---------------------------------------------------------------------------
112
+ ACTION_JSON_RE = re.compile(r"\{[^}]+\}", re.DOTALL)
113
+
114
+
115
+ def parse_action(text: str) -> dict:
116
+ """Extract the first JSON object from the model response."""
117
+ if not text:
118
+ return {"action_type": "noop"}
119
+ m = ACTION_JSON_RE.search(text)
120
+ if m:
121
+ try:
122
+ obj = json.loads(m.group(0))
123
+ if "action_type" in obj:
124
+ return obj
125
+ except json.JSONDecodeError:
126
+ pass
127
+ return {"action_type": "noop"}
128
+
129
+
130
+ def build_user_prompt(obs: dict, step_num: int) -> str:
131
+ """Build the user prompt from the observation."""
132
+ parts = [
133
+ f"TASK: {obs.get('task_description', '')}",
134
+ f"DIFFICULTY: {obs.get('difficulty', '')}",
135
+ f"STEP: {step_num}/{obs.get('max_steps', '?')}",
136
+ f"CURRENT SCORE: {obs.get('current_score', 0.0)}",
137
+ "",
138
+ "CURRENT DATA:",
139
+ obs.get("data_preview", "(no data)"),
140
+ "",
141
+ obs.get("quality_report", ""),
142
+ ]
143
+ history = obs.get("action_history", [])
144
+ if history:
145
+ parts.append("")
146
+ parts.append("RECENT ACTIONS:")
147
+ for h in history[-5:]:
148
+ parts.append(f" {h}")
149
+
150
+ parts.append("")
151
+ parts.append("Respond with exactly one JSON action.")
152
+ return "\n".join(parts)
153
+
154
+
155
+ # ---------------------------------------------------------------------------
156
+ # Run one task
157
+ # ---------------------------------------------------------------------------
158
+ def run_task(
159
+ llm_client: OpenAI,
160
+ env_client: _SimpleClient,
161
+ task_name: str,
162
+ max_steps: int,
163
+ ) -> float:
164
+ """Run a single task and return the final score."""
165
+ print(f"\n{'='*60}")
166
+ print(f" TASK: {task_name.upper()}")
167
+ print(f"{'='*60}")
168
+
169
+ result = env_client.reset(task_name)
170
+ obs = result.observation
171
+ print(f" Task: {obs.get('task_description', '')[:80]}...")
172
+ print(f" Max steps: {max_steps}")
173
+
174
+ for step in range(1, max_steps + 1):
175
+ if result.done:
176
+ print(f" Episode done at step {step - 1}")
177
+ break
178
+
179
+ user_prompt = build_user_prompt(obs, step)
180
+ messages = [
181
+ {"role": "system", "content": SYSTEM_PROMPT},
182
+ {"role": "user", "content": user_prompt},
183
+ ]
184
+
185
+ try:
186
+ completion = llm_client.chat.completions.create(
187
+ model=MODEL_NAME,
188
+ messages=messages,
189
+ temperature=TEMPERATURE,
190
+ max_tokens=MAX_TOKENS,
191
+ stream=False,
192
+ )
193
+ response_text = completion.choices[0].message.content or ""
194
+ except Exception as exc:
195
+ print(f" Step {step}: LLM error ({exc}), using noop")
196
+ response_text = '{"action_type": "noop"}'
197
+
198
+ action = parse_action(response_text)
199
+ print(f" Step {step}: {action.get('action_type', '?')}", end="")
200
+ if action.get("row_index") is not None:
201
+ print(f" row={action['row_index']}", end="")
202
+ if action.get("column_name"):
203
+ print(f" col={action['column_name']}", end="")
204
+ if action.get("new_value"):
205
+ print(f" val={action['new_value']}", end="")
206
+
207
+ result = env_client.step(action)
208
+ obs = result.observation
209
+ print(f" -> reward={result.reward:.4f} done={result.done}")
210
+
211
+ if result.done:
212
+ break
213
+
214
+ # If agent never submitted, force submit
215
+ if not result.done:
216
+ result = env_client.step({"action_type": "submit"})
217
+
218
+ final_score = result.reward
219
+ print(f"\n FINAL SCORE ({task_name}): {final_score:.4f}")
220
+ return final_score
221
+
222
+
223
+ # ---------------------------------------------------------------------------
224
+ # Main
225
+ # ---------------------------------------------------------------------------
226
+ def main() -> None:
227
+ if not API_KEY:
228
+ print("ERROR: HF_TOKEN or API_KEY environment variable not set")
229
+ sys.exit(1)
230
+ if not MODEL_NAME:
231
+ print("ERROR: MODEL_NAME environment variable not set")
232
+ sys.exit(1)
233
+
234
+ print("DataClean Environment — Baseline Inference")
235
+ print(f" API: {API_BASE_URL}")
236
+ print(f" Model: {MODEL_NAME}")
237
+ print(f" Env: {ENV_BASE_URL}")
238
+
239
+ llm_client = OpenAI(base_url=API_BASE_URL, api_key=API_KEY)
240
+ env_client = _SimpleClient(ENV_BASE_URL)
241
+
242
+ scores = {}
243
+ try:
244
+ for task_name in ["easy", "medium", "hard"]:
245
+ max_steps = MAX_STEPS_PER_TASK[task_name]
246
+ score = run_task(llm_client, env_client, task_name, max_steps)
247
+ scores[task_name] = score
248
+ finally:
249
+ env_client.close()
250
+
251
+ print(f"\n{'='*60}")
252
+ print(" FINAL RESULTS")
253
+ print(f"{'='*60}")
254
+ for name, score in scores.items():
255
+ bar = "#" * int(score * 40)
256
+ print(f" {name:8s}: {score:.4f} [{bar:<40s}]")
257
+ avg = sum(scores.values()) / len(scores) if scores else 0.0
258
+ print(f" {'AVERAGE':8s}: {avg:.4f}")
259
+ print(f"{'='*60}")
260
+
261
+
262
+ if __name__ == "__main__":
263
+ main()
openenv.yaml ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ spec_version: 1
2
+ name: dataclean_env
3
+ type: space
4
+ runtime: fastapi
5
+ app: dataclean_env.server.app:app
6
+ port: 7860
pyproject.toml ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [build-system]
2
+ requires = ["setuptools>=68.0", "wheel"]
3
+ build-backend = "setuptools.build_meta"
4
+
5
+ [project]
6
+ name = "dataclean-env"
7
+ version = "1.0.0"
8
+ description = "OpenEnv environment for training AI agents on real-world data-quality cleaning tasks"
9
+ readme = "README.md"
10
+ license = {text = "BSD-3-Clause"}
11
+ requires-python = ">=3.10"
12
+ dependencies = [
13
+ "fastapi>=0.104.0",
14
+ "uvicorn>=0.24.0",
15
+ "requests>=2.25.0",
16
+ "pydantic>=2.0.0",
17
+ "openai>=1.0.0",
18
+ ]
19
+
20
+ [project.optional-dependencies]
21
+ server = [
22
+ "fastapi>=0.104.0",
23
+ "uvicorn>=0.24.0",
24
+ ]
25
+
26
+ [tool.setuptools.packages.find]
27
+ include = ["dataclean_env*"]
requirements.txt ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ fastapi>=0.104.0
2
+ uvicorn[standard]>=0.24.0
3
+ requests>=2.25.0
4
+ pydantic>=2.0.0
5
+ openai>=1.0.0
6
+ websockets>=12.0