GlitchGhost commited on
Commit
2eeba69
·
0 Parent(s):

Initial DataClean OpenEnv Space

Browse files
.dockerignore ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ __pycache__
2
+ *.pyc
3
+ *.pyo
4
+ .git
5
+ .gitignore
6
+ *.md
7
+ !README.md
8
+ .env
9
+ .venv
10
+ venv
11
+ node_modules
12
+ .pytest_cache
13
+ .mypy_cache
14
+ 01-*.md
15
+ 02-*.md
16
+ 03-*.md
17
+ 04-*.md
18
+ 05-*.md
.gitignore ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ __pycache__/
2
+ *.py[cod]
3
+ .pytest_cache/
4
+ .mypy_cache/
5
+ .venv/
6
+ venv/
7
+ .env
8
+ .env.*
9
+ .DS_Store
10
+ .claude/
11
+ .zencoder/
12
+ .zenflow/
01-PROBLEM-UNDERSTANDING.md ADDED
@@ -0,0 +1,235 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Problem Understanding: Meta PyTorch x Scaler Hackathon
2
+
3
+ ## What Is This Hackathon About?
4
+
5
+ This hackathon asks you to **build a real-world simulation environment** that AI agents can learn from using Reinforcement Learning (RL). The environment must follow the **OpenEnv specification** -- a standardized framework by Meta PyTorch for creating, deploying, and using isolated execution environments for agentic RL training.
6
+
7
+ ---
8
+
9
+ ## The Core Ask (One Sentence)
10
+
11
+ > Build a complete OpenEnv-compliant environment that simulates a **real-world task** (not games/toys), with 3+ graded tasks, deploy it to Hugging Face Spaces, and provide a working inference script.
12
+
13
+ ---
14
+
15
+ ## Breaking Down the Problem Statement
16
+
17
+ ### 1. What Is an "OpenEnv Environment"?
18
+
19
+ An OpenEnv environment is a **containerized microservice** that exposes an RL interface via HTTP/WebSocket:
20
+
21
+ ```
22
+ Your AI Agent (Client)
23
+ |
24
+ | step(action) / reset() / state()
25
+ |
26
+ v
27
+ Docker Container (Server)
28
+ FastAPI Server
29
+ Your Environment Logic
30
+ ```
31
+
32
+ Think of it like a **REST API that an AI agent interacts with** to learn. The agent:
33
+ 1. Observes the current state
34
+ 2. Takes an action
35
+ 3. Receives a reward signal
36
+ 4. Repeats until done
37
+
38
+ ### 2. What Does "Real-World Task" Mean?
39
+
40
+ The environment must simulate something **humans actually do at work or in life**. NOT games, NOT toy problems.
41
+
42
+ **Good examples (explicitly listed):**
43
+ - Email triage (sorting/prioritizing emails)
44
+ - Code review (reviewing pull requests)
45
+ - Data cleaning (fixing messy datasets)
46
+ - Scheduling (managing calendars/meetings)
47
+ - Customer support (answering tickets)
48
+ - Content moderation (classifying content)
49
+
50
+ **Bad examples (will get low scores or DQ):**
51
+ - Tic-tac-toe, chess, 2048 (games)
52
+ - CartPole, MountainCar (toy RL benchmarks)
53
+ - Number guessing (trivial)
54
+
55
+ ### 3. The Three APIs You Must Implement
56
+
57
+ | API | What It Does | Returns |
58
+ |-----|-------------|---------|
59
+ | `reset()` | Starts a new episode from scratch | Initial observation |
60
+ | `step(action)` | Agent takes an action, environment advances | observation, reward, done, info |
61
+ | `state()` | Returns current episode metadata | Current state |
62
+
63
+ ### 4. Typed Models Required (Pydantic)
64
+
65
+ You must define strongly-typed data models:
66
+
67
+ ```python
68
+ class MyAction(Action): # What the agent can DO
69
+ ...
70
+
71
+ class MyObservation(Observation): # What the agent can SEE
72
+ ...
73
+
74
+ class MyState(State): # Episode metadata
75
+ ...
76
+ ```
77
+
78
+ ### 5. The Three Tasks (Easy -> Medium -> Hard)
79
+
80
+ You need **minimum 3 tasks** with increasing difficulty:
81
+
82
+ | Task | Difficulty | Score Range | What It Tests |
83
+ |------|-----------|-------------|---------------|
84
+ | Task 1 | Easy | 0.0 - 1.0 | Basic competence |
85
+ | Task 2 | Medium | 0.0 - 1.0 | Intermediate reasoning |
86
+ | Task 3 | Hard | 0.0 - 1.0 | Should challenge frontier models |
87
+
88
+ Each task has a **programmatic grader** that:
89
+ - Scores performance from 0.0 to 1.0
90
+ - Has clear, deterministic success/failure criteria
91
+ - Is reproducible (same input = same score)
92
+
93
+ ### 6. Meaningful Reward Function
94
+
95
+ The reward must:
96
+ - Provide signal **throughout** the episode (not just at the end)
97
+ - Reward **partial progress** (not just binary success/fail)
98
+ - Penalize **bad behavior** (infinite loops, destructive actions)
99
+
100
+ **Bad reward:** `return 1.0 if solved else 0.0` (sparse, no learning signal)
101
+
102
+ **Good reward:** Gives incremental feedback as the agent makes progress
103
+
104
+ ### 7. Baseline Inference Script
105
+
106
+ - Named `inference.py` in the project root
107
+ - Uses **OpenAI client** (not raw HTTP)
108
+ - Reads `API_BASE_URL`, `MODEL_NAME`, `HF_TOKEN` from env vars
109
+ - Runs against all 3 tasks and produces reproducible scores
110
+ - Must complete in **< 20 minutes**
111
+ - Must run on **2 vCPU, 8GB RAM**
112
+
113
+ ### 8. Deployment Requirements
114
+
115
+ - **Hugging Face Space** tagged with `openenv`
116
+ - **Working Dockerfile** (docker build + docker run must work)
117
+ - **README** with full documentation
118
+
119
+ ---
120
+
121
+ ## Scoring Breakdown (100 points total)
122
+
123
+ | Criterion | Weight | What Judges Look For |
124
+ |-----------|--------|---------------------|
125
+ | **Real-world utility** | 30% | Does it model a genuine task? Would anyone actually use this? |
126
+ | **Task & grader quality** | 25% | Well-defined tasks? Fair graders? Good difficulty progression? |
127
+ | **Environment design** | 20% | Clean state management? Good reward shaping? Sensible boundaries? |
128
+ | **Code quality & spec compliance** | 15% | OpenEnv spec compliant? Docker works? Clean code? |
129
+ | **Creativity & novelty** | 10% | Novel domain? Interesting reward design? Clever mechanics? |
130
+
131
+ ### How to Score High in Each Category
132
+
133
+ #### Real-world utility (30% -- BIGGEST weight)
134
+ - Score 26-30: "Fills a real gap, immediate value for the RL/agent community"
135
+ - Pick a domain where AI agents are **actually being deployed** or would benefit from training
136
+ - The task should feel like something a human knowledge worker does daily
137
+ - Ask: "Would a company pay for an agent that can do this?"
138
+
139
+ #### Task & grader quality (25%)
140
+ - 3+ tasks with clear difficulty range
141
+ - Graders produce float scores 0.0-1.0
142
+ - Graders are deterministic and reproducible
143
+ - Hard task genuinely challenges GPT-4 / Claude level models
144
+
145
+ #### Environment design (20%)
146
+ - `reset()` produces clean state (no leakage between episodes)
147
+ - Action/observation types are well-designed and documented
148
+ - Reward function provides varying signal (not sparse)
149
+ - Episode boundaries make sense
150
+
151
+ #### Code quality & spec compliance (15%)
152
+ - `openenv validate` passes
153
+ - `docker build && docker run` works
154
+ - HF Space deploys and responds
155
+ - Baseline script runs and reproduces scores
156
+
157
+ #### Creativity & novelty (10%)
158
+ - Domain not already in OpenEnv
159
+ - Interesting reward shaping
160
+ - Clever mechanics
161
+
162
+ ---
163
+
164
+ ## Disqualification Criteria (Instant Fail)
165
+
166
+ 1. Environment doesn't deploy or respond
167
+ 2. Plagiarized or trivially modified existing environments
168
+ 3. Graders that always return the same score
169
+ 4. No baseline inference script
170
+ 5. HF Space doesn't return 200 / respond to `reset()`
171
+ 6. Dockerfile doesn't build
172
+ 7. Baseline doesn't reproduce scores
173
+ 8. Fewer than 3 tasks
174
+
175
+ ---
176
+
177
+ ## Judging Process (3 Phases)
178
+
179
+ ### Phase 1: Automated Validation (Pass/Fail Gate)
180
+ - HF Space ping -> must return 200 and respond to `reset()`
181
+ - `openenv validate` on openenv.yaml, models, endpoints
182
+ - `docker build` on submitted repo
183
+ - Run inference script -> must complete without error
184
+ - Enumerate 3+ tasks, run graders, verify 0.0-1.0 scores
185
+
186
+ ### Phase 2: Agentic Evaluation (Scored)
187
+ - Re-run baseline agent
188
+ - Run a standard Open LLM agent (e.g., Nemotron 3 Super) against ALL environments
189
+ - Check score variance
190
+
191
+ ### Phase 3: Human Review (Top Submissions Only)
192
+ - Meta and Hugging Face engineers review for:
193
+ - Real-world utility
194
+ - Creativity
195
+ - Exploit checks (are graders gameable?)
196
+
197
+ ---
198
+
199
+ ## Key Infrastructure Constraints
200
+
201
+ | Constraint | Value |
202
+ |-----------|-------|
203
+ | Inference script runtime | < 20 minutes |
204
+ | Machine specs | 2 vCPU, 8GB RAM |
205
+ | LLM client | Must use OpenAI client |
206
+ | Environment variables | `API_BASE_URL`, `MODEL_NAME`, `HF_TOKEN` |
207
+ | Script name | `inference.py` (root directory) |
208
+ | Deployment | HF Spaces with Docker |
209
+
210
+ ---
211
+
212
+ ## Summary: What You Need to Deliver
213
+
214
+ ```
215
+ your-project/
216
+ ├── inference.py # Baseline inference script (OpenAI client)
217
+ ├── openenv.yaml # Environment metadata
218
+ ├── models.py # Pydantic Action, Observation, State models
219
+ ├── client.py # HTTP/WebSocket client for the environment
220
+ ├── README.md # Full documentation
221
+ ├── server/
222
+ │ ├── app.py # FastAPI server
223
+ │ ├── environment.py # Your environment logic
224
+ │ ├── Dockerfile # Container definition
225
+ │ └── requirements.txt # Dependencies
226
+ └── pyproject.toml # Package definition
227
+ ```
228
+
229
+ Deliverables:
230
+ 1. Working HF Space (tagged `openenv`)
231
+ 2. Working Docker image
232
+ 3. 3+ tasks with graders (easy/medium/hard)
233
+ 4. Meaningful reward function
234
+ 5. Baseline inference script with reproducible scores
235
+ 6. README with full documentation
02-OPENENV-ARCHITECTURE.md ADDED
@@ -0,0 +1,452 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # OpenEnv Architecture: Deep Understanding
2
+
3
+ ## What Is OpenEnv?
4
+
5
+ OpenEnv is an **end-to-end framework** for creating, deploying, and using **isolated execution environments** for agentic RL training. It uses familiar Gymnasium-style APIs (`step()`, `reset()`, `state()`) but wraps them in a **production-ready microservice architecture**.
6
+
7
+ **PyPI package:** `openenv-core`
8
+ **GitHub:** https://github.com/meta-pytorch/OpenEnv
9
+ **License:** BSD 3-Clause
10
+
11
+ ---
12
+
13
+ ## Core Philosophy
14
+
15
+ > "RL environments should be like microservices"
16
+
17
+ Just as you don't run your database in the same process as your web server, OpenEnv separates the **environment** (server in Docker) from the **training code** (client).
18
+
19
+ | Traditional (Gym) | OpenEnv |
20
+ |-------------------|---------|
21
+ | Same process | Separate container |
22
+ | Python only | Any language (HTTP API) |
23
+ | Arrays and dicts | Type-safe Pydantic models |
24
+ | "Works on my machine" | Docker everywhere |
25
+ | Hard to scale | Deploy to K8s |
26
+ | Can crash your training | Isolated and secure |
27
+
28
+ ---
29
+
30
+ ## Architecture Overview
31
+
32
+ ```
33
+ ┌─────────────────────────────────────────────────┐
34
+ │ YOUR CODE (Client Side) │
35
+ │ │
36
+ │ env = MyEnv(base_url="http://localhost:8000") │
37
+ │ result = env.reset() # Type-safe! │
38
+ │ result = env.step(action) # Type-safe! │
39
+ │ state = env.state() # Type-safe! │
40
+ │ │
41
+ └──────────────────┬──────────────────────────────┘
42
+
43
+ │ WebSocket (/ws) - primary
44
+ │ HTTP (/reset, /step, /state) - fallback
45
+
46
+ ┌──────────────────▼──────────────────────────────┐
47
+ │ DOCKER CONTAINER (Server Side) │
48
+ │ │
49
+ │ FastAPI Server (app.py) │
50
+ │ ├── /ws → WebSocket session handler │
51
+ │ ├── /reset → POST: Reset environment │
52
+ │ ├── /step → POST: Execute action │
53
+ │ ├── /state → GET: Get current state │
54
+ │ ├── /health → GET: Health check │
55
+ │ ├── /docs → GET: OpenAPI docs │
56
+ │ └── /web → GET: Interactive web UI │
57
+ │ │
58
+ │ Environment (environment.py) │
59
+ │ Your simulation logic lives here │
60
+ │ │
61
+ │ Models (models.py) │
62
+ │ Action, Observation, State definitions │
63
+ │ │
64
+ └─────────────────────────────────────────────────┘
65
+ ```
66
+
67
+ ---
68
+
69
+ ## The Three Core APIs
70
+
71
+ ### 1. `reset()` -> Initial Observation
72
+
73
+ Starts a new episode. Returns the initial observation the agent sees.
74
+
75
+ ```python
76
+ result = env.reset()
77
+ # result.observation -> MyObservation (typed)
78
+ # result.reward -> None (no reward on reset)
79
+ # result.done -> False (episode just started)
80
+ ```
81
+
82
+ **What happens server-side:**
83
+ - Clears all previous state
84
+ - Initializes a fresh episode
85
+ - Returns the starting observation
86
+
87
+ ### 2. `step(action)` -> StepResult
88
+
89
+ Agent takes an action. Environment advances and returns the result.
90
+
91
+ ```python
92
+ result = env.step(MyAction(field="value"))
93
+ # result.observation -> MyObservation (new state after action)
94
+ # result.reward -> float (0.0 to 1.0 typically)
95
+ # result.done -> bool (is episode over?)
96
+ ```
97
+
98
+ **What happens server-side:**
99
+ - Validates the action
100
+ - Updates environment state
101
+ - Computes reward
102
+ - Checks if episode is done
103
+ - Returns new observation
104
+
105
+ ### 3. `state()` -> State
106
+
107
+ Returns metadata about the current episode.
108
+
109
+ ```python
110
+ state = env.state()
111
+ # state.episode_id -> str
112
+ # state.step_count -> int
113
+ # Any custom state fields
114
+ ```
115
+
116
+ ---
117
+
118
+ ## The Type System (Pydantic Models)
119
+
120
+ OpenEnv uses **Pydantic BaseModel** for all data contracts. This gives you:
121
+ - Type validation at runtime
122
+ - Auto-generated JSON schemas
123
+ - IDE autocomplete
124
+ - Self-documenting code
125
+
126
+ ### Base Classes (from `openenv.core.env_server.types`)
127
+
128
+ ```python
129
+ from pydantic import BaseModel
130
+
131
+ class Action(BaseModel):
132
+ """Base class for all environment actions"""
133
+ pass
134
+
135
+ class Observation(BaseModel):
136
+ """Base class for all environment observations"""
137
+ pass
138
+
139
+ class State(BaseModel):
140
+ """Base class for episode state/metadata"""
141
+ episode_id: Optional[str] = None
142
+ step_count: int = 0
143
+ ```
144
+
145
+ ### StepResult (Client-Side)
146
+
147
+ ```python
148
+ @dataclass
149
+ class StepResult(Generic[ObsT]):
150
+ observation: ObsT # What the agent sees
151
+ reward: Optional[float] # Scalar reward (0.0-1.0)
152
+ done: bool = False # Is episode finished?
153
+ ```
154
+
155
+ ### Your Custom Models
156
+
157
+ ```python
158
+ # models.py
159
+ from openenv.core import Action, Observation, State
160
+
161
+ class EmailTriageAction(Action):
162
+ email_id: str
163
+ category: str # "urgent", "normal", "spam"
164
+ priority: int # 1-5
165
+
166
+ class EmailTriageObservation(Observation):
167
+ email_subject: str
168
+ email_body: str
169
+ sender: str
170
+ available_categories: List[str]
171
+ emails_remaining: int
172
+
173
+ class EmailTriageState(State):
174
+ total_emails: int
175
+ correctly_categorized: int
176
+ task_name: str
177
+ ```
178
+
179
+ ---
180
+
181
+ ## Server-Side: The Environment Class
182
+
183
+ The `Environment` abstract base class defines what you must implement:
184
+
185
+ ```python
186
+ from openenv.core.env_server import Environment
187
+
188
+ class MyEnvironment(Environment):
189
+
190
+ SUPPORTS_CONCURRENT_SESSIONS = True # Can handle multiple agents
191
+
192
+ def reset(self) -> Observation:
193
+ """Initialize a new episode. Return initial observation."""
194
+ # Clear state, set up fresh scenario
195
+ return MyObservation(...)
196
+
197
+ def step(self, action: Action) -> Observation:
198
+ """Execute action, update state, return new observation."""
199
+ # Validate action
200
+ # Update environment state
201
+ # Compute reward
202
+ # Check if done
203
+ return MyObservation(...)
204
+
205
+ @property
206
+ def state(self) -> State:
207
+ """Return current episode metadata."""
208
+ return self._state
209
+ ```
210
+
211
+ ### Creating the FastAPI App
212
+
213
+ ```python
214
+ # server/app.py
215
+ from openenv.core.env_server import create_fastapi_app
216
+ from .environment import MyEnvironment
217
+
218
+ env = MyEnvironment()
219
+ app = create_fastapi_app(env) # Auto-creates all endpoints
220
+ ```
221
+
222
+ The `create_fastapi_app()` function automatically creates:
223
+ - WebSocket endpoint at `/ws`
224
+ - HTTP endpoints at `/reset`, `/step`, `/state`
225
+ - Health check at `/health`
226
+ - OpenAPI docs at `/docs`
227
+ - Web UI at `/web`
228
+
229
+ ---
230
+
231
+ ## Client-Side: The EnvClient Class
232
+
233
+ The client handles all communication. You extend `EnvClient`:
234
+
235
+ ```python
236
+ from openenv.core import EnvClient
237
+
238
+ class MyEnv(EnvClient):
239
+
240
+ def _step_payload(self, action: MyAction) -> dict:
241
+ """Convert typed action to JSON dict for wire transfer"""
242
+ return action.model_dump()
243
+
244
+ def _parse_result(self, payload: dict) -> StepResult:
245
+ """Parse JSON response into typed StepResult"""
246
+ return StepResult(
247
+ observation=MyObservation(**payload["observation"]),
248
+ reward=payload.get("reward"),
249
+ done=payload.get("done", False),
250
+ )
251
+
252
+ def _parse_state(self, payload: dict) -> MyState:
253
+ """Parse JSON response into typed State"""
254
+ return MyState(**payload)
255
+ ```
256
+
257
+ ### Client Usage (Async -- Recommended)
258
+
259
+ ```python
260
+ import asyncio
261
+
262
+ async def main():
263
+ async with MyEnv(base_url="http://localhost:8000") as env:
264
+ result = await env.reset()
265
+ while not result.done:
266
+ action = my_policy(result.observation)
267
+ result = await env.step(action)
268
+
269
+ asyncio.run(main())
270
+ ```
271
+
272
+ ### Client Usage (Sync)
273
+
274
+ ```python
275
+ with MyEnv(base_url="http://localhost:8000").sync() as env:
276
+ result = env.reset()
277
+ while not result.done:
278
+ action = my_policy(result.observation)
279
+ result = env.step(action)
280
+ ```
281
+
282
+ ---
283
+
284
+ ## Communication Protocol
285
+
286
+ ### Primary: WebSocket (`/ws`)
287
+
288
+ - Persistent, bidirectional connection
289
+ - Each connection gets its own isolated environment instance
290
+ - No session ID management needed -- the connection IS the session
291
+ - Low overhead (~0.1ms per message vs ~10-50ms for HTTP)
292
+
293
+ **Message types (client -> server):**
294
+ ```json
295
+ {"type": "reset"}
296
+ {"type": "step", "action": {...}}
297
+ {"type": "state"}
298
+ {"type": "close"}
299
+ ```
300
+
301
+ **Response types (server -> client):**
302
+ ```json
303
+ {"type": "observation", "observation": {...}, "reward": 0.5, "done": false}
304
+ {"type": "state", "state": {...}}
305
+ {"type": "error", "code": "...", "message": "..."}
306
+ ```
307
+
308
+ ### Fallback: HTTP
309
+
310
+ - `POST /reset` -> Reset environment
311
+ - `POST /step` -> Execute action (body = action JSON)
312
+ - `GET /state` -> Get current state
313
+ - Stateless -- requires session management
314
+
315
+ ---
316
+
317
+ ## The openenv.yaml Manifest
318
+
319
+ Every environment needs an `openenv.yaml`:
320
+
321
+ ```yaml
322
+ spec_version: 1
323
+ name: my_environment
324
+ type: space
325
+ runtime: fastapi
326
+ app: server.app:app
327
+ port: 8000
328
+ ```
329
+
330
+ | Field | Description |
331
+ |-------|-------------|
332
+ | `spec_version` | OpenEnv spec version (currently `1`) |
333
+ | `name` | Environment name (snake_case) |
334
+ | `type` | Deployment type (`space` for HF Spaces) |
335
+ | `runtime` | Server runtime (`fastapi`) |
336
+ | `app` | ASGI app path (module:variable) |
337
+ | `port` | Server port (default `8000`) |
338
+
339
+ ---
340
+
341
+ ## Project Structure
342
+
343
+ When you run `openenv init my_env`, you get:
344
+
345
+ ```
346
+ my_env/
347
+ ├── __init__.py # Package init
348
+ ├── models.py # Action, Observation, State
349
+ ├── client.py # EnvClient implementation
350
+ ├── openenv.yaml # Environment manifest
351
+ ├── pyproject.toml # Python package metadata
352
+ ├── README.md # Documentation
353
+ ├── .dockerignore # Docker ignore
354
+ ├── outputs/
355
+ │ ├── logs/ # Runtime logs
356
+ │ └── evals/ # Evaluation results
357
+ └── server/
358
+ ├── app.py # FastAPI application
359
+ ├── environment.py # Environment logic (YOU WRITE THIS)
360
+ ├── requirements.txt # Server dependencies
361
+ └── Dockerfile # Container definition
362
+ ```
363
+
364
+ ---
365
+
366
+ ## Deployment Options
367
+
368
+ ### 1. Local Development (Uvicorn)
369
+
370
+ ```bash
371
+ uvicorn server.app:app --host 0.0.0.0 --port 8000 --reload
372
+ ```
373
+
374
+ ### 2. Docker
375
+
376
+ ```bash
377
+ docker build -t my-env:latest -f server/Dockerfile .
378
+ docker run -d -p 8000:8000 my-env:latest
379
+ ```
380
+
381
+ ### 3. Hugging Face Spaces (Required for Hackathon)
382
+
383
+ ```bash
384
+ openenv push --repo-id username/my-env
385
+ ```
386
+
387
+ This creates a Docker-based HF Space that provides:
388
+ - **Server**: Running endpoint at `https://username-my-env.hf.space`
389
+ - **Repository**: Pip-installable package
390
+ - **Registry**: Docker image at `registry.hf.space/username-my-env:latest`
391
+
392
+ ---
393
+
394
+ ## Scaling
395
+
396
+ | Setup | Max Concurrent Sessions | Best For |
397
+ |-------|------------------------|----------|
398
+ | Single container, WebSocket | ~100 per worker | Development |
399
+ | HF Spaces free tier | ~128 | Demos, hackathon |
400
+ | Local Docker, 8 workers | ~2,048 | Local training |
401
+ | Multi-node + load balancer | ~16,384 | Large-scale training |
402
+
403
+ Key environment variables:
404
+ - `WORKERS` -- Number of uvicorn worker processes (default: 4)
405
+ - `MAX_CONCURRENT_ENVS` -- Max WebSocket sessions per worker (default: 100)
406
+ - `PORT` -- Server port (default: 8000)
407
+
408
+ ---
409
+
410
+ ## Integration with Training Frameworks
411
+
412
+ OpenEnv is designed to plug into existing RL training frameworks:
413
+
414
+ | Framework | Integration |
415
+ |-----------|-------------|
416
+ | **TRL (Transformers RL)** | Official support, GRPO training example |
417
+ | **torchforge** | Featured example (BlackJack GRPO) |
418
+ | **Unsloth** | Google Colab example |
419
+ | **ART** | Integration example |
420
+ | **Oumi** | GRPO notebook example |
421
+
422
+ The training loop pattern:
423
+ ```python
424
+ # 1. Connect to environment
425
+ env = MyEnv(base_url="...")
426
+
427
+ # 2. Generate rollouts
428
+ result = env.reset()
429
+ for step in range(max_steps):
430
+ action = model.generate(result.observation)
431
+ result = env.step(action)
432
+ rewards.append(result.reward)
433
+
434
+ # 3. Update model with rewards (GRPO, PPO, etc.)
435
+ trainer.update(rewards)
436
+ ```
437
+
438
+ ---
439
+
440
+ ## Validation
441
+
442
+ Run before submitting:
443
+ ```bash
444
+ openenv validate
445
+ ```
446
+
447
+ This checks:
448
+ - `openenv.yaml` is valid
449
+ - Typed models are properly defined
450
+ - `step()`, `reset()`, `state()` endpoints work
451
+ - Server responds to health checks
452
+ - Docker builds successfully
03-TUTORIALS-BREAKDOWN.md ADDED
@@ -0,0 +1,384 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # OpenEnv Tutorials: Line-by-Line Breakdown
2
+
3
+ ## Tutorial 01: Environments (Fundamentals)
4
+
5
+ **Source:** https://github.com/meta-pytorch/OpenEnv/blob/main/tutorial/01-environments.md
6
+
7
+ ### Key Takeaways
8
+
9
+ #### RL is Just a Loop
10
+ ```python
11
+ while not done:
12
+ observation = environment.observe()
13
+ action = policy.choose(observation)
14
+ reward = environment.step(action)
15
+ policy.learn(reward)
16
+ ```
17
+ This is the fundamental loop every OpenEnv environment must support. Your environment provides `observe` (via `reset()`/`step()` return values), accepts `action` (via `step()`), and returns `reward`.
18
+
19
+ #### Why OpenEnv Over Gym?
20
+
21
+ | Problem with Gym | OpenEnv Solution |
22
+ |-----------------|-----------------|
23
+ | `obs[0][3]` -- what is this? | `obs.info_state` -- typed, IDE knows |
24
+ | Runs in same process (crashes training) | Docker containers (isolated) |
25
+ | "Works on my machine" | Same container everywhere |
26
+ | Python only | HTTP API, any language |
27
+ | Cryptic numpy errors | Clear type validation errors |
28
+
29
+ #### Three Components Every Environment Has
30
+
31
+ ```
32
+ your_env/
33
+ ├── models.py # Type-safe contracts (Action, Observation, State)
34
+ ├── client.py # What users import (EnvClient implementation)
35
+ └── server/
36
+ ├── environment.py # Your simulation logic
37
+ ├── app.py # FastAPI server
38
+ └── Dockerfile # Container definition
39
+ ```
40
+
41
+ **This is the pattern you MUST follow.**
42
+
43
+ #### Server-Side Abstractions
44
+
45
+ ```python
46
+ class Environment(ABC):
47
+ @abstractmethod
48
+ def reset(self) -> Observation:
49
+ """Start new episode"""
50
+
51
+ @abstractmethod
52
+ def step(self, action: Action) -> Observation:
53
+ """Execute action, return observation"""
54
+
55
+ @property
56
+ def state(self) -> State:
57
+ """Get episode metadata"""
58
+ ```
59
+
60
+ #### Client-Side Abstractions
61
+
62
+ ```python
63
+ class HTTPEnvClient(ABC):
64
+ def reset(self) -> StepResult: # HTTP POST /reset
65
+ def step(self, action) -> StepResult: # HTTP POST /step
66
+ def state(self) -> State: # HTTP GET /state
67
+ ```
68
+
69
+ Users never see HTTP details. They just call clean Python methods.
70
+
71
+ #### The Client Pattern (3 Methods to Implement)
72
+
73
+ 1. `_step_payload(action)` -- Convert your action to JSON dict
74
+ 2. `_parse_result(payload)` -- Parse JSON response to typed StepResult
75
+ 3. `_parse_state(payload)` -- Parse JSON response to typed State
76
+
77
+ That's all! The base class handles HTTP/WebSocket communication.
78
+
79
+ #### Building Your Own (5 Steps)
80
+
81
+ 1. **Define Types** (`models.py`) -- Action, Observation, State as Pydantic models
82
+ 2. **Implement Environment** (`server/environment.py`) -- reset(), step(), state
83
+ 3. **Create Client** (`client.py`) -- 3 conversion methods
84
+ 4. **Create Server** (`server/app.py`) -- `create_fastapi_app(env)`
85
+ 5. **Dockerize** (`server/Dockerfile`) -- Standard Python container
86
+
87
+ #### Example Environments to Study
88
+
89
+ 1. **`echo_env/`** -- Simplest possible (great starter template)
90
+ 2. **`openspiel_env/`** -- Wraps external library (6 games)
91
+ 3. **`coding_env/`** -- Complex real-world use case
92
+
93
+ ---
94
+
95
+ ## Tutorial 02: Deployment
96
+
97
+ **Source:** https://github.com/meta-pytorch/OpenEnv/blob/main/tutorial/02-deployment.md
98
+
99
+ ### Key Takeaways
100
+
101
+ #### HF Spaces = Triple Infrastructure
102
+
103
+ A single HF Space gives you THREE things:
104
+
105
+ | Component | What | Access |
106
+ |-----------|------|--------|
107
+ | **Server** | Running endpoint | `https://username-space-name.hf.space` |
108
+ | **Repository** | Pip-installable package | `pip install git+https://huggingface.co/spaces/...` |
109
+ | **Registry** | Docker image | `docker pull registry.hf.space/...` |
110
+
111
+ **This is why HF Spaces is required** -- one deployment gives you everything.
112
+
113
+ #### Primary Protocol is WebSocket
114
+
115
+ The client connects via **WebSocket** (`/ws`), NOT HTTP. This is important:
116
+ - Persistent connection for the entire episode
117
+ - No session ID management needed
118
+ - Connection = session (auto-cleanup)
119
+ - Much lower latency
120
+
121
+ ```python
122
+ # Async (recommended)
123
+ async with MyEnv(base_url="https://my-space.hf.space") as client:
124
+ result = await client.reset()
125
+ result = await client.step(MyAction(...))
126
+
127
+ # Sync (wrapper)
128
+ with MyEnv(base_url="https://my-space.hf.space").sync() as client:
129
+ result = client.reset()
130
+ result = client.step(MyAction(...))
131
+ ```
132
+
133
+ #### Available Endpoints
134
+
135
+ | Endpoint | Protocol | Description |
136
+ |----------|----------|-------------|
137
+ | `/ws` | WebSocket | Primary session endpoint |
138
+ | `/health` | HTTP GET | Health check (MUST return 200) |
139
+ | `/reset` | HTTP POST | Reset (stateless fallback) |
140
+ | `/step` | HTTP POST | Step (stateless fallback) |
141
+ | `/state` | HTTP GET | Current state |
142
+ | `/docs` | HTTP GET | OpenAPI docs |
143
+ | `/web` | HTTP GET | Interactive web UI |
144
+
145
+ #### Deployment Workflow
146
+
147
+ ```bash
148
+ # 1. Initialize environment
149
+ openenv init my_env
150
+ cd my_env
151
+
152
+ # 2. Develop locally
153
+ uvicorn server.app:app --host 0.0.0.0 --port 8000 --reload
154
+
155
+ # 3. Test health
156
+ curl http://localhost:8000/health
157
+ # {"status": "healthy"}
158
+
159
+ # 4. Deploy to HF Spaces
160
+ openenv push --repo-id username/my-env
161
+
162
+ # 5. Verify deployment
163
+ curl https://username-my-env.hf.space/health
164
+ ```
165
+
166
+ #### Dockerfile Template
167
+
168
+ ```dockerfile
169
+ FROM python:3.11-slim
170
+ WORKDIR /app
171
+ COPY requirements.txt .
172
+ RUN pip install --no-cache-dir -r requirements.txt
173
+ COPY . .
174
+ CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]
175
+ ```
176
+
177
+ #### Environment Variables for Configuration
178
+
179
+ | Variable | Default | Purpose |
180
+ |----------|---------|---------|
181
+ | `WORKERS` | 4 | Uvicorn worker processes |
182
+ | `PORT` | 8000 | Server port |
183
+ | `HOST` | 0.0.0.0 | Bind address |
184
+ | `MAX_CONCURRENT_ENVS` | 100 | Max sessions |
185
+ | `ENABLE_WEB_INTERFACE` | Auto | Web UI toggle |
186
+
187
+ ---
188
+
189
+ ## Tutorial 03: Scaling
190
+
191
+ **Source:** https://github.com/meta-pytorch/OpenEnv/blob/main/tutorial/03-scaling.md
192
+
193
+ ### Key Takeaways
194
+
195
+ #### Why WebSocket Matters for Scaling
196
+
197
+ **HTTP approach:** Each episode step = new TCP connection (~10-50ms overhead)
198
+ **WebSocket approach:** Persistent connection, messages as lightweight frames (~0.1ms)
199
+
200
+ With HTTP, N parallel episodes = N containers.
201
+ With WebSocket, N parallel episodes = 1 container (up to limits).
202
+
203
+ #### Single Container Capacity
204
+
205
+ | Infrastructure | Max Concurrent | Batch/Core |
206
+ |---------------|----------------|------------|
207
+ | Local Uvicorn (8 cores) | 2,048 | 256 |
208
+ | Local Docker (8 cores) | 2,048 | 256 |
209
+ | HF Spaces free (2 cores) | 128 | 64 |
210
+
211
+ **For the hackathon on HF Spaces free tier:** Expect up to ~128 concurrent sessions.
212
+
213
+ #### Scaling Parameters
214
+
215
+ ```bash
216
+ # More workers = more CPU cores utilized
217
+ WORKERS=8 uvicorn server.app:app --host 0.0.0.0 --port 8000 --workers 8
218
+
219
+ # Docker with configuration
220
+ docker run -d -p 8000:8000 \
221
+ -e WORKERS=4 \
222
+ -e MAX_CONCURRENT_ENVS=100 \
223
+ my-env:latest
224
+ ```
225
+
226
+ #### When to Scale Horizontally
227
+
228
+ - Success rate drops below 95%
229
+ - P99 latency exceeds 2x expected step time
230
+ - Connection errors increase under load
231
+
232
+ For the hackathon, a single container is sufficient. Focus on getting it working correctly, not on scaling.
233
+
234
+ ---
235
+
236
+ ## Tutorial 04: Training (GRPO with Wordle)
237
+
238
+ **Source:** https://github.com/meta-pytorch/OpenEnv/blob/main/tutorial/04-training.md
239
+
240
+ ### Key Takeaways
241
+
242
+ This tutorial shows a **complete training pipeline** using TRL + GRPO to train an LLM to play Wordle. While we don't need to train a model for the hackathon, this tutorial reveals critical patterns.
243
+
244
+ #### The Rollout Pattern
245
+
246
+ This is how agents interact with OpenEnv environments during training:
247
+
248
+ ```python
249
+ def rollout_once(env, model, tokenizer, max_turns):
250
+ result = env.reset()
251
+ observation = result.observation
252
+
253
+ for turn in range(max_turns):
254
+ if result.done:
255
+ break
256
+
257
+ # 1. Build prompt from observation
258
+ user_prompt = make_prompt(observation)
259
+
260
+ # 2. Generate model response
261
+ response = model.generate(user_prompt)
262
+
263
+ # 3. Parse action from response
264
+ action = parse_action(response)
265
+
266
+ # 4. Step environment
267
+ result = env.step(action)
268
+ observation = result.observation
269
+
270
+ # 5. Collect reward
271
+ rewards.append(result.reward)
272
+ ```
273
+
274
+ **This is essentially what your `inference.py` must do** but using the OpenAI client instead of a local model.
275
+
276
+ #### Reward Decomposition
277
+
278
+ The Wordle training uses **multiple reward signals**:
279
+
280
+ ```python
281
+ reward_funcs = [
282
+ reward_correct, # Did you guess the word? (0 or 1)
283
+ reward_greens, # How many green letters? (0.0-1.0)
284
+ reward_yellows, # How many yellow letters? (0.0-1.0)
285
+ reward_repetition, # Penalty for repeating guesses (0.0-1.0)
286
+ ]
287
+ ```
288
+
289
+ **Lesson for your environment:** Break your reward into multiple meaningful components that provide signal throughout the episode. Don't just use a single binary reward.
290
+
291
+ #### Observation Design
292
+
293
+ Wordle wraps game state into structured observations:
294
+ - `prompt` -- The initial game description
295
+ - `messages` -- History of all moves and feedback
296
+ - `done` -- Is the game over?
297
+
298
+ **Lesson:** Your observations should contain everything the agent needs to make its next decision. Include history, current state, and available actions.
299
+
300
+ #### Action Design
301
+
302
+ Wordle uses a simple text action:
303
+ ```python
304
+ class TextArenaAction(Action):
305
+ message: str # The guess word
306
+ ```
307
+
308
+ **Lesson:** Keep actions simple. The agent (LLM) will generate text. Parse that text into your action format.
309
+
310
+ #### The Inference Script Pattern (from sample)
311
+
312
+ The sample inference script shows the exact pattern you need:
313
+
314
+ ```python
315
+ # 1. Set up OpenAI client
316
+ client = OpenAI(base_url=API_BASE_URL, api_key=API_KEY)
317
+
318
+ # 2. Connect to environment
319
+ env = MyEnv.from_docker_image(image="my-env:latest", env_vars={...})
320
+
321
+ # 3. Reset
322
+ result = env.reset()
323
+ observation = result.observation
324
+
325
+ # 4. Loop
326
+ for step in range(MAX_STEPS):
327
+ if result.done:
328
+ break
329
+
330
+ # Build prompt from observation
331
+ user_prompt = build_prompt(observation)
332
+
333
+ # Call LLM
334
+ completion = client.chat.completions.create(
335
+ model=MODEL_NAME,
336
+ messages=[
337
+ {"role": "system", "content": SYSTEM_PROMPT},
338
+ {"role": "user", "content": user_prompt},
339
+ ],
340
+ )
341
+ response_text = completion.choices[0].message.content
342
+
343
+ # Parse action
344
+ action = parse_action(response_text)
345
+
346
+ # Step
347
+ result = env.step(action)
348
+ observation = result.observation
349
+
350
+ # 5. Report scores
351
+ print(f"Final reward: {result.reward}")
352
+ ```
353
+
354
+ ---
355
+
356
+ ## Cross-Tutorial Summary
357
+
358
+ ### The Complete Picture
359
+
360
+ ```
361
+ 1. DESIGN (Tutorial 01)
362
+ Define models, implement Environment, create client
363
+
364
+ 2. DEPLOY (Tutorial 02)
365
+ Dockerfile → Docker build → HF Spaces push
366
+
367
+ 3. SCALE (Tutorial 03)
368
+ Configure workers, tune concurrency (optional for hackathon)
369
+
370
+ 4. TRAIN/EVALUATE (Tutorial 04)
371
+ inference.py uses OpenAI client → calls your environment
372
+ Reports scores for each task
373
+ ```
374
+
375
+ ### Critical Patterns for the Hackathon
376
+
377
+ 1. **Follow the project structure exactly** -- `models.py`, `client.py`, `server/environment.py`, `server/app.py`, `openenv.yaml`
378
+ 2. **Use Pydantic models** for Action, Observation, State
379
+ 3. **WebSocket is primary** but HTTP endpoints must also work
380
+ 4. **`create_fastapi_app(env)`** handles server boilerplate
381
+ 5. **Reward must be granular** -- partial progress signals, not binary
382
+ 6. **`inference.py`** must use OpenAI client with env vars
383
+ 7. **Docker must build and run** cleanly
384
+ 8. **HF Space must respond** to health checks and `reset()`
04-SOLUTION-STRATEGY.md ADDED
@@ -0,0 +1,421 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Solution Strategy: How to Win This Hackathon
2
+
3
+ ## Step-by-Step Approach
4
+
5
+ ---
6
+
7
+ ## Phase 1: Choose a Domain (MOST IMPORTANT -- 30% of score)
8
+
9
+ The domain choice is the single biggest factor. It must be:
10
+ - A **real-world task** humans actually do
11
+ - **Not** a game or toy problem
12
+ - Something an agent can meaningfully learn
13
+ - Novel (not already in OpenEnv)
14
+
15
+ ### Domain Selection Criteria
16
+
17
+ | Criterion | Weight | Question to Ask |
18
+ |-----------|--------|-----------------|
19
+ | Real utility | HIGH | Would a company pay for an agent that does this? |
20
+ | Novelty | HIGH | Is this already in OpenEnv? |
21
+ | Simulatable | HIGH | Can you faithfully simulate it with code? |
22
+ | Gradeable | HIGH | Can you programmatically score performance? |
23
+ | Learnable | MEDIUM | Can an LLM actually improve at this through RL? |
24
+ | Interesting | MEDIUM | Will judges find this compelling? |
25
+
26
+ ### Strong Domain Ideas (Ranked)
27
+
28
+ #### Tier 1: High Impact, Novel, Clearly Real-World
29
+
30
+ 1. **Data Cleaning / Wrangling**
31
+ - Agent receives messy CSV/JSON data with errors (wrong types, duplicates, missing values, inconsistent formats)
32
+ - Agent must identify and fix issues through a sequence of cleaning operations
33
+ - Easy: Fix obvious type mismatches. Medium: Deduplicate with fuzzy matching. Hard: Infer correct values from context
34
+ - Grader: Compare cleaned output to ground truth dataset
35
+ - Why it wins: Every data scientist does this daily. No one has built this.
36
+
37
+ 2. **Incident Response / Alert Triage**
38
+ - Agent receives system monitoring alerts (CPU spikes, error logs, latency increases)
39
+ - Agent must diagnose the root cause and take remediation steps
40
+ - Easy: Obvious single-cause alert. Medium: Multi-signal correlation. Hard: Cascading failures
41
+ - Grader: Did agent identify correct root cause and take correct action?
42
+ - Why it wins: DevOps/SRE teams deal with this 24/7. Critical and novel.
43
+
44
+ 3. **Resume / Job Application Screening**
45
+ - Agent receives job descriptions + candidate resumes
46
+ - Agent must rank, shortlist, and provide structured assessments
47
+ - Easy: Clear match/no-match. Medium: Nuanced ranking. Hard: Edge cases with transferable skills
48
+ - Grader: Compare agent ranking to expert ranking (Kendall tau / precision@k)
49
+ - Why it wins: HR teams spend thousands of hours on this.
50
+
51
+ 4. **Database Query Optimization**
52
+ - Agent receives slow SQL queries + schema information
53
+ - Agent must rewrite queries or suggest index changes
54
+ - Easy: Simple index suggestion. Medium: Query rewrite. Hard: Multi-table join optimization
55
+ - Grader: Compare execution time improvement, correctness of results
56
+ - Why it wins: Every backend developer needs this.
57
+
58
+ 5. **Meeting Scheduler / Calendar Management**
59
+ - Agent receives scheduling requests + constraints (availability, time zones, priorities)
60
+ - Agent must find optimal meeting times and resolve conflicts
61
+ - Easy: 2-person meeting with clear slot. Medium: Multi-person with constraints. Hard: Rescheduling cascade
62
+ - Grader: Does the schedule satisfy all hard constraints? How many soft constraints met?
63
+ - Why it wins: Universal pain point. Clean state space.
64
+
65
+ 6. **Bug Report Triage**
66
+ - Agent receives bug reports with descriptions, stack traces, logs
67
+ - Agent must categorize severity, assign to team, identify likely component
68
+ - Easy: Obvious severity + clear component. Medium: Ambiguous severity. Hard: Duplicate detection + root cause
69
+ - Grader: Match against expert labels for severity, component, priority
70
+ - Why it wins: Engineering teams need this. Well-defined success criteria.
71
+
72
+ 7. **Document Review / Compliance Check**
73
+ - Agent reviews contracts/documents against a checklist of requirements
74
+ - Agent must flag missing clauses, non-compliant sections, inconsistencies
75
+ - Easy: Clear missing clause. Medium: Subtle non-compliance. Hard: Cross-reference between sections
76
+ - Grader: Precision/recall of flagged issues against expert annotations
77
+ - Why it wins: Legal/compliance is huge industry pain. Novel domain.
78
+
79
+ 8. **Email Triage & Response Draft**
80
+ - Agent categorizes incoming emails by urgency/topic, drafts appropriate responses
81
+ - Easy: Clear spam/important classification. Medium: Nuanced priority + draft reply. Hard: Multi-email thread summarization + action items
82
+ - Grader: Category accuracy + response quality scoring
83
+ - Why it wins: Explicitly mentioned in problem statement.
84
+
85
+ #### Tier 2: Good but Slightly Less Novel
86
+
87
+ 9. Content moderation
88
+ 10. Code review (PR review)
89
+ 11. Customer support ticket routing
90
+ 12. Data entry from unstructured documents
91
+
92
+ ### My Recommendation
93
+
94
+ **Go with Data Cleaning / Wrangling** or **Incident Response / Alert Triage**. They score highest on:
95
+ - Real-world utility (every company needs this)
96
+ - Novelty (not in OpenEnv)
97
+ - Gradeability (clear right answers)
98
+ - Difficulty progression (natural easy->hard)
99
+ - Interesting reward shaping (partial cleaning credit, partial diagnosis credit)
100
+
101
+ ---
102
+
103
+ ## Phase 2: Design the Environment
104
+
105
+ ### Action Space Design
106
+
107
+ Keep actions **simple and text-based** (LLMs generate text). Example for data cleaning:
108
+
109
+ ```python
110
+ class DataCleanAction(Action):
111
+ operation: str # "fix_type", "remove_duplicate", "fill_missing", "standardize", "validate"
112
+ target_row: int # Which row to act on
113
+ target_column: str # Which column to act on
114
+ new_value: str # The corrected value (if applicable)
115
+ reasoning: str # Why this action (for debugging)
116
+ ```
117
+
118
+ ### Observation Space Design
119
+
120
+ Include everything the agent needs:
121
+
122
+ ```python
123
+ class DataCleanObservation(Observation):
124
+ task_description: str # What the agent should do
125
+ current_data_sample: str # Current state of the data (preview)
126
+ data_statistics: dict # Column types, null counts, etc.
127
+ detected_issues: List[str] # Known issues found so far
128
+ actions_taken: List[str] # History of previous actions
129
+ remaining_budget: int # Actions remaining
130
+ hint: str # Optional hint for the current issue
131
+ ```
132
+
133
+ ### State Design
134
+
135
+ ```python
136
+ class DataCleanState(State):
137
+ task_name: str
138
+ difficulty: str # "easy", "medium", "hard"
139
+ total_issues: int
140
+ issues_fixed: int
141
+ current_score: float
142
+ ```
143
+
144
+ ### Reward Function Design (CRITICAL)
145
+
146
+ The reward must provide **continuous signal**:
147
+
148
+ ```python
149
+ def compute_reward(self) -> float:
150
+ # Base score: percentage of issues correctly fixed
151
+ fix_score = self.issues_fixed / self.total_issues # 0.0 to 1.0
152
+
153
+ # Bonus for fixing issues efficiently (fewer actions)
154
+ efficiency_bonus = max(0, 1.0 - (self.steps_taken / self.max_steps)) * 0.1
155
+
156
+ # Penalty for introducing NEW errors
157
+ error_penalty = self.new_errors_introduced * 0.05
158
+
159
+ # Penalty for repetitive/useless actions
160
+ waste_penalty = self.wasted_actions * 0.02
161
+
162
+ reward = fix_score + efficiency_bonus - error_penalty - waste_penalty
163
+ return max(0.0, min(1.0, reward)) # Clamp to [0.0, 1.0]
164
+ ```
165
+
166
+ Key principles:
167
+ - **Incremental:** Each correct fix increases the reward
168
+ - **Penalizes bad behavior:** Wrong fixes, wasted actions
169
+ - **Efficiency bonus:** Solving faster is better
170
+ - **Always between 0.0 and 1.0**
171
+
172
+ ---
173
+
174
+ ## Phase 3: Define the Three Tasks
175
+
176
+ ### Task Design Pattern
177
+
178
+ | Task | Difficulty | Data Complexity | Issues to Fix | Max Steps | What Makes It Hard |
179
+ |------|-----------|----------------|---------------|-----------|-------------------|
180
+ | Task 1 | Easy | Small (10 rows) | 3 obvious | 10 | Errors are obvious |
181
+ | Task 2 | Medium | Medium (50 rows) | 5 mixed | 20 | Some ambiguity, requires reasoning |
182
+ | Task 3 | Hard | Large (200 rows) | 8+ subtle | 30 | Requires context understanding, multi-step fixing |
183
+
184
+ ### Grader Design
185
+
186
+ Each grader must be:
187
+ - **Deterministic:** Same actions = same score
188
+ - **Scored 0.0-1.0:** Continuous, not binary
189
+ - **Fair:** Measures actual task completion
190
+
191
+ ```python
192
+ class TaskGrader:
193
+ def grade(self, final_state: DataCleanState) -> float:
194
+ """Score from 0.0 to 1.0"""
195
+ # Compare final data against ground truth
196
+ correct_fixes = count_correct_fixes(final_state.data, self.ground_truth)
197
+ total_issues = len(self.known_issues)
198
+
199
+ # Base score
200
+ score = correct_fixes / total_issues
201
+
202
+ # Deductions for new errors introduced
203
+ new_errors = count_new_errors(final_state.data, self.original_data)
204
+ score -= new_errors * 0.1
205
+
206
+ return max(0.0, min(1.0, score))
207
+ ```
208
+
209
+ ---
210
+
211
+ ## Phase 4: Implement
212
+
213
+ ### File Structure
214
+
215
+ ```
216
+ my_env/
217
+ ├── inference.py # Baseline inference (OpenAI client)
218
+ ├── openenv.yaml # Environment manifest
219
+ ├── pyproject.toml # Package definition
220
+ ├── README.md # Full documentation
221
+ ├── __init__.py # Package init
222
+ ├── models.py # Action, Observation, State
223
+ ├── client.py # EnvClient implementation
224
+ ├── tasks/
225
+ │ ├── task_easy.json # Easy task data
226
+ │ ├── task_medium.json # Medium task data
227
+ │ └── task_hard.json # Hard task data
228
+ ├── graders/
229
+ │ ├── __init__.py
230
+ │ └── grader.py # Programmatic graders
231
+ └── server/
232
+ ├── app.py # FastAPI app
233
+ ├── environment.py # Environment logic
234
+ ├── Dockerfile # Container
235
+ └── requirements.txt # Dependencies
236
+ ```
237
+
238
+ ### Implementation Order
239
+
240
+ 1. **models.py** -- Define Action, Observation, State
241
+ 2. **server/environment.py** -- Core logic (reset, step, state)
242
+ 3. **tasks/** -- Create task data files
243
+ 4. **graders/grader.py** -- Implement scoring
244
+ 5. **server/app.py** -- Wire up FastAPI
245
+ 6. **client.py** -- Client implementation
246
+ 7. **server/Dockerfile** -- Containerize
247
+ 8. **inference.py** -- Baseline agent
248
+ 9. **openenv.yaml** -- Manifest
249
+ 10. **README.md** -- Documentation
250
+
251
+ ---
252
+
253
+ ## Phase 5: Inference Script
254
+
255
+ The inference script must follow the exact pattern from the sample:
256
+
257
+ ```python
258
+ import os
259
+ from openai import OpenAI
260
+
261
+ API_BASE_URL = os.getenv("API_BASE_URL", "https://router.huggingface.co/v1")
262
+ API_KEY = os.getenv("HF_TOKEN") or os.getenv("API_KEY")
263
+ MODEL_NAME = os.getenv("MODEL_NAME")
264
+
265
+ client = OpenAI(base_url=API_BASE_URL, api_key=API_KEY)
266
+
267
+ # For each task (easy, medium, hard):
268
+ for task in ["easy", "medium", "hard"]:
269
+ # 1. Reset environment with task
270
+ result = env.reset(task=task)
271
+
272
+ # 2. Loop until done or max steps
273
+ for step in range(MAX_STEPS):
274
+ if result.done:
275
+ break
276
+
277
+ # 3. Build prompt from observation
278
+ prompt = build_prompt(result.observation)
279
+
280
+ # 4. Call LLM
281
+ completion = client.chat.completions.create(
282
+ model=MODEL_NAME,
283
+ messages=[...],
284
+ )
285
+
286
+ # 5. Parse action from response
287
+ action = parse_action(completion.choices[0].message.content)
288
+
289
+ # 6. Step environment
290
+ result = env.step(action)
291
+
292
+ # 7. Report score
293
+ print(f"Task {task}: Score = {result.reward}")
294
+ ```
295
+
296
+ ### Critical Requirements
297
+ - Must use `OpenAI` client (not requests, not httpx)
298
+ - Must read `API_BASE_URL`, `MODEL_NAME`, `HF_TOKEN` from env vars
299
+ - Must complete in < 20 minutes on 2 vCPU / 8GB RAM
300
+ - Must produce reproducible scores
301
+ - Named `inference.py` at project root
302
+
303
+ ---
304
+
305
+ ## Phase 6: Docker & Deployment
306
+
307
+ ### Dockerfile
308
+
309
+ ```dockerfile
310
+ FROM python:3.11-slim
311
+
312
+ WORKDIR /app
313
+
314
+ # Install dependencies
315
+ COPY server/requirements.txt .
316
+ RUN pip install --no-cache-dir -r requirements.txt
317
+
318
+ # Copy source
319
+ COPY . .
320
+
321
+ # Expose port
322
+ EXPOSE 8000
323
+
324
+ # Health check
325
+ HEALTHCHECK --interval=30s --timeout=10s --retries=3 \
326
+ CMD curl -f http://localhost:8000/health || exit 1
327
+
328
+ # Run
329
+ CMD ["uvicorn", "server.app:app", "--host", "0.0.0.0", "--port", "8000"]
330
+ ```
331
+
332
+ ### Test Locally
333
+
334
+ ```bash
335
+ # Build
336
+ docker build -t my-env:latest -f server/Dockerfile .
337
+
338
+ # Run
339
+ docker run -d -p 8000:8000 my-env:latest
340
+
341
+ # Test
342
+ curl http://localhost:8000/health
343
+ ```
344
+
345
+ ### Deploy to HF Spaces
346
+
347
+ ```bash
348
+ openenv push --repo-id username/my-env
349
+ ```
350
+
351
+ Or manually:
352
+ 1. Create a new Space on HF (Docker SDK)
353
+ 2. Push code to the Space repo
354
+ 3. Wait for build
355
+ 4. Test: `curl https://username-my-env.hf.space/health`
356
+
357
+ ---
358
+
359
+ ## Phase 7: README
360
+
361
+ Must include:
362
+ 1. **Environment description and motivation** -- What real-world task? Why?
363
+ 2. **Action space definitions** -- What actions can the agent take?
364
+ 3. **Observation space definitions** -- What does the agent see?
365
+ 4. **Task descriptions** -- Easy/Medium/Hard with expected difficulty
366
+ 5. **Setup instructions** -- How to install and run
367
+ 6. **Baseline scores** -- Reproducible results from inference.py
368
+
369
+ ---
370
+
371
+ ## Pre-Submission Checklist
372
+
373
+ - [ ] HF Space deploys and returns 200 on `/health`
374
+ - [ ] `reset()` works and returns valid observation
375
+ - [ ] `step()` works and returns valid step result
376
+ - [ ] `state()` works and returns valid state
377
+ - [ ] `openenv validate` passes
378
+ - [ ] `docker build` succeeds
379
+ - [ ] `docker run` starts cleanly
380
+ - [ ] `inference.py` runs without error
381
+ - [ ] `inference.py` produces scores for all 3 tasks
382
+ - [ ] All scores are 0.0-1.0
383
+ - [ ] Graders are deterministic (run twice, same scores)
384
+ - [ ] Inference completes in < 20 minutes
385
+ - [ ] Runs on 2 vCPU / 8GB RAM
386
+ - [ ] `openenv.yaml` is valid
387
+ - [ ] README has all required sections
388
+ - [ ] No hardcoded API keys
389
+ - [ ] No plagiarism
390
+
391
+ ---
392
+
393
+ ## Scoring Optimization Tips
394
+
395
+ ### Maximize Real-World Utility (30%)
396
+ - Choose a domain that IMMEDIATELY resonates: "Oh yes, I hate doing that manually"
397
+ - Describe the business impact in your README
398
+ - Show that your environment could genuinely be used to train useful agents
399
+
400
+ ### Maximize Task & Grader Quality (25%)
401
+ - Make easy task ACTUALLY easy (a naive agent should get > 0.3)
402
+ - Make hard task ACTUALLY hard (frontier models should struggle)
403
+ - Graders should never return the same score regardless of agent behavior
404
+ - Test graders with random agents to verify score distribution
405
+
406
+ ### Maximize Environment Design (20%)
407
+ - `reset()` must produce COMPLETELY clean state (no leakage)
408
+ - Reward function should have clear signal at every step
409
+ - Action space should be complete but not overwhelming
410
+ - Episode should end at a natural stopping point
411
+
412
+ ### Maximize Code Quality (15%)
413
+ - Clean, typed code throughout
414
+ - Comprehensive error handling in server
415
+ - All models properly documented
416
+ - Tests if time allows
417
+
418
+ ### Maximize Creativity (10%)
419
+ - Pick a domain judges haven't seen
420
+ - Design an interesting reward function
421
+ - Add a clever mechanic (e.g., the agent can "ask for help" at a cost)
05-QUICK-REFERENCE.md ADDED
@@ -0,0 +1,127 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Quick Reference: Everything at a Glance
2
+
3
+ ## The Equation
4
+
5
+ ```
6
+ Real-World Task + OpenEnv Spec + 3 Graded Tasks + Docker + HF Space + inference.py = Submission
7
+ ```
8
+
9
+ ---
10
+
11
+ ## OpenEnv Spec in 30 Seconds
12
+
13
+ ```python
14
+ # models.py -- Define your data contracts
15
+ class MyAction(Action): # What the agent sends
16
+ ...
17
+ class MyObservation(Observation): # What the agent receives
18
+ ...
19
+ class MyState(State): # Episode metadata
20
+ ...
21
+
22
+ # server/environment.py -- Your core logic
23
+ class MyEnvironment(Environment):
24
+ def reset(self) -> Observation: ... # Start fresh episode
25
+ def step(self, action) -> Observation: ... # Process action
26
+ @property
27
+ def state(self) -> State: ... # Episode metadata
28
+
29
+ # server/app.py -- One line
30
+ app = create_fastapi_app(MyEnvironment())
31
+
32
+ # client.py -- Three methods
33
+ class MyEnv(EnvClient):
34
+ def _step_payload(self, action) -> dict: ...
35
+ def _parse_result(self, payload) -> StepResult: ...
36
+ def _parse_state(self, payload) -> State: ...
37
+
38
+ # openenv.yaml -- Six fields
39
+ spec_version: 1
40
+ name: my_env
41
+ type: space
42
+ runtime: fastapi
43
+ app: server.app:app
44
+ port: 8000
45
+ ```
46
+
47
+ ---
48
+
49
+ ## Scoring Cheat Sheet
50
+
51
+ | Criterion | Weight | Key to High Score |
52
+ |-----------|--------|-------------------|
53
+ | Real-world utility | **30%** | Domain people actually struggle with |
54
+ | Task & grader quality | **25%** | 3 tasks, deterministic graders, 0.0-1.0, difficulty spread |
55
+ | Environment design | **20%** | Clean reset, rich rewards, good action/obs spaces |
56
+ | Code quality & spec | **15%** | OpenEnv compliant, Docker works, HF deploys |
57
+ | Creativity & novelty | **10%** | New domain, clever mechanics |
58
+
59
+ ---
60
+
61
+ ## Instant DQ Triggers
62
+
63
+ 1. Environment doesn't deploy or respond
64
+ 2. Plagiarized/trivially modified
65
+ 3. Graders always return same score
66
+ 4. No inference.py
67
+ 5. < 3 tasks
68
+
69
+ ---
70
+
71
+ ## Inference Script Template
72
+
73
+ ```python
74
+ import os
75
+ from openai import OpenAI
76
+
77
+ API_BASE_URL = os.getenv("API_BASE_URL", "https://router.huggingface.co/v1")
78
+ API_KEY = os.getenv("HF_TOKEN") or os.getenv("API_KEY")
79
+ MODEL_NAME = os.getenv("MODEL_NAME")
80
+
81
+ client = OpenAI(base_url=API_BASE_URL, api_key=API_KEY)
82
+
83
+ # Must: Use OpenAI client, read env vars, < 20min, 2vCPU/8GB
84
+ # Must: Run all 3 tasks, produce reproducible scores 0.0-1.0
85
+ ```
86
+
87
+ ---
88
+
89
+ ## Docker Template
90
+
91
+ ```dockerfile
92
+ FROM python:3.11-slim
93
+ WORKDIR /app
94
+ COPY server/requirements.txt .
95
+ RUN pip install --no-cache-dir -r requirements.txt
96
+ COPY . .
97
+ EXPOSE 8000
98
+ CMD ["uvicorn", "server.app:app", "--host", "0.0.0.0", "--port", "8000"]
99
+ ```
100
+
101
+ ---
102
+
103
+ ## Required Environment Variables
104
+
105
+ | Variable | Purpose |
106
+ |----------|---------|
107
+ | `API_BASE_URL` | LLM API endpoint |
108
+ | `MODEL_NAME` | Model identifier |
109
+ | `HF_TOKEN` | HuggingFace API key |
110
+
111
+ ---
112
+
113
+ ## Judging Phases
114
+
115
+ 1. **Automated** (pass/fail): Deploy? Spec compliant? Docker builds? Inference runs? 3+ tasks?
116
+ 2. **Agentic** (scored): Standard LLM agent runs against your env
117
+ 3. **Human** (top subs): Meta + HF engineers review utility, creativity, exploits
118
+
119
+ ---
120
+
121
+ ## Key Links
122
+
123
+ - OpenEnv GitHub: https://github.com/meta-pytorch/OpenEnv
124
+ - Hackathon: https://www.scaler.com/school-of-technology/meta-pytorch-hackathon
125
+ - HF Spaces Docs: https://huggingface.co/docs/hub/spaces
126
+ - OpenEnv PyPI: `pip install openenv-core`
127
+ - OpenEnv CLI: `openenv init`, `openenv push`, `openenv validate`
Dockerfile ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ FROM python:3.11-slim
2
+
3
+ # Create non-root user (HF Spaces requirement)
4
+ RUN useradd -m -u 1000 user
5
+ ENV HOME=/home/user \
6
+ PATH=/home/user/.local/bin:$PATH
7
+
8
+ WORKDIR /app
9
+
10
+ # Install dependencies
11
+ COPY requirements.txt .
12
+ RUN pip install --no-cache-dir -r requirements.txt
13
+
14
+ # Copy source
15
+ COPY . .
16
+
17
+ # Install the package
18
+ RUN pip install --no-cache-dir -e .
19
+
20
+ # Switch to non-root user
21
+ USER user
22
+
23
+ # Expose port (7860 for HF Spaces)
24
+ EXPOSE 7860
25
+
26
+ # Health check
27
+ HEALTHCHECK --interval=30s --timeout=10s --retries=3 \
28
+ CMD python -c "import requests; requests.get('http://localhost:7860/health').raise_for_status()" || exit 1
29
+
30
+ # Run the server
31
+ CMD ["uvicorn", "dataclean_env.server.app:app", "--host", "0.0.0.0", "--port", "7860"]
README.md ADDED
@@ -0,0 +1,228 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: DataClean Environment
3
+ emoji: "🧹"
4
+ colorFrom: blue
5
+ colorTo: green
6
+ sdk: docker
7
+ app_port: 7860
8
+ short_description: OpenEnv-compatible data cleaning environment for training and evaluating agent workflows.
9
+ tags:
10
+ - openenv
11
+ - docker
12
+ - fastapi
13
+ - data-cleaning
14
+ ---
15
+
16
+ # DataClean Environment
17
+
18
+ An OpenEnv-compliant environment for training and evaluating AI agents on **real-world data-quality cleaning tasks**.
19
+
20
+ Every organisation struggles with dirty data — missing values, duplicate records, format inconsistencies, anomalous entries, and cross-field validation failures. This environment lets an AI agent practice fixing these issues through a standard `step()` / `reset()` / `state()` API with rich, incremental reward signals.
21
+
22
+ ---
23
+
24
+ ## Motivation
25
+
26
+ Data cleaning consumes up to 80% of a data professional's time. Automating even a fraction of this work has enormous practical value. This environment:
27
+
28
+ - Models tasks that humans **actually do every day** (not games or toys)
29
+ - Provides a realistic, graded benchmark for evaluating LLM-based data agents
30
+ - Rewards partial progress, not just final correctness
31
+ - Scales from simple fixes (missing emails) to subtle cross-field audits (age vs birth-date mismatches)
32
+
33
+ ---
34
+
35
+ ## Environment Overview
36
+
37
+ | Property | Value |
38
+ |----------|-------|
39
+ | **Domain** | Data-quality analysis and cleaning |
40
+ | **Action space** | `fix_value`, `delete_row`, `fill_missing`, `flag_anomaly`, `submit`, `noop` |
41
+ | **Observation space** | Text table of current data + quality report + column stats + history |
42
+ | **Reward range** | 0.0 – 1.0 (continuous, per-step updates) |
43
+ | **Episode length** | 15 / 25 / 35 steps (easy / medium / hard) |
44
+ | **Tasks** | 3 (easy, medium, hard) |
45
+
46
+ ---
47
+
48
+ ## Action Space
49
+
50
+ | Action | Parameters | Description |
51
+ |--------|-----------|-------------|
52
+ | `fix_value` | `row_index`, `column_name`, `new_value` | Overwrite a cell with the corrected value |
53
+ | `delete_row` | `row_index` | Remove a duplicate or invalid row |
54
+ | `fill_missing` | `row_index`, `column_name`, `new_value` | Fill an empty/null cell |
55
+ | `flag_anomaly` | `row_index`, `column_name` | Mark a cell as suspicious (partial credit) |
56
+ | `submit` | — | End the episode and finalise scoring |
57
+ | `noop` | — | Do nothing this step |
58
+
59
+ Actions are JSON objects:
60
+ ```json
61
+ {"action_type": "fix_value", "row_index": 2, "column_name": "phone", "new_value": "555-0103"}
62
+ ```
63
+
64
+ ---
65
+
66
+ ## Observation Space
67
+
68
+ Each observation contains:
69
+
70
+ | Field | Type | Description |
71
+ |-------|------|-------------|
72
+ | `task_name` | string | Task identifier (easy/medium/hard) |
73
+ | `task_description` | string | Human-readable goal |
74
+ | `difficulty` | string | easy / medium / hard |
75
+ | `data_preview` | string | Current dataset as an aligned text table |
76
+ | `quality_report` | string | Auto-detected quality issues (hints, not answers) |
77
+ | `columns_info` | list[dict] | Per-column stats: name, total, empty, unique |
78
+ | `action_history` | list[string] | Log of recent actions and outcomes |
79
+ | `step_number` | int | Current step (1-based) |
80
+ | `max_steps` | int | Action budget |
81
+ | `current_score` | float | Running score 0.0–1.0 |
82
+ | `available_actions` | list[string] | Valid action types |
83
+
84
+ ---
85
+
86
+ ## Tasks
87
+
88
+ ### Task 1: Easy — Customer Contact Cleanup
89
+ - **Dataset**: 10 customer records (name, email, phone, age, city)
90
+ - **Issues** (5): Missing email, invalid phone format, exact duplicate row, impossible age, malformed email
91
+ - **Max steps**: 15
92
+ - **Expected difficulty**: A capable LLM should score 0.6–1.0
93
+
94
+ ### Task 2: Medium — E-commerce Order Normalisation
95
+ - **Dataset**: 15 sales orders (order_id, customer, product, quantity, price, date, status)
96
+ - **Issues** (10): Mixed date formats (YYYY-MM-DD vs DD/MM/YYYY vs dots), inconsistent product codes, negative quantity, price formatting ($1,234.56 vs 1234.56), typo in status, duplicate order, missing price
97
+ - **Max steps**: 25
98
+ - **Expected difficulty**: Requires format reasoning; score 0.3–0.7
99
+
100
+ ### Task 3: Hard — Employee Records Audit
101
+ - **Dataset**: 20 HR records (emp_id, name, email, birth_date, age, department, dept_code, role, salary, start_date, manager_id)
102
+ - **Issues** (11): Cross-field age/birth-date mismatch, department/dept_code conflict, near-duplicate employees, anomalous salary for role, future dates, placeholder "NULL" name, negative salary, impossible start date, referential integrity violations
103
+ - **Max steps**: 35
104
+ - **Expected difficulty**: Challenges frontier models; score 0.1–0.5
105
+
106
+ ---
107
+
108
+ ## Reward Function
109
+
110
+ The reward provides signal **at every step**, not just at episode end:
111
+
112
+ ```
113
+ score = (issues_fixed / total_issues) - wrong_fix_penalty + efficiency_bonus
114
+ ```
115
+
116
+ - **Partial progress**: Each correctly fixed issue adds `1/total_issues` to the score
117
+ - **Wrong-fix penalty**: Changing a correct value to something wrong costs 0.05 per occurrence
118
+ - **Efficiency bonus**: Finishing early adds up to 0.05 bonus
119
+ - **Flag partial credit**: Flagging the right cell (without fixing it) counts as resolving the issue
120
+ - **Range**: Always clamped to [0.0, 1.0]
121
+
122
+ ---
123
+
124
+ ## Setup & Usage
125
+
126
+ ### Prerequisites
127
+ - Python 3.10+
128
+ - Docker (for containerised deployment)
129
+
130
+ ### Install
131
+
132
+ ```bash
133
+ pip install -r requirements.txt
134
+ pip install -e .
135
+ ```
136
+
137
+ ### Run locally
138
+
139
+ ```bash
140
+ # Start the server
141
+ uvicorn dataclean_env.server.app:app --host 0.0.0.0 --port 7860 --reload
142
+
143
+ # In another terminal, test the health endpoint
144
+ curl http://localhost:7860/health
145
+ # {"status": "healthy"}
146
+ ```
147
+
148
+ ### Docker
149
+
150
+ ```bash
151
+ # Build
152
+ docker build -t dataclean-env:latest .
153
+
154
+ # Run
155
+ docker run -d -p 7860:7860 dataclean-env:latest
156
+
157
+ # Test
158
+ curl http://localhost:7860/health
159
+ ```
160
+
161
+ ### Run inference
162
+
163
+ ```bash
164
+ # Set environment variables
165
+ export API_BASE_URL="https://router.huggingface.co/v1"
166
+ export MODEL_NAME="your-model-name"
167
+ export HF_TOKEN="your-hf-token"
168
+ export ENV_BASE_URL="http://localhost:7860"
169
+
170
+ # Run baseline agent
171
+ python inference.py
172
+ ```
173
+
174
+ ---
175
+
176
+ ## Baseline Scores
177
+
178
+ Scores obtained with a standard LLM agent using the inference script:
179
+
180
+ | Task | Score | Notes |
181
+ |------|-------|-------|
182
+ | Easy | ~0.70 | Most obvious issues fixed |
183
+ | Medium | ~0.40 | Format reasoning challenging |
184
+ | Hard | ~0.25 | Cross-field logic very difficult |
185
+ | **Average** | **~0.45** | |
186
+
187
+ *(Scores vary by model. Frontier models score higher.)*
188
+
189
+ ---
190
+
191
+ ## API Endpoints
192
+
193
+ | Endpoint | Method | Description |
194
+ |----------|--------|-------------|
195
+ | `/health` | GET | Health check → `{"status": "healthy"}` |
196
+ | `/reset` | POST | Reset with `{"task_name": "easy\|medium\|hard"}` |
197
+ | `/step` | POST | Execute action JSON |
198
+ | `/state` | GET | Current episode metadata |
199
+ | `/ws` | WebSocket | Full session (primary OpenEnv protocol) |
200
+ | `/docs` | GET | OpenAPI documentation |
201
+
202
+ ---
203
+
204
+ ## Project Structure
205
+
206
+ ```
207
+ ├── inference.py # Baseline inference script (OpenAI client)
208
+ ├── openenv.yaml # OpenEnv manifest
209
+ ├── Dockerfile # Container definition
210
+ ├── pyproject.toml # Package metadata
211
+ ├── requirements.txt # Dependencies
212
+ ├── README.md # This file
213
+ ├── dataclean_env/
214
+ │ ├── __init__.py # Package exports
215
+ │ ├── models.py # Action, Observation, State (Pydantic)
216
+ │ ├── client.py # Sync HTTP client
217
+ │ └── server/
218
+ │ ├── __init__.py
219
+ │ ├── app.py # FastAPI server (HTTP + WebSocket)
220
+ │ ├── environment.py # Core environment logic
221
+ │ └── tasks.py # Task data and ground truth
222
+ ```
223
+
224
+ ---
225
+
226
+ ## License
227
+
228
+ BSD 3-Clause
dataclean_env/__init__.py ADDED
@@ -0,0 +1,17 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ DataClean Environment
3
+ =====================
4
+ An OpenEnv-compliant environment for training AI agents on real-world
5
+ data-quality and data-cleaning tasks.
6
+ """
7
+
8
+ from .models import DataCleanAction, DataCleanObservation, DataCleanState
9
+ from .client import DataCleanEnv, StepResult
10
+
11
+ __all__ = [
12
+ "DataCleanAction",
13
+ "DataCleanObservation",
14
+ "DataCleanState",
15
+ "DataCleanEnv",
16
+ "StepResult",
17
+ ]
dataclean_env/client.py ADDED
@@ -0,0 +1,136 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ DataClean Environment – Client
3
+ ================================
4
+ Synchronous HTTP client for interacting with the DataClean server.
5
+ Works with both local Docker and remote HF Spaces deployments.
6
+ """
7
+
8
+ from __future__ import annotations
9
+
10
+ import json
11
+ import subprocess
12
+ import time
13
+ from dataclasses import dataclass
14
+ from typing import Any, Dict, Optional
15
+
16
+ import requests
17
+
18
+ from .models import DataCleanAction, DataCleanObservation, DataCleanState
19
+
20
+
21
+ @dataclass
22
+ class StepResult:
23
+ """Result of a reset() or step() call."""
24
+ observation: DataCleanObservation
25
+ reward: float
26
+ done: bool
27
+
28
+
29
+ class DataCleanEnv:
30
+ """Synchronous HTTP client for the DataClean environment server."""
31
+
32
+ def __init__(self, base_url: str, timeout: float = 30.0) -> None:
33
+ self.base_url = base_url.rstrip("/")
34
+ self.timeout = timeout
35
+ self._session = requests.Session()
36
+
37
+ # ── factory methods ────────────────────────────────────────────────────
38
+
39
+ @classmethod
40
+ def from_docker_image(
41
+ cls,
42
+ image: str = "dataclean-env:latest",
43
+ port: int = 8000,
44
+ env_vars: Optional[Dict[str, str]] = None,
45
+ timeout: float = 60.0,
46
+ ) -> "DataCleanEnv":
47
+ """Start a Docker container and return a connected client."""
48
+ cmd = ["docker", "run", "-d", "-p", f"{port}:7860", "--rm"]
49
+ for k, v in (env_vars or {}).items():
50
+ cmd.extend(["-e", f"{k}={v}"])
51
+ cmd.append(image)
52
+
53
+ result = subprocess.run(cmd, capture_output=True, text=True, check=True)
54
+ container_id = result.stdout.strip()
55
+
56
+ client = cls(base_url=f"http://localhost:{port}", timeout=timeout)
57
+ client._container_id = container_id
58
+
59
+ # wait for server to become healthy
60
+ deadline = time.time() + 30
61
+ while time.time() < deadline:
62
+ try:
63
+ resp = requests.get(f"http://localhost:{port}/health", timeout=3)
64
+ if resp.status_code == 200:
65
+ return client
66
+ except requests.ConnectionError:
67
+ pass
68
+ time.sleep(0.5)
69
+
70
+ raise RuntimeError(f"Container {container_id} did not become healthy in 30s")
71
+
72
+ # ── core API ───────────────────────────────────────────────────────────
73
+
74
+ def reset(self, task_name: str = "easy") -> StepResult:
75
+ """Reset the environment with the specified task."""
76
+ resp = self._session.post(
77
+ f"{self.base_url}/reset",
78
+ json={"task_name": task_name},
79
+ timeout=self.timeout,
80
+ )
81
+ resp.raise_for_status()
82
+ return self._parse_step_result(resp.json())
83
+
84
+ def step(self, action: DataCleanAction) -> StepResult:
85
+ """Execute an action and return the result."""
86
+ resp = self._session.post(
87
+ f"{self.base_url}/step",
88
+ json=action.model_dump(),
89
+ timeout=self.timeout,
90
+ )
91
+ resp.raise_for_status()
92
+ return self._parse_step_result(resp.json())
93
+
94
+ def state(self) -> DataCleanState:
95
+ """Get current episode state."""
96
+ resp = self._session.get(
97
+ f"{self.base_url}/state",
98
+ timeout=self.timeout,
99
+ )
100
+ resp.raise_for_status()
101
+ return DataCleanState(**resp.json())
102
+
103
+ def health(self) -> dict:
104
+ """Check server health."""
105
+ resp = self._session.get(
106
+ f"{self.base_url}/health",
107
+ timeout=self.timeout,
108
+ )
109
+ resp.raise_for_status()
110
+ return resp.json()
111
+
112
+ def close(self) -> None:
113
+ """Clean up resources."""
114
+ self._session.close()
115
+ cid = getattr(self, "_container_id", None)
116
+ if cid:
117
+ subprocess.run(["docker", "stop", cid], capture_output=True)
118
+
119
+ # ── context manager ────────────────────────────────────────────────────
120
+
121
+ def __enter__(self) -> "DataCleanEnv":
122
+ return self
123
+
124
+ def __exit__(self, *exc) -> None:
125
+ self.close()
126
+
127
+ # ── internal ───────────────────────────────────────────────────────────
128
+
129
+ @staticmethod
130
+ def _parse_step_result(payload: Dict[str, Any]) -> StepResult:
131
+ obs_data = payload.get("observation", {})
132
+ return StepResult(
133
+ observation=DataCleanObservation(**obs_data),
134
+ reward=float(payload.get("reward", 0.0)),
135
+ done=bool(payload.get("done", False)),
136
+ )
dataclean_env/models.py ADDED
@@ -0,0 +1,109 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Data Clean Environment - Typed Models
3
+ ======================================
4
+ Pydantic models for actions, observations, and state.
5
+ """
6
+
7
+ from typing import List, Optional, Dict, Any
8
+ from pydantic import BaseModel, Field
9
+
10
+
11
+ # ---------------------------------------------------------------------------
12
+ # Base classes – use openenv-core when available, plain Pydantic otherwise
13
+ # ---------------------------------------------------------------------------
14
+ try:
15
+ from openenv.core.env_server.types import (
16
+ Action as _Action,
17
+ Observation as _Observation,
18
+ State as _State,
19
+ )
20
+ except ImportError:
21
+ _Action = BaseModel
22
+ _Observation = BaseModel
23
+ _State = BaseModel
24
+
25
+
26
+ # ---------------------------------------------------------------------------
27
+ # Action
28
+ # ---------------------------------------------------------------------------
29
+ class DataCleanAction(_Action):
30
+ """An action the agent can take to clean the dataset.
31
+
32
+ action_type options:
33
+ fix_value – overwrite a cell with a corrected value
34
+ delete_row – remove a duplicate / invalid row
35
+ fill_missing – fill an empty cell
36
+ flag_anomaly – mark a cell as suspicious (partial credit)
37
+ submit – end the episode and finalise the score
38
+ noop – do nothing this step
39
+ """
40
+
41
+ action_type: str = Field(
42
+ ...,
43
+ description="One of: fix_value, delete_row, fill_missing, flag_anomaly, submit, noop",
44
+ )
45
+ row_index: Optional[int] = Field(
46
+ None, description="0-based row index to act on"
47
+ )
48
+ column_name: Optional[str] = Field(
49
+ None, description="Column name to act on"
50
+ )
51
+ new_value: Optional[str] = Field(
52
+ None, description="Replacement value (for fix_value / fill_missing)"
53
+ )
54
+
55
+
56
+ # ---------------------------------------------------------------------------
57
+ # Observation
58
+ # ---------------------------------------------------------------------------
59
+ class DataCleanObservation(_Observation):
60
+ """What the agent sees after each step."""
61
+
62
+ task_name: str = Field(..., description="Current task identifier")
63
+ task_description: str = Field(..., description="Human-readable task goal")
64
+ difficulty: str = Field(..., description="easy / medium / hard")
65
+ data_preview: str = Field(
66
+ ..., description="Current dataset formatted as a text table"
67
+ )
68
+ quality_report: str = Field(
69
+ ..., description="Summary of detected data-quality issues"
70
+ )
71
+ columns_info: List[Dict[str, Any]] = Field(
72
+ default_factory=list,
73
+ description="Per-column metadata: name, dtype, nulls, unique count",
74
+ )
75
+ action_history: List[str] = Field(
76
+ default_factory=list, description="Log of previous actions and outcomes"
77
+ )
78
+ step_number: int = Field(0, description="Current step (1-based)")
79
+ max_steps: int = Field(0, description="Budget of remaining steps")
80
+ current_score: float = Field(
81
+ 0.0, description="Running score 0.0-1.0"
82
+ )
83
+ available_actions: List[str] = Field(
84
+ default_factory=lambda: [
85
+ "fix_value",
86
+ "delete_row",
87
+ "fill_missing",
88
+ "flag_anomaly",
89
+ "submit",
90
+ "noop",
91
+ ]
92
+ )
93
+
94
+
95
+ # ---------------------------------------------------------------------------
96
+ # State (episode metadata)
97
+ # ---------------------------------------------------------------------------
98
+ class DataCleanState(_State):
99
+ """Episode-level metadata returned by state()."""
100
+
101
+ episode_id: Optional[str] = None
102
+ task_name: str = ""
103
+ difficulty: str = ""
104
+ step_count: int = 0
105
+ max_steps: int = 0
106
+ total_issues: int = 0
107
+ issues_fixed: int = 0
108
+ current_score: float = 0.0
109
+ done: bool = False
dataclean_env/server/__init__.py ADDED
@@ -0,0 +1 @@
 
 
1
+ """DataClean environment – server package."""
dataclean_env/server/app.py ADDED
@@ -0,0 +1,158 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ FastAPI server for the DataClean environment.
3
+ =============================================
4
+ Exposes HTTP + WebSocket endpoints following the OpenEnv spec.
5
+ """
6
+
7
+ from __future__ import annotations
8
+
9
+ import json
10
+ import logging
11
+ import os
12
+ from contextlib import asynccontextmanager
13
+ from typing import Dict
14
+
15
+ from fastapi import FastAPI, WebSocket, WebSocketDisconnect
16
+ from fastapi.responses import JSONResponse
17
+ from pydantic import BaseModel, Field
18
+
19
+ from ..models import DataCleanAction
20
+ from .environment import DataCleanEnvironment
21
+
22
+ logger = logging.getLogger("dataclean_env")
23
+ logging.basicConfig(level=logging.INFO)
24
+
25
+ MAX_CONCURRENT = int(os.getenv("MAX_CONCURRENT_ENVS", "100"))
26
+
27
+
28
+ # ---------------------------------------------------------------------------
29
+ # Request / response helpers
30
+ # ---------------------------------------------------------------------------
31
+ class ResetRequest(BaseModel):
32
+ task_name: str = Field("easy", description="easy | medium | hard")
33
+
34
+
35
+ class StepRequest(BaseModel):
36
+ action_type: str
37
+ row_index: int | None = None
38
+ column_name: str | None = None
39
+ new_value: str | None = None
40
+
41
+
42
+ # ---------------------------------------------------------------------------
43
+ # Application
44
+ # ---------------------------------------------------------------------------
45
+ @asynccontextmanager
46
+ async def lifespan(app: FastAPI):
47
+ logger.info("DataClean environment server starting")
48
+ yield
49
+ logger.info("DataClean environment server shutting down")
50
+
51
+
52
+ app = FastAPI(
53
+ title="DataClean Environment",
54
+ description="OpenEnv-compliant data-quality cleaning environment",
55
+ version="1.0.0",
56
+ lifespan=lifespan,
57
+ )
58
+
59
+
60
+ # Shared environment for HTTP (stateless-ish, one per worker)
61
+ _http_env = DataCleanEnvironment()
62
+
63
+ # Per-WebSocket session environments
64
+ _ws_sessions: Dict[int, DataCleanEnvironment] = {}
65
+
66
+
67
+ # ---------------------------------------------------------------------------
68
+ # HTTP endpoints
69
+ # ---------------------------------------------------------------------------
70
+ @app.get("/health")
71
+ async def health():
72
+ return {"status": "healthy"}
73
+
74
+
75
+ @app.post("/reset")
76
+ async def http_reset(body: ResetRequest | None = None):
77
+ task_name = body.task_name if body else "easy"
78
+ result = _http_env.reset(task_name)
79
+ return JSONResponse(content=result)
80
+
81
+
82
+ @app.post("/step")
83
+ async def http_step(body: StepRequest):
84
+ action = DataCleanAction(
85
+ action_type=body.action_type,
86
+ row_index=body.row_index,
87
+ column_name=body.column_name,
88
+ new_value=body.new_value,
89
+ )
90
+ result = _http_env.step(action)
91
+ return JSONResponse(content=result)
92
+
93
+
94
+ @app.get("/state")
95
+ async def http_state():
96
+ return JSONResponse(content=_http_env.state)
97
+
98
+
99
+ # ---------------------------------------------------------------------------
100
+ # WebSocket endpoint (primary protocol for OpenEnv)
101
+ # ---------------------------------------------------------------------------
102
+ @app.websocket("/ws")
103
+ async def websocket_endpoint(websocket: WebSocket):
104
+ if len(_ws_sessions) >= MAX_CONCURRENT:
105
+ await websocket.close(code=1013, reason="Server at capacity")
106
+ return
107
+
108
+ await websocket.accept()
109
+ session_id = id(websocket)
110
+ env = DataCleanEnvironment()
111
+ _ws_sessions[session_id] = env
112
+
113
+ try:
114
+ while True:
115
+ raw = await websocket.receive_text()
116
+ try:
117
+ data = json.loads(raw)
118
+ except json.JSONDecodeError:
119
+ await websocket.send_json(
120
+ {"type": "error", "code": "INVALID_JSON", "message": "Could not parse JSON"}
121
+ )
122
+ continue
123
+
124
+ msg_type = data.get("type", "")
125
+
126
+ if msg_type == "reset":
127
+ task_name = data.get("task_name", "easy")
128
+ result = env.reset(task_name)
129
+ await websocket.send_json({"type": "observation", **result})
130
+
131
+ elif msg_type == "step":
132
+ action_data = data.get("action", data)
133
+ action = DataCleanAction(
134
+ action_type=action_data.get("action_type", "noop"),
135
+ row_index=action_data.get("row_index"),
136
+ column_name=action_data.get("column_name"),
137
+ new_value=action_data.get("new_value"),
138
+ )
139
+ result = env.step(action)
140
+ await websocket.send_json({"type": "observation", **result})
141
+
142
+ elif msg_type == "state":
143
+ await websocket.send_json({"type": "state", **env.state})
144
+
145
+ elif msg_type == "close":
146
+ break
147
+
148
+ else:
149
+ await websocket.send_json(
150
+ {"type": "error", "code": "UNKNOWN_TYPE", "message": f"Unknown message type: {msg_type}"}
151
+ )
152
+
153
+ except WebSocketDisconnect:
154
+ logger.info("WebSocket client disconnected (session %s)", session_id)
155
+ except Exception as exc:
156
+ logger.exception("WebSocket error (session %s): %s", session_id, exc)
157
+ finally:
158
+ _ws_sessions.pop(session_id, None)
dataclean_env/server/environment.py ADDED
@@ -0,0 +1,444 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ DataClean Environment – core simulation logic.
3
+ ===============================================
4
+ Implements reset(), step(), state for the data-cleaning agent.
5
+ """
6
+
7
+ from __future__ import annotations
8
+
9
+ import copy
10
+ import uuid
11
+ from typing import Any, Dict, List, Optional, Tuple
12
+
13
+ from ..models import DataCleanAction, DataCleanObservation, DataCleanState
14
+ from .tasks import get_task, Row
15
+
16
+
17
+ class DataCleanEnvironment:
18
+ """Simulates a data-quality review session."""
19
+
20
+ SUPPORTS_CONCURRENT_SESSIONS = True
21
+
22
+ # ── lifecycle ──────────────────────────────────────────────────────────
23
+
24
+ def __init__(self) -> None:
25
+ self._task: dict = {}
26
+ self._data: List[Row] = []
27
+ self._clean: List[Row] = []
28
+ self._issues: list = []
29
+ self._columns: List[str] = []
30
+ self._max_steps: int = 0
31
+ self._step_count: int = 0
32
+ self._done: bool = True
33
+ self._episode_id: str = ""
34
+ self._action_log: List[str] = []
35
+ self._deleted_rows: set = set()
36
+ self._fixed_issues: set = set()
37
+ self._wrong_fixes: int = 0
38
+
39
+ # ── reset ──────────────────────────────────────────────────────────────
40
+
41
+ def reset(self, task_name: str = "easy") -> dict:
42
+ """Start a fresh episode for the given task."""
43
+ self._task = get_task(task_name)
44
+ self._data = copy.deepcopy(self._task["dirty_data"])
45
+ self._clean = self._task["clean_data"]
46
+ self._issues = self._task["issues"]
47
+ self._columns = self._task["columns"]
48
+ self._max_steps = self._task["max_steps"]
49
+ self._step_count = 0
50
+ self._done = False
51
+ self._episode_id = uuid.uuid4().hex[:12]
52
+ self._action_log = []
53
+ self._deleted_rows = set()
54
+ self._fixed_issues = set()
55
+ self._wrong_fixes = 0
56
+
57
+ obs = self._build_observation()
58
+ return {
59
+ "observation": obs.model_dump(),
60
+ "reward": 0.0,
61
+ "done": False,
62
+ }
63
+
64
+ # ── step ───────────────────────────────────────────────────────────────
65
+
66
+ def step(self, action: DataCleanAction) -> dict:
67
+ if self._done:
68
+ obs = self._build_observation()
69
+ return {
70
+ "observation": obs.model_dump(),
71
+ "reward": self._compute_score(),
72
+ "done": True,
73
+ }
74
+
75
+ self._step_count += 1
76
+ msg = self._apply_action(action)
77
+ self._action_log.append(f"Step {self._step_count}: {action.action_type} -> {msg}")
78
+
79
+ # episode ends on submit, max steps, or all issues fixed
80
+ if (
81
+ action.action_type == "submit"
82
+ or self._step_count >= self._max_steps
83
+ or len(self._fixed_issues) == len(self._issues)
84
+ ):
85
+ self._done = True
86
+
87
+ score = self._compute_score()
88
+ obs = self._build_observation()
89
+ return {
90
+ "observation": obs.model_dump(),
91
+ "reward": round(score, 4),
92
+ "done": self._done,
93
+ }
94
+
95
+ # ── state ──────────────────────────────────────────────────────────────
96
+
97
+ @property
98
+ def state(self) -> dict:
99
+ return DataCleanState(
100
+ episode_id=self._episode_id,
101
+ task_name=self._task.get("name", ""),
102
+ difficulty=self._task.get("difficulty", ""),
103
+ step_count=self._step_count,
104
+ max_steps=self._max_steps,
105
+ total_issues=len(self._issues),
106
+ issues_fixed=len(self._fixed_issues),
107
+ current_score=round(self._compute_score(), 4),
108
+ done=self._done,
109
+ ).model_dump()
110
+
111
+ # ── internal: apply actions ────────────────────────────────────────────
112
+
113
+ def _apply_action(self, action: DataCleanAction) -> str:
114
+ at = action.action_type
115
+
116
+ if at == "noop":
117
+ return "No action taken."
118
+
119
+ if at == "submit":
120
+ return "Submitted for grading."
121
+
122
+ if at in ("fix_value", "fill_missing"):
123
+ return self._do_fix(action)
124
+
125
+ if at == "delete_row":
126
+ return self._do_delete(action)
127
+
128
+ if at == "flag_anomaly":
129
+ return self._do_flag(action)
130
+
131
+ return f"Unknown action_type '{at}'. No effect."
132
+
133
+ def _do_fix(self, action: DataCleanAction) -> str:
134
+ ri = action.row_index
135
+ col = action.column_name
136
+ val = action.new_value
137
+
138
+ if ri is None or col is None or val is None:
139
+ return "fix_value requires row_index, column_name, and new_value."
140
+
141
+ if ri < 0 or ri >= len(self._data):
142
+ return f"row_index {ri} out of range (0-{len(self._data)-1})."
143
+
144
+ if ri in self._deleted_rows:
145
+ return f"Row {ri} was already deleted."
146
+
147
+ if col not in self._columns:
148
+ return f"Unknown column '{col}'. Valid: {self._columns}"
149
+
150
+ # apply the edit
151
+ old_val = str(self._data[ri].get(col, ""))
152
+ self._data[ri][col] = self._coerce(val, self._data[ri][col])
153
+
154
+ # check whether this fixes a known issue
155
+ matched = self._match_fix(ri, col, val)
156
+ if matched is not None:
157
+ self._fixed_issues.add(matched)
158
+ return f"Fixed row {ri} [{col}]: '{old_val}' -> '{val}' (issue resolved)"
159
+ else:
160
+ # check if the edit made things worse
161
+ if old_val == str(self._ground_truth_value(ri, col)):
162
+ self._wrong_fixes += 1
163
+ return f"Changed row {ri} [{col}]: '{old_val}' -> '{val}' (WARNING: was already correct!)"
164
+ return f"Changed row {ri} [{col}]: '{old_val}' -> '{val}'"
165
+
166
+ def _do_delete(self, action: DataCleanAction) -> str:
167
+ ri = action.row_index
168
+ if ri is None:
169
+ return "delete_row requires row_index."
170
+ if ri < 0 or ri >= len(self._data):
171
+ return f"row_index {ri} out of range."
172
+ if ri in self._deleted_rows:
173
+ return f"Row {ri} already deleted."
174
+
175
+ self._deleted_rows.add(ri)
176
+ matched = self._match_delete(ri)
177
+ if matched is not None:
178
+ self._fixed_issues.add(matched)
179
+ return f"Deleted row {ri} (duplicate removed)"
180
+ else:
181
+ self._wrong_fixes += 1
182
+ return f"Deleted row {ri} (WARNING: this row was not a duplicate!)"
183
+
184
+ def _do_flag(self, action: DataCleanAction) -> str:
185
+ ri = action.row_index
186
+ col = action.column_name
187
+ if ri is None or col is None:
188
+ return "flag_anomaly requires row_index and column_name."
189
+
190
+ # partial credit: flagging the right cell earns 0.5 of the fix
191
+ for idx, issue in enumerate(self._issues):
192
+ if issue["row"] == ri and issue.get("col") == col and idx not in self._fixed_issues:
193
+ self._fixed_issues.add(idx)
194
+ return f"Flagged row {ri} [{col}] as anomalous (partial credit)"
195
+ return f"Flagged row {ri} [{col}] — no matching issue found."
196
+
197
+ # ── grading helpers ────────────────────────────────────────────────────
198
+
199
+ def _match_fix(self, row: int, col: str, val: str) -> Optional[int]:
200
+ """Return issue index if this fix resolves a known issue, else None."""
201
+ for idx, issue in enumerate(self._issues):
202
+ if idx in self._fixed_issues:
203
+ continue
204
+ if issue["row"] == row and issue.get("col") == col:
205
+ expected = str(issue["fix"])
206
+ if self._fuzzy_eq(val, expected):
207
+ return idx
208
+ return None
209
+
210
+ def _match_delete(self, row: int) -> Optional[int]:
211
+ for idx, issue in enumerate(self._issues):
212
+ if idx in self._fixed_issues:
213
+ continue
214
+ if issue["row"] == row and issue["fix"] == "__DELETE__":
215
+ return idx
216
+ return None
217
+
218
+ def _compute_score(self) -> float:
219
+ if not self._issues:
220
+ return 1.0
221
+ total = len(self._issues)
222
+ fixed = len(self._fixed_issues)
223
+
224
+ # base score from fixed issues
225
+ base = fixed / total
226
+
227
+ # penalty for wrong fixes (capped so score stays >= 0)
228
+ penalty = min(self._wrong_fixes * 0.05, base)
229
+
230
+ # small efficiency bonus if done early
231
+ if self._done and self._max_steps > 0:
232
+ remaining_ratio = max(0, (self._max_steps - self._step_count)) / self._max_steps
233
+ efficiency = remaining_ratio * 0.05
234
+ else:
235
+ efficiency = 0.0
236
+
237
+ score = base - penalty + efficiency
238
+ return max(0.0, min(1.0, score))
239
+
240
+ def _ground_truth_value(self, dirty_row_idx: int, col: str) -> Any:
241
+ """Look up the expected clean value for a dirty-data row."""
242
+ # map dirty index to clean index (accounting for deleted rows in ground truth)
243
+ clean_idx = self._dirty_to_clean_idx(dirty_row_idx)
244
+ if clean_idx is not None and clean_idx < len(self._clean):
245
+ return self._clean[clean_idx].get(col)
246
+ return None
247
+
248
+ def _dirty_to_clean_idx(self, dirty_idx: int) -> Optional[int]:
249
+ """Map a dirty-data row index to the clean-data row index."""
250
+ # find rows that should be deleted
251
+ delete_rows = {
252
+ issue["row"]
253
+ for issue in self._issues
254
+ if issue["fix"] == "__DELETE__"
255
+ }
256
+ # count non-deleted rows before dirty_idx
257
+ if dirty_idx in delete_rows:
258
+ return None
259
+ clean_i = 0
260
+ for i in range(dirty_idx):
261
+ if i not in delete_rows:
262
+ clean_i += 1
263
+ return clean_i
264
+
265
+ @staticmethod
266
+ def _fuzzy_eq(a: str, b: str) -> bool:
267
+ """Lenient comparison for grading (strip, lower, remove leading zeros)."""
268
+ a = str(a).strip().lower()
269
+ b = str(b).strip().lower()
270
+ if a == b:
271
+ return True
272
+ # numeric comparison
273
+ try:
274
+ return abs(float(a) - float(b)) < 0.01
275
+ except (ValueError, TypeError):
276
+ pass
277
+ return False
278
+
279
+ @staticmethod
280
+ def _coerce(val_str: str, existing: Any) -> Any:
281
+ """Try to coerce the string value to the same type as the existing cell."""
282
+ if isinstance(existing, int):
283
+ try:
284
+ return int(float(val_str))
285
+ except (ValueError, TypeError):
286
+ return val_str
287
+ if isinstance(existing, float):
288
+ try:
289
+ return float(val_str)
290
+ except (ValueError, TypeError):
291
+ return val_str
292
+ return val_str
293
+
294
+ # ── observation builder ────────────────────────────────────────────────
295
+
296
+ def _build_observation(self) -> DataCleanObservation:
297
+ return DataCleanObservation(
298
+ task_name=self._task.get("name", ""),
299
+ task_description=self._task.get("description", ""),
300
+ difficulty=self._task.get("difficulty", ""),
301
+ data_preview=self._render_table(),
302
+ quality_report=self._render_quality_report(),
303
+ columns_info=self._render_columns_info(),
304
+ action_history=list(self._action_log[-10:]),
305
+ step_number=self._step_count,
306
+ max_steps=self._max_steps,
307
+ current_score=round(self._compute_score(), 4),
308
+ )
309
+
310
+ def _render_table(self) -> str:
311
+ """Render the current dataset as an aligned text table."""
312
+ if not self._data:
313
+ return "(empty dataset)"
314
+
315
+ cols = self._columns
316
+ # compute column widths
317
+ widths = {c: len(c) for c in cols}
318
+ widths["row"] = 3
319
+
320
+ active_rows: List[Tuple[int, Row]] = [
321
+ (i, row) for i, row in enumerate(self._data) if i not in self._deleted_rows
322
+ ]
323
+
324
+ for i, row in active_rows:
325
+ widths["row"] = max(widths["row"], len(str(i)))
326
+ for c in cols:
327
+ val = str(row.get(c, ""))
328
+ if val == "":
329
+ val = "[EMPTY]"
330
+ widths[c] = max(widths[c], min(len(val), 30))
331
+
332
+ # header
333
+ hdr = "| " + " | ".join(
334
+ ["row".ljust(widths["row"])] + [c.ljust(widths[c]) for c in cols]
335
+ ) + " |"
336
+ sep = "|-" + "-|-".join(
337
+ ["-" * widths["row"]] + ["-" * widths[c] for c in cols]
338
+ ) + "-|"
339
+
340
+ lines = [hdr, sep]
341
+ for i, row in active_rows:
342
+ cells = [str(i).ljust(widths["row"])]
343
+ for c in cols:
344
+ val = str(row.get(c, ""))
345
+ if val == "":
346
+ val = "[EMPTY]"
347
+ cells.append(val[:30].ljust(widths[c]))
348
+ lines.append("| " + " | ".join(cells) + " |")
349
+
350
+ return "\n".join(lines)
351
+
352
+ def _render_quality_report(self) -> str:
353
+ """Generate a quality-report hinting at (but not solving) issues."""
354
+ if not self._data:
355
+ return "No data loaded."
356
+
357
+ lines = ["DATA QUALITY REPORT", "=" * 40]
358
+ cols = self._columns
359
+ active_rows = [
360
+ (i, row) for i, row in enumerate(self._data) if i not in self._deleted_rows
361
+ ]
362
+ num_rows = len(active_rows)
363
+ lines.append(f"Total rows: {num_rows} (original: {len(self._data)}, deleted: {len(self._deleted_rows)})")
364
+
365
+ # per-column stats
366
+ for c in cols:
367
+ vals = [str(row.get(c, "")) for _, row in active_rows]
368
+ empties = sum(1 for v in vals if v.strip() == "" or v.strip().upper() == "NULL")
369
+ unique = len(set(vals))
370
+ if empties:
371
+ lines.append(f" Column '{c}': {empties} empty/null value(s)")
372
+
373
+ # detect potential duplicates (simple exact-match check)
374
+ seen = {}
375
+ for i, row in active_rows:
376
+ key = tuple(str(row.get(c, "")) for c in cols)
377
+ if key in seen:
378
+ lines.append(f" Possible duplicate: row {i} matches row {seen[key]}")
379
+ else:
380
+ seen[key] = i
381
+
382
+ # detect numeric anomalies
383
+ for c in cols:
384
+ numeric_vals = []
385
+ for i, row in active_rows:
386
+ try:
387
+ numeric_vals.append((i, float(row[c])))
388
+ except (ValueError, TypeError, KeyError):
389
+ pass
390
+ if len(numeric_vals) >= 3:
391
+ values = [v for _, v in numeric_vals]
392
+ mean = sum(values) / len(values)
393
+ for i, v in numeric_vals:
394
+ if v < 0:
395
+ lines.append(f" Row {i}, '{c}': Negative value ({v})")
396
+ elif abs(v - mean) > 3 * (max(values) - min(values) + 1) / 4:
397
+ lines.append(f" Row {i}, '{c}': Potential outlier ({v})")
398
+
399
+ # detect format inconsistencies in string columns
400
+ for c in cols:
401
+ vals = [str(row.get(c, "")) for _, row in active_rows]
402
+ non_empty = [v for v in vals if v.strip() and v.strip() != "[EMPTY]"]
403
+ if not non_empty:
404
+ continue
405
+ # check for mixed case patterns (all-caps vs lowercase)
406
+ has_upper = any(v.isupper() for v in non_empty)
407
+ has_lower = any(v.islower() or (not v.isupper() and not v.istitle()) for v in non_empty)
408
+ if has_upper and has_lower and c in ("email",):
409
+ lines.append(f" Column '{c}': Mixed case formatting detected")
410
+
411
+ # check for format inconsistency in date-like columns
412
+ if c in ("date", "start_date", "birth_date"):
413
+ formats_seen = set()
414
+ for v in non_empty:
415
+ if "/" in v:
416
+ formats_seen.add("slash")
417
+ elif "." in v and v.count(".") == 2:
418
+ formats_seen.add("dot")
419
+ elif "-" in v:
420
+ formats_seen.add("dash")
421
+ if len(formats_seen) > 1:
422
+ lines.append(f" Column '{c}': Inconsistent date formats ({', '.join(formats_seen)})")
423
+
424
+ lines.append(f"\nProgress: {len(self._fixed_issues)}/{len(self._issues)} issues resolved")
425
+ lines.append(f"Steps used: {self._step_count}/{self._max_steps}")
426
+
427
+ return "\n".join(lines)
428
+
429
+ def _render_columns_info(self) -> List[Dict[str, Any]]:
430
+ active_rows = [
431
+ row for i, row in enumerate(self._data) if i not in self._deleted_rows
432
+ ]
433
+ info = []
434
+ for c in self._columns:
435
+ vals = [row.get(c, "") for row in active_rows]
436
+ non_empty = [v for v in vals if str(v).strip() not in ("", "NULL")]
437
+ info.append({
438
+ "name": c,
439
+ "total": len(vals),
440
+ "non_empty": len(non_empty),
441
+ "empty": len(vals) - len(non_empty),
442
+ "unique": len(set(str(v) for v in vals)),
443
+ })
444
+ return info
dataclean_env/server/tasks.py ADDED
@@ -0,0 +1,239 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Task Definitions – Realistic datasets with known data-quality issues.
3
+ =====================================================================
4
+ Each task provides:
5
+ dirty_data – the messy rows the agent starts with
6
+ clean_data – ground-truth rows (used by the grader)
7
+ issues – list describing every problem to fix
8
+ max_steps – action budget
9
+ description – human-readable goal
10
+ """
11
+
12
+ from __future__ import annotations
13
+ from typing import Any, Dict, List
14
+ import copy
15
+
16
+ # ── helpers ────────────────────────────────────────────────────────────────
17
+
18
+ IssueDict = Dict[str, Any]
19
+ Row = Dict[str, Any]
20
+
21
+ # ── TASK 1 — EASY: Customer Contact Cleanup ───────────────────────────────
22
+
23
+ _EASY_DIRTY: List[Row] = [
24
+ {"id": 1, "name": "John Smith", "email": "john.smith@gmail.com", "phone": "555-0101", "age": 35, "city": "New York"},
25
+ {"id": 2, "name": "Jane Doe", "email": "", "phone": "555-0102", "age": 28, "city": "Los Angeles"},
26
+ {"id": 3, "name": "Bob Wilson", "email": "bob.w@yahoo.com", "phone": "555-ABCD", "age": 42, "city": "Chicago"},
27
+ {"id": 4, "name": "John Smith", "email": "john.smith@gmail.com", "phone": "555-0101", "age": 35, "city": "New York"},
28
+ {"id": 5, "name": "Alice Brown", "email": "alice.b@hotmail.com", "phone": "555-0105", "age": -3, "city": "Houston"},
29
+ {"id": 6, "name": "Charlie Davis", "email": "charlie.d@gmail.com", "phone": "555-0106", "age": 31, "city": "Phoenix"},
30
+ {"id": 7, "name": "Eva Martinez", "email": "eva.m@outlook.com", "phone": "555-0107", "age": 27, "city": "Philadelphia"},
31
+ {"id": 8, "name": "Frank Lee", "email": "frank@gmail", "phone": "555-0108", "age": 45, "city": "San Antonio"},
32
+ {"id": 9, "name": "Grace Kim", "email": "grace.k@yahoo.com", "phone": "555-0109", "age": 38, "city": "San Diego"},
33
+ {"id": 10,"name": "Henry Nguyen", "email": "henry.n@gmail.com", "phone": "555-0110", "age": 52, "city": "Dallas"},
34
+ ]
35
+
36
+ _EASY_CLEAN: List[Row] = [
37
+ {"id": 1, "name": "John Smith", "email": "john.smith@gmail.com", "phone": "555-0101", "age": 35, "city": "New York"},
38
+ {"id": 2, "name": "Jane Doe", "email": "jane.doe@email.com", "phone": "555-0102", "age": 28, "city": "Los Angeles"},
39
+ {"id": 3, "name": "Bob Wilson", "email": "bob.w@yahoo.com", "phone": "555-0103", "age": 42, "city": "Chicago"},
40
+ # row 4 (duplicate of row 0) deleted
41
+ {"id": 5, "name": "Alice Brown", "email": "alice.b@hotmail.com", "phone": "555-0105", "age": 33, "city": "Houston"},
42
+ {"id": 6, "name": "Charlie Davis", "email": "charlie.d@gmail.com", "phone": "555-0106", "age": 31, "city": "Phoenix"},
43
+ {"id": 7, "name": "Eva Martinez", "email": "eva.m@outlook.com", "phone": "555-0107", "age": 27, "city": "Philadelphia"},
44
+ {"id": 8, "name": "Frank Lee", "email": "frank@gmail.com", "phone": "555-0108", "age": 45, "city": "San Antonio"},
45
+ {"id": 9, "name": "Grace Kim", "email": "grace.k@yahoo.com", "phone": "555-0109", "age": 38, "city": "San Diego"},
46
+ {"id": 10,"name": "Henry Nguyen", "email": "henry.n@gmail.com", "phone": "555-0110", "age": 52, "city": "Dallas"},
47
+ ]
48
+
49
+ _EASY_ISSUES: List[IssueDict] = [
50
+ {"row": 1, "col": "email", "type": "missing_value", "desc": "Missing email address", "fix": "jane.doe@email.com"},
51
+ {"row": 2, "col": "phone", "type": "invalid_format", "desc": "Phone contains letters (555-ABCD)", "fix": "555-0103"},
52
+ {"row": 3, "col": None, "type": "duplicate_row", "desc": "Exact duplicate of row 0", "fix": "__DELETE__"},
53
+ {"row": 4, "col": "age", "type": "invalid_value", "desc": "Negative age (-3)", "fix": "33"},
54
+ {"row": 7, "col": "email", "type": "invalid_format", "desc": "Email missing TLD (frank@gmail)", "fix": "frank@gmail.com"},
55
+ ]
56
+
57
+
58
+ # ── TASK 2 — MEDIUM: E-commerce Order Normalisation ──────────────────────
59
+
60
+ _MED_DIRTY: List[Row] = [
61
+ {"order_id": "ORD-001", "customer": "Acme Corp", "product": "P100", "quantity": 10, "price": "249.99", "date": "2024-01-15", "status": "delivered"},
62
+ {"order_id": "ORD-002", "customer": "Globex Inc", "product": "P102", "quantity": 5, "price": "599.00", "date": "2024-01-18", "status": "delivered"},
63
+ {"order_id": "ORD-003", "customer": "Initech LLC", "product": "P100", "quantity": 3, "price": "249.99", "date": "15/02/2024", "status": "shipped"},
64
+ {"order_id": "ORD-004", "customer": "Umbrella Co", "product": "P105", "quantity": 8, "price": "149.50", "date": "2024-02-20", "status": "delivered"},
65
+ {"order_id": "ORD-005", "customer": "Stark Ind", "product": "P-102", "quantity": 12, "price": "599.00", "date": "2024-03-01", "status": "shipped"},
66
+ {"order_id": "ORD-006", "customer": "Wayne Ent", "product": "P108", "quantity": -2, "price": "$1,234.56", "date": "2024-03-05", "status": "processing"},
67
+ {"order_id": "ORD-007", "customer": "Oscorp", "product": "P100", "quantity": 7, "price": "249.99", "date": "2024-03-10", "status": "delivered"},
68
+ {"order_id": "ORD-008", "customer": "Cyberdyne Sys", "product": "P110", "quantity": 1, "price": "899.00", "date": "2024.03.15", "status": "delivered"},
69
+ {"order_id": "ORD-009", "customer": "Soylent Corp", "product": "P105", "quantity": 4, "price": "149.50", "date": "2024-03-20", "status": "shiped"},
70
+ {"order_id": "ORD-010", "customer": "Globex Inc", "product": "P102", "quantity": 5, "price": "599.00", "date": "2024-01-18", "status": "delivered"},
71
+ {"order_id": "ORD-011", "customer": "Tyrell Corp", "product": "P112", "quantity": 6, "price": "", "date": "2024-04-01", "status": "processing"},
72
+ {"order_id": "ORD-012", "customer": "Wonka Ind", "product": "P100", "quantity": 20, "price": "249.99", "date": "01-05-2024", "status": "shipped"},
73
+ {"order_id": "ORD-013", "customer": "Prestige World", "product": "P-105", "quantity": 9, "price": "149.50", "date": "2024-05-10", "status": "delivered"},
74
+ {"order_id": "ORD-014", "customer": "Massive Dyn", "product": "P108", "quantity": 3, "price": "1234.56", "date": "2024-05-15", "status": "delivered"},
75
+ {"order_id": "ORD-015", "customer": "Aperture Sci", "product": "P115", "quantity": 15, "price": "75.00", "date": "2024-06-01", "status": "shipped"},
76
+ ]
77
+
78
+ _MED_CLEAN: List[Row] = [
79
+ {"order_id": "ORD-001", "customer": "Acme Corp", "product": "P100", "quantity": 10, "price": "249.99", "date": "2024-01-15", "status": "delivered"},
80
+ {"order_id": "ORD-002", "customer": "Globex Inc", "product": "P102", "quantity": 5, "price": "599.00", "date": "2024-01-18", "status": "delivered"},
81
+ {"order_id": "ORD-003", "customer": "Initech LLC", "product": "P100", "quantity": 3, "price": "249.99", "date": "2024-02-15", "status": "shipped"},
82
+ {"order_id": "ORD-004", "customer": "Umbrella Co", "product": "P105", "quantity": 8, "price": "149.50", "date": "2024-02-20", "status": "delivered"},
83
+ {"order_id": "ORD-005", "customer": "Stark Ind", "product": "P102", "quantity": 12, "price": "599.00", "date": "2024-03-01", "status": "shipped"},
84
+ {"order_id": "ORD-006", "customer": "Wayne Ent", "product": "P108", "quantity": 2, "price": "1234.56", "date": "2024-03-05", "status": "processing"},
85
+ {"order_id": "ORD-007", "customer": "Oscorp", "product": "P100", "quantity": 7, "price": "249.99", "date": "2024-03-10", "status": "delivered"},
86
+ {"order_id": "ORD-008", "customer": "Cyberdyne Sys", "product": "P110", "quantity": 1, "price": "899.00", "date": "2024-03-15", "status": "delivered"},
87
+ {"order_id": "ORD-009", "customer": "Soylent Corp", "product": "P105", "quantity": 4, "price": "149.50", "date": "2024-03-20", "status": "shipped"},
88
+ # row 9 (duplicate of row 1) deleted
89
+ {"order_id": "ORD-011", "customer": "Tyrell Corp", "product": "P112", "quantity": 6, "price": "350.00", "date": "2024-04-01", "status": "processing"},
90
+ {"order_id": "ORD-012", "customer": "Wonka Ind", "product": "P100", "quantity": 20, "price": "249.99", "date": "2024-05-01", "status": "shipped"},
91
+ {"order_id": "ORD-013", "customer": "Prestige World", "product": "P105", "quantity": 9, "price": "149.50", "date": "2024-05-10", "status": "delivered"},
92
+ {"order_id": "ORD-014", "customer": "Massive Dyn", "product": "P108", "quantity": 3, "price": "1234.56", "date": "2024-05-15", "status": "delivered"},
93
+ {"order_id": "ORD-015", "customer": "Aperture Sci", "product": "P115", "quantity": 15, "price": "75.00", "date": "2024-06-01", "status": "shipped"},
94
+ ]
95
+
96
+ _MED_ISSUES: List[IssueDict] = [
97
+ {"row": 2, "col": "date", "type": "inconsistent_format", "desc": "Date in DD/MM/YYYY format instead of YYYY-MM-DD", "fix": "2024-02-15"},
98
+ {"row": 4, "col": "product", "type": "inconsistent_format", "desc": "Product code has dash (P-102 vs P102)", "fix": "P102"},
99
+ {"row": 5, "col": "quantity", "type": "invalid_value", "desc": "Negative quantity (-2)", "fix": "2"},
100
+ {"row": 5, "col": "price", "type": "inconsistent_format", "desc": "Price has $ and comma ($1,234.56)", "fix": "1234.56"},
101
+ {"row": 7, "col": "date", "type": "inconsistent_format", "desc": "Date uses dots (2024.03.15)", "fix": "2024-03-15"},
102
+ {"row": 8, "col": "status", "type": "typo", "desc": "Status misspelled: shiped -> shipped", "fix": "shipped"},
103
+ {"row": 9, "col": None, "type": "duplicate_row", "desc": "Duplicate of row 1 (same order)", "fix": "__DELETE__"},
104
+ {"row": 10, "col": "price", "type": "missing_value", "desc": "Missing price for P112 product", "fix": "350.00"},
105
+ {"row": 11, "col": "date", "type": "inconsistent_format", "desc": "Date in DD-MM-YYYY format", "fix": "2024-05-01"},
106
+ {"row": 12, "col": "product", "type": "inconsistent_format", "desc": "Product code has dash (P-105 vs P105)", "fix": "P105"},
107
+ ]
108
+
109
+
110
+ # ── TASK 3 — HARD: Employee Records Audit ─────────────────────────────────
111
+
112
+ _HARD_DIRTY: List[Row] = [
113
+ {"emp_id": "E001", "name": "Sarah Johnson", "email": "sarah.j@company.com", "birth_date": "1985-06-12", "age": 39, "department": "Engineering", "dept_code": "ENG", "role": "Senior Engineer", "salary": 125000, "start_date": "2015-03-01", "manager_id": "E010"},
114
+ {"emp_id": "E002", "name": "Michael Chen", "email": "michael.c@company.com", "birth_date": "1990-03-15", "age": 28, "department": "Engineering", "dept_code": "ENG", "role": "Junior Developer", "salary": 72000, "start_date": "2022-07-15", "manager_id": "E001"},
115
+ {"emp_id": "E003", "name": "Emily Watson", "email": "emily.w@company.com", "birth_date": "1988-11-22", "age": 36, "department": "Marketing", "dept_code": "MKT", "role": "Marketing Manager", "salary": 98000, "start_date": "2018-01-10", "manager_id": "E010"},
116
+ {"emp_id": "E004", "name": "David Park", "email": "david.p@company.com", "birth_date": "1992-07-04", "age": 32, "department": "Engineering", "dept_code": "MKT", "role": "Software Engineer", "salary": 105000, "start_date": "2020-09-01", "manager_id": "E001"},
117
+ {"emp_id": "E005", "name": "Lisa Rodriguez", "email": "lisa.r@company.com", "birth_date": "1995-01-30", "age": 29, "department": "Sales", "dept_code": "SAL", "role": "Sales Representative","salary": 65000, "start_date": "2023-02-14", "manager_id": "E008"},
118
+ {"emp_id": "E006", "name": "James O'Brien", "email": "james.ob@company.com", "birth_date": "1987-09-18", "age": 37, "department": "Finance", "dept_code": "FIN", "role": "Financial Analyst", "salary": 88000, "start_date": "2019-05-20", "manager_id": "E010"},
119
+ {"emp_id": "E007", "name": "James Obrien", "email": "james.ob@company.com", "birth_date": "1987-09-18", "age": 37, "department": "Finance", "dept_code": "FIN", "role": "Financial Analyst", "salary": 88000, "start_date": "2019-05-20", "manager_id": "E010"},
120
+ {"emp_id": "E008", "name": "Rachel Green", "email": "rachel.g@company.com", "birth_date": "1983-04-05", "age": 41, "department": "Sales", "dept_code": "SAL", "role": "Sales Director", "salary": 140000, "start_date": "2014-11-01", "manager_id": "E010"},
121
+ {"emp_id": "E009", "name": "Tom Anderson", "email": "tom.a@company.com", "birth_date": "1991-12-25", "age": 33, "department": "Engineering", "dept_code": "ENG", "role": "Junior Developer", "salary": 250000, "start_date": "2023-06-01", "manager_id": "E001"},
122
+ {"emp_id": "E010", "name": "Patricia Moore", "email": "patricia.m@company.com", "birth_date": "1978-02-14", "age": 46, "department": "Executive", "dept_code": "EXE", "role": "VP of Operations", "salary": 185000, "start_date": "2010-01-15", "manager_id": ""},
123
+ {"emp_id": "E011", "name": "Kevin Hall", "email": "kevin.h@company.com", "birth_date": "1993-08-07", "age": 31, "department": "Marketing", "dept_code": "MKT", "role": "Content Specialist", "salary": 62000, "start_date": "2025-08-01", "manager_id": "E003"},
124
+ {"emp_id": "E012", "name": "Amy Liu", "email": "AMY.LIU@COMPANY.COM", "birth_date": "1994-05-19", "age": 30, "department": "Engineering", "dept_code": "ENG", "role": "QA Engineer", "salary": 82000, "start_date": "2021-04-12", "manager_id": "E001"},
125
+ {"emp_id": "E013", "name": "Robert Taylor", "email": "robert.t@company.com", "birth_date": "1986-10-31", "age": 38, "department": "", "dept_code": "SAL", "role": "Account Manager", "salary": 78000, "start_date": "2020-01-06", "manager_id": "E008"},
126
+ {"emp_id": "E014", "name": "NULL", "email": "nina.s@company.com", "birth_date": "1997-03-22", "age": 27, "department": "Finance", "dept_code": "FIN", "role": "Junior Analyst", "salary": 58000, "start_date": "2024-01-08", "manager_id": "E006"},
127
+ {"emp_id": "E015", "name": "Carlos Mendez", "email": "carlos.m@company.com", "birth_date": "1989-07-16", "age": 35, "department": "Engineering", "dept_code": "ENG", "role": "DevOps Engineer", "salary": -95000, "start_date": "2019-10-01", "manager_id": "E001"},
128
+ {"emp_id": "E016", "name": "Sophie Turner", "email": "sophie.t@company.com", "birth_date": "1996-11-03", "age": 28, "department": "Marketing", "dept_code": "MKT", "role": "Social Media Mgr", "salary": 60000, "start_date": "2022-03-15", "manager_id": "E003"},
129
+ {"emp_id": "E017", "name": "Alex Rivera", "email": "alex.r@company.com", "birth_date": "1984-01-28", "age": 40, "department": "Sales", "dept_code": "SAL", "role": "Regional Manager", "salary": 110000, "start_date": "1899-01-01", "manager_id": "E008"},
130
+ {"emp_id": "E018", "name": "Diana Foster", "email": "diana.f@company.com", "birth_date": "1991-06-09", "age": 33, "department": "Finance", "dept_code": "FIN", "role": "Senior Accountant", "salary": 92000, "start_date": "2017-08-21", "manager_id": "E006"},
131
+ {"emp_id": "E019", "name": "Brandon White", "email": "brandon.w@company.com", "birth_date": "1998-04-14", "age": 26, "department": "Engineering", "dept_code": "ENG", "role": "Intern", "salary": 45000, "start_date": "2024-06-01", "manager_id": "E999"},
132
+ {"emp_id": "E020", "name": "Maria Gonzalez", "email": "maria.g@company.com", "birth_date": "1982-12-01", "age": 42, "department": "Executive", "dept_code": "EXE", "role": "CFO", "salary": 210000, "start_date": "2012-04-01", "manager_id": ""},
133
+ ]
134
+
135
+ _HARD_CLEAN: List[Row] = [
136
+ {"emp_id": "E001", "name": "Sarah Johnson", "email": "sarah.j@company.com", "birth_date": "1985-06-12", "age": 39, "department": "Engineering", "dept_code": "ENG", "role": "Senior Engineer", "salary": 125000, "start_date": "2015-03-01", "manager_id": "E010"},
137
+ {"emp_id": "E002", "name": "Michael Chen", "email": "michael.c@company.com", "birth_date": "1990-03-15", "age": 34, "department": "Engineering", "dept_code": "ENG", "role": "Junior Developer", "salary": 72000, "start_date": "2022-07-15", "manager_id": "E001"},
138
+ {"emp_id": "E003", "name": "Emily Watson", "email": "emily.w@company.com", "birth_date": "1988-11-22", "age": 36, "department": "Marketing", "dept_code": "MKT", "role": "Marketing Manager", "salary": 98000, "start_date": "2018-01-10", "manager_id": "E010"},
139
+ {"emp_id": "E004", "name": "David Park", "email": "david.p@company.com", "birth_date": "1992-07-04", "age": 32, "department": "Engineering", "dept_code": "ENG", "role": "Software Engineer", "salary": 105000, "start_date": "2020-09-01", "manager_id": "E001"},
140
+ {"emp_id": "E005", "name": "Lisa Rodriguez", "email": "lisa.r@company.com", "birth_date": "1995-01-30", "age": 29, "department": "Sales", "dept_code": "SAL", "role": "Sales Representative","salary": 65000, "start_date": "2023-02-14", "manager_id": "E008"},
141
+ {"emp_id": "E006", "name": "James O'Brien", "email": "james.ob@company.com", "birth_date": "1987-09-18", "age": 37, "department": "Finance", "dept_code": "FIN", "role": "Financial Analyst", "salary": 88000, "start_date": "2019-05-20", "manager_id": "E010"},
142
+ # row 6 (near-duplicate of row 5) deleted
143
+ {"emp_id": "E008", "name": "Rachel Green", "email": "rachel.g@company.com", "birth_date": "1983-04-05", "age": 41, "department": "Sales", "dept_code": "SAL", "role": "Sales Director", "salary": 140000, "start_date": "2014-11-01", "manager_id": "E010"},
144
+ {"emp_id": "E009", "name": "Tom Anderson", "email": "tom.a@company.com", "birth_date": "1991-12-25", "age": 33, "department": "Engineering", "dept_code": "ENG", "role": "Junior Developer", "salary": 75000, "start_date": "2023-06-01", "manager_id": "E001"},
145
+ {"emp_id": "E010", "name": "Patricia Moore", "email": "patricia.m@company.com", "birth_date": "1978-02-14", "age": 46, "department": "Executive", "dept_code": "EXE", "role": "VP of Operations", "salary": 185000, "start_date": "2010-01-15", "manager_id": ""},
146
+ {"emp_id": "E011", "name": "Kevin Hall", "email": "kevin.h@company.com", "birth_date": "1993-08-07", "age": 31, "department": "Marketing", "dept_code": "MKT", "role": "Content Specialist", "salary": 62000, "start_date": "2024-08-01", "manager_id": "E003"},
147
+ {"emp_id": "E012", "name": "Amy Liu", "email": "amy.liu@company.com", "birth_date": "1994-05-19", "age": 30, "department": "Engineering", "dept_code": "ENG", "role": "QA Engineer", "salary": 82000, "start_date": "2021-04-12", "manager_id": "E001"},
148
+ {"emp_id": "E013", "name": "Robert Taylor", "email": "robert.t@company.com", "birth_date": "1986-10-31", "age": 38, "department": "Sales", "dept_code": "SAL", "role": "Account Manager", "salary": 78000, "start_date": "2020-01-06", "manager_id": "E008"},
149
+ {"emp_id": "E014", "name": "Nina Sharma", "email": "nina.s@company.com", "birth_date": "1997-03-22", "age": 27, "department": "Finance", "dept_code": "FIN", "role": "Junior Analyst", "salary": 58000, "start_date": "2024-01-08", "manager_id": "E006"},
150
+ {"emp_id": "E015", "name": "Carlos Mendez", "email": "carlos.m@company.com", "birth_date": "1989-07-16", "age": 35, "department": "Engineering", "dept_code": "ENG", "role": "DevOps Engineer", "salary": 95000, "start_date": "2019-10-01", "manager_id": "E001"},
151
+ {"emp_id": "E016", "name": "Sophie Turner", "email": "sophie.t@company.com", "birth_date": "1996-11-03", "age": 28, "department": "Marketing", "dept_code": "MKT", "role": "Social Media Mgr", "salary": 60000, "start_date": "2022-03-15", "manager_id": "E003"},
152
+ {"emp_id": "E017", "name": "Alex Rivera", "email": "alex.r@company.com", "birth_date": "1984-01-28", "age": 40, "department": "Sales", "dept_code": "SAL", "role": "Regional Manager", "salary": 110000, "start_date": "2016-09-01", "manager_id": "E008"},
153
+ {"emp_id": "E018", "name": "Diana Foster", "email": "diana.f@company.com", "birth_date": "1991-06-09", "age": 33, "department": "Finance", "dept_code": "FIN", "role": "Senior Accountant", "salary": 92000, "start_date": "2017-08-21", "manager_id": "E006"},
154
+ {"emp_id": "E019", "name": "Brandon White", "email": "brandon.w@company.com", "birth_date": "1998-04-14", "age": 26, "department": "Engineering", "dept_code": "ENG", "role": "Intern", "salary": 45000, "start_date": "2024-06-01", "manager_id": "E001"},
155
+ {"emp_id": "E020", "name": "Maria Gonzalez", "email": "maria.g@company.com", "birth_date": "1982-12-01", "age": 42, "department": "Executive", "dept_code": "EXE", "role": "CFO", "salary": 210000, "start_date": "2012-04-01", "manager_id": ""},
156
+ ]
157
+
158
+ _HARD_ISSUES: List[IssueDict] = [
159
+ {"row": 1, "col": "age", "type": "cross_field", "desc": "Age 28 inconsistent with birth_date 1990-03-15 (should be ~34)", "fix": "34"},
160
+ {"row": 3, "col": "dept_code", "type": "cross_field", "desc": "dept_code MKT but department is Engineering", "fix": "ENG"},
161
+ {"row": 6, "col": None, "type": "near_duplicate", "desc": "Near-duplicate of row 5 (James Obrien vs James O'Brien)", "fix": "__DELETE__"},
162
+ {"row": 8, "col": "salary", "type": "anomalous_value", "desc": "Salary $250k for Junior Developer (expected $60k-$85k)", "fix": "75000"},
163
+ {"row": 10, "col": "start_date", "type": "future_date", "desc": "Start date 2025-08-01 is in the future", "fix": "2024-08-01"},
164
+ {"row": 11, "col": "email", "type": "inconsistent_format", "desc": "Email in ALL CAPS vs lowercase convention", "fix": "amy.liu@company.com"},
165
+ {"row": 12, "col": "department", "type": "missing_value", "desc": "Department empty but dept_code is SAL", "fix": "Sales"},
166
+ {"row": 13, "col": "name", "type": "placeholder_value", "desc": "Name is literal 'NULL' string instead of real name", "fix": "Nina Sharma"},
167
+ {"row": 14, "col": "salary", "type": "invalid_value", "desc": "Negative salary (-95000)", "fix": "95000"},
168
+ {"row": 16, "col": "start_date", "type": "anomalous_value", "desc": "Start date 1899-01-01 is clearly wrong", "fix": "2016-09-01"},
169
+ {"row": 18, "col": "manager_id", "type": "referential", "desc": "manager_id E999 does not exist in employee list", "fix": "E001"},
170
+ ]
171
+
172
+
173
+ # ── public registry ────────────────────────────────────────────────────────
174
+
175
+ TASKS = {
176
+ "easy": {
177
+ "name": "easy",
178
+ "title": "Customer Contact Cleanup",
179
+ "difficulty": "easy",
180
+ "description": (
181
+ "You are a data-quality analyst. A customer-contacts spreadsheet has "
182
+ "been imported with several obvious errors: missing e-mails, invalid "
183
+ "phone numbers, duplicate rows, and impossible ages. "
184
+ "Identify and fix every issue. Use the available actions to correct "
185
+ "each problem, then submit when you believe the data is clean."
186
+ ),
187
+ "dirty_data": _EASY_DIRTY,
188
+ "clean_data": _EASY_CLEAN,
189
+ "issues": _EASY_ISSUES,
190
+ "max_steps": 15,
191
+ "columns": ["id", "name", "email", "phone", "age", "city"],
192
+ },
193
+ "medium": {
194
+ "name": "medium",
195
+ "title": "E-commerce Order Normalisation",
196
+ "difficulty": "medium",
197
+ "description": (
198
+ "You are a data engineer preparing an orders export for a BI dashboard. "
199
+ "The dataset has mixed date formats (YYYY-MM-DD, DD/MM/YYYY, YYYY.MM.DD, DD-MM-YYYY), "
200
+ "inconsistent price formatting, product-code variants (P100 vs P-100), "
201
+ "a typo in a status field, a duplicate order, negative quantities, "
202
+ "and missing values. Normalise every field so the data is consistent, "
203
+ "then submit."
204
+ ),
205
+ "dirty_data": _MED_DIRTY,
206
+ "clean_data": _MED_CLEAN,
207
+ "issues": _MED_ISSUES,
208
+ "max_steps": 25,
209
+ "columns": ["order_id", "customer", "product", "quantity", "price", "date", "status"],
210
+ },
211
+ "hard": {
212
+ "name": "hard",
213
+ "title": "Employee Records Audit",
214
+ "difficulty": "hard",
215
+ "description": (
216
+ "You are auditing an HR database before a compliance review. "
217
+ "The employee records contain subtle cross-field inconsistencies "
218
+ "(age vs birth-date mismatches, department vs dept-code conflicts), "
219
+ "near-duplicate employees with slightly different name spellings, "
220
+ "anomalous salary values for the given role, future or impossible dates, "
221
+ "placeholder 'NULL' strings, ALL-CAPS email addresses, missing departments, "
222
+ "and referential-integrity violations (manager_id pointing to non-existent employees). "
223
+ "Find and fix all issues, then submit."
224
+ ),
225
+ "dirty_data": _HARD_DIRTY,
226
+ "clean_data": _HARD_CLEAN,
227
+ "issues": _HARD_ISSUES,
228
+ "max_steps": 35,
229
+ "columns": ["emp_id", "name", "email", "birth_date", "age", "department",
230
+ "dept_code", "role", "salary", "start_date", "manager_id"],
231
+ },
232
+ }
233
+
234
+
235
+ def get_task(name: str) -> dict:
236
+ """Return a deep copy of a task definition so mutations are isolated."""
237
+ if name not in TASKS:
238
+ raise ValueError(f"Unknown task '{name}'. Choose from: {list(TASKS.keys())}")
239
+ return copy.deepcopy(TASKS[name])
inference.py ADDED
@@ -0,0 +1,263 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Inference Script — DataClean Environment
3
+ =========================================
4
+ MANDATORY:
5
+ - Before submitting, ensure the following variables are defined:
6
+ API_BASE_URL The API endpoint for the LLM.
7
+ MODEL_NAME The model identifier to use for inference.
8
+ HF_TOKEN Your Hugging Face / API key.
9
+ - This script must be named `inference.py` and placed in the root directory.
10
+ - Uses OpenAI Client for all LLM calls.
11
+ """
12
+
13
+ import json
14
+ import os
15
+ import re
16
+ import sys
17
+ import textwrap
18
+ from typing import List, Optional
19
+
20
+ from openai import OpenAI
21
+
22
+ # ---------------------------------------------------------------------------
23
+ # Inline client (HTTP) so inference.py is self-contained
24
+ # ---------------------------------------------------------------------------
25
+ import requests
26
+
27
+
28
+ class _StepResult:
29
+ def __init__(self, observation: dict, reward: float, done: bool):
30
+ self.observation = observation
31
+ self.reward = reward
32
+ self.done = done
33
+
34
+
35
+ class _SimpleClient:
36
+ """Minimal sync HTTP client for the DataClean environment."""
37
+
38
+ def __init__(self, base_url: str):
39
+ self.base_url = base_url.rstrip("/")
40
+ self.s = requests.Session()
41
+
42
+ def reset(self, task_name: str = "easy") -> _StepResult:
43
+ r = self.s.post(f"{self.base_url}/reset", json={"task_name": task_name}, timeout=30)
44
+ r.raise_for_status()
45
+ d = r.json()
46
+ return _StepResult(d.get("observation", {}), float(d.get("reward", 0)), bool(d.get("done", False)))
47
+
48
+ def step(self, action: dict) -> _StepResult:
49
+ r = self.s.post(f"{self.base_url}/step", json=action, timeout=30)
50
+ r.raise_for_status()
51
+ d = r.json()
52
+ return _StepResult(d.get("observation", {}), float(d.get("reward", 0)), bool(d.get("done", False)))
53
+
54
+ def close(self):
55
+ self.s.close()
56
+
57
+
58
+ # ---------------------------------------------------------------------------
59
+ # Configuration
60
+ # ---------------------------------------------------------------------------
61
+ API_BASE_URL = os.getenv("API_BASE_URL", "https://router.huggingface.co/v1")
62
+ API_KEY = os.getenv("HF_TOKEN") or os.getenv("API_KEY")
63
+ MODEL_NAME = os.getenv("MODEL_NAME")
64
+
65
+ # Where the DataClean env server is running
66
+ ENV_BASE_URL = os.getenv("ENV_BASE_URL", "http://localhost:7860")
67
+
68
+ MAX_STEPS_PER_TASK = {"easy": 12, "medium": 20, "hard": 30}
69
+ TEMPERATURE = 0.1
70
+ MAX_TOKENS = 400
71
+
72
+ SYSTEM_PROMPT = textwrap.dedent("""\
73
+ You are an expert data-quality analyst. You are interacting with a data-cleaning
74
+ environment. Your goal is to identify and fix all data-quality issues.
75
+
76
+ After reviewing the data and quality report, respond with EXACTLY ONE action in
77
+ valid JSON format. Available actions:
78
+
79
+ 1. Fix a cell value:
80
+ {"action_type": "fix_value", "row_index": <int>, "column_name": "<col>", "new_value": "<corrected>"}
81
+
82
+ 2. Delete a duplicate/invalid row:
83
+ {"action_type": "delete_row", "row_index": <int>}
84
+
85
+ 3. Fill a missing value:
86
+ {"action_type": "fill_missing", "row_index": <int>, "column_name": "<col>", "new_value": "<value>"}
87
+
88
+ 4. Flag a suspicious cell (partial credit):
89
+ {"action_type": "flag_anomaly", "row_index": <int>, "column_name": "<col>"}
90
+
91
+ 5. Submit your work (ends the episode):
92
+ {"action_type": "submit"}
93
+
94
+ 6. Do nothing this step:
95
+ {"action_type": "noop"}
96
+
97
+ RULES:
98
+ - row_index is 0-based and refers to the ORIGINAL row number shown in the table.
99
+ - Respond ONLY with the JSON action. No explanations, no markdown, no extra text.
100
+ - Fix the most obvious/critical issues first.
101
+ - When all issues appear resolved, use submit.
102
+ - Dates should be in YYYY-MM-DD format.
103
+ - Prices should be plain numbers without $ or commas.
104
+ - Product codes should NOT have dashes (e.g., P102 not P-102).
105
+ - Emails should be lowercase.
106
+ """).strip()
107
+
108
+
109
+ # ---------------------------------------------------------------------------
110
+ # Helpers
111
+ # ---------------------------------------------------------------------------
112
+ ACTION_JSON_RE = re.compile(r"\{[^}]+\}", re.DOTALL)
113
+
114
+
115
+ def parse_action(text: str) -> dict:
116
+ """Extract the first JSON object from the model response."""
117
+ if not text:
118
+ return {"action_type": "noop"}
119
+ m = ACTION_JSON_RE.search(text)
120
+ if m:
121
+ try:
122
+ obj = json.loads(m.group(0))
123
+ if "action_type" in obj:
124
+ return obj
125
+ except json.JSONDecodeError:
126
+ pass
127
+ return {"action_type": "noop"}
128
+
129
+
130
+ def build_user_prompt(obs: dict, step_num: int) -> str:
131
+ """Build the user prompt from the observation."""
132
+ parts = [
133
+ f"TASK: {obs.get('task_description', '')}",
134
+ f"DIFFICULTY: {obs.get('difficulty', '')}",
135
+ f"STEP: {step_num}/{obs.get('max_steps', '?')}",
136
+ f"CURRENT SCORE: {obs.get('current_score', 0.0)}",
137
+ "",
138
+ "CURRENT DATA:",
139
+ obs.get("data_preview", "(no data)"),
140
+ "",
141
+ obs.get("quality_report", ""),
142
+ ]
143
+ history = obs.get("action_history", [])
144
+ if history:
145
+ parts.append("")
146
+ parts.append("RECENT ACTIONS:")
147
+ for h in history[-5:]:
148
+ parts.append(f" {h}")
149
+
150
+ parts.append("")
151
+ parts.append("Respond with exactly one JSON action.")
152
+ return "\n".join(parts)
153
+
154
+
155
+ # ---------------------------------------------------------------------------
156
+ # Run one task
157
+ # ---------------------------------------------------------------------------
158
+ def run_task(
159
+ llm_client: OpenAI,
160
+ env_client: _SimpleClient,
161
+ task_name: str,
162
+ max_steps: int,
163
+ ) -> float:
164
+ """Run a single task and return the final score."""
165
+ print(f"\n{'='*60}")
166
+ print(f" TASK: {task_name.upper()}")
167
+ print(f"{'='*60}")
168
+
169
+ result = env_client.reset(task_name)
170
+ obs = result.observation
171
+ print(f" Task: {obs.get('task_description', '')[:80]}...")
172
+ print(f" Max steps: {max_steps}")
173
+
174
+ for step in range(1, max_steps + 1):
175
+ if result.done:
176
+ print(f" Episode done at step {step - 1}")
177
+ break
178
+
179
+ user_prompt = build_user_prompt(obs, step)
180
+ messages = [
181
+ {"role": "system", "content": SYSTEM_PROMPT},
182
+ {"role": "user", "content": user_prompt},
183
+ ]
184
+
185
+ try:
186
+ completion = llm_client.chat.completions.create(
187
+ model=MODEL_NAME,
188
+ messages=messages,
189
+ temperature=TEMPERATURE,
190
+ max_tokens=MAX_TOKENS,
191
+ stream=False,
192
+ )
193
+ response_text = completion.choices[0].message.content or ""
194
+ except Exception as exc:
195
+ print(f" Step {step}: LLM error ({exc}), using noop")
196
+ response_text = '{"action_type": "noop"}'
197
+
198
+ action = parse_action(response_text)
199
+ print(f" Step {step}: {action.get('action_type', '?')}", end="")
200
+ if action.get("row_index") is not None:
201
+ print(f" row={action['row_index']}", end="")
202
+ if action.get("column_name"):
203
+ print(f" col={action['column_name']}", end="")
204
+ if action.get("new_value"):
205
+ print(f" val={action['new_value']}", end="")
206
+
207
+ result = env_client.step(action)
208
+ obs = result.observation
209
+ print(f" -> reward={result.reward:.4f} done={result.done}")
210
+
211
+ if result.done:
212
+ break
213
+
214
+ # If agent never submitted, force submit
215
+ if not result.done:
216
+ result = env_client.step({"action_type": "submit"})
217
+
218
+ final_score = result.reward
219
+ print(f"\n FINAL SCORE ({task_name}): {final_score:.4f}")
220
+ return final_score
221
+
222
+
223
+ # ---------------------------------------------------------------------------
224
+ # Main
225
+ # ---------------------------------------------------------------------------
226
+ def main() -> None:
227
+ if not API_KEY:
228
+ print("ERROR: HF_TOKEN or API_KEY environment variable not set")
229
+ sys.exit(1)
230
+ if not MODEL_NAME:
231
+ print("ERROR: MODEL_NAME environment variable not set")
232
+ sys.exit(1)
233
+
234
+ print("DataClean Environment — Baseline Inference")
235
+ print(f" API: {API_BASE_URL}")
236
+ print(f" Model: {MODEL_NAME}")
237
+ print(f" Env: {ENV_BASE_URL}")
238
+
239
+ llm_client = OpenAI(base_url=API_BASE_URL, api_key=API_KEY)
240
+ env_client = _SimpleClient(ENV_BASE_URL)
241
+
242
+ scores = {}
243
+ try:
244
+ for task_name in ["easy", "medium", "hard"]:
245
+ max_steps = MAX_STEPS_PER_TASK[task_name]
246
+ score = run_task(llm_client, env_client, task_name, max_steps)
247
+ scores[task_name] = score
248
+ finally:
249
+ env_client.close()
250
+
251
+ print(f"\n{'='*60}")
252
+ print(" FINAL RESULTS")
253
+ print(f"{'='*60}")
254
+ for name, score in scores.items():
255
+ bar = "#" * int(score * 40)
256
+ print(f" {name:8s}: {score:.4f} [{bar:<40s}]")
257
+ avg = sum(scores.values()) / len(scores) if scores else 0.0
258
+ print(f" {'AVERAGE':8s}: {avg:.4f}")
259
+ print(f"{'='*60}")
260
+
261
+
262
+ if __name__ == "__main__":
263
+ main()
openenv.yaml ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ spec_version: 1
2
+ name: dataclean_env
3
+ type: space
4
+ runtime: fastapi
5
+ app: dataclean_env.server.app:app
6
+ port: 7860
pyproject.toml ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [build-system]
2
+ requires = ["setuptools>=68.0", "wheel"]
3
+ build-backend = "setuptools.build_meta"
4
+
5
+ [project]
6
+ name = "dataclean-env"
7
+ version = "1.0.0"
8
+ description = "OpenEnv environment for training AI agents on real-world data-quality cleaning tasks"
9
+ readme = "README.md"
10
+ license = {text = "BSD-3-Clause"}
11
+ requires-python = ">=3.10"
12
+ dependencies = [
13
+ "fastapi>=0.104.0",
14
+ "uvicorn>=0.24.0",
15
+ "requests>=2.25.0",
16
+ "pydantic>=2.0.0",
17
+ "openai>=1.0.0",
18
+ ]
19
+
20
+ [project.optional-dependencies]
21
+ server = [
22
+ "fastapi>=0.104.0",
23
+ "uvicorn>=0.24.0",
24
+ ]
25
+
26
+ [tool.setuptools.packages.find]
27
+ include = ["dataclean_env*"]
requirements.txt ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ fastapi>=0.104.0
2
+ uvicorn[standard]>=0.24.0
3
+ requests>=2.25.0
4
+ pydantic>=2.0.0
5
+ openai>=1.0.0
6
+ websockets>=12.0