sairaj2 commited on
Commit
3432ccd
·
1 Parent(s): 746496e

Add /web dashboard UI and update Docker setup

Browse files

- Added interactive web dashboard at /web with session controls, step controls, live outputs
- Updated Dockerfile for better Hugging Face Spaces compatibility
- Updated README.md with UI documentation
- Added .dockerignore for cleaner Docker builds
- Updated requirements.txt with necessary dependencies
- Updated env/environment.py and env/reward.py for improved functionality
- Added scripts/ directory with validation script

Files changed (9) hide show
  1. .dockerignore +22 -0
  2. Dockerfile +22 -22
  3. README.md +94 -976
  4. app.py +411 -9
  5. env/environment.py +22 -7
  6. env/reward.py +5 -5
  7. inference.py +64 -78
  8. requirements.txt +1 -1
  9. scripts/validate-submission.sh +62 -0
.dockerignore ADDED
@@ -0,0 +1,22 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ .git
2
+ .gitignore
3
+
4
+ __pycache__/
5
+ *.py[cod]
6
+ *.pyo
7
+ *.pyd
8
+ .pytest_cache/
9
+ .mypy_cache/
10
+ .ruff_cache/
11
+ .cache/
12
+
13
+ venv/
14
+ .venv/
15
+
16
+ *.ipynb
17
+
18
+ *.log
19
+ results_*.json
20
+
21
+ .DS_Store
22
+
Dockerfile CHANGED
@@ -1,35 +1,35 @@
1
- # OpenEnv Data Cleaning Environment
2
  FROM python:3.11-slim
3
 
4
- WORKDIR /app
 
 
 
5
 
6
- # Install system dependencies
7
- RUN apt-get update && apt-get install -y \
8
- gcc \
9
- g++ \
10
- && rm -rf /var/lib/apt/lists/*
11
 
12
- # Copy requirements first for better caching
13
- COPY requirements.txt .
14
 
15
- # Install Python dependencies
16
- RUN pip install --no-cache-dir -r requirements.txt
 
 
17
 
18
  # Copy application code
19
- COPY . .
20
 
21
- # Create data directory
22
- RUN mkdir -p data
23
 
24
- # Generate datasets on build
25
- RUN python -c "from env.tasks import TaskManager; tm = TaskManager(); tm.generate_datasets()"
26
 
27
- # Expose port
28
  EXPOSE 7860
29
 
30
- # Health check
31
- HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
32
- CMD python -c "import requests; requests.get('http://localhost:7860/health')" || exit 1
33
 
34
- # Run the application
35
- CMD ["python", "app.py"]
 
 
1
+ # OpenEnv Data Cleaning Environment (FastAPI)
2
  FROM python:3.11-slim
3
 
4
+ ENV PYTHONDONTWRITEBYTECODE=1 \
5
+ PYTHONUNBUFFERED=1 \
6
+ PIP_NO_CACHE_DIR=1 \
7
+ PORT=7860
8
 
9
+ WORKDIR /app
 
 
 
 
10
 
11
+ # Create non-root user
12
+ RUN useradd -m -u 10001 appuser
13
 
14
+ # Install Python dependencies first for better caching
15
+ COPY requirements.txt /app/requirements.txt
16
+ RUN python -m pip install --upgrade pip && \
17
+ pip install -r /app/requirements.txt
18
 
19
  # Copy application code
20
+ COPY . /app
21
 
22
+ # Ensure data directory exists and is writable (tasks may write datasets)
23
+ RUN mkdir -p /app/data && chown -R appuser:appuser /app
24
 
25
+ USER appuser
 
26
 
 
27
  EXPOSE 7860
28
 
29
+ # Health check (keep it simple; HF will also probe)
30
+ HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \
31
+ CMD python -c "import os,requests; requests.get(f\"http://127.0.0.1:{os.getenv('PORT','7860')}/health\", timeout=2).raise_for_status()"
32
 
33
+ # Production server (FastAPI)
34
+ # Note: keep a single worker because sessions are held in-memory per process.
35
+ CMD ["sh", "-c", "uvicorn app:app --host 0.0.0.0 --port ${PORT} --proxy-headers --forwarded-allow-ips='*' --workers 1"]
README.md CHANGED
@@ -1,795 +1,113 @@
1
- # OpenEnv Data Cleaning Environment
2
-
3
- A production-grade OpenEnv environment for data cleaning and validation tasks that simulates real-world data engineering workflows.
4
-
5
- ## Table of Contents
6
-
7
- - [Real-World Problem](#real-world-problem)
8
- - [Solution Overview](#solution-overview)
9
- - [Architecture](#architecture)
10
- - [How It Works](#how-it-works)
11
- - [Action Space](#action-space)
12
- - [Observation Space](#observation-space)
13
- - [Reward System](#reward-system)
14
- - [Task Levels](#task-levels)
15
- - [Quick Start](#quick-start)
16
- - [Deployment](#deployment)
17
- - [API Reference](#api-reference)
18
- - [Running the Baseline Agent](#running-the-baseline-agent)
19
- - [Expected Scores](#expected-scores)
20
- - [Development Guide](#development-guide)
21
-
22
- ---
23
-
24
- ## Real-World Problem
25
-
26
- ### The Data Quality Crisis
27
-
28
- In modern data engineering, **60-80% of a data scientist's time** is spent on data cleaning and preparation. This critical but tedious process involves:
29
-
30
- ```mermaid
31
- graph TD
32
- A[Raw Data] --> B{Data Quality Issues}
33
- B --> C[Missing Values]
34
- B --> D[Duplicate Records]
35
- B --> E[Format Inconsistencies]
36
- B --> F[Invalid Values]
37
- B --> G[Cross-Field Errors]
38
-
39
- C --> H[Manual Cleaning]
40
- D --> H
41
- E --> H
42
- F --> H
43
- G --> H
44
-
45
- H --> I[Time Consuming]
46
- H --> J[Error Prone]
47
- H --> K[Not Scalable]
48
-
49
- I --> L[Business Impact]
50
- J --> L
51
- K --> L
52
-
53
- L --> M[Delayed Insights]
54
- L --> N[Wrong Decisions]
55
- L --> O[Lost Revenue]
56
- ```
57
-
58
- ### Industry Pain Points
59
-
60
- | Problem | Impact | Current Solution |
61
- | ---------------------- | ----------------------------- | ---------------------------------- |
62
- | **Missing Values** | 15-25% of datasets have gaps | Manual imputation, simple fill |
63
- | **Duplicates** | 5-10% redundant records | SQL dedup, pandas drop_duplicates |
64
- | **Format Issues** | 20-30% inconsistent formats | Regex, manual standardization |
65
- | **Invalid Data** | 10-15% out-of-range values | Business rules, validation scripts |
66
- | **Cross-Field Errors** | 5-10% logical inconsistencies | Custom validation logic |
67
-
68
- ### Why This Matters
69
-
70
- ```mermaid
71
- graph LR
72
- A[Poor Data Quality] --> B[Failed ML Models]
73
- A --> C[Incorrect Analytics]
74
- A --> D[Compliance Issues]
75
- A --> E[Customer Churn]
76
-
77
- B --> F[$4.2M Annual Loss]
78
- C --> F
79
- D --> F
80
- E --> F
81
-
82
- style F fill:#ff6b6b,stroke:#c92a2a
83
- ```
84
-
85
- **Key Statistics:**
86
-
87
- - **$12.9 million** - Average annual cost of poor data quality per organization (Gartner)
88
- - **40% of business initiatives** fail to achieve targets due to poor data quality
89
- - **27% of revenue** is lost due to inaccurate data in CRM systems
90
-
91
- ---
92
-
93
- ## Solution Overview
94
-
95
- ### What This Project Provides
96
-
97
- This OpenEnv environment creates a **standardized, reproducible benchmark** for evaluating AI agents on data cleaning tasks. It bridges the gap between academic research and production data engineering.
98
-
99
- ```mermaid
100
- graph TB
101
- subgraph "OpenEnv Data Cleaning Environment"
102
- A[Environment Interface] --> B[Action Space]
103
- A --> C[Observation Space]
104
- A --> D[Reward System]
105
-
106
- B --> E[10 Structured Actions]
107
- C --> F[Intelligent Feedback]
108
- D --> G[Multi-Component Scoring]
109
-
110
- E --> H[Agent Interaction]
111
- F --> H
112
- G --> H
113
-
114
- H --> I[Deterministic Grading]
115
- I --> J[Reproducible Results]
116
- end
117
-
118
- subgraph "Real-World Applications"
119
- K[Data Pipeline Automation]
120
- L[ETL Quality Assurance]
121
- M[ML Data Preparation]
122
- N[Compliance Validation]
123
- end
124
-
125
- J --> K
126
- J --> L
127
- J --> M
128
- J --> N
129
- ```
130
-
131
- ### Key Benefits
132
-
133
- 1. **Standardized Evaluation**: Compare different AI agents on the same tasks
134
- 2. **Realistic Scenarios**: Based on actual data engineering challenges
135
- 3. **Deterministic Grading**: Reproducible scoring for fair comparison
136
- 4. **Production Ready**: Docker deployment, REST API, scalable architecture
137
- 5. **Extensible**: Easy to add new tasks, actions, and metrics
138
-
139
- ---
140
-
141
- ## Architecture
142
-
143
- ### System Architecture
144
-
145
- ```mermaid
146
- graph TB
147
- subgraph "Client Layer"
148
- A[AI Agent]
149
- B[Human User]
150
- C[CI/CD Pipeline]
151
- end
152
-
153
- subgraph "API Layer"
154
- D[FastAPI Server]
155
- E[REST Endpoints]
156
- F[Session Management]
157
- end
158
-
159
- subgraph "Environment Layer"
160
- G[DataCleaningEnv]
161
- H[Task Manager]
162
- I[Reward Calculator]
163
- J[Grader]
164
- end
165
-
166
- subgraph "Data Layer"
167
- K[Dirty Datasets]
168
- L[Clean Datasets]
169
- M[Session State]
170
- end
171
-
172
- A --> D
173
- B --> D
174
- C --> D
175
-
176
- D --> E
177
- D --> F
178
-
179
- E --> G
180
- F --> G
181
-
182
- G --> H
183
- G --> I
184
- G --> J
185
-
186
- H --> K
187
- H --> L
188
- G --> M
189
-
190
- style D fill:#4dabf7,stroke:#1971c2
191
- style G fill:#51cf66,stroke:#2f9e44
192
- ```
193
-
194
- ### Component Interaction
195
-
196
- ```mermaid
197
- sequenceDiagram
198
- participant Agent
199
- participant API
200
- participant Env
201
- participant Reward
202
- participant Grader
203
-
204
- Agent->>API: POST /reset (task_id)
205
- API->>Env: reset(task_config)
206
- Env-->>API: Observation
207
- API-->>Agent: Initial State
208
-
209
- loop Until Done
210
- Agent->>API: POST /step (action)
211
- API->>Env: step(action)
212
- Env->>Reward: calculate_reward()
213
- Reward-->>Env: Reward
214
- Env-->>API: Observation, Reward, Done, Info
215
- API-->>Agent: Step Result
216
- end
217
-
218
- Agent->>API: POST /step (submit)
219
- API->>Env: step(submit)
220
- Env->>Grader: grade()
221
- Grader-->>Env: Final Score
222
- Env-->>API: Final Result
223
- API-->>Agent: Final Score
224
- ```
225
-
226
- ### Data Flow
227
-
228
- ```mermaid
229
- flowchart TD
230
- A[Dirty Dataset] --> B[Environment Reset]
231
- B --> C[Initial Observation]
232
-
233
- C --> D{Agent Decision}
234
- D --> E[Select Action]
235
-
236
- E --> F[Execute Action]
237
- F --> G[Update DataFrame]
238
-
239
- G --> H[Calculate Metrics]
240
- H --> I[Detect Issues]
241
-
242
- I --> J[Generate Observation]
243
- J --> K[Calculate Reward]
244
-
245
- K --> L{Done?}
246
- L -->|No| D
247
- L -->|Yes| M[Final Grading]
248
-
249
- M --> N[Score Calculation]
250
- N --> O[Results]
251
-
252
- style A fill:#ffd43b,stroke:#f59f00
253
- style O fill:#51cf66,stroke:#2f9e44
254
- ```
255
-
256
- ---
257
-
258
- ## How It Works
259
-
260
- ### Environment Lifecycle
261
-
262
- ```mermaid
263
- stateDiagram-v2
264
- [*] --> Initialized: Create Environment
265
-
266
- Initialized --> Ready: reset(task)
267
-
268
- Ready --> Running: step(action)
269
-
270
- Running --> Running: Action Executed
271
- Running --> Running: Reward Calculated
272
- Running --> Done: submit() or max_steps
273
-
274
- Done --> Ready: reset(new_task)
275
- Done --> [*]: Delete Session
276
-
277
- state Running {
278
- [*] --> ExecuteAction
279
- ExecuteAction --> UpdateState
280
- UpdateState --> CalculateMetrics
281
- CalculateMetrics --> DetectIssues
282
- DetectIssues --> GenerateObservation
283
- GenerateObservation --> CalculateReward
284
- CalculateReward --> [*]
285
- }
286
- ```
287
-
288
- ### Action Execution Flow
289
-
290
- ```mermaid
291
- flowchart LR
292
- A[Action Input] --> B{Validate Action}
293
- B -->|Invalid| C[Return Error]
294
- B -->|Valid| D[Execute Operation]
295
-
296
- D --> E[Update DataFrame]
297
- E --> F[Track Changes]
298
- F --> G[Update Issues]
299
-
300
- G --> H[Calculate Metrics]
301
- H --> I[Generate Result]
302
-
303
- I --> J[Return ActionResult]
304
-
305
- style C fill:#ff6b6b,stroke:#c92a2a
306
- style J fill:#51cf66,stroke:#2f9e44
307
- ```
308
-
309
- ### Reward Calculation
310
-
311
- ```mermaid
312
- graph TD
313
- A[Current State] --> B[Quality Metrics]
314
- A --> C[Issue Count]
315
- A --> D[Action History]
316
-
317
- B --> E[Quality Improvement]
318
- C --> F[Issue Resolution]
319
-
320
- E --> G[Reward Components]
321
- F --> G
322
- D --> H[Redundancy Check]
323
- D --> I[Destructive Check]
324
-
325
- H --> J[Penalties]
326
- I --> J
327
-
328
- G --> K[Weighted Sum]
329
- J --> K
330
-
331
- K --> L[Final Reward]
332
-
333
- style L fill:#4dabf7,stroke:#1971c2
334
- ```
335
-
336
- ---
337
-
338
- ## Action Space
339
-
340
- ### Available Actions
341
-
342
- ```mermaid
343
- graph TB
344
- subgraph "Data Cleaning Actions"
345
- A[fill_missing] --> A1[mean/median/mode]
346
- A --> A2[forward/backward fill]
347
- A --> A3[specific value]
348
-
349
- B[drop_duplicates] --> B1[subset columns]
350
- B --> B2[keep first/last]
351
-
352
- C[normalize_text] --> C1[lowercase]
353
- C --> C2[strip whitespace]
354
- C --> C3[remove special chars]
355
-
356
- D[standardize_format] --> D1[email]
357
- D --> D2[phone]
358
- D --> D3[date]
359
-
360
- E[validate_range] --> E1[min value]
361
- E --> E2[max value]
362
-
363
- F[detect_outliers] --> F1[IQR method]
364
- F --> F2[Z-score method]
365
-
366
- G[infer_values] --> G1[interpolation]
367
- G --> G2[regression]
368
- G --> G3[mode]
369
-
370
- H[flag_invalid] --> H1[row_id]
371
- H --> H2[reason]
372
- end
373
-
374
- subgraph "Control Actions"
375
- I[revert_last_action]
376
- J[submit]
377
- end
378
- ```
379
-
380
- ### Action Details
381
-
382
- | Action | Description | Parameters | Example |
383
- | -------------------- | ----------------------- | ------------------------- | ---------------------------------------------------------- |
384
- | `fill_missing` | Fill missing values | column, strategy/value | `{"column": "age", "strategy": "median"}` |
385
- | `drop_duplicates` | Remove duplicate rows | subset, keep | `{"subset": ["email"], "keep": "first"}` |
386
- | `normalize_text` | Normalize text data | column, operations | `{"column": "name", "operations": ["lowercase", "strip"]}` |
387
- | `standardize_format` | Standardize formats | column, format_type | `{"column": "email", "format_type": "email"}` |
388
- | `validate_range` | Fix out-of-range values | column, min, max | `{"column": "age", "min_value": 0, "max_value": 120}` |
389
- | `detect_outliers` | Handle outliers | column, method, threshold | `{"column": "salary", "method": "iqr"}` |
390
- | `infer_values` | Infer missing values | column, method | `{"column": "price", "method": "interpolation"}` |
391
- | `flag_invalid` | Flag invalid rows | row_id, reason | `{"row_id": 42, "reason": "Invalid email"}` |
392
- | `revert_last_action` | Undo last action | - | `{}` |
393
- | `submit` | Submit cleaned data | - | `{}` |
394
-
395
- ---
396
-
397
- ## Observation Space
398
-
399
- ### Observation Structure
400
-
401
- ```mermaid
402
- graph TB
403
- subgraph "Observation"
404
- A[table_preview] --> A1[First N rows]
405
- B[schema] --> B1[Column types]
406
- B --> B2[Null counts]
407
- B --> B3[Unique counts]
408
-
409
- C[detected_issues] --> C1[Issue type]
410
- C --> C2[Count]
411
- C --> C3[Severity]
412
-
413
- D[quality_metrics] --> D1[Completeness]
414
- D --> D2[Validity]
415
- D --> D3[Consistency]
416
- D --> D4[Uniqueness]
417
- D --> D5[Overall]
418
-
419
- E[step_count] --> E1[Current step]
420
- F[max_steps] --> F1[Step limit]
421
-
422
- G[last_action_result] --> G1[Success/failure]
423
- G --> G2[Message]
424
- G --> G3[Rows affected]
425
-
426
- H[task_info] --> H1[Level]
427
- H --> H2[Description]
428
- end
429
- ```
430
-
431
- ### Quality Metrics Explained
432
-
433
- ```mermaid
434
- graph LR
435
- A[Quality Metrics] --> B[Completeness]
436
- A --> C[Validity]
437
- A --> D[Consistency]
438
- A --> E[Uniqueness]
439
-
440
- B --> F[Non-null ratio]
441
- C --> G[Format compliance]
442
- D --> H[Type consistency]
443
- E --> I[1 - duplicate rate]
444
-
445
- F --> J[Overall Score]
446
- G --> J
447
- H --> J
448
- I --> J
449
-
450
- J --> K[30% B + 30% C + 20% D + 20% E]
451
- ```
452
-
453
  ---
454
-
455
- ## Reward System
456
-
457
- ### Multi-Component Reward
458
-
459
- ```mermaid
460
- graph TB
461
- subgraph "Positive Components"
462
- A[Quality Improvement] --> A1[× 0.4 weight]
463
- B[Issue Resolution] --> B1[× 0.4 weight]
464
- C[Schema Validity] --> C1[× 0.2 weight]
465
- end
466
-
467
- subgraph "Penalties"
468
- D[Destructive Changes] --> D1[× 0.5 penalty]
469
- E[Redundant Actions] --> E1[× 0.05 penalty]
470
- F[Step Cost] --> F1[× 0.01 penalty]
471
- end
472
-
473
- A1 --> G[Total Reward]
474
- B1 --> G
475
- C1 --> G
476
-
477
- D1 --> G
478
- E1 --> G
479
- F1 --> G
480
-
481
- G --> H[Final Score]
482
-
483
- style H fill:#4dabf7,stroke:#1971c2
484
- ```
485
-
486
- ### Reward Formula
487
-
488
- ```
489
- reward = (quality_improvement × 0.4)
490
- + (issue_resolution × 0.4)
491
- + (schema_validity × 0.2)
492
- - (destructive_penalty × 0.5)
493
- - (redundant_penalty × 0.05)
494
- - (step_penalty × 0.01)
495
- ```
496
-
497
- ### Grading System
498
-
499
- ```mermaid
500
- graph TD
501
- subgraph "Easy Task Weights"
502
- A1[Completeness: 40%]
503
- A2[Uniqueness: 30%]
504
- A3[Format: 20%]
505
- A4[Structure: 10%]
506
- end
507
-
508
- subgraph "Medium Task Weights"
509
- B1[Completeness: 25%]
510
- B2[Uniqueness: 20%]
511
- B3[Format: 25%]
512
- B4[Structure: 15%]
513
- B5[Consistency: 15%]
514
- end
515
-
516
- subgraph "Hard Task Weights"
517
- C1[Completeness: 20%]
518
- C2[Uniqueness: 15%]
519
- C3[Format: 20%]
520
- C4[Structure: 15%]
521
- C5[Consistency: 15%]
522
- C6[Cross-Field: 15%]
523
- end
524
- ```
525
-
526
  ---
527
 
528
- ## Task Levels
529
 
530
- ### Easy Task (easy_001)
531
 
532
- ```mermaid
533
- graph LR
534
- A[Customer Database] --> B[100 Records]
535
- B --> C[Missing Values]
536
- B --> D[Exact Duplicates]
537
 
538
- C --> E[Fill with median/mode]
539
- D --> F[Drop duplicates]
540
 
541
- E --> G[Clean Data]
542
- F --> G
543
-
544
- style A fill:#ffd43b,stroke:#f59f00
545
- style G fill:#51cf66,stroke:#2f9e44
546
- ```
547
 
548
- - **Dataset**: Customer database (100 records)
549
- - **Issues**: Missing values in age/email, 8 duplicate records
550
- - **Max Steps**: 30
551
- - **Focus**: Basic cleaning operations
552
 
553
- ### Medium Task (medium_001)
554
 
555
- ```mermaid
556
- graph LR
557
- A[Employee Records] --> B[120 Records]
558
- B --> C[Format Issues]
559
- B --> D[Validation Errors]
560
- B --> E[Missing Values]
561
 
562
- C --> F[Standardize formats]
563
- D --> G[Validate ranges]
564
- E --> H[Fill missing]
565
 
566
- F --> I[Clean Data]
567
- G --> I
568
- H --> I
569
 
570
- style A fill:#ffd43b,stroke:#f59f00
571
- style I fill:#51cf66,stroke:#2f9e44
572
- ```
573
 
574
- - **Dataset**: Employee records (120 records)
575
- - **Issues**: Inconsistent email/phone/date formats, out-of-range salary/age
576
- - **Max Steps**: 40
577
- - **Focus**: Format standardization, range validation
578
 
579
- ### Hard Task (hard_001)
580
 
581
- ```mermaid
582
- graph LR
583
- A[Sales Transactions] --> B[150 Records]
584
- B --> C[Cross-Field Errors]
585
- B --> D[Outliers]
586
- B --> E[Complex Missing]
587
- B --> F[Duplicates]
588
 
589
- C --> G[Validate relationships]
590
- D --> H[Detect outliers]
591
- E --> I[Infer values]
592
- F --> J[Remove duplicates]
 
 
 
 
 
 
593
 
594
- G --> K[Clean Data]
595
- H --> K
596
- I --> K
597
- J --> K
598
 
599
- style A fill:#ffd43b,stroke:#f59f00
600
- style K fill:#51cf66,stroke:#2f9e44
601
- ```
602
 
603
- - **Dataset**: Sales transactions (150 records)
604
- - **Issues**: End dates before start dates, total ≠ price × quantity, statistical outliers
605
- - **Max Steps**: 50
606
- - **Focus**: Cross-field validation, anomaly detection
 
607
 
608
- ---
609
 
610
- ## Quick Start
611
 
612
- ### Prerequisites
 
613
 
614
- - Python 3.11+
615
- - pip or Docker
616
 
617
- ### Installation
618
 
619
  ```bash
620
- # Clone repository
621
- git clone <repository-url>
622
- cd data-cleaning-env
623
-
624
- # Create virtual environment
625
- python -m venv venv
626
- source venv/bin/activate # Linux/Mac
627
- # or
628
- venv\Scripts\activate # Windows
629
-
630
- # Install dependencies
631
- pip install -r requirements.txt
632
-
633
- # Generate datasets
634
- python -c "from env.tasks import TaskManager; tm = TaskManager(); tm.generate_datasets()"
635
-
636
- # Start server
637
- python app.py
638
  ```
639
 
640
- ### Verify Installation
641
 
642
  ```bash
643
- # Check health
644
- curl http://localhost:7860/health
645
-
646
- # List tasks
647
- curl http://localhost:7860/tasks
648
- ```
649
-
650
- ---
651
-
652
- ## Deployment
653
-
654
- ### Docker Deployment
655
-
656
- ```bash
657
- # Build image
658
- docker build -t data-cleaning-env .
659
-
660
- # Run container
661
- docker run -p 7860:7860 data-cleaning-env
662
-
663
- # Run with environment variables
664
- docker run -p 7860:7860 \
665
- -e PORT=7860 \
666
- -e LOG_LEVEL=info \
667
- data-cleaning-env
668
- ```
669
-
670
- ### Docker Compose
671
-
672
- ```yaml
673
- version: "3.8"
674
- services:
675
- data-cleaning-env:
676
- build: .
677
- ports:
678
- - "7860:7860"
679
- environment:
680
- - PORT=7860
681
- - LOG_LEVEL=info
682
- healthcheck:
683
- test: ["CMD", "curl", "-f", "http://localhost:7860/health"]
684
- interval: 30s
685
- timeout: 10s
686
- retries: 3
687
- ```
688
-
689
- ### Hugging Face Spaces
690
-
691
- 1. **Create Space**
692
- - Go to huggingface.co/spaces
693
- - Click "Create new Space"
694
- - Select "Docker" as SDK
695
- - Choose "Blank" template
696
-
697
- 2. **Upload Files**
698
-
699
- ```
700
- ├── env/
701
- ├── agent/
702
- ├── app.py
703
- ├── Dockerfile
704
- ├── requirements.txt
705
- └── README.md
706
- ```
707
-
708
- 3. **Configure Space**
709
- - Set `app.py` as main file
710
- - Set port to 7860
711
- - Add any required secrets
712
-
713
- 4. **Deploy**
714
- - Space will automatically build
715
- - Access via provided URL
716
-
717
- ### Cloud Deployment
718
-
719
- #### AWS ECS
720
-
721
- ```bash
722
- # Build and push to ECR
723
- aws ecr create-repository --repository-name data-cleaning-env
724
- docker tag data-cleaning-env:latest <account-id>.dkr.ecr.<region>.amazonaws.com/data-cleaning-env:latest
725
- docker push <account-id>.dkr.ecr.<region>.amazonaws.com/data-cleaning-env:latest
726
-
727
- # Create ECS task definition
728
- aws ecs register-task-definition --cli-input-json file://task-definition.json
729
-
730
- # Run task
731
- aws ecs run-task --cluster default --task-definition data-cleaning-env
732
- ```
733
-
734
- #### Google Cloud Run
735
-
736
- ```bash
737
- # Build and push to GCR
738
- gcloud builds submit --tag gcr.io/<project-id>/data-cleaning-env
739
-
740
- # Deploy to Cloud Run
741
- gcloud run deploy data-cleaning-env \
742
- --image gcr.io/<project-id>/data-cleaning-env \
743
- --platform managed \
744
- --port 7860
745
- ```
746
-
747
- ---
748
-
749
- ## API Reference
750
-
751
- ### Endpoints
752
-
753
- ```mermaid
754
- graph LR
755
- subgraph "Core Endpoints"
756
- A[GET /] --> A1[Root info]
757
- B[GET /health] --> B1[Health check]
758
- C[GET /tasks] --> C1[List tasks]
759
- D[GET /tasks/:id] --> D1[Task details]
760
- end
761
-
762
- subgraph "Environment Endpoints"
763
- E[POST /reset] --> E1[Reset env]
764
- F[POST /step] --> F1[Execute action]
765
- G[GET /state/:id] --> G1[Get state]
766
- H[GET /data/:id] --> H1[Get data]
767
- end
768
-
769
- subgraph "Session Endpoints"
770
- I[GET /sessions] --> I1[List sessions]
771
- J[DELETE /session/:id] --> J1[Delete session]
772
- K[GET /history/:id] --> K1[Get history]
773
- end
774
  ```
775
 
776
- ### Example Usage
777
 
778
- #### Reset Environment
779
 
780
  ```bash
781
- curl -X POST http://localhost:7860/reset \
782
  -H "Content-Type: application/json" \
783
- -d '{"task_id": "easy_001", "session_id": "test-session"}'
784
  ```
785
 
786
- #### Execute Action
787
 
788
  ```bash
789
- curl -X POST http://localhost:7860/step \
790
  -H "Content-Type: application/json" \
791
  -d '{
792
- "session_id": "test-session",
793
  "action": {
794
  "action_type": "fill_missing",
795
  "params": {"column": "age", "strategy": "median"}
@@ -797,244 +115,44 @@ curl -X POST http://localhost:7860/step \
797
  }'
798
  ```
799
 
800
- #### Get Current Data
801
 
802
- ```bash
803
- curl http://localhost:7860/data/test-session?rows=10
804
- ```
805
 
806
- ---
807
 
808
- ## Running the Baseline Agent
 
 
 
809
 
810
- ### Setup
811
 
812
  ```bash
813
- # Set OpenAI API key
814
- export OPENAI_API_KEY="your-api-key"
815
-
816
- # Run on easy task
817
- python agent/baseline.py easy
818
-
819
- # Run on medium task
820
- python agent/baseline.py medium
821
-
822
- # Run on hard task
823
- python agent/baseline.py hard
824
  ```
825
 
826
- ### Agent Strategy
827
-
828
- ```mermaid
829
- flowchart TD
830
- A[Start] --> B[Analyze Issues]
831
- B --> C[Prioritize High Severity]
832
-
833
- C --> D{Missing Values?}
834
- D -->|Yes| E[Fill Missing]
835
- D -->|No| F{Duplicates?}
836
-
837
- F -->|Yes| G[Drop Duplicates]
838
- F -->|No| H{Format Issues?}
839
-
840
- H -->|Yes| I[Standardize Format]
841
- H -->|No| J{Out of Range?}
842
-
843
- J -->|Yes| K[Validate Range]
844
- J -->|No| L{Outliers?}
845
-
846
- L -->|Yes| M[Detect Outliers]
847
- L -->|No| N{Quality Good?}
848
-
849
- N -->|Yes| O[Submit]
850
- N -->|No| P[Continue Cleaning]
851
-
852
- E --> Q[Next Iteration]
853
- G --> Q
854
- I --> Q
855
- K --> Q
856
- M --> Q
857
- P --> Q
858
-
859
- Q --> B
860
-
861
- style O fill:#51cf66,stroke:#2f9e44
862
- ```
863
-
864
- ### Agent Output Example
865
-
866
- ```
867
- ============================================================
868
- Task: Clean a customer database with missing values and duplicate records
869
- Level: easy
870
- Max Steps: 30
871
- ============================================================
872
-
873
- Step 1: fill_missing
874
- Params: {'column': 'age', 'strategy': 'median'}
875
- Reward: 0.2345
876
- Quality: 78.45%
877
- Issues: 18
878
-
879
- Step 2: fill_missing
880
- Params: {'column': 'email', 'strategy': 'mode'}
881
- Reward: 0.1892
882
- Quality: 85.23%
883
- Issues: 12
884
-
885
- Step 3: drop_duplicates
886
- Params: {}
887
- Reward: 0.3421
888
- Quality: 92.15%
889
- Issues: 4
890
-
891
- Step 4: submit
892
- Reward: 1.2500
893
- Quality: 95.80%
894
- Issues: 0
895
-
896
- ============================================================
897
- RESULTS:
898
- Total Steps: 4
899
- Total Reward: 2.0158
900
- Final Quality: 95.80%
901
- Final Score: 87.50%
902
- ============================================================
903
- ```
904
-
905
- ---
906
-
907
- ## Expected Scores
908
-
909
- ### Performance Benchmarks
910
-
911
- ```mermaid
912
- graph TB
913
- subgraph "Easy Task"
914
- A1[Agent Score: 75-90%]
915
- A2[Human Baseline: 95%]
916
- A3[Random Baseline: 40%]
917
- end
918
 
919
- subgraph "Medium Task"
920
- B1[Agent Score: 65-80%]
921
- B2[Human Baseline: 90%]
922
- B3[Random Baseline: 30%]
923
- end
924
-
925
- subgraph "Hard Task"
926
- C1[Agent Score: 55-75%]
927
- C2[Human Baseline: 85%]
928
- C3[Random Baseline: 20%]
929
- end
930
- ```
931
-
932
- ### Score Interpretation
933
-
934
- | Score Range | Rating | Description |
935
- | ----------- | ----------------- | ---------------------------------------------- |
936
- | 90-100% | Excellent | Near-perfect cleaning, minimal errors |
937
- | 80-89% | Good | Most issues resolved, minor mistakes |
938
- | 70-79% | Satisfactory | Core issues addressed, some gaps |
939
- | 60-69% | Needs Improvement | Basic cleaning done, significant issues remain |
940
- | <60% | Poor | Major issues unresolved |
941
-
942
- ---
943
-
944
- ## Development Guide
945
-
946
- ### Adding New Tasks
947
-
948
- 1. **Define Task Config** (env/tasks.py)
949
-
950
- ```python
951
- self.tasks['new_task'] = TaskConfig(
952
- task_id='new_task',
953
- task_level=TaskLevel.MEDIUM,
954
- description="Task description",
955
- dataset_path=str(self.data_dir / "new_dirty.csv"),
956
- expected_output_path=str(self.data_dir / "new_clean.csv"),
957
- max_steps=40,
958
- issues=[...]
959
- )
960
- ```
961
-
962
- 2. **Create Dataset Generators**
963
- ```python
964
- def _generate_new_dataset(self):
965
- # Generate clean data
966
- clean_df = pd.DataFrame(...)
967
- clean_df.to_csv(self.data_dir / "new_clean.csv", index=False)
968
-
969
- # Add realistic issues
970
- dirty_df = clean_df.copy()
971
- # Add missing values, duplicates, format issues, etc.
972
- dirty_df.to_csv(self.data_dir / "new_dirty.csv", index=False)
973
- ```
974
-
975
- ### Adding New Actions
976
-
977
- 1. **Add Action Type** (env/models.py)
978
-
979
- ```python
980
- class ActionType(str, Enum):
981
- NEW_ACTION = "new_action"
982
- ```
983
-
984
- 2. **Create Parameter Model** (env/models.py)
985
-
986
- ```python
987
- class NewActionParams(BaseModel):
988
- param1: str
989
- param2: Optional[int] = None
990
- ```
991
-
992
- 3. **Implement Action** (env/environment.py)
993
-
994
- ```python
995
- def _new_action(self, params: Dict) -> ActionResult:
996
- # Implementation
997
- return ActionResult(success=True, message="...", rows_affected=n)
998
- ```
999
-
1000
- 4. **Update Agent Prompt** (agent/baseline.py)
1001
- - Add action description to system prompt
1002
- - Include example usage
1003
-
1004
- ### Customizing Rewards
1005
-
1006
- Modify weights in `RewardCalculator.__init__()`:
1007
-
1008
- ```python
1009
- self.step_penalty = 0.01 # Penalty per step
1010
- self.destructive_penalty = 0.5 # Penalty for destructive changes
1011
- self.redundant_penalty = 0.05 # Penalty for redundant actions
1012
- self.quality_weight = 0.4 # Weight for quality improvement
1013
- self.issue_weight = 0.4 # Weight for issue resolution
1014
- self.schema_weight = 0.2 # Weight for schema validity
1015
  ```
1016
 
1017
- ---
1018
-
1019
- ## License
1020
-
1021
- MIT License
1022
-
1023
- ## Repository
1024
 
1025
- https://github.com/meta-pytorch/OpenEnv
 
 
 
 
 
 
 
 
1026
 
1027
- ---
1028
-
1029
- ## Citation
1030
 
1031
- If you use this environment in your research, please cite:
1032
-
1033
- ```bibtex
1034
- @software{openenv_data_cleaning,
1035
- title={OpenEnv Data Cleaning Environment},
1036
- author={OpenEnv Team},
1037
- year={2024},
1038
- url={https://github.com/meta-pytorch/OpenEnv}
1039
- }
1040
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: OpenEnv Data Cleaning Environment
3
+ emoji: 🧼
4
+ colorFrom: blue
5
+ colorTo: green
6
+ sdk: docker
7
+ app_port: 7860
8
+ pinned: false
9
+ tags:
10
+ - fastapi
11
+ - docker
12
+ - openenv
13
+ - data-cleaning
14
+ - data-validation
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15
  ---
16
 
17
+ ## What this is
18
 
19
+ This repo is a **Dockerized FastAPI application** that serves an OpenEnv-style **data cleaning environment**:
20
 
21
+ - `POST /reset` to start a session on a task
22
+ - `POST /step` to take actions (fill missing, drop duplicates, standardize formats, etc.)
23
+ - `GET /health` for readiness checks
 
 
24
 
25
+ It is suitable for **Hugging Face Spaces (Docker)**. Inference Endpoints are not ideal here because this is an interactive multi-endpoint environment, not a single model inference API.
 
26
 
27
+ ## Web UI (optional)
 
 
 
 
 
28
 
29
+ Open `\/web` for a lightweight dashboard to reset/step and view the table preview.
 
 
 
30
 
31
+ ## Real-world task
32
 
33
+ Simulates a common data engineering workflow: **cleaning a dirty table** so downstream analytics/ML won’t break.
34
+ Agents must iteratively apply safe transformations (imputation, deduplication, normalization, format standardization, range/outlier handling) and then **submit**.
 
 
 
 
35
 
36
+ ## Tasks (3 levels, deterministic grading)
 
 
37
 
38
+ - **easy_001**: missing values + exact duplicates (customer table)
39
+ - **medium_001**: missing values + format inconsistencies + invalid ranges (employee table)
40
+ - **hard_001**: missing values + duplicates + mixed date/currency formats + cross-field constraints + outliers (sales table)
41
 
42
+ On `submit`, the grader returns a **score in \([0.0, 1.0]\)** in `info.grade.final_score`.
 
 
43
 
44
+ ## Action space
 
 
 
45
 
46
+ `Action` = `{ "action_type": <enum>, "params": <dict> }`
47
 
48
+ Supported `action_type` values:
 
 
 
 
 
 
49
 
50
+ - `fill_missing` (`column`, `strategy` in {mean, median, mode, forward_fill, backward_fill}, optional `value`)
51
+ - `drop_duplicates` (optional `subset`, optional `keep`)
52
+ - `normalize_text` (`column`, `operations`)
53
+ - `standardize_format` (`column`, `format_type` in {email, date, phone, currency, percentage})
54
+ - `validate_range` (`column`, optional `min_value`, optional `max_value`)
55
+ - `detect_outliers` (`column`, `method` in {iqr, zscore}, optional `threshold`)
56
+ - `infer_values` (`column`, `method`, optional `reference_columns`)
57
+ - `flag_invalid` (`row_id`, optional `reason`)
58
+ - `revert_last_action` (no params)
59
+ - `submit` (no params)
60
 
61
+ ## Observation space
 
 
 
62
 
63
+ Each `step()` returns an `Observation` including:
 
 
64
 
65
+ - `table_preview`: first N rows as JSON records
66
+ - `column_schema`: per-column type, null counts, unique counts, samples
67
+ - `detected_issues`: issue summaries (type/count/severity)
68
+ - `quality_metrics`: completeness/validity/consistency/uniqueness/overall (0..1)
69
+ - `issues_remaining`, `step_count`, `max_steps`, plus `last_action_result`
70
 
71
+ ## Reward shaping (dense signal)
72
 
73
+ Reward is a **multi-component** `Reward` model:
74
 
75
+ - **Positive**: quality improvement, issue resolution progress, schema validity
76
+ - **Penalties**: destructive changes, redundant actions, per-step cost
77
 
78
+ This gives partial-progress signal instead of only terminal success/failure.
 
79
 
80
+ ## Local run (Docker)
81
 
82
  ```bash
83
+ docker build -t datacleanser .
84
+ docker run --rm -p 7860:7860 datacleanser
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
85
  ```
86
 
87
+ Verify:
88
 
89
  ```bash
90
+ curl -fsS http://localhost:7860/health
91
+ curl -fsS http://localhost:7860/tasks
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
92
  ```
93
 
94
+ ## API usage
95
 
96
+ Reset:
97
 
98
  ```bash
99
+ curl -fsS -X POST http://localhost:7860/reset \
100
  -H "Content-Type: application/json" \
101
+ -d '{"task_id":"easy_001","session_id":"demo"}'
102
  ```
103
 
104
+ Step:
105
 
106
  ```bash
107
+ curl -fsS -X POST http://localhost:7860/step \
108
  -H "Content-Type: application/json" \
109
  -d '{
110
+ "session_id": "demo",
111
  "action": {
112
  "action_type": "fill_missing",
113
  "params": {"column": "age", "strategy": "median"}
 
115
  }'
116
  ```
117
 
118
+ ## Baseline agent (LLM) inference
119
 
120
+ The baseline script is `inference.py` (repo root). It uses an **OpenAI-compatible** API.
 
 
121
 
122
+ Required environment variables (per submission rules):
123
 
124
+ - `API_BASE_URL`: OpenAI-compatible endpoint base URL (optional if using OpenAI default)
125
+ - `MODEL_NAME`: model id (e.g. `gpt-4.1-mini`, or your provider’s model name)
126
+ - `OPENAI_API_KEY`: API key (preferred)
127
+ - `HF_TOKEN`: API key fallback (used if `OPENAI_API_KEY` is not set)
128
 
129
+ Run all 3 tasks locally via Docker:
130
 
131
  ```bash
132
+ docker build -t datacleanser .
133
+ docker run --rm -p 7860:7860 datacleanser
 
 
 
 
 
 
 
 
 
134
  ```
135
 
136
+ In another terminal (using local python env that has deps installed), or run inside the container:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
137
 
138
+ ```bash
139
+ docker exec -it $(docker ps -q --filter ancestor=datacleanser | head -n 1) \
140
+ sh -lc 'API_BASE_URL="$API_BASE_URL" MODEL_NAME="$MODEL_NAME" OPENAI_API_KEY="$OPENAI_API_KEY" HF_TOKEN="$HF_TOKEN" python3 inference.py --all --out baseline_results.json'
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
141
  ```
142
 
143
+ ## Hugging Face Spaces (Docker) deployment
 
 
 
 
 
 
144
 
145
+ 1. Create a Space → **SDK: Docker**
146
+ 2. Push these files to the Space repo:
147
+ - `Dockerfile`
148
+ - `.dockerignore`
149
+ - `requirements.txt`
150
+ - `app.py`
151
+ - `env/`, `agent/`, `data/` (optional; datasets are generated on startup)
152
+ - `README.md` (this file)
153
+ 3. The Space will build and start automatically on port **7860**.
154
 
155
+ ## Notes
 
 
156
 
157
+ - The server generates datasets on startup (see `app.py` startup event).
158
+ - For baseline agent runs (outside Spaces), set `OPENAI_API_KEY` and use `inference.py`.
 
 
 
 
 
 
 
 
app.py CHANGED
@@ -9,6 +9,8 @@ from pathlib import Path
9
 
10
  from fastapi import FastAPI, HTTPException, BackgroundTasks
11
  from fastapi.middleware.cors import CORSMiddleware
 
 
12
  from pydantic import BaseModel
13
  import uvicorn
14
 
@@ -39,6 +41,24 @@ app.add_middleware(
39
  environments: Dict[str, DataCleaningEnv] = {}
40
  task_manager = TaskManager()
41
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
42
 
43
  class ResetRequest(BaseModel):
44
  task_id: str
@@ -78,6 +98,382 @@ async def root():
78
  }
79
 
80
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
81
  @app.get("/health")
82
  async def health_check():
83
  """Health check endpoint"""
@@ -149,8 +545,8 @@ async def reset_environment(request: ResetRequest):
149
  message=f"Environment reset with task {request.task_id}",
150
  data={
151
  "session_id": session_id,
152
- "observation": observation.dict(),
153
- "state": env.state()
154
  }
155
  )
156
  except Exception as e:
@@ -177,11 +573,11 @@ async def step_environment(request: StepRequest):
177
  success=True,
178
  message="Action executed",
179
  data={
180
- "observation": observation.dict(),
181
- "reward": reward.dict(),
182
  "done": done,
183
- "info": info,
184
- "state": env.state()
185
  }
186
  )
187
  except Exception as e:
@@ -201,10 +597,16 @@ async def get_state(session_id: str):
201
  env = environments[session_id]
202
  return {
203
  "session_id": session_id,
204
- "state": env.state()
205
  }
206
 
207
 
 
 
 
 
 
 
208
  @app.get("/data/{session_id}")
209
  async def get_current_data(session_id: str, rows: int = 100):
210
  """Get current dataframe"""
@@ -221,7 +623,7 @@ async def get_current_data(session_id: str, rows: int = 100):
221
  "session_id": session_id,
222
  "rows": len(df),
223
  "columns": list(df.columns),
224
- "data": df.head(rows).to_dict('records')
225
  }
226
 
227
 
@@ -237,7 +639,7 @@ async def get_history(session_id: str):
237
  env = environments[session_id]
238
  return {
239
  "session_id": session_id,
240
- "history": env.get_history()
241
  }
242
 
243
 
 
9
 
10
  from fastapi import FastAPI, HTTPException, BackgroundTasks
11
  from fastapi.middleware.cors import CORSMiddleware
12
+ from fastapi.encoders import jsonable_encoder
13
+ from fastapi.responses import HTMLResponse
14
  from pydantic import BaseModel
15
  import uvicorn
16
 
 
41
  environments: Dict[str, DataCleaningEnv] = {}
42
  task_manager = TaskManager()
43
 
44
+ def _to_jsonable(obj: Any) -> Any:
45
+ """
46
+ Convert common non-JSON-native scalar types (notably numpy scalars) into plain python types.
47
+ """
48
+ # numpy scalar types have .item(); converting them avoids FastAPI/Pydantic serialization errors.
49
+ if hasattr(obj, "item") and callable(getattr(obj, "item")):
50
+ try:
51
+ return obj.item()
52
+ except Exception:
53
+ pass
54
+ if isinstance(obj, dict):
55
+ return {k: _to_jsonable(v) for k, v in obj.items()}
56
+ if isinstance(obj, list):
57
+ return [_to_jsonable(v) for v in obj]
58
+ if isinstance(obj, tuple):
59
+ return [_to_jsonable(v) for v in obj]
60
+ return obj
61
+
62
 
63
  class ResetRequest(BaseModel):
64
  task_id: str
 
98
  }
99
 
100
 
101
+ @app.get("/web", response_class=HTMLResponse)
102
+ async def web_ui():
103
+ html = """
104
+ <!doctype html>
105
+ <html lang="en">
106
+ <head>
107
+ <meta charset="utf-8" />
108
+ <meta name="viewport" content="width=device-width, initial-scale=1" />
109
+ <title>DataCleanser • OpenEnv</title>
110
+ <style>
111
+ :root {
112
+ --bg: #0b1020;
113
+ --panel: #0f1730;
114
+ --muted: #8aa0c6;
115
+ --text: #e8eeff;
116
+ --border: rgba(255,255,255,0.10);
117
+ --accent: #6aa2ff;
118
+ --good: #2dd4bf;
119
+ --bad: #fb7185;
120
+ --warn: #fbbf24;
121
+ --mono: ui-monospace, SFMono-Regular, Menlo, Monaco, Consolas, "Liberation Mono", "Courier New", monospace;
122
+ --sans: ui-sans-serif, system-ui, -apple-system, Segoe UI, Roboto, Helvetica, Arial, "Apple Color Emoji", "Segoe UI Emoji";
123
+ }
124
+ * { box-sizing: border-box; }
125
+ body {
126
+ margin: 0; padding: 0;
127
+ font-family: var(--sans);
128
+ background: radial-gradient(900px 600px at 20% 0%, rgba(106,162,255,0.18), transparent 55%),
129
+ radial-gradient(900px 600px at 80% 0%, rgba(45,212,191,0.14), transparent 55%),
130
+ var(--bg);
131
+ color: var(--text);
132
+ }
133
+ .container { max-width: 1200px; margin: 0 auto; padding: 20px; }
134
+ .topbar {
135
+ display: flex; gap: 12px; align-items: center; justify-content: space-between;
136
+ padding: 14px 16px; border: 1px solid var(--border); border-radius: 14px;
137
+ background: linear-gradient(180deg, rgba(255,255,255,0.06), rgba(255,255,255,0.02));
138
+ backdrop-filter: blur(8px);
139
+ }
140
+ .brand { display: flex; gap: 10px; align-items: center; }
141
+ .badge { font-family: var(--mono); font-size: 12px; color: var(--muted); border: 1px solid var(--border); padding: 4px 8px; border-radius: 999px; }
142
+ h1 { font-size: 18px; margin: 0; letter-spacing: 0.2px; }
143
+ .grid { display: grid; grid-template-columns: 380px 1fr; gap: 14px; margin-top: 14px; }
144
+ .card {
145
+ border: 1px solid var(--border);
146
+ border-radius: 14px;
147
+ background: rgba(15,23,48,0.72);
148
+ backdrop-filter: blur(8px);
149
+ overflow: hidden;
150
+ }
151
+ .card h2 {
152
+ font-size: 13px; letter-spacing: 0.5px; text-transform: uppercase;
153
+ margin: 0; padding: 12px 14px; border-bottom: 1px solid var(--border); color: var(--muted);
154
+ }
155
+ .card .body { padding: 14px; }
156
+ label { display: block; font-size: 12px; color: var(--muted); margin-bottom: 6px; }
157
+ input, select, textarea {
158
+ width: 100%;
159
+ padding: 10px 10px;
160
+ border-radius: 10px;
161
+ border: 1px solid var(--border);
162
+ background: rgba(255,255,255,0.03);
163
+ color: var(--text);
164
+ outline: none;
165
+ }
166
+ textarea { min-height: 110px; font-family: var(--mono); font-size: 12px; }
167
+ .row { display: grid; grid-template-columns: 1fr 1fr; gap: 10px; }
168
+ .btnrow { display: flex; flex-wrap: wrap; gap: 10px; margin-top: 10px; }
169
+ button {
170
+ border: 1px solid var(--border);
171
+ background: rgba(255,255,255,0.06);
172
+ color: var(--text);
173
+ padding: 10px 12px;
174
+ border-radius: 12px;
175
+ cursor: pointer;
176
+ font-weight: 600;
177
+ }
178
+ button.primary { border-color: rgba(106,162,255,0.5); background: rgba(106,162,255,0.14); }
179
+ button.good { border-color: rgba(45,212,191,0.5); background: rgba(45,212,191,0.12); }
180
+ button.bad { border-color: rgba(251,113,133,0.5); background: rgba(251,113,133,0.10); }
181
+ button:disabled { opacity: 0.55; cursor: not-allowed; }
182
+ .pill {
183
+ display: inline-flex; align-items: center; gap: 8px;
184
+ padding: 6px 10px; border: 1px solid var(--border); border-radius: 999px;
185
+ color: var(--muted); font-size: 12px; font-family: var(--mono);
186
+ }
187
+ .statusDot { width: 8px; height: 8px; border-radius: 999px; background: var(--warn); }
188
+ .statusDot.ok { background: var(--good); }
189
+ .statusDot.err { background: var(--bad); }
190
+ .kvs { display: grid; grid-template-columns: repeat(3, 1fr); gap: 10px; }
191
+ .kv { border: 1px solid var(--border); border-radius: 12px; padding: 10px; background: rgba(255,255,255,0.02); }
192
+ .kv .k { color: var(--muted); font-size: 11px; margin-bottom: 6px; }
193
+ .kv .v { font-family: var(--mono); font-size: 13px; }
194
+ .mono { font-family: var(--mono); font-size: 12px; color: var(--muted); }
195
+ table { width: 100%; border-collapse: collapse; font-size: 12px; }
196
+ th, td { border-bottom: 1px solid var(--border); padding: 8px 8px; text-align: left; vertical-align: top; }
197
+ th { color: var(--muted); font-weight: 600; background: rgba(255,255,255,0.02); position: sticky; top: 0; }
198
+ .scroll { max-height: 520px; overflow: auto; }
199
+ .msg { white-space: pre-wrap; font-family: var(--mono); font-size: 12px; color: var(--muted); margin: 0; }
200
+ .foot { margin-top: 14px; color: var(--muted); font-size: 12px; }
201
+ a { color: var(--accent); text-decoration: none; }
202
+ @media (max-width: 980px) { .grid { grid-template-columns: 1fr; } }
203
+ </style>
204
+ </head>
205
+ <body>
206
+ <div class="container">
207
+ <div class="topbar">
208
+ <div class="brand">
209
+ <h1>DataCleanser • OpenEnv Environment</h1>
210
+ <span class="badge">HTTP API + in-memory sessions</span>
211
+ </div>
212
+ <div class="pill" title="Backend health">
213
+ <span id="healthDot" class="statusDot"></span>
214
+ <span id="healthText">checking /health…</span>
215
+ </div>
216
+ </div>
217
+
218
+ <div class="grid">
219
+ <div class="card">
220
+ <h2>Session</h2>
221
+ <div class="body">
222
+ <label>Session ID</label>
223
+ <input id="sessionId" value="demo" />
224
+ <div style="height:10px"></div>
225
+ <div class="row">
226
+ <div>
227
+ <label>Task</label>
228
+ <select id="taskSelect"></select>
229
+ </div>
230
+ <div>
231
+ <label>Difficulty filter</label>
232
+ <select id="levelFilter">
233
+ <option value="">all</option>
234
+ <option value="easy">easy</option>
235
+ <option value="medium">medium</option>
236
+ <option value="hard">hard</option>
237
+ </select>
238
+ </div>
239
+ </div>
240
+ <div class="btnrow">
241
+ <button class="primary" id="btnLoadTasks">Load tasks</button>
242
+ <button class="good" id="btnReset">Reset</button>
243
+ <button id="btnState">State</button>
244
+ <button class="bad" id="btnDelete">Delete session</button>
245
+ </div>
246
+ <div style="height:10px"></div>
247
+ <p class="msg" id="sessionMsg"></p>
248
+ </div>
249
+ </div>
250
+
251
+ <div class="card">
252
+ <h2>Step</h2>
253
+ <div class="body">
254
+ <div class="row">
255
+ <div>
256
+ <label>Action type</label>
257
+ <select id="actionType">
258
+ <option value="fill_missing">fill_missing</option>
259
+ <option value="drop_duplicates">drop_duplicates</option>
260
+ <option value="normalize_text">normalize_text</option>
261
+ <option value="standardize_format">standardize_format</option>
262
+ <option value="validate_range">validate_range</option>
263
+ <option value="detect_outliers">detect_outliers</option>
264
+ <option value="infer_values">infer_values</option>
265
+ <option value="flag_invalid">flag_invalid</option>
266
+ <option value="revert_last_action">revert_last_action</option>
267
+ <option value="submit">submit</option>
268
+ </select>
269
+ </div>
270
+ <div>
271
+ <label>Quick templates</label>
272
+ <select id="templateSelect">
273
+ <option value="">—</option>
274
+ <option value="{&quot;column&quot;:&quot;age&quot;,&quot;strategy&quot;:&quot;median&quot;}">fill_missing age median</option>
275
+ <option value="{&quot;column&quot;:&quot;email&quot;,&quot;format_type&quot;:&quot;email&quot;}">standardize_format email</option>
276
+ <option value="{&quot;column&quot;:&quot;phone&quot;,&quot;format_type&quot;:&quot;phone&quot;}">standardize_format phone</option>
277
+ <option value="{&quot;subset&quot;:[&quot;email&quot;],&quot;keep&quot;:&quot;first&quot;}">drop_duplicates by email</option>
278
+ </select>
279
+ </div>
280
+ </div>
281
+ <div style="height:10px"></div>
282
+ <label>Params (JSON)</label>
283
+ <textarea id="paramsJson">{}</textarea>
284
+ <div class="btnrow">
285
+ <button class="primary" id="btnStep">Step</button>
286
+ <button class="good" id="btnSubmit">Submit</button>
287
+ </div>
288
+ <div style="height:10px"></div>
289
+ <p class="msg" id="stepMsg"></p>
290
+ </div>
291
+ </div>
292
+ </div>
293
+
294
+ <div class="grid">
295
+ <div class="card">
296
+ <h2>Metrics</h2>
297
+ <div class="body">
298
+ <div class="kvs">
299
+ <div class="kv"><div class="k">Step</div><div class="v" id="kvStep">—</div></div>
300
+ <div class="kv"><div class="k">Issues remaining</div><div class="v" id="kvIssues">—</div></div>
301
+ <div class="kv"><div class="k">Reward (last)</div><div class="v" id="kvReward">—</div></div>
302
+ </div>
303
+ <div style="height:10px"></div>
304
+ <div class="kvs">
305
+ <div class="kv"><div class="k">Overall quality</div><div class="v" id="kvOverall">—</div></div>
306
+ <div class="kv"><div class="k">Completeness</div><div class="v" id="kvComplete">—</div></div>
307
+ <div class="kv"><div class="k">Uniqueness</div><div class="v" id="kvUnique">—</div></div>
308
+ </div>
309
+ <div style="height:10px"></div>
310
+ <p class="mono">Tip: Use <a href="/docs" target="_blank" rel="noreferrer">/docs</a> for full API schema.</p>
311
+ </div>
312
+ </div>
313
+
314
+ <div class="card">
315
+ <h2>Table preview</h2>
316
+ <div class="body">
317
+ <div class="scroll">
318
+ <table id="previewTable"></table>
319
+ </div>
320
+ </div>
321
+ </div>
322
+ </div>
323
+
324
+ <div class="card" style="margin-top:14px">
325
+ <h2>Issues & last action</h2>
326
+ <div class="body">
327
+ <pre class="msg" id="issuesBox">—</pre>
328
+ </div>
329
+ </div>
330
+
331
+ <div class="foot">
332
+ If the UI loads but actions fail, your session may not exist yet — click <b>Reset</b> first.
333
+ </div>
334
+ </div>
335
+
336
+ <script>
337
+ const $ = (id) => document.getElementById(id);
338
+
339
+ function setHealth(ok, text) {
340
+ const dot = $("healthDot");
341
+ dot.classList.remove("ok", "err");
342
+ dot.classList.add(ok ? "ok" : "err");
343
+ $("healthText").textContent = text;
344
+ }
345
+
346
+ async function api(method, path, body=null) {
347
+ const opts = { method, headers: { "Content-Type": "application/json" } };
348
+ if (body) opts.body = JSON.stringify(body);
349
+ const res = await fetch(path, opts);
350
+ const txt = await res.text();
351
+ let data = null;
352
+ try { data = txt ? JSON.parse(txt) : null; } catch { data = { raw: txt }; }
353
+ if (!res.ok) {
354
+ const msg = (data && (data.detail || data.message)) ? (data.detail || data.message) : txt;
355
+ throw new Error(`${res.status} ${res.statusText}: ${msg}`);
356
+ }
357
+ return data;
358
+ }
359
+
360
+ function renderPreview(obs) {
361
+ const table = $("previewTable");
362
+ const rows = (obs && obs.table_preview) ? obs.table_preview : [];
363
+ if (!rows.length) { table.innerHTML = "<tr><td class='mono'>No preview available</td></tr>"; return; }
364
+ const cols = Object.keys(rows[0]);
365
+ const thead = "<thead><tr>" + cols.map(c => `<th>${c}</th>`).join("") + "</tr></thead>";
366
+ const tbody = "<tbody>" + rows.slice(0, 30).map(r =>
367
+ "<tr>" + cols.map(c => `<td>${(r[c] === null || r[c] === undefined) ? "" : String(r[c])}</td>`).join("") + "</tr>"
368
+ ).join("") + "</tbody>";
369
+ table.innerHTML = thead + tbody;
370
+ }
371
+
372
+ function updateMetrics(data) {
373
+ const obs = data?.observation || data?.data?.observation;
374
+ const reward = data?.reward || data?.data?.reward;
375
+ if (!obs) return;
376
+ $("kvStep").textContent = `${obs.step_count ?? "—"} / ${obs.max_steps ?? "—"}`;
377
+ $("kvIssues").textContent = String(obs.issues_remaining ?? "—");
378
+ $("kvReward").textContent = reward ? String(reward.total ?? "—") : "—";
379
+ const qm = obs.quality_metrics || {};
380
+ $("kvOverall").textContent = (qm.overall ?? "—");
381
+ $("kvComplete").textContent = (qm.completeness ?? "—");
382
+ $("kvUnique").textContent = (qm.uniqueness ?? "—");
383
+
384
+ const issues = {
385
+ detected_issues: obs.detected_issues || [],
386
+ last_action_result: obs.last_action_result || null,
387
+ done: data?.done ?? data?.data?.done ?? null,
388
+ grade: data?.info?.grade || data?.data?.info?.grade || null
389
+ };
390
+ $("issuesBox").textContent = JSON.stringify(issues, null, 2);
391
+ renderPreview(obs);
392
+ }
393
+
394
+ async function checkHealth() {
395
+ try {
396
+ const h = await api("GET", "/health");
397
+ setHealth(true, `healthy`);
398
+ } catch (e) {
399
+ setHealth(false, `unhealthy`);
400
+ }
401
+ }
402
+
403
+ async function loadTasks() {
404
+ $("sessionMsg").textContent = "Loading tasks…";
405
+ const level = $("levelFilter").value;
406
+ const q = level ? `?level=${encodeURIComponent(level)}` : "";
407
+ const out = await api("GET", `/tasks${q}`);
408
+ const tasks = out.tasks || [];
409
+ const sel = $("taskSelect");
410
+ sel.innerHTML = tasks.map(t => `<option value="${t.task_id}">${t.task_id} • ${t.task_level}</option>`).join("");
411
+ $("sessionMsg").textContent = tasks.length ? `Loaded ${tasks.length} tasks.` : "No tasks found.";
412
+ }
413
+
414
+ async function doReset() {
415
+ const task_id = $("taskSelect").value;
416
+ const session_id = $("sessionId").value.trim() || "demo";
417
+ $("sessionMsg").textContent = "Resetting…";
418
+ const out = await api("POST", "/reset", { task_id, session_id });
419
+ $("sessionMsg").textContent = out.message || "Reset ok.";
420
+ updateMetrics(out.data);
421
+ }
422
+
423
+ async function doState() {
424
+ const session_id = $("sessionId").value.trim() || "demo";
425
+ $("sessionMsg").textContent = "Fetching state…";
426
+ const out = await api("GET", `/state?session_id=${encodeURIComponent(session_id)}`);
427
+ $("sessionMsg").textContent = "State ok.";
428
+ $("issuesBox").textContent = JSON.stringify(out, null, 2);
429
+ }
430
+
431
+ async function doDelete() {
432
+ const session_id = $("sessionId").value.trim() || "demo";
433
+ $("sessionMsg").textContent = "Deleting session…";
434
+ const out = await api("DELETE", `/session/${encodeURIComponent(session_id)}`);
435
+ $("sessionMsg").textContent = out.message || "Deleted.";
436
+ }
437
+
438
+ async function doStep(forceSubmit=false) {
439
+ const session_id = $("sessionId").value.trim() || "demo";
440
+ const action_type = forceSubmit ? "submit" : $("actionType").value;
441
+ let params = {};
442
+ try { params = JSON.parse($("paramsJson").value || "{}"); } catch (e) {
443
+ $("stepMsg").textContent = "Params JSON is invalid.";
444
+ return;
445
+ }
446
+ if (action_type === "submit" || action_type === "revert_last_action") params = {};
447
+ $("stepMsg").textContent = "Stepping…";
448
+ const out = await api("POST", "/step", { session_id, action: { action_type, params }});
449
+ $("stepMsg").textContent = out.message || "Step ok.";
450
+ updateMetrics(out.data);
451
+ }
452
+
453
+ $("btnLoadTasks").addEventListener("click", () => loadTasks().catch(e => $("sessionMsg").textContent = e.message));
454
+ $("btnReset").addEventListener("click", () => doReset().catch(e => $("sessionMsg").textContent = e.message));
455
+ $("btnState").addEventListener("click", () => doState().catch(e => $("sessionMsg").textContent = e.message));
456
+ $("btnDelete").addEventListener("click", () => doDelete().catch(e => $("sessionMsg").textContent = e.message));
457
+ $("btnStep").addEventListener("click", () => doStep(false).catch(e => $("stepMsg").textContent = e.message));
458
+ $("btnSubmit").addEventListener("click", () => doStep(true).catch(e => $("stepMsg").textContent = e.message));
459
+
460
+ $("templateSelect").addEventListener("change", (e) => {
461
+ const v = e.target.value;
462
+ if (v) $("paramsJson").value = v;
463
+ });
464
+
465
+ (async function init() {
466
+ await checkHealth();
467
+ await loadTasks();
468
+ })();
469
+ setInterval(checkHealth, 5000);
470
+ </script>
471
+ </body>
472
+ </html>
473
+ """
474
+ return HTMLResponse(content=html)
475
+
476
+
477
  @app.get("/health")
478
  async def health_check():
479
  """Health check endpoint"""
 
545
  message=f"Environment reset with task {request.task_id}",
546
  data={
547
  "session_id": session_id,
548
+ "observation": _to_jsonable(observation.model_dump(mode="json")),
549
+ "state": _to_jsonable(jsonable_encoder(env.state()))
550
  }
551
  )
552
  except Exception as e:
 
573
  success=True,
574
  message="Action executed",
575
  data={
576
+ "observation": _to_jsonable(observation.model_dump(mode="json")),
577
+ "reward": _to_jsonable(reward.model_dump(mode="json")),
578
  "done": done,
579
+ "info": _to_jsonable(info),
580
+ "state": _to_jsonable(env.state())
581
  }
582
  )
583
  except Exception as e:
 
597
  env = environments[session_id]
598
  return {
599
  "session_id": session_id,
600
+ "state": _to_jsonable(jsonable_encoder(env.state()))
601
  }
602
 
603
 
604
+ @app.get("/state")
605
+ async def get_state_query(session_id: str):
606
+ """Get current environment state (query param form)"""
607
+ return await get_state(session_id)
608
+
609
+
610
  @app.get("/data/{session_id}")
611
  async def get_current_data(session_id: str, rows: int = 100):
612
  """Get current dataframe"""
 
623
  "session_id": session_id,
624
  "rows": len(df),
625
  "columns": list(df.columns),
626
+ "data": _to_jsonable(jsonable_encoder(df.head(rows).to_dict("records")))
627
  }
628
 
629
 
 
639
  env = environments[session_id]
640
  return {
641
  "session_id": session_id,
642
+ "history": _to_jsonable(jsonable_encoder(env.get_history()))
643
  }
644
 
645
 
env/environment.py CHANGED
@@ -181,17 +181,23 @@ class DataCleaningEnv:
181
  Returns:
182
  Dictionary containing current state information
183
  """
 
 
 
 
 
 
 
 
184
  return {
185
  'step_count': self.step_count,
186
  'max_steps': self.max_steps,
187
  'done': self.done,
188
  'total_rows': len(self.current_df) if self.current_df is not None else 0,
189
  'total_columns': len(self.current_df.columns) if self.current_df is not None else 0,
190
- 'current_issues': self.current_issues,
191
- 'initial_issues': self.initial_issues,
192
- 'quality_metrics': self.reward_calculator.calculate_quality_metrics(
193
- self.current_df
194
- ).dict() if self.current_df is not None else {},
195
  'history_length': len(self.history)
196
  }
197
 
@@ -495,7 +501,16 @@ class DataCleaningEnv:
495
 
496
  def _get_observation(self) -> Observation:
497
  """Generate observation from current state"""
498
- preview = self.current_df.head(self.preview_rows).to_dict('records')
 
 
 
 
 
 
 
 
 
499
 
500
  schema = []
501
  for col in self.current_df.columns:
@@ -506,7 +521,7 @@ class DataCleaningEnv:
506
  non_null_count=int(col_data.count()),
507
  null_count=int(col_data.isnull().sum()),
508
  unique_count=int(col_data.nunique()),
509
- sample_values=col_data.dropna().head(3).tolist()
510
  ))
511
 
512
  detected_issues = []
 
181
  Returns:
182
  Dictionary containing current state information
183
  """
184
+ current_issues = {k: int(v) for k, v in (self.current_issues or {}).items()}
185
+ initial_issues = {k: int(v) for k, v in (self.initial_issues or {}).items()}
186
+ quality_metrics = (
187
+ self.reward_calculator.calculate_quality_metrics(self.current_df).model_dump(mode="json")
188
+ if self.current_df is not None
189
+ else {}
190
+ )
191
+
192
  return {
193
  'step_count': self.step_count,
194
  'max_steps': self.max_steps,
195
  'done': self.done,
196
  'total_rows': len(self.current_df) if self.current_df is not None else 0,
197
  'total_columns': len(self.current_df.columns) if self.current_df is not None else 0,
198
+ 'current_issues': current_issues,
199
+ 'initial_issues': initial_issues,
200
+ 'quality_metrics': quality_metrics,
 
 
201
  'history_length': len(self.history)
202
  }
203
 
 
501
 
502
  def _get_observation(self) -> Observation:
503
  """Generate observation from current state"""
504
+ def _to_py(val: Any) -> Any:
505
+ # Convert numpy/pandas scalars into JSON-friendly python types
506
+ if isinstance(val, (np.generic,)):
507
+ return val.item()
508
+ return val
509
+
510
+ preview_raw = self.current_df.head(self.preview_rows).to_dict("records")
511
+ preview: List[Dict[str, Any]] = [
512
+ {k: _to_py(v) for k, v in row.items()} for row in preview_raw
513
+ ]
514
 
515
  schema = []
516
  for col in self.current_df.columns:
 
521
  non_null_count=int(col_data.count()),
522
  null_count=int(col_data.isnull().sum()),
523
  unique_count=int(col_data.nunique()),
524
+ sample_values=[_to_py(v) for v in col_data.dropna().head(3).tolist()]
525
  ))
526
 
527
  detected_issues = []
env/reward.py CHANGED
@@ -39,8 +39,8 @@ class RewardCalculator:
39
 
40
  def calculate_quality_metrics(self, df: pd.DataFrame) -> QualityMetrics:
41
  """Calculate comprehensive quality metrics"""
42
- completeness = calculate_completeness(df)
43
- uniqueness = calculate_uniqueness(df)
44
 
45
  validity_scores = []
46
  for col in df.columns:
@@ -52,11 +52,11 @@ class RewardCalculator:
52
  else:
53
  validity_scores.append(1.0)
54
 
55
- validity = np.mean(validity_scores) if validity_scores else 1.0
56
 
57
- consistency = self._calculate_consistency(df)
58
 
59
- overall = (
60
  completeness * 0.3 +
61
  validity * 0.3 +
62
  consistency * 0.2 +
 
39
 
40
  def calculate_quality_metrics(self, df: pd.DataFrame) -> QualityMetrics:
41
  """Calculate comprehensive quality metrics"""
42
+ completeness = float(calculate_completeness(df))
43
+ uniqueness = float(calculate_uniqueness(df))
44
 
45
  validity_scores = []
46
  for col in df.columns:
 
52
  else:
53
  validity_scores.append(1.0)
54
 
55
+ validity = float(np.mean(validity_scores)) if validity_scores else 1.0
56
 
57
+ consistency = float(self._calculate_consistency(df))
58
 
59
+ overall = float(
60
  completeness * 0.3 +
61
  validity * 0.3 +
62
  consistency * 0.2 +
inference.py CHANGED
@@ -1,26 +1,33 @@
1
  """
2
- Inference script for OpenEnv Data Cleaning Environment
3
- Uses OpenAI-compatible API to run baseline agent on data cleaning tasks
 
 
 
 
 
 
 
4
  """
5
 
6
  import os
7
  import json
8
  import logging
9
  import sys
10
- from typing import Dict, List, Optional, Any
11
  from openai import OpenAI
12
 
13
  from env.environment import DataCleaningEnv
14
  from env.models import Action, ActionType, Observation, TaskLevel
15
  from env.tasks import TaskManager
16
 
17
- logging.basicConfig(
18
- level=logging.INFO,
19
- format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
20
- )
21
  logger = logging.getLogger(__name__)
22
 
23
 
 
 
 
24
  class DataCleaningAgent:
25
  """
26
  Agent that uses OpenAI-compatible API to select data cleaning actions
@@ -31,15 +38,17 @@ class DataCleaningAgent:
31
  api_key: Optional[str] = None,
32
  model: Optional[str] = None,
33
  api_base_url: Optional[str] = None,
34
- temperature: float = 0.1,
35
- max_tokens: int = 1000
36
  ):
37
- self.api_key = api_key or os.getenv("OPENAI_API_KEY")
38
- self.model = model or os.getenv("MODEL_NAME", "gpt-4")
39
  self.api_base_url = api_base_url or os.getenv("API_BASE_URL")
40
 
41
  if not self.api_key:
42
- raise ValueError("OpenAI API key not provided. Set OPENAI_API_KEY environment variable.")
 
 
43
 
44
  client_kwargs = {"api_key": self.api_key}
45
  if self.api_base_url:
@@ -53,7 +62,6 @@ class DataCleaningAgent:
53
  self.system_prompt = self._build_system_prompt()
54
 
55
  def _build_system_prompt(self) -> str:
56
- """Build the system prompt for the agent"""
57
  return """You are an expert data cleaning agent. Your task is to analyze data quality issues and select the most appropriate cleaning actions.
58
 
59
  You will receive observations from a data cleaning environment containing:
@@ -117,7 +125,7 @@ Be efficient - each step has a small penalty. Focus on actions that improve data
117
  {"role": "user", "content": user_message}
118
  ],
119
  temperature=self.temperature,
120
- max_tokens=self.max_tokens
121
  )
122
 
123
  response_text = response.choices[0].message.content
@@ -133,7 +141,7 @@ Be efficient - each step has a small penalty. Focus on actions that improve data
133
  return action
134
 
135
  except Exception as e:
136
- logger.error(f"Error selecting action: {e}")
137
  return Action(action_type=ActionType.SUBMIT, params={})
138
 
139
  def _format_observation(self, observation: Observation) -> str:
@@ -340,80 +348,58 @@ def run_inference(
340
 
341
 
342
  def main():
343
- """Main entry point for inference script"""
344
  import argparse
345
 
346
  parser = argparse.ArgumentParser(description="Run data cleaning agent inference")
347
- parser.add_argument(
348
- "task_level",
349
- nargs="?",
350
- default="easy",
351
- choices=["easy", "medium", "hard"],
352
- help="Task difficulty level (default: easy)"
353
- )
354
- parser.add_argument(
355
- "--task-id",
356
- type=str,
357
- help="Specific task ID to run"
358
- )
359
- parser.add_argument(
360
- "--model",
361
- type=str,
362
- default=None,
363
- help="Model name (default: from MODEL_NAME env or gpt-4)"
364
- )
365
- parser.add_argument(
366
- "--api-base",
367
- type=str,
368
- default=None,
369
- help="API base URL (default: from API_BASE_URL env)"
370
- )
371
- parser.add_argument(
372
- "--max-steps",
373
- type=int,
374
- default=None,
375
- help="Override max steps"
376
- )
377
- parser.add_argument(
378
- "--quiet",
379
- action="store_true",
380
- help="Suppress verbose output"
381
- )
382
 
383
  args = parser.parse_args()
384
 
385
  # Generate datasets if needed
386
  task_manager = TaskManager()
387
  task_manager.generate_datasets()
388
-
389
- # Determine task ID
390
- if args.task_id:
391
- task_id = args.task_id
 
392
  else:
393
- tasks = task_manager.list_tasks(TaskLevel(args.task_level))
394
- if not tasks:
395
- print(f"No tasks found for level: {args.task_level}")
396
- sys.exit(1)
397
- task_id = tasks[0].task_id
398
-
 
 
399
  try:
400
- result = run_inference(
401
- task_id=task_id,
402
- model=args.model,
403
- api_base_url=args.api_base,
404
- verbose=not args.quiet,
405
- max_steps=args.max_steps
406
- )
407
-
408
- # Print summary
409
- print(f"\nTask {task_id} completed with score: {result['grade'].get('final_score', 0):.2%}")
410
-
411
- # Save results
412
- output_file = f"results_{task_id}.json"
413
- with open(output_file, 'w') as f:
414
- json.dump(result, f, indent=2, default=str)
415
- print(f"Results saved to {output_file}")
416
-
 
 
 
 
417
  except Exception as e:
418
  logger.error(f"Error running inference: {e}")
419
  sys.exit(1)
 
1
  """
2
+ Baseline inference script for OpenEnv Data Cleaning Environment.
3
+
4
+ Requirements (submission):
5
+ - File name: inference.py at repo root
6
+ - Uses OpenAI client for all LLM calls
7
+ - Reads credentials/config from environment variables:
8
+ - OPENAI_API_KEY (preferred) or HF_TOKEN (fallback)
9
+ - API_BASE_URL (optional; OpenAI-compatible endpoint)
10
+ - MODEL_NAME (model identifier)
11
  """
12
 
13
  import os
14
  import json
15
  import logging
16
  import sys
17
+ from typing import Dict, List, Optional, Any, Tuple
18
  from openai import OpenAI
19
 
20
  from env.environment import DataCleaningEnv
21
  from env.models import Action, ActionType, Observation, TaskLevel
22
  from env.tasks import TaskManager
23
 
24
+ logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(name)s - %(levelname)s - %(message)s")
 
 
 
25
  logger = logging.getLogger(__name__)
26
 
27
 
28
+ DEFAULT_TASKS: Tuple[str, ...] = ("easy_001", "medium_001", "hard_001")
29
+
30
+
31
  class DataCleaningAgent:
32
  """
33
  Agent that uses OpenAI-compatible API to select data cleaning actions
 
38
  api_key: Optional[str] = None,
39
  model: Optional[str] = None,
40
  api_base_url: Optional[str] = None,
41
+ temperature: float = 0.0,
42
+ max_tokens: int = 900
43
  ):
44
+ self.api_key = api_key or os.getenv("OPENAI_API_KEY") or os.getenv("HF_TOKEN")
45
+ self.model = model or os.getenv("MODEL_NAME", "gpt-4.1-mini")
46
  self.api_base_url = api_base_url or os.getenv("API_BASE_URL")
47
 
48
  if not self.api_key:
49
+ raise ValueError(
50
+ "No API key found. Set OPENAI_API_KEY (preferred) or HF_TOKEN (fallback)."
51
+ )
52
 
53
  client_kwargs = {"api_key": self.api_key}
54
  if self.api_base_url:
 
62
  self.system_prompt = self._build_system_prompt()
63
 
64
  def _build_system_prompt(self) -> str:
 
65
  return """You are an expert data cleaning agent. Your task is to analyze data quality issues and select the most appropriate cleaning actions.
66
 
67
  You will receive observations from a data cleaning environment containing:
 
125
  {"role": "user", "content": user_message}
126
  ],
127
  temperature=self.temperature,
128
+ max_tokens=self.max_tokens,
129
  )
130
 
131
  response_text = response.choices[0].message.content
 
141
  return action
142
 
143
  except Exception as e:
144
+ logger.error(f"Model request failed ({e}). Using fallback submit action.")
145
  return Action(action_type=ActionType.SUBMIT, params={})
146
 
147
  def _format_observation(self, observation: Observation) -> str:
 
348
 
349
 
350
  def main():
 
351
  import argparse
352
 
353
  parser = argparse.ArgumentParser(description="Run data cleaning agent inference")
354
+ parser.add_argument("--task-id", type=str, help="Run a specific task id (e.g. easy_001)")
355
+ parser.add_argument("--all", action="store_true", help="Run easy_001, medium_001, hard_001")
356
+ parser.add_argument("--model", type=str, default=None, help="Overrides MODEL_NAME")
357
+ parser.add_argument("--api-base", type=str, default=None, help="Overrides API_BASE_URL")
358
+ parser.add_argument("--max-steps", type=int, default=None, help="Override max steps")
359
+ parser.add_argument("--quiet", action="store_true", help="Suppress verbose output")
360
+ parser.add_argument("--out", type=str, default="baseline_results.json", help="Output JSON path")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
361
 
362
  args = parser.parse_args()
363
 
364
  # Generate datasets if needed
365
  task_manager = TaskManager()
366
  task_manager.generate_datasets()
367
+
368
+ if args.all:
369
+ task_ids = list(DEFAULT_TASKS)
370
+ elif args.task_id:
371
+ task_ids = [args.task_id]
372
  else:
373
+ task_ids = ["easy_001"]
374
+
375
+ results: Dict[str, Any] = {
376
+ "model_name": args.model or os.getenv("MODEL_NAME", "gpt-4.1-mini"),
377
+ "api_base_url": args.api_base or os.getenv("API_BASE_URL"),
378
+ "tasks": {},
379
+ }
380
+
381
  try:
382
+ for task_id in task_ids:
383
+ result = run_inference(
384
+ task_id=task_id,
385
+ model=args.model,
386
+ api_base_url=args.api_base,
387
+ verbose=not args.quiet,
388
+ max_steps=args.max_steps,
389
+ )
390
+ results["tasks"][task_id] = {
391
+ "task_level": result.get("task_level"),
392
+ "total_steps": result.get("total_steps"),
393
+ "final_quality": result.get("final_quality"),
394
+ "issues_remaining": result.get("issues_remaining"),
395
+ "final_score": (result.get("grade") or {}).get("final_score", 0.0),
396
+ }
397
+
398
+ with open(args.out, "w") as f:
399
+ json.dump(results, f, indent=2, default=str)
400
+ print(json.dumps(results, indent=2, default=str))
401
+ print(f"\nSaved -> {args.out}")
402
+
403
  except Exception as e:
404
  logger.error(f"Error running inference: {e}")
405
  sys.exit(1)
requirements.txt CHANGED
@@ -5,7 +5,7 @@ pydantic>=2.0.0
5
 
6
  # Web framework
7
  fastapi>=0.100.0
8
- uvicorn>=0.23.0
9
  python-multipart>=0.0.6
10
 
11
  # OpenAI API (for baseline agent)
 
5
 
6
  # Web framework
7
  fastapi>=0.100.0
8
+ uvicorn[standard]>=0.23.0
9
  python-multipart>=0.0.6
10
 
11
  # OpenAI API (for baseline agent)
scripts/validate-submission.sh ADDED
@@ -0,0 +1,62 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+ #
3
+ # validate-submission.sh — lightweight submission validator
4
+ #
5
+ # Usage:
6
+ # ./scripts/validate-submission.sh [ping_url] [repo_dir]
7
+ #
8
+ # Examples:
9
+ # ./scripts/validate-submission.sh http://localhost:7860 .
10
+ # ./scripts/validate-submission.sh https://<your-space>.hf.space .
11
+ #
12
+
13
+ set -euo pipefail
14
+
15
+ PING_URL="${1:-http://localhost:7860}"
16
+ REPO_DIR="${2:-.}"
17
+
18
+ echo "==> Repo dir: ${REPO_DIR}"
19
+ echo "==> Ping URL: ${PING_URL}"
20
+
21
+ echo "==> Checking required files..."
22
+ for f in Dockerfile requirements.txt app.py openenv.yaml inference.py README.md; do
23
+ test -f "${REPO_DIR}/${f}" || { echo "Missing ${f}"; exit 1; }
24
+ done
25
+
26
+ echo "==> Docker build..."
27
+ docker build -t datacleanser:validate "${REPO_DIR}"
28
+
29
+ echo "==> Docker run..."
30
+ CID="$(docker run -d -p 7860:7860 datacleanser:validate)"
31
+ cleanup() { docker rm -f "${CID}" >/dev/null 2>&1 || true; }
32
+ trap cleanup EXIT
33
+
34
+ echo "==> Waiting for /health..."
35
+ for i in {1..30}; do
36
+ if curl -fsS "http://localhost:7860/health" >/dev/null; then
37
+ break
38
+ fi
39
+ sleep 1
40
+ done
41
+
42
+ echo "==> Probing endpoints..."
43
+ curl -fsS "http://localhost:7860/health" | cat
44
+ echo
45
+ curl -fsS "http://localhost:7860/tasks" | head -c 400 || true
46
+ echo
47
+ curl -fsS -X POST "http://localhost:7860/reset" \
48
+ -H "Content-Type: application/json" \
49
+ -d '{"task_id":"easy_001","session_id":"validate"}' | head -c 400 || true
50
+ echo
51
+ curl -fsS "http://localhost:7860/state?session_id=validate" | head -c 400 || true
52
+ echo
53
+
54
+ echo "==> Optional: openenv validate (if installed)..."
55
+ if command -v openenv >/dev/null 2>&1; then
56
+ (cd "${REPO_DIR}" && openenv validate) || true
57
+ else
58
+ echo "openenv CLI not found; skipping openenv validate."
59
+ fi
60
+
61
+ echo "==> OK"
62
+