File size: 6,673 Bytes
b17178a
dce68a7
 
 
 
b17178a
 
dce68a7
 
 
 
b17178a
 
dce68a7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c22bf49
dce68a7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c22bf49
dce68a7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
---
title: Data Cleaning OpenEnv Environment
emoji: 🧹
colorFrom: blue
colorTo: green
sdk: docker
pinned: false
app_port: 7860
base_path: /web
tags:
  - openenv
---

# Data Cleaning OpenEnv Environment

## Overview

This repository contains a real-world OpenEnv benchmark for interactive tabular data cleaning. The agent operates on messy employee-style datasets and must resolve common data preparation issues step by step: missing values, duplicate rows, wrong dtypes, inconsistent categorical values, and derived feature creation.

The implementation uses plain Python data structures instead of pandas so it stays lightweight for the hackathon constraints, Docker validation, and Hugging Face Spaces deployment.

The repository now follows the standard OpenEnv layout closely:

```text
openenv-data-cleaning/
β”œβ”€β”€ client.py
β”œβ”€β”€ models.py
β”œβ”€β”€ openenv.yaml
β”œβ”€β”€ pyproject.toml
β”œβ”€β”€ server/
β”‚   β”œβ”€β”€ app.py
β”‚   β”œβ”€β”€ environment.py
β”‚   └── requirements.txt
└── outputs/
    β”œβ”€β”€ evals/
    └── logs/
```

## Environment Summary

- Domain: tabular data cleaning and preparation
- Mode: simulation environment with `reset()`, `step()`, and `state()`
- API: FastAPI on port `7860`
- Tasks: `basic_cleaning`, `moderate_cleaning`, `full_pipeline`
- Difficulty curve: easy -> medium -> hard

## Action Space

| Action | Target | Required params | Validation rules |
| --- | --- | --- | --- |
| `fill_missing` | Specific column | `{"strategy": "mean" \| "median" \| "zero" \| "mode" \| "unknown"}` | Numeric columns allow `mean`, `median`, `zero`; categorical columns allow `mode`, `unknown`. |
| `drop_duplicates` | `__all__` | `{}` | Only valid when duplicate rows are still present. |
| `convert_dtype` | Specific column | `{"target_dtype": "int" \| "float" \| "str" \| "bool"}` | Target dtype must match the task configuration and values must be convertible. |
| `normalize_category` | Categorical column | `{}` | Only valid when case-only category inconsistencies remain. |
| `create_feature` | Registered feature name | `{"feature_name": "<name>"}` | Feature must be required by the task and its source column must already be clean enough to use. |

Invalid actions leave the dataset unchanged, emit `{"error": "invalid_action"}` in `info`, consume a step, and return a low reward `0.01`.

## Observation and State Space

Every `reset()`, `step()`, and `state()` call returns the same typed observation payload:

| Field | Type | Description |
| --- | --- | --- |
| `data_preview` | `list[dict[str, Any]]` | First five rows of the current dataset |
| `columns` | `list[ColumnInfo]` | Per-column dtype, null count, and unique count |
| `pending_issues` | `list[Issue]` | Remaining fixable issues |
| `resolved_issues` | `list[Issue]` | Issues already credited as solved |
| `action_history` | `list[dict[str, Any]]` | Previous actions with reward and optional error |
| `quality_score` | `float` | Current quality score in `[0.0, 1.0]` |
| `steps_remaining` | `int` | Remaining episode budget |
| `total_rows` | `int` | Current number of rows |
| `total_issues_at_start` | `int` | Issues detected immediately after `reset()` |

## Tasks

| Task | Difficulty | Rows | Main issue profile |
| --- | --- | --- | --- |
| `basic_cleaning` | Easy | 20 | Missing `age`, missing `salary` |
| `moderate_cleaning` | Medium | 50 | Missing `age`, missing `salary`, missing `years_exp`, duplicate rows, wrong `salary` dtype |
| `full_pipeline` | Hard | 100 | Missing values, duplicate rows, wrong `salary` and `rating` dtypes, inconsistent `city`, inconsistent `department`, required `age_group` feature |

The hardest task includes explicit dependency chains such as fixing missing salary values before dtype conversion and cleaning source columns before feature creation.

## Reward and Grading

Step reward:

```text
reward = (new_quality - old_quality) + ordering_bonus - 0.01
ordering_bonus = 0.05 if dependencies were already satisfied else 0.0
reward is then clamped to `(0.01, 0.99)`
```

Dataset quality score combines:

- Completeness: 40%
- Uniqueness: 30%
- Consistency: 30%

Task grader:

```text
correctness = issues_fixed / total_issues
efficiency = max(0, 1 - steps_taken / (2 * total_issues))
penalty = wrong_actions * 0.05
score = 0.8 * correctness + 0.2 * efficiency - penalty
```

Grader scores are deterministic, clamped to `[0.0, 1.0]`, and rounded to two decimals.

## Setup

### Python and install

The project requires Python `3.10+`. Python `3.11` is recommended.

```bash
python3.11 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
```

### Run local checks

```bash
python test_env.py
openenv validate .
```

### Run the FastAPI app

```bash
uv run server
```

Equivalent direct command:

```bash
uvicorn server.app:app --host 0.0.0.0 --port 7860
```

### Run the baseline inference script

The hackathon evaluator expects these environment variables:

```bash
export HF_TOKEN=...
export API_BASE_URL=...
export MODEL_NAME=...
python inference.py
```

The script uses the OpenAI Python client and emits the required `[START]`, `[STEP]`, and `[END]` structured logs.

### Docker

```bash
docker build -t data-cleaning-env .
docker run -p 7860:7860 data-cleaning-env
```

## API Surface

- `GET /`
- `GET /health`
- `GET /metadata`
- `GET /tasks`
- `GET /schema`
- `POST /reset`
- `POST /step`
- `GET /state`
- `POST /mcp`

## Baseline Scores

Deterministic scripted benchmark from `test_env.py`:

- `basic_cleaning`: `0.90`
- `moderate_cleaning`: `0.90`
- `full_pipeline`: `0.90`

Model-based baseline from `inference.py`:

- `basic_cleaning`: `0.90`
- `moderate_cleaning`: `0.41`
- `full_pipeline`: `0.20`

These scores were produced on April 8, 2026 using `MODEL_NAME=Qwen/Qwen2.5-72B-Instruct` through the configured Hugging Face router. The run completed and emitted the required structured logs, but the provider returned HTTP `402` after the early steps, so the medium and hard tasks were penalized by fallback `parse_error` actions. For a stronger final baseline, top up credits or switch `API_BASE_URL` / `MODEL_NAME` to a provider with available quota and rerun `python inference.py`.

## Deployment

### Hugging Face Spaces

Deploy this repo as a Docker Space tagged with OpenEnv. After deployment, verify:

- the Space root responds with HTTP `200`
- `POST /reset` works on the live Space
- `openenv validate <space-url>` passes runtime validation

Recommended deploy command:

```bash
openenv push --repo-id kaustubhg73/data-cleaning-openenv --exclude .openenv-upload-ignore
```

Space link:

- https://huggingface.co/spaces/kaustubhg73/data-cleaning-openenv