File size: 15,627 Bytes
9996a16
c338ce7
 
9996a16
c338ce7
9996a16
 
c338ce7
 
 
9996a16
 
c338ce7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
17adce2
c338ce7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
---
title: DataQA Environment Server
emoji: "\U0001F50D"
colorFrom: blue
colorTo: gray
sdk: docker
pinned: false
app_port: 8000
tags:
  - openenv
---

# DataQA Environment

A two-phase OpenEnv RL environment for **Data Quality Assurance** β€” an LLM agent inspects corrupted datasets, identifies all planted quality issues, and proposes data repairs.

### Demo: Agent Trajectory Replay

```
EASY TASK (Step 2) β€” All 6 issues found + 5 fixes proposed
  Reward: 0.87 | Identify: 1.00 | Fix: 0.67
  βœ“ row:4  name: empty β†’ "David Kim"
  βœ“ row:7  salary: "seventy-five thousand" β†’ "75000"
  βœ“ row:9  salary: "5000" β†’ "73000"
  βœ“ row:15 email: mismatch β†’ "oscar.rivera@company.com"
  βœ“ row:18 start_date: "2027-06-15" β†’ "2022-01-19"
  βœ“ row:21 duplicate row detected

HARD TASK β€” ML experiment metadata
  Step 1: Found 5/10, missed hard issues    β†’ Reward: 0.69
  Step 2: Found 10/10 + 5 fixes proposed   β†’ Reward: 0.77
  Issues requiring ML knowledge:
    β€’ val_loss < train_loss (data leakage signal)
    β€’ resnet18 using 42.5GB GPU (impossible)
    β€’ 350 epochs on ImageNet in 30 min (impossible)
    β€’ wav2vec2 at 98.5% accuracy (exceeds SOTA)

ALIGNMENT TASK β€” NVIDIA HelpSteer data (hardest)
  Step 1: Found 7/12, missed subtle issues  β†’ Reward: 0.58
  Step 2: Found 12/12 + 3 fixes proposed   β†’ Reward: 0.72
  Issues requiring deep reasoning:
    β€’ Cerasus vs Prunus serrulata (wrong taxonomic name)
    β€’ $400.3M at Sotheby's vs $450.3M at Christie's (close but wrong)
    β€’ "does NOT learn via backprop" then describes backprop (self-contradiction)
    β€’ Fake Nature paper by "Dr. Sarah Chen" (hallucinated citation)
    β€’ "use bare except everywhere" rated helpfulness=3 (harmful advice)
    β€’ [SYSTEM] prompt leaked in response (pipeline contamination)
```

> The interactive replay UI with color-coded dataset visualization is available on the HF Space.

## Motivation

Every ML engineer and data scientist spends significant time debugging data quality issues β€” missing values, type mismatches, logical inconsistencies, and subtle statistical anomalies β€” before data enters ML pipelines or production databases. This is a genuine, high-frequency human task that directly impacts model quality and business outcomes.

DataQA turns this into a **two-phase RL challenge**:
1. **Identify** β€” systematically inspect corrupted data and pinpoint every planted issue
2. **Fix** β€” propose corrected values by reasoning about schema, constraints, and context

This creates a rich multi-step decision problem where agents must explore datasets strategically, distinguish subtle anomalies from noise, and reason about what the correct data should be.

## Environment API

| Endpoint | Method | Description |
|----------|--------|-------------|
| `/reset` | POST | Start a new episode with a corrupted dataset |
| `/step` | POST | Submit identified issues + proposed fixes |
| `/state` | GET | Get current episode state |
| `/health` | GET | Health check |

## Tasks

| Task | Issues | Difficulty | Domain | Description |
|------|--------|-----------|--------|-------------|
| `easy` | 6 | Beginner | HR/Employee data (21 rows) | Nulls, wrong types, duplicates, out-of-range, email-name mismatch, future dates |
| `medium` | 8 | Intermediate | E-commerce orders (31 rows) | Inconsistent totals, invalid categories, duplicate keys, wrong date formats, invalid country codes, future-date deliveries |
| `hard` | 10 | Advanced | ML experiment metadata (31 rows) | Data leakage signals, unreasonable GPU memory, impossibly fast training, SOTA-exceeding accuracy, timestamp ordering, whitespace-only fields |
| `alignment` | 12 | Expert | LLM alignment data (30 rows, NVIDIA HelpSteer) | See below |
| `moderation` | 10 | Expert | Content moderation (30 rows, OpenAI Moderation) | Mislabeled hate/violence, false positives on clean text, subset rule violations, label range errors |

**Difficulty progression**: Easy issues are individually obvious (empty fields, text in numeric columns). Medium issues require cross-column reasoning (total != qty * price) and set membership checks. Hard issues require ML domain knowledge (val_loss < train_loss = data leakage) and multi-row temporal reasoning.

### Alignment Task: LLM Training Data Quality (Expert)

Built on **real data from [NVIDIA HelpSteer](https://huggingface.co/datasets/nvidia/HelpSteer)** β€” 30 human-annotated prompt-response pairs with quality scores (helpfulness, correctness, coherence, complexity, verbosity on 0-4 scale).

This task targets a critical real-world problem: **catching quality issues in LLM fine-tuning data before it corrupts model training**. The 12 planted issues represent failure modes actually seen in production data pipelines:

| Issue | Difficulty | Why It's Hard |
|---|---|---|
| Subtle factual error (*Cerasus* vs *Prunus serrulata*) | 3.0 | Old taxonomic synonym β€” sounds plausible, requires domain knowledge |
| Plausible wrong numbers ($400.3M at Sotheby's vs $450.3M at Christie's) | 3.0 | Right painting, wrong price by $50M and wrong auction house |
| Self-contradictory reasoning ("does NOT learn via backprop" then describes backprop) | 3.0 | Response negates its own conclusion β€” trains confused models |
| Hallucinated citation (fake Nature paper by fake Dr. Sarah Chen) | 3.0 | Fabricated study with specific fake statistics β€” most dangerous for training |
| Harmful coding advice ("use bare except everywhere") with high quality scores | 3.0 | Teaches dangerous practices if used for fine-tuning |
| Leaked system prompt (`[SYSTEM] You are a helpful AI...`) in response | 2.5 | Data pipeline failed to strip prompt template |
| Semantic near-duplicate prompt (rephrased, not exact copy) | 2.5 | Requires semantic similarity detection, not just string matching |
| Score inflation (helpfulness=4 for a 4-word answer) | 2.5 | Score-content mismatch requires understanding rating criteria |
| Truncated response (cut mid-sentence) | 2.5 | `max_length` truncation without sentence boundary detection |
| Response in French for English prompt | 2.0 | Language contamination from multilingual training data |
| Response plagiarized from another row | 2.0 | Data pipeline shuffling/dedup failure |
| Whitespace-only prompt | 2.0 | Empty training example from pipeline artifact |

These issues are designed to challenge frontier models β€” they require factual recall, semantic reasoning, cross-row comparison, and understanding of what makes training data harmful.

## Two-Phase Action Space

### Phase 1: Identify Issues

Submit issues in format: `row:<row_number>,col:<column_name>,issue:<issue_type>`

- `row_number`: 1-indexed data row position (after header)
- `column_name`: Exact column header name, lowercase
- `issue_type`: One of the supported types below

### Phase 2: Propose Fixes

Submit fixes in format: `row:<row_number>,col:<column_name>,fix:<corrected_value>`

The agent proposes the **correct value** that should replace the corrupted data. Fixes are graded against the original clean dataset.

Both phases can be submitted in the same step or across multiple steps.

**Supported Issue Types:**

| Type | Description | Example |
|------|-------------|---------|
| `missing_value` | Null, empty, or whitespace-only | Empty name field |
| `wrong_type` | Value doesn't match expected type | Salary as "seventy-five thousand" |
| `duplicate_row` | Exact duplicate or duplicate key | Two rows with same employee_id |
| `out_of_range` | Value outside valid range | Salary of 5000 when min is 50000 |
| `format_violation` | Wrong format or invalid enum | Date as DD/MM/YYYY instead of YYYY-MM-DD |
| `inconsistent_value` | Computed field mismatch, logical inconsistency | total != qty * price |
| `statistical_outlier` | Unreasonable value given context | resnet18 using 42.5GB GPU |
| `referential_integrity` | Foreign key violation | (available for custom tasks) |

## Observation Space

| Field | Type | Description |
|-------|------|-------------|
| `dataset_csv` | str | The corrupted dataset in CSV format |
| `schema_description` | str | Column types, ranges, and constraints |
| `validation_rules` | str | Business rules the data must satisfy |
| `task_description` | str | Task context and instructions |
| `feedback` | str | Per-step results: TP/FP/FN, precision/recall, fix scores |
| `num_issues_hint` | int | Exact count of planted issues |
| `max_steps` | int | Maximum attempts allowed |
| `done` | bool | Whether episode has terminated |
| `reward` | float | Best combined reward so far (0.0-1.0) |

**Observation Metadata** (per step):
- Identify: `identify_f1`, `identify_score`, `precision`, `recall`, `tp`, `fp`, `fn`
- Fix: `fix_score`, `fixes_correct`, `fixes_partial`, `fixes_wrong`, `fixes_attempted`
- Combined: `combined_reward`, `difficulty_found`, `difficulty_missed`

## Reward Function

### Combined Reward

```
combined_reward = 0.6 * identify_score + 0.4 * fix_score
```

If no fixes are submitted, `combined_reward = identify_score` (no penalty β€” backward compatible).

### Identify Score (Difficulty-Weighted F1)

Each planted issue has a **difficulty weight** (1.0-3.0):

| Weight | Category | Examples |
|--------|----------|----------|
| 1.0 | Easy | Missing values, obvious out-of-range, wrong type |
| 1.5-2.0 | Medium | Duplicate keys, format violations, cross-column checks |
| 2.5-3.0 | Hard | Data leakage, statistical outliers, whitespace-only |

- **Weighted Recall** = (difficulty of found issues) / (total difficulty)
- **Weighted Precision** = penalizes false positives proportional to average difficulty
- **Weighted F1** = harmonic mean

### Fix Score (Difficulty-Weighted Quality)

Each proposed fix is compared against the original clean value:

| Fix Quality | Score | Description |
|-------------|-------|-------------|
| Exact match | 1.0 | Case-insensitive, whitespace-stripped match |
| Numeric close | 0.8 | Within 1% of correct numeric value |
| Correct cell | 0.1 | Right location, wrong value |
| Non-issue cell | 0.0 | Fix targets a cell with no issue |

Fix score = (sum of best fix score per issue Γ— difficulty weight) / (total difficulty weight)

### Reward Properties

- **Per-step partial progress**: reward increases as more issues are found/fixed
- **Difficulty-aware**: finding subtle issues earns more than obvious ones
- **Penalizes bad behavior**: false positives reduce score, fixing non-issues earns nothing
- **Monotonically non-decreasing**: best score across all steps is the final reward
- **Always in [0.0, 1.0]**: meets hackathon requirement

### Episode Boundaries

- Each task allows up to 3 steps (attempts)
- Episode ends when F1 >= 0.999 (perfect identification) or max steps reached
- Agent receives detailed feedback after each step to improve on next attempt

## Baseline Scores

Baseline agent uses Qwen2.5-72B-Instruct via HuggingFace Router:

| Task | Identify Score | Fix Score | Combined | Notes |
|------|---------------|-----------|----------|-------|
| `easy` | 0.7-1.0 | 0.5-0.9 | 0.6-1.0 | Most LLMs find obvious issues reliably |
| `medium` | 0.5-0.8 | 0.3-0.6 | 0.4-0.7 | Cross-column reasoning challenges models |
| `hard` | 0.3-0.6 | 0.2-0.4 | 0.3-0.5 | ML domain knowledge and subtle patterns |

Scores vary by model. The hard task is designed to challenge frontier models.

## Extensibility

### Custom Contamination Rules

```python
from dataqa_env import register_contamination_rule
from dataqa_env.server.tasks import PlantedIssue

def swap_digits(rows, header, col_idx, row_idx, rng):
    val = rows[row_idx][col_idx]
    corrupted = val[::-1]
    issue = PlantedIssue(
        row=row_idx + 1, col=header[col_idx],
        issue_type="format_violation",
        description=f"Digits swapped in {header[col_idx]}",
        difficulty=2.0,
    )
    return corrupted, issue

register_contamination_rule("swap_digits", swap_digits)
```

### Custom Tasks from Config

```python
from dataqa_env import create_task_from_config, register_task

task = create_task_from_config(
    task_id="custom",
    name="Custom Validation",
    description="Find quality issues in this dataset.",
    schema_description="id: int, name: str, score: int (0-100)",
    validation_rules="No missing values. Scores must be 0-100.",
    clean_csv="id,name,score\n1,Alice,95\n2,Bob,87\n3,Carol,92",
    contaminations=[
        {"rule": "missing_value", "row": 0, "col": 1, "difficulty": 1.0},
        {"rule": "negative_value", "row": 2, "col": 2, "difficulty": 1.5},
    ],
)
register_task("custom", lambda seed: task)
```

### Built-in Contamination Rules

| Rule | Effect | Default Difficulty |
|------|--------|--------------------|
| `missing_value` | Sets field to empty string | 1.0 |
| `whitespace_value` | Sets field to single space | 2.5 |
| `wrong_type_text` | Replaces with random text | 1.0 |
| `negative_value` | Negates numeric value | 1.0 |

## Setup & Quick Start

```bash
# Install
pip install -e .

# Run server locally
uvicorn dataqa_env.server.app:app --host 0.0.0.0 --port 8000

# Run inference (set your API credentials)
API_BASE_URL=https://router.huggingface.co/v1 \
MODEL_NAME=Qwen/Qwen2.5-72B-Instruct \
HF_TOKEN=your-token \
python inference.py
```

## Docker

```bash
docker build -t dataqa-env .
docker run -p 8000:8000 dataqa-env
```

## Testing

```bash
pip install -e ".[dev]"
pytest tests/ -v
```

118 tests covering:
- Task creation, corruption, and difficulty weights
- Issue key and fix parsing (standard, lenient, edge cases)
- F1, weighted reward, and fix quality computation
- Full environment lifecycle (identify-only and identify+fix)
- Combined reward calculation and weight verification
- Inference script parsing and prompt building
- Structured log format ([START], [STEP], [END])
- Score bounds (0.0-1.0), best-score monotonicity
- Extensibility API (custom rules, custom tasks)

## Validation

```bash
# OpenEnv spec validation
openenv validate .

# Pre-submission validation (requires HF Space URL)
./prevalidation_script.sh https://your-space.hf.space
```

## Environment Variables

| Variable | Description | Default |
|----------|-------------|---------|
| `API_BASE_URL` | LLM API endpoint | `https://router.huggingface.co/v1` |
| `MODEL_NAME` | Model identifier | `Qwen/Qwen2.5-72B-Instruct` |
| `HF_TOKEN` | HuggingFace token / API key | - |
| `ENV_URL` | Environment server URL | `http://localhost:8000` |

## Architecture

```
dataqa_env/
β”œβ”€β”€ __init__.py            # Public API + extensibility exports
β”œβ”€β”€ models.py              # Pydantic: DataQAAction (issues + fixes), DataQAObservation, DataQAState
β”œβ”€β”€ client.py              # EnvClient for WebSocket connections
β”œβ”€β”€ server/
β”‚   β”œβ”€β”€ environment.py     # Two-phase DataQAEnvironment (identify + fix + combined reward)
β”‚   β”œβ”€β”€ tasks.py           # Task definitions + contamination rules + extensibility API
β”‚   β”œβ”€β”€ app.py             # FastAPI server (via openenv-core create_app)
β”‚   └── Dockerfile
tests/
β”œβ”€β”€ test_tasks.py          # Task creation, corruption, difficulty weights
β”œβ”€β”€ test_environment.py    # Identify scoring, fix grading, combined reward, lifecycle
β”œβ”€β”€ test_inference.py      # LLM response parsing, fix parsing, prompt building, log format
└── test_extensibility.py  # Custom rules, custom tasks, registration API
inference.py               # Two-phase baseline agent (identify β†’ fix)
openenv.yaml               # OpenEnv/HF Spaces spec
pyproject.toml             # Package metadata and dependencies
Dockerfile                 # Production container
```