Spaces:
Sleeping
Sleeping
File size: 19,738 Bytes
8ced877 0c216ef f5583f9 0c216ef 8ced877 0bb6f99 0c216ef 8ced877 0c216ef 6c1b2ac f5583f9 6c1b2ac f5583f9 6c1b2ac 0c216ef 6c1b2ac 0c216ef 6c1b2ac 0c216ef 6c1b2ac 0c216ef 6c1b2ac 0c216ef 6c1b2ac 0c216ef 6c1b2ac 0c216ef 6c1b2ac f5583f9 6c1b2ac f5583f9 6c1b2ac 0c216ef 6c1b2ac 0c216ef 6c1b2ac 0c216ef 6c1b2ac 0c216ef 6c1b2ac 0c216ef 6c1b2ac 0c216ef 6c1b2ac 0c216ef 6c1b2ac 0c216ef 6c1b2ac f5583f9 6c1b2ac f5583f9 6c1b2ac f5583f9 6c1b2ac f5583f9 6c1b2ac f5583f9 346b0b1 6c1b2ac f5583f9 6c1b2ac 346b0b1 6c1b2ac f5583f9 346b0b1 f5583f9 346b0b1 f5583f9 346b0b1 f5583f9 346b0b1 f5583f9 346b0b1 f5583f9 6c1b2ac f5583f9 346b0b1 f5583f9 346b0b1 0c216ef 346b0b1 0c216ef 346b0b1 0c216ef 346b0b1 f5583f9 0c216ef 346b0b1 f5583f9 346b0b1 6c1b2ac f5583f9 346b0b1 0c216ef 346b0b1 0c216ef 346b0b1 6c1b2ac 346b0b1 0c216ef 346b0b1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 | ---
title: DataQA Environment Server
emoji: "\U0001F50D"
colorFrom: blue
colorTo: gray
sdk: docker
pinned: false
app_port: 8000
tags:
- openenv
---
# DataQA Environment
**A two-phase OpenEnv RL environment for Data Quality Assurance** β an LLM agent inspects corrupted datasets, identifies all planted quality issues, and proposes data repairs.
## Why DataQA? The Moat
### 1. Solves a Real, High-Frequency Problem
Every ML team burns hours on data quality β missing values, type mismatches, logical inconsistencies, subtle statistical anomalies β before data enters training pipelines or production databases. DataQA turns this universal pain point into a graded RL environment. Unlike synthetic toy problems, **these are the exact data bugs that corrupt production ML models.**
### 2. Seven Diverse Domains, One Unified Interface
| Task | Domain | Issues | What Makes It Hard |
|------|--------|--------|--------------------|
| `easy` | HR / Employee data | 6 | Missing values, typos, format errors |
| `medium` | E-commerce orders | 8 | Cross-column math (`total != qty * price`), OCR errors |
| `hard` | ML experiment metadata | 10 | Data leakage detection, impossible GPU specs, SOTA violations |
| `alignment` | LLM fine-tuning data (NVIDIA HelpSteer) | 12 | Hallucinated citations, self-contradictions, toxic content scored as helpful |
| `coding` | Code instruction-response pairs | 10 | Logic bugs in "correct" code, `eval()` injection, language mismatches |
| `toolcalling` | Function-calling schemas | 10 | Hallucinated parameters, missing required args, name mismatches |
| `moderation` | Content moderation labels | 10 | Mislabeled hate speech, false positives on clean text |
**66 total planted issues** spanning tabular data, free-text, code, JSON schemas, and safety labels. No other OpenEnv submission covers this breadth with a single coherent reward function.
### 3. Two-Phase Reward β Identify Then Fix
Most data QA environments only ask "is there a bug?" DataQA goes further:
- **Phase 1 (Identify):** Find all issues β graded by difficulty-weighted F1
- **Phase 2 (Fix):** Propose the correct value β graded against the clean original with tiered scoring (exact match = 1.0, valid fix = 0.8, partial = 0.4, right cell wrong value = 0.1)
```
combined_reward = 0.6 * identify_score + 0.4 * fix_score
```
This creates a richer learning signal than binary classification. An agent that finds 8/10 issues and fixes 5 of them correctly gets meaningful partial credit β perfect for GRPO/RLHF training.
### 4. Difficulty-Weighted Scoring Rewards Deeper Reasoning
Each planted issue has a difficulty weight (1.0-3.0). Finding a hallucinated citation (3.0) earns triple the reward of finding an empty field (1.0). This incentivizes agents to develop genuine reasoning capabilities rather than pattern-matching surface-level errors.
### 5. Multi-Step Feedback Loop
Agents get 3 attempts per task with detailed per-step feedback:
- Which issues were correct (true positives) vs wrong (false positives)
- Which issues were missed (false negatives) with difficulty hints
- Fix quality scores with reasons
This enables the agent to **learn from its mistakes within a single episode** β a natural curriculum.
### 6. Fully Extensible
```python
# Add your own contamination rules
register_contamination_rule("swap_digits", my_swap_fn)
# Create tasks from any CSV
task = create_task_from_config(
task_id="custom", clean_csv="...",
contaminations=[{"rule": "missing_value", "row": 0, "col": 1}]
)
register_task("custom", lambda seed: task)
```
New domains can be added in minutes. The contamination engine is domain-agnostic.
---
## Demo: Agent Trajectory
```
HARD TASK β ML experiment metadata
Step 1: Found 5/10, missed hard issues β Reward: 0.69
Step 2: Found 10/10 + 5 fixes proposed β Reward: 0.77
Issues requiring ML knowledge:
β’ val_loss < train_loss (data leakage signal)
β’ resnet18 using 42.5GB GPU (impossible for 11M params)
β’ 350 epochs on ImageNet in 30 min (impossibly fast)
β’ wav2vec2 at 98.5% accuracy (exceeds SOTA)
ALIGNMENT TASK β NVIDIA HelpSteer data
Step 1: Found 7/12, missed subtle issues β Reward: 0.58
Step 2: Found 12/12 + 3 fixes proposed β Reward: 0.72
Issues requiring deep reasoning:
β’ Cerasus vs Prunus serrulata (wrong taxonomic name)
β’ $400.3M at Sotheby's vs $450.3M at Christie's (close but wrong)
β’ Fake Nature paper by "Dr. Sarah Chen" (hallucinated citation)
β’ Gender-biased advice rated helpfulness=4 (toxic content with inflated scores)
CODING TASK β Code instruction-response pairs
Issues requiring code understanding:
β’ Binary search off-by-one (lo=mid causes infinite loop) marked correct
β’ eval(uid) in Flask route β code injection vulnerability
β’ JavaScript response for a Python-labeled task
β’ Duplicate "merge sort" instruction across rows
```
> The interactive replay UI with color-coded dataset visualization is available on the HF Space.
## Environment API
| Endpoint | Method | Description |
|----------|--------|-------------|
| `/reset` | POST | Start a new episode with a corrupted dataset |
| `/step` | POST | Submit identified issues + proposed fixes |
| `/state` | GET | Get current episode state |
| `/health` | GET | Health check |
## Tasks
**Difficulty progression**: Easy issues are individually obvious (empty fields, text in numeric columns). Medium issues require cross-column reasoning (total != qty * price) and set membership checks. Hard issues require ML domain knowledge (val_loss < train_loss = data leakage). Expert tasks (alignment, coding, toolcalling, moderation) require domain expertise, semantic reasoning, and cross-row comparison.
### Alignment Task: LLM Training Data Quality (Expert)
Built on **real data from [NVIDIA HelpSteer](https://huggingface.co/datasets/nvidia/HelpSteer)** β 30 human-annotated prompt-response pairs with quality scores (helpfulness, correctness, coherence, complexity, verbosity on 0-4 scale).
This task targets a critical real-world problem: **catching quality issues in LLM fine-tuning data before it corrupts model training**. The 12 planted issues represent failure modes actually seen in production data pipelines:
| Issue | Difficulty | Why It's Hard |
|---|---|---|
| Subtle factual error (*Cerasus* vs *Prunus serrulata*) | 3.0 | Old taxonomic synonym β sounds plausible, requires domain knowledge |
| Plausible wrong numbers ($400.3M at Sotheby's vs $450.3M at Christie's) | 3.0 | Right painting, wrong price by $50M and wrong auction house |
| Self-contradictory reasoning ("does NOT learn via backprop" then describes backprop) | 3.0 | Response negates its own conclusion β trains confused models |
| Hallucinated citation (fake Nature paper by fake Dr. Sarah Chen) | 3.0 | Fabricated study with specific fake statistics β most dangerous for training |
| Harmful coding advice ("use bare except everywhere") with high quality scores | 3.0 | Teaches dangerous practices if used for fine-tuning |
| Toxic/biased response scored as helpful | 3.0 | Gender-biased stereotypes with helpfulness=4 β poisons alignment training |
| Leaked system prompt (`[SYSTEM] You are a helpful AI...`) in response | 2.5 | Data pipeline failed to strip prompt template |
| Semantic near-duplicate prompt (rephrased, not exact copy) | 2.5 | Requires semantic similarity detection, not just string matching |
| Truncated response (cut mid-sentence) | 2.5 | `max_length` truncation without sentence boundary detection |
| Response in French for English prompt | 2.0 | Language contamination from multilingual training data |
| Response plagiarized from another row | 2.0 | Data pipeline shuffling/dedup failure |
| Whitespace-only prompt | 2.0 | Empty training example from pipeline artifact |
### Coding Task: Code Quality (Expert)
20-row dataset of code instruction-response pairs (Python algorithms, data structures, web, design patterns). 10 planted issues:
- Syntax errors in "correct" code (unbalanced parens)
- Logic bugs marked `is_correct=true` (binary search off-by-one infinite loop)
- Security vulnerabilities (`eval()` on user input) marked correct
- Language mismatches (JavaScript response labeled Python)
- Truncated code, difficulty label mismatches, duplicate instructions, wrong categories, missing test cases
### Tool-Calling Task: Function Schema Quality (Expert)
20-row dataset of function definitions with parameter schemas, example calls, and outputs. 10 planted issues:
- Function name mismatch between definition and example call
- Missing required parameters in example call
- Hallucinated parameters not in schema
- Type mismatches (string "high" for integer quality parameter)
- Invalid JSON, duplicate function names, misleading descriptions, wrong categories
### Moderation Task: Content Label Quality (Expert)
30-row dataset modeled on content moderation pipelines. 10 planted issues:
- Mislabeled hate speech and violence (unflagged toxic content)
- False positives on clean text (idioms flagged as hate)
- Subset rule violations (`hate_threatening` without `hate` flag)
- Out-of-range label values
## Two-Phase Action Space
### Phase 1: Identify Issues
Submit issues in format: `row:<row_number>,col:<column_name>,issue:<issue_type>`
- `row_number`: 1-indexed data row position (after header)
- `column_name`: Exact column header name, lowercase
- `issue_type`: One of the supported types below
### Phase 2: Propose Fixes
Submit fixes in format: `row:<row_number>,col:<column_name>,fix:<corrected_value>`
The agent proposes the **correct value** that should replace the corrupted data. Fixes are graded against the original clean dataset.
Both phases can be submitted in the same step or across multiple steps.
**Supported Issue Types:**
| Type | Description | Example |
|------|-------------|---------|
| `missing_value` | Null, empty, or whitespace-only | Empty name field |
| `wrong_type` | Value doesn't match expected type | Salary as "seventy-five thousand" |
| `duplicate_row` | Exact duplicate or duplicate key | Two rows with same employee_id |
| `out_of_range` | Value outside valid range | Salary of 5000 when min is 50000 |
| `format_violation` | Wrong format or invalid enum | Date as DD/MM/YYYY instead of YYYY-MM-DD |
| `inconsistent_value` | Computed field mismatch, logical inconsistency | total != qty * price |
| `statistical_outlier` | Unreasonable value given context | resnet18 using 42.5GB GPU |
| `referential_integrity` | Foreign key violation | (available for custom tasks) |
## Observation Space
| Field | Type | Description |
|-------|------|-------------|
| `dataset_csv` | str | The corrupted dataset in CSV format |
| `schema_description` | str | Column types, ranges, and constraints |
| `validation_rules` | str | Business rules the data must satisfy |
| `task_description` | str | Task context and instructions |
| `feedback` | str | Per-step results: TP/FP/FN, precision/recall, fix scores |
| `num_issues_hint` | int | Exact count of planted issues |
| `max_steps` | int | Maximum attempts allowed |
| `done` | bool | Whether episode has terminated |
| `reward` | float | Best combined reward so far (strict 0-1 range) |
**Observation Metadata** (per step):
- Identify: `identify_f1`, `identify_score`, `precision`, `recall`, `tp`, `fp`, `fn`
- Fix: `fix_score`, `fixes_correct`, `fixes_partial`, `fixes_wrong`, `fixes_attempted`
- Combined: `combined_reward`, `difficulty_found`, `difficulty_missed`
## Reward Function
### Combined Reward
```
combined_reward = 0.6 * identify_score + 0.4 * fix_score
```
If no fixes are submitted, `combined_reward = identify_score` (no penalty β backward compatible).
### Identify Score (Difficulty-Weighted F1)
Each planted issue has a **difficulty weight** (1.0-3.0):
| Weight | Category | Examples |
|--------|----------|----------|
| 1.0 | Easy | Missing values, obvious out-of-range, wrong type |
| 1.5-2.0 | Medium | Duplicate keys, format violations, cross-column checks |
| 2.5-3.0 | Hard | Data leakage, statistical outliers, hallucinated citations |
- **Weighted Recall** = (difficulty of found issues) / (total difficulty)
- **Weighted Precision** = penalizes false positives proportional to average difficulty
- **Weighted F1** = harmonic mean
### Fix Score (Tiered Grading by Issue Type)
Each proposed fix is graded with tiered scoring that gives partial credit for reasonable attempts:
| Fix Quality | Score | Description |
|-------------|-------|-------------|
| Exact match | 1.0 | Case-insensitive, whitespace-stripped match with clean value |
| Valid fix | 0.8 | Right type/range, addresses the issue (e.g., any non-empty value for missing field) |
| Partially valid | 0.4 | Reasonable attempt, right direction (e.g., numeric in right ballpark) |
| Right cell, wrong value | 0.1 | Targets correct cell but fix doesn't address the issue |
| Non-issue cell | 0.0 | Fix targets a cell with no issue |
Fix score = (sum of best fix score per issue x difficulty weight) / (total difficulty weight)
### Reward Properties
| Property | Detail |
|----------|--------|
| Range | Strict (0, 1) β 0.001 minimum, 0.999 maximum |
| Partial credit | Yes β per-issue, difficulty-weighted |
| Monotonic | Best score across all steps is final reward |
| Penalizes guessing | False positives reduce precision, fixing non-issues scores 0 |
| Multi-step improvement | Detailed feedback enables learning across attempts |
### Episode Boundaries
- Each task allows up to 3 steps (attempts)
- Episode ends when F1 >= 0.999 (perfect identification) or max steps reached
- Agent receives detailed feedback after each step to improve on next attempt
## Extensibility
### Custom Contamination Rules
```python
from dataqa_env import register_contamination_rule
from dataqa_env.server.tasks import PlantedIssue
def swap_digits(rows, header, col_idx, row_idx, rng):
val = rows[row_idx][col_idx]
corrupted = val[::-1]
issue = PlantedIssue(
row=row_idx + 1, col=header[col_idx],
issue_type="format_violation",
description=f"Digits swapped in {header[col_idx]}",
difficulty=2.0,
)
return corrupted, issue
register_contamination_rule("swap_digits", swap_digits)
```
### Custom Tasks from Config
```python
from dataqa_env import create_task_from_config, register_task
task = create_task_from_config(
task_id="custom",
name="Custom Validation",
description="Find quality issues in this dataset.",
schema_description="id: int, name: str, score: int (0-100)",
validation_rules="No missing values. Scores must be 0-100.",
clean_csv="id,name,score\n1,Alice,95\n2,Bob,87\n3,Carol,92",
contaminations=[
{"rule": "missing_value", "row": 0, "col": 1, "difficulty": 1.0},
{"rule": "negative_value", "row": 2, "col": 2, "difficulty": 1.5},
],
)
register_task("custom", lambda seed: task)
```
### Built-in Contamination Rules
| Rule | Effect | Default Difficulty |
|------|--------|--------------------|
| `missing_value` | Sets field to empty string | 1.0 |
| `whitespace_value` | Sets field to single space | 2.5 |
| `wrong_type_text` | Replaces with random text | 1.0 |
| `negative_value` | Negates numeric value | 1.0 |
## Setup & Quick Start
```bash
# Install
pip install -e .
# Run server locally
uvicorn dataqa_env.server.app:app --host 0.0.0.0 --port 8000
# Run inference (set your API credentials)
API_BASE_URL=https://router.huggingface.co/v1 \
MODEL_NAME=Qwen/Qwen2.5-72B-Instruct \
HF_TOKEN=your-token \
python inference.py
```
## Docker
```bash
docker build -t dataqa-env .
docker run -p 8000:8000 dataqa-env
```
## Testing
```bash
pip install -e ".[dev]"
pytest tests/ -v
```
128 tests covering:
- Task creation, corruption, and difficulty weights for all 7 tasks
- Issue key and fix parsing (standard, lenient, edge cases)
- F1, weighted reward, and fix quality computation
- Full environment lifecycle (identify-only and identify+fix)
- Combined reward calculation and weight verification
- Inference script parsing and prompt building
- Structured log format ([START], [STEP], [END])
- Score bounds (strict 0-1), best-score monotonicity
- Extensibility API (custom rules, custom tasks)
- Moderation task determinism and label consistency
## Validation
```bash
# OpenEnv spec validation
openenv validate .
# Pre-submission validation (requires HF Space URL)
./prevalidation_script.sh https://your-space.hf.space
```
## Environment Variables
| Variable | Description | Default |
|----------|-------------|---------|
| `API_BASE_URL` | LLM API endpoint | `https://router.huggingface.co/v1` |
| `MODEL_NAME` | Model identifier | `Qwen/Qwen2.5-72B-Instruct` |
| `HF_TOKEN` | HuggingFace token / API key | - |
| `ENV_URL` | Environment server URL | `http://localhost:8000` |
## Architecture
```
dataqa_env/
βββ __init__.py # Public API + extensibility exports
βββ models.py # Pydantic: DataQAAction (issues + fixes), DataQAObservation, DataQAState
βββ client.py # EnvClient for WebSocket connections
βββ server/
β βββ environment.py # Two-phase DataQAEnvironment (identify + fix + combined reward)
β βββ tasks.py # 7 task definitions + contamination rules + extensibility API
β βββ gradio_ui.py # Interactive web UI with agent trajectory replay
β βββ app.py # FastAPI server (via openenv-core create_app)
β βββ Dockerfile
tests/
βββ test_tasks.py # Task creation, corruption, difficulty weights (all 7 tasks)
βββ test_environment.py # Identify scoring, fix grading, combined reward, lifecycle
βββ test_inference.py # LLM response parsing, fix parsing, prompt building, log format
βββ test_extensibility.py # Custom rules, custom tasks, registration API
inference.py # Two-phase baseline agent (identify then fix)
openenv.yaml # OpenEnv/HF Spaces spec
pyproject.toml # Package metadata and dependencies
Dockerfile # Production container
```
### Key Modules
**`dataqa_env/server/tasks.py`** β The core of the environment. Each task function (`create_task_easy`, `create_task_coding`, etc.) builds a clean CSV dataset, injects corruptions as `PlantedIssue` objects with row/col/type/difficulty, and returns a `Task` dataclass. The `TASK_REGISTRY` dict maps task IDs to factory functions. The extensibility API (`register_task`, `register_contamination_rule`, `create_task_from_config`) allows users to add domains without modifying source.
**`dataqa_env/server/environment.py`** β The `DataQAEnvironment` class inherits from OpenEnv's `Environment` base. `reset()` loads a task by ID and returns the corrupted CSV + schema. `step()` parses issue keys and fix proposals from the action, computes difficulty-weighted F1 for identification, grades fixes with tiered scoring by issue type, and returns combined reward with detailed feedback. Handles HTTP statelessness via auto-reset from `action.task_id`.
**`dataqa_env/models.py`** β Pydantic models for the OpenEnv interface. `DataQAAction` carries `issues: List[str]`, `fixes: List[str]`, and `task_id: str`. `DataQAObservation` carries the CSV, schema, rules, feedback, and scoring metadata. `DataQAState` tracks episode progress.
**`inference.py`** β Baseline LLM agent using OpenAI-compatible API. Runs all 7 tasks sequentially with 3 steps each. Lenient regex parsing handles case variations and delimiter differences in LLM output. Structured logging in `[START]/[STEP]/[END]` format for evaluation.
|