Spaces:

varb15
/

dataqa-env

Sleeping

App Files Files Community

varb15 commited on Apr 8

Commit

6c1b2ac

verified ·

1 Parent(s): c5b540e

Upload README.md with huggingface_hub

Browse files

Files changed (1) hide show

README.md +101 -275

README.md CHANGED Viewed

@@ -12,340 +12,166 @@ tags:
 # DataQA Environment
-A two-phase OpenEnv RL environment for **Data Quality Assurance** — an LLM agent inspects corrupted datasets, identifies all planted quality issues, and proposes data repairs.
-### Demo: Agent Trajectory Replay
-```
-EASY TASK (Step 2) — All 6 issues found + 5 fixes proposed
-  Reward: 0.87 | Identify: 1.00 | Fix: 0.67
-  ✓ row:4  name: empty → "David Kim"
-  ✓ row:7  salary: "seventy-five thousand" → "75000"
-  ✓ row:9  salary: "5000" → "73000"
-  ✓ row:15 email: mismatch → "oscar.rivera@company.com"
-  ✓ row:18 start_date: "2027-06-15" → "2022-01-19"
-  ✓ row:21 duplicate row detected
-HARD TASK — ML experiment metadata
-  Step 1: Found 5/10, missed hard issues    → Reward: 0.69
-  Step 2: Found 10/10 + 5 fixes proposed   → Reward: 0.77
-  Issues requiring ML knowledge:
-    • val_loss < train_loss (data leakage signal)
-    • resnet18 using 42.5GB GPU (impossible)
-    • 350 epochs on ImageNet in 30 min (impossible)
-    • wav2vec2 at 98.5% accuracy (exceeds SOTA)
-ALIGNMENT TASK — NVIDIA HelpSteer data (hardest)
-  Step 1: Found 7/12, missed subtle issues  → Reward: 0.58
-  Step 2: Found 12/12 + 3 fixes proposed   → Reward: 0.72
-  Issues requiring deep reasoning:
-    • Cerasus vs Prunus serrulata (wrong taxonomic name)
-    • $400.3M at Sotheby's vs $450.3M at Christie's (close but wrong)
-    • "does NOT learn via backprop" then describes backprop (self-contradiction)
-    • Fake Nature paper by "Dr. Sarah Chen" (hallucinated citation)
-    • "use bare except everywhere" rated helpfulness=3 (harmful advice)
-    • [SYSTEM] prompt leaked in response (pipeline contamination)
-```
-> The interactive replay UI with color-coded dataset visualization is available on the HF Space.
-## Motivation
-Every ML engineer and data scientist spends significant time debugging data quality issues — missing values, type mismatches, logical inconsistencies, and subtle statistical anomalies — before data enters ML pipelines or production databases. This is a genuine, high-frequency human task that directly impacts model quality and business outcomes.
-DataQA turns this into a **two-phase RL challenge**:
-1. **Identify** — systematically inspect corrupted data and pinpoint every planted issue
-2. **Fix** — propose corrected values by reasoning about schema, constraints, and context
-This creates a rich multi-step decision problem where agents must explore datasets strategically, distinguish subtle anomalies from noise, and reason about what the correct data should be.
-## Environment API
-| Endpoint | Method | Description |
-|----------|--------|-------------|
-| `/reset` | POST | Start a new episode with a corrupted dataset |
-| `/step` | POST | Submit identified issues + proposed fixes |
-| `/state` | GET | Get current episode state |
-| `/health` | GET | Health check |
-## Tasks
-| Task | Issues | Difficulty | Domain | Description |
-|------|--------|-----------|--------|-------------|
-| `easy` | 6 | Beginner | HR/Employee data (21 rows) | Nulls, wrong types, duplicates, out-of-range, email-name mismatch, future dates |
-| `medium` | 8 | Intermediate | E-commerce orders (31 rows) | Inconsistent totals, invalid categories, duplicate keys, wrong date formats, invalid country codes, future-date deliveries |
-| `hard` | 10 | Advanced | ML experiment metadata (31 rows) | Data leakage signals, unreasonable GPU memory, impossibly fast training, SOTA-exceeding accuracy, timestamp ordering, whitespace-only fields |
-| `alignment` | 12 | Expert | LLM alignment data (30 rows, NVIDIA HelpSteer) | See below |
-| `moderation` | 10 | Expert | Content moderation (30 rows, OpenAI Moderation) | Mislabeled hate/violence, false positives on clean text, subset rule violations, label range errors |
-**Difficulty progression**: Easy issues are individually obvious (empty fields, text in numeric columns). Medium issues require cross-column reasoning (total != qty * price) and set membership checks. Hard issues require ML domain knowledge (val_loss < train_loss = data leakage) and multi-row temporal reasoning.
-### Alignment Task: LLM Training Data Quality (Expert)
-Built on **real data from [NVIDIA HelpSteer](https://huggingface.co/datasets/nvidia/HelpSteer)** — 30 human-annotated prompt-response pairs with quality scores (helpfulness, correctness, coherence, complexity, verbosity on 0-4 scale).
-This task targets a critical real-world problem: **catching quality issues in LLM fine-tuning data before it corrupts model training**. The 12 planted issues represent failure modes actually seen in production data pipelines:
-| Issue | Difficulty | Why It's Hard |
-|---|---|---|
-| Subtle factual error (*Cerasus* vs *Prunus serrulata*) | 3.0 | Old taxonomic synonym — sounds plausible, requires domain knowledge |
-| Plausible wrong numbers ($400.3M at Sotheby's vs $450.3M at Christie's) | 3.0 | Right painting, wrong price by $50M and wrong auction house |
-| Self-contradictory reasoning ("does NOT learn via backprop" then describes backprop) | 3.0 | Response negates its own conclusion — trains confused models |
-| Hallucinated citation (fake Nature paper by fake Dr. Sarah Chen) | 3.0 | Fabricated study with specific fake statistics — most dangerous for training |
-| Harmful coding advice ("use bare except everywhere") with high quality scores | 3.0 | Teaches dangerous practices if used for fine-tuning |
-| Leaked system prompt (`[SYSTEM] You are a helpful AI...`) in response | 2.5 | Data pipeline failed to strip prompt template |
-| Semantic near-duplicate prompt (rephrased, not exact copy) | 2.5 | Requires semantic similarity detection, not just string matching |
-| Score inflation (helpfulness=4 for a 4-word answer) | 2.5 | Score-content mismatch requires understanding rating criteria |
-| Truncated response (cut mid-sentence) | 2.5 | `max_length` truncation without sentence boundary detection |
-| Response in French for English prompt | 2.0 | Language contamination from multilingual training data |
-| Response plagiarized from another row | 2.0 | Data pipeline shuffling/dedup failure |
-| Whitespace-only prompt | 2.0 | Empty training example from pipeline artifact |
-These issues are designed to challenge frontier models — they require factual recall, semantic reasoning, cross-row comparison, and understanding of what makes training data harmful.
-## Two-Phase Action Space
-### Phase 1: Identify Issues
-Submit issues in format: `row:<row_number>,col:<column_name>,issue:<issue_type>`
-- `row_number`: 1-indexed data row position (after header)
-- `column_name`: Exact column header name, lowercase
-- `issue_type`: One of the supported types below
-### Phase 2: Propose Fixes
-Submit fixes in format: `row:<row_number>,col:<column_name>,fix:<corrected_value>`
-The agent proposes the **correct value** that should replace the corrupted data. Fixes are graded against the original clean dataset.
-Both phases can be submitted in the same step or across multiple steps.
-**Supported Issue Types:**
-| Type | Description | Example |
-|------|-------------|---------|
-| `missing_value` | Null, empty, or whitespace-only | Empty name field |
-| `wrong_type` | Value doesn't match expected type | Salary as "seventy-five thousand" |
-| `duplicate_row` | Exact duplicate or duplicate key | Two rows with same employee_id |
-| `out_of_range` | Value outside valid range | Salary of 5000 when min is 50000 |
-| `format_violation` | Wrong format or invalid enum | Date as DD/MM/YYYY instead of YYYY-MM-DD |
-| `inconsistent_value` | Computed field mismatch, logical inconsistency | total != qty * price |
-| `statistical_outlier` | Unreasonable value given context | resnet18 using 42.5GB GPU |
-| `referential_integrity` | Foreign key violation | (available for custom tasks) |
-## Observation Space
-| Field | Type | Description |
-|-------|------|-------------|
-| `dataset_csv` | str | The corrupted dataset in CSV format |
-| `schema_description` | str | Column types, ranges, and constraints |
-| `validation_rules` | str | Business rules the data must satisfy |
-| `task_description` | str | Task context and instructions |
-| `feedback` | str | Per-step results: TP/FP/FN, precision/recall, fix scores |
-| `num_issues_hint` | int | Exact count of planted issues |
-| `max_steps` | int | Maximum attempts allowed |
-| `done` | bool | Whether episode has terminated |
-| `reward` | float | Best combined reward so far (0.0-1.0) |
-**Observation Metadata** (per step):
-- Identify: `identify_f1`, `identify_score`, `precision`, `recall`, `tp`, `fp`, `fn`
-- Fix: `fix_score`, `fixes_correct`, `fixes_partial`, `fixes_wrong`, `fixes_attempted`
-- Combined: `combined_reward`, `difficulty_found`, `difficulty_missed`
-## Reward Function
-### Combined Reward
 ```
 combined_reward = 0.6 * identify_score + 0.4 * fix_score
 ```
-If no fixes are submitted, `combined_reward = identify_score` (no penalty — backward compatible).
-### Identify Score (Difficulty-Weighted F1)
-Each planted issue has a **difficulty weight** (1.0-3.0):
-| Weight | Category | Examples |
-|--------|----------|----------|
-| 1.0 | Easy | Missing values, obvious out-of-range, wrong type |
-| 1.5-2.0 | Medium | Duplicate keys, format violations, cross-column checks |
-| 2.5-3.0 | Hard | Data leakage, statistical outliers, whitespace-only |
-- **Weighted Recall** = (difficulty of found issues) / (total difficulty)
-- **Weighted Precision** = penalizes false positives proportional to average difficulty
-- **Weighted F1** = harmonic mean
-### Fix Score (Difficulty-Weighted Quality)
-Each proposed fix is compared against the original clean value:
-| Fix Quality | Score | Description |
-|-------------|-------|-------------|
-| Exact match | 1.0 | Case-insensitive, whitespace-stripped match |
-| Numeric close | 0.8 | Within 1% of correct numeric value |
-| Correct cell | 0.1 | Right location, wrong value |
-| Non-issue cell | 0.0 | Fix targets a cell with no issue |
-Fix score = (sum of best fix score per issue × difficulty weight) / (total difficulty weight)
-### Reward Properties
-- **Per-step partial progress**: reward increases as more issues are found/fixed
-- **Difficulty-aware**: finding subtle issues earns more than obvious ones
-- **Penalizes bad behavior**: false positives reduce score, fixing non-issues earns nothing
-- **Monotonically non-decreasing**: best score across all steps is the final reward
-- **Always in [0.0, 1.0]**: meets hackathon requirement
-### Episode Boundaries
-- Each task allows up to 3 steps (attempts)
-- Episode ends when F1 >= 0.999 (perfect identification) or max steps reached
-- Agent receives detailed feedback after each step to improve on next attempt
-## Baseline Scores
-Baseline agent uses Qwen2.5-72B-Instruct via HuggingFace Router:
-| Task | Identify Score | Fix Score | Combined | Notes |
-|------|---------------|-----------|----------|-------|
-| `easy` | 0.7-1.0 | 0.5-0.9 | 0.6-1.0 | Most LLMs find obvious issues reliably |
-| `medium` | 0.5-0.8 | 0.3-0.6 | 0.4-0.7 | Cross-column reasoning challenges models |
-| `hard` | 0.3-0.6 | 0.2-0.4 | 0.3-0.5 | ML domain knowledge and subtle patterns |
-Scores vary by model. The hard task is designed to challenge frontier models.
-## Extensibility
-### Custom Contamination Rules
-```python
-from dataqa_env import register_contamination_rule
-from dataqa_env.server.tasks import PlantedIssue
-def swap_digits(rows, header, col_idx, row_idx, rng):
-    val = rows[row_idx][col_idx]
-    corrupted = val[::-1]
-    issue = PlantedIssue(
-        row=row_idx + 1, col=header[col_idx],
-        issue_type="format_violation",
-        description=f"Digits swapped in {header[col_idx]}",
-        difficulty=2.0,
-    )
-    return corrupted, issue
-register_contamination_rule("swap_digits", swap_digits)
-```
-### Custom Tasks from Config
-```python
-from dataqa_env import create_task_from_config, register_task
-task = create_task_from_config(
-    task_id="custom",
-    name="Custom Validation",
-    description="Find quality issues in this dataset.",
-    schema_description="id: int, name: str, score: int (0-100)",
-    validation_rules="No missing values. Scores must be 0-100.",
-    clean_csv="id,name,score\n1,Alice,95\n2,Bob,87\n3,Carol,92",
-    contaminations=[
-        {"rule": "missing_value", "row": 0, "col": 1, "difficulty": 1.0},
-        {"rule": "negative_value", "row": 2, "col": 2, "difficulty": 1.5},
-    ],
-)
-register_task("custom", lambda seed: task)
-```
-### Built-in Contamination Rules
-| Rule | Effect | Default Difficulty |
-|------|--------|--------------------|
-| `missing_value` | Sets field to empty string | 1.0 |
-| `whitespace_value` | Sets field to single space | 2.5 |
-| `wrong_type_text` | Replaces with random text | 1.0 |
-| `negative_value` | Negates numeric value | 1.0 |
-## Setup & Quick Start
 ```bash
-# Install
 pip install -e .
-# Run server locally
 uvicorn dataqa_env.server.app:app --host 0.0.0.0 --port 8000
-# Run inference (set your API credentials)
 API_BASE_URL=https://router.huggingface.co/v1 \
 MODEL_NAME=Qwen/Qwen2.5-72B-Instruct \
 HF_TOKEN=your-token \
 python inference.py
 ```
-## Docker
-```bash
-docker build -t dataqa-env .
-docker run -p 8000:8000 dataqa-env
-```
 ## Testing
 ```bash
 pip install -e ".[dev]"
 pytest tests/ -v
 ```
-118 tests covering:
-- Task creation, corruption, and difficulty weights
-- Issue key and fix parsing (standard, lenient, edge cases)
-- F1, weighted reward, and fix quality computation
-- Full environment lifecycle (identify-only and identify+fix)
-- Combined reward calculation and weight verification
-- Inference script parsing and prompt building
-- Structured log format ([START], [STEP], [END])
-- Score bounds (0.0-1.0), best-score monotonicity
-- Extensibility API (custom rules, custom tasks)
-## Validation
-```bash
-# OpenEnv spec validation
-openenv validate .
-# Pre-submission validation (requires HF Space URL)
-./prevalidation_script.sh https://your-space.hf.space
-```
-## Environment Variables
-| Variable | Description | Default |
-|----------|-------------|---------|
-| `API_BASE_URL` | LLM API endpoint | `https://router.huggingface.co/v1` |
-| `MODEL_NAME` | Model identifier | `Qwen/Qwen2.5-72B-Instruct` |
-| `HF_TOKEN` | HuggingFace token / API key | - |
-| `ENV_URL` | Environment server URL | `http://localhost:8000` |
 ## Architecture
 ```
 dataqa_env/
-├── __init__.py            # Public API + extensibility exports
-├── models.py              # Pydantic: DataQAAction (issues + fixes), DataQAObservation, DataQAState
-├── client.py              # EnvClient for WebSocket connections
 ├── server/
-│   ├── environment.py     # Two-phase DataQAEnvironment (identify + fix + combined reward)
-│   ├── tasks.py           # Task definitions + contamination rules + extensibility API
-│   ├── app.py             # FastAPI server (via openenv-core create_app)
-│   └── Dockerfile
-tests/
-├── test_tasks.py          # Task creation, corruption, difficulty weights
-├── test_environment.py    # Identify scoring, fix grading, combined reward, lifecycle
-├── test_inference.py      # LLM response parsing, fix parsing, prompt building, log format
-└── test_extensibility.py  # Custom rules, custom tasks, registration API
 inference.py               # Two-phase baseline agent (identify → fix)
-openenv.yaml               # OpenEnv/HF Spaces spec
-pyproject.toml             # Package metadata and dependencies
-Dockerfile                 # Production container
 ```

 # DataQA Environment
+**A two-phase OpenEnv RL environment for Data Quality Assurance** — an LLM agent inspects corrupted datasets, identifies all planted quality issues, and proposes data repairs.
+## Why DataQA? The Moat
+### 1. Solves a Real, High-Frequency Problem
+Every ML team burns hours on data quality — missing values, type mismatches, logical inconsistencies, subtle statistical anomalies — before data enters training pipelines or production databases. DataQA turns this universal pain point into a graded RL environment. Unlike synthetic toy problems, **these are the exact data bugs that corrupt production ML models.**
+### 2. Seven Diverse Domains, One Unified Interface
+| Task | Domain | Issues | What Makes It Hard |
+|------|--------|--------|--------------------|
+| `easy` | HR / Employee data | 6 | Missing values, typos, format errors |
+| `medium` | E-commerce orders | 8 | Cross-column math (`total != qty * price`), OCR errors |
+| `hard` | ML experiment metadata | 10 | Data leakage detection, impossible GPU specs, SOTA violations |
+| `alignment` | LLM fine-tuning data (NVIDIA HelpSteer) | 12 | Hallucinated citations, self-contradictions, toxic content scored as helpful |
+| `coding` | Code instruction-response pairs | 10 | Logic bugs in "correct" code, `eval()` injection, language mismatches |
+| `toolcalling` | Function-calling schemas | 10 | Hallucinated parameters, missing required args, name mismatches |
+| `moderation` | Content moderation labels | 10 | Mislabeled hate speech, false positives on clean text |
+**66 total planted issues** spanning tabular data, free-text, code, JSON schemas, and safety labels. No other OpenEnv submission covers this breadth with a single coherent reward function.
+### 3. Two-Phase Reward — Identify Then Fix
+Most data QA environments only ask "is there a bug?" DataQA goes further:
+- **Phase 1 (Identify):** Find all issues — graded by difficulty-weighted F1
+- **Phase 2 (Fix):** Propose the correct value — graded against the clean original with tiered scoring (exact match = 1.0, valid fix = 0.8, partial = 0.4, right cell wrong value = 0.1)
 ```
 combined_reward = 0.6 * identify_score + 0.4 * fix_score
 ```
+This creates a richer learning signal than binary classification. An agent that finds 8/10 issues and fixes 5 of them correctly gets meaningful partial credit — perfect for GRPO/RLHF training.
+### 4. Difficulty-Weighted Scoring Rewards Deeper Reasoning
+Each planted issue has a difficulty weight (1.0-3.0). Finding a hallucinated citation (3.0) earns triple the reward of finding an empty field (1.0). This incentivizes agents to develop genuine reasoning capabilities rather than pattern-matching surface-level errors.
+### 5. Multi-Step Feedback Loop
+Agents get 3 attempts per task with detailed per-step feedback:
+- Which issues were correct (true positives) vs wrong (false positives)
+- Which issues were missed (false negatives) with difficulty hints
+- Fix quality scores with reasons
+This enables the agent to **learn from its mistakes within a single episode** — a natural curriculum.
+### 6. Fully Extensible
+```python
+# Add your own contamination rules
+register_contamination_rule("swap_digits", my_swap_fn)
+# Create tasks from any CSV
+task = create_task_from_config(
+    task_id="custom", clean_csv="...",
+    contaminations=[{"rule": "missing_value", "row": 0, "col": 1}]
+)
+register_task("custom", lambda seed: task)
+```
+New domains can be added in minutes. The contamination engine is domain-agnostic.
+---
+## Demo: Agent Trajectory
+```
+HARD TASK — ML experiment metadata
+  Step 1: Found 5/10, missed hard issues    → Reward: 0.69
+  Step 2: Found 10/10 + 5 fixes proposed   → Reward: 0.77
+  Issues requiring ML knowledge:
+    • val_loss < train_loss (data leakage signal)
+    • resnet18 using 42.5GB GPU (impossible for 11M params)
+    • 350 epochs on ImageNet in 30 min (impossibly fast)
+    • wav2vec2 at 98.5% accuracy (exceeds SOTA)
+ALIGNMENT TASK — NVIDIA HelpSteer data
+  Step 1: Found 7/12, missed subtle issues  ��� Reward: 0.58
+  Step 2: Found 12/12 + 3 fixes proposed   → Reward: 0.72
+  Issues requiring deep reasoning:
+    • Cerasus vs Prunus serrulata (wrong taxonomic name)
+    • $400.3M at Sotheby's vs $450.3M at Christie's (close but wrong)
+    • Fake Nature paper by "Dr. Sarah Chen" (hallucinated citation)
+    • Gender-biased advice rated helpfulness=4 (toxic content with inflated scores)
+CODING TASK — Code instruction-response pairs
+  Issues requiring code understanding:
+    • Binary search off-by-one (lo=mid causes infinite loop) marked correct
+    • eval(uid) in Flask route — code injection vulnerability
+    • JavaScript response for a Python-labeled task
+    • Duplicate "merge sort" instruction across rows
+```
+## Environment API
+| Endpoint | Method | Description |
+|----------|--------|-------------|
+| `/reset` | POST | Start a new episode with `{"task_id": "easy"}` |
+| `/step` | POST | Submit identified issues + proposed fixes |
+| `/state` | GET | Get current episode state |
+| `/health` | GET | Health check |
+## Action Format
+**Identify:** `row:<N>,col:<column>,issue:<type>` where type is one of: `missing_value`, `wrong_type`, `duplicate_row`, `out_of_range`, `format_violation`, `inconsistent_value`, `statistical_outlier`
+**Fix:** `row:<N>,col:<column>,fix:<corrected_value>`
+Both can be submitted in the same step or across multiple steps (3 steps max).
+## Reward Design
+| Property | Detail |
+|----------|--------|
+| Range | Strict (0, 1) — 0.001 minimum, 0.999 maximum |
+| Partial credit | Yes — per-issue, difficulty-weighted |
+| Monotonic | Best score across all steps is final reward |
+| Penalizes guessing | False positives reduce precision, fixing non-issues scores 0 |
+| Multi-step improvement | Detailed feedback enables learning across attempts |
+**Fix grading tiers** (by issue type):
+- Exact match with clean value → 1.0
+- Valid fix: right type/range, addresses the issue → 0.8
+- Partially valid: reasonable attempt, right direction → 0.4
+- Right cell, wrong value → 0.1
+- Non-issue cell → 0.0
+## Quick Start
 ```bash
 pip install -e .
 uvicorn dataqa_env.server.app:app --host 0.0.0.0 --port 8000
+# Run baseline agent
 API_BASE_URL=https://router.huggingface.co/v1 \
 MODEL_NAME=Qwen/Qwen2.5-72B-Instruct \
 HF_TOKEN=your-token \
 python inference.py
 ```
 ## Testing
+128 tests covering task creation, reward computation, fix grading, environment lifecycle, inference parsing, and extensibility API.
 ```bash
 pip install -e ".[dev]"
 pytest tests/ -v
 ```
 ## Architecture
 ```
 dataqa_env/
+├── models.py              # DataQAAction (issues + fixes), DataQAObservation
 ├── server/
+│   ├── environment.py     # Two-phase grading engine (identify + fix + combined reward)
+│   ├── tasks.py           # 7 task definitions + contamination rules + extensibility API
+│   └── app.py             # FastAPI server (via openenv-core)
 inference.py               # Two-phase baseline agent (identify → fix)
+tests/                     # 128 tests
 ```