Spaces:
Sleeping
Sleeping
Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -12,340 +12,166 @@ tags:
|
|
| 12 |
|
| 13 |
# DataQA Environment
|
| 14 |
|
| 15 |
-
A two-phase OpenEnv RL environment for
|
| 16 |
|
| 17 |
-
##
|
| 18 |
|
| 19 |
-
|
| 20 |
-
EASY TASK (Step 2) β All 6 issues found + 5 fixes proposed
|
| 21 |
-
Reward: 0.87 | Identify: 1.00 | Fix: 0.67
|
| 22 |
-
β row:4 name: empty β "David Kim"
|
| 23 |
-
β row:7 salary: "seventy-five thousand" β "75000"
|
| 24 |
-
β row:9 salary: "5000" β "73000"
|
| 25 |
-
β row:15 email: mismatch β "oscar.rivera@company.com"
|
| 26 |
-
β row:18 start_date: "2027-06-15" β "2022-01-19"
|
| 27 |
-
β row:21 duplicate row detected
|
| 28 |
-
|
| 29 |
-
HARD TASK β ML experiment metadata
|
| 30 |
-
Step 1: Found 5/10, missed hard issues β Reward: 0.69
|
| 31 |
-
Step 2: Found 10/10 + 5 fixes proposed β Reward: 0.77
|
| 32 |
-
Issues requiring ML knowledge:
|
| 33 |
-
β’ val_loss < train_loss (data leakage signal)
|
| 34 |
-
β’ resnet18 using 42.5GB GPU (impossible)
|
| 35 |
-
β’ 350 epochs on ImageNet in 30 min (impossible)
|
| 36 |
-
β’ wav2vec2 at 98.5% accuracy (exceeds SOTA)
|
| 37 |
-
|
| 38 |
-
ALIGNMENT TASK β NVIDIA HelpSteer data (hardest)
|
| 39 |
-
Step 1: Found 7/12, missed subtle issues β Reward: 0.58
|
| 40 |
-
Step 2: Found 12/12 + 3 fixes proposed β Reward: 0.72
|
| 41 |
-
Issues requiring deep reasoning:
|
| 42 |
-
β’ Cerasus vs Prunus serrulata (wrong taxonomic name)
|
| 43 |
-
β’ $400.3M at Sotheby's vs $450.3M at Christie's (close but wrong)
|
| 44 |
-
β’ "does NOT learn via backprop" then describes backprop (self-contradiction)
|
| 45 |
-
β’ Fake Nature paper by "Dr. Sarah Chen" (hallucinated citation)
|
| 46 |
-
β’ "use bare except everywhere" rated helpfulness=3 (harmful advice)
|
| 47 |
-
β’ [SYSTEM] prompt leaked in response (pipeline contamination)
|
| 48 |
-
```
|
| 49 |
-
|
| 50 |
-
> The interactive replay UI with color-coded dataset visualization is available on the HF Space.
|
| 51 |
-
|
| 52 |
-
## Motivation
|
| 53 |
|
| 54 |
-
Every ML
|
| 55 |
|
| 56 |
-
|
| 57 |
-
1. **Identify** β systematically inspect corrupted data and pinpoint every planted issue
|
| 58 |
-
2. **Fix** β propose corrected values by reasoning about schema, constraints, and context
|
| 59 |
|
| 60 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 61 |
|
| 62 |
-
|
| 63 |
|
| 64 |
-
|
| 65 |
-
|----------|--------|-------------|
|
| 66 |
-
| `/reset` | POST | Start a new episode with a corrupted dataset |
|
| 67 |
-
| `/step` | POST | Submit identified issues + proposed fixes |
|
| 68 |
-
| `/state` | GET | Get current episode state |
|
| 69 |
-
| `/health` | GET | Health check |
|
| 70 |
|
| 71 |
-
|
| 72 |
|
| 73 |
-
|
| 74 |
-
|
| 75 |
-
| `easy` | 6 | Beginner | HR/Employee data (21 rows) | Nulls, wrong types, duplicates, out-of-range, email-name mismatch, future dates |
|
| 76 |
-
| `medium` | 8 | Intermediate | E-commerce orders (31 rows) | Inconsistent totals, invalid categories, duplicate keys, wrong date formats, invalid country codes, future-date deliveries |
|
| 77 |
-
| `hard` | 10 | Advanced | ML experiment metadata (31 rows) | Data leakage signals, unreasonable GPU memory, impossibly fast training, SOTA-exceeding accuracy, timestamp ordering, whitespace-only fields |
|
| 78 |
-
| `alignment` | 12 | Expert | LLM alignment data (30 rows, NVIDIA HelpSteer) | See below |
|
| 79 |
-
| `moderation` | 10 | Expert | Content moderation (30 rows, OpenAI Moderation) | Mislabeled hate/violence, false positives on clean text, subset rule violations, label range errors |
|
| 80 |
-
|
| 81 |
-
**Difficulty progression**: Easy issues are individually obvious (empty fields, text in numeric columns). Medium issues require cross-column reasoning (total != qty * price) and set membership checks. Hard issues require ML domain knowledge (val_loss < train_loss = data leakage) and multi-row temporal reasoning.
|
| 82 |
-
|
| 83 |
-
### Alignment Task: LLM Training Data Quality (Expert)
|
| 84 |
-
|
| 85 |
-
Built on **real data from [NVIDIA HelpSteer](https://huggingface.co/datasets/nvidia/HelpSteer)** β 30 human-annotated prompt-response pairs with quality scores (helpfulness, correctness, coherence, complexity, verbosity on 0-4 scale).
|
| 86 |
-
|
| 87 |
-
This task targets a critical real-world problem: **catching quality issues in LLM fine-tuning data before it corrupts model training**. The 12 planted issues represent failure modes actually seen in production data pipelines:
|
| 88 |
-
|
| 89 |
-
| Issue | Difficulty | Why It's Hard |
|
| 90 |
-
|---|---|---|
|
| 91 |
-
| Subtle factual error (*Cerasus* vs *Prunus serrulata*) | 3.0 | Old taxonomic synonym β sounds plausible, requires domain knowledge |
|
| 92 |
-
| Plausible wrong numbers ($400.3M at Sotheby's vs $450.3M at Christie's) | 3.0 | Right painting, wrong price by $50M and wrong auction house |
|
| 93 |
-
| Self-contradictory reasoning ("does NOT learn via backprop" then describes backprop) | 3.0 | Response negates its own conclusion β trains confused models |
|
| 94 |
-
| Hallucinated citation (fake Nature paper by fake Dr. Sarah Chen) | 3.0 | Fabricated study with specific fake statistics β most dangerous for training |
|
| 95 |
-
| Harmful coding advice ("use bare except everywhere") with high quality scores | 3.0 | Teaches dangerous practices if used for fine-tuning |
|
| 96 |
-
| Leaked system prompt (`[SYSTEM] You are a helpful AI...`) in response | 2.5 | Data pipeline failed to strip prompt template |
|
| 97 |
-
| Semantic near-duplicate prompt (rephrased, not exact copy) | 2.5 | Requires semantic similarity detection, not just string matching |
|
| 98 |
-
| Score inflation (helpfulness=4 for a 4-word answer) | 2.5 | Score-content mismatch requires understanding rating criteria |
|
| 99 |
-
| Truncated response (cut mid-sentence) | 2.5 | `max_length` truncation without sentence boundary detection |
|
| 100 |
-
| Response in French for English prompt | 2.0 | Language contamination from multilingual training data |
|
| 101 |
-
| Response plagiarized from another row | 2.0 | Data pipeline shuffling/dedup failure |
|
| 102 |
-
| Whitespace-only prompt | 2.0 | Empty training example from pipeline artifact |
|
| 103 |
-
|
| 104 |
-
These issues are designed to challenge frontier models β they require factual recall, semantic reasoning, cross-row comparison, and understanding of what makes training data harmful.
|
| 105 |
-
|
| 106 |
-
## Two-Phase Action Space
|
| 107 |
-
|
| 108 |
-
### Phase 1: Identify Issues
|
| 109 |
-
|
| 110 |
-
Submit issues in format: `row:<row_number>,col:<column_name>,issue:<issue_type>`
|
| 111 |
-
|
| 112 |
-
- `row_number`: 1-indexed data row position (after header)
|
| 113 |
-
- `column_name`: Exact column header name, lowercase
|
| 114 |
-
- `issue_type`: One of the supported types below
|
| 115 |
-
|
| 116 |
-
### Phase 2: Propose Fixes
|
| 117 |
-
|
| 118 |
-
Submit fixes in format: `row:<row_number>,col:<column_name>,fix:<corrected_value>`
|
| 119 |
-
|
| 120 |
-
The agent proposes the **correct value** that should replace the corrupted data. Fixes are graded against the original clean dataset.
|
| 121 |
-
|
| 122 |
-
Both phases can be submitted in the same step or across multiple steps.
|
| 123 |
-
|
| 124 |
-
**Supported Issue Types:**
|
| 125 |
-
|
| 126 |
-
| Type | Description | Example |
|
| 127 |
-
|------|-------------|---------|
|
| 128 |
-
| `missing_value` | Null, empty, or whitespace-only | Empty name field |
|
| 129 |
-
| `wrong_type` | Value doesn't match expected type | Salary as "seventy-five thousand" |
|
| 130 |
-
| `duplicate_row` | Exact duplicate or duplicate key | Two rows with same employee_id |
|
| 131 |
-
| `out_of_range` | Value outside valid range | Salary of 5000 when min is 50000 |
|
| 132 |
-
| `format_violation` | Wrong format or invalid enum | Date as DD/MM/YYYY instead of YYYY-MM-DD |
|
| 133 |
-
| `inconsistent_value` | Computed field mismatch, logical inconsistency | total != qty * price |
|
| 134 |
-
| `statistical_outlier` | Unreasonable value given context | resnet18 using 42.5GB GPU |
|
| 135 |
-
| `referential_integrity` | Foreign key violation | (available for custom tasks) |
|
| 136 |
-
|
| 137 |
-
## Observation Space
|
| 138 |
-
|
| 139 |
-
| Field | Type | Description |
|
| 140 |
-
|-------|------|-------------|
|
| 141 |
-
| `dataset_csv` | str | The corrupted dataset in CSV format |
|
| 142 |
-
| `schema_description` | str | Column types, ranges, and constraints |
|
| 143 |
-
| `validation_rules` | str | Business rules the data must satisfy |
|
| 144 |
-
| `task_description` | str | Task context and instructions |
|
| 145 |
-
| `feedback` | str | Per-step results: TP/FP/FN, precision/recall, fix scores |
|
| 146 |
-
| `num_issues_hint` | int | Exact count of planted issues |
|
| 147 |
-
| `max_steps` | int | Maximum attempts allowed |
|
| 148 |
-
| `done` | bool | Whether episode has terminated |
|
| 149 |
-
| `reward` | float | Best combined reward so far (0.0-1.0) |
|
| 150 |
-
|
| 151 |
-
**Observation Metadata** (per step):
|
| 152 |
-
- Identify: `identify_f1`, `identify_score`, `precision`, `recall`, `tp`, `fp`, `fn`
|
| 153 |
-
- Fix: `fix_score`, `fixes_correct`, `fixes_partial`, `fixes_wrong`, `fixes_attempted`
|
| 154 |
-
- Combined: `combined_reward`, `difficulty_found`, `difficulty_missed`
|
| 155 |
-
|
| 156 |
-
## Reward Function
|
| 157 |
-
|
| 158 |
-
### Combined Reward
|
| 159 |
|
| 160 |
```
|
| 161 |
combined_reward = 0.6 * identify_score + 0.4 * fix_score
|
| 162 |
```
|
| 163 |
|
| 164 |
-
|
| 165 |
-
|
| 166 |
-
### Identify Score (Difficulty-Weighted F1)
|
| 167 |
-
|
| 168 |
-
Each planted issue has a **difficulty weight** (1.0-3.0):
|
| 169 |
|
| 170 |
-
|
| 171 |
-
|--------|----------|----------|
|
| 172 |
-
| 1.0 | Easy | Missing values, obvious out-of-range, wrong type |
|
| 173 |
-
| 1.5-2.0 | Medium | Duplicate keys, format violations, cross-column checks |
|
| 174 |
-
| 2.5-3.0 | Hard | Data leakage, statistical outliers, whitespace-only |
|
| 175 |
|
| 176 |
-
-
|
| 177 |
-
- **Weighted Precision** = penalizes false positives proportional to average difficulty
|
| 178 |
-
- **Weighted F1** = harmonic mean
|
| 179 |
|
| 180 |
-
###
|
| 181 |
|
| 182 |
-
|
|
|
|
|
|
|
|
|
|
| 183 |
|
| 184 |
-
|
| 185 |
-
|-------------|-------|-------------|
|
| 186 |
-
| Exact match | 1.0 | Case-insensitive, whitespace-stripped match |
|
| 187 |
-
| Numeric close | 0.8 | Within 1% of correct numeric value |
|
| 188 |
-
| Correct cell | 0.1 | Right location, wrong value |
|
| 189 |
-
| Non-issue cell | 0.0 | Fix targets a cell with no issue |
|
| 190 |
|
| 191 |
-
|
| 192 |
|
| 193 |
-
|
|
|
|
|
|
|
| 194 |
|
| 195 |
-
|
| 196 |
-
|
| 197 |
-
|
| 198 |
-
|
| 199 |
-
|
|
|
|
|
|
|
| 200 |
|
| 201 |
-
|
| 202 |
|
| 203 |
-
-
|
| 204 |
-
- Episode ends when F1 >= 0.999 (perfect identification) or max steps reached
|
| 205 |
-
- Agent receives detailed feedback after each step to improve on next attempt
|
| 206 |
|
| 207 |
-
##
|
| 208 |
|
| 209 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 210 |
|
| 211 |
-
|
| 212 |
-
|
| 213 |
-
|
| 214 |
-
|
| 215 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 216 |
|
| 217 |
-
|
| 218 |
|
| 219 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 220 |
|
| 221 |
-
##
|
| 222 |
|
| 223 |
-
```
|
| 224 |
-
from dataqa_env import register_contamination_rule
|
| 225 |
-
from dataqa_env.server.tasks import PlantedIssue
|
| 226 |
-
|
| 227 |
-
def swap_digits(rows, header, col_idx, row_idx, rng):
|
| 228 |
-
val = rows[row_idx][col_idx]
|
| 229 |
-
corrupted = val[::-1]
|
| 230 |
-
issue = PlantedIssue(
|
| 231 |
-
row=row_idx + 1, col=header[col_idx],
|
| 232 |
-
issue_type="format_violation",
|
| 233 |
-
description=f"Digits swapped in {header[col_idx]}",
|
| 234 |
-
difficulty=2.0,
|
| 235 |
-
)
|
| 236 |
-
return corrupted, issue
|
| 237 |
-
|
| 238 |
-
register_contamination_rule("swap_digits", swap_digits)
|
| 239 |
-
```
|
| 240 |
|
| 241 |
-
|
| 242 |
|
| 243 |
-
|
| 244 |
-
from dataqa_env import create_task_from_config, register_task
|
| 245 |
|
| 246 |
-
|
| 247 |
-
task_id="custom",
|
| 248 |
-
name="Custom Validation",
|
| 249 |
-
description="Find quality issues in this dataset.",
|
| 250 |
-
schema_description="id: int, name: str, score: int (0-100)",
|
| 251 |
-
validation_rules="No missing values. Scores must be 0-100.",
|
| 252 |
-
clean_csv="id,name,score\n1,Alice,95\n2,Bob,87\n3,Carol,92",
|
| 253 |
-
contaminations=[
|
| 254 |
-
{"rule": "missing_value", "row": 0, "col": 1, "difficulty": 1.0},
|
| 255 |
-
{"rule": "negative_value", "row": 2, "col": 2, "difficulty": 1.5},
|
| 256 |
-
],
|
| 257 |
-
)
|
| 258 |
-
register_task("custom", lambda seed: task)
|
| 259 |
-
```
|
| 260 |
|
| 261 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 262 |
|
| 263 |
-
|
| 264 |
-
|
| 265 |
-
|
| 266 |
-
|
| 267 |
-
|
| 268 |
-
|
| 269 |
|
| 270 |
-
##
|
| 271 |
|
| 272 |
```bash
|
| 273 |
-
# Install
|
| 274 |
pip install -e .
|
| 275 |
-
|
| 276 |
-
# Run server locally
|
| 277 |
uvicorn dataqa_env.server.app:app --host 0.0.0.0 --port 8000
|
| 278 |
|
| 279 |
-
# Run
|
| 280 |
API_BASE_URL=https://router.huggingface.co/v1 \
|
| 281 |
MODEL_NAME=Qwen/Qwen2.5-72B-Instruct \
|
| 282 |
HF_TOKEN=your-token \
|
| 283 |
python inference.py
|
| 284 |
```
|
| 285 |
|
| 286 |
-
## Docker
|
| 287 |
-
|
| 288 |
-
```bash
|
| 289 |
-
docker build -t dataqa-env .
|
| 290 |
-
docker run -p 8000:8000 dataqa-env
|
| 291 |
-
```
|
| 292 |
-
|
| 293 |
## Testing
|
| 294 |
|
|
|
|
|
|
|
| 295 |
```bash
|
| 296 |
pip install -e ".[dev]"
|
| 297 |
pytest tests/ -v
|
| 298 |
```
|
| 299 |
|
| 300 |
-
118 tests covering:
|
| 301 |
-
- Task creation, corruption, and difficulty weights
|
| 302 |
-
- Issue key and fix parsing (standard, lenient, edge cases)
|
| 303 |
-
- F1, weighted reward, and fix quality computation
|
| 304 |
-
- Full environment lifecycle (identify-only and identify+fix)
|
| 305 |
-
- Combined reward calculation and weight verification
|
| 306 |
-
- Inference script parsing and prompt building
|
| 307 |
-
- Structured log format ([START], [STEP], [END])
|
| 308 |
-
- Score bounds (0.0-1.0), best-score monotonicity
|
| 309 |
-
- Extensibility API (custom rules, custom tasks)
|
| 310 |
-
|
| 311 |
-
## Validation
|
| 312 |
-
|
| 313 |
-
```bash
|
| 314 |
-
# OpenEnv spec validation
|
| 315 |
-
openenv validate .
|
| 316 |
-
|
| 317 |
-
# Pre-submission validation (requires HF Space URL)
|
| 318 |
-
./prevalidation_script.sh https://your-space.hf.space
|
| 319 |
-
```
|
| 320 |
-
|
| 321 |
-
## Environment Variables
|
| 322 |
-
|
| 323 |
-
| Variable | Description | Default |
|
| 324 |
-
|----------|-------------|---------|
|
| 325 |
-
| `API_BASE_URL` | LLM API endpoint | `https://router.huggingface.co/v1` |
|
| 326 |
-
| `MODEL_NAME` | Model identifier | `Qwen/Qwen2.5-72B-Instruct` |
|
| 327 |
-
| `HF_TOKEN` | HuggingFace token / API key | - |
|
| 328 |
-
| `ENV_URL` | Environment server URL | `http://localhost:8000` |
|
| 329 |
-
|
| 330 |
## Architecture
|
| 331 |
|
| 332 |
```
|
| 333 |
dataqa_env/
|
| 334 |
-
βββ
|
| 335 |
-
βββ models.py # Pydantic: DataQAAction (issues + fixes), DataQAObservation, DataQAState
|
| 336 |
-
βββ client.py # EnvClient for WebSocket connections
|
| 337 |
βββ server/
|
| 338 |
-
β βββ environment.py # Two-phase
|
| 339 |
-
β βββ tasks.py #
|
| 340 |
-
β
|
| 341 |
-
β βββ Dockerfile
|
| 342 |
-
tests/
|
| 343 |
-
βββ test_tasks.py # Task creation, corruption, difficulty weights
|
| 344 |
-
βββ test_environment.py # Identify scoring, fix grading, combined reward, lifecycle
|
| 345 |
-
βββ test_inference.py # LLM response parsing, fix parsing, prompt building, log format
|
| 346 |
-
βββ test_extensibility.py # Custom rules, custom tasks, registration API
|
| 347 |
inference.py # Two-phase baseline agent (identify β fix)
|
| 348 |
-
|
| 349 |
-
pyproject.toml # Package metadata and dependencies
|
| 350 |
-
Dockerfile # Production container
|
| 351 |
```
|
|
|
|
| 12 |
|
| 13 |
# DataQA Environment
|
| 14 |
|
| 15 |
+
**A two-phase OpenEnv RL environment for Data Quality Assurance** β an LLM agent inspects corrupted datasets, identifies all planted quality issues, and proposes data repairs.
|
| 16 |
|
| 17 |
+
## Why DataQA? The Moat
|
| 18 |
|
| 19 |
+
### 1. Solves a Real, High-Frequency Problem
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 20 |
|
| 21 |
+
Every ML team burns hours on data quality β missing values, type mismatches, logical inconsistencies, subtle statistical anomalies β before data enters training pipelines or production databases. DataQA turns this universal pain point into a graded RL environment. Unlike synthetic toy problems, **these are the exact data bugs that corrupt production ML models.**
|
| 22 |
|
| 23 |
+
### 2. Seven Diverse Domains, One Unified Interface
|
|
|
|
|
|
|
| 24 |
|
| 25 |
+
| Task | Domain | Issues | What Makes It Hard |
|
| 26 |
+
|------|--------|--------|--------------------|
|
| 27 |
+
| `easy` | HR / Employee data | 6 | Missing values, typos, format errors |
|
| 28 |
+
| `medium` | E-commerce orders | 8 | Cross-column math (`total != qty * price`), OCR errors |
|
| 29 |
+
| `hard` | ML experiment metadata | 10 | Data leakage detection, impossible GPU specs, SOTA violations |
|
| 30 |
+
| `alignment` | LLM fine-tuning data (NVIDIA HelpSteer) | 12 | Hallucinated citations, self-contradictions, toxic content scored as helpful |
|
| 31 |
+
| `coding` | Code instruction-response pairs | 10 | Logic bugs in "correct" code, `eval()` injection, language mismatches |
|
| 32 |
+
| `toolcalling` | Function-calling schemas | 10 | Hallucinated parameters, missing required args, name mismatches |
|
| 33 |
+
| `moderation` | Content moderation labels | 10 | Mislabeled hate speech, false positives on clean text |
|
| 34 |
|
| 35 |
+
**66 total planted issues** spanning tabular data, free-text, code, JSON schemas, and safety labels. No other OpenEnv submission covers this breadth with a single coherent reward function.
|
| 36 |
|
| 37 |
+
### 3. Two-Phase Reward β Identify Then Fix
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 38 |
|
| 39 |
+
Most data QA environments only ask "is there a bug?" DataQA goes further:
|
| 40 |
|
| 41 |
+
- **Phase 1 (Identify):** Find all issues β graded by difficulty-weighted F1
|
| 42 |
+
- **Phase 2 (Fix):** Propose the correct value β graded against the clean original with tiered scoring (exact match = 1.0, valid fix = 0.8, partial = 0.4, right cell wrong value = 0.1)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 43 |
|
| 44 |
```
|
| 45 |
combined_reward = 0.6 * identify_score + 0.4 * fix_score
|
| 46 |
```
|
| 47 |
|
| 48 |
+
This creates a richer learning signal than binary classification. An agent that finds 8/10 issues and fixes 5 of them correctly gets meaningful partial credit β perfect for GRPO/RLHF training.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 49 |
|
| 50 |
+
### 4. Difficulty-Weighted Scoring Rewards Deeper Reasoning
|
|
|
|
|
|
|
|
|
|
|
|
|
| 51 |
|
| 52 |
+
Each planted issue has a difficulty weight (1.0-3.0). Finding a hallucinated citation (3.0) earns triple the reward of finding an empty field (1.0). This incentivizes agents to develop genuine reasoning capabilities rather than pattern-matching surface-level errors.
|
|
|
|
|
|
|
| 53 |
|
| 54 |
+
### 5. Multi-Step Feedback Loop
|
| 55 |
|
| 56 |
+
Agents get 3 attempts per task with detailed per-step feedback:
|
| 57 |
+
- Which issues were correct (true positives) vs wrong (false positives)
|
| 58 |
+
- Which issues were missed (false negatives) with difficulty hints
|
| 59 |
+
- Fix quality scores with reasons
|
| 60 |
|
| 61 |
+
This enables the agent to **learn from its mistakes within a single episode** β a natural curriculum.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 62 |
|
| 63 |
+
### 6. Fully Extensible
|
| 64 |
|
| 65 |
+
```python
|
| 66 |
+
# Add your own contamination rules
|
| 67 |
+
register_contamination_rule("swap_digits", my_swap_fn)
|
| 68 |
|
| 69 |
+
# Create tasks from any CSV
|
| 70 |
+
task = create_task_from_config(
|
| 71 |
+
task_id="custom", clean_csv="...",
|
| 72 |
+
contaminations=[{"rule": "missing_value", "row": 0, "col": 1}]
|
| 73 |
+
)
|
| 74 |
+
register_task("custom", lambda seed: task)
|
| 75 |
+
```
|
| 76 |
|
| 77 |
+
New domains can be added in minutes. The contamination engine is domain-agnostic.
|
| 78 |
|
| 79 |
+
---
|
|
|
|
|
|
|
| 80 |
|
| 81 |
+
## Demo: Agent Trajectory
|
| 82 |
|
| 83 |
+
```
|
| 84 |
+
HARD TASK β ML experiment metadata
|
| 85 |
+
Step 1: Found 5/10, missed hard issues β Reward: 0.69
|
| 86 |
+
Step 2: Found 10/10 + 5 fixes proposed β Reward: 0.77
|
| 87 |
+
Issues requiring ML knowledge:
|
| 88 |
+
β’ val_loss < train_loss (data leakage signal)
|
| 89 |
+
β’ resnet18 using 42.5GB GPU (impossible for 11M params)
|
| 90 |
+
β’ 350 epochs on ImageNet in 30 min (impossibly fast)
|
| 91 |
+
β’ wav2vec2 at 98.5% accuracy (exceeds SOTA)
|
| 92 |
|
| 93 |
+
ALIGNMENT TASK β NVIDIA HelpSteer data
|
| 94 |
+
Step 1: Found 7/12, missed subtle issues οΏ½οΏ½οΏ½ Reward: 0.58
|
| 95 |
+
Step 2: Found 12/12 + 3 fixes proposed β Reward: 0.72
|
| 96 |
+
Issues requiring deep reasoning:
|
| 97 |
+
β’ Cerasus vs Prunus serrulata (wrong taxonomic name)
|
| 98 |
+
β’ $400.3M at Sotheby's vs $450.3M at Christie's (close but wrong)
|
| 99 |
+
β’ Fake Nature paper by "Dr. Sarah Chen" (hallucinated citation)
|
| 100 |
+
β’ Gender-biased advice rated helpfulness=4 (toxic content with inflated scores)
|
| 101 |
+
|
| 102 |
+
CODING TASK β Code instruction-response pairs
|
| 103 |
+
Issues requiring code understanding:
|
| 104 |
+
β’ Binary search off-by-one (lo=mid causes infinite loop) marked correct
|
| 105 |
+
β’ eval(uid) in Flask route β code injection vulnerability
|
| 106 |
+
β’ JavaScript response for a Python-labeled task
|
| 107 |
+
β’ Duplicate "merge sort" instruction across rows
|
| 108 |
+
```
|
| 109 |
|
| 110 |
+
## Environment API
|
| 111 |
|
| 112 |
+
| Endpoint | Method | Description |
|
| 113 |
+
|----------|--------|-------------|
|
| 114 |
+
| `/reset` | POST | Start a new episode with `{"task_id": "easy"}` |
|
| 115 |
+
| `/step` | POST | Submit identified issues + proposed fixes |
|
| 116 |
+
| `/state` | GET | Get current episode state |
|
| 117 |
+
| `/health` | GET | Health check |
|
| 118 |
|
| 119 |
+
## Action Format
|
| 120 |
|
| 121 |
+
**Identify:** `row:<N>,col:<column>,issue:<type>` where type is one of: `missing_value`, `wrong_type`, `duplicate_row`, `out_of_range`, `format_violation`, `inconsistent_value`, `statistical_outlier`
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 122 |
|
| 123 |
+
**Fix:** `row:<N>,col:<column>,fix:<corrected_value>`
|
| 124 |
|
| 125 |
+
Both can be submitted in the same step or across multiple steps (3 steps max).
|
|
|
|
| 126 |
|
| 127 |
+
## Reward Design
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 128 |
|
| 129 |
+
| Property | Detail |
|
| 130 |
+
|----------|--------|
|
| 131 |
+
| Range | Strict (0, 1) β 0.001 minimum, 0.999 maximum |
|
| 132 |
+
| Partial credit | Yes β per-issue, difficulty-weighted |
|
| 133 |
+
| Monotonic | Best score across all steps is final reward |
|
| 134 |
+
| Penalizes guessing | False positives reduce precision, fixing non-issues scores 0 |
|
| 135 |
+
| Multi-step improvement | Detailed feedback enables learning across attempts |
|
| 136 |
|
| 137 |
+
**Fix grading tiers** (by issue type):
|
| 138 |
+
- Exact match with clean value β 1.0
|
| 139 |
+
- Valid fix: right type/range, addresses the issue β 0.8
|
| 140 |
+
- Partially valid: reasonable attempt, right direction β 0.4
|
| 141 |
+
- Right cell, wrong value β 0.1
|
| 142 |
+
- Non-issue cell β 0.0
|
| 143 |
|
| 144 |
+
## Quick Start
|
| 145 |
|
| 146 |
```bash
|
|
|
|
| 147 |
pip install -e .
|
|
|
|
|
|
|
| 148 |
uvicorn dataqa_env.server.app:app --host 0.0.0.0 --port 8000
|
| 149 |
|
| 150 |
+
# Run baseline agent
|
| 151 |
API_BASE_URL=https://router.huggingface.co/v1 \
|
| 152 |
MODEL_NAME=Qwen/Qwen2.5-72B-Instruct \
|
| 153 |
HF_TOKEN=your-token \
|
| 154 |
python inference.py
|
| 155 |
```
|
| 156 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 157 |
## Testing
|
| 158 |
|
| 159 |
+
128 tests covering task creation, reward computation, fix grading, environment lifecycle, inference parsing, and extensibility API.
|
| 160 |
+
|
| 161 |
```bash
|
| 162 |
pip install -e ".[dev]"
|
| 163 |
pytest tests/ -v
|
| 164 |
```
|
| 165 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 166 |
## Architecture
|
| 167 |
|
| 168 |
```
|
| 169 |
dataqa_env/
|
| 170 |
+
βββ models.py # DataQAAction (issues + fixes), DataQAObservation
|
|
|
|
|
|
|
| 171 |
βββ server/
|
| 172 |
+
β βββ environment.py # Two-phase grading engine (identify + fix + combined reward)
|
| 173 |
+
β βββ tasks.py # 7 task definitions + contamination rules + extensibility API
|
| 174 |
+
β βββ app.py # FastAPI server (via openenv-core)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 175 |
inference.py # Two-phase baseline agent (identify β fix)
|
| 176 |
+
tests/ # 128 tests
|
|
|
|
|
|
|
| 177 |
```
|