varb15 commited on
Commit
6c1b2ac
Β·
verified Β·
1 Parent(s): c5b540e

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +101 -275
README.md CHANGED
@@ -12,340 +12,166 @@ tags:
12
 
13
  # DataQA Environment
14
 
15
- A two-phase OpenEnv RL environment for **Data Quality Assurance** β€” an LLM agent inspects corrupted datasets, identifies all planted quality issues, and proposes data repairs.
16
 
17
- ### Demo: Agent Trajectory Replay
18
 
19
- ```
20
- EASY TASK (Step 2) β€” All 6 issues found + 5 fixes proposed
21
- Reward: 0.87 | Identify: 1.00 | Fix: 0.67
22
- βœ“ row:4 name: empty β†’ "David Kim"
23
- βœ“ row:7 salary: "seventy-five thousand" β†’ "75000"
24
- βœ“ row:9 salary: "5000" β†’ "73000"
25
- βœ“ row:15 email: mismatch β†’ "oscar.rivera@company.com"
26
- βœ“ row:18 start_date: "2027-06-15" β†’ "2022-01-19"
27
- βœ“ row:21 duplicate row detected
28
-
29
- HARD TASK β€” ML experiment metadata
30
- Step 1: Found 5/10, missed hard issues β†’ Reward: 0.69
31
- Step 2: Found 10/10 + 5 fixes proposed β†’ Reward: 0.77
32
- Issues requiring ML knowledge:
33
- β€’ val_loss < train_loss (data leakage signal)
34
- β€’ resnet18 using 42.5GB GPU (impossible)
35
- β€’ 350 epochs on ImageNet in 30 min (impossible)
36
- β€’ wav2vec2 at 98.5% accuracy (exceeds SOTA)
37
-
38
- ALIGNMENT TASK β€” NVIDIA HelpSteer data (hardest)
39
- Step 1: Found 7/12, missed subtle issues β†’ Reward: 0.58
40
- Step 2: Found 12/12 + 3 fixes proposed β†’ Reward: 0.72
41
- Issues requiring deep reasoning:
42
- β€’ Cerasus vs Prunus serrulata (wrong taxonomic name)
43
- β€’ $400.3M at Sotheby's vs $450.3M at Christie's (close but wrong)
44
- β€’ "does NOT learn via backprop" then describes backprop (self-contradiction)
45
- β€’ Fake Nature paper by "Dr. Sarah Chen" (hallucinated citation)
46
- β€’ "use bare except everywhere" rated helpfulness=3 (harmful advice)
47
- β€’ [SYSTEM] prompt leaked in response (pipeline contamination)
48
- ```
49
-
50
- > The interactive replay UI with color-coded dataset visualization is available on the HF Space.
51
-
52
- ## Motivation
53
 
54
- Every ML engineer and data scientist spends significant time debugging data quality issues β€” missing values, type mismatches, logical inconsistencies, and subtle statistical anomalies β€” before data enters ML pipelines or production databases. This is a genuine, high-frequency human task that directly impacts model quality and business outcomes.
55
 
56
- DataQA turns this into a **two-phase RL challenge**:
57
- 1. **Identify** β€” systematically inspect corrupted data and pinpoint every planted issue
58
- 2. **Fix** β€” propose corrected values by reasoning about schema, constraints, and context
59
 
60
- This creates a rich multi-step decision problem where agents must explore datasets strategically, distinguish subtle anomalies from noise, and reason about what the correct data should be.
 
 
 
 
 
 
 
 
61
 
62
- ## Environment API
63
 
64
- | Endpoint | Method | Description |
65
- |----------|--------|-------------|
66
- | `/reset` | POST | Start a new episode with a corrupted dataset |
67
- | `/step` | POST | Submit identified issues + proposed fixes |
68
- | `/state` | GET | Get current episode state |
69
- | `/health` | GET | Health check |
70
 
71
- ## Tasks
72
 
73
- | Task | Issues | Difficulty | Domain | Description |
74
- |------|--------|-----------|--------|-------------|
75
- | `easy` | 6 | Beginner | HR/Employee data (21 rows) | Nulls, wrong types, duplicates, out-of-range, email-name mismatch, future dates |
76
- | `medium` | 8 | Intermediate | E-commerce orders (31 rows) | Inconsistent totals, invalid categories, duplicate keys, wrong date formats, invalid country codes, future-date deliveries |
77
- | `hard` | 10 | Advanced | ML experiment metadata (31 rows) | Data leakage signals, unreasonable GPU memory, impossibly fast training, SOTA-exceeding accuracy, timestamp ordering, whitespace-only fields |
78
- | `alignment` | 12 | Expert | LLM alignment data (30 rows, NVIDIA HelpSteer) | See below |
79
- | `moderation` | 10 | Expert | Content moderation (30 rows, OpenAI Moderation) | Mislabeled hate/violence, false positives on clean text, subset rule violations, label range errors |
80
-
81
- **Difficulty progression**: Easy issues are individually obvious (empty fields, text in numeric columns). Medium issues require cross-column reasoning (total != qty * price) and set membership checks. Hard issues require ML domain knowledge (val_loss < train_loss = data leakage) and multi-row temporal reasoning.
82
-
83
- ### Alignment Task: LLM Training Data Quality (Expert)
84
-
85
- Built on **real data from [NVIDIA HelpSteer](https://huggingface.co/datasets/nvidia/HelpSteer)** β€” 30 human-annotated prompt-response pairs with quality scores (helpfulness, correctness, coherence, complexity, verbosity on 0-4 scale).
86
-
87
- This task targets a critical real-world problem: **catching quality issues in LLM fine-tuning data before it corrupts model training**. The 12 planted issues represent failure modes actually seen in production data pipelines:
88
-
89
- | Issue | Difficulty | Why It's Hard |
90
- |---|---|---|
91
- | Subtle factual error (*Cerasus* vs *Prunus serrulata*) | 3.0 | Old taxonomic synonym β€” sounds plausible, requires domain knowledge |
92
- | Plausible wrong numbers ($400.3M at Sotheby's vs $450.3M at Christie's) | 3.0 | Right painting, wrong price by $50M and wrong auction house |
93
- | Self-contradictory reasoning ("does NOT learn via backprop" then describes backprop) | 3.0 | Response negates its own conclusion β€” trains confused models |
94
- | Hallucinated citation (fake Nature paper by fake Dr. Sarah Chen) | 3.0 | Fabricated study with specific fake statistics β€” most dangerous for training |
95
- | Harmful coding advice ("use bare except everywhere") with high quality scores | 3.0 | Teaches dangerous practices if used for fine-tuning |
96
- | Leaked system prompt (`[SYSTEM] You are a helpful AI...`) in response | 2.5 | Data pipeline failed to strip prompt template |
97
- | Semantic near-duplicate prompt (rephrased, not exact copy) | 2.5 | Requires semantic similarity detection, not just string matching |
98
- | Score inflation (helpfulness=4 for a 4-word answer) | 2.5 | Score-content mismatch requires understanding rating criteria |
99
- | Truncated response (cut mid-sentence) | 2.5 | `max_length` truncation without sentence boundary detection |
100
- | Response in French for English prompt | 2.0 | Language contamination from multilingual training data |
101
- | Response plagiarized from another row | 2.0 | Data pipeline shuffling/dedup failure |
102
- | Whitespace-only prompt | 2.0 | Empty training example from pipeline artifact |
103
-
104
- These issues are designed to challenge frontier models β€” they require factual recall, semantic reasoning, cross-row comparison, and understanding of what makes training data harmful.
105
-
106
- ## Two-Phase Action Space
107
-
108
- ### Phase 1: Identify Issues
109
-
110
- Submit issues in format: `row:<row_number>,col:<column_name>,issue:<issue_type>`
111
-
112
- - `row_number`: 1-indexed data row position (after header)
113
- - `column_name`: Exact column header name, lowercase
114
- - `issue_type`: One of the supported types below
115
-
116
- ### Phase 2: Propose Fixes
117
-
118
- Submit fixes in format: `row:<row_number>,col:<column_name>,fix:<corrected_value>`
119
-
120
- The agent proposes the **correct value** that should replace the corrupted data. Fixes are graded against the original clean dataset.
121
-
122
- Both phases can be submitted in the same step or across multiple steps.
123
-
124
- **Supported Issue Types:**
125
-
126
- | Type | Description | Example |
127
- |------|-------------|---------|
128
- | `missing_value` | Null, empty, or whitespace-only | Empty name field |
129
- | `wrong_type` | Value doesn't match expected type | Salary as "seventy-five thousand" |
130
- | `duplicate_row` | Exact duplicate or duplicate key | Two rows with same employee_id |
131
- | `out_of_range` | Value outside valid range | Salary of 5000 when min is 50000 |
132
- | `format_violation` | Wrong format or invalid enum | Date as DD/MM/YYYY instead of YYYY-MM-DD |
133
- | `inconsistent_value` | Computed field mismatch, logical inconsistency | total != qty * price |
134
- | `statistical_outlier` | Unreasonable value given context | resnet18 using 42.5GB GPU |
135
- | `referential_integrity` | Foreign key violation | (available for custom tasks) |
136
-
137
- ## Observation Space
138
-
139
- | Field | Type | Description |
140
- |-------|------|-------------|
141
- | `dataset_csv` | str | The corrupted dataset in CSV format |
142
- | `schema_description` | str | Column types, ranges, and constraints |
143
- | `validation_rules` | str | Business rules the data must satisfy |
144
- | `task_description` | str | Task context and instructions |
145
- | `feedback` | str | Per-step results: TP/FP/FN, precision/recall, fix scores |
146
- | `num_issues_hint` | int | Exact count of planted issues |
147
- | `max_steps` | int | Maximum attempts allowed |
148
- | `done` | bool | Whether episode has terminated |
149
- | `reward` | float | Best combined reward so far (0.0-1.0) |
150
-
151
- **Observation Metadata** (per step):
152
- - Identify: `identify_f1`, `identify_score`, `precision`, `recall`, `tp`, `fp`, `fn`
153
- - Fix: `fix_score`, `fixes_correct`, `fixes_partial`, `fixes_wrong`, `fixes_attempted`
154
- - Combined: `combined_reward`, `difficulty_found`, `difficulty_missed`
155
-
156
- ## Reward Function
157
-
158
- ### Combined Reward
159
 
160
  ```
161
  combined_reward = 0.6 * identify_score + 0.4 * fix_score
162
  ```
163
 
164
- If no fixes are submitted, `combined_reward = identify_score` (no penalty β€” backward compatible).
165
-
166
- ### Identify Score (Difficulty-Weighted F1)
167
-
168
- Each planted issue has a **difficulty weight** (1.0-3.0):
169
 
170
- | Weight | Category | Examples |
171
- |--------|----------|----------|
172
- | 1.0 | Easy | Missing values, obvious out-of-range, wrong type |
173
- | 1.5-2.0 | Medium | Duplicate keys, format violations, cross-column checks |
174
- | 2.5-3.0 | Hard | Data leakage, statistical outliers, whitespace-only |
175
 
176
- - **Weighted Recall** = (difficulty of found issues) / (total difficulty)
177
- - **Weighted Precision** = penalizes false positives proportional to average difficulty
178
- - **Weighted F1** = harmonic mean
179
 
180
- ### Fix Score (Difficulty-Weighted Quality)
181
 
182
- Each proposed fix is compared against the original clean value:
 
 
 
183
 
184
- | Fix Quality | Score | Description |
185
- |-------------|-------|-------------|
186
- | Exact match | 1.0 | Case-insensitive, whitespace-stripped match |
187
- | Numeric close | 0.8 | Within 1% of correct numeric value |
188
- | Correct cell | 0.1 | Right location, wrong value |
189
- | Non-issue cell | 0.0 | Fix targets a cell with no issue |
190
 
191
- Fix score = (sum of best fix score per issue Γ— difficulty weight) / (total difficulty weight)
192
 
193
- ### Reward Properties
 
 
194
 
195
- - **Per-step partial progress**: reward increases as more issues are found/fixed
196
- - **Difficulty-aware**: finding subtle issues earns more than obvious ones
197
- - **Penalizes bad behavior**: false positives reduce score, fixing non-issues earns nothing
198
- - **Monotonically non-decreasing**: best score across all steps is the final reward
199
- - **Always in [0.0, 1.0]**: meets hackathon requirement
 
 
200
 
201
- ### Episode Boundaries
202
 
203
- - Each task allows up to 3 steps (attempts)
204
- - Episode ends when F1 >= 0.999 (perfect identification) or max steps reached
205
- - Agent receives detailed feedback after each step to improve on next attempt
206
 
207
- ## Baseline Scores
208
 
209
- Baseline agent uses Qwen2.5-72B-Instruct via HuggingFace Router:
 
 
 
 
 
 
 
 
210
 
211
- | Task | Identify Score | Fix Score | Combined | Notes |
212
- |------|---------------|-----------|----------|-------|
213
- | `easy` | 0.7-1.0 | 0.5-0.9 | 0.6-1.0 | Most LLMs find obvious issues reliably |
214
- | `medium` | 0.5-0.8 | 0.3-0.6 | 0.4-0.7 | Cross-column reasoning challenges models |
215
- | `hard` | 0.3-0.6 | 0.2-0.4 | 0.3-0.5 | ML domain knowledge and subtle patterns |
 
 
 
 
 
 
 
 
 
 
 
216
 
217
- Scores vary by model. The hard task is designed to challenge frontier models.
218
 
219
- ## Extensibility
 
 
 
 
 
220
 
221
- ### Custom Contamination Rules
222
 
223
- ```python
224
- from dataqa_env import register_contamination_rule
225
- from dataqa_env.server.tasks import PlantedIssue
226
-
227
- def swap_digits(rows, header, col_idx, row_idx, rng):
228
- val = rows[row_idx][col_idx]
229
- corrupted = val[::-1]
230
- issue = PlantedIssue(
231
- row=row_idx + 1, col=header[col_idx],
232
- issue_type="format_violation",
233
- description=f"Digits swapped in {header[col_idx]}",
234
- difficulty=2.0,
235
- )
236
- return corrupted, issue
237
-
238
- register_contamination_rule("swap_digits", swap_digits)
239
- ```
240
 
241
- ### Custom Tasks from Config
242
 
243
- ```python
244
- from dataqa_env import create_task_from_config, register_task
245
 
246
- task = create_task_from_config(
247
- task_id="custom",
248
- name="Custom Validation",
249
- description="Find quality issues in this dataset.",
250
- schema_description="id: int, name: str, score: int (0-100)",
251
- validation_rules="No missing values. Scores must be 0-100.",
252
- clean_csv="id,name,score\n1,Alice,95\n2,Bob,87\n3,Carol,92",
253
- contaminations=[
254
- {"rule": "missing_value", "row": 0, "col": 1, "difficulty": 1.0},
255
- {"rule": "negative_value", "row": 2, "col": 2, "difficulty": 1.5},
256
- ],
257
- )
258
- register_task("custom", lambda seed: task)
259
- ```
260
 
261
- ### Built-in Contamination Rules
 
 
 
 
 
 
262
 
263
- | Rule | Effect | Default Difficulty |
264
- |------|--------|--------------------|
265
- | `missing_value` | Sets field to empty string | 1.0 |
266
- | `whitespace_value` | Sets field to single space | 2.5 |
267
- | `wrong_type_text` | Replaces with random text | 1.0 |
268
- | `negative_value` | Negates numeric value | 1.0 |
269
 
270
- ## Setup & Quick Start
271
 
272
  ```bash
273
- # Install
274
  pip install -e .
275
-
276
- # Run server locally
277
  uvicorn dataqa_env.server.app:app --host 0.0.0.0 --port 8000
278
 
279
- # Run inference (set your API credentials)
280
  API_BASE_URL=https://router.huggingface.co/v1 \
281
  MODEL_NAME=Qwen/Qwen2.5-72B-Instruct \
282
  HF_TOKEN=your-token \
283
  python inference.py
284
  ```
285
 
286
- ## Docker
287
-
288
- ```bash
289
- docker build -t dataqa-env .
290
- docker run -p 8000:8000 dataqa-env
291
- ```
292
-
293
  ## Testing
294
 
 
 
295
  ```bash
296
  pip install -e ".[dev]"
297
  pytest tests/ -v
298
  ```
299
 
300
- 118 tests covering:
301
- - Task creation, corruption, and difficulty weights
302
- - Issue key and fix parsing (standard, lenient, edge cases)
303
- - F1, weighted reward, and fix quality computation
304
- - Full environment lifecycle (identify-only and identify+fix)
305
- - Combined reward calculation and weight verification
306
- - Inference script parsing and prompt building
307
- - Structured log format ([START], [STEP], [END])
308
- - Score bounds (0.0-1.0), best-score monotonicity
309
- - Extensibility API (custom rules, custom tasks)
310
-
311
- ## Validation
312
-
313
- ```bash
314
- # OpenEnv spec validation
315
- openenv validate .
316
-
317
- # Pre-submission validation (requires HF Space URL)
318
- ./prevalidation_script.sh https://your-space.hf.space
319
- ```
320
-
321
- ## Environment Variables
322
-
323
- | Variable | Description | Default |
324
- |----------|-------------|---------|
325
- | `API_BASE_URL` | LLM API endpoint | `https://router.huggingface.co/v1` |
326
- | `MODEL_NAME` | Model identifier | `Qwen/Qwen2.5-72B-Instruct` |
327
- | `HF_TOKEN` | HuggingFace token / API key | - |
328
- | `ENV_URL` | Environment server URL | `http://localhost:8000` |
329
-
330
  ## Architecture
331
 
332
  ```
333
  dataqa_env/
334
- β”œβ”€β”€ __init__.py # Public API + extensibility exports
335
- β”œβ”€β”€ models.py # Pydantic: DataQAAction (issues + fixes), DataQAObservation, DataQAState
336
- β”œβ”€β”€ client.py # EnvClient for WebSocket connections
337
  β”œβ”€β”€ server/
338
- β”‚ β”œβ”€β”€ environment.py # Two-phase DataQAEnvironment (identify + fix + combined reward)
339
- β”‚ β”œβ”€β”€ tasks.py # Task definitions + contamination rules + extensibility API
340
- β”‚ β”œβ”€β”€ app.py # FastAPI server (via openenv-core create_app)
341
- β”‚ └── Dockerfile
342
- tests/
343
- β”œβ”€β”€ test_tasks.py # Task creation, corruption, difficulty weights
344
- β”œβ”€β”€ test_environment.py # Identify scoring, fix grading, combined reward, lifecycle
345
- β”œβ”€β”€ test_inference.py # LLM response parsing, fix parsing, prompt building, log format
346
- └── test_extensibility.py # Custom rules, custom tasks, registration API
347
  inference.py # Two-phase baseline agent (identify β†’ fix)
348
- openenv.yaml # OpenEnv/HF Spaces spec
349
- pyproject.toml # Package metadata and dependencies
350
- Dockerfile # Production container
351
  ```
 
12
 
13
  # DataQA Environment
14
 
15
+ **A two-phase OpenEnv RL environment for Data Quality Assurance** β€” an LLM agent inspects corrupted datasets, identifies all planted quality issues, and proposes data repairs.
16
 
17
+ ## Why DataQA? The Moat
18
 
19
+ ### 1. Solves a Real, High-Frequency Problem
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
20
 
21
+ Every ML team burns hours on data quality β€” missing values, type mismatches, logical inconsistencies, subtle statistical anomalies β€” before data enters training pipelines or production databases. DataQA turns this universal pain point into a graded RL environment. Unlike synthetic toy problems, **these are the exact data bugs that corrupt production ML models.**
22
 
23
+ ### 2. Seven Diverse Domains, One Unified Interface
 
 
24
 
25
+ | Task | Domain | Issues | What Makes It Hard |
26
+ |------|--------|--------|--------------------|
27
+ | `easy` | HR / Employee data | 6 | Missing values, typos, format errors |
28
+ | `medium` | E-commerce orders | 8 | Cross-column math (`total != qty * price`), OCR errors |
29
+ | `hard` | ML experiment metadata | 10 | Data leakage detection, impossible GPU specs, SOTA violations |
30
+ | `alignment` | LLM fine-tuning data (NVIDIA HelpSteer) | 12 | Hallucinated citations, self-contradictions, toxic content scored as helpful |
31
+ | `coding` | Code instruction-response pairs | 10 | Logic bugs in "correct" code, `eval()` injection, language mismatches |
32
+ | `toolcalling` | Function-calling schemas | 10 | Hallucinated parameters, missing required args, name mismatches |
33
+ | `moderation` | Content moderation labels | 10 | Mislabeled hate speech, false positives on clean text |
34
 
35
+ **66 total planted issues** spanning tabular data, free-text, code, JSON schemas, and safety labels. No other OpenEnv submission covers this breadth with a single coherent reward function.
36
 
37
+ ### 3. Two-Phase Reward β€” Identify Then Fix
 
 
 
 
 
38
 
39
+ Most data QA environments only ask "is there a bug?" DataQA goes further:
40
 
41
+ - **Phase 1 (Identify):** Find all issues β€” graded by difficulty-weighted F1
42
+ - **Phase 2 (Fix):** Propose the correct value β€” graded against the clean original with tiered scoring (exact match = 1.0, valid fix = 0.8, partial = 0.4, right cell wrong value = 0.1)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
43
 
44
  ```
45
  combined_reward = 0.6 * identify_score + 0.4 * fix_score
46
  ```
47
 
48
+ This creates a richer learning signal than binary classification. An agent that finds 8/10 issues and fixes 5 of them correctly gets meaningful partial credit β€” perfect for GRPO/RLHF training.
 
 
 
 
49
 
50
+ ### 4. Difficulty-Weighted Scoring Rewards Deeper Reasoning
 
 
 
 
51
 
52
+ Each planted issue has a difficulty weight (1.0-3.0). Finding a hallucinated citation (3.0) earns triple the reward of finding an empty field (1.0). This incentivizes agents to develop genuine reasoning capabilities rather than pattern-matching surface-level errors.
 
 
53
 
54
+ ### 5. Multi-Step Feedback Loop
55
 
56
+ Agents get 3 attempts per task with detailed per-step feedback:
57
+ - Which issues were correct (true positives) vs wrong (false positives)
58
+ - Which issues were missed (false negatives) with difficulty hints
59
+ - Fix quality scores with reasons
60
 
61
+ This enables the agent to **learn from its mistakes within a single episode** β€” a natural curriculum.
 
 
 
 
 
62
 
63
+ ### 6. Fully Extensible
64
 
65
+ ```python
66
+ # Add your own contamination rules
67
+ register_contamination_rule("swap_digits", my_swap_fn)
68
 
69
+ # Create tasks from any CSV
70
+ task = create_task_from_config(
71
+ task_id="custom", clean_csv="...",
72
+ contaminations=[{"rule": "missing_value", "row": 0, "col": 1}]
73
+ )
74
+ register_task("custom", lambda seed: task)
75
+ ```
76
 
77
+ New domains can be added in minutes. The contamination engine is domain-agnostic.
78
 
79
+ ---
 
 
80
 
81
+ ## Demo: Agent Trajectory
82
 
83
+ ```
84
+ HARD TASK β€” ML experiment metadata
85
+ Step 1: Found 5/10, missed hard issues β†’ Reward: 0.69
86
+ Step 2: Found 10/10 + 5 fixes proposed β†’ Reward: 0.77
87
+ Issues requiring ML knowledge:
88
+ β€’ val_loss < train_loss (data leakage signal)
89
+ β€’ resnet18 using 42.5GB GPU (impossible for 11M params)
90
+ β€’ 350 epochs on ImageNet in 30 min (impossibly fast)
91
+ β€’ wav2vec2 at 98.5% accuracy (exceeds SOTA)
92
 
93
+ ALIGNMENT TASK β€” NVIDIA HelpSteer data
94
+ Step 1: Found 7/12, missed subtle issues οΏ½οΏ½οΏ½ Reward: 0.58
95
+ Step 2: Found 12/12 + 3 fixes proposed β†’ Reward: 0.72
96
+ Issues requiring deep reasoning:
97
+ β€’ Cerasus vs Prunus serrulata (wrong taxonomic name)
98
+ β€’ $400.3M at Sotheby's vs $450.3M at Christie's (close but wrong)
99
+ β€’ Fake Nature paper by "Dr. Sarah Chen" (hallucinated citation)
100
+ β€’ Gender-biased advice rated helpfulness=4 (toxic content with inflated scores)
101
+
102
+ CODING TASK β€” Code instruction-response pairs
103
+ Issues requiring code understanding:
104
+ β€’ Binary search off-by-one (lo=mid causes infinite loop) marked correct
105
+ β€’ eval(uid) in Flask route β€” code injection vulnerability
106
+ β€’ JavaScript response for a Python-labeled task
107
+ β€’ Duplicate "merge sort" instruction across rows
108
+ ```
109
 
110
+ ## Environment API
111
 
112
+ | Endpoint | Method | Description |
113
+ |----------|--------|-------------|
114
+ | `/reset` | POST | Start a new episode with `{"task_id": "easy"}` |
115
+ | `/step` | POST | Submit identified issues + proposed fixes |
116
+ | `/state` | GET | Get current episode state |
117
+ | `/health` | GET | Health check |
118
 
119
+ ## Action Format
120
 
121
+ **Identify:** `row:<N>,col:<column>,issue:<type>` where type is one of: `missing_value`, `wrong_type`, `duplicate_row`, `out_of_range`, `format_violation`, `inconsistent_value`, `statistical_outlier`
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
122
 
123
+ **Fix:** `row:<N>,col:<column>,fix:<corrected_value>`
124
 
125
+ Both can be submitted in the same step or across multiple steps (3 steps max).
 
126
 
127
+ ## Reward Design
 
 
 
 
 
 
 
 
 
 
 
 
 
128
 
129
+ | Property | Detail |
130
+ |----------|--------|
131
+ | Range | Strict (0, 1) β€” 0.001 minimum, 0.999 maximum |
132
+ | Partial credit | Yes β€” per-issue, difficulty-weighted |
133
+ | Monotonic | Best score across all steps is final reward |
134
+ | Penalizes guessing | False positives reduce precision, fixing non-issues scores 0 |
135
+ | Multi-step improvement | Detailed feedback enables learning across attempts |
136
 
137
+ **Fix grading tiers** (by issue type):
138
+ - Exact match with clean value β†’ 1.0
139
+ - Valid fix: right type/range, addresses the issue β†’ 0.8
140
+ - Partially valid: reasonable attempt, right direction β†’ 0.4
141
+ - Right cell, wrong value β†’ 0.1
142
+ - Non-issue cell β†’ 0.0
143
 
144
+ ## Quick Start
145
 
146
  ```bash
 
147
  pip install -e .
 
 
148
  uvicorn dataqa_env.server.app:app --host 0.0.0.0 --port 8000
149
 
150
+ # Run baseline agent
151
  API_BASE_URL=https://router.huggingface.co/v1 \
152
  MODEL_NAME=Qwen/Qwen2.5-72B-Instruct \
153
  HF_TOKEN=your-token \
154
  python inference.py
155
  ```
156
 
 
 
 
 
 
 
 
157
  ## Testing
158
 
159
+ 128 tests covering task creation, reward computation, fix grading, environment lifecycle, inference parsing, and extensibility API.
160
+
161
  ```bash
162
  pip install -e ".[dev]"
163
  pytest tests/ -v
164
  ```
165
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
166
  ## Architecture
167
 
168
  ```
169
  dataqa_env/
170
+ β”œβ”€β”€ models.py # DataQAAction (issues + fixes), DataQAObservation
 
 
171
  β”œβ”€β”€ server/
172
+ β”‚ β”œβ”€β”€ environment.py # Two-phase grading engine (identify + fix + combined reward)
173
+ β”‚ β”œβ”€β”€ tasks.py # 7 task definitions + contamination rules + extensibility API
174
+ β”‚ └── app.py # FastAPI server (via openenv-core)
 
 
 
 
 
 
175
  inference.py # Two-phase baseline agent (identify β†’ fix)
176
+ tests/ # 128 tests
 
 
177
  ```