Spaces:
Sleeping
Sleeping
Commit Β·
1bd072d
1
Parent(s): 96d698c
update README with alignment task details and issue breakdown
Browse filesCo-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
README.md
CHANGED
|
@@ -19,14 +19,14 @@ A two-phase OpenEnv RL environment for **Data Quality Assurance** β an LLM age
|
|
| 19 |
```
|
| 20 |
EASY TASK (Step 2) β All 6 issues found + 5 fixes proposed
|
| 21 |
Reward: 0.87 | Identify: 1.00 | Fix: 0.67
|
| 22 |
-
β row:4 name: empty β "David Kim"
|
| 23 |
-
β row:7 salary: "seventy-five thousand" β "75000"
|
| 24 |
-
β row:9 salary: "5000" β "73000"
|
| 25 |
-
β row:15 email: mismatch β "oscar.rivera@company.com"
|
| 26 |
-
β row:18 start_date: "2027-06-15" β "2022-01-19"
|
| 27 |
β row:21 duplicate row detected
|
| 28 |
|
| 29 |
-
HARD TASK
|
| 30 |
Step 1: Found 5/10, missed hard issues β Reward: 0.69
|
| 31 |
Step 2: Found 10/10 + 5 fixes proposed β Reward: 0.77
|
| 32 |
Issues requiring ML knowledge:
|
|
@@ -34,6 +34,17 @@ HARD TASK (Step 1 β Step 2)
|
|
| 34 |
β’ resnet18 using 42.5GB GPU (impossible)
|
| 35 |
β’ 350 epochs on ImageNet in 30 min (impossible)
|
| 36 |
β’ wav2vec2 at 98.5% accuracy (exceeds SOTA)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 37 |
```
|
| 38 |
|
| 39 |
> The interactive replay UI with color-coded dataset visualization is available on the HF Space.
|
|
@@ -64,10 +75,33 @@ This creates a rich multi-step decision problem where agents must explore datase
|
|
| 64 |
| `easy` | 6 | Beginner | HR/Employee data (21 rows) | Nulls, wrong types, duplicates, out-of-range, email-name mismatch, future dates |
|
| 65 |
| `medium` | 8 | Intermediate | E-commerce orders (31 rows) | Inconsistent totals, invalid categories, duplicate keys, wrong date formats, invalid country codes, future-date deliveries |
|
| 66 |
| `hard` | 10 | Advanced | ML experiment metadata (31 rows) | Data leakage signals, unreasonable GPU memory, impossibly fast training, SOTA-exceeding accuracy, timestamp ordering, whitespace-only fields |
|
| 67 |
-
| `alignment` | 12 | Expert | LLM
|
| 68 |
|
| 69 |
**Difficulty progression**: Easy issues are individually obvious (empty fields, text in numeric columns). Medium issues require cross-column reasoning (total != qty * price) and set membership checks. Hard issues require ML domain knowledge (val_loss < train_loss = data leakage) and multi-row temporal reasoning.
|
| 70 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 71 |
## Two-Phase Action Space
|
| 72 |
|
| 73 |
### Phase 1: Identify Issues
|
|
|
|
| 19 |
```
|
| 20 |
EASY TASK (Step 2) β All 6 issues found + 5 fixes proposed
|
| 21 |
Reward: 0.87 | Identify: 1.00 | Fix: 0.67
|
| 22 |
+
β row:4 name: empty β "David Kim"
|
| 23 |
+
β row:7 salary: "seventy-five thousand" β "75000"
|
| 24 |
+
β row:9 salary: "5000" β "73000"
|
| 25 |
+
β row:15 email: mismatch β "oscar.rivera@company.com"
|
| 26 |
+
β row:18 start_date: "2027-06-15" β "2022-01-19"
|
| 27 |
β row:21 duplicate row detected
|
| 28 |
|
| 29 |
+
HARD TASK β ML experiment metadata
|
| 30 |
Step 1: Found 5/10, missed hard issues β Reward: 0.69
|
| 31 |
Step 2: Found 10/10 + 5 fixes proposed β Reward: 0.77
|
| 32 |
Issues requiring ML knowledge:
|
|
|
|
| 34 |
β’ resnet18 using 42.5GB GPU (impossible)
|
| 35 |
β’ 350 epochs on ImageNet in 30 min (impossible)
|
| 36 |
β’ wav2vec2 at 98.5% accuracy (exceeds SOTA)
|
| 37 |
+
|
| 38 |
+
ALIGNMENT TASK β NVIDIA HelpSteer data (hardest)
|
| 39 |
+
Step 1: Found 7/12, missed subtle issues β Reward: 0.58
|
| 40 |
+
Step 2: Found 12/12 + 3 fixes proposed β Reward: 0.72
|
| 41 |
+
Issues requiring deep reasoning:
|
| 42 |
+
β’ Cerasus vs Prunus serrulata (wrong taxonomic name)
|
| 43 |
+
β’ $400.3M at Sotheby's vs $450.3M at Christie's (close but wrong)
|
| 44 |
+
β’ "does NOT learn via backprop" then describes backprop (self-contradiction)
|
| 45 |
+
β’ Fake Nature paper by "Dr. Sarah Chen" (hallucinated citation)
|
| 46 |
+
β’ "use bare except everywhere" rated helpfulness=3 (harmful advice)
|
| 47 |
+
β’ [SYSTEM] prompt leaked in response (pipeline contamination)
|
| 48 |
```
|
| 49 |
|
| 50 |
> The interactive replay UI with color-coded dataset visualization is available on the HF Space.
|
|
|
|
| 75 |
| `easy` | 6 | Beginner | HR/Employee data (21 rows) | Nulls, wrong types, duplicates, out-of-range, email-name mismatch, future dates |
|
| 76 |
| `medium` | 8 | Intermediate | E-commerce orders (31 rows) | Inconsistent totals, invalid categories, duplicate keys, wrong date formats, invalid country codes, future-date deliveries |
|
| 77 |
| `hard` | 10 | Advanced | ML experiment metadata (31 rows) | Data leakage signals, unreasonable GPU memory, impossibly fast training, SOTA-exceeding accuracy, timestamp ordering, whitespace-only fields |
|
| 78 |
+
| `alignment` | 12 | Expert | LLM alignment data (30 rows, NVIDIA HelpSteer) | See below |
|
| 79 |
|
| 80 |
**Difficulty progression**: Easy issues are individually obvious (empty fields, text in numeric columns). Medium issues require cross-column reasoning (total != qty * price) and set membership checks. Hard issues require ML domain knowledge (val_loss < train_loss = data leakage) and multi-row temporal reasoning.
|
| 81 |
|
| 82 |
+
### Alignment Task: LLM Training Data Quality (Expert)
|
| 83 |
+
|
| 84 |
+
Built on **real data from [NVIDIA HelpSteer](https://huggingface.co/datasets/nvidia/HelpSteer)** β 30 human-annotated prompt-response pairs with quality scores (helpfulness, correctness, coherence, complexity, verbosity on 0-4 scale).
|
| 85 |
+
|
| 86 |
+
This task targets a critical real-world problem: **catching quality issues in LLM fine-tuning data before it corrupts model training**. The 12 planted issues represent failure modes actually seen in production data pipelines:
|
| 87 |
+
|
| 88 |
+
| Issue | Difficulty | Why It's Hard |
|
| 89 |
+
|---|---|---|
|
| 90 |
+
| Subtle factual error (*Cerasus* vs *Prunus serrulata*) | 3.0 | Old taxonomic synonym β sounds plausible, requires domain knowledge |
|
| 91 |
+
| Plausible wrong numbers ($400.3M at Sotheby's vs $450.3M at Christie's) | 3.0 | Right painting, wrong price by $50M and wrong auction house |
|
| 92 |
+
| Self-contradictory reasoning ("does NOT learn via backprop" then describes backprop) | 3.0 | Response negates its own conclusion β trains confused models |
|
| 93 |
+
| Hallucinated citation (fake Nature paper by fake Dr. Sarah Chen) | 3.0 | Fabricated study with specific fake statistics β most dangerous for training |
|
| 94 |
+
| Harmful coding advice ("use bare except everywhere") with high quality scores | 3.0 | Teaches dangerous practices if used for fine-tuning |
|
| 95 |
+
| Leaked system prompt (`[SYSTEM] You are a helpful AI...`) in response | 2.5 | Data pipeline failed to strip prompt template |
|
| 96 |
+
| Semantic near-duplicate prompt (rephrased, not exact copy) | 2.5 | Requires semantic similarity detection, not just string matching |
|
| 97 |
+
| Score inflation (helpfulness=4 for a 4-word answer) | 2.5 | Score-content mismatch requires understanding rating criteria |
|
| 98 |
+
| Truncated response (cut mid-sentence) | 2.5 | `max_length` truncation without sentence boundary detection |
|
| 99 |
+
| Response in French for English prompt | 2.0 | Language contamination from multilingual training data |
|
| 100 |
+
| Response plagiarized from another row | 2.0 | Data pipeline shuffling/dedup failure |
|
| 101 |
+
| Whitespace-only prompt | 2.0 | Empty training example from pipeline artifact |
|
| 102 |
+
|
| 103 |
+
These issues are designed to challenge frontier models β they require factual recall, semantic reasoning, cross-row comparison, and understanding of what makes training data harmful.
|
| 104 |
+
|
| 105 |
## Two-Phase Action Space
|
| 106 |
|
| 107 |
### Phase 1: Identify Issues
|