avanigupta Claude Opus 4.6 (1M context) commited on
Commit
1bd072d
Β·
1 Parent(s): 96d698c

update README with alignment task details and issue breakdown

Browse files

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Files changed (1) hide show
  1. README.md +41 -7
README.md CHANGED
@@ -19,14 +19,14 @@ A two-phase OpenEnv RL environment for **Data Quality Assurance** β€” an LLM age
19
  ```
20
  EASY TASK (Step 2) β€” All 6 issues found + 5 fixes proposed
21
  Reward: 0.87 | Identify: 1.00 | Fix: 0.67
22
- βœ“ row:4 name: empty β†’ "David Kim" (fix correct)
23
- βœ“ row:7 salary: "seventy-five thousand" β†’ "75000" (fix correct)
24
- βœ“ row:9 salary: "5000" β†’ "73000" (fix correct)
25
- βœ“ row:15 email: mismatch β†’ "oscar.rivera@company.com" (fix correct)
26
- βœ“ row:18 start_date: "2027-06-15" β†’ "2022-01-19" (fix correct)
27
  βœ“ row:21 duplicate row detected
28
 
29
- HARD TASK (Step 1 β†’ Step 2)
30
  Step 1: Found 5/10, missed hard issues β†’ Reward: 0.69
31
  Step 2: Found 10/10 + 5 fixes proposed β†’ Reward: 0.77
32
  Issues requiring ML knowledge:
@@ -34,6 +34,17 @@ HARD TASK (Step 1 β†’ Step 2)
34
  β€’ resnet18 using 42.5GB GPU (impossible)
35
  β€’ 350 epochs on ImageNet in 30 min (impossible)
36
  β€’ wav2vec2 at 98.5% accuracy (exceeds SOTA)
 
 
 
 
 
 
 
 
 
 
 
37
  ```
38
 
39
  > The interactive replay UI with color-coded dataset visualization is available on the HF Space.
@@ -64,10 +75,33 @@ This creates a rich multi-step decision problem where agents must explore datase
64
  | `easy` | 6 | Beginner | HR/Employee data (21 rows) | Nulls, wrong types, duplicates, out-of-range, email-name mismatch, future dates |
65
  | `medium` | 8 | Intermediate | E-commerce orders (31 rows) | Inconsistent totals, invalid categories, duplicate keys, wrong date formats, invalid country codes, future-date deliveries |
66
  | `hard` | 10 | Advanced | ML experiment metadata (31 rows) | Data leakage signals, unreasonable GPU memory, impossibly fast training, SOTA-exceeding accuracy, timestamp ordering, whitespace-only fields |
67
- | `alignment` | 12 | Expert | LLM instruction-tuning data (25 rows) | Instruction-response mismatches, factual errors in "good" labels, hallucinated citations, harmful advice, language mismatches, truncated responses, duplicate instructions |
68
 
69
  **Difficulty progression**: Easy issues are individually obvious (empty fields, text in numeric columns). Medium issues require cross-column reasoning (total != qty * price) and set membership checks. Hard issues require ML domain knowledge (val_loss < train_loss = data leakage) and multi-row temporal reasoning.
70
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
71
  ## Two-Phase Action Space
72
 
73
  ### Phase 1: Identify Issues
 
19
  ```
20
  EASY TASK (Step 2) β€” All 6 issues found + 5 fixes proposed
21
  Reward: 0.87 | Identify: 1.00 | Fix: 0.67
22
+ βœ“ row:4 name: empty β†’ "David Kim"
23
+ βœ“ row:7 salary: "seventy-five thousand" β†’ "75000"
24
+ βœ“ row:9 salary: "5000" β†’ "73000"
25
+ βœ“ row:15 email: mismatch β†’ "oscar.rivera@company.com"
26
+ βœ“ row:18 start_date: "2027-06-15" β†’ "2022-01-19"
27
  βœ“ row:21 duplicate row detected
28
 
29
+ HARD TASK β€” ML experiment metadata
30
  Step 1: Found 5/10, missed hard issues β†’ Reward: 0.69
31
  Step 2: Found 10/10 + 5 fixes proposed β†’ Reward: 0.77
32
  Issues requiring ML knowledge:
 
34
  β€’ resnet18 using 42.5GB GPU (impossible)
35
  β€’ 350 epochs on ImageNet in 30 min (impossible)
36
  β€’ wav2vec2 at 98.5% accuracy (exceeds SOTA)
37
+
38
+ ALIGNMENT TASK β€” NVIDIA HelpSteer data (hardest)
39
+ Step 1: Found 7/12, missed subtle issues β†’ Reward: 0.58
40
+ Step 2: Found 12/12 + 3 fixes proposed β†’ Reward: 0.72
41
+ Issues requiring deep reasoning:
42
+ β€’ Cerasus vs Prunus serrulata (wrong taxonomic name)
43
+ β€’ $400.3M at Sotheby's vs $450.3M at Christie's (close but wrong)
44
+ β€’ "does NOT learn via backprop" then describes backprop (self-contradiction)
45
+ β€’ Fake Nature paper by "Dr. Sarah Chen" (hallucinated citation)
46
+ β€’ "use bare except everywhere" rated helpfulness=3 (harmful advice)
47
+ β€’ [SYSTEM] prompt leaked in response (pipeline contamination)
48
  ```
49
 
50
  > The interactive replay UI with color-coded dataset visualization is available on the HF Space.
 
75
  | `easy` | 6 | Beginner | HR/Employee data (21 rows) | Nulls, wrong types, duplicates, out-of-range, email-name mismatch, future dates |
76
  | `medium` | 8 | Intermediate | E-commerce orders (31 rows) | Inconsistent totals, invalid categories, duplicate keys, wrong date formats, invalid country codes, future-date deliveries |
77
  | `hard` | 10 | Advanced | ML experiment metadata (31 rows) | Data leakage signals, unreasonable GPU memory, impossibly fast training, SOTA-exceeding accuracy, timestamp ordering, whitespace-only fields |
78
+ | `alignment` | 12 | Expert | LLM alignment data (30 rows, NVIDIA HelpSteer) | See below |
79
 
80
  **Difficulty progression**: Easy issues are individually obvious (empty fields, text in numeric columns). Medium issues require cross-column reasoning (total != qty * price) and set membership checks. Hard issues require ML domain knowledge (val_loss < train_loss = data leakage) and multi-row temporal reasoning.
81
 
82
+ ### Alignment Task: LLM Training Data Quality (Expert)
83
+
84
+ Built on **real data from [NVIDIA HelpSteer](https://huggingface.co/datasets/nvidia/HelpSteer)** β€” 30 human-annotated prompt-response pairs with quality scores (helpfulness, correctness, coherence, complexity, verbosity on 0-4 scale).
85
+
86
+ This task targets a critical real-world problem: **catching quality issues in LLM fine-tuning data before it corrupts model training**. The 12 planted issues represent failure modes actually seen in production data pipelines:
87
+
88
+ | Issue | Difficulty | Why It's Hard |
89
+ |---|---|---|
90
+ | Subtle factual error (*Cerasus* vs *Prunus serrulata*) | 3.0 | Old taxonomic synonym β€” sounds plausible, requires domain knowledge |
91
+ | Plausible wrong numbers ($400.3M at Sotheby's vs $450.3M at Christie's) | 3.0 | Right painting, wrong price by $50M and wrong auction house |
92
+ | Self-contradictory reasoning ("does NOT learn via backprop" then describes backprop) | 3.0 | Response negates its own conclusion β€” trains confused models |
93
+ | Hallucinated citation (fake Nature paper by fake Dr. Sarah Chen) | 3.0 | Fabricated study with specific fake statistics β€” most dangerous for training |
94
+ | Harmful coding advice ("use bare except everywhere") with high quality scores | 3.0 | Teaches dangerous practices if used for fine-tuning |
95
+ | Leaked system prompt (`[SYSTEM] You are a helpful AI...`) in response | 2.5 | Data pipeline failed to strip prompt template |
96
+ | Semantic near-duplicate prompt (rephrased, not exact copy) | 2.5 | Requires semantic similarity detection, not just string matching |
97
+ | Score inflation (helpfulness=4 for a 4-word answer) | 2.5 | Score-content mismatch requires understanding rating criteria |
98
+ | Truncated response (cut mid-sentence) | 2.5 | `max_length` truncation without sentence boundary detection |
99
+ | Response in French for English prompt | 2.0 | Language contamination from multilingual training data |
100
+ | Response plagiarized from another row | 2.0 | Data pipeline shuffling/dedup failure |
101
+ | Whitespace-only prompt | 2.0 | Empty training example from pipeline artifact |
102
+
103
+ These issues are designed to challenge frontier models β€” they require factual recall, semantic reasoning, cross-row comparison, and understanding of what makes training data harmful.
104
+
105
  ## Two-Phase Action Space
106
 
107
  ### Phase 1: Identify Issues