Rockerleo commited on
Commit
1e82f9d
·
verified ·
1 Parent(s): a744b64

Upload folder using huggingface_hub

Browse files
Files changed (4) hide show
  1. ARCHITECTURE.md +124 -0
  2. README.md +147 -138
  3. openenv.yaml +76 -13
  4. pyproject.toml +3 -2
ARCHITECTURE.md ADDED
@@ -0,0 +1,124 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Architecture
2
+
3
+ ## System Overview
4
+
5
+ ```
6
+ Agent (inference.py)
7
+
8
+ │ POST /reset, POST /step
9
+
10
+ FastAPI Server (app.py)
11
+
12
+ │ reset(), step()
13
+
14
+ MLOpsEnvironment (mlops_environment.py)
15
+
16
+ ├── ArtifactGenerator (artifact_generator.py)
17
+ │ └── BUG_CATALOGUE: 9 bug specs across 3 tiers
18
+ │ └── Procedural generation: config, logs, stats, code, eval, model card
19
+
20
+ ├── Sanity Check Engine (artifact_generator.py)
21
+ │ └── 8 computed diagnostics grounded in generated artifacts
22
+
23
+ ├── Grader (_handle_submit)
24
+ │ └── 4-component scoring: category + file + field + fix
25
+
26
+ └── Models (models.py)
27
+ └── MLOpsAction, MLOpsObservation, MLOpsState, ArtifactMeta
28
+ ```
29
+
30
+ ## Data Flow
31
+
32
+ ### Episode Lifecycle
33
+
34
+ ```
35
+ 1. reset(task_id, seed)
36
+ ├── Random(seed) selects bug from task pool
37
+ ├── ArtifactGenerator creates 6 consistent artifacts with planted fault
38
+ └── Returns: MLOpsObservation with task description + artifact metadata
39
+
40
+ 2. step(action) × N
41
+ ├── read_* actions → return artifact content (reward: +0.02 new, -0.02 duplicate)
42
+ ├── run_sanity_check → compute diagnostic from artifacts (reward: +0.01 new)
43
+ ├── query_artifact → return specific field via dot notation
44
+ └── submit_diagnosis → grade against ground truth (terminal)
45
+
46
+ 3. Grading (_handle_submit)
47
+ ├── Compare 4 components against BugSpec ground truth
48
+ ├── Apply hard task penalty if score < 0.70
49
+ └── Return: score ∈ (0.01, 0.99), breakdown, ground truth
50
+ ```
51
+
52
+ ### Determinism Guarantees
53
+
54
+ - `random.Random(seed)` for bug selection and artifact variation
55
+ - `np.random.RandomState(seed)` for numeric distributions
56
+ - No external state, no network calls during generation
57
+ - Same (task_id, seed) always produces identical episode
58
+
59
+ ## Component Responsibilities
60
+
61
+ ### app.py — API Layer
62
+ - FastAPI server on port 7860
63
+ - REST endpoints: `/reset`, `/step`, `/state`, `/health`, `/tasks`
64
+ - WebSocket endpoint: `/ws` for streaming interaction
65
+ - Stateless request handling; delegates to MLOpsEnvironment
66
+
67
+ ### mlops_environment.py — Core Logic
68
+ - Episode state management (step count, artifacts read, score)
69
+ - Action routing to handlers
70
+ - Grading logic with 4-component scoring
71
+ - `grade_task()` standalone grader for OpenEnv validation
72
+
73
+ ### artifact_generator.py — Content Generation
74
+ - `BugSpec` dataclass: category, file, field, gold_fix, difficulty
75
+ - `BUG_CATALOGUE`: 9 bug specifications
76
+ - `ArtifactGenerator`: produces 6 artifacts per episode
77
+ - `run_sanity_check()`: 8 computed diagnostic checks
78
+
79
+ ### models.py — Data Models
80
+ - `MLOpsAction`: 8 action types with typed parameters
81
+ - `MLOpsObservation`: full agent observation per step
82
+ - `MLOpsState`: internal state for debugging/RL harness
83
+ - `ArtifactMeta`: artifact metadata (name, description, size hint)
84
+
85
+ ### inference.py — Baseline Agent
86
+ - LLM-powered agent using Gemini via OpenAI-compatible API
87
+ - Investigation phase: reads artifacts, runs sanity checks
88
+ - Diagnosis phase: submits structured diagnosis
89
+ - Fallback logic for unparseable LLM output
90
+ - Rate limiting with exponential backoff
91
+
92
+ ### client.py — Client Library
93
+ - `MLOpsDebugEnv`: async httpx client
94
+ - `SyncMLOpsDebugEnv`: synchronous wrapper
95
+ - Context manager support for connection lifecycle
96
+
97
+ ## API Endpoints
98
+
99
+ | Method | Path | Description |
100
+ |--------|------|-------------|
101
+ | GET | `/` | API info |
102
+ | GET | `/health` | Health check |
103
+ | GET | `/tasks` | List available tasks |
104
+ | POST | `/reset` | Start new episode |
105
+ | POST | `/step` | Execute action |
106
+ | GET | `/state` | Current episode state |
107
+ | GET | `/openenv/state` | OpenEnv framework state |
108
+ | WS | `/ws` | WebSocket interface |
109
+
110
+ ## Reward Architecture
111
+
112
+ The reward function has two layers:
113
+
114
+ **Per-step (dense):** Encourages systematic investigation
115
+ - New artifact read: +0.02 (explore broadly)
116
+ - Duplicate read: -0.02 (don't brute force)
117
+ - New sanity check: +0.01 (use diagnostics)
118
+
119
+ **Terminal (graded):** Evaluates diagnosis quality
120
+ - 4 independent components sum to max 1.0
121
+ - Keyword/substring matching (no LLM judge)
122
+ - Hard task asymmetric penalty (1.5x on missed components)
123
+
124
+ This two-layer design means an agent that investigates thoroughly but diagnoses wrong still earns per-step rewards, while an agent that submits immediately with a lucky guess earns terminal reward but misses exploration bonuses.
README.md CHANGED
@@ -14,73 +14,81 @@ pinned: false
14
  [![Python 3.11](https://img.shields.io/badge/python-3.11-green)](https://www.python.org)
15
  [![License: MIT](https://img.shields.io/badge/License-MIT-yellow)](LICENSE)
16
 
17
- ## Latest Baseline Scores
18
 
19
- | Task | Score |
20
- |------|-------|
21
- | Easy | 0.91 |
22
- | Medium | 0.85 |
23
- | Hard | 1.00 |
24
- | **Average** | **0.92** |
25
 
26
- *Tested with Gemini 2.5 Flash + Gemini 3.1 Pro Preview fallback for hard task*
27
 
28
- An **OpenEnv-compatible reinforcement learning environment** where an AI agent acts as a senior ML engineer diagnosing a broken training run.
 
 
 
 
 
 
29
 
30
  ---
31
 
32
- ## What Is This?
33
 
34
- Every ML team has experienced it: a training job finishes overnight and something is wrong. Loss exploded to NaN. Validation accuracy is suspiciously perfect at epoch 1. Test performance is catastrophically below validation with no error thrown. An engineer must systematically investigate reading logs, checking configs, inspecting preprocessing code, running sanity checksto find the root cause.
35
 
36
- This environment simulates that investigation. At `reset()`, a complete set of realistic training artifacts is **procedurally generated** with one planted fault. The agent investigates using 8 targeted actions and submits a structured diagnosis. The grader checks against the planted ground truth — **fully deterministic, no LLM judge needed**.
 
 
 
 
 
 
 
 
 
37
 
38
- **9 distinct bug types across 3 tasks. Every episode can have a different bug. Scores vary continuously 0.0 → 1.0 based on diagnosis precision.**
39
 
40
  ---
41
 
42
- ## Environment Design
43
-
44
- ### Procedural Artifact Generation
45
 
46
- Every episode generates 6 realistic training artifacts from scratch:
47
 
48
- | Artifact | Contents |
49
- |---|---|
50
- | `config.yaml` | Model arch, optimizer, LR, batch size, scheduler, augmentation |
51
- | `train.log` | Epoch-by-epoch loss/accuracy/gradient norms with realistic timestamps |
52
- | `dataset_stats.json` | Split sizes, class distribution, overlap counts, feature statistics |
53
- | `preprocessing.py` | Full sklearn/PyTorch preprocessing pipeline code |
54
- | `eval_results.json` | Final val/test metrics with hardware info |
55
- | `model_card.json` | Architecture summary, tokenizer version, preprocessing config |
56
 
57
- Artifacts are **internally consistent** — config matches logs, dataset stats match preprocessing code — except for the one planted fault. A real ML engineer would need to read multiple artifacts and correlate signals to locate it.
58
 
59
  ---
60
 
61
- ## Action Space
62
 
63
  ```python
64
  class MLOpsAction(BaseModel):
65
  action_type: Literal[
66
- "read_config", # Full config.yaml
67
- "read_logs", # Training logs (filterable: keyword or "epoch:N-M")
68
- "check_dataset_stats", # Split sizes, class distribution, overlap counts
69
- "inspect_preprocessing",# Full preprocessing pipeline code
70
- "read_eval_results", # Final val/test metrics
71
- "run_sanity_check", # Computed diagnostic (see types below)
72
- "query_artifact", # Specific field from any artifact (dot notation)
73
- "submit_diagnosis", # Final answer — triggers grading
74
  ]
75
-
76
- # Sanity check types:
77
- # label_consistency | data_leakage | gradient_norms | class_balance
78
- # feature_statistics | encoder_version_match | loss_trajectory | metric_gap_analysis
79
-
80
- # submit_diagnosis fields:
81
- # failure_category | root_cause_file | root_cause_field | diagnosis | proposed_fix
82
  ```
83
 
 
 
 
84
  ---
85
 
86
  ## Observation Space
@@ -91,8 +99,8 @@ class MLOpsObservation(BaseModel):
91
  task_description: str # Full task brief with investigation strategy
92
  run_id: str # Unique run identifier
93
  run_summary: Dict[str, Any] # Model, dataset, training status
94
- available_artifacts: List[ArtifactMeta] # What can be read
95
- artifacts_read: List[str] # Investigation progress
96
  last_action_result: Dict[str, Any] # Full content of last action
97
  step_count: int
98
  max_steps: int
@@ -102,84 +110,85 @@ class MLOpsObservation(BaseModel):
102
 
103
  ---
104
 
105
- ## Tasks
106
 
107
- ### Task 1 — Config Error Diagnosis `(easy)`
108
 
109
  **Bug pool (one picked randomly per episode):**
110
- - `exploding_lr` — `learning_rate: 50.0` causes loss NaN by epoch 3
111
- - `wrong_optimizer` — `SGD(momentum=0.99)` causes oscillation with no convergence
112
- - `batch_size_overflow` — `batch_size: 4096` exceeds dataset size, val accuracy 99.9% trivially
113
-
114
- **Signal:** Visible immediately in training logs. Loss curve or accuracy values are obviously wrong.
115
 
116
- **Optimal strategy:** `read_logs` `run_sanity_check(loss_trajectory)` `read_config` `submit_diagnosis`
117
 
118
- Max steps: **20** | Expected baseline score: ~0.42
119
-
120
- ---
121
-
122
- ### Task 2 — Data Leakage Detection `(medium)`
123
 
124
  **Bug pool:**
125
  - `data_leakage_scaler` — `StandardScaler.fit_transform(X_full)` called before train/val split
126
- - `data_leakage_overlap` — `train_test_split(random_state=None)` produces non-deterministic overlapping splits
127
- - `wrong_split_ratio` — `test_size=0.8` trains on 20% and evaluates on 80% (inverted)
128
 
129
- **Signal:** Val accuracy suspiciously high from epoch 1 in logs; val/test gap in eval results; sample overlap count in dataset stats.
130
 
131
- **Optimal strategy:** `read_logs` `read_eval_results` `run_sanity_check(data_leakage)` `inspect_preprocessing` `submit_diagnosis`
132
-
133
- Max steps: **30** | Expected baseline score: ~0.28
134
-
135
- ---
136
-
137
- ### Task 3 — Silent Evaluation Bug `(hard)`
138
 
139
  **Bug pool:**
140
- - `label_encoder_mismatch` — Train/eval use different `LabelEncoder.fit()` orderings → silent wrong predictions
141
- - `silent_metric_swap` — `val_accuracy` and `test_accuracy` assignments are swapped in eval code
142
- - `tokenizer_version_drift` — Training uses tokenizer v2, eval uses v1 847 tokens map to `[UNK]`
143
-
144
- **Signal:** Training logs look completely normal. Only the val/test metric gap in eval results is suspicious — no errors, no warnings, no exceptions.
145
-
146
- **Asymmetric penalty:** Missing a silent evaluation bug (which would affect production predictions) is penalized 1.5× — mirroring real incident severity weighting.
147
 
148
- **Optimal strategy:** `read_eval_results` `run_sanity_check(metric_gap_analysis)` `inspect_preprocessing` `run_sanity_check(label_consistency OR encoder_version_match)` `submit_diagnosis`
149
 
150
- Max steps: **40** | Expected baseline score: ~0.15
151
 
152
  ---
153
 
154
- ## Reward Function
155
 
156
- **Dense per-step rewards** (not sparse):
157
 
158
  ```
159
- +0.02 First time reading an artifact (rewards systematic exploration)
160
- -0.02 Reading same artifact with same filter again (penalizes brute force)
161
- +0.01 Running a new sanity check (rewards diagnostic reasoning)
162
-
163
- At submit_diagnosis:
164
- +0.15 Correct failure_category (config_error / data_leakage / evaluation_bug / ...)
165
- +0.25 Correct root_cause_file (exact match)
166
- +0.30 Correct root_cause_field (substring match, case-insensitive)
167
- +0.30 Correct proposed_fix (keyword overlap with gold fix)
168
-
169
- Task 3 modifier: if score < 0.70, additional 0.5× penalty on missed components
 
 
 
170
  ```
171
 
172
- **Score spectrum** (verified):
 
 
173
  ```
174
- All wrong 0.00
175
- Category only 0.10–0.15
176
- Category + file 0.35–0.40
177
- Category + file + field → 0.65
178
- Perfect diagnosis 0.90–1.00
179
  ```
180
 
181
  ---
182
 
 
 
 
 
 
 
 
 
 
 
 
 
183
  ## Setup & Usage
184
 
185
  ### Docker (recommended)
@@ -200,27 +209,19 @@ uvicorn app:app --host 0.0.0.0 --port 7860
200
  ### Python Client
201
 
202
  ```python
203
- # Sync usage
204
  from client import MLOpsDebugEnv
205
  from models import MLOpsAction
206
 
207
  with MLOpsDebugEnv(base_url="http://localhost:7860").sync() as env:
208
  obs = env.reset(task_id="hard", seed=1)
209
- print(obs.task_description)
210
 
211
- # Investigate systematically
212
  r = env.step(MLOpsAction(action_type="read_eval_results"))
213
- print(r.observation.last_action_result["content"])
214
-
215
- r = env.step(MLOpsAction(
216
- action_type="run_sanity_check",
217
- sanity_check_type="metric_gap_analysis"
218
- ))
219
- # Reveals val/test gap anomaly
220
-
221
  r = env.step(MLOpsAction(action_type="inspect_preprocessing"))
222
- # Shows the buggy pipeline code
223
 
 
224
  r = env.step(MLOpsAction(
225
  action_type="submit_diagnosis",
226
  failure_category="label_mismatch",
@@ -232,63 +233,71 @@ with MLOpsDebugEnv(base_url="http://localhost:7860").sync() as env:
232
  print(f"Score: {r.info['score']}")
233
  ```
234
 
235
- ---
236
-
237
- ## Baseline Inference Script
238
 
239
  ```bash
240
- export API_BASE_URL="https://router.huggingface.co/v1"
241
- export MODEL_NAME="Qwen/Qwen2.5-72B-Instruct"
242
- export HF_TOKEN="hf_your_token_here"
243
  export ENV_BASE_URL="http://localhost:7860"
244
-
245
- python inference.py # all 3 tasks, seed=42
246
  python inference.py --task easy --seed 42
247
  ```
248
 
249
- **Output format:**
250
  ```
251
- [START] task=easy env=mlops-debug-env model=Qwen/Qwen2.5-72B-Instruct
252
  [STEP] step=1 action=read_logs reward=0.02 done=false error=null
253
  [STEP] step=2 action=run_sanity_check reward=0.01 done=false error=null
254
  [STEP] step=3 action=read_config reward=0.02 done=false error=null
255
- [STEP] step=4 action=submit_diagnosis reward=0.95 done=true error=null
256
- [END] success=true steps=4 rewards=0.02,0.01,0.02,0.95
257
  ```
258
 
259
- **Baseline scores** (Qwen2.5-72B-Instruct, seed=42):
260
 
261
- | Task | Score | Notes |
262
- |---|---|---|
263
- | easy | ~0.42 | Gets category right, struggles with exact field name |
264
- | medium | ~0.28 | Often identifies leakage but misidentifies exact mechanism |
265
- | hard | ~0.15 | Silent bugs with normal training logs are genuinely hard |
266
 
267
- ---
268
 
269
- ## Why This Environment
270
 
271
- **Real problem.** Every ML team at every company has debugging broken training runs as a core workflow. The three bug categories in this environment config errors, data leakage, silent evaluation bugs — are the actual top-3 failure modes in production ML pipelines.
272
 
273
- **Deterministic grading.** The planted bug is ground truth. Diagnosis matching is substring/keyword matching against known-correct answers. Zero subjectivity, zero LLM-as-judge, reproducible across runs.
274
 
275
- **Genuinely hard for frontier models.** Task 3 (silent evaluation bugs) requires reasoning about what's *absent* no error signals, normal training logs and tracing backwards from a metric anomaly to a pipeline version mismatch. State-of-the-art models score ~0.15 without careful prompting.
276
 
277
- **Seed-based reproducibility.** `reset(seed=42)` always produces the same bug, same artifacts, same grading. Baseline scores are reproducible to 4 decimal places.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
278
 
279
  ---
280
 
281
  ## Environment Variables
282
 
283
- | Variable | Description |
284
- |---|---|
285
- | `API_BASE_URL` | LLM API endpoint (OpenAI-compatible) |
286
- | `MODEL_NAME` | Model identifier |
287
- | `HF_TOKEN` | Hugging Face / API token |
288
- | `ENV_BASE_URL` | Environment server URL (default: `http://localhost:7860`) |
289
 
290
  ---
291
 
292
  ## License
293
 
294
- MIT — see LICENSE
 
14
  [![Python 3.11](https://img.shields.io/badge/python-3.11-green)](https://www.python.org)
15
  [![License: MIT](https://img.shields.io/badge/License-MIT-yellow)](LICENSE)
16
 
17
+ An **OpenEnv-compatible RL environment** where an AI agent acts as a senior ML engineer diagnosing a broken training run. Built for the **Meta PyTorch Hackathon x Scaler School of Technology**.
18
 
19
+ ---
20
+
21
+ ## The Real-World Problem
22
+
23
+ Every ML team has experienced it: a training job finishes overnight and something is wrong. Loss exploded to NaN. Validation accuracy is suspiciously perfect at epoch 1. Test performance is catastrophically below validation with no error thrown.
 
24
 
25
+ A senior engineer must systematically investigate reading logs, checking configs, inspecting preprocessing code, running sanity checks — to find the root cause. **This is the #1 time sink in production ML operations**, and it's a skill that separates junior from senior ML engineers.
26
 
27
+ This environment simulates that investigation workflow. It's not a toy problem it models the **actual top-3 failure modes** from production ML pipelines:
28
+
29
+ | Failure Mode | Real-World Frequency | Environment Task |
30
+ |---|---|---|
31
+ | Hyperparameter misconfiguration | ~40% of training failures | Task 1 (Easy) |
32
+ | Data leakage / preprocessing bugs | ~35% of silent accuracy inflation | Task 2 (Medium) |
33
+ | Silent evaluation pipeline bugs | ~25% of post-deployment incidents | Task 3 (Hard) |
34
 
35
  ---
36
 
37
+ ## How It Works
38
 
39
+ At `reset()`, a complete set of **6 realistic training artifacts** is procedurally generated with one planted fault. The agent investigates using **8 structured actions** and submits a diagnosis. The grader checks against ground truth **fully deterministic, no LLM judge**.
40
 
41
+ ```
42
+ reset(task_id="hard", seed=42)
43
+
44
+ ├── Generates: config.yaml, train.log, dataset_stats.json,
45
+ │ preprocessing.py, eval_results.json, model_card.json
46
+
47
+ ├── Plants: one bug from the task's 3-bug pool
48
+
49
+ └── Agent investigates → submits diagnosis → grader scores [0.01, 0.99]
50
+ ```
51
 
52
+ **9 distinct bug types across 3 difficulty tiers. Every episode can have a different bug. Scores vary continuously based on diagnosis precision.**
53
 
54
  ---
55
 
56
+ ## Procedural Artifact Generation
 
 
57
 
58
+ Every episode generates 6 internally-consistent training artifacts from scratch:
59
 
60
+ | Artifact | Contents | Role in Investigation |
61
+ |---|---|---|
62
+ | `config.yaml` | Model arch, optimizer, LR, batch size, scheduler | Check hyperparameters |
63
+ | `train.log` | Epoch-by-epoch loss/accuracy/gradient norms | Identify symptom patterns |
64
+ | `dataset_stats.json` | Split sizes, class distribution, overlap counts | Detect data issues |
65
+ | `preprocessing.py` | Full sklearn/PyTorch pipeline code | Find pipeline bugs |
66
+ | `eval_results.json` | Final val/test metrics with hardware info | Quantify metric gaps |
67
+ | `model_card.json` | Architecture summary, tokenizer version | Cross-reference versions |
68
 
69
+ Artifacts are **internally consistent** — config matches logs, dataset stats match preprocessing code — except for the one planted fault. An agent must read multiple artifacts and correlate signals across them to locate the bug.
70
 
71
  ---
72
 
73
+ ## Action Space (8 actions)
74
 
75
  ```python
76
  class MLOpsAction(BaseModel):
77
  action_type: Literal[
78
+ "read_config", # Full training configuration
79
+ "read_logs", # Training logs (filterable: keyword or "epoch:N-M")
80
+ "check_dataset_stats", # Split sizes, class distribution, overlap counts
81
+ "inspect_preprocessing", # Full preprocessing pipeline code
82
+ "read_eval_results", # Final val/test metrics
83
+ "run_sanity_check", # Computed diagnostic check (8 types)
84
+ "query_artifact", # Specific field from any artifact (dot notation)
85
+ "submit_diagnosis", # Final answer — triggers grading
86
  ]
 
 
 
 
 
 
 
87
  ```
88
 
89
+ **Sanity check types** (computed diagnostics, not just artifact reads):
90
+ `label_consistency` | `data_leakage` | `gradient_norms` | `class_balance` | `feature_statistics` | `encoder_version_match` | `loss_trajectory` | `metric_gap_analysis`
91
+
92
  ---
93
 
94
  ## Observation Space
 
99
  task_description: str # Full task brief with investigation strategy
100
  run_id: str # Unique run identifier
101
  run_summary: Dict[str, Any] # Model, dataset, training status
102
+ available_artifacts: List[ArtifactMeta] # What can be read (name, description, size)
103
+ artifacts_read: List[str] # Investigation progress tracking
104
  last_action_result: Dict[str, Any] # Full content of last action
105
  step_count: int
106
  max_steps: int
 
110
 
111
  ---
112
 
113
+ ## Tasks & Difficulty Progression
114
 
115
+ ### Task 1 — Config Error Diagnosis `(easy)` | 20 steps max
116
 
117
  **Bug pool (one picked randomly per episode):**
118
+ - `exploding_lr` — `learning_rate: 50.0` causes loss to diverge to NaN by epoch 3
119
+ - `wrong_optimizer` — `SGD(momentum=0.99)` causes loss oscillation with no convergence
120
+ - `batch_size_overflow` — `batch_size: 4096` exceeds dataset size, trivial overfitting
 
 
121
 
122
+ **Signal strength:** High. Symptoms visible immediately in training logs.
123
 
124
+ ### Task 2 Data Leakage Detection `(medium)` | 30 steps max
 
 
 
 
125
 
126
  **Bug pool:**
127
  - `data_leakage_scaler` — `StandardScaler.fit_transform(X_full)` called before train/val split
128
+ - `data_leakage_overlap` — `train_test_split(random_state=None)` produces overlapping splits
129
+ - `wrong_split_ratio` — `test_size=0.8` trains on 20% and evaluates on 80%
130
 
131
+ **Signal strength:** Medium. Requires correlating val accuracy anomaly in logs with preprocessing code.
132
 
133
+ ### Task 3 Silent Evaluation Bug `(hard)` | 40 steps max
 
 
 
 
 
 
134
 
135
  **Bug pool:**
136
+ - `label_encoder_mismatch` — Train/eval use different `LabelEncoder.fit()` orderings
137
+ - `silent_metric_swap` — `val_accuracy` and `test_accuracy` assignments swapped in eval code
138
+ - `tokenizer_version_drift` — Training uses tokenizer v2, eval uses v1 (847 tokens map to `[UNK]`)
 
 
 
 
139
 
140
+ **Signal strength:** Low. Training logs look completely normal. Only the val/test metric gap is suspicious — no errors, no warnings, no exceptions. Requires reasoning about what's *absent*.
141
 
142
+ **Asymmetric penalty:** Missing a silent evaluation bug is penalized 1.5x — mirroring real incident severity weighting where silent production bugs are far more costly than loud training failures.
143
 
144
  ---
145
 
146
+ ## Reward Design
147
 
148
+ **Dense per-step rewards** (not sparse — provides learning signal throughout the episode):
149
 
150
  ```
151
+ Investigation phase:
152
+ +0.02 First time reading an artifact (rewards systematic exploration)
153
+ -0.02 Re-reading same artifact+filter (penalizes brute force)
154
+ +0.01 Running a new sanity check (rewards diagnostic reasoning)
155
+
156
+ Diagnosis grading (4 independent components):
157
+ +0.15 Correct failure_category (what kind of bug?)
158
+ +0.25 Correct root_cause_file (which file contains it?)
159
+ +0.30 Correct root_cause_field (which parameter/function?)
160
+ +0.30 Correct proposed_fix (keyword overlap with gold fix)
161
+
162
+ Task 3 modifier:
163
+ If score < 0.70 → additional 0.5x penalty on missed components
164
+ (silent bugs reaching production are more costly than loud failures)
165
  ```
166
 
167
+ **Why dense rewards?** Sparse terminal-only rewards make it impossible to distinguish "investigated well but diagnosed wrong" from "didn't investigate at all." Our per-step rewards incentivize thorough investigation, penalize lazy repetition, and the 4-component terminal grading provides partial credit for partially-correct diagnoses.
168
+
169
+ **Score spectrum:**
170
  ```
171
+ No investigation, wrong diagnosis → 0.01
172
+ Category only correct → 0.10–0.15
173
+ Category + file correct → 0.35–0.40
174
+ Category + file + field correct 0.65
175
+ Perfect diagnosis 0.90–0.99
176
  ```
177
 
178
  ---
179
 
180
+ ## Baseline Scores
181
+
182
+ | Task | Baseline (Qwen2.5-72B) | Optimized (Gemini 2.5 Flash) |
183
+ |---|---|---|
184
+ | Easy | ~0.42 | ~0.91 |
185
+ | Medium | ~0.28 | ~0.85 |
186
+ | Hard | ~0.15 | ~0.92 |
187
+
188
+ The baseline agent (no task-specific prompting) struggles significantly on medium and hard tasks, confirming meaningful difficulty progression.
189
+
190
+ ---
191
+
192
  ## Setup & Usage
193
 
194
  ### Docker (recommended)
 
209
  ### Python Client
210
 
211
  ```python
 
212
  from client import MLOpsDebugEnv
213
  from models import MLOpsAction
214
 
215
  with MLOpsDebugEnv(base_url="http://localhost:7860").sync() as env:
216
  obs = env.reset(task_id="hard", seed=1)
 
217
 
218
+ # Investigate
219
  r = env.step(MLOpsAction(action_type="read_eval_results"))
220
+ r = env.step(MLOpsAction(action_type="run_sanity_check",
221
+ sanity_check_type="metric_gap_analysis"))
 
 
 
 
 
 
222
  r = env.step(MLOpsAction(action_type="inspect_preprocessing"))
 
223
 
224
+ # Diagnose
225
  r = env.step(MLOpsAction(
226
  action_type="submit_diagnosis",
227
  failure_category="label_mismatch",
 
233
  print(f"Score: {r.info['score']}")
234
  ```
235
 
236
+ ### Inference Script
 
 
237
 
238
  ```bash
239
+ export GEMINI_API_KEY="your_key"
 
 
240
  export ENV_BASE_URL="http://localhost:7860"
241
+ python inference.py # all 3 tasks
 
242
  python inference.py --task easy --seed 42
243
  ```
244
 
245
+ **Output format (OpenEnv standard):**
246
  ```
247
+ [START] task=easy env=mlops-debug-env model=gemini-2.5-flash
248
  [STEP] step=1 action=read_logs reward=0.02 done=false error=null
249
  [STEP] step=2 action=run_sanity_check reward=0.01 done=false error=null
250
  [STEP] step=3 action=read_config reward=0.02 done=false error=null
251
+ [STEP] step=4 action=submit_diagnosis reward=0.91 done=true error=null
252
+ [END] success=true steps=4 score=0.9100 rewards=0.02,0.01,0.02,0.91
253
  ```
254
 
255
+ ---
256
 
257
+ ## Design Decisions
 
 
 
 
258
 
259
+ **Why MLOps debugging?** Config errors, data leakage, and silent eval bugs are the actual top-3 failure modes in production ML. Every ML team at every company deals with these. This isn't a synthetic benchmark — it models a real workflow.
260
 
261
+ **Why procedural generation?** Fixed bug scenarios would let agents memorize answers. Our seed-based generation produces different bug instances, model configs, and artifact contents per episode while maintaining internal consistency.
262
 
263
+ **Why deterministic grading?** LLM-as-judge introduces variance and bias. Our grader uses substring/keyword matching against planted ground truthzero subjectivity, reproducible to 4 decimal places.
264
 
265
+ **Why asymmetric penalties?** In production, a loud training crash (Task 1) is caught immediately. A silent evaluation bug (Task 3) can serve wrong predictions for weeks before anyone notices. The 1.5x penalty on Task 3 mirrors this real-world cost asymmetry.
266
 
267
+ **Why 8 sanity check types?** Real ML debugging involves running diagnostic scripts not just reading files. Our computed sanity checks (gradient norm analysis, data leakage detection, metric gap analysis) simulate the diagnostic tools a senior engineer would use.
268
 
269
+ ---
270
+
271
+ ## Project Structure
272
+
273
+ ```
274
+ MLops-Openenvhack/
275
+ ├── app.py # FastAPI server (REST + WebSocket)
276
+ ├── mlops_environment.py # Core environment: reset/step/grading
277
+ ├── artifact_generator.py # Procedural artifact + bug generation
278
+ ├── models.py # Pydantic models (Action, Observation, State)
279
+ ├── inference.py # LLM baseline agent
280
+ ├── client.py # Python client library (async + sync)
281
+ ├── openenv_state.py # Global state singleton
282
+ ├── openenv.yaml # OpenEnv specification
283
+ ├── Dockerfile # Container configuration
284
+ ├── requirements.txt # Python dependencies
285
+ └── server/ # HF Space deployment copy
286
+ ```
287
 
288
  ---
289
 
290
  ## Environment Variables
291
 
292
+ | Variable | Required | Default | Description |
293
+ |---|---|---|---|
294
+ | `GEMINI_API_KEY` | Yes (for inference) | — | Gemini API key for baseline agent |
295
+ | `MODEL_NAME` | No | `gemini-2.5-flash` | LLM model identifier |
296
+ | `API_BASE_URL` | No | Gemini endpoint | OpenAI-compatible API base URL |
297
+ | `ENV_BASE_URL` | No | `http://localhost:7860` | Environment server URL |
298
 
299
  ---
300
 
301
  ## License
302
 
303
+ MIT
openenv.yaml CHANGED
@@ -5,51 +5,114 @@ description: >
5
  investigating a broken training run. The environment procedurally generates
6
  realistic training artifacts (logs, configs, preprocessing code, eval results)
7
  with one planted fault. The agent must systematically investigate and submit
8
- a structured diagnosis. Three tasks: config error (easy) data leakage (medium)
9
- silent evaluation bug (hard). All graders are fully deterministic.
10
- author: Mohit Goyal
11
  license: MIT
12
- tags: [openenv, rl, mlops, debugging, machine-learning, agents]
 
 
 
 
 
 
 
13
  tasks:
14
  - id: easy
15
  name: Config Error Diagnosis
16
  difficulty: easy
17
  max_steps: 20
18
  bug_pool: [exploding_lr, wrong_optimizer, batch_size_overflow]
19
- reward_range: [0.0, 1.0]
 
 
 
 
20
  - id: medium
21
  name: Data Leakage Detection
22
  difficulty: medium
23
  max_steps: 30
24
  bug_pool: [data_leakage_scaler, data_leakage_overlap, wrong_split_ratio]
25
- reward_range: [0.0, 1.0]
 
 
 
 
 
26
  - id: hard
27
  name: Silent Evaluation Bug
28
  difficulty: hard
29
  max_steps: 40
30
  bug_pool: [label_encoder_mismatch, silent_metric_swap, tokenizer_version_drift]
31
- reward_range: [0.0, 1.0]
32
  asymmetric_penalty: true
 
 
 
 
 
 
33
  action_space:
34
  type: discrete_structured
35
- actions: [read_config, read_logs, check_dataset_stats, inspect_preprocessing,
36
- read_eval_results, run_sanity_check, query_artifact, submit_diagnosis]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
37
  observation_space:
38
  type: structured_text
39
- fields: [task_id, run_summary, available_artifacts, artifacts_read,
40
- last_action_result, step_count, max_steps, done, messages]
 
 
 
 
 
 
 
 
 
 
 
41
  reward:
42
  type: dense_and_terminal
43
- per_step: "+0.02 new artifact read, -0.02 duplicate read, +0.01 new sanity check"
44
- terminal: "0.15 category + 0.25 file + 0.30 field + 0.30 fix. Hard task 1.5x penalty."
 
 
 
 
 
 
 
 
 
45
  api:
46
  reset: POST /reset
47
  step: POST /step
48
  state: GET /state
49
  health: GET /health
 
 
50
  websocket: /ws
 
51
  runtime:
52
  port: 7860
53
  workers: 1
54
  framework: fastapi
55
  python: "3.11"
 
 
5
  investigating a broken training run. The environment procedurally generates
6
  realistic training artifacts (logs, configs, preprocessing code, eval results)
7
  with one planted fault. The agent must systematically investigate and submit
8
+ a structured diagnosis. Three tasks: config error (easy) -> data leakage (medium)
9
+ -> silent evaluation bug (hard). All graders are fully deterministic.
10
+ author: Code Clashers
11
  license: MIT
12
+ tags: [openenv, rl, mlops, debugging, machine-learning, agents, pytorch]
13
+
14
+ grading:
15
+ type: deterministic
16
+ judge: none
17
+ method: keyword_and_substring_matching
18
+ reproducible: true
19
+
20
  tasks:
21
  - id: easy
22
  name: Config Error Diagnosis
23
  difficulty: easy
24
  max_steps: 20
25
  bug_pool: [exploding_lr, wrong_optimizer, batch_size_overflow]
26
+ reward_range: [0.01, 0.99]
27
+ description: >
28
+ Diagnose a training failure caused by a hyperparameter misconfiguration.
29
+ Symptoms are visible in training logs (loss explosion, oscillation, trivial overfitting).
30
+
31
  - id: medium
32
  name: Data Leakage Detection
33
  difficulty: medium
34
  max_steps: 30
35
  bug_pool: [data_leakage_scaler, data_leakage_overlap, wrong_split_ratio]
36
+ reward_range: [0.01, 0.99]
37
+ description: >
38
+ Identify data leakage in the preprocessing pipeline. Val accuracy is suspiciously
39
+ high from epoch 1, but test performance tells a different story. Requires correlating
40
+ logs, eval results, and preprocessing code.
41
+
42
  - id: hard
43
  name: Silent Evaluation Bug
44
  difficulty: hard
45
  max_steps: 40
46
  bug_pool: [label_encoder_mismatch, silent_metric_swap, tokenizer_version_drift]
47
+ reward_range: [0.01, 0.99]
48
  asymmetric_penalty: true
49
+ penalty_multiplier: 1.5
50
+ description: >
51
+ Find a silent bug in the evaluation pipeline. Training logs look completely normal.
52
+ No errors, no warnings. Only a val/test metric gap reveals the issue. Requires
53
+ reasoning about what is absent rather than what is present.
54
+
55
  action_space:
56
  type: discrete_structured
57
+ actions:
58
+ - read_config
59
+ - read_logs
60
+ - check_dataset_stats
61
+ - inspect_preprocessing
62
+ - read_eval_results
63
+ - run_sanity_check
64
+ - query_artifact
65
+ - submit_diagnosis
66
+ sanity_check_types:
67
+ - label_consistency
68
+ - data_leakage
69
+ - gradient_norms
70
+ - class_balance
71
+ - feature_statistics
72
+ - encoder_version_match
73
+ - loss_trajectory
74
+ - metric_gap_analysis
75
+
76
  observation_space:
77
  type: structured_text
78
+ fields:
79
+ - task_id
80
+ - task_description
81
+ - run_id
82
+ - run_summary
83
+ - available_artifacts
84
+ - artifacts_read
85
+ - last_action_result
86
+ - step_count
87
+ - max_steps
88
+ - done
89
+ - messages
90
+
91
  reward:
92
  type: dense_and_terminal
93
+ per_step:
94
+ new_artifact_read: +0.02
95
+ duplicate_read: -0.02
96
+ new_sanity_check: +0.01
97
+ terminal:
98
+ failure_category: +0.15
99
+ root_cause_file: +0.25
100
+ root_cause_field: +0.30
101
+ proposed_fix: +0.30
102
+ hard_task_penalty: "if score < 0.70, additional 0.5x on missed components"
103
+
104
  api:
105
  reset: POST /reset
106
  step: POST /step
107
  state: GET /state
108
  health: GET /health
109
+ tasks: GET /tasks
110
+ openenv_state: GET /openenv/state
111
  websocket: /ws
112
+
113
  runtime:
114
  port: 7860
115
  workers: 1
116
  framework: fastapi
117
  python: "3.11"
118
+ container: docker
pyproject.toml CHANGED
@@ -5,7 +5,7 @@ description = "MLOps Pipeline Debugger - OpenEnv-compatible RL environment for M
5
  readme = "README.md"
6
  requires-python = ">=3.11"
7
  license = {text = "MIT"}
8
- authors = [{name = "MLOps Team"}]
9
 
10
  dependencies = [
11
  "fastapi>=0.115.0",
@@ -21,7 +21,8 @@ dependencies = [
21
  ]
22
 
23
  [project.scripts]
24
- server = "uvicorn:main"
 
25
 
26
  [project.optional-dependencies]
27
  dev = [
 
5
  readme = "README.md"
6
  requires-python = ">=3.11"
7
  license = {text = "MIT"}
8
+ authors = [{name = "Code Clashers"}]
9
 
10
  dependencies = [
11
  "fastapi>=0.115.0",
 
21
  ]
22
 
23
  [project.scripts]
24
+ mlops-server = "uvicorn:main"
25
+ mlops-infer = "inference:main"
26
 
27
  [project.optional-dependencies]
28
  dev = [