Spaces:
Sleeping
Sleeping
Upload folder using huggingface_hub
Browse files- ARCHITECTURE.md +124 -0
- README.md +147 -138
- openenv.yaml +76 -13
- pyproject.toml +3 -2
ARCHITECTURE.md
ADDED
|
@@ -0,0 +1,124 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Architecture
|
| 2 |
+
|
| 3 |
+
## System Overview
|
| 4 |
+
|
| 5 |
+
```
|
| 6 |
+
Agent (inference.py)
|
| 7 |
+
│
|
| 8 |
+
│ POST /reset, POST /step
|
| 9 |
+
▼
|
| 10 |
+
FastAPI Server (app.py)
|
| 11 |
+
│
|
| 12 |
+
│ reset(), step()
|
| 13 |
+
▼
|
| 14 |
+
MLOpsEnvironment (mlops_environment.py)
|
| 15 |
+
│
|
| 16 |
+
├── ArtifactGenerator (artifact_generator.py)
|
| 17 |
+
│ └── BUG_CATALOGUE: 9 bug specs across 3 tiers
|
| 18 |
+
│ └── Procedural generation: config, logs, stats, code, eval, model card
|
| 19 |
+
│
|
| 20 |
+
├── Sanity Check Engine (artifact_generator.py)
|
| 21 |
+
│ └── 8 computed diagnostics grounded in generated artifacts
|
| 22 |
+
│
|
| 23 |
+
├── Grader (_handle_submit)
|
| 24 |
+
│ └── 4-component scoring: category + file + field + fix
|
| 25 |
+
│
|
| 26 |
+
└── Models (models.py)
|
| 27 |
+
└── MLOpsAction, MLOpsObservation, MLOpsState, ArtifactMeta
|
| 28 |
+
```
|
| 29 |
+
|
| 30 |
+
## Data Flow
|
| 31 |
+
|
| 32 |
+
### Episode Lifecycle
|
| 33 |
+
|
| 34 |
+
```
|
| 35 |
+
1. reset(task_id, seed)
|
| 36 |
+
├── Random(seed) selects bug from task pool
|
| 37 |
+
├── ArtifactGenerator creates 6 consistent artifacts with planted fault
|
| 38 |
+
└── Returns: MLOpsObservation with task description + artifact metadata
|
| 39 |
+
|
| 40 |
+
2. step(action) × N
|
| 41 |
+
├── read_* actions → return artifact content (reward: +0.02 new, -0.02 duplicate)
|
| 42 |
+
├── run_sanity_check → compute diagnostic from artifacts (reward: +0.01 new)
|
| 43 |
+
├── query_artifact → return specific field via dot notation
|
| 44 |
+
└── submit_diagnosis → grade against ground truth (terminal)
|
| 45 |
+
|
| 46 |
+
3. Grading (_handle_submit)
|
| 47 |
+
├── Compare 4 components against BugSpec ground truth
|
| 48 |
+
├── Apply hard task penalty if score < 0.70
|
| 49 |
+
└── Return: score ∈ (0.01, 0.99), breakdown, ground truth
|
| 50 |
+
```
|
| 51 |
+
|
| 52 |
+
### Determinism Guarantees
|
| 53 |
+
|
| 54 |
+
- `random.Random(seed)` for bug selection and artifact variation
|
| 55 |
+
- `np.random.RandomState(seed)` for numeric distributions
|
| 56 |
+
- No external state, no network calls during generation
|
| 57 |
+
- Same (task_id, seed) always produces identical episode
|
| 58 |
+
|
| 59 |
+
## Component Responsibilities
|
| 60 |
+
|
| 61 |
+
### app.py — API Layer
|
| 62 |
+
- FastAPI server on port 7860
|
| 63 |
+
- REST endpoints: `/reset`, `/step`, `/state`, `/health`, `/tasks`
|
| 64 |
+
- WebSocket endpoint: `/ws` for streaming interaction
|
| 65 |
+
- Stateless request handling; delegates to MLOpsEnvironment
|
| 66 |
+
|
| 67 |
+
### mlops_environment.py — Core Logic
|
| 68 |
+
- Episode state management (step count, artifacts read, score)
|
| 69 |
+
- Action routing to handlers
|
| 70 |
+
- Grading logic with 4-component scoring
|
| 71 |
+
- `grade_task()` standalone grader for OpenEnv validation
|
| 72 |
+
|
| 73 |
+
### artifact_generator.py — Content Generation
|
| 74 |
+
- `BugSpec` dataclass: category, file, field, gold_fix, difficulty
|
| 75 |
+
- `BUG_CATALOGUE`: 9 bug specifications
|
| 76 |
+
- `ArtifactGenerator`: produces 6 artifacts per episode
|
| 77 |
+
- `run_sanity_check()`: 8 computed diagnostic checks
|
| 78 |
+
|
| 79 |
+
### models.py — Data Models
|
| 80 |
+
- `MLOpsAction`: 8 action types with typed parameters
|
| 81 |
+
- `MLOpsObservation`: full agent observation per step
|
| 82 |
+
- `MLOpsState`: internal state for debugging/RL harness
|
| 83 |
+
- `ArtifactMeta`: artifact metadata (name, description, size hint)
|
| 84 |
+
|
| 85 |
+
### inference.py — Baseline Agent
|
| 86 |
+
- LLM-powered agent using Gemini via OpenAI-compatible API
|
| 87 |
+
- Investigation phase: reads artifacts, runs sanity checks
|
| 88 |
+
- Diagnosis phase: submits structured diagnosis
|
| 89 |
+
- Fallback logic for unparseable LLM output
|
| 90 |
+
- Rate limiting with exponential backoff
|
| 91 |
+
|
| 92 |
+
### client.py — Client Library
|
| 93 |
+
- `MLOpsDebugEnv`: async httpx client
|
| 94 |
+
- `SyncMLOpsDebugEnv`: synchronous wrapper
|
| 95 |
+
- Context manager support for connection lifecycle
|
| 96 |
+
|
| 97 |
+
## API Endpoints
|
| 98 |
+
|
| 99 |
+
| Method | Path | Description |
|
| 100 |
+
|--------|------|-------------|
|
| 101 |
+
| GET | `/` | API info |
|
| 102 |
+
| GET | `/health` | Health check |
|
| 103 |
+
| GET | `/tasks` | List available tasks |
|
| 104 |
+
| POST | `/reset` | Start new episode |
|
| 105 |
+
| POST | `/step` | Execute action |
|
| 106 |
+
| GET | `/state` | Current episode state |
|
| 107 |
+
| GET | `/openenv/state` | OpenEnv framework state |
|
| 108 |
+
| WS | `/ws` | WebSocket interface |
|
| 109 |
+
|
| 110 |
+
## Reward Architecture
|
| 111 |
+
|
| 112 |
+
The reward function has two layers:
|
| 113 |
+
|
| 114 |
+
**Per-step (dense):** Encourages systematic investigation
|
| 115 |
+
- New artifact read: +0.02 (explore broadly)
|
| 116 |
+
- Duplicate read: -0.02 (don't brute force)
|
| 117 |
+
- New sanity check: +0.01 (use diagnostics)
|
| 118 |
+
|
| 119 |
+
**Terminal (graded):** Evaluates diagnosis quality
|
| 120 |
+
- 4 independent components sum to max 1.0
|
| 121 |
+
- Keyword/substring matching (no LLM judge)
|
| 122 |
+
- Hard task asymmetric penalty (1.5x on missed components)
|
| 123 |
+
|
| 124 |
+
This two-layer design means an agent that investigates thoroughly but diagnoses wrong still earns per-step rewards, while an agent that submits immediately with a lucky guess earns terminal reward but misses exploration bonuses.
|
README.md
CHANGED
|
@@ -14,73 +14,81 @@ pinned: false
|
|
| 14 |
[](https://www.python.org)
|
| 15 |
[](LICENSE)
|
| 16 |
|
| 17 |
-
|
| 18 |
|
| 19 |
-
|
| 20 |
-
|
| 21 |
-
|
| 22 |
-
|
| 23 |
-
|
| 24 |
-
| **Average** | **0.92** |
|
| 25 |
|
| 26 |
-
|
| 27 |
|
| 28 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 29 |
|
| 30 |
---
|
| 31 |
|
| 32 |
-
##
|
| 33 |
|
| 34 |
-
|
| 35 |
|
| 36 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 37 |
|
| 38 |
-
**9 distinct bug types across 3
|
| 39 |
|
| 40 |
---
|
| 41 |
|
| 42 |
-
##
|
| 43 |
-
|
| 44 |
-
### Procedural Artifact Generation
|
| 45 |
|
| 46 |
-
Every episode generates 6
|
| 47 |
|
| 48 |
-
| Artifact | Contents |
|
| 49 |
-
|---|---|
|
| 50 |
-
| `config.yaml` | Model arch, optimizer, LR, batch size, scheduler
|
| 51 |
-
| `train.log` | Epoch-by-epoch loss/accuracy/gradient norms
|
| 52 |
-
| `dataset_stats.json` | Split sizes, class distribution, overlap counts
|
| 53 |
-
| `preprocessing.py` | Full sklearn/PyTorch
|
| 54 |
-
| `eval_results.json` | Final val/test metrics with hardware info |
|
| 55 |
-
| `model_card.json` | Architecture summary, tokenizer version
|
| 56 |
|
| 57 |
-
Artifacts are **internally consistent** — config matches logs, dataset stats match preprocessing code — except for the one planted fault.
|
| 58 |
|
| 59 |
---
|
| 60 |
|
| 61 |
-
## Action Space
|
| 62 |
|
| 63 |
```python
|
| 64 |
class MLOpsAction(BaseModel):
|
| 65 |
action_type: Literal[
|
| 66 |
-
"read_config",
|
| 67 |
-
"read_logs",
|
| 68 |
-
"check_dataset_stats",
|
| 69 |
-
"inspect_preprocessing",# Full preprocessing pipeline code
|
| 70 |
-
"read_eval_results",
|
| 71 |
-
"run_sanity_check",
|
| 72 |
-
"query_artifact",
|
| 73 |
-
"submit_diagnosis",
|
| 74 |
]
|
| 75 |
-
|
| 76 |
-
# Sanity check types:
|
| 77 |
-
# label_consistency | data_leakage | gradient_norms | class_balance
|
| 78 |
-
# feature_statistics | encoder_version_match | loss_trajectory | metric_gap_analysis
|
| 79 |
-
|
| 80 |
-
# submit_diagnosis fields:
|
| 81 |
-
# failure_category | root_cause_file | root_cause_field | diagnosis | proposed_fix
|
| 82 |
```
|
| 83 |
|
|
|
|
|
|
|
|
|
|
| 84 |
---
|
| 85 |
|
| 86 |
## Observation Space
|
|
@@ -91,8 +99,8 @@ class MLOpsObservation(BaseModel):
|
|
| 91 |
task_description: str # Full task brief with investigation strategy
|
| 92 |
run_id: str # Unique run identifier
|
| 93 |
run_summary: Dict[str, Any] # Model, dataset, training status
|
| 94 |
-
available_artifacts: List[ArtifactMeta] # What can be read
|
| 95 |
-
artifacts_read: List[str] # Investigation progress
|
| 96 |
last_action_result: Dict[str, Any] # Full content of last action
|
| 97 |
step_count: int
|
| 98 |
max_steps: int
|
|
@@ -102,84 +110,85 @@ class MLOpsObservation(BaseModel):
|
|
| 102 |
|
| 103 |
---
|
| 104 |
|
| 105 |
-
## Tasks
|
| 106 |
|
| 107 |
-
### Task 1 — Config Error Diagnosis `(easy)`
|
| 108 |
|
| 109 |
**Bug pool (one picked randomly per episode):**
|
| 110 |
-
- `exploding_lr` — `learning_rate: 50.0` causes loss
|
| 111 |
-
- `wrong_optimizer` — `SGD(momentum=0.99)` causes oscillation with no convergence
|
| 112 |
-
- `batch_size_overflow` — `batch_size: 4096` exceeds dataset size,
|
| 113 |
-
|
| 114 |
-
**Signal:** Visible immediately in training logs. Loss curve or accuracy values are obviously wrong.
|
| 115 |
|
| 116 |
-
**
|
| 117 |
|
| 118 |
-
|
| 119 |
-
|
| 120 |
-
---
|
| 121 |
-
|
| 122 |
-
### Task 2 — Data Leakage Detection `(medium)`
|
| 123 |
|
| 124 |
**Bug pool:**
|
| 125 |
- `data_leakage_scaler` — `StandardScaler.fit_transform(X_full)` called before train/val split
|
| 126 |
-
- `data_leakage_overlap` — `train_test_split(random_state=None)` produces
|
| 127 |
-
- `wrong_split_ratio` — `test_size=0.8` trains on 20% and evaluates on 80%
|
| 128 |
|
| 129 |
-
**Signal:**
|
| 130 |
|
| 131 |
-
|
| 132 |
-
|
| 133 |
-
Max steps: **30** | Expected baseline score: ~0.28
|
| 134 |
-
|
| 135 |
-
---
|
| 136 |
-
|
| 137 |
-
### Task 3 — Silent Evaluation Bug `(hard)`
|
| 138 |
|
| 139 |
**Bug pool:**
|
| 140 |
-
- `label_encoder_mismatch` — Train/eval use different `LabelEncoder.fit()` orderings
|
| 141 |
-
- `silent_metric_swap` — `val_accuracy` and `test_accuracy` assignments
|
| 142 |
-
- `tokenizer_version_drift` — Training uses tokenizer v2, eval uses v1
|
| 143 |
-
|
| 144 |
-
**Signal:** Training logs look completely normal. Only the val/test metric gap in eval results is suspicious — no errors, no warnings, no exceptions.
|
| 145 |
-
|
| 146 |
-
**Asymmetric penalty:** Missing a silent evaluation bug (which would affect production predictions) is penalized 1.5× — mirroring real incident severity weighting.
|
| 147 |
|
| 148 |
-
**
|
| 149 |
|
| 150 |
-
|
| 151 |
|
| 152 |
---
|
| 153 |
|
| 154 |
-
## Reward
|
| 155 |
|
| 156 |
-
**Dense per-step rewards** (not sparse):
|
| 157 |
|
| 158 |
```
|
| 159 |
-
|
| 160 |
-
|
| 161 |
-
|
| 162 |
-
|
| 163 |
-
|
| 164 |
-
|
| 165 |
-
+0.
|
| 166 |
-
+0.
|
| 167 |
-
+0.30 Correct
|
| 168 |
-
|
| 169 |
-
|
|
|
|
|
|
|
|
|
|
| 170 |
```
|
| 171 |
|
| 172 |
-
**
|
|
|
|
|
|
|
| 173 |
```
|
| 174 |
-
|
| 175 |
-
Category only
|
| 176 |
-
Category + file
|
| 177 |
-
Category + file + field →
|
| 178 |
-
Perfect diagnosis
|
| 179 |
```
|
| 180 |
|
| 181 |
---
|
| 182 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 183 |
## Setup & Usage
|
| 184 |
|
| 185 |
### Docker (recommended)
|
|
@@ -200,27 +209,19 @@ uvicorn app:app --host 0.0.0.0 --port 7860
|
|
| 200 |
### Python Client
|
| 201 |
|
| 202 |
```python
|
| 203 |
-
# Sync usage
|
| 204 |
from client import MLOpsDebugEnv
|
| 205 |
from models import MLOpsAction
|
| 206 |
|
| 207 |
with MLOpsDebugEnv(base_url="http://localhost:7860").sync() as env:
|
| 208 |
obs = env.reset(task_id="hard", seed=1)
|
| 209 |
-
print(obs.task_description)
|
| 210 |
|
| 211 |
-
# Investigate
|
| 212 |
r = env.step(MLOpsAction(action_type="read_eval_results"))
|
| 213 |
-
|
| 214 |
-
|
| 215 |
-
r = env.step(MLOpsAction(
|
| 216 |
-
action_type="run_sanity_check",
|
| 217 |
-
sanity_check_type="metric_gap_analysis"
|
| 218 |
-
))
|
| 219 |
-
# Reveals val/test gap anomaly
|
| 220 |
-
|
| 221 |
r = env.step(MLOpsAction(action_type="inspect_preprocessing"))
|
| 222 |
-
# Shows the buggy pipeline code
|
| 223 |
|
|
|
|
| 224 |
r = env.step(MLOpsAction(
|
| 225 |
action_type="submit_diagnosis",
|
| 226 |
failure_category="label_mismatch",
|
|
@@ -232,63 +233,71 @@ with MLOpsDebugEnv(base_url="http://localhost:7860").sync() as env:
|
|
| 232 |
print(f"Score: {r.info['score']}")
|
| 233 |
```
|
| 234 |
|
| 235 |
-
|
| 236 |
-
|
| 237 |
-
## Baseline Inference Script
|
| 238 |
|
| 239 |
```bash
|
| 240 |
-
export
|
| 241 |
-
export MODEL_NAME="Qwen/Qwen2.5-72B-Instruct"
|
| 242 |
-
export HF_TOKEN="hf_your_token_here"
|
| 243 |
export ENV_BASE_URL="http://localhost:7860"
|
| 244 |
-
|
| 245 |
-
python inference.py # all 3 tasks, seed=42
|
| 246 |
python inference.py --task easy --seed 42
|
| 247 |
```
|
| 248 |
|
| 249 |
-
**Output format:**
|
| 250 |
```
|
| 251 |
-
[START] task=easy env=mlops-debug-env model=
|
| 252 |
[STEP] step=1 action=read_logs reward=0.02 done=false error=null
|
| 253 |
[STEP] step=2 action=run_sanity_check reward=0.01 done=false error=null
|
| 254 |
[STEP] step=3 action=read_config reward=0.02 done=false error=null
|
| 255 |
-
[STEP] step=4 action=submit_diagnosis reward=0.
|
| 256 |
-
[END] success=true steps=4 rewards=0.02,0.01,0.02,0.
|
| 257 |
```
|
| 258 |
|
| 259 |
-
|
| 260 |
|
| 261 |
-
|
| 262 |
-
|---|---|---|
|
| 263 |
-
| easy | ~0.42 | Gets category right, struggles with exact field name |
|
| 264 |
-
| medium | ~0.28 | Often identifies leakage but misidentifies exact mechanism |
|
| 265 |
-
| hard | ~0.15 | Silent bugs with normal training logs are genuinely hard |
|
| 266 |
|
| 267 |
-
-
|
| 268 |
|
| 269 |
-
|
| 270 |
|
| 271 |
-
**
|
| 272 |
|
| 273 |
-
**
|
| 274 |
|
| 275 |
-
**
|
| 276 |
|
| 277 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 278 |
|
| 279 |
---
|
| 280 |
|
| 281 |
## Environment Variables
|
| 282 |
|
| 283 |
-
| Variable | Description |
|
| 284 |
-
|---|---|
|
| 285 |
-
| `
|
| 286 |
-
| `MODEL_NAME` |
|
| 287 |
-
| `
|
| 288 |
-
| `ENV_BASE_URL` |
|
| 289 |
|
| 290 |
---
|
| 291 |
|
| 292 |
## License
|
| 293 |
|
| 294 |
-
MIT
|
|
|
|
| 14 |
[](https://www.python.org)
|
| 15 |
[](LICENSE)
|
| 16 |
|
| 17 |
+
An **OpenEnv-compatible RL environment** where an AI agent acts as a senior ML engineer diagnosing a broken training run. Built for the **Meta PyTorch Hackathon x Scaler School of Technology**.
|
| 18 |
|
| 19 |
+
---
|
| 20 |
+
|
| 21 |
+
## The Real-World Problem
|
| 22 |
+
|
| 23 |
+
Every ML team has experienced it: a training job finishes overnight and something is wrong. Loss exploded to NaN. Validation accuracy is suspiciously perfect at epoch 1. Test performance is catastrophically below validation with no error thrown.
|
|
|
|
| 24 |
|
| 25 |
+
A senior engineer must systematically investigate — reading logs, checking configs, inspecting preprocessing code, running sanity checks — to find the root cause. **This is the #1 time sink in production ML operations**, and it's a skill that separates junior from senior ML engineers.
|
| 26 |
|
| 27 |
+
This environment simulates that investigation workflow. It's not a toy problem — it models the **actual top-3 failure modes** from production ML pipelines:
|
| 28 |
+
|
| 29 |
+
| Failure Mode | Real-World Frequency | Environment Task |
|
| 30 |
+
|---|---|---|
|
| 31 |
+
| Hyperparameter misconfiguration | ~40% of training failures | Task 1 (Easy) |
|
| 32 |
+
| Data leakage / preprocessing bugs | ~35% of silent accuracy inflation | Task 2 (Medium) |
|
| 33 |
+
| Silent evaluation pipeline bugs | ~25% of post-deployment incidents | Task 3 (Hard) |
|
| 34 |
|
| 35 |
---
|
| 36 |
|
| 37 |
+
## How It Works
|
| 38 |
|
| 39 |
+
At `reset()`, a complete set of **6 realistic training artifacts** is procedurally generated with one planted fault. The agent investigates using **8 structured actions** and submits a diagnosis. The grader checks against ground truth — **fully deterministic, no LLM judge**.
|
| 40 |
|
| 41 |
+
```
|
| 42 |
+
reset(task_id="hard", seed=42)
|
| 43 |
+
│
|
| 44 |
+
├── Generates: config.yaml, train.log, dataset_stats.json,
|
| 45 |
+
│ preprocessing.py, eval_results.json, model_card.json
|
| 46 |
+
│
|
| 47 |
+
├── Plants: one bug from the task's 3-bug pool
|
| 48 |
+
│
|
| 49 |
+
└── Agent investigates → submits diagnosis → grader scores [0.01, 0.99]
|
| 50 |
+
```
|
| 51 |
|
| 52 |
+
**9 distinct bug types across 3 difficulty tiers. Every episode can have a different bug. Scores vary continuously based on diagnosis precision.**
|
| 53 |
|
| 54 |
---
|
| 55 |
|
| 56 |
+
## Procedural Artifact Generation
|
|
|
|
|
|
|
| 57 |
|
| 58 |
+
Every episode generates 6 internally-consistent training artifacts from scratch:
|
| 59 |
|
| 60 |
+
| Artifact | Contents | Role in Investigation |
|
| 61 |
+
|---|---|---|
|
| 62 |
+
| `config.yaml` | Model arch, optimizer, LR, batch size, scheduler | Check hyperparameters |
|
| 63 |
+
| `train.log` | Epoch-by-epoch loss/accuracy/gradient norms | Identify symptom patterns |
|
| 64 |
+
| `dataset_stats.json` | Split sizes, class distribution, overlap counts | Detect data issues |
|
| 65 |
+
| `preprocessing.py` | Full sklearn/PyTorch pipeline code | Find pipeline bugs |
|
| 66 |
+
| `eval_results.json` | Final val/test metrics with hardware info | Quantify metric gaps |
|
| 67 |
+
| `model_card.json` | Architecture summary, tokenizer version | Cross-reference versions |
|
| 68 |
|
| 69 |
+
Artifacts are **internally consistent** — config matches logs, dataset stats match preprocessing code — except for the one planted fault. An agent must read multiple artifacts and correlate signals across them to locate the bug.
|
| 70 |
|
| 71 |
---
|
| 72 |
|
| 73 |
+
## Action Space (8 actions)
|
| 74 |
|
| 75 |
```python
|
| 76 |
class MLOpsAction(BaseModel):
|
| 77 |
action_type: Literal[
|
| 78 |
+
"read_config", # Full training configuration
|
| 79 |
+
"read_logs", # Training logs (filterable: keyword or "epoch:N-M")
|
| 80 |
+
"check_dataset_stats", # Split sizes, class distribution, overlap counts
|
| 81 |
+
"inspect_preprocessing", # Full preprocessing pipeline code
|
| 82 |
+
"read_eval_results", # Final val/test metrics
|
| 83 |
+
"run_sanity_check", # Computed diagnostic check (8 types)
|
| 84 |
+
"query_artifact", # Specific field from any artifact (dot notation)
|
| 85 |
+
"submit_diagnosis", # Final answer — triggers grading
|
| 86 |
]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 87 |
```
|
| 88 |
|
| 89 |
+
**Sanity check types** (computed diagnostics, not just artifact reads):
|
| 90 |
+
`label_consistency` | `data_leakage` | `gradient_norms` | `class_balance` | `feature_statistics` | `encoder_version_match` | `loss_trajectory` | `metric_gap_analysis`
|
| 91 |
+
|
| 92 |
---
|
| 93 |
|
| 94 |
## Observation Space
|
|
|
|
| 99 |
task_description: str # Full task brief with investigation strategy
|
| 100 |
run_id: str # Unique run identifier
|
| 101 |
run_summary: Dict[str, Any] # Model, dataset, training status
|
| 102 |
+
available_artifacts: List[ArtifactMeta] # What can be read (name, description, size)
|
| 103 |
+
artifacts_read: List[str] # Investigation progress tracking
|
| 104 |
last_action_result: Dict[str, Any] # Full content of last action
|
| 105 |
step_count: int
|
| 106 |
max_steps: int
|
|
|
|
| 110 |
|
| 111 |
---
|
| 112 |
|
| 113 |
+
## Tasks & Difficulty Progression
|
| 114 |
|
| 115 |
+
### Task 1 — Config Error Diagnosis `(easy)` | 20 steps max
|
| 116 |
|
| 117 |
**Bug pool (one picked randomly per episode):**
|
| 118 |
+
- `exploding_lr` — `learning_rate: 50.0` causes loss to diverge to NaN by epoch 3
|
| 119 |
+
- `wrong_optimizer` — `SGD(momentum=0.99)` causes loss oscillation with no convergence
|
| 120 |
+
- `batch_size_overflow` — `batch_size: 4096` exceeds dataset size, trivial overfitting
|
|
|
|
|
|
|
| 121 |
|
| 122 |
+
**Signal strength:** High. Symptoms visible immediately in training logs.
|
| 123 |
|
| 124 |
+
### Task 2 — Data Leakage Detection `(medium)` | 30 steps max
|
|
|
|
|
|
|
|
|
|
|
|
|
| 125 |
|
| 126 |
**Bug pool:**
|
| 127 |
- `data_leakage_scaler` — `StandardScaler.fit_transform(X_full)` called before train/val split
|
| 128 |
+
- `data_leakage_overlap` — `train_test_split(random_state=None)` produces overlapping splits
|
| 129 |
+
- `wrong_split_ratio` — `test_size=0.8` trains on 20% and evaluates on 80%
|
| 130 |
|
| 131 |
+
**Signal strength:** Medium. Requires correlating val accuracy anomaly in logs with preprocessing code.
|
| 132 |
|
| 133 |
+
### Task 3 — Silent Evaluation Bug `(hard)` | 40 steps max
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 134 |
|
| 135 |
**Bug pool:**
|
| 136 |
+
- `label_encoder_mismatch` — Train/eval use different `LabelEncoder.fit()` orderings
|
| 137 |
+
- `silent_metric_swap` — `val_accuracy` and `test_accuracy` assignments swapped in eval code
|
| 138 |
+
- `tokenizer_version_drift` — Training uses tokenizer v2, eval uses v1 (847 tokens map to `[UNK]`)
|
|
|
|
|
|
|
|
|
|
|
|
|
| 139 |
|
| 140 |
+
**Signal strength:** Low. Training logs look completely normal. Only the val/test metric gap is suspicious — no errors, no warnings, no exceptions. Requires reasoning about what's *absent*.
|
| 141 |
|
| 142 |
+
**Asymmetric penalty:** Missing a silent evaluation bug is penalized 1.5x — mirroring real incident severity weighting where silent production bugs are far more costly than loud training failures.
|
| 143 |
|
| 144 |
---
|
| 145 |
|
| 146 |
+
## Reward Design
|
| 147 |
|
| 148 |
+
**Dense per-step rewards** (not sparse — provides learning signal throughout the episode):
|
| 149 |
|
| 150 |
```
|
| 151 |
+
Investigation phase:
|
| 152 |
+
+0.02 First time reading an artifact (rewards systematic exploration)
|
| 153 |
+
-0.02 Re-reading same artifact+filter (penalizes brute force)
|
| 154 |
+
+0.01 Running a new sanity check (rewards diagnostic reasoning)
|
| 155 |
+
|
| 156 |
+
Diagnosis grading (4 independent components):
|
| 157 |
+
+0.15 Correct failure_category (what kind of bug?)
|
| 158 |
+
+0.25 Correct root_cause_file (which file contains it?)
|
| 159 |
+
+0.30 Correct root_cause_field (which parameter/function?)
|
| 160 |
+
+0.30 Correct proposed_fix (keyword overlap with gold fix)
|
| 161 |
+
|
| 162 |
+
Task 3 modifier:
|
| 163 |
+
If score < 0.70 → additional 0.5x penalty on missed components
|
| 164 |
+
(silent bugs reaching production are more costly than loud failures)
|
| 165 |
```
|
| 166 |
|
| 167 |
+
**Why dense rewards?** Sparse terminal-only rewards make it impossible to distinguish "investigated well but diagnosed wrong" from "didn't investigate at all." Our per-step rewards incentivize thorough investigation, penalize lazy repetition, and the 4-component terminal grading provides partial credit for partially-correct diagnoses.
|
| 168 |
+
|
| 169 |
+
**Score spectrum:**
|
| 170 |
```
|
| 171 |
+
No investigation, wrong diagnosis → 0.01
|
| 172 |
+
Category only correct → 0.10–0.15
|
| 173 |
+
Category + file correct → 0.35–0.40
|
| 174 |
+
Category + file + field correct → 0.65
|
| 175 |
+
Perfect diagnosis → 0.90–0.99
|
| 176 |
```
|
| 177 |
|
| 178 |
---
|
| 179 |
|
| 180 |
+
## Baseline Scores
|
| 181 |
+
|
| 182 |
+
| Task | Baseline (Qwen2.5-72B) | Optimized (Gemini 2.5 Flash) |
|
| 183 |
+
|---|---|---|
|
| 184 |
+
| Easy | ~0.42 | ~0.91 |
|
| 185 |
+
| Medium | ~0.28 | ~0.85 |
|
| 186 |
+
| Hard | ~0.15 | ~0.92 |
|
| 187 |
+
|
| 188 |
+
The baseline agent (no task-specific prompting) struggles significantly on medium and hard tasks, confirming meaningful difficulty progression.
|
| 189 |
+
|
| 190 |
+
---
|
| 191 |
+
|
| 192 |
## Setup & Usage
|
| 193 |
|
| 194 |
### Docker (recommended)
|
|
|
|
| 209 |
### Python Client
|
| 210 |
|
| 211 |
```python
|
|
|
|
| 212 |
from client import MLOpsDebugEnv
|
| 213 |
from models import MLOpsAction
|
| 214 |
|
| 215 |
with MLOpsDebugEnv(base_url="http://localhost:7860").sync() as env:
|
| 216 |
obs = env.reset(task_id="hard", seed=1)
|
|
|
|
| 217 |
|
| 218 |
+
# Investigate
|
| 219 |
r = env.step(MLOpsAction(action_type="read_eval_results"))
|
| 220 |
+
r = env.step(MLOpsAction(action_type="run_sanity_check",
|
| 221 |
+
sanity_check_type="metric_gap_analysis"))
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 222 |
r = env.step(MLOpsAction(action_type="inspect_preprocessing"))
|
|
|
|
| 223 |
|
| 224 |
+
# Diagnose
|
| 225 |
r = env.step(MLOpsAction(
|
| 226 |
action_type="submit_diagnosis",
|
| 227 |
failure_category="label_mismatch",
|
|
|
|
| 233 |
print(f"Score: {r.info['score']}")
|
| 234 |
```
|
| 235 |
|
| 236 |
+
### Inference Script
|
|
|
|
|
|
|
| 237 |
|
| 238 |
```bash
|
| 239 |
+
export GEMINI_API_KEY="your_key"
|
|
|
|
|
|
|
| 240 |
export ENV_BASE_URL="http://localhost:7860"
|
| 241 |
+
python inference.py # all 3 tasks
|
|
|
|
| 242 |
python inference.py --task easy --seed 42
|
| 243 |
```
|
| 244 |
|
| 245 |
+
**Output format (OpenEnv standard):**
|
| 246 |
```
|
| 247 |
+
[START] task=easy env=mlops-debug-env model=gemini-2.5-flash
|
| 248 |
[STEP] step=1 action=read_logs reward=0.02 done=false error=null
|
| 249 |
[STEP] step=2 action=run_sanity_check reward=0.01 done=false error=null
|
| 250 |
[STEP] step=3 action=read_config reward=0.02 done=false error=null
|
| 251 |
+
[STEP] step=4 action=submit_diagnosis reward=0.91 done=true error=null
|
| 252 |
+
[END] success=true steps=4 score=0.9100 rewards=0.02,0.01,0.02,0.91
|
| 253 |
```
|
| 254 |
|
| 255 |
+
---
|
| 256 |
|
| 257 |
+
## Design Decisions
|
|
|
|
|
|
|
|
|
|
|
|
|
| 258 |
|
| 259 |
+
**Why MLOps debugging?** Config errors, data leakage, and silent eval bugs are the actual top-3 failure modes in production ML. Every ML team at every company deals with these. This isn't a synthetic benchmark — it models a real workflow.
|
| 260 |
|
| 261 |
+
**Why procedural generation?** Fixed bug scenarios would let agents memorize answers. Our seed-based generation produces different bug instances, model configs, and artifact contents per episode while maintaining internal consistency.
|
| 262 |
|
| 263 |
+
**Why deterministic grading?** LLM-as-judge introduces variance and bias. Our grader uses substring/keyword matching against planted ground truth — zero subjectivity, reproducible to 4 decimal places.
|
| 264 |
|
| 265 |
+
**Why asymmetric penalties?** In production, a loud training crash (Task 1) is caught immediately. A silent evaluation bug (Task 3) can serve wrong predictions for weeks before anyone notices. The 1.5x penalty on Task 3 mirrors this real-world cost asymmetry.
|
| 266 |
|
| 267 |
+
**Why 8 sanity check types?** Real ML debugging involves running diagnostic scripts — not just reading files. Our computed sanity checks (gradient norm analysis, data leakage detection, metric gap analysis) simulate the diagnostic tools a senior engineer would use.
|
| 268 |
|
| 269 |
+
---
|
| 270 |
+
|
| 271 |
+
## Project Structure
|
| 272 |
+
|
| 273 |
+
```
|
| 274 |
+
MLops-Openenvhack/
|
| 275 |
+
├── app.py # FastAPI server (REST + WebSocket)
|
| 276 |
+
├── mlops_environment.py # Core environment: reset/step/grading
|
| 277 |
+
├── artifact_generator.py # Procedural artifact + bug generation
|
| 278 |
+
├── models.py # Pydantic models (Action, Observation, State)
|
| 279 |
+
├── inference.py # LLM baseline agent
|
| 280 |
+
├── client.py # Python client library (async + sync)
|
| 281 |
+
├── openenv_state.py # Global state singleton
|
| 282 |
+
├── openenv.yaml # OpenEnv specification
|
| 283 |
+
├── Dockerfile # Container configuration
|
| 284 |
+
├── requirements.txt # Python dependencies
|
| 285 |
+
└── server/ # HF Space deployment copy
|
| 286 |
+
```
|
| 287 |
|
| 288 |
---
|
| 289 |
|
| 290 |
## Environment Variables
|
| 291 |
|
| 292 |
+
| Variable | Required | Default | Description |
|
| 293 |
+
|---|---|---|---|
|
| 294 |
+
| `GEMINI_API_KEY` | Yes (for inference) | — | Gemini API key for baseline agent |
|
| 295 |
+
| `MODEL_NAME` | No | `gemini-2.5-flash` | LLM model identifier |
|
| 296 |
+
| `API_BASE_URL` | No | Gemini endpoint | OpenAI-compatible API base URL |
|
| 297 |
+
| `ENV_BASE_URL` | No | `http://localhost:7860` | Environment server URL |
|
| 298 |
|
| 299 |
---
|
| 300 |
|
| 301 |
## License
|
| 302 |
|
| 303 |
+
MIT
|
openenv.yaml
CHANGED
|
@@ -5,51 +5,114 @@ description: >
|
|
| 5 |
investigating a broken training run. The environment procedurally generates
|
| 6 |
realistic training artifacts (logs, configs, preprocessing code, eval results)
|
| 7 |
with one planted fault. The agent must systematically investigate and submit
|
| 8 |
-
a structured diagnosis. Three tasks: config error (easy)
|
| 9 |
-
|
| 10 |
-
author:
|
| 11 |
license: MIT
|
| 12 |
-
tags: [openenv, rl, mlops, debugging, machine-learning, agents]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 13 |
tasks:
|
| 14 |
- id: easy
|
| 15 |
name: Config Error Diagnosis
|
| 16 |
difficulty: easy
|
| 17 |
max_steps: 20
|
| 18 |
bug_pool: [exploding_lr, wrong_optimizer, batch_size_overflow]
|
| 19 |
-
reward_range: [0.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 20 |
- id: medium
|
| 21 |
name: Data Leakage Detection
|
| 22 |
difficulty: medium
|
| 23 |
max_steps: 30
|
| 24 |
bug_pool: [data_leakage_scaler, data_leakage_overlap, wrong_split_ratio]
|
| 25 |
-
reward_range: [0.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 26 |
- id: hard
|
| 27 |
name: Silent Evaluation Bug
|
| 28 |
difficulty: hard
|
| 29 |
max_steps: 40
|
| 30 |
bug_pool: [label_encoder_mismatch, silent_metric_swap, tokenizer_version_drift]
|
| 31 |
-
reward_range: [0.
|
| 32 |
asymmetric_penalty: true
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 33 |
action_space:
|
| 34 |
type: discrete_structured
|
| 35 |
-
actions:
|
| 36 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 37 |
observation_space:
|
| 38 |
type: structured_text
|
| 39 |
-
fields:
|
| 40 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 41 |
reward:
|
| 42 |
type: dense_and_terminal
|
| 43 |
-
per_step:
|
| 44 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 45 |
api:
|
| 46 |
reset: POST /reset
|
| 47 |
step: POST /step
|
| 48 |
state: GET /state
|
| 49 |
health: GET /health
|
|
|
|
|
|
|
| 50 |
websocket: /ws
|
|
|
|
| 51 |
runtime:
|
| 52 |
port: 7860
|
| 53 |
workers: 1
|
| 54 |
framework: fastapi
|
| 55 |
python: "3.11"
|
|
|
|
|
|
| 5 |
investigating a broken training run. The environment procedurally generates
|
| 6 |
realistic training artifacts (logs, configs, preprocessing code, eval results)
|
| 7 |
with one planted fault. The agent must systematically investigate and submit
|
| 8 |
+
a structured diagnosis. Three tasks: config error (easy) -> data leakage (medium)
|
| 9 |
+
-> silent evaluation bug (hard). All graders are fully deterministic.
|
| 10 |
+
author: Code Clashers
|
| 11 |
license: MIT
|
| 12 |
+
tags: [openenv, rl, mlops, debugging, machine-learning, agents, pytorch]
|
| 13 |
+
|
| 14 |
+
grading:
|
| 15 |
+
type: deterministic
|
| 16 |
+
judge: none
|
| 17 |
+
method: keyword_and_substring_matching
|
| 18 |
+
reproducible: true
|
| 19 |
+
|
| 20 |
tasks:
|
| 21 |
- id: easy
|
| 22 |
name: Config Error Diagnosis
|
| 23 |
difficulty: easy
|
| 24 |
max_steps: 20
|
| 25 |
bug_pool: [exploding_lr, wrong_optimizer, batch_size_overflow]
|
| 26 |
+
reward_range: [0.01, 0.99]
|
| 27 |
+
description: >
|
| 28 |
+
Diagnose a training failure caused by a hyperparameter misconfiguration.
|
| 29 |
+
Symptoms are visible in training logs (loss explosion, oscillation, trivial overfitting).
|
| 30 |
+
|
| 31 |
- id: medium
|
| 32 |
name: Data Leakage Detection
|
| 33 |
difficulty: medium
|
| 34 |
max_steps: 30
|
| 35 |
bug_pool: [data_leakage_scaler, data_leakage_overlap, wrong_split_ratio]
|
| 36 |
+
reward_range: [0.01, 0.99]
|
| 37 |
+
description: >
|
| 38 |
+
Identify data leakage in the preprocessing pipeline. Val accuracy is suspiciously
|
| 39 |
+
high from epoch 1, but test performance tells a different story. Requires correlating
|
| 40 |
+
logs, eval results, and preprocessing code.
|
| 41 |
+
|
| 42 |
- id: hard
|
| 43 |
name: Silent Evaluation Bug
|
| 44 |
difficulty: hard
|
| 45 |
max_steps: 40
|
| 46 |
bug_pool: [label_encoder_mismatch, silent_metric_swap, tokenizer_version_drift]
|
| 47 |
+
reward_range: [0.01, 0.99]
|
| 48 |
asymmetric_penalty: true
|
| 49 |
+
penalty_multiplier: 1.5
|
| 50 |
+
description: >
|
| 51 |
+
Find a silent bug in the evaluation pipeline. Training logs look completely normal.
|
| 52 |
+
No errors, no warnings. Only a val/test metric gap reveals the issue. Requires
|
| 53 |
+
reasoning about what is absent rather than what is present.
|
| 54 |
+
|
| 55 |
action_space:
|
| 56 |
type: discrete_structured
|
| 57 |
+
actions:
|
| 58 |
+
- read_config
|
| 59 |
+
- read_logs
|
| 60 |
+
- check_dataset_stats
|
| 61 |
+
- inspect_preprocessing
|
| 62 |
+
- read_eval_results
|
| 63 |
+
- run_sanity_check
|
| 64 |
+
- query_artifact
|
| 65 |
+
- submit_diagnosis
|
| 66 |
+
sanity_check_types:
|
| 67 |
+
- label_consistency
|
| 68 |
+
- data_leakage
|
| 69 |
+
- gradient_norms
|
| 70 |
+
- class_balance
|
| 71 |
+
- feature_statistics
|
| 72 |
+
- encoder_version_match
|
| 73 |
+
- loss_trajectory
|
| 74 |
+
- metric_gap_analysis
|
| 75 |
+
|
| 76 |
observation_space:
|
| 77 |
type: structured_text
|
| 78 |
+
fields:
|
| 79 |
+
- task_id
|
| 80 |
+
- task_description
|
| 81 |
+
- run_id
|
| 82 |
+
- run_summary
|
| 83 |
+
- available_artifacts
|
| 84 |
+
- artifacts_read
|
| 85 |
+
- last_action_result
|
| 86 |
+
- step_count
|
| 87 |
+
- max_steps
|
| 88 |
+
- done
|
| 89 |
+
- messages
|
| 90 |
+
|
| 91 |
reward:
|
| 92 |
type: dense_and_terminal
|
| 93 |
+
per_step:
|
| 94 |
+
new_artifact_read: +0.02
|
| 95 |
+
duplicate_read: -0.02
|
| 96 |
+
new_sanity_check: +0.01
|
| 97 |
+
terminal:
|
| 98 |
+
failure_category: +0.15
|
| 99 |
+
root_cause_file: +0.25
|
| 100 |
+
root_cause_field: +0.30
|
| 101 |
+
proposed_fix: +0.30
|
| 102 |
+
hard_task_penalty: "if score < 0.70, additional 0.5x on missed components"
|
| 103 |
+
|
| 104 |
api:
|
| 105 |
reset: POST /reset
|
| 106 |
step: POST /step
|
| 107 |
state: GET /state
|
| 108 |
health: GET /health
|
| 109 |
+
tasks: GET /tasks
|
| 110 |
+
openenv_state: GET /openenv/state
|
| 111 |
websocket: /ws
|
| 112 |
+
|
| 113 |
runtime:
|
| 114 |
port: 7860
|
| 115 |
workers: 1
|
| 116 |
framework: fastapi
|
| 117 |
python: "3.11"
|
| 118 |
+
container: docker
|
pyproject.toml
CHANGED
|
@@ -5,7 +5,7 @@ description = "MLOps Pipeline Debugger - OpenEnv-compatible RL environment for M
|
|
| 5 |
readme = "README.md"
|
| 6 |
requires-python = ">=3.11"
|
| 7 |
license = {text = "MIT"}
|
| 8 |
-
authors = [{name = "
|
| 9 |
|
| 10 |
dependencies = [
|
| 11 |
"fastapi>=0.115.0",
|
|
@@ -21,7 +21,8 @@ dependencies = [
|
|
| 21 |
]
|
| 22 |
|
| 23 |
[project.scripts]
|
| 24 |
-
server = "uvicorn:main"
|
|
|
|
| 25 |
|
| 26 |
[project.optional-dependencies]
|
| 27 |
dev = [
|
|
|
|
| 5 |
readme = "README.md"
|
| 6 |
requires-python = ">=3.11"
|
| 7 |
license = {text = "MIT"}
|
| 8 |
+
authors = [{name = "Code Clashers"}]
|
| 9 |
|
| 10 |
dependencies = [
|
| 11 |
"fastapi>=0.115.0",
|
|
|
|
| 21 |
]
|
| 22 |
|
| 23 |
[project.scripts]
|
| 24 |
+
mlops-server = "uvicorn:main"
|
| 25 |
+
mlops-infer = "inference:main"
|
| 26 |
|
| 27 |
[project.optional-dependencies]
|
| 28 |
dev = [
|