Spaces:
Sleeping
Sleeping
File size: 11,243 Bytes
a97c7e7 7e782aa e782518 7e782aa e782518 7e782aa | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 | ---
title: MLOps Pipeline Debugger
emoji: π§
colorFrom: blue
colorTo: purple
sdk: docker
sdk_version: "1.0"
python_version: "3.11"
app_file: app.py
pinned: false
---
# MLOps Pipeline Debugger
[](https://github.com/meta-pytorch/OpenEnv)
[](https://www.python.org)
[](LICENSE)
## Latest Baseline Scores
| Task | Score |
|------|-------|
| Easy | 0.91 |
| Medium | 0.85 |
| Hard | 1.00 |
| **Average** | **0.92** |
*Tested with Gemini 2.5 Flash + Gemini 3.1 Pro Preview fallback for hard task*
An **OpenEnv-compatible reinforcement learning environment** where an AI agent acts as a senior ML engineer diagnosing a broken training run.
---
## What Is This?
Every ML team has experienced it: a training job finishes overnight and something is wrong. Loss exploded to NaN. Validation accuracy is suspiciously perfect at epoch 1. Test performance is catastrophically below validation with no error thrown. An engineer must systematically investigate β reading logs, checking configs, inspecting preprocessing code, running sanity checks β to find the root cause.
This environment simulates that investigation. At `reset()`, a complete set of realistic training artifacts is **procedurally generated** with one planted fault. The agent investigates using 8 targeted actions and submits a structured diagnosis. The grader checks against the planted ground truth β **fully deterministic, no LLM judge needed**.
**9 distinct bug types across 3 tasks. Every episode can have a different bug. Scores vary continuously 0.0 β 1.0 based on diagnosis precision.**
---
## Environment Design
### Procedural Artifact Generation
Every episode generates 6 realistic training artifacts from scratch:
| Artifact | Contents |
|---|---|
| `config.yaml` | Model arch, optimizer, LR, batch size, scheduler, augmentation |
| `train.log` | Epoch-by-epoch loss/accuracy/gradient norms with realistic timestamps |
| `dataset_stats.json` | Split sizes, class distribution, overlap counts, feature statistics |
| `preprocessing.py` | Full sklearn/PyTorch preprocessing pipeline code |
| `eval_results.json` | Final val/test metrics with hardware info |
| `model_card.json` | Architecture summary, tokenizer version, preprocessing config |
Artifacts are **internally consistent** β config matches logs, dataset stats match preprocessing code β except for the one planted fault. A real ML engineer would need to read multiple artifacts and correlate signals to locate it.
---
## Action Space
```python
class MLOpsAction(BaseModel):
action_type: Literal[
"read_config", # Full config.yaml
"read_logs", # Training logs (filterable: keyword or "epoch:N-M")
"check_dataset_stats", # Split sizes, class distribution, overlap counts
"inspect_preprocessing",# Full preprocessing pipeline code
"read_eval_results", # Final val/test metrics
"run_sanity_check", # Computed diagnostic (see types below)
"query_artifact", # Specific field from any artifact (dot notation)
"submit_diagnosis", # Final answer β triggers grading
]
# Sanity check types:
# label_consistency | data_leakage | gradient_norms | class_balance
# feature_statistics | encoder_version_match | loss_trajectory | metric_gap_analysis
# submit_diagnosis fields:
# failure_category | root_cause_file | root_cause_field | diagnosis | proposed_fix
```
---
## Observation Space
```python
class MLOpsObservation(BaseModel):
task_id: str # easy | medium | hard
task_description: str # Full task brief with investigation strategy
run_id: str # Unique run identifier
run_summary: Dict[str, Any] # Model, dataset, training status
available_artifacts: List[ArtifactMeta] # What can be read
artifacts_read: List[str] # Investigation progress
last_action_result: Dict[str, Any] # Full content of last action
step_count: int
max_steps: int
done: bool
messages: List[str] # System warnings (duplicate reads, etc.)
```
---
## Tasks
### Task 1 β Config Error Diagnosis `(easy)`
**Bug pool (one picked randomly per episode):**
- `exploding_lr` β `learning_rate: 50.0` causes loss β NaN by epoch 3
- `wrong_optimizer` β `SGD(momentum=0.99)` causes oscillation with no convergence
- `batch_size_overflow` β `batch_size: 4096` exceeds dataset size, val accuracy 99.9% trivially
**Signal:** Visible immediately in training logs. Loss curve or accuracy values are obviously wrong.
**Optimal strategy:** `read_logs` β `run_sanity_check(loss_trajectory)` β `read_config` β `submit_diagnosis`
Max steps: **20** | Expected baseline score: ~0.42
---
### Task 2 β Data Leakage Detection `(medium)`
**Bug pool:**
- `data_leakage_scaler` β `StandardScaler.fit_transform(X_full)` called before train/val split
- `data_leakage_overlap` β `train_test_split(random_state=None)` produces non-deterministic overlapping splits
- `wrong_split_ratio` β `test_size=0.8` trains on 20% and evaluates on 80% (inverted)
**Signal:** Val accuracy suspiciously high from epoch 1 in logs; val/test gap in eval results; sample overlap count in dataset stats.
**Optimal strategy:** `read_logs` β `read_eval_results` β `run_sanity_check(data_leakage)` β `inspect_preprocessing` β `submit_diagnosis`
Max steps: **30** | Expected baseline score: ~0.28
---
### Task 3 β Silent Evaluation Bug `(hard)`
**Bug pool:**
- `label_encoder_mismatch` β Train/eval use different `LabelEncoder.fit()` orderings β silent wrong predictions
- `silent_metric_swap` β `val_accuracy` and `test_accuracy` assignments are swapped in eval code
- `tokenizer_version_drift` β Training uses tokenizer v2, eval uses v1 β 847 tokens map to `[UNK]`
**Signal:** Training logs look completely normal. Only the val/test metric gap in eval results is suspicious β no errors, no warnings, no exceptions.
**Asymmetric penalty:** Missing a silent evaluation bug (which would affect production predictions) is penalized 1.5Γ β mirroring real incident severity weighting.
**Optimal strategy:** `read_eval_results` β `run_sanity_check(metric_gap_analysis)` β `inspect_preprocessing` β `run_sanity_check(label_consistency OR encoder_version_match)` β `submit_diagnosis`
Max steps: **40** | Expected baseline score: ~0.15
---
## Reward Function
**Dense per-step rewards** (not sparse):
```
+0.02 First time reading an artifact (rewards systematic exploration)
-0.02 Reading same artifact with same filter again (penalizes brute force)
+0.01 Running a new sanity check (rewards diagnostic reasoning)
At submit_diagnosis:
+0.15 Correct failure_category (config_error / data_leakage / evaluation_bug / ...)
+0.25 Correct root_cause_file (exact match)
+0.30 Correct root_cause_field (substring match, case-insensitive)
+0.30 Correct proposed_fix (keyword overlap with gold fix)
Task 3 modifier: if score < 0.70, additional 0.5Γ penalty on missed components
```
**Score spectrum** (verified):
```
All wrong β 0.00
Category only β 0.10β0.15
Category + file β 0.35β0.40
Category + file + field β 0.65
Perfect diagnosis β 0.90β1.00
```
---
## Setup & Usage
### Docker (recommended)
```bash
docker build -t mlops-debug-env .
docker run -p 7860:7860 mlops-debug-env
curl http://localhost:7860/health
```
### Local Python
```bash
pip install -r requirements.txt
uvicorn app:app --host 0.0.0.0 --port 7860
```
### Python Client
```python
# Sync usage
from client import MLOpsDebugEnv
from models import MLOpsAction
with MLOpsDebugEnv(base_url="http://localhost:7860").sync() as env:
obs = env.reset(task_id="hard", seed=1)
print(obs.task_description)
# Investigate systematically
r = env.step(MLOpsAction(action_type="read_eval_results"))
print(r.observation.last_action_result["content"])
r = env.step(MLOpsAction(
action_type="run_sanity_check",
sanity_check_type="metric_gap_analysis"
))
# Reveals val/test gap anomaly
r = env.step(MLOpsAction(action_type="inspect_preprocessing"))
# Shows the buggy pipeline code
r = env.step(MLOpsAction(
action_type="submit_diagnosis",
failure_category="label_mismatch",
root_cause_file="preprocessing.py",
root_cause_field="LabelEncoder.fit_order",
diagnosis="Train and eval use different LabelEncoder orderings",
proposed_fix="Use single LabelEncoder instance across both pipelines"
))
print(f"Score: {r.info['score']}")
```
---
## Baseline Inference Script
```bash
export API_BASE_URL="https://router.huggingface.co/v1"
export MODEL_NAME="Qwen/Qwen2.5-72B-Instruct"
export HF_TOKEN="hf_your_token_here"
export ENV_BASE_URL="http://localhost:7860"
python inference.py # all 3 tasks, seed=42
python inference.py --task easy --seed 42
```
**Output format:**
```
[START] task=easy env=mlops-debug-env model=Qwen/Qwen2.5-72B-Instruct
[STEP] step=1 action=read_logs reward=0.02 done=false error=null
[STEP] step=2 action=run_sanity_check reward=0.01 done=false error=null
[STEP] step=3 action=read_config reward=0.02 done=false error=null
[STEP] step=4 action=submit_diagnosis reward=0.95 done=true error=null
[END] success=true steps=4 rewards=0.02,0.01,0.02,0.95
```
**Baseline scores** (Qwen2.5-72B-Instruct, seed=42):
| Task | Score | Notes |
|---|---|---|
| easy | ~0.42 | Gets category right, struggles with exact field name |
| medium | ~0.28 | Often identifies leakage but misidentifies exact mechanism |
| hard | ~0.15 | Silent bugs with normal training logs are genuinely hard |
---
## Why This Environment
**Real problem.** Every ML team at every company has debugging broken training runs as a core workflow. The three bug categories in this environment β config errors, data leakage, silent evaluation bugs β are the actual top-3 failure modes in production ML pipelines.
**Deterministic grading.** The planted bug is ground truth. Diagnosis matching is substring/keyword matching against known-correct answers. Zero subjectivity, zero LLM-as-judge, reproducible across runs.
**Genuinely hard for frontier models.** Task 3 (silent evaluation bugs) requires reasoning about what's *absent* β no error signals, normal training logs β and tracing backwards from a metric anomaly to a pipeline version mismatch. State-of-the-art models score ~0.15 without careful prompting.
**Seed-based reproducibility.** `reset(seed=42)` always produces the same bug, same artifacts, same grading. Baseline scores are reproducible to 4 decimal places.
---
## Environment Variables
| Variable | Description |
|---|---|
| `API_BASE_URL` | LLM API endpoint (OpenAI-compatible) |
| `MODEL_NAME` | Model identifier |
| `HF_TOKEN` | Hugging Face / API token |
| `ENV_BASE_URL` | Environment server URL (default: `http://localhost:7860`) |
---
## License
MIT β see LICENSE
|