Spaces:
Sleeping
Sleeping
File size: 12,887 Bytes
f941151 c07115c 1e82f9d c07115c 1e82f9d c07115c 1e82f9d c07115c 1e82f9d c07115c 1e82f9d c07115c 1e82f9d c07115c 1e82f9d c07115c 1e82f9d c07115c 1e82f9d c07115c 1e82f9d c07115c 1e82f9d c07115c 1e82f9d c07115c 1e82f9d c07115c 1e82f9d c07115c 1e82f9d cf91c05 c07115c 1e82f9d c07115c 1e82f9d c07115c 1e82f9d c07115c 1e82f9d c07115c 1e82f9d c07115c 1e82f9d c07115c 1e82f9d c07115c 1e82f9d c07115c 1e82f9d c07115c 1e82f9d c07115c 1e82f9d c07115c 1e82f9d c07115c 1e82f9d c07115c 1e82f9d c07115c 1e82f9d c07115c 1e82f9d c07115c 1e82f9d c07115c 1e82f9d c07115c 1e82f9d c07115c 1e82f9d c07115c 1e82f9d c07115c 1e82f9d c07115c 1e82f9d c07115c 1e82f9d c07115c 1e82f9d c07115c 1e82f9d c07115c 1e82f9d c07115c 1e82f9d c07115c 1e82f9d c07115c 1e82f9d c07115c 1e82f9d c07115c 1e82f9d c07115c 1e82f9d c07115c 1e82f9d c07115c 1e82f9d c07115c 1e82f9d c07115c 1e82f9d | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 | ---
title: MLOps Pipeline Debugger
emoji: π§
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 7860
pinned: false
---
# MLOps Pipeline Debugger
[](https://github.com/meta-pytorch/OpenEnv)
[](https://www.python.org)
[](LICENSE)
An **OpenEnv-compatible RL environment** where an AI agent acts as a senior ML engineer diagnosing a broken training run. Built for the **Meta PyTorch Hackathon x Scaler School of Technology**.
---
## The Real-World Problem
Every ML team has experienced it: a training job finishes overnight and something is wrong. Loss exploded to NaN. Validation accuracy is suspiciously perfect at epoch 1. Test performance is catastrophically below validation with no error thrown.
A senior engineer must systematically investigate β reading logs, checking configs, inspecting preprocessing code, running sanity checks β to find the root cause. **This is the #1 time sink in production ML operations**, and it's a skill that separates junior from senior ML engineers.
This environment simulates that investigation workflow. It's not a toy problem β it models the **actual top-3 failure modes** from production ML pipelines:
| Failure Mode | Real-World Frequency | Environment Task |
|---|---|---|
| Hyperparameter misconfiguration | ~40% of training failures | Task 1 (Easy) |
| Data leakage / preprocessing bugs | ~35% of silent accuracy inflation | Task 2 (Medium) |
| Silent evaluation pipeline bugs | ~25% of post-deployment incidents | Task 3 (Hard) |
---
## How It Works
At `reset()`, a complete set of **6 realistic training artifacts** is procedurally generated with one planted fault. The agent investigates using **8 structured actions** and submits a diagnosis. The grader checks against ground truth β **fully deterministic, no LLM judge**.
```
reset(task_id="hard", seed=42)
β
βββ Generates: config.yaml, train.log, dataset_stats.json,
β preprocessing.py, eval_results.json, model_card.json
β
βββ Plants: one bug from the task's 3-bug pool
β
βββ Agent investigates β submits diagnosis β grader scores [0.01, 0.99]
```
**9 distinct bug types across 3 difficulty tiers. Every episode can have a different bug. Scores vary continuously based on diagnosis precision.**
---
## Procedural Artifact Generation
Every episode generates 6 internally-consistent training artifacts from scratch:
| Artifact | Contents | Role in Investigation |
|---|---|---|
| `config.yaml` | Model arch, optimizer, LR, batch size, scheduler | Check hyperparameters |
| `train.log` | Epoch-by-epoch loss/accuracy/gradient norms | Identify symptom patterns |
| `dataset_stats.json` | Split sizes, class distribution, overlap counts | Detect data issues |
| `preprocessing.py` | Full sklearn/PyTorch pipeline code | Find pipeline bugs |
| `eval_results.json` | Final val/test metrics with hardware info | Quantify metric gaps |
| `model_card.json` | Architecture summary, tokenizer version | Cross-reference versions |
Artifacts are **internally consistent** β config matches logs, dataset stats match preprocessing code β except for the one planted fault. An agent must read multiple artifacts and correlate signals across them to locate the bug.
---
## Action Space (8 actions)
```python
class MLOpsAction(BaseModel):
action_type: Literal[
"read_config", # Full training configuration
"read_logs", # Training logs (filterable: keyword or "epoch:N-M")
"check_dataset_stats", # Split sizes, class distribution, overlap counts
"inspect_preprocessing", # Full preprocessing pipeline code
"read_eval_results", # Final val/test metrics
"run_sanity_check", # Computed diagnostic check (8 types)
"query_artifact", # Specific field from any artifact (dot notation)
"submit_diagnosis", # Final answer β triggers grading
]
```
**Sanity check types** (computed diagnostics, not just artifact reads):
`label_consistency` | `data_leakage` | `gradient_norms` | `class_balance` | `feature_statistics` | `encoder_version_match` | `loss_trajectory` | `metric_gap_analysis`
---
## Observation Space
```python
class MLOpsObservation(BaseModel):
task_id: str # easy | medium | hard
task_description: str # Full task brief with investigation strategy
run_id: str # Unique run identifier
run_summary: Dict[str, Any] # Model, dataset, training status
available_artifacts: List[ArtifactMeta] # What can be read (name, description, size)
artifacts_read: List[str] # Investigation progress tracking
last_action_result: Dict[str, Any] # Full content of last action
step_count: int
max_steps: int
done: bool
messages: List[str] # System warnings (duplicate reads, etc.)
```
---
## Tasks & Difficulty Progression
### Task 1 β Config Error Diagnosis `(easy)` | 20 steps max
**Bug pool (one picked randomly per episode):**
- `exploding_lr` β `learning_rate: 50.0` causes loss to diverge to NaN by epoch 3
- `wrong_optimizer` β `SGD(momentum=0.99)` causes loss oscillation with no convergence
- `batch_size_overflow` β `batch_size: 4096` exceeds dataset size, trivial overfitting
**Signal strength:** High. Symptoms visible immediately in training logs.
### Task 2 β Data Leakage Detection `(medium)` | 30 steps max
**Bug pool:**
- `data_leakage_scaler` β `StandardScaler.fit_transform(X_full)` called before train/val split
- `data_leakage_overlap` β `train_test_split(random_state=None)` produces overlapping splits
- `wrong_split_ratio` β `test_size=0.8` trains on 20% and evaluates on 80%
**Signal strength:** Medium. Requires correlating val accuracy anomaly in logs with preprocessing code.
### Task 3 β Silent Evaluation Bug `(hard)` | 40 steps max
**Bug pool:**
- `label_encoder_mismatch` β Train/eval use different `LabelEncoder.fit()` orderings
- `silent_metric_swap` β `val_accuracy` and `test_accuracy` assignments swapped in eval code
- `tokenizer_version_drift` β Training uses tokenizer v2, eval uses v1 (847 tokens map to `[UNK]`)
**Signal strength:** Low. Training logs look completely normal. Only the val/test metric gap is suspicious β no errors, no warnings, no exceptions. Requires reasoning about what's *absent*.
**Asymmetric penalty:** Missing a silent evaluation bug is penalized 1.5x β mirroring real incident severity weighting where silent production bugs are far more costly than loud training failures.
---
## Reward Design
**Dense per-step rewards** (not sparse β provides learning signal throughout the episode):
```
Investigation phase:
+0.02 First time reading an artifact (rewards systematic exploration)
-0.02 Re-reading same artifact+filter (penalizes brute force)
+0.01 Running a new sanity check (rewards diagnostic reasoning)
Diagnosis grading (4 independent components):
+0.15 Correct failure_category (what kind of bug?)
+0.25 Correct root_cause_file (which file contains it?)
+0.30 Correct root_cause_field (which parameter/function?)
+0.30 Correct proposed_fix (keyword overlap with gold fix)
Task 3 modifier:
If score < 0.70 β additional 0.5x penalty on missed components
(silent bugs reaching production are more costly than loud failures)
```
**Why dense rewards?** Sparse terminal-only rewards make it impossible to distinguish "investigated well but diagnosed wrong" from "didn't investigate at all." Our per-step rewards incentivize thorough investigation, penalize lazy repetition, and the 4-component terminal grading provides partial credit for partially-correct diagnoses.
**Score spectrum:**
```
No investigation, wrong diagnosis β 0.01
Category only correct β 0.10β0.15
Category + file correct β 0.35β0.40
Category + file + field correct β 0.65
Perfect diagnosis β 0.90β0.99
```
---
## Baseline Scores
| Task | Baseline (Qwen2.5-72B) | Optimized (Gemini 2.5 Flash) |
|---|---|---|
| Easy | ~0.42 | ~0.91 |
| Medium | ~0.28 | ~0.85 |
| Hard | ~0.15 | ~0.92 |
The baseline agent (no task-specific prompting) struggles significantly on medium and hard tasks, confirming meaningful difficulty progression.
---
## Setup & Usage
### Docker (recommended)
```bash
docker build -t mlops-debug-env .
docker run -p 7860:7860 mlops-debug-env
curl http://localhost:7860/health
```
### Local Python
```bash
pip install -r requirements.txt
uvicorn app:app --host 0.0.0.0 --port 7860
```
### Python Client
```python
from client import MLOpsDebugEnv
from models import MLOpsAction
with MLOpsDebugEnv(base_url="http://localhost:7860").sync() as env:
obs = env.reset(task_id="hard", seed=1)
# Investigate
r = env.step(MLOpsAction(action_type="read_eval_results"))
r = env.step(MLOpsAction(action_type="run_sanity_check",
sanity_check_type="metric_gap_analysis"))
r = env.step(MLOpsAction(action_type="inspect_preprocessing"))
# Diagnose
r = env.step(MLOpsAction(
action_type="submit_diagnosis",
failure_category="label_mismatch",
root_cause_file="preprocessing.py",
root_cause_field="LabelEncoder.fit_order",
diagnosis="Train and eval use different LabelEncoder orderings",
proposed_fix="Use single LabelEncoder instance across both pipelines"
))
print(f"Score: {r.info['score']}")
```
### Inference Script
```bash
export GEMINI_API_KEY="your_key"
export ENV_BASE_URL="http://localhost:7860"
python inference.py # all 3 tasks
python inference.py --task easy --seed 42
```
**Output format (OpenEnv standard):**
```
[START] task=easy env=mlops-debug-env model=gemini-2.5-flash
[STEP] step=1 action=read_logs reward=0.02 done=false error=null
[STEP] step=2 action=run_sanity_check reward=0.01 done=false error=null
[STEP] step=3 action=read_config reward=0.02 done=false error=null
[STEP] step=4 action=submit_diagnosis reward=0.91 done=true error=null
[END] success=true steps=4 score=0.9100 rewards=0.02,0.01,0.02,0.91
```
---
## Design Decisions
**Why MLOps debugging?** Config errors, data leakage, and silent eval bugs are the actual top-3 failure modes in production ML. Every ML team at every company deals with these. This isn't a synthetic benchmark β it models a real workflow.
**Why procedural generation?** Fixed bug scenarios would let agents memorize answers. Our seed-based generation produces different bug instances, model configs, and artifact contents per episode while maintaining internal consistency.
**Why deterministic grading?** LLM-as-judge introduces variance and bias. Our grader uses substring/keyword matching against planted ground truth β zero subjectivity, reproducible to 4 decimal places.
**Why asymmetric penalties?** In production, a loud training crash (Task 1) is caught immediately. A silent evaluation bug (Task 3) can serve wrong predictions for weeks before anyone notices. The 1.5x penalty on Task 3 mirrors this real-world cost asymmetry.
**Why 8 sanity check types?** Real ML debugging involves running diagnostic scripts β not just reading files. Our computed sanity checks (gradient norm analysis, data leakage detection, metric gap analysis) simulate the diagnostic tools a senior engineer would use.
---
## Project Structure
```
MLops-Openenvhack/
βββ app.py # FastAPI server (REST + WebSocket)
βββ mlops_environment.py # Core environment: reset/step/grading
βββ artifact_generator.py # Procedural artifact + bug generation
βββ models.py # Pydantic models (Action, Observation, State)
βββ inference.py # LLM baseline agent
βββ client.py # Python client library (async + sync)
βββ openenv_state.py # Global state singleton
βββ openenv.yaml # OpenEnv specification
βββ Dockerfile # Container configuration
βββ requirements.txt # Python dependencies
βββ server/ # HF Space deployment copy
```
---
## Environment Variables
| Variable | Required | Default | Description |
|---|---|---|---|
| `GEMINI_API_KEY` | Yes (for inference) | β | Gemini API key for baseline agent |
| `MODEL_NAME` | No | `gemini-2.5-flash` | LLM model identifier |
| `API_BASE_URL` | No | Gemini endpoint | OpenAI-compatible API base URL |
| `ENV_BASE_URL` | No | `http://localhost:7860` | Environment server URL |
---
## License
MIT
|