--- title: MLOps Pipeline Debugger emoji: 🔧 colorFrom: blue colorTo: purple sdk: docker app_port: 7860 pinned: false --- # MLOps Pipeline Debugger [![OpenEnv](https://img.shields.io/badge/OpenEnv-1.0.0-blue)](https://github.com/meta-pytorch/OpenEnv) [![Python 3.11](https://img.shields.io/badge/python-3.11-green)](https://www.python.org) [![License: MIT](https://img.shields.io/badge/License-MIT-yellow)](LICENSE) An **OpenEnv-compatible RL environment** where an AI agent acts as a senior ML engineer diagnosing a broken training run. Built for the **Meta PyTorch Hackathon x Scaler School of Technology**. --- ## The Real-World Problem Every ML team has experienced it: a training job finishes overnight and something is wrong. Loss exploded to NaN. Validation accuracy is suspiciously perfect at epoch 1. Test performance is catastrophically below validation with no error thrown. A senior engineer must systematically investigate — reading logs, checking configs, inspecting preprocessing code, running sanity checks — to find the root cause. **This is the #1 time sink in production ML operations**, and it's a skill that separates junior from senior ML engineers. This environment simulates that investigation workflow. It's not a toy problem — it models the **actual top-3 failure modes** from production ML pipelines: | Failure Mode | Real-World Frequency | Environment Task | |---|---|---| | Hyperparameter misconfiguration | ~40% of training failures | Task 1 (Easy) | | Data leakage / preprocessing bugs | ~35% of silent accuracy inflation | Task 2 (Medium) | | Silent evaluation pipeline bugs | ~25% of post-deployment incidents | Task 3 (Hard) | --- ## How It Works At `reset()`, a complete set of **6 realistic training artifacts** is procedurally generated with one planted fault. The agent investigates using **8 structured actions** and submits a diagnosis. The grader checks against ground truth — **fully deterministic, no LLM judge**. ``` reset(task_id="hard", seed=42) │ ├── Generates: config.yaml, train.log, dataset_stats.json, │ preprocessing.py, eval_results.json, model_card.json │ ├── Plants: one bug from the task's 3-bug pool │ └── Agent investigates → submits diagnosis → grader scores [0.01, 0.99] ``` **9 distinct bug types across 3 difficulty tiers. Every episode can have a different bug. Scores vary continuously based on diagnosis precision.** --- ## Procedural Artifact Generation Every episode generates 6 internally-consistent training artifacts from scratch: | Artifact | Contents | Role in Investigation | |---|---|---| | `config.yaml` | Model arch, optimizer, LR, batch size, scheduler | Check hyperparameters | | `train.log` | Epoch-by-epoch loss/accuracy/gradient norms | Identify symptom patterns | | `dataset_stats.json` | Split sizes, class distribution, overlap counts | Detect data issues | | `preprocessing.py` | Full sklearn/PyTorch pipeline code | Find pipeline bugs | | `eval_results.json` | Final val/test metrics with hardware info | Quantify metric gaps | | `model_card.json` | Architecture summary, tokenizer version | Cross-reference versions | Artifacts are **internally consistent** — config matches logs, dataset stats match preprocessing code — except for the one planted fault. An agent must read multiple artifacts and correlate signals across them to locate the bug. --- ## Action Space (8 actions) ```python class MLOpsAction(BaseModel): action_type: Literal[ "read_config", # Full training configuration "read_logs", # Training logs (filterable: keyword or "epoch:N-M") "check_dataset_stats", # Split sizes, class distribution, overlap counts "inspect_preprocessing", # Full preprocessing pipeline code "read_eval_results", # Final val/test metrics "run_sanity_check", # Computed diagnostic check (8 types) "query_artifact", # Specific field from any artifact (dot notation) "submit_diagnosis", # Final answer — triggers grading ] ``` **Sanity check types** (computed diagnostics, not just artifact reads): `label_consistency` | `data_leakage` | `gradient_norms` | `class_balance` | `feature_statistics` | `encoder_version_match` | `loss_trajectory` | `metric_gap_analysis` --- ## Observation Space ```python class MLOpsObservation(BaseModel): task_id: str # easy | medium | hard task_description: str # Full task brief with investigation strategy run_id: str # Unique run identifier run_summary: Dict[str, Any] # Model, dataset, training status available_artifacts: List[ArtifactMeta] # What can be read (name, description, size) artifacts_read: List[str] # Investigation progress tracking last_action_result: Dict[str, Any] # Full content of last action step_count: int max_steps: int done: bool messages: List[str] # System warnings (duplicate reads, etc.) ``` --- ## Tasks & Difficulty Progression ### Task 1 — Config Error Diagnosis `(easy)` | 20 steps max **Bug pool (one picked randomly per episode):** - `exploding_lr` — `learning_rate: 50.0` causes loss to diverge to NaN by epoch 3 - `wrong_optimizer` — `SGD(momentum=0.99)` causes loss oscillation with no convergence - `batch_size_overflow` — `batch_size: 4096` exceeds dataset size, trivial overfitting **Signal strength:** High. Symptoms visible immediately in training logs. ### Task 2 — Data Leakage Detection `(medium)` | 30 steps max **Bug pool:** - `data_leakage_scaler` — `StandardScaler.fit_transform(X_full)` called before train/val split - `data_leakage_overlap` — `train_test_split(random_state=None)` produces overlapping splits - `wrong_split_ratio` — `test_size=0.8` trains on 20% and evaluates on 80% **Signal strength:** Medium. Requires correlating val accuracy anomaly in logs with preprocessing code. ### Task 3 — Silent Evaluation Bug `(hard)` | 40 steps max **Bug pool:** - `label_encoder_mismatch` — Train/eval use different `LabelEncoder.fit()` orderings - `silent_metric_swap` — `val_accuracy` and `test_accuracy` assignments swapped in eval code - `tokenizer_version_drift` — Training uses tokenizer v2, eval uses v1 (847 tokens map to `[UNK]`) **Signal strength:** Low. Training logs look completely normal. Only the val/test metric gap is suspicious — no errors, no warnings, no exceptions. Requires reasoning about what's *absent*. **Asymmetric penalty:** Missing a silent evaluation bug is penalized 1.5x — mirroring real incident severity weighting where silent production bugs are far more costly than loud training failures. --- ## Reward Design **Dense per-step rewards** (not sparse — provides learning signal throughout the episode): ``` Investigation phase: +0.02 First time reading an artifact (rewards systematic exploration) -0.02 Re-reading same artifact+filter (penalizes brute force) +0.01 Running a new sanity check (rewards diagnostic reasoning) Diagnosis grading (4 independent components): +0.15 Correct failure_category (what kind of bug?) +0.25 Correct root_cause_file (which file contains it?) +0.30 Correct root_cause_field (which parameter/function?) +0.30 Correct proposed_fix (keyword overlap with gold fix) Task 3 modifier: If score < 0.70 → additional 0.5x penalty on missed components (silent bugs reaching production are more costly than loud failures) ``` **Why dense rewards?** Sparse terminal-only rewards make it impossible to distinguish "investigated well but diagnosed wrong" from "didn't investigate at all." Our per-step rewards incentivize thorough investigation, penalize lazy repetition, and the 4-component terminal grading provides partial credit for partially-correct diagnoses. **Score spectrum:** ``` No investigation, wrong diagnosis → 0.01 Category only correct → 0.10–0.15 Category + file correct → 0.35–0.40 Category + file + field correct → 0.65 Perfect diagnosis → 0.90–0.99 ``` --- ## Baseline Scores | Task | Baseline (Qwen2.5-72B) | Optimized (Gemini 2.5 Flash) | |---|---|---| | Easy | ~0.42 | ~0.91 | | Medium | ~0.28 | ~0.85 | | Hard | ~0.15 | ~0.92 | The baseline agent (no task-specific prompting) struggles significantly on medium and hard tasks, confirming meaningful difficulty progression. --- ## Setup & Usage ### Docker (recommended) ```bash docker build -t mlops-debug-env . docker run -p 7860:7860 mlops-debug-env curl http://localhost:7860/health ``` ### Local Python ```bash pip install -r requirements.txt uvicorn app:app --host 0.0.0.0 --port 7860 ``` ### Python Client ```python from client import MLOpsDebugEnv from models import MLOpsAction with MLOpsDebugEnv(base_url="http://localhost:7860").sync() as env: obs = env.reset(task_id="hard", seed=1) # Investigate r = env.step(MLOpsAction(action_type="read_eval_results")) r = env.step(MLOpsAction(action_type="run_sanity_check", sanity_check_type="metric_gap_analysis")) r = env.step(MLOpsAction(action_type="inspect_preprocessing")) # Diagnose r = env.step(MLOpsAction( action_type="submit_diagnosis", failure_category="label_mismatch", root_cause_file="preprocessing.py", root_cause_field="LabelEncoder.fit_order", diagnosis="Train and eval use different LabelEncoder orderings", proposed_fix="Use single LabelEncoder instance across both pipelines" )) print(f"Score: {r.info['score']}") ``` ### Inference Script ```bash export GEMINI_API_KEY="your_key" export ENV_BASE_URL="http://localhost:7860" python inference.py # all 3 tasks python inference.py --task easy --seed 42 ``` **Output format (OpenEnv standard):** ``` [START] task=easy env=mlops-debug-env model=gemini-2.5-flash [STEP] step=1 action=read_logs reward=0.02 done=false error=null [STEP] step=2 action=run_sanity_check reward=0.01 done=false error=null [STEP] step=3 action=read_config reward=0.02 done=false error=null [STEP] step=4 action=submit_diagnosis reward=0.91 done=true error=null [END] success=true steps=4 score=0.9100 rewards=0.02,0.01,0.02,0.91 ``` --- ## Design Decisions **Why MLOps debugging?** Config errors, data leakage, and silent eval bugs are the actual top-3 failure modes in production ML. Every ML team at every company deals with these. This isn't a synthetic benchmark — it models a real workflow. **Why procedural generation?** Fixed bug scenarios would let agents memorize answers. Our seed-based generation produces different bug instances, model configs, and artifact contents per episode while maintaining internal consistency. **Why deterministic grading?** LLM-as-judge introduces variance and bias. Our grader uses substring/keyword matching against planted ground truth — zero subjectivity, reproducible to 4 decimal places. **Why asymmetric penalties?** In production, a loud training crash (Task 1) is caught immediately. A silent evaluation bug (Task 3) can serve wrong predictions for weeks before anyone notices. The 1.5x penalty on Task 3 mirrors this real-world cost asymmetry. **Why 8 sanity check types?** Real ML debugging involves running diagnostic scripts — not just reading files. Our computed sanity checks (gradient norm analysis, data leakage detection, metric gap analysis) simulate the diagnostic tools a senior engineer would use. --- ## Project Structure ``` MLops-Openenvhack/ ├── app.py # FastAPI server (REST + WebSocket) ├── mlops_environment.py # Core environment: reset/step/grading ├── artifact_generator.py # Procedural artifact + bug generation ├── models.py # Pydantic models (Action, Observation, State) ├── inference.py # LLM baseline agent ├── client.py # Python client library (async + sync) ├── openenv_state.py # Global state singleton ├── openenv.yaml # OpenEnv specification ├── Dockerfile # Container configuration ├── requirements.txt # Python dependencies └── server/ # HF Space deployment copy ``` --- ## Environment Variables | Variable | Required | Default | Description | |---|---|---|---| | `GEMINI_API_KEY` | Yes (for inference) | — | Gemini API key for baseline agent | | `MODEL_NAME` | No | `gemini-2.5-flash` | LLM model identifier | | `API_BASE_URL` | No | Gemini endpoint | OpenAI-compatible API base URL | | `ENV_BASE_URL` | No | `http://localhost:7860` | Environment server URL | --- ## License MIT