Spaces:
Sleeping
Sleeping
| title: MLOps Pipeline Debugger | |
| emoji: π§ | |
| colorFrom: blue | |
| colorTo: purple | |
| sdk: docker | |
| app_port: 7860 | |
| pinned: false | |
| # MLOps Pipeline Debugger | |
| [](https://github.com/meta-pytorch/OpenEnv) | |
| [](https://www.python.org) | |
| [](LICENSE) | |
| An **OpenEnv-compatible RL environment** where an AI agent acts as a senior ML engineer diagnosing a broken training run. Built for the **Meta PyTorch Hackathon x Scaler School of Technology**. | |
| --- | |
| ## The Real-World Problem | |
| Every ML team has experienced it: a training job finishes overnight and something is wrong. Loss exploded to NaN. Validation accuracy is suspiciously perfect at epoch 1. Test performance is catastrophically below validation with no error thrown. | |
| A senior engineer must systematically investigate β reading logs, checking configs, inspecting preprocessing code, running sanity checks β to find the root cause. **This is the #1 time sink in production ML operations**, and it's a skill that separates junior from senior ML engineers. | |
| This environment simulates that investigation workflow. It's not a toy problem β it models the **actual top-3 failure modes** from production ML pipelines: | |
| | Failure Mode | Real-World Frequency | Environment Task | | |
| |---|---|---| | |
| | Hyperparameter misconfiguration | ~40% of training failures | Task 1 (Easy) | | |
| | Data leakage / preprocessing bugs | ~35% of silent accuracy inflation | Task 2 (Medium) | | |
| | Silent evaluation pipeline bugs | ~25% of post-deployment incidents | Task 3 (Hard) | | |
| --- | |
| ## How It Works | |
| At `reset()`, a complete set of **6 realistic training artifacts** is procedurally generated with one planted fault. The agent investigates using **8 structured actions** and submits a diagnosis. The grader checks against ground truth β **fully deterministic, no LLM judge**. | |
| ``` | |
| reset(task_id="hard", seed=42) | |
| β | |
| βββ Generates: config.yaml, train.log, dataset_stats.json, | |
| β preprocessing.py, eval_results.json, model_card.json | |
| β | |
| βββ Plants: one bug from the task's 3-bug pool | |
| β | |
| βββ Agent investigates β submits diagnosis β grader scores [0.01, 0.99] | |
| ``` | |
| **9 distinct bug types across 3 difficulty tiers. Every episode can have a different bug. Scores vary continuously based on diagnosis precision.** | |
| --- | |
| ## Procedural Artifact Generation | |
| Every episode generates 6 internally-consistent training artifacts from scratch: | |
| | Artifact | Contents | Role in Investigation | | |
| |---|---|---| | |
| | `config.yaml` | Model arch, optimizer, LR, batch size, scheduler | Check hyperparameters | | |
| | `train.log` | Epoch-by-epoch loss/accuracy/gradient norms | Identify symptom patterns | | |
| | `dataset_stats.json` | Split sizes, class distribution, overlap counts | Detect data issues | | |
| | `preprocessing.py` | Full sklearn/PyTorch pipeline code | Find pipeline bugs | | |
| | `eval_results.json` | Final val/test metrics with hardware info | Quantify metric gaps | | |
| | `model_card.json` | Architecture summary, tokenizer version | Cross-reference versions | | |
| Artifacts are **internally consistent** β config matches logs, dataset stats match preprocessing code β except for the one planted fault. An agent must read multiple artifacts and correlate signals across them to locate the bug. | |
| --- | |
| ## Action Space (8 actions) | |
| ```python | |
| class MLOpsAction(BaseModel): | |
| action_type: Literal[ | |
| "read_config", # Full training configuration | |
| "read_logs", # Training logs (filterable: keyword or "epoch:N-M") | |
| "check_dataset_stats", # Split sizes, class distribution, overlap counts | |
| "inspect_preprocessing", # Full preprocessing pipeline code | |
| "read_eval_results", # Final val/test metrics | |
| "run_sanity_check", # Computed diagnostic check (8 types) | |
| "query_artifact", # Specific field from any artifact (dot notation) | |
| "submit_diagnosis", # Final answer β triggers grading | |
| ] | |
| ``` | |
| **Sanity check types** (computed diagnostics, not just artifact reads): | |
| `label_consistency` | `data_leakage` | `gradient_norms` | `class_balance` | `feature_statistics` | `encoder_version_match` | `loss_trajectory` | `metric_gap_analysis` | |
| --- | |
| ## Observation Space | |
| ```python | |
| class MLOpsObservation(BaseModel): | |
| task_id: str # easy | medium | hard | |
| task_description: str # Full task brief with investigation strategy | |
| run_id: str # Unique run identifier | |
| run_summary: Dict[str, Any] # Model, dataset, training status | |
| available_artifacts: List[ArtifactMeta] # What can be read (name, description, size) | |
| artifacts_read: List[str] # Investigation progress tracking | |
| last_action_result: Dict[str, Any] # Full content of last action | |
| step_count: int | |
| max_steps: int | |
| done: bool | |
| messages: List[str] # System warnings (duplicate reads, etc.) | |
| ``` | |
| --- | |
| ## Tasks & Difficulty Progression | |
| ### Task 1 β Config Error Diagnosis `(easy)` | 20 steps max | |
| **Bug pool (one picked randomly per episode):** | |
| - `exploding_lr` β `learning_rate: 50.0` causes loss to diverge to NaN by epoch 3 | |
| - `wrong_optimizer` β `SGD(momentum=0.99)` causes loss oscillation with no convergence | |
| - `batch_size_overflow` β `batch_size: 4096` exceeds dataset size, trivial overfitting | |
| **Signal strength:** High. Symptoms visible immediately in training logs. | |
| ### Task 2 β Data Leakage Detection `(medium)` | 30 steps max | |
| **Bug pool:** | |
| - `data_leakage_scaler` β `StandardScaler.fit_transform(X_full)` called before train/val split | |
| - `data_leakage_overlap` β `train_test_split(random_state=None)` produces overlapping splits | |
| - `wrong_split_ratio` β `test_size=0.8` trains on 20% and evaluates on 80% | |
| **Signal strength:** Medium. Requires correlating val accuracy anomaly in logs with preprocessing code. | |
| ### Task 3 β Silent Evaluation Bug `(hard)` | 40 steps max | |
| **Bug pool:** | |
| - `label_encoder_mismatch` β Train/eval use different `LabelEncoder.fit()` orderings | |
| - `silent_metric_swap` β `val_accuracy` and `test_accuracy` assignments swapped in eval code | |
| - `tokenizer_version_drift` β Training uses tokenizer v2, eval uses v1 (847 tokens map to `[UNK]`) | |
| **Signal strength:** Low. Training logs look completely normal. Only the val/test metric gap is suspicious β no errors, no warnings, no exceptions. Requires reasoning about what's *absent*. | |
| **Asymmetric penalty:** Missing a silent evaluation bug is penalized 1.5x β mirroring real incident severity weighting where silent production bugs are far more costly than loud training failures. | |
| --- | |
| ## Reward Design | |
| **Dense per-step rewards** (not sparse β provides learning signal throughout the episode): | |
| ``` | |
| Investigation phase: | |
| +0.02 First time reading an artifact (rewards systematic exploration) | |
| -0.02 Re-reading same artifact+filter (penalizes brute force) | |
| +0.01 Running a new sanity check (rewards diagnostic reasoning) | |
| Diagnosis grading (4 independent components): | |
| +0.15 Correct failure_category (what kind of bug?) | |
| +0.25 Correct root_cause_file (which file contains it?) | |
| +0.30 Correct root_cause_field (which parameter/function?) | |
| +0.30 Correct proposed_fix (keyword overlap with gold fix) | |
| Task 3 modifier: | |
| If score < 0.70 β additional 0.5x penalty on missed components | |
| (silent bugs reaching production are more costly than loud failures) | |
| ``` | |
| **Why dense rewards?** Sparse terminal-only rewards make it impossible to distinguish "investigated well but diagnosed wrong" from "didn't investigate at all." Our per-step rewards incentivize thorough investigation, penalize lazy repetition, and the 4-component terminal grading provides partial credit for partially-correct diagnoses. | |
| **Score spectrum:** | |
| ``` | |
| No investigation, wrong diagnosis β 0.01 | |
| Category only correct β 0.10β0.15 | |
| Category + file correct β 0.35β0.40 | |
| Category + file + field correct β 0.65 | |
| Perfect diagnosis β 0.90β0.99 | |
| ``` | |
| --- | |
| ## Baseline Scores | |
| | Task | Baseline (Qwen2.5-72B) | Optimized (Gemini 2.5 Flash) | | |
| |---|---|---| | |
| | Easy | ~0.42 | ~0.91 | | |
| | Medium | ~0.28 | ~0.85 | | |
| | Hard | ~0.15 | ~0.92 | | |
| The baseline agent (no task-specific prompting) struggles significantly on medium and hard tasks, confirming meaningful difficulty progression. | |
| --- | |
| ## Setup & Usage | |
| ### Docker (recommended) | |
| ```bash | |
| docker build -t mlops-debug-env . | |
| docker run -p 7860:7860 mlops-debug-env | |
| curl http://localhost:7860/health | |
| ``` | |
| ### Local Python | |
| ```bash | |
| pip install -r requirements.txt | |
| uvicorn app:app --host 0.0.0.0 --port 7860 | |
| ``` | |
| ### Python Client | |
| ```python | |
| from client import MLOpsDebugEnv | |
| from models import MLOpsAction | |
| with MLOpsDebugEnv(base_url="http://localhost:7860").sync() as env: | |
| obs = env.reset(task_id="hard", seed=1) | |
| # Investigate | |
| r = env.step(MLOpsAction(action_type="read_eval_results")) | |
| r = env.step(MLOpsAction(action_type="run_sanity_check", | |
| sanity_check_type="metric_gap_analysis")) | |
| r = env.step(MLOpsAction(action_type="inspect_preprocessing")) | |
| # Diagnose | |
| r = env.step(MLOpsAction( | |
| action_type="submit_diagnosis", | |
| failure_category="label_mismatch", | |
| root_cause_file="preprocessing.py", | |
| root_cause_field="LabelEncoder.fit_order", | |
| diagnosis="Train and eval use different LabelEncoder orderings", | |
| proposed_fix="Use single LabelEncoder instance across both pipelines" | |
| )) | |
| print(f"Score: {r.info['score']}") | |
| ``` | |
| ### Inference Script | |
| ```bash | |
| export GEMINI_API_KEY="your_key" | |
| export ENV_BASE_URL="http://localhost:7860" | |
| python inference.py # all 3 tasks | |
| python inference.py --task easy --seed 42 | |
| ``` | |
| **Output format (OpenEnv standard):** | |
| ``` | |
| [START] task=easy env=mlops-debug-env model=gemini-2.5-flash | |
| [STEP] step=1 action=read_logs reward=0.02 done=false error=null | |
| [STEP] step=2 action=run_sanity_check reward=0.01 done=false error=null | |
| [STEP] step=3 action=read_config reward=0.02 done=false error=null | |
| [STEP] step=4 action=submit_diagnosis reward=0.91 done=true error=null | |
| [END] success=true steps=4 score=0.9100 rewards=0.02,0.01,0.02,0.91 | |
| ``` | |
| --- | |
| ## Design Decisions | |
| **Why MLOps debugging?** Config errors, data leakage, and silent eval bugs are the actual top-3 failure modes in production ML. Every ML team at every company deals with these. This isn't a synthetic benchmark β it models a real workflow. | |
| **Why procedural generation?** Fixed bug scenarios would let agents memorize answers. Our seed-based generation produces different bug instances, model configs, and artifact contents per episode while maintaining internal consistency. | |
| **Why deterministic grading?** LLM-as-judge introduces variance and bias. Our grader uses substring/keyword matching against planted ground truth β zero subjectivity, reproducible to 4 decimal places. | |
| **Why asymmetric penalties?** In production, a loud training crash (Task 1) is caught immediately. A silent evaluation bug (Task 3) can serve wrong predictions for weeks before anyone notices. The 1.5x penalty on Task 3 mirrors this real-world cost asymmetry. | |
| **Why 8 sanity check types?** Real ML debugging involves running diagnostic scripts β not just reading files. Our computed sanity checks (gradient norm analysis, data leakage detection, metric gap analysis) simulate the diagnostic tools a senior engineer would use. | |
| --- | |
| ## Project Structure | |
| ``` | |
| MLops-Openenvhack/ | |
| βββ app.py # FastAPI server (REST + WebSocket) | |
| βββ mlops_environment.py # Core environment: reset/step/grading | |
| βββ artifact_generator.py # Procedural artifact + bug generation | |
| βββ models.py # Pydantic models (Action, Observation, State) | |
| βββ inference.py # LLM baseline agent | |
| βββ client.py # Python client library (async + sync) | |
| βββ openenv_state.py # Global state singleton | |
| βββ openenv.yaml # OpenEnv specification | |
| βββ Dockerfile # Container configuration | |
| βββ requirements.txt # Python dependencies | |
| βββ server/ # HF Space deployment copy | |
| ``` | |
| --- | |
| ## Environment Variables | |
| | Variable | Required | Default | Description | | |
| |---|---|---|---| | |
| | `GEMINI_API_KEY` | Yes (for inference) | β | Gemini API key for baseline agent | | |
| | `MODEL_NAME` | No | `gemini-2.5-flash` | LLM model identifier | | |
| | `API_BASE_URL` | No | Gemini endpoint | OpenAI-compatible API base URL | | |
| | `ENV_BASE_URL` | No | `http://localhost:7860` | Environment server URL | | |
| --- | |
| ## License | |
| MIT | |