mlops-openenv / README.md
Rockerleo's picture
Upload folder using huggingface_hub
1e82f9d verified
---
title: MLOps Pipeline Debugger
emoji: πŸ”§
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 7860
pinned: false
---
# MLOps Pipeline Debugger
[![OpenEnv](https://img.shields.io/badge/OpenEnv-1.0.0-blue)](https://github.com/meta-pytorch/OpenEnv)
[![Python 3.11](https://img.shields.io/badge/python-3.11-green)](https://www.python.org)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow)](LICENSE)
An **OpenEnv-compatible RL environment** where an AI agent acts as a senior ML engineer diagnosing a broken training run. Built for the **Meta PyTorch Hackathon x Scaler School of Technology**.
---
## The Real-World Problem
Every ML team has experienced it: a training job finishes overnight and something is wrong. Loss exploded to NaN. Validation accuracy is suspiciously perfect at epoch 1. Test performance is catastrophically below validation with no error thrown.
A senior engineer must systematically investigate β€” reading logs, checking configs, inspecting preprocessing code, running sanity checks β€” to find the root cause. **This is the #1 time sink in production ML operations**, and it's a skill that separates junior from senior ML engineers.
This environment simulates that investigation workflow. It's not a toy problem β€” it models the **actual top-3 failure modes** from production ML pipelines:
| Failure Mode | Real-World Frequency | Environment Task |
|---|---|---|
| Hyperparameter misconfiguration | ~40% of training failures | Task 1 (Easy) |
| Data leakage / preprocessing bugs | ~35% of silent accuracy inflation | Task 2 (Medium) |
| Silent evaluation pipeline bugs | ~25% of post-deployment incidents | Task 3 (Hard) |
---
## How It Works
At `reset()`, a complete set of **6 realistic training artifacts** is procedurally generated with one planted fault. The agent investigates using **8 structured actions** and submits a diagnosis. The grader checks against ground truth β€” **fully deterministic, no LLM judge**.
```
reset(task_id="hard", seed=42)
β”‚
β”œβ”€β”€ Generates: config.yaml, train.log, dataset_stats.json,
β”‚ preprocessing.py, eval_results.json, model_card.json
β”‚
β”œβ”€β”€ Plants: one bug from the task's 3-bug pool
β”‚
└── Agent investigates β†’ submits diagnosis β†’ grader scores [0.01, 0.99]
```
**9 distinct bug types across 3 difficulty tiers. Every episode can have a different bug. Scores vary continuously based on diagnosis precision.**
---
## Procedural Artifact Generation
Every episode generates 6 internally-consistent training artifacts from scratch:
| Artifact | Contents | Role in Investigation |
|---|---|---|
| `config.yaml` | Model arch, optimizer, LR, batch size, scheduler | Check hyperparameters |
| `train.log` | Epoch-by-epoch loss/accuracy/gradient norms | Identify symptom patterns |
| `dataset_stats.json` | Split sizes, class distribution, overlap counts | Detect data issues |
| `preprocessing.py` | Full sklearn/PyTorch pipeline code | Find pipeline bugs |
| `eval_results.json` | Final val/test metrics with hardware info | Quantify metric gaps |
| `model_card.json` | Architecture summary, tokenizer version | Cross-reference versions |
Artifacts are **internally consistent** β€” config matches logs, dataset stats match preprocessing code β€” except for the one planted fault. An agent must read multiple artifacts and correlate signals across them to locate the bug.
---
## Action Space (8 actions)
```python
class MLOpsAction(BaseModel):
action_type: Literal[
"read_config", # Full training configuration
"read_logs", # Training logs (filterable: keyword or "epoch:N-M")
"check_dataset_stats", # Split sizes, class distribution, overlap counts
"inspect_preprocessing", # Full preprocessing pipeline code
"read_eval_results", # Final val/test metrics
"run_sanity_check", # Computed diagnostic check (8 types)
"query_artifact", # Specific field from any artifact (dot notation)
"submit_diagnosis", # Final answer β€” triggers grading
]
```
**Sanity check types** (computed diagnostics, not just artifact reads):
`label_consistency` | `data_leakage` | `gradient_norms` | `class_balance` | `feature_statistics` | `encoder_version_match` | `loss_trajectory` | `metric_gap_analysis`
---
## Observation Space
```python
class MLOpsObservation(BaseModel):
task_id: str # easy | medium | hard
task_description: str # Full task brief with investigation strategy
run_id: str # Unique run identifier
run_summary: Dict[str, Any] # Model, dataset, training status
available_artifacts: List[ArtifactMeta] # What can be read (name, description, size)
artifacts_read: List[str] # Investigation progress tracking
last_action_result: Dict[str, Any] # Full content of last action
step_count: int
max_steps: int
done: bool
messages: List[str] # System warnings (duplicate reads, etc.)
```
---
## Tasks & Difficulty Progression
### Task 1 β€” Config Error Diagnosis `(easy)` | 20 steps max
**Bug pool (one picked randomly per episode):**
- `exploding_lr` β€” `learning_rate: 50.0` causes loss to diverge to NaN by epoch 3
- `wrong_optimizer` β€” `SGD(momentum=0.99)` causes loss oscillation with no convergence
- `batch_size_overflow` β€” `batch_size: 4096` exceeds dataset size, trivial overfitting
**Signal strength:** High. Symptoms visible immediately in training logs.
### Task 2 β€” Data Leakage Detection `(medium)` | 30 steps max
**Bug pool:**
- `data_leakage_scaler` β€” `StandardScaler.fit_transform(X_full)` called before train/val split
- `data_leakage_overlap` β€” `train_test_split(random_state=None)` produces overlapping splits
- `wrong_split_ratio` β€” `test_size=0.8` trains on 20% and evaluates on 80%
**Signal strength:** Medium. Requires correlating val accuracy anomaly in logs with preprocessing code.
### Task 3 β€” Silent Evaluation Bug `(hard)` | 40 steps max
**Bug pool:**
- `label_encoder_mismatch` β€” Train/eval use different `LabelEncoder.fit()` orderings
- `silent_metric_swap` β€” `val_accuracy` and `test_accuracy` assignments swapped in eval code
- `tokenizer_version_drift` β€” Training uses tokenizer v2, eval uses v1 (847 tokens map to `[UNK]`)
**Signal strength:** Low. Training logs look completely normal. Only the val/test metric gap is suspicious β€” no errors, no warnings, no exceptions. Requires reasoning about what's *absent*.
**Asymmetric penalty:** Missing a silent evaluation bug is penalized 1.5x β€” mirroring real incident severity weighting where silent production bugs are far more costly than loud training failures.
---
## Reward Design
**Dense per-step rewards** (not sparse β€” provides learning signal throughout the episode):
```
Investigation phase:
+0.02 First time reading an artifact (rewards systematic exploration)
-0.02 Re-reading same artifact+filter (penalizes brute force)
+0.01 Running a new sanity check (rewards diagnostic reasoning)
Diagnosis grading (4 independent components):
+0.15 Correct failure_category (what kind of bug?)
+0.25 Correct root_cause_file (which file contains it?)
+0.30 Correct root_cause_field (which parameter/function?)
+0.30 Correct proposed_fix (keyword overlap with gold fix)
Task 3 modifier:
If score < 0.70 β†’ additional 0.5x penalty on missed components
(silent bugs reaching production are more costly than loud failures)
```
**Why dense rewards?** Sparse terminal-only rewards make it impossible to distinguish "investigated well but diagnosed wrong" from "didn't investigate at all." Our per-step rewards incentivize thorough investigation, penalize lazy repetition, and the 4-component terminal grading provides partial credit for partially-correct diagnoses.
**Score spectrum:**
```
No investigation, wrong diagnosis β†’ 0.01
Category only correct β†’ 0.10–0.15
Category + file correct β†’ 0.35–0.40
Category + file + field correct β†’ 0.65
Perfect diagnosis β†’ 0.90–0.99
```
---
## Baseline Scores
| Task | Baseline (Qwen2.5-72B) | Optimized (Gemini 2.5 Flash) |
|---|---|---|
| Easy | ~0.42 | ~0.91 |
| Medium | ~0.28 | ~0.85 |
| Hard | ~0.15 | ~0.92 |
The baseline agent (no task-specific prompting) struggles significantly on medium and hard tasks, confirming meaningful difficulty progression.
---
## Setup & Usage
### Docker (recommended)
```bash
docker build -t mlops-debug-env .
docker run -p 7860:7860 mlops-debug-env
curl http://localhost:7860/health
```
### Local Python
```bash
pip install -r requirements.txt
uvicorn app:app --host 0.0.0.0 --port 7860
```
### Python Client
```python
from client import MLOpsDebugEnv
from models import MLOpsAction
with MLOpsDebugEnv(base_url="http://localhost:7860").sync() as env:
obs = env.reset(task_id="hard", seed=1)
# Investigate
r = env.step(MLOpsAction(action_type="read_eval_results"))
r = env.step(MLOpsAction(action_type="run_sanity_check",
sanity_check_type="metric_gap_analysis"))
r = env.step(MLOpsAction(action_type="inspect_preprocessing"))
# Diagnose
r = env.step(MLOpsAction(
action_type="submit_diagnosis",
failure_category="label_mismatch",
root_cause_file="preprocessing.py",
root_cause_field="LabelEncoder.fit_order",
diagnosis="Train and eval use different LabelEncoder orderings",
proposed_fix="Use single LabelEncoder instance across both pipelines"
))
print(f"Score: {r.info['score']}")
```
### Inference Script
```bash
export GEMINI_API_KEY="your_key"
export ENV_BASE_URL="http://localhost:7860"
python inference.py # all 3 tasks
python inference.py --task easy --seed 42
```
**Output format (OpenEnv standard):**
```
[START] task=easy env=mlops-debug-env model=gemini-2.5-flash
[STEP] step=1 action=read_logs reward=0.02 done=false error=null
[STEP] step=2 action=run_sanity_check reward=0.01 done=false error=null
[STEP] step=3 action=read_config reward=0.02 done=false error=null
[STEP] step=4 action=submit_diagnosis reward=0.91 done=true error=null
[END] success=true steps=4 score=0.9100 rewards=0.02,0.01,0.02,0.91
```
---
## Design Decisions
**Why MLOps debugging?** Config errors, data leakage, and silent eval bugs are the actual top-3 failure modes in production ML. Every ML team at every company deals with these. This isn't a synthetic benchmark β€” it models a real workflow.
**Why procedural generation?** Fixed bug scenarios would let agents memorize answers. Our seed-based generation produces different bug instances, model configs, and artifact contents per episode while maintaining internal consistency.
**Why deterministic grading?** LLM-as-judge introduces variance and bias. Our grader uses substring/keyword matching against planted ground truth β€” zero subjectivity, reproducible to 4 decimal places.
**Why asymmetric penalties?** In production, a loud training crash (Task 1) is caught immediately. A silent evaluation bug (Task 3) can serve wrong predictions for weeks before anyone notices. The 1.5x penalty on Task 3 mirrors this real-world cost asymmetry.
**Why 8 sanity check types?** Real ML debugging involves running diagnostic scripts β€” not just reading files. Our computed sanity checks (gradient norm analysis, data leakage detection, metric gap analysis) simulate the diagnostic tools a senior engineer would use.
---
## Project Structure
```
MLops-Openenvhack/
β”œβ”€β”€ app.py # FastAPI server (REST + WebSocket)
β”œβ”€β”€ mlops_environment.py # Core environment: reset/step/grading
β”œβ”€β”€ artifact_generator.py # Procedural artifact + bug generation
β”œβ”€β”€ models.py # Pydantic models (Action, Observation, State)
β”œβ”€β”€ inference.py # LLM baseline agent
β”œβ”€β”€ client.py # Python client library (async + sync)
β”œβ”€β”€ openenv_state.py # Global state singleton
β”œβ”€β”€ openenv.yaml # OpenEnv specification
β”œβ”€β”€ Dockerfile # Container configuration
β”œβ”€β”€ requirements.txt # Python dependencies
└── server/ # HF Space deployment copy
```
---
## Environment Variables
| Variable | Required | Default | Description |
|---|---|---|---|
| `GEMINI_API_KEY` | Yes (for inference) | β€” | Gemini API key for baseline agent |
| `MODEL_NAME` | No | `gemini-2.5-flash` | LLM model identifier |
| `API_BASE_URL` | No | Gemini endpoint | OpenAI-compatible API base URL |
| `ENV_BASE_URL` | No | `http://localhost:7860` | Environment server URL |
---
## License
MIT