Spaces:

Rockerleo
/

mlops-openenv

Sleeping

App Files Files Community

mlops-openenv / README.md

Rockerleo

Upload folder using huggingface_hub

1e82f9d verified about 1 month ago

preview code

raw

history blame contribute delete

12.9 kB

	---
	title: MLOps Pipeline Debugger
	emoji: 🔧
	colorFrom: blue
	colorTo: purple
	sdk: docker
	app_port: 7860
	pinned: false
	---

	# MLOps Pipeline Debugger

	[![OpenEnv](https://img.shields.io/badge/OpenEnv-1.0.0-blue)](https://github.com/meta-pytorch/OpenEnv)
	[![Python 3.11](https://img.shields.io/badge/python-3.11-green)](https://www.python.org)
	[![License: MIT](https://img.shields.io/badge/License-MIT-yellow)](LICENSE)

	An OpenEnv-compatible RL environment where an AI agent acts as a senior ML engineer diagnosing a broken training run. Built for the Meta PyTorch Hackathon x Scaler School of Technology.

	---

	## The Real-World Problem

	Every ML team has experienced it: a training job finishes overnight and something is wrong. Loss exploded to NaN. Validation accuracy is suspiciously perfect at epoch 1. Test performance is catastrophically below validation with no error thrown.

	A senior engineer must systematically investigate — reading logs, checking configs, inspecting preprocessing code, running sanity checks — to find the root cause. This is the #1 time sink in production ML operations, and it's a skill that separates junior from senior ML engineers.

	This environment simulates that investigation workflow. It's not a toy problem — it models the actual top-3 failure modes from production ML pipelines:

	\| Failure Mode \| Real-World Frequency \| Environment Task \|
	\|---\|---\|---\|
	\| Hyperparameter misconfiguration \| ~40% of training failures \| Task 1 (Easy) \|
	\| Data leakage / preprocessing bugs \| ~35% of silent accuracy inflation \| Task 2 (Medium) \|
	\| Silent evaluation pipeline bugs \| ~25% of post-deployment incidents \| Task 3 (Hard) \|

	---

	## How It Works

	At `reset()`, a complete set of 6 realistic training artifacts is procedurally generated with one planted fault. The agent investigates using 8 structured actions and submits a diagnosis. The grader checks against ground truth — fully deterministic, no LLM judge.

	```
	reset(task_id="hard", seed=42)
	│
	├── Generates: config.yaml, train.log, dataset_stats.json,
	│ preprocessing.py, eval_results.json, model_card.json
	│
	├── Plants: one bug from the task's 3-bug pool
	│
	└── Agent investigates → submits diagnosis → grader scores [0.01, 0.99]
	```

	9 distinct bug types across 3 difficulty tiers. Every episode can have a different bug. Scores vary continuously based on diagnosis precision.

	---

	## Procedural Artifact Generation

	Every episode generates 6 internally-consistent training artifacts from scratch:

	\| Artifact \| Contents \| Role in Investigation \|
	\|---\|---\|---\|
	\| `config.yaml` \| Model arch, optimizer, LR, batch size, scheduler \| Check hyperparameters \|
	\| `train.log` \| Epoch-by-epoch loss/accuracy/gradient norms \| Identify symptom patterns \|
	\| `dataset_stats.json` \| Split sizes, class distribution, overlap counts \| Detect data issues \|
	\| `preprocessing.py` \| Full sklearn/PyTorch pipeline code \| Find pipeline bugs \|
	\| `eval_results.json` \| Final val/test metrics with hardware info \| Quantify metric gaps \|
	\| `model_card.json` \| Architecture summary, tokenizer version \| Cross-reference versions \|

	Artifacts are internally consistent — config matches logs, dataset stats match preprocessing code — except for the one planted fault. An agent must read multiple artifacts and correlate signals across them to locate the bug.

	---

	## Action Space (8 actions)

	```python
	class MLOpsAction(BaseModel):
	action_type: Literal[
	"read_config", # Full training configuration
	"read_logs", # Training logs (filterable: keyword or "epoch:N-M")
	"check_dataset_stats", # Split sizes, class distribution, overlap counts
	"inspect_preprocessing", # Full preprocessing pipeline code
	"read_eval_results", # Final val/test metrics
	"run_sanity_check", # Computed diagnostic check (8 types)
	"query_artifact", # Specific field from any artifact (dot notation)
	"submit_diagnosis", # Final answer — triggers grading
	]
	```

	Sanity check types (computed diagnostics, not just artifact reads):
	`label_consistency` \| `data_leakage` \| `gradient_norms` \| `class_balance` \| `feature_statistics` \| `encoder_version_match` \| `loss_trajectory` \| `metric_gap_analysis`

	---

	## Observation Space

	```python
	class MLOpsObservation(BaseModel):
	task_id: str # easy \| medium \| hard
	task_description: str # Full task brief with investigation strategy
	run_id: str # Unique run identifier
	run_summary: Dict[str, Any] # Model, dataset, training status
	available_artifacts: List[ArtifactMeta] # What can be read (name, description, size)
	artifacts_read: List[str] # Investigation progress tracking
	last_action_result: Dict[str, Any] # Full content of last action
	step_count: int
	max_steps: int
	done: bool
	messages: List[str] # System warnings (duplicate reads, etc.)
	```

	---

	## Tasks & Difficulty Progression

	### Task 1 — Config Error Diagnosis `(easy)` \| 20 steps max

	Bug pool (one picked randomly per episode):
	- `exploding_lr` — `learning_rate: 50.0` causes loss to diverge to NaN by epoch 3
	- `wrong_optimizer` — `SGD(momentum=0.99)` causes loss oscillation with no convergence
	- `batch_size_overflow` — `batch_size: 4096` exceeds dataset size, trivial overfitting

	Signal strength: High. Symptoms visible immediately in training logs.

	### Task 2 — Data Leakage Detection `(medium)` \| 30 steps max

	Bug pool:
	- `data_leakage_scaler` — `StandardScaler.fit_transform(X_full)` called before train/val split
	- `data_leakage_overlap` — `train_test_split(random_state=None)` produces overlapping splits
	- `wrong_split_ratio` — `test_size=0.8` trains on 20% and evaluates on 80%

	Signal strength: Medium. Requires correlating val accuracy anomaly in logs with preprocessing code.

	### Task 3 — Silent Evaluation Bug `(hard)` \| 40 steps max

	Bug pool:
	- `label_encoder_mismatch` — Train/eval use different `LabelEncoder.fit()` orderings
	- `silent_metric_swap` — `val_accuracy` and `test_accuracy` assignments swapped in eval code
	- `tokenizer_version_drift` — Training uses tokenizer v2, eval uses v1 (847 tokens map to `[UNK]`)

	Signal strength: Low. Training logs look completely normal. Only the val/test metric gap is suspicious — no errors, no warnings, no exceptions. Requires reasoning about what's absent.

	Asymmetric penalty: Missing a silent evaluation bug is penalized 1.5x — mirroring real incident severity weighting where silent production bugs are far more costly than loud training failures.

	---

	## Reward Design

	Dense per-step rewards (not sparse — provides learning signal throughout the episode):

	```
	Investigation phase:
	+0.02 First time reading an artifact (rewards systematic exploration)
	-0.02 Re-reading same artifact+filter (penalizes brute force)
	+0.01 Running a new sanity check (rewards diagnostic reasoning)

	Diagnosis grading (4 independent components):
	+0.15 Correct failure_category (what kind of bug?)
	+0.25 Correct root_cause_file (which file contains it?)
	+0.30 Correct root_cause_field (which parameter/function?)
	+0.30 Correct proposed_fix (keyword overlap with gold fix)

	Task 3 modifier:
	If score < 0.70 → additional 0.5x penalty on missed components
	(silent bugs reaching production are more costly than loud failures)
	```

	Why dense rewards? Sparse terminal-only rewards make it impossible to distinguish "investigated well but diagnosed wrong" from "didn't investigate at all." Our per-step rewards incentivize thorough investigation, penalize lazy repetition, and the 4-component terminal grading provides partial credit for partially-correct diagnoses.

	Score spectrum:
	```
	No investigation, wrong diagnosis → 0.01
	Category only correct → 0.10–0.15
	Category + file correct → 0.35–0.40
	Category + file + field correct → 0.65
	Perfect diagnosis → 0.90–0.99
	```

	---

	## Baseline Scores

	\| Task \| Baseline (Qwen2.5-72B) \| Optimized (Gemini 2.5 Flash) \|
	\|---\|---\|---\|
	\| Easy \| ~0.42 \| ~0.91 \|
	\| Medium \| ~0.28 \| ~0.85 \|
	\| Hard \| ~0.15 \| ~0.92 \|

	The baseline agent (no task-specific prompting) struggles significantly on medium and hard tasks, confirming meaningful difficulty progression.

	---

	## Setup & Usage

	### Docker (recommended)

	```bash
	docker build -t mlops-debug-env .
	docker run -p 7860:7860 mlops-debug-env
	curl http://localhost:7860/health
	```

	### Local Python

	```bash
	pip install -r requirements.txt
	uvicorn app:app --host 0.0.0.0 --port 7860
	```

	### Python Client

	```python
	from client import MLOpsDebugEnv
	from models import MLOpsAction

	with MLOpsDebugEnv(base_url="http://localhost:7860").sync() as env:
	obs = env.reset(task_id="hard", seed=1)

	# Investigate
	r = env.step(MLOpsAction(action_type="read_eval_results"))
	r = env.step(MLOpsAction(action_type="run_sanity_check",
	sanity_check_type="metric_gap_analysis"))
	r = env.step(MLOpsAction(action_type="inspect_preprocessing"))

	# Diagnose
	r = env.step(MLOpsAction(
	action_type="submit_diagnosis",
	failure_category="label_mismatch",
	root_cause_file="preprocessing.py",
	root_cause_field="LabelEncoder.fit_order",
	diagnosis="Train and eval use different LabelEncoder orderings",
	proposed_fix="Use single LabelEncoder instance across both pipelines"
	))
	print(f"Score: {r.info['score']}")
	```

	### Inference Script

	```bash
	export GEMINI_API_KEY="your_key"
	export ENV_BASE_URL="http://localhost:7860"
	python inference.py # all 3 tasks
	python inference.py --task easy --seed 42
	```

	Output format (OpenEnv standard):
	```
	[START] task=easy env=mlops-debug-env model=gemini-2.5-flash
	[STEP] step=1 action=read_logs reward=0.02 done=false error=null
	[STEP] step=2 action=run_sanity_check reward=0.01 done=false error=null
	[STEP] step=3 action=read_config reward=0.02 done=false error=null
	[STEP] step=4 action=submit_diagnosis reward=0.91 done=true error=null
	[END] success=true steps=4 score=0.9100 rewards=0.02,0.01,0.02,0.91
	```

	---

	## Design Decisions

	Why MLOps debugging? Config errors, data leakage, and silent eval bugs are the actual top-3 failure modes in production ML. Every ML team at every company deals with these. This isn't a synthetic benchmark — it models a real workflow.

	Why procedural generation? Fixed bug scenarios would let agents memorize answers. Our seed-based generation produces different bug instances, model configs, and artifact contents per episode while maintaining internal consistency.

	Why deterministic grading? LLM-as-judge introduces variance and bias. Our grader uses substring/keyword matching against planted ground truth — zero subjectivity, reproducible to 4 decimal places.

	Why asymmetric penalties? In production, a loud training crash (Task 1) is caught immediately. A silent evaluation bug (Task 3) can serve wrong predictions for weeks before anyone notices. The 1.5x penalty on Task 3 mirrors this real-world cost asymmetry.

	Why 8 sanity check types? Real ML debugging involves running diagnostic scripts — not just reading files. Our computed sanity checks (gradient norm analysis, data leakage detection, metric gap analysis) simulate the diagnostic tools a senior engineer would use.

	---

	## Project Structure

	```
	MLops-Openenvhack/
	├── app.py # FastAPI server (REST + WebSocket)
	├── mlops_environment.py # Core environment: reset/step/grading
	├── artifact_generator.py # Procedural artifact + bug generation
	├── models.py # Pydantic models (Action, Observation, State)
	├── inference.py # LLM baseline agent
	├── client.py # Python client library (async + sync)
	├── openenv_state.py # Global state singleton
	├── openenv.yaml # OpenEnv specification
	├── Dockerfile # Container configuration
	├── requirements.txt # Python dependencies
	└── server/ # HF Space deployment copy
	```

	---

	## Environment Variables

	\| Variable \| Required \| Default \| Description \|
	\|---\|---\|---\|---\|
	\| `GEMINI_API_KEY` \| Yes (for inference) \| — \| Gemini API key for baseline agent \|
	\| `MODEL_NAME` \| No \| `gemini-2.5-flash` \| LLM model identifier \|
	\| `API_BASE_URL` \| No \| Gemini endpoint \| OpenAI-compatible API base URL \|
	\| `ENV_BASE_URL` \| No \| `http://localhost:7860` \| Environment server URL \|

	---

	## License

	MIT