mlops-openenv / README.md
Rockerleo's picture
Upload folder using huggingface_hub
1e82f9d verified
metadata
title: MLOps Pipeline Debugger
emoji: πŸ”§
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 7860
pinned: false

MLOps Pipeline Debugger

OpenEnv Python 3.11 License: MIT

An OpenEnv-compatible RL environment where an AI agent acts as a senior ML engineer diagnosing a broken training run. Built for the Meta PyTorch Hackathon x Scaler School of Technology.


The Real-World Problem

Every ML team has experienced it: a training job finishes overnight and something is wrong. Loss exploded to NaN. Validation accuracy is suspiciously perfect at epoch 1. Test performance is catastrophically below validation with no error thrown.

A senior engineer must systematically investigate β€” reading logs, checking configs, inspecting preprocessing code, running sanity checks β€” to find the root cause. This is the #1 time sink in production ML operations, and it's a skill that separates junior from senior ML engineers.

This environment simulates that investigation workflow. It's not a toy problem β€” it models the actual top-3 failure modes from production ML pipelines:

Failure Mode Real-World Frequency Environment Task
Hyperparameter misconfiguration ~40% of training failures Task 1 (Easy)
Data leakage / preprocessing bugs ~35% of silent accuracy inflation Task 2 (Medium)
Silent evaluation pipeline bugs ~25% of post-deployment incidents Task 3 (Hard)

How It Works

At reset(), a complete set of 6 realistic training artifacts is procedurally generated with one planted fault. The agent investigates using 8 structured actions and submits a diagnosis. The grader checks against ground truth β€” fully deterministic, no LLM judge.

reset(task_id="hard", seed=42)
    β”‚
    β”œβ”€β”€ Generates: config.yaml, train.log, dataset_stats.json,
    β”‚              preprocessing.py, eval_results.json, model_card.json
    β”‚
    β”œβ”€β”€ Plants: one bug from the task's 3-bug pool
    β”‚
    └── Agent investigates β†’ submits diagnosis β†’ grader scores [0.01, 0.99]

9 distinct bug types across 3 difficulty tiers. Every episode can have a different bug. Scores vary continuously based on diagnosis precision.


Procedural Artifact Generation

Every episode generates 6 internally-consistent training artifacts from scratch:

Artifact Contents Role in Investigation
config.yaml Model arch, optimizer, LR, batch size, scheduler Check hyperparameters
train.log Epoch-by-epoch loss/accuracy/gradient norms Identify symptom patterns
dataset_stats.json Split sizes, class distribution, overlap counts Detect data issues
preprocessing.py Full sklearn/PyTorch pipeline code Find pipeline bugs
eval_results.json Final val/test metrics with hardware info Quantify metric gaps
model_card.json Architecture summary, tokenizer version Cross-reference versions

Artifacts are internally consistent β€” config matches logs, dataset stats match preprocessing code β€” except for the one planted fault. An agent must read multiple artifacts and correlate signals across them to locate the bug.


Action Space (8 actions)

class MLOpsAction(BaseModel):
    action_type: Literal[
        "read_config",           # Full training configuration
        "read_logs",             # Training logs (filterable: keyword or "epoch:N-M")
        "check_dataset_stats",   # Split sizes, class distribution, overlap counts
        "inspect_preprocessing", # Full preprocessing pipeline code
        "read_eval_results",     # Final val/test metrics
        "run_sanity_check",      # Computed diagnostic check (8 types)
        "query_artifact",        # Specific field from any artifact (dot notation)
        "submit_diagnosis",      # Final answer β€” triggers grading
    ]

Sanity check types (computed diagnostics, not just artifact reads): label_consistency | data_leakage | gradient_norms | class_balance | feature_statistics | encoder_version_match | loss_trajectory | metric_gap_analysis


Observation Space

class MLOpsObservation(BaseModel):
    task_id: str                          # easy | medium | hard
    task_description: str                 # Full task brief with investigation strategy
    run_id: str                           # Unique run identifier
    run_summary: Dict[str, Any]           # Model, dataset, training status
    available_artifacts: List[ArtifactMeta]  # What can be read (name, description, size)
    artifacts_read: List[str]             # Investigation progress tracking
    last_action_result: Dict[str, Any]    # Full content of last action
    step_count: int
    max_steps: int
    done: bool
    messages: List[str]                   # System warnings (duplicate reads, etc.)

Tasks & Difficulty Progression

Task 1 β€” Config Error Diagnosis (easy) | 20 steps max

Bug pool (one picked randomly per episode):

  • exploding_lr β€” learning_rate: 50.0 causes loss to diverge to NaN by epoch 3
  • wrong_optimizer β€” SGD(momentum=0.99) causes loss oscillation with no convergence
  • batch_size_overflow β€” batch_size: 4096 exceeds dataset size, trivial overfitting

Signal strength: High. Symptoms visible immediately in training logs.

Task 2 β€” Data Leakage Detection (medium) | 30 steps max

Bug pool:

  • data_leakage_scaler β€” StandardScaler.fit_transform(X_full) called before train/val split
  • data_leakage_overlap β€” train_test_split(random_state=None) produces overlapping splits
  • wrong_split_ratio β€” test_size=0.8 trains on 20% and evaluates on 80%

Signal strength: Medium. Requires correlating val accuracy anomaly in logs with preprocessing code.

Task 3 β€” Silent Evaluation Bug (hard) | 40 steps max

Bug pool:

  • label_encoder_mismatch β€” Train/eval use different LabelEncoder.fit() orderings
  • silent_metric_swap β€” val_accuracy and test_accuracy assignments swapped in eval code
  • tokenizer_version_drift β€” Training uses tokenizer v2, eval uses v1 (847 tokens map to [UNK])

Signal strength: Low. Training logs look completely normal. Only the val/test metric gap is suspicious β€” no errors, no warnings, no exceptions. Requires reasoning about what's absent.

Asymmetric penalty: Missing a silent evaluation bug is penalized 1.5x β€” mirroring real incident severity weighting where silent production bugs are far more costly than loud training failures.


Reward Design

Dense per-step rewards (not sparse β€” provides learning signal throughout the episode):

Investigation phase:
  +0.02  First time reading an artifact     (rewards systematic exploration)
  -0.02  Re-reading same artifact+filter    (penalizes brute force)
  +0.01  Running a new sanity check         (rewards diagnostic reasoning)

Diagnosis grading (4 independent components):
  +0.15  Correct failure_category           (what kind of bug?)
  +0.25  Correct root_cause_file            (which file contains it?)
  +0.30  Correct root_cause_field           (which parameter/function?)
  +0.30  Correct proposed_fix               (keyword overlap with gold fix)

Task 3 modifier:
  If score < 0.70 β†’ additional 0.5x penalty on missed components
  (silent bugs reaching production are more costly than loud failures)

Why dense rewards? Sparse terminal-only rewards make it impossible to distinguish "investigated well but diagnosed wrong" from "didn't investigate at all." Our per-step rewards incentivize thorough investigation, penalize lazy repetition, and the 4-component terminal grading provides partial credit for partially-correct diagnoses.

Score spectrum:

No investigation, wrong diagnosis  β†’  0.01
Category only correct              β†’  0.10–0.15
Category + file correct            β†’  0.35–0.40
Category + file + field correct    β†’  0.65
Perfect diagnosis                  β†’  0.90–0.99

Baseline Scores

Task Baseline (Qwen2.5-72B) Optimized (Gemini 2.5 Flash)
Easy ~0.42 ~0.91
Medium ~0.28 ~0.85
Hard ~0.15 ~0.92

The baseline agent (no task-specific prompting) struggles significantly on medium and hard tasks, confirming meaningful difficulty progression.


Setup & Usage

Docker (recommended)

docker build -t mlops-debug-env .
docker run -p 7860:7860 mlops-debug-env
curl http://localhost:7860/health

Local Python

pip install -r requirements.txt
uvicorn app:app --host 0.0.0.0 --port 7860

Python Client

from client import MLOpsDebugEnv
from models import MLOpsAction

with MLOpsDebugEnv(base_url="http://localhost:7860").sync() as env:
    obs = env.reset(task_id="hard", seed=1)

    # Investigate
    r = env.step(MLOpsAction(action_type="read_eval_results"))
    r = env.step(MLOpsAction(action_type="run_sanity_check",
                             sanity_check_type="metric_gap_analysis"))
    r = env.step(MLOpsAction(action_type="inspect_preprocessing"))

    # Diagnose
    r = env.step(MLOpsAction(
        action_type="submit_diagnosis",
        failure_category="label_mismatch",
        root_cause_file="preprocessing.py",
        root_cause_field="LabelEncoder.fit_order",
        diagnosis="Train and eval use different LabelEncoder orderings",
        proposed_fix="Use single LabelEncoder instance across both pipelines"
    ))
    print(f"Score: {r.info['score']}")

Inference Script

export GEMINI_API_KEY="your_key"
export ENV_BASE_URL="http://localhost:7860"
python inference.py                    # all 3 tasks
python inference.py --task easy --seed 42

Output format (OpenEnv standard):

[START] task=easy env=mlops-debug-env model=gemini-2.5-flash
[STEP] step=1 action=read_logs reward=0.02 done=false error=null
[STEP] step=2 action=run_sanity_check reward=0.01 done=false error=null
[STEP] step=3 action=read_config reward=0.02 done=false error=null
[STEP] step=4 action=submit_diagnosis reward=0.91 done=true error=null
[END] success=true steps=4 score=0.9100 rewards=0.02,0.01,0.02,0.91

Design Decisions

Why MLOps debugging? Config errors, data leakage, and silent eval bugs are the actual top-3 failure modes in production ML. Every ML team at every company deals with these. This isn't a synthetic benchmark β€” it models a real workflow.

Why procedural generation? Fixed bug scenarios would let agents memorize answers. Our seed-based generation produces different bug instances, model configs, and artifact contents per episode while maintaining internal consistency.

Why deterministic grading? LLM-as-judge introduces variance and bias. Our grader uses substring/keyword matching against planted ground truth β€” zero subjectivity, reproducible to 4 decimal places.

Why asymmetric penalties? In production, a loud training crash (Task 1) is caught immediately. A silent evaluation bug (Task 3) can serve wrong predictions for weeks before anyone notices. The 1.5x penalty on Task 3 mirrors this real-world cost asymmetry.

Why 8 sanity check types? Real ML debugging involves running diagnostic scripts β€” not just reading files. Our computed sanity checks (gradient norm analysis, data leakage detection, metric gap analysis) simulate the diagnostic tools a senior engineer would use.


Project Structure

MLops-Openenvhack/
β”œβ”€β”€ app.py                  # FastAPI server (REST + WebSocket)
β”œβ”€β”€ mlops_environment.py    # Core environment: reset/step/grading
β”œβ”€β”€ artifact_generator.py   # Procedural artifact + bug generation
β”œβ”€β”€ models.py               # Pydantic models (Action, Observation, State)
β”œβ”€β”€ inference.py             # LLM baseline agent
β”œβ”€β”€ client.py               # Python client library (async + sync)
β”œβ”€β”€ openenv_state.py        # Global state singleton
β”œβ”€β”€ openenv.yaml            # OpenEnv specification
β”œβ”€β”€ Dockerfile              # Container configuration
β”œβ”€β”€ requirements.txt        # Python dependencies
└── server/                 # HF Space deployment copy

Environment Variables

Variable Required Default Description
GEMINI_API_KEY Yes (for inference) β€” Gemini API key for baseline agent
MODEL_NAME No gemini-2.5-flash LLM model identifier
API_BASE_URL No Gemini endpoint OpenAI-compatible API base URL
ENV_BASE_URL No http://localhost:7860 Environment server URL

License

MIT