Spaces:

Rockerleo
/

mlops-openenv

Sleeping

App Files Files Community

Rockerleo commited on Apr 11

Commit

1e82f9d

verified ·

1 Parent(s): a744b64

Upload folder using huggingface_hub

Browse files

Files changed (4) hide show

ARCHITECTURE.md +124 -0
README.md +147 -138
openenv.yaml +76 -13
pyproject.toml +3 -2

ARCHITECTURE.md ADDED Viewed

	@@ -0,0 +1,124 @@

+# Architecture
+## System Overview
+```
+Agent (inference.py)
+    │
+    │  POST /reset, POST /step
+    ▼
+FastAPI Server (app.py)
+    │
+    │  reset(), step()
+    ▼
+MLOpsEnvironment (mlops_environment.py)
+    │
+    ├── ArtifactGenerator (artifact_generator.py)
+    │   └── BUG_CATALOGUE: 9 bug specs across 3 tiers
+    │   └── Procedural generation: config, logs, stats, code, eval, model card
+    │
+    ├── Sanity Check Engine (artifact_generator.py)
+    │   └── 8 computed diagnostics grounded in generated artifacts
+    │
+    ├── Grader (_handle_submit)
+    │   └── 4-component scoring: category + file + field + fix
+    │
+    └── Models (models.py)
+        └── MLOpsAction, MLOpsObservation, MLOpsState, ArtifactMeta
+```
+## Data Flow
+### Episode Lifecycle
+```
+1. reset(task_id, seed)
+   ├── Random(seed) selects bug from task pool
+   ├── ArtifactGenerator creates 6 consistent artifacts with planted fault
+   └── Returns: MLOpsObservation with task description + artifact metadata
+2. step(action) × N
+   ├── read_* actions → return artifact content (reward: +0.02 new, -0.02 duplicate)
+   ├── run_sanity_check → compute diagnostic from artifacts (reward: +0.01 new)
+   ├── query_artifact → return specific field via dot notation
+   └── submit_diagnosis → grade against ground truth (terminal)
+3. Grading (_handle_submit)
+   ├── Compare 4 components against BugSpec ground truth
+   ├── Apply hard task penalty if score < 0.70
+   └── Return: score ∈ (0.01, 0.99), breakdown, ground truth
+```
+### Determinism Guarantees
+- `random.Random(seed)` for bug selection and artifact variation
+- `np.random.RandomState(seed)` for numeric distributions
+- No external state, no network calls during generation
+- Same (task_id, seed) always produces identical episode
+## Component Responsibilities
+### app.py — API Layer
+- FastAPI server on port 7860
+- REST endpoints: `/reset`, `/step`, `/state`, `/health`, `/tasks`
+- WebSocket endpoint: `/ws` for streaming interaction
+- Stateless request handling; delegates to MLOpsEnvironment
+### mlops_environment.py — Core Logic
+- Episode state management (step count, artifacts read, score)
+- Action routing to handlers
+- Grading logic with 4-component scoring
+- `grade_task()` standalone grader for OpenEnv validation
+### artifact_generator.py — Content Generation
+- `BugSpec` dataclass: category, file, field, gold_fix, difficulty
+- `BUG_CATALOGUE`: 9 bug specifications
+- `ArtifactGenerator`: produces 6 artifacts per episode
+- `run_sanity_check()`: 8 computed diagnostic checks
+### models.py — Data Models
+- `MLOpsAction`: 8 action types with typed parameters
+- `MLOpsObservation`: full agent observation per step
+- `MLOpsState`: internal state for debugging/RL harness
+- `ArtifactMeta`: artifact metadata (name, description, size hint)
+### inference.py — Baseline Agent
+- LLM-powered agent using Gemini via OpenAI-compatible API
+- Investigation phase: reads artifacts, runs sanity checks
+- Diagnosis phase: submits structured diagnosis
+- Fallback logic for unparseable LLM output
+- Rate limiting with exponential backoff
+### client.py — Client Library
+- `MLOpsDebugEnv`: async httpx client
+- `SyncMLOpsDebugEnv`: synchronous wrapper
+- Context manager support for connection lifecycle
+## API Endpoints
+| Method | Path | Description |
+|--------|------|-------------|
+| GET | `/` | API info |
+| GET | `/health` | Health check |
+| GET | `/tasks` | List available tasks |
+| POST | `/reset` | Start new episode |
+| POST | `/step` | Execute action |
+| GET | `/state` | Current episode state |
+| GET | `/openenv/state` | OpenEnv framework state |
+| WS | `/ws` | WebSocket interface |
+## Reward Architecture
+The reward function has two layers:
+**Per-step (dense):** Encourages systematic investigation
+- New artifact read: +0.02 (explore broadly)
+- Duplicate read: -0.02 (don't brute force)
+- New sanity check: +0.01 (use diagnostics)
+**Terminal (graded):** Evaluates diagnosis quality
+- 4 independent components sum to max 1.0
+- Keyword/substring matching (no LLM judge)
+- Hard task asymmetric penalty (1.5x on missed components)
+This two-layer design means an agent that investigates thoroughly but diagnoses wrong still earns per-step rewards, while an agent that submits immediately with a lucky guess earns terminal reward but misses exploration bonuses.

README.md CHANGED Viewed

@@ -14,73 +14,81 @@ pinned: false
 [![Python 3.11](https://img.shields.io/badge/python-3.11-green)](https://www.python.org)
 [![License: MIT](https://img.shields.io/badge/License-MIT-yellow)](LICENSE)
-## Latest Baseline Scores
-| Task | Score |
-|------|-------|
-| Easy | 0.91 |
-| Medium | 0.85 |
-| Hard | 1.00 |
-| **Average** | **0.92** |
-*Tested with Gemini 2.5 Flash + Gemini 3.1 Pro Preview fallback for hard task*
-An **OpenEnv-compatible reinforcement learning environment** where an AI agent acts as a senior ML engineer diagnosing a broken training run.
 ---
-## What Is This?
-Every ML team has experienced it: a training job finishes overnight and something is wrong. Loss exploded to NaN. Validation accuracy is suspiciously perfect at epoch 1. Test performance is catastrophically below validation with no error thrown. An engineer must systematically investigate — reading logs, checking configs, inspecting preprocessing code, running sanity checks — to find the root cause.
-This environment simulates that investigation. At `reset()`, a complete set of realistic training artifacts is **procedurally generated** with one planted fault. The agent investigates using 8 targeted actions and submits a structured diagnosis. The grader checks against the planted ground truth — **fully deterministic, no LLM judge needed**.
-**9 distinct bug types across 3 tasks. Every episode can have a different bug. Scores vary continuously 0.0 → 1.0 based on diagnosis precision.**
 ---
-## Environment Design
-### Procedural Artifact Generation
-Every episode generates 6 realistic training artifacts from scratch:
-| Artifact | Contents |
-|---|---|
-| `config.yaml` | Model arch, optimizer, LR, batch size, scheduler, augmentation |
-| `train.log` | Epoch-by-epoch loss/accuracy/gradient norms with realistic timestamps |
-| `dataset_stats.json` | Split sizes, class distribution, overlap counts, feature statistics |
-| `preprocessing.py` | Full sklearn/PyTorch preprocessing pipeline code |
-| `eval_results.json` | Final val/test metrics with hardware info |
-| `model_card.json` | Architecture summary, tokenizer version, preprocessing config |
-Artifacts are **internally consistent** — config matches logs, dataset stats match preprocessing code — except for the one planted fault. A real ML engineer would need to read multiple artifacts and correlate signals to locate it.
 ---
-## Action Space
 ```python
 class MLOpsAction(BaseModel):
     action_type: Literal[
-        "read_config",          # Full config.yaml
-        "read_logs",            # Training logs (filterable: keyword or "epoch:N-M")
-        "check_dataset_stats",  # Split sizes, class distribution, overlap counts
-        "inspect_preprocessing",# Full preprocessing pipeline code
-        "read_eval_results",    # Final val/test metrics
-        "run_sanity_check",     # Computed diagnostic (see types below)
-        "query_artifact",       # Specific field from any artifact (dot notation)
-        "submit_diagnosis",     # Final answer — triggers grading
     ]
-    # Sanity check types:
-    # label_consistency | data_leakage | gradient_norms | class_balance
-    # feature_statistics | encoder_version_match | loss_trajectory | metric_gap_analysis
-    # submit_diagnosis fields:
-    # failure_category | root_cause_file | root_cause_field | diagnosis | proposed_fix
 ```
 ---
 ## Observation Space
@@ -91,8 +99,8 @@ class MLOpsObservation(BaseModel):
     task_description: str                 # Full task brief with investigation strategy
     run_id: str                           # Unique run identifier
     run_summary: Dict[str, Any]           # Model, dataset, training status
-    available_artifacts: List[ArtifactMeta]  # What can be read
-    artifacts_read: List[str]             # Investigation progress
     last_action_result: Dict[str, Any]    # Full content of last action
     step_count: int
     max_steps: int
@@ -102,84 +110,85 @@ class MLOpsObservation(BaseModel):
 ---
-## Tasks
-### Task 1 — Config Error Diagnosis `(easy)`
 **Bug pool (one picked randomly per episode):**
-- `exploding_lr` — `learning_rate: 50.0` causes loss → NaN by epoch 3
-- `wrong_optimizer` — `SGD(momentum=0.99)` causes oscillation with no convergence
-- `batch_size_overflow` — `batch_size: 4096` exceeds dataset size, val accuracy 99.9% trivially
-**Signal:** Visible immediately in training logs. Loss curve or accuracy values are obviously wrong.
-**Optimal strategy:** `read_logs` → `run_sanity_check(loss_trajectory)` → `read_config` → `submit_diagnosis`
-Max steps: **20** | Expected baseline score: ~0.42
----
-### Task 2 — Data Leakage Detection `(medium)`
 **Bug pool:**
 - `data_leakage_scaler` — `StandardScaler.fit_transform(X_full)` called before train/val split
-- `data_leakage_overlap` — `train_test_split(random_state=None)` produces non-deterministic overlapping splits
-- `wrong_split_ratio` — `test_size=0.8` trains on 20% and evaluates on 80% (inverted)
-**Signal:** Val accuracy suspiciously high from epoch 1 in logs; val/test gap in eval results; sample overlap count in dataset stats.
-**Optimal strategy:** `read_logs` → `read_eval_results` → `run_sanity_check(data_leakage)` → `inspect_preprocessing` → `submit_diagnosis`
-Max steps: **30** | Expected baseline score: ~0.28
----
-### Task 3 — Silent Evaluation Bug `(hard)`
 **Bug pool:**
-- `label_encoder_mismatch` — Train/eval use different `LabelEncoder.fit()` orderings → silent wrong predictions
-- `silent_metric_swap` — `val_accuracy` and `test_accuracy` assignments are swapped in eval code
-- `tokenizer_version_drift` — Training uses tokenizer v2, eval uses v1 → 847 tokens map to `[UNK]`
-**Signal:** Training logs look completely normal. Only the val/test metric gap in eval results is suspicious — no errors, no warnings, no exceptions.
-**Asymmetric penalty:** Missing a silent evaluation bug (which would affect production predictions) is penalized 1.5× — mirroring real incident severity weighting.
-**Optimal strategy:** `read_eval_results` → `run_sanity_check(metric_gap_analysis)` → `inspect_preprocessing` → `run_sanity_check(label_consistency OR encoder_version_match)` → `submit_diagnosis`
-Max steps: **40** | Expected baseline score: ~0.15
 ---
-## Reward Function
-**Dense per-step rewards** (not sparse):
 ```
-+0.02  First time reading an artifact (rewards systematic exploration)
--0.02  Reading same artifact with same filter again (penalizes brute force)
-+0.01  Running a new sanity check (rewards diagnostic reasoning)
-At submit_diagnosis:
-+0.15  Correct failure_category  (config_error / data_leakage / evaluation_bug / ...)
-+0.25  Correct root_cause_file   (exact match)
-+0.30  Correct root_cause_field  (substring match, case-insensitive)
-+0.30  Correct proposed_fix      (keyword overlap with gold fix)
-Task 3 modifier: if score < 0.70, additional 0.5× penalty on missed components
 ```
-**Score spectrum** (verified):
 ```
-All wrong            → 0.00
-Category only        → 0.10–0.15
-Category + file      → 0.35–0.40
-Category + file + field → 0.65
-Perfect diagnosis    → 0.90–1.00
 ```
 ---
 ## Setup & Usage
 ### Docker (recommended)
@@ -200,27 +209,19 @@ uvicorn app:app --host 0.0.0.0 --port 7860
 ### Python Client
 ```python
-# Sync usage
 from client import MLOpsDebugEnv
 from models import MLOpsAction
 with MLOpsDebugEnv(base_url="http://localhost:7860").sync() as env:
     obs = env.reset(task_id="hard", seed=1)
-    print(obs.task_description)
-    # Investigate systematically
     r = env.step(MLOpsAction(action_type="read_eval_results"))
-    print(r.observation.last_action_result["content"])
-    r = env.step(MLOpsAction(
-        action_type="run_sanity_check",
-        sanity_check_type="metric_gap_analysis"
-    ))
-    # Reveals val/test gap anomaly
     r = env.step(MLOpsAction(action_type="inspect_preprocessing"))
-    # Shows the buggy pipeline code
     r = env.step(MLOpsAction(
         action_type="submit_diagnosis",
         failure_category="label_mismatch",
@@ -232,63 +233,71 @@ with MLOpsDebugEnv(base_url="http://localhost:7860").sync() as env:
     print(f"Score: {r.info['score']}")
 ```
----
-## Baseline Inference Script
 ```bash
-export API_BASE_URL="https://router.huggingface.co/v1"
-export MODEL_NAME="Qwen/Qwen2.5-72B-Instruct"
-export HF_TOKEN="hf_your_token_here"
 export ENV_BASE_URL="http://localhost:7860"
-python inference.py          # all 3 tasks, seed=42
 python inference.py --task easy --seed 42
 ```
-**Output format:**
 ```
-[START] task=easy env=mlops-debug-env model=Qwen/Qwen2.5-72B-Instruct
 [STEP] step=1 action=read_logs reward=0.02 done=false error=null
 [STEP] step=2 action=run_sanity_check reward=0.01 done=false error=null
 [STEP] step=3 action=read_config reward=0.02 done=false error=null
-[STEP] step=4 action=submit_diagnosis reward=0.95 done=true error=null
-[END] success=true steps=4 rewards=0.02,0.01,0.02,0.95
 ```
-**Baseline scores** (Qwen2.5-72B-Instruct, seed=42):
-| Task | Score | Notes |
-|---|---|---|
-| easy | ~0.42 | Gets category right, struggles with exact field name |
-| medium | ~0.28 | Often identifies leakage but misidentifies exact mechanism |
-| hard | ~0.15 | Silent bugs with normal training logs are genuinely hard |
----
-## Why This Environment
-**Real problem.** Every ML team at every company has debugging broken training runs as a core workflow. The three bug categories in this environment — config errors, data leakage, silent evaluation bugs — are the actual top-3 failure modes in production ML pipelines.
-**Deterministic grading.** The planted bug is ground truth. Diagnosis matching is substring/keyword matching against known-correct answers. Zero subjectivity, zero LLM-as-judge, reproducible across runs.
-**Genuinely hard for frontier models.** Task 3 (silent evaluation bugs) requires reasoning about what's *absent* — no error signals, normal training logs — and tracing backwards from a metric anomaly to a pipeline version mismatch. State-of-the-art models score ~0.15 without careful prompting.
-**Seed-based reproducibility.** `reset(seed=42)` always produces the same bug, same artifacts, same grading. Baseline scores are reproducible to 4 decimal places.
 ---
 ## Environment Variables
-| Variable | Description |
-|---|---|
-| `API_BASE_URL` | LLM API endpoint (OpenAI-compatible) |
-| `MODEL_NAME` | Model identifier |
-| `HF_TOKEN` | Hugging Face / API token |
-| `ENV_BASE_URL` | Environment server URL (default: `http://localhost:7860`) |
 ---
 ## License
-MIT — see LICENSE

 [![Python 3.11](https://img.shields.io/badge/python-3.11-green)](https://www.python.org)
 [![License: MIT](https://img.shields.io/badge/License-MIT-yellow)](LICENSE)
+An **OpenEnv-compatible RL environment** where an AI agent acts as a senior ML engineer diagnosing a broken training run. Built for the **Meta PyTorch Hackathon x Scaler School of Technology**.
+---
+## The Real-World Problem
+Every ML team has experienced it: a training job finishes overnight and something is wrong. Loss exploded to NaN. Validation accuracy is suspiciously perfect at epoch 1. Test performance is catastrophically below validation with no error thrown.
+A senior engineer must systematically investigate — reading logs, checking configs, inspecting preprocessing code, running sanity checks — to find the root cause. **This is the #1 time sink in production ML operations**, and it's a skill that separates junior from senior ML engineers.
+This environment simulates that investigation workflow. It's not a toy problem — it models the **actual top-3 failure modes** from production ML pipelines:
+| Failure Mode | Real-World Frequency | Environment Task |
+|---|---|---|
+| Hyperparameter misconfiguration | ~40% of training failures | Task 1 (Easy) |
+| Data leakage / preprocessing bugs | ~35% of silent accuracy inflation | Task 2 (Medium) |
+| Silent evaluation pipeline bugs | ~25% of post-deployment incidents | Task 3 (Hard) |
 ---
+## How It Works
+At `reset()`, a complete set of **6 realistic training artifacts** is procedurally generated with one planted fault. The agent investigates using **8 structured actions** and submits a diagnosis. The grader checks against ground truth — **fully deterministic, no LLM judge**.
+```
+reset(task_id="hard", seed=42)
+    │
+    ├── Generates: config.yaml, train.log, dataset_stats.json,
+    │              preprocessing.py, eval_results.json, model_card.json
+    │
+    ├── Plants: one bug from the task's 3-bug pool
+    │
+    └── Agent investigates → submits diagnosis → grader scores [0.01, 0.99]
+```
+**9 distinct bug types across 3 difficulty tiers. Every episode can have a different bug. Scores vary continuously based on diagnosis precision.**
 ---
+## Procedural Artifact Generation
+Every episode generates 6 internally-consistent training artifacts from scratch:
+| Artifact | Contents | Role in Investigation |
+|---|---|---|
+| `config.yaml` | Model arch, optimizer, LR, batch size, scheduler | Check hyperparameters |
+| `train.log` | Epoch-by-epoch loss/accuracy/gradient norms | Identify symptom patterns |
+| `dataset_stats.json` | Split sizes, class distribution, overlap counts | Detect data issues |
+| `preprocessing.py` | Full sklearn/PyTorch pipeline code | Find pipeline bugs |
+| `eval_results.json` | Final val/test metrics with hardware info | Quantify metric gaps |
+| `model_card.json` | Architecture summary, tokenizer version | Cross-reference versions |
+Artifacts are **internally consistent** — config matches logs, dataset stats match preprocessing code — except for the one planted fault. An agent must read multiple artifacts and correlate signals across them to locate the bug.
 ---
+## Action Space (8 actions)
 ```python
 class MLOpsAction(BaseModel):
     action_type: Literal[
+        "read_config",           # Full training configuration
+        "read_logs",             # Training logs (filterable: keyword or "epoch:N-M")
+        "check_dataset_stats",   # Split sizes, class distribution, overlap counts
+        "inspect_preprocessing", # Full preprocessing pipeline code
+        "read_eval_results",     # Final val/test metrics
+        "run_sanity_check",      # Computed diagnostic check (8 types)
+        "query_artifact",        # Specific field from any artifact (dot notation)
+        "submit_diagnosis",      # Final answer — triggers grading
     ]
 ```
+**Sanity check types** (computed diagnostics, not just artifact reads):
+`label_consistency` | `data_leakage` | `gradient_norms` | `class_balance` | `feature_statistics` | `encoder_version_match` | `loss_trajectory` | `metric_gap_analysis`
 ---
 ## Observation Space
     task_description: str                 # Full task brief with investigation strategy
     run_id: str                           # Unique run identifier
     run_summary: Dict[str, Any]           # Model, dataset, training status
+    available_artifacts: List[ArtifactMeta]  # What can be read (name, description, size)
+    artifacts_read: List[str]             # Investigation progress tracking
     last_action_result: Dict[str, Any]    # Full content of last action
     step_count: int
     max_steps: int
 ---
+## Tasks & Difficulty Progression
+### Task 1 — Config Error Diagnosis `(easy)` | 20 steps max
 **Bug pool (one picked randomly per episode):**
+- `exploding_lr` — `learning_rate: 50.0` causes loss to diverge to NaN by epoch 3
+- `wrong_optimizer` — `SGD(momentum=0.99)` causes loss oscillation with no convergence
+- `batch_size_overflow` — `batch_size: 4096` exceeds dataset size, trivial overfitting
+**Signal strength:** High. Symptoms visible immediately in training logs.
+### Task 2 — Data Leakage Detection `(medium)` | 30 steps max
 **Bug pool:**
 - `data_leakage_scaler` — `StandardScaler.fit_transform(X_full)` called before train/val split
+- `data_leakage_overlap` — `train_test_split(random_state=None)` produces overlapping splits
+- `wrong_split_ratio` — `test_size=0.8` trains on 20% and evaluates on 80%
+**Signal strength:** Medium. Requires correlating val accuracy anomaly in logs with preprocessing code.
+### Task 3 — Silent Evaluation Bug `(hard)` | 40 steps max
 **Bug pool:**
+- `label_encoder_mismatch` — Train/eval use different `LabelEncoder.fit()` orderings
+- `silent_metric_swap` — `val_accuracy` and `test_accuracy` assignments swapped in eval code
+- `tokenizer_version_drift` — Training uses tokenizer v2, eval uses v1 (847 tokens map to `[UNK]`)
+**Signal strength:** Low. Training logs look completely normal. Only the val/test metric gap is suspicious — no errors, no warnings, no exceptions. Requires reasoning about what's *absent*.
+**Asymmetric penalty:** Missing a silent evaluation bug is penalized 1.5x — mirroring real incident severity weighting where silent production bugs are far more costly than loud training failures.
 ---
+## Reward Design
+**Dense per-step rewards** (not sparse — provides learning signal throughout the episode):
 ```
+Investigation phase:
+  +0.02  First time reading an artifact     (rewards systematic exploration)
+  -0.02  Re-reading same artifact+filter    (penalizes brute force)
+  +0.01  Running a new sanity check         (rewards diagnostic reasoning)
+Diagnosis grading (4 independent components):
+  +0.15  Correct failure_category           (what kind of bug?)
+  +0.25  Correct root_cause_file            (which file contains it?)
+  +0.30  Correct root_cause_field           (which parameter/function?)
+  +0.30  Correct proposed_fix               (keyword overlap with gold fix)
+Task 3 modifier:
+  If score < 0.70 → additional 0.5x penalty on missed components
+  (silent bugs reaching production are more costly than loud failures)
 ```
+**Why dense rewards?** Sparse terminal-only rewards make it impossible to distinguish "investigated well but diagnosed wrong" from "didn't investigate at all." Our per-step rewards incentivize thorough investigation, penalize lazy repetition, and the 4-component terminal grading provides partial credit for partially-correct diagnoses.
+**Score spectrum:**
 ```
+No investigation, wrong diagnosis  →  0.01
+Category only correct              →  0.10–0.15
+Category + file correct            →  0.35–0.40
+Category + file + field correct    →  0.65
+Perfect diagnosis                  →  0.90–0.99
 ```
 ---
+## Baseline Scores
+| Task | Baseline (Qwen2.5-72B) | Optimized (Gemini 2.5 Flash) |
+|---|---|---|
+| Easy | ~0.42 | ~0.91 |
+| Medium | ~0.28 | ~0.85 |
+| Hard | ~0.15 | ~0.92 |
+The baseline agent (no task-specific prompting) struggles significantly on medium and hard tasks, confirming meaningful difficulty progression.
+---
 ## Setup & Usage
 ### Docker (recommended)
 ### Python Client
 ```python
 from client import MLOpsDebugEnv
 from models import MLOpsAction
 with MLOpsDebugEnv(base_url="http://localhost:7860").sync() as env:
     obs = env.reset(task_id="hard", seed=1)
+    # Investigate
     r = env.step(MLOpsAction(action_type="read_eval_results"))
+    r = env.step(MLOpsAction(action_type="run_sanity_check",
+                             sanity_check_type="metric_gap_analysis"))
     r = env.step(MLOpsAction(action_type="inspect_preprocessing"))
+    # Diagnose
     r = env.step(MLOpsAction(
         action_type="submit_diagnosis",
         failure_category="label_mismatch",
     print(f"Score: {r.info['score']}")
 ```
+### Inference Script
 ```bash
+export GEMINI_API_KEY="your_key"
 export ENV_BASE_URL="http://localhost:7860"
+python inference.py                    # all 3 tasks
 python inference.py --task easy --seed 42
 ```
+**Output format (OpenEnv standard):**
 ```
+[START] task=easy env=mlops-debug-env model=gemini-2.5-flash
 [STEP] step=1 action=read_logs reward=0.02 done=false error=null
 [STEP] step=2 action=run_sanity_check reward=0.01 done=false error=null
 [STEP] step=3 action=read_config reward=0.02 done=false error=null
+[STEP] step=4 action=submit_diagnosis reward=0.91 done=true error=null
+[END] success=true steps=4 score=0.9100 rewards=0.02,0.01,0.02,0.91
 ```
+---
+## Design Decisions
+**Why MLOps debugging?** Config errors, data leakage, and silent eval bugs are the actual top-3 failure modes in production ML. Every ML team at every company deals with these. This isn't a synthetic benchmark — it models a real workflow.
+**Why procedural generation?** Fixed bug scenarios would let agents memorize answers. Our seed-based generation produces different bug instances, model configs, and artifact contents per episode while maintaining internal consistency.
+**Why deterministic grading?** LLM-as-judge introduces variance and bias. Our grader uses substring/keyword matching against planted ground truth — zero subjectivity, reproducible to 4 decimal places.
+**Why asymmetric penalties?** In production, a loud training crash (Task 1) is caught immediately. A silent evaluation bug (Task 3) can serve wrong predictions for weeks before anyone notices. The 1.5x penalty on Task 3 mirrors this real-world cost asymmetry.
+**Why 8 sanity check types?** Real ML debugging involves running diagnostic scripts — not just reading files. Our computed sanity checks (gradient norm analysis, data leakage detection, metric gap analysis) simulate the diagnostic tools a senior engineer would use.
+---
+## Project Structure
+```
+MLops-Openenvhack/
+├── app.py                  # FastAPI server (REST + WebSocket)
+├── mlops_environment.py    # Core environment: reset/step/grading
+├── artifact_generator.py   # Procedural artifact + bug generation
+├── models.py               # Pydantic models (Action, Observation, State)
+├── inference.py             # LLM baseline agent
+├── client.py               # Python client library (async + sync)
+├── openenv_state.py        # Global state singleton
+├── openenv.yaml            # OpenEnv specification
+├── Dockerfile              # Container configuration
+├── requirements.txt        # Python dependencies
+└── server/                 # HF Space deployment copy
+```
 ---
 ## Environment Variables
+| Variable | Required | Default | Description |
+|---|---|---|---|
+| `GEMINI_API_KEY` | Yes (for inference) | — | Gemini API key for baseline agent |
+| `MODEL_NAME` | No | `gemini-2.5-flash` | LLM model identifier |
+| `API_BASE_URL` | No | Gemini endpoint | OpenAI-compatible API base URL |
+| `ENV_BASE_URL` | No | `http://localhost:7860` | Environment server URL |
 ---
 ## License
+MIT

openenv.yaml CHANGED Viewed

@@ -5,51 +5,114 @@ description: >
   investigating a broken training run. The environment procedurally generates
   realistic training artifacts (logs, configs, preprocessing code, eval results)
   with one planted fault. The agent must systematically investigate and submit
-  a structured diagnosis. Three tasks: config error (easy) → data leakage (medium)
-  → silent evaluation bug (hard). All graders are fully deterministic.
-author: Mohit Goyal
 license: MIT
-tags: [openenv, rl, mlops, debugging, machine-learning, agents]
 tasks:
   - id: easy
     name: Config Error Diagnosis
     difficulty: easy
     max_steps: 20
     bug_pool: [exploding_lr, wrong_optimizer, batch_size_overflow]
-    reward_range: [0.0, 1.0]
   - id: medium
     name: Data Leakage Detection
     difficulty: medium
     max_steps: 30
     bug_pool: [data_leakage_scaler, data_leakage_overlap, wrong_split_ratio]
-    reward_range: [0.0, 1.0]
   - id: hard
     name: Silent Evaluation Bug
     difficulty: hard
     max_steps: 40
     bug_pool: [label_encoder_mismatch, silent_metric_swap, tokenizer_version_drift]
-    reward_range: [0.0, 1.0]
     asymmetric_penalty: true
 action_space:
   type: discrete_structured
-  actions: [read_config, read_logs, check_dataset_stats, inspect_preprocessing,
-            read_eval_results, run_sanity_check, query_artifact, submit_diagnosis]
 observation_space:
   type: structured_text
-  fields: [task_id, run_summary, available_artifacts, artifacts_read,
-           last_action_result, step_count, max_steps, done, messages]
 reward:
   type: dense_and_terminal
-  per_step: "+0.02 new artifact read, -0.02 duplicate read, +0.01 new sanity check"
-  terminal: "0.15 category + 0.25 file + 0.30 field + 0.30 fix. Hard task 1.5x penalty."
 api:
   reset: POST /reset
   step: POST /step
   state: GET /state
   health: GET /health
   websocket: /ws
 runtime:
   port: 7860
   workers: 1
   framework: fastapi
   python: "3.11"

   investigating a broken training run. The environment procedurally generates
   realistic training artifacts (logs, configs, preprocessing code, eval results)
   with one planted fault. The agent must systematically investigate and submit
+  a structured diagnosis. Three tasks: config error (easy) -> data leakage (medium)
+  -> silent evaluation bug (hard). All graders are fully deterministic.
+author: Code Clashers
 license: MIT
+tags: [openenv, rl, mlops, debugging, machine-learning, agents, pytorch]
+grading:
+  type: deterministic
+  judge: none
+  method: keyword_and_substring_matching
+  reproducible: true
 tasks:
   - id: easy
     name: Config Error Diagnosis
     difficulty: easy
     max_steps: 20
     bug_pool: [exploding_lr, wrong_optimizer, batch_size_overflow]
+    reward_range: [0.01, 0.99]
+    description: >
+      Diagnose a training failure caused by a hyperparameter misconfiguration.
+      Symptoms are visible in training logs (loss explosion, oscillation, trivial overfitting).
   - id: medium
     name: Data Leakage Detection
     difficulty: medium
     max_steps: 30
     bug_pool: [data_leakage_scaler, data_leakage_overlap, wrong_split_ratio]
+    reward_range: [0.01, 0.99]
+    description: >
+      Identify data leakage in the preprocessing pipeline. Val accuracy is suspiciously
+      high from epoch 1, but test performance tells a different story. Requires correlating
+      logs, eval results, and preprocessing code.
   - id: hard
     name: Silent Evaluation Bug
     difficulty: hard
     max_steps: 40
     bug_pool: [label_encoder_mismatch, silent_metric_swap, tokenizer_version_drift]
+    reward_range: [0.01, 0.99]
     asymmetric_penalty: true
+    penalty_multiplier: 1.5
+    description: >
+      Find a silent bug in the evaluation pipeline. Training logs look completely normal.
+      No errors, no warnings. Only a val/test metric gap reveals the issue. Requires
+      reasoning about what is absent rather than what is present.
 action_space:
   type: discrete_structured
+  actions:
+    - read_config
+    - read_logs
+    - check_dataset_stats
+    - inspect_preprocessing
+    - read_eval_results
+    - run_sanity_check
+    - query_artifact
+    - submit_diagnosis
+  sanity_check_types:
+    - label_consistency
+    - data_leakage
+    - gradient_norms
+    - class_balance
+    - feature_statistics
+    - encoder_version_match
+    - loss_trajectory
+    - metric_gap_analysis
 observation_space:
   type: structured_text
+  fields:
+    - task_id
+    - task_description
+    - run_id
+    - run_summary
+    - available_artifacts
+    - artifacts_read
+    - last_action_result
+    - step_count
+    - max_steps
+    - done
+    - messages
 reward:
   type: dense_and_terminal
+  per_step:
+    new_artifact_read: +0.02
+    duplicate_read: -0.02
+    new_sanity_check: +0.01
+  terminal:
+    failure_category: +0.15
+    root_cause_file: +0.25
+    root_cause_field: +0.30
+    proposed_fix: +0.30
+  hard_task_penalty: "if score < 0.70, additional 0.5x on missed components"
 api:
   reset: POST /reset
   step: POST /step
   state: GET /state
   health: GET /health
+  tasks: GET /tasks
+  openenv_state: GET /openenv/state
   websocket: /ws
 runtime:
   port: 7860
   workers: 1
   framework: fastapi
   python: "3.11"
+  container: docker

pyproject.toml CHANGED Viewed

@@ -5,7 +5,7 @@ description = "MLOps Pipeline Debugger - OpenEnv-compatible RL environment for M
 readme = "README.md"
 requires-python = ">=3.11"
 license = {text = "MIT"}
-authors = [{name = "MLOps Team"}]
 dependencies = [
     "fastapi>=0.115.0",
@@ -21,7 +21,8 @@ dependencies = [
 ]
 [project.scripts]
-server = "uvicorn:main"
 [project.optional-dependencies]
 dev = [

 readme = "README.md"
 requires-python = ">=3.11"
 license = {text = "MIT"}
+authors = [{name = "Code Clashers"}]
 dependencies = [
     "fastapi>=0.115.0",
 ]
 [project.scripts]
+mlops-server = "uvicorn:main"
+mlops-infer = "inference:main"
 [project.optional-dependencies]
 dev = [