mlops-openenv / ARCHITECTURE.md
Rockerleo's picture
Upload folder using huggingface_hub
1e82f9d verified

Architecture

System Overview

Agent (inference.py)
    β”‚
    β”‚  POST /reset, POST /step
    β–Ό
FastAPI Server (app.py)
    β”‚
    β”‚  reset(), step()
    β–Ό
MLOpsEnvironment (mlops_environment.py)
    β”‚
    β”œβ”€β”€ ArtifactGenerator (artifact_generator.py)
    β”‚   └── BUG_CATALOGUE: 9 bug specs across 3 tiers
    β”‚   └── Procedural generation: config, logs, stats, code, eval, model card
    β”‚
    β”œβ”€β”€ Sanity Check Engine (artifact_generator.py)
    β”‚   └── 8 computed diagnostics grounded in generated artifacts
    β”‚
    β”œβ”€β”€ Grader (_handle_submit)
    β”‚   └── 4-component scoring: category + file + field + fix
    β”‚
    └── Models (models.py)
        └── MLOpsAction, MLOpsObservation, MLOpsState, ArtifactMeta

Data Flow

Episode Lifecycle

1. reset(task_id, seed)
   β”œβ”€β”€ Random(seed) selects bug from task pool
   β”œβ”€β”€ ArtifactGenerator creates 6 consistent artifacts with planted fault
   └── Returns: MLOpsObservation with task description + artifact metadata

2. step(action) Γ— N
   β”œβ”€β”€ read_* actions β†’ return artifact content (reward: +0.02 new, -0.02 duplicate)
   β”œβ”€β”€ run_sanity_check β†’ compute diagnostic from artifacts (reward: +0.01 new)
   β”œβ”€β”€ query_artifact β†’ return specific field via dot notation
   └── submit_diagnosis β†’ grade against ground truth (terminal)

3. Grading (_handle_submit)
   β”œβ”€β”€ Compare 4 components against BugSpec ground truth
   β”œβ”€β”€ Apply hard task penalty if score < 0.70
   └── Return: score ∈ (0.01, 0.99), breakdown, ground truth

Determinism Guarantees

  • random.Random(seed) for bug selection and artifact variation
  • np.random.RandomState(seed) for numeric distributions
  • No external state, no network calls during generation
  • Same (task_id, seed) always produces identical episode

Component Responsibilities

app.py β€” API Layer

  • FastAPI server on port 7860
  • REST endpoints: /reset, /step, /state, /health, /tasks
  • WebSocket endpoint: /ws for streaming interaction
  • Stateless request handling; delegates to MLOpsEnvironment

mlops_environment.py β€” Core Logic

  • Episode state management (step count, artifacts read, score)
  • Action routing to handlers
  • Grading logic with 4-component scoring
  • grade_task() standalone grader for OpenEnv validation

artifact_generator.py β€” Content Generation

  • BugSpec dataclass: category, file, field, gold_fix, difficulty
  • BUG_CATALOGUE: 9 bug specifications
  • ArtifactGenerator: produces 6 artifacts per episode
  • run_sanity_check(): 8 computed diagnostic checks

models.py β€” Data Models

  • MLOpsAction: 8 action types with typed parameters
  • MLOpsObservation: full agent observation per step
  • MLOpsState: internal state for debugging/RL harness
  • ArtifactMeta: artifact metadata (name, description, size hint)

inference.py β€” Baseline Agent

  • LLM-powered agent using Gemini via OpenAI-compatible API
  • Investigation phase: reads artifacts, runs sanity checks
  • Diagnosis phase: submits structured diagnosis
  • Fallback logic for unparseable LLM output
  • Rate limiting with exponential backoff

client.py β€” Client Library

  • MLOpsDebugEnv: async httpx client
  • SyncMLOpsDebugEnv: synchronous wrapper
  • Context manager support for connection lifecycle

API Endpoints

Method Path Description
GET / API info
GET /health Health check
GET /tasks List available tasks
POST /reset Start new episode
POST /step Execute action
GET /state Current episode state
GET /openenv/state OpenEnv framework state
WS /ws WebSocket interface

Reward Architecture

The reward function has two layers:

Per-step (dense): Encourages systematic investigation

  • New artifact read: +0.02 (explore broadly)
  • Duplicate read: -0.02 (don't brute force)
  • New sanity check: +0.01 (use diagnostics)

Terminal (graded): Evaluates diagnosis quality

  • 4 independent components sum to max 1.0
  • Keyword/substring matching (no LLM judge)
  • Hard task asymmetric penalty (1.5x on missed components)

This two-layer design means an agent that investigates thoroughly but diagnoses wrong still earns per-step rewards, while an agent that submits immediately with a lucky guess earns terminal reward but misses exploration bonuses.