Spaces:
Sleeping
Sleeping
Architecture
System Overview
Agent (inference.py)
β
β POST /reset, POST /step
βΌ
FastAPI Server (app.py)
β
β reset(), step()
βΌ
MLOpsEnvironment (mlops_environment.py)
β
βββ ArtifactGenerator (artifact_generator.py)
β βββ BUG_CATALOGUE: 9 bug specs across 3 tiers
β βββ Procedural generation: config, logs, stats, code, eval, model card
β
βββ Sanity Check Engine (artifact_generator.py)
β βββ 8 computed diagnostics grounded in generated artifacts
β
βββ Grader (_handle_submit)
β βββ 4-component scoring: category + file + field + fix
β
βββ Models (models.py)
βββ MLOpsAction, MLOpsObservation, MLOpsState, ArtifactMeta
Data Flow
Episode Lifecycle
1. reset(task_id, seed)
βββ Random(seed) selects bug from task pool
βββ ArtifactGenerator creates 6 consistent artifacts with planted fault
βββ Returns: MLOpsObservation with task description + artifact metadata
2. step(action) Γ N
βββ read_* actions β return artifact content (reward: +0.02 new, -0.02 duplicate)
βββ run_sanity_check β compute diagnostic from artifacts (reward: +0.01 new)
βββ query_artifact β return specific field via dot notation
βββ submit_diagnosis β grade against ground truth (terminal)
3. Grading (_handle_submit)
βββ Compare 4 components against BugSpec ground truth
βββ Apply hard task penalty if score < 0.70
βββ Return: score β (0.01, 0.99), breakdown, ground truth
Determinism Guarantees
random.Random(seed)for bug selection and artifact variationnp.random.RandomState(seed)for numeric distributions- No external state, no network calls during generation
- Same (task_id, seed) always produces identical episode
Component Responsibilities
app.py β API Layer
- FastAPI server on port 7860
- REST endpoints:
/reset,/step,/state,/health,/tasks - WebSocket endpoint:
/wsfor streaming interaction - Stateless request handling; delegates to MLOpsEnvironment
mlops_environment.py β Core Logic
- Episode state management (step count, artifacts read, score)
- Action routing to handlers
- Grading logic with 4-component scoring
grade_task()standalone grader for OpenEnv validation
artifact_generator.py β Content Generation
BugSpecdataclass: category, file, field, gold_fix, difficultyBUG_CATALOGUE: 9 bug specificationsArtifactGenerator: produces 6 artifacts per episoderun_sanity_check(): 8 computed diagnostic checks
models.py β Data Models
MLOpsAction: 8 action types with typed parametersMLOpsObservation: full agent observation per stepMLOpsState: internal state for debugging/RL harnessArtifactMeta: artifact metadata (name, description, size hint)
inference.py β Baseline Agent
- LLM-powered agent using Gemini via OpenAI-compatible API
- Investigation phase: reads artifacts, runs sanity checks
- Diagnosis phase: submits structured diagnosis
- Fallback logic for unparseable LLM output
- Rate limiting with exponential backoff
client.py β Client Library
MLOpsDebugEnv: async httpx clientSyncMLOpsDebugEnv: synchronous wrapper- Context manager support for connection lifecycle
API Endpoints
| Method | Path | Description |
|---|---|---|
| GET | / |
API info |
| GET | /health |
Health check |
| GET | /tasks |
List available tasks |
| POST | /reset |
Start new episode |
| POST | /step |
Execute action |
| GET | /state |
Current episode state |
| GET | /openenv/state |
OpenEnv framework state |
| WS | /ws |
WebSocket interface |
Reward Architecture
The reward function has two layers:
Per-step (dense): Encourages systematic investigation
- New artifact read: +0.02 (explore broadly)
- Duplicate read: -0.02 (don't brute force)
- New sanity check: +0.01 (use diagnostics)
Terminal (graded): Evaluates diagnosis quality
- 4 independent components sum to max 1.0
- Keyword/substring matching (no LLM judge)
- Hard task asymmetric penalty (1.5x on missed components)
This two-layer design means an agent that investigates thoroughly but diagnoses wrong still earns per-step rewards, while an agent that submits immediately with a lucky guess earns terminal reward but misses exploration bonuses.